## Topic: Pandas DF Filtering based on String

### OUTCOMES

- 1. Introduction of String filtering

- 2. str.contains() method

- 3. str.startswith() and str.endswith()

- 4. Filtering using Regular Expression

- 5. Real-life example

### 1. Introduction of String filtering

- String filtering means selecting rows where a column (usually of type object, string) match certain string conditions.

- syntax:
    - df.loc[df['column_name'].str.condition(), case = True]

    - here, 
        - condition => different string method
        - case = True => Case sensitive 

In [9]:
import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "name": ["Raihan", "Kamruzzaman", "Alice", "rahim", "KzRaihan"],
    "dept": ["IT", "HR", "IT", "Finance", "IT"],
    "city": ["Dhaka", "Chittagong", "Dhaka", "Rajshahi", "Dhaka"]
})



In [4]:
df.head(3)

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
1,2,Kamruzzaman,HR,Chittagong
2,3,Alice,IT,Dhaka


In [None]:
# Example -> string filtering
# - filter rows that contain a specific string 

df.loc[df["city"].str.contains("Dhaka")]

# NOTE 
# city - column
# Dhaka - string (filtering string)

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
2,3,Alice,IT,Dhaka
4,5,KzRaihan,IT,Dhaka


### 2. str.contains() method

In [None]:
# filtering name column with contain "R" word

df.loc[df['name'].str.contains('R')]

# here default -> case is True
# only upper case 'R' is count

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
4,5,KzRaihan,IT,Dhaka


In [None]:
# filtering name column with insensitive case

df.loc[df['name'].str.contains("R", case = False)]

# lower case and upper case both are count

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
1,2,Kamruzzaman,HR,Chittagong
3,4,rahim,Finance,Rajshahi
4,5,KzRaihan,IT,Dhaka


### 3. str.startswith() and str.endswith()

In [20]:
# str.startwith()
df[df["name"].str.startswith("K")]

Unnamed: 0,id,name,dept,city
1,2,Kamruzzaman,HR,Chittagong
4,5,KzRaihan,IT,Dhaka


In [22]:
# str.endswith()
end_with_n = df[df['name'].str.endswith('n')]

end_with_n

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
1,2,Kamruzzaman,HR,Chittagong
4,5,KzRaihan,IT,Dhaka


In [25]:
# multiple string condition 

filter = df[
    (df['name'].str.contains('R', case = False)) &
    (df['city'].str.startswith('D'))
]

filter

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
4,5,KzRaihan,IT,Dhaka


### 4. Filtering using Regular Expression

- Regular Expression is a pattern that define a search rule for strings.

- symbol & meaning
    - ^  => start of string
    - $  => Ending of string

In [27]:
df.loc[df['city'].str.contains(r'^C')]

Unnamed: 0,id,name,dept,city
1,2,Kamruzzaman,HR,Chittagong


In [None]:
# filtering name column that start with Kz using regular expression

df.loc[df['name'].str.contains(r'^Kz')]


Unnamed: 0,id,name,dept,city
4,5,KzRaihan,IT,Dhaka


In [None]:
# filtering city column that end with 'hi' using regular expression

df[df['city'].str.contains(r'hi$')]

Unnamed: 0,id,name,dept,city
3,4,rahim,Finance,Rajshahi


In [None]:
# filtering name column that contain vowels (AEIOU) using regular expression

df[df['name'].str.contains(r'[AEIOU]', case = False)]

# NOTE
# return rows where the name column contains at least AEIOU (upper or lower) case vowel.

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
1,2,Kamruzzaman,HR,Chittagong
2,3,Alice,IT,Dhaka
3,4,rahim,Finance,Rajshahi
4,5,KzRaihan,IT,Dhaka


In [None]:
df[df['name'].str.contains(r'^[AEIOU]', case = False)]

# return rows where the name column start with vowel 

Unnamed: 0,id,name,dept,city
2,3,Alice,IT,Dhaka


In [47]:
# Filter by multiple options

df.loc[df['dept'].str.contains(r'It|HR', case = False)]

Unnamed: 0,id,name,dept,city
0,1,Raihan,IT,Dhaka
1,2,Kamruzzaman,HR,Chittagong
2,3,Alice,IT,Dhaka
4,5,KzRaihan,IT,Dhaka


### 5. Real-life example

In [52]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Hannah","Sakib"],
    "City": ["New York", "Los Angeles", "Newark", "Boston", "New Delhi", "Chicago", "New Orleans", "Houston","H Los Ang"],
    "Department": ["HR", "IT", "Finance", "IT", "HR", "Marketing", "Finance", "HR", "HR"],
    "Salary": [50000, 60000, 55000, 70000, 52000, 58000, 62000, 51000,70000]
}


- 1. Filter all city whose name contains "New".

- 2.  Find city whose department start with "Los".

- 3. Get all rows where city is not "New York".

- 4. Filter employees whose names end with "an" and dept is "IT".

In [54]:
df = pd.DataFrame(data)

In [55]:
df.head(3)

Unnamed: 0,Name,City,Department,Salary
0,Alice,New York,HR,50000
1,Bob,Los Angeles,IT,60000
2,Charlie,Newark,Finance,55000


In [None]:
# 1. Filter all City whose name contains "New".

df[df['City'].str.contains(r'New')]

Unnamed: 0,Name,City,Department,Salary
0,Alice,New York,HR,50000
2,Charlie,Newark,Finance,55000
4,Eve,New Delhi,HR,52000
6,Grace,New Orleans,Finance,62000


In [58]:
# 2.  Find City whose department start with "Los".

# using regular expression

df[df['City'].str.contains(r'^Los')]


Unnamed: 0,Name,City,Department,Salary
1,Bob,Los Angeles,IT,60000


In [61]:
# 3. Get all rows where city is not "New York".

df[~(df['City'].str.contains('New York'))]

Unnamed: 0,Name,City,Department,Salary
1,Bob,Los Angeles,IT,60000
2,Charlie,Newark,Finance,55000
3,David,Boston,IT,70000
4,Eve,New Delhi,HR,52000
5,Frank,Chicago,Marketing,58000
6,Grace,New Orleans,Finance,62000
7,Hannah,Houston,HR,51000
8,Sakib,H Los Ang,HR,70000


In [None]:
# 4. Filter employees whose names end with "b" and dept is "IT".

df[
    (df['Name'].str.contains(r'b$')) &
    (df['Department'].str.contains('IT')
)
]


Unnamed: 0,Name,City,Department,Salary
1,Bob,Los Angeles,IT,60000
