<a href="https://colab.research.google.com/github/Taibur-Raxon/Python/blob/main/String_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import the Pandas library and create DataFrame

Before doing anything else, you'll need to import Pandas and get some data to work with.

In [1]:
import pandas as pd

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Claire', 'David', 'Emma'],
    'Age': [25, 30, 22, 28, 26],
    'Department': ['Marketing', 'Finance', 'Sales', 'HR', 'Marketing'],
    'City': ['New York', 'London', 'Paris', 'San Francisco', 'Sydney'],
    'Email': ['alice@example.com', 'bob@example.com', 'claire@example.com', 'david@example.com', 'emma@example.com'],
    'Job_Title' : ['Data Scientist', 'Financial Analyst', 'Sales Executive', 'HR Manager', 'Marketing Specialist'],
    'Full_Name' : ['Alice Johnson', 'Bob Smith', 'Claire Williams', 'David Lee', 'Emma Brown'],
    'Phone' : ['123-456-7890', '987-654-3210', '555-123-4567', '111-222-3333', '444-555-6666'],
    'Address' : ['       123 Main Street', '        456 Park Avenue', '         789 Elm Road', '        321 Oak Street','        555 Maple Lane']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Department,City,Email,Job_Title,Full_Name,Phone,Address
0,Alice,25,Marketing,New York,alice@example.com,Data Scientist,Alice Johnson,123-456-7890,123 Main Street
1,Bob,30,Finance,London,bob@example.com,Financial Analyst,Bob Smith,987-654-3210,456 Park Avenue
2,Claire,22,Sales,Paris,claire@example.com,Sales Executive,Claire Williams,555-123-4567,789 Elm Road
3,David,28,HR,San Francisco,david@example.com,HR Manager,David Lee,111-222-3333,321 Oak Street
4,Emma,26,Marketing,Sydney,emma@example.com,Marketing Specialist,Emma Brown,444-555-6666,555 Maple Lane


## The `.str` accessor

In Pandas, the `.str` accessor allows us to perform various string operations on DataFrame columns containing string values. This provides a convenient and efficient way to work with textual data within DataFrames.

> **Note:** The `.str` accessor works on Series. To use any of the tools below, be sure to specify which column of a DataFrame you wish to work with.

### String indexing and slicing

We can access individual characters of each string in a DataFrame column using string indexing. Let's select the first letter of each name:

In [None]:
df['Name'].str[0]

0    A
1    B
2    C
3    D
4    E
Name: Name, dtype: object

Using a slice in the string indexer, we'll now select the first four characters of each city.

In [None]:
df['City'].str[:4]

0    New 
1    Lond
2    Pari
3    San 
4    Sydn
Name: City, dtype: object

### Converting cases

`.str.lower()`      
With this method you can change all letters to lower case.

In [None]:
df['Full_Name'].str.lower()

0      alice johnson
1          bob smith
2    claire williams
3          david lee
4         emma brown
Name: Full_Name, dtype: object

Similarly, Pandas offers:  

`str.upper()`  
Converts all characters to uppercase.  

`str.title()`  
Converts the first character of each word to uppercase and the remaining characters to lowercase.  

`str.capitalize()`  
Converts first character of the whole string to uppercase and the remaining characters to lowercase.  

`str.swapcase()`  
Converts uppercase to lowercase and lowercase to uppercase.  

`str.casefold()`  
Removes all case distinctions in the string. This method is meant to deal with the kind of special characters generally not recognized as having upper and lower cases, e.g. "ß" becomes "ss".


### Conditions

`.str.startswith()` and `.str.endswith()`  
Often used for filtering DataFrames, these methods will check if the first or last character(s) in each string match(es) the given string.

In [None]:
df['Email'].str.startswith('david')

0    False
1    False
2    False
3     True
4    False
Name: Email, dtype: bool

`.str.contains()`  
Another useful method for filtering `.str.contains()` checks if any part of each string matches the given string.

In [None]:
df['Email'].str.contains('@')

0    True
1    True
2    True
3    True
4    True
Name: Email, dtype: bool

### Length and counting

`.str.len()`  
This method will count and return the number of characters in each string.

In [4]:
df['Address'].str.len()

0    22
1    23
2    21
3    22
4    22
Name: Address, dtype: int64

`.str.count()`  
returns the count of occurrences of a specified substring in each string of the column

In [None]:
df['Email'].str.count('example')

0    1
1    1
2    1
3    1
4    1
Name: Email, dtype: int64

### Manipulating strings

`.str.replace()`  
Used to locate one sub-string and, if it exists, replace it with another.

In [None]:
df['City'].str.replace('New', 'Old')

0         Old York
1           London
2            Paris
3    San Francisco
4           Sydney
Name: City, dtype: object

`.str.strip()`  
It is not uncommon for data to end up carrying certain artefacts of the ETL process, often as leading or tailing characters. Most commonly, this will result in whitespace; `.str.strip()` removes whitespace before and after a string by default, and can remove others when specified.

In [None]:
df['Address']

0            123 Main Street
1            456 Park Avenue
2               789 Elm Road
3             321 Oak Street
4             555 Maple Lane
Name: Address, dtype: object

In [3]:
df['Address'].str.strip()

0    123 Main Street
1    456 Park Avenue
2       789 Elm Road
3     321 Oak Street
4     555 Maple Lane
Name: Address, dtype: object

`.str.split()`  
Used to break a string down into its constituent parts, `.str.split()` will search a string for a given character, creating items in a list each time that character is encountered. By default, splits will be made on whitespace.

In [None]:
df['Full_Name'].str.split()

0      [Alice, Johnson]
1          [Bob, Smith]
2    [Claire, Williams]
3          [David, Lee]
4         [Emma, Brown]
Name: Full_Name, dtype: object

The resulting lists can be accessed with a further `.str` followed by an indexer or `.get()`.

In [8]:
df['Full_Name'].str.split(' ').str[1]

0     Johnson
1       Smith
2    Williams
3         Lee
4       Brown
Name: Full_Name, dtype: object

In [None]:
df['Full_Name'].str.split(' ').str.get(1)

0     Johnson
1       Smith
2    Williams
3         Lee
4       Brown
Name: Full_Name, dtype: object

## Regular expressions

Regular expressions, commonly known as regex, are powerful tools for pattern matching and text manipulation.  
The regex syntax consists of metacharacters, quantifiers, character classes, and more, which define the rules for matching patterns in strings.

Common Metacharacters and Their Meanings:

**. (Period)**: Matches any character except a newline.  
For example, the pattern a.b will match 'aab', 'acb', 'a9b', but not 'a\nb'.

**\* (Asterisk)**: Matches zero or more occurrences of the preceding character.

 For example, the pattern ab*c will match 'ac', 'abc', 'abbc', 'abbbc', and so on.

**\+ (Plus)**: Matches one or more occurrences of the preceding character.

 For example, the pattern ab+c will match 'abc', 'abbc', 'abbbc', but not 'ac'.

**? (Question Mark)**: Matches zero or one occurrence of the preceding character.  
For example, the pattern colou?r will match both 'color' and 'colour'.

**^ (Caret)**: Matches the start of a string.  
For example, the pattern ^abc will match 'abc' only if it appears at the beginning of a string.

**\$ (Dollar)**: Matches the end of a string.  
For example, the pattern abc$ will match 'abc' only if it appears at the end of a string.

**[ ] (Square Brackets)**: Matches any single character within the specified set.  
For example, the pattern [aeiou] will match any vowel.

**[^] (Caret Inside Square Brackets)**: Matches any single character not within the specified set.  
For example, the pattern [^aeiou] will match any non-vowel.

Try [this site](https://regex101.com/) for diving deeper into regular expressions.

### Regex in Pandas
In pandas, certain methods allow for regex pattern matching — some by default and others when explicitly set to do so.

In [None]:
df['Phone'].str.contains(r'\d+-\d+-\d+')

0    True
1    True
2    True
3    True
4    True
Name: Phone, dtype: bool

`\d+-\d+-\d+`     
This pattern matches all strings that containg two minus signs with 1 or more number characters between them.

## Challenges

In [12]:
data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco', 'London', 'Paris', 'Berlin', 'Rome', 'Tokyo'],
    'Country': ['USA', 'USA', 'USA', 'USA', 'USA', 'UK', 'France', 'Germany', 'Italy', 'Japan'],
    'Population (Millions)': [8.4, 3.9, 2.7, 2.3, 0.9, 8.9, 2.1, 3.7, 2.8, 13.9],
    'Area (km2)': [468.9, 502.8, 227.6, 1, 121.4, 1572, 105.4, 891.8, 1285, 2187],
    'Language': ['English', 'English', 'English', 'English', 'English', 'English', 'French', 'German', 'Italian', 'Japanese'],
    'Currency': ['USD', 'USD', 'USD', 'USD', 'USD', 'GBP', 'EUR', 'EUR', 'EUR', 'JPY'],
    'Continent': ['North America', 'North America', 'North America', 'North America', 'North America', 'Europe', 'Europe', 'Europe', 'Europe', 'Asia'],
    'Is_Capital': [False, False, False, False, False, True, True, True, True, True]
}

cities_df = pd.DataFrame(data)

# Adding more rows
extra_data = {
    'City': ['Sydney', 'Seoul', 'Beijing', 'Moscow', 'Cairo', 'Mumbai'],
    'Country': ['Australia', 'South Korea', 'China', 'Russia', 'Egypt', 'India'],
    'Population (Millions)': [5.3, 9.7, 21.5, 12.5, 9.5, 20.7],
    'Area (km2)': [1687, 605, 16411, 2561, 3034, 603],
    'Language': ['English', 'Korean', 'Mandarin', 'Russian', 'Arabic', 'Hindi'],
    'Currency': ['AUD', 'KRW', 'CNY', 'RUB', 'EGP', 'INR'],
    'Continent': ['Australia', 'Asia', 'Asia', 'Europe', 'Africa', 'Asia'],
    'Is_Capital': [False, True, True, True, True, False]
}

extra_df = pd.DataFrame(extra_data)
cities_df = pd.concat([cities_df, extra_df], ignore_index=True)

cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital
0,New York,USA,8.4,468.9,English,USD,North America,False
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False
2,Chicago,USA,2.7,227.6,English,USD,North America,False
3,Houston,USA,2.3,1.0,English,USD,North America,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False
5,London,UK,8.9,1572.0,English,GBP,Europe,True
6,Paris,France,2.1,105.4,French,EUR,Europe,True
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True


### Challenge 1
Create a new column 'City_Length' that contains the length of each city name

In [14]:
cities_df['City_Length'] = cities_df['City'].str.len()

In [16]:
cities_df



Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length
0,New York,USA,8.4,468.9,English,USD,North America,False,8
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7
3,Houston,USA,2.3,1.0,English,USD,North America,False,7
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5


### Challenge 2
Convert the 'City' names to uppercase and store them in a new column 'City_Upper'

In [18]:
cities_df['City_Upper'] = cities_df['City'].str.upper()
cities_df


Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper
0,New York,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO


### Challenge 3
Check if the 'City' names end with the letter 'o'. Create a new column 'Ends_With_O' with the boolean results

In [19]:
cities_df['Ends_With_O'] = cities_df['City'].str.endswith('o')
cities_df


Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O
0,New York,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True


### Challenge 4
Replace the word 'York' in 'City' names with 'Ville' and update the 'City' column accordingly

In [31]:
cities_df['Ville'] = cities_df['City'].str.replace('York', 'Ville')
cities_df

Unnamed: 0,City,Country,Population (Millions),Area (km2),Language,Currency,Continent,Is_Capital,City_Length,City_Upper,Ends_With_O,a,Ville
0,New York,USA,8.4,468.9,English,USD,North America,False,8,NEW YORK,False,New Ville,New Ville
1,Los Angeles,USA,3.9,502.8,English,USD,North America,False,11,LOS ANGELES,False,Los Angeles,Los Angeles
2,Chicago,USA,2.7,227.6,English,USD,North America,False,7,CHICAGO,True,Chicago,Chicago
3,Houston,USA,2.3,1.0,English,USD,North America,False,7,HOUSTON,False,Houston,Houston
4,San Francisco,USA,0.9,121.4,English,USD,North America,False,13,SAN FRANCISCO,True,San Francisco,San Francisco
5,London,UK,8.9,1572.0,English,GBP,Europe,True,6,LONDON,False,London,London
6,Paris,France,2.1,105.4,French,EUR,Europe,True,5,PARIS,False,Paris,Paris
7,Berlin,Germany,3.7,891.8,German,EUR,Europe,True,6,BERLIN,False,Berlin,Berlin
8,Rome,Italy,2.8,1285.0,Italian,EUR,Europe,True,4,ROME,False,Rome,Rome
9,Tokyo,Japan,13.9,2187.0,Japanese,JPY,Asia,True,5,TOKYO,True,Tokyo,Tokyo


In [24]:
cities_df = cities_df.drop(columns=['a'])
cities_df

### Challenge 5
Create a new column 'Country_Code' by extracting the first three characters from the 'Country' names

In [33]:
# Create 'Country_Code' column by extracting the first three characters from 'Country'
cities_df['Country_Code'] = cities_df['Country'].str[:3]

# Display the DataFrame with the new column
print(cities_df)

             City      Country  Population (Millions)  Area (km2)  Language  \
0        New York          USA                    8.4       468.9   English   
1     Los Angeles          USA                    3.9       502.8   English   
2         Chicago          USA                    2.7       227.6   English   
3         Houston          USA                    2.3         1.0   English   
4   San Francisco          USA                    0.9       121.4   English   
5          London           UK                    8.9      1572.0   English   
6           Paris       France                    2.1       105.4    French   
7          Berlin      Germany                    3.7       891.8    German   
8            Rome        Italy                    2.8      1285.0   Italian   
9           Tokyo        Japan                   13.9      2187.0  Japanese   
10         Sydney    Australia                    5.3      1687.0   English   
11          Seoul  South Korea                    9.

### Challenge 6
Count the occurrences of the letter 'a' in each 'City' name

In [37]:
cities_df['City'].str.count('a')

0     0
1     0
2     1
3     0
4     2
5     0
6     1
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    1
15    1
Name: City, dtype: int64

### Challenge 7
Check if the 'City' names start with the letter 'C' and end with the letter 'o'

In [40]:
# Check if 'City' names start with 'C' and end with 'o'
cities_df['Starts_C_Ends_o'] = cities_df['City'].str.startswith('C') & cities_df['City'].str.endswith('o')

# Display the DataFrame with the new column
print(cities_df)

             City      Country  Population (Millions)  Area (km2)  Language  \
0        New York          USA                    8.4       468.9   English   
1     Los Angeles          USA                    3.9       502.8   English   
2         Chicago          USA                    2.7       227.6   English   
3         Houston          USA                    2.3         1.0   English   
4   San Francisco          USA                    0.9       121.4   English   
5          London           UK                    8.9      1572.0   English   
6           Paris       France                    2.1       105.4    French   
7          Berlin      Germany                    3.7       891.8    German   
8            Rome        Italy                    2.8      1285.0   Italian   
9           Tokyo        Japan                   13.9      2187.0  Japanese   
10         Sydney    Australia                    5.3      1687.0   English   
11          Seoul  South Korea                    9.

### Challenge 7
Check if the 'City' names contain exactly two words

In [None]:
# Check if 'City' names contain exactly two words
cities_df['Two_Words'] = cities_df['City'].str.split().apply(len) == 2

# Display the DataFrame with the new column
print(cities_df)