<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: String manipulation with Pandas
© ExploreAI Academy

In this notebook, we return to the Pandas library, this time looking specifically at functionality aimed at string manipulation.

## Learning objectives

By the end of this notebook, you should be able to:
- Understand how to apply Pandas to manipulate strings in Python.

## Examples

Let's explore how Pandas, a powerful Python library, can be used for manipulating strings within DataFrames and series. This builds upon our knowledge of Python regex, introducing how these concepts can be applied using Pandas.

### Example 1

**Scenario**: Suppose we have been given a dataset of different plant species recorded in a national park. The species names are inconsistently formatted with leading/trailing spaces and varying case letters. 

**Question**: Clean the 'species' column by removing leading/trailing spaces and converting all names to lowercase.

**Solution**:


In [20]:
i = " LowerCase "
import re 
x = (re.findall(r'[A-Za-z]+',i))

y=x[0]
i.lstrip().lower()

'lowercase '

In [35]:
def col_cleaner(Animals):
    import re
    import pandas as pd
    
    # Apply lowercase to all entries and extract the first alphabetic character
    Animals['species'] = [re.findall(r'\b[A-Za-z]\b', i.lower())[0].lower() if re.findall(r'\b[A-Za-z]\b', i.lower()) else None for i in Animals['species']]
    
    return Animals

import pandas as pd
data = {'species': [' Maple (10 years) ', 'oak', 'Pine(3 years)', 'maple ', ' Oak (1.5 Years)']}
df = pd.DataFrame(data)

col_cleaner(df)


Unnamed: 0,species
0,
1,
2,
3,
4,


In [32]:
for i in list(df['species']):
    print(i)

 Maple (10 years) 
oak
Pine(3 years)
maple 
 Oak (1.5 Years)


In [36]:
import pandas as pd

# Sample data
data = {'species': [' Maple (10 years) ', 'oak', 'Pine(3 years)', 'maple ', ' Oak (1.5 Years)']}
df = pd.DataFrame(data)

# Cleaning the 'species' column
df['species'] = df['species']

In [49]:
## Creating regex to extract the word
for word in df['species']:
    import re
    pattern = re.compile(r'\w+')
    print(pattern.findall(word)[0].lower())

# forming a list from the result
df['species'] = [pattern.findall(word)[0].lower() for word in df['species']]

df


maple
oak
pine
maple
oak


Unnamed: 0,species
0,maple
1,oak
2,pine
3,maple
4,oak


In [None]:
def col_cleaner(df):
    import re
    pattern = re.compile(r'\w+')
    df['species'] = [pattern.findall(word)[0].lower() for word in df['species']]
    
    return df


import pandas as pd

# Sample data
data = {'species': [' Maple (10 years) ', 'oak', 'Pine(3 years)', 'maple ', ' Oak (1.5 Years)']}
df = pd.DataFrame(data)    

col_cleaner(df)
    

In [58]:
pattern = re.compile(r'\d+')
for dig in df['species']:
    if pattern.findall(dig):
        print(float(pattern.findall(dig)[0]))
    else:
        print('none')

none
none
none
none
none


In [59]:
import re

pattern = re.compile(r'\d+')

for dig in df['species']:
    matches = pattern.findall(dig)
    if matches:
        print(float(matches[0]))
    else:
        print('none')


none
none
none
none
none


In [63]:
import pandas as pd

# Sample data
data = {'species': [' Maple (10 years) ', 'oak', 'Pine(3 years)', 'maple ', ' Oak (1.5 Years)']}
df = pd.DataFrame(data)

# Cleaning the 'species' column
df['species'] = df['species'].str.strip().str.lower()
print(df)
df.head()

            species
0  maple (10 years)
1               oak
2     pine(3 years)
3             maple
4   oak (1.5 years)


Unnamed: 0,species
0,maple (10 years)
1,oak
2,pine(3 years)
3,maple
4,oak (1.5 years)


**Explanation**: We first isolate the column that needs to be formatted, `species`. The `strip()` function is used to remove any leading and trailing spaces from the string. `lower()` converts all the characters in the string to lowercase. These methods are chained to perform both operations in a single line. The output is returned as the same column in the DataFrame.

### Example 2

**Scenario**: In our dataset, some species have their age mentioned in years within the name in parentheses (e.g. "Maple (10 years)"). 

**Question**: Use a regex to extract the age in years from the species name and create a new column 'age'. Fill with 'Unknown' if there is no age present.


In [76]:

# Extracting age using regular expression
df['age'] = df['species'].str.extract('(\d+\.\d+|\d+)').fillna("Unknown")
print(df)


            species      age
0  maple (10 years)       10
1               oak  Unknown
2     pine(3 years)        3
3             maple  Unknown
4   oak (1.5 years)      1.5


In [75]:
x = 'jack'
y
s = ''
pd.fillna(s,"sss")

AttributeError: module 'pandas' has no attribute 'fillna'

In [71]:
pattern = re.compile(r'\d+.\d+|\d+')
for dig in df['species']:
    print(pattern.findall(dig).str.fillna('Unknown'))

AttributeError: 'list' object has no attribute 'str'

**Explanation**:

The regular expression `(\d+\.\d+|\d+)` is used here. This regex has two parts:
* `\d+\.\d+` matches a sequence of digits followed by a decimal point and then another sequence of digits, capturing decimal numbers.
* `\d+` matches a sequence of digits, capturing whole numbers.

The `|` operator in the regex means "or", so the expression looks for either a decimal number or a whole number.
`fillna("Unknown")` is used to handle cases where no age information is found in the species name.

### Example 3

**Scenario**: After cleaning the dataset, we want to know how many unique species are recorded and their respective counts.

**Question**: Aggregate the data to count the number of occurrences of each unique species.

In [77]:
# Clean and standardise species names
df['species'] = df['species'].str.extract('([a-zA-Z]+)', expand=False).fillna('').str.strip().str.lower()

# Counting occurrences of each species
species_counts = df['species'].value_counts()
print(species_counts)


species
maple    2
oak      2
pine     1
Name: count, dtype: int64


**Explanation**: 
1. `df['species']` is used to access the 'species' column as a Series.
2. `.str.extract('([a-zA-Z]+)', expand=False)` is applied directly to the 'species' column to extract alphabetic characters, and `expand=False` is used to ensure it remains a Series.
3. The subsequent string operations `(fillna('').str.strip().str.lower())` are applied directly to the 'species' Series.
4. Finally, `value_counts()` is used to count the occurrences of each unique species in the 'species' Series.

### Example 4

**Scenario**: The park authorities are interested in grouping species by their first letter to design different zones in the park.

**Question**: Create a new column 'zone' based on the first letter of each species name and group the DataFrame by this new column.

In [78]:
# Creating a 'zone' column based on the first letter
df['zone'] = df['species'].str[0]
grouped_data = df.groupby('zone').size()
print(grouped_data)

zone
m    2
o    2
p    1
dtype: int64


The `zone` column is created by extracting the first letter of each species name using `str[0]`. Then, the `groupby` function is used to group the DataFrame by the 'zone' column, and `size()` is used to count the number of entries in each group. The function `count()` would return a similar answer, but both the `species` and `age` columns would be returned in the form of a DataFrame.

## Summary

And with that, we've been introduced to key concepts related to manipulating and analysing text data using the Pandas library in Python. We've covered techniques such as cleaning and standardising text data, extracting information using regular expressions, and aggregating data based on text patterns. These concepts will enable us to effectively process and gain insights from textual information within DataFrames.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>