<a href="https://colab.research.google.com/github/TheMaze45/Pandas/blob/main/Intro_to_Pandas_String_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# pandas and text methods

We will create a DataFrame to practice with:

In [71]:
import pandas as pd

names = ["Erika Schumacher", "Javi López", "Maria Rovira", "Ana Garamond", 
         "Shekhar Biswas", "Muriel Adams", "Saira Polom", "Alex Edwin", 
         "Kit Ching", "Dog Woof"]
ages = [22, 50, 23, 29, 44, 30, 25, 71, 35, 2]
nations = ["DE", "ES", "ES", "ES", "IN", "DE", "IN", "UK", "UK", "XX"]
sibilings = [2, 0, 4, 1, 1, 2, 3, 7, 0, 9]
colors = ["Red", "Yellow", "Yellow", "Blue", "Red", "Yellow", "Blue", "Blue", "Red", "Gray"]



people = pd.DataFrame({"name":names,
                       "age":ages,
                       "country":nations,
                       "sibilings":sibilings,
                       "favourite_color":colors
                      })

people.head()

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red


## String Operations

You have already learned how to filter data with simple conditions, like getting all people whose favourite colour is "Red":

In [4]:
people.loc[people["favourite_color"]=="Red",:]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
4,Shekhar Biswas,44,IN,1,Red
8,Kit Ching,35,UK,0,Red


When it comes to text data, sometimes the conditions are more complex. How would we select all the people whose name starts with a certain letter? 

This is where pandas String Operations are really helpful. Go through [this user guide](https://pandas.pydata.org/docs/user_guide/text.html#string-methods) from Pandas' documentation, it's a good introduction. Here are some examples:

Filtering rows with name starting with A:

- first we generate the boolean expression

In [None]:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.str.startswith.html

people.name.str.startswith("A")

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
8    False
9    False
Name: name, dtype: bool

- and then pass it to `loc[]`

In [11]:
people

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red
9,Dog Woof,2,XX,9,Gray


String methods can also change text:

In [None]:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html

people.name.str.lower()

0    erika schumacher
1          javi lópez
2        maria rovira
3        ana garamond
4      shekhar biswas
5        muriel adams
6         saira polom
7          alex edwin
8           kit ching
9            dog woof
Name: name, dtype: object

Note that we have just outputted these names, but we have not changed the original dataframe:

In [23]:
people["name"]

0    Erika Schumacher
1          Javi López
2        Maria Rovira
3        Ana Garamond
4      Shekhar Biswas
5        Muriel Adams
6         Saira Polom
7          Alex Edwin
8           Kit Ching
9            Dog Woof
Name: name, dtype: object

pandas will not make changes to the original data unless you explicitly tell it to do so. If we wanted to change the original dataframe, we would have assign this output (the names in lower case) to the column in the dataframe we want to change. When doing that, it is important that you select that column using `loc[]`, and not simply `DataFrame.column`:

In [24]:
people.loc[:,"name"] = people.name.str.lower()

In [25]:
# now the original dataframe has been modified:
people.head(2)

Unnamed: 0,name,age,country,sibilings,favourite_color
0,erika schumacher,22,DE,2,Red
1,javi lópez,50,ES,0,Yellow


###### **Exercise 1:**
select all people whose name contains (either in the first name or the surname) the letter `p`.

In [35]:
# There are no people whose first name starts with "P", so we can't use str.startswith() function
# Instead we have to use a different one
people.name.str.startswith("P")

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: name, dtype: bool

In [36]:
# We could use the .cotains() function and give it "P" as an argument
# As you can see, we have a match there, lets make it a bit more visible, with the .loc function
people.name.str.contains("P")

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
Name: name, dtype: bool

In [37]:
# As we can see, we have one person who fits into the criteria !
people.loc[people.name.str.contains("P")]

Unnamed: 0,name,age,country,sibilings,favourite_color
6,Saira Polom,25,IN,3,Blue


###### **Exercise 2:**
select all people whose full name + surname has more than 12 characters.

In [42]:
# First we need the function to return the length of a string, 
# we can accomplish that with .len()
people.name.str.len()

0    16
1    10
2    12
3    12
4    14
5    12
6    11
7    10
8     9
9     8
Name: name, dtype: int64

In [43]:
# Next we want to filter out string which have less than 12 characters
# This will return boolean values
people.name.str.len() > 12

0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: name, dtype: bool

In [44]:
# Now we can put that into a table with .loc
# Pay attention, that " " <-- empty space also counts as one character !
people.loc[people.name.str.len() > 12]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
4,Shekhar Biswas,44,IN,1,Red


###### **Exercise 3:**
select all people whose surname starts with the letter `e`:

In [60]:
# The problem that we have is, that name and surname are in one string
people["name"]

0    Erika Schumacher
1          Javi López
2        Maria Rovira
3        Ana Garamond
4      Shekhar Biswas
5        Muriel Adams
6         Saira Polom
7          Alex Edwin
8           Kit Ching
9            Dog Woof
Name: name, dtype: object

In [59]:
# We can easily get the first name with the .startswith() function,
# but the surname is a tad more difficult, names are not separated into different columns

Unnamed: 0,name,last
0,Erika,Schumacher
1,Javi,López
2,Maria,Rovira
3,Ana,Garamond
4,Shekhar,Biswas
5,Muriel,Adams
6,Saira,Polom
7,Alex,Edwin
8,Kit,Ching
9,Dog,Woof


In [64]:
# So lets seperate the names into different columns !
# For that we create a new DF which only contains the names, but separated into first & last names
name_df = (people['name'].str
                 .split(' ',expand = True)
                 .rename(columns = {0:'first name',1:'last name'}))

In [65]:
# Now the names are separated, making the filtering process much more easier
name_df

Unnamed: 0,first name,last name
0,Erika,Schumacher
1,Javi,López
2,Maria,Rovira
3,Ana,Garamond
4,Shekhar,Biswas
5,Muriel,Adams
6,Saira,Polom
7,Alex,Edwin
8,Kit,Ching
9,Dog,Woof


In [67]:
name_df.loc[name_df["last name"].str.startswith("E")]

Unnamed: 0,first name,last name
7,Alex,Edwin


###### **Exercise 4:**
Create a new dataframe, `people_names`, where the first name and the last name are split into two different columns, `first_name` and `last_name`. The first row of the new dataframe should look like this:

`name           	first_name	last_name	age	country 	sibilings	favourite_color`

`erika schumacher	erika    	schumacher	22	DE      	2       	Red`

In [84]:
# Here we apply mostly the same as we did before and also adding some more columns to the new DF
# We already separated the names in the exercise before, they are stored in the name_df

name_df

Unnamed: 0,first name,last name
0,Erika,Schumacher
1,Javi,López
2,Maria,Rovira
3,Ana,Garamond
4,Shekhar,Biswas
5,Muriel,Adams
6,Saira,Polom
7,Alex,Edwin
8,Kit,Ching
9,Dog,Woof


In [86]:
# The other DF is people
people

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red
9,Dog Woof,2,XX,9,Gray


In [101]:
# For testing purpose created a new DF with joined tables
# Works good, now the columns need to be sorted
test_df_3 = people.join(name_df)
test_df_3

Unnamed: 0,name,age,country,sibilings,favourite_color,first name,last name
0,Erika Schumacher,22,DE,2,Red,Erika,Schumacher
1,Javi López,50,ES,0,Yellow,Javi,López
2,Maria Rovira,23,ES,4,Yellow,Maria,Rovira
3,Ana Garamond,29,ES,1,Blue,Ana,Garamond
4,Shekhar Biswas,44,IN,1,Red,Shekhar,Biswas
5,Muriel Adams,30,DE,2,Yellow,Muriel,Adams
6,Saira Polom,25,IN,3,Blue,Saira,Polom
7,Alex Edwin,71,UK,7,Blue,Alex,Edwin
8,Kit Ching,35,UK,0,Red,Kit,Ching
9,Dog Woof,2,XX,9,Gray,Dog,Woof


In [106]:
# Now, with the .loc function we can easily rearange the order of the columns
# Keep in mind, that the changes are not stored in the DF !
test_df_3.loc[:,['name','first name','last name','age','country','sibilings','favourite_color']]

Unnamed: 0,name,first name,last name,age,country,sibilings,favourite_color
0,Erika Schumacher,Erika,Schumacher,22,DE,2,Red
1,Javi López,Javi,López,50,ES,0,Yellow
2,Maria Rovira,Maria,Rovira,23,ES,4,Yellow
3,Ana Garamond,Ana,Garamond,29,ES,1,Blue
4,Shekhar Biswas,Shekhar,Biswas,44,IN,1,Red
5,Muriel Adams,Muriel,Adams,30,DE,2,Yellow
6,Saira Polom,Saira,Polom,25,IN,3,Blue
7,Alex Edwin,Alex,Edwin,71,UK,7,Blue
8,Kit Ching,Kit,Ching,35,UK,0,Red
9,Dog Woof,Dog,Woof,2,XX,9,Gray


In [108]:
# To save the changes, just create a new variable
people_names = test_df_3.loc[:,['name','first name','last name','age','country','sibilings','favourite_color']]

In [109]:
people_names

Unnamed: 0,name,first name,last name,age,country,sibilings,favourite_color
0,Erika Schumacher,Erika,Schumacher,22,DE,2,Red
1,Javi López,Javi,López,50,ES,0,Yellow
2,Maria Rovira,Maria,Rovira,23,ES,4,Yellow
3,Ana Garamond,Ana,Garamond,29,ES,1,Blue
4,Shekhar Biswas,Shekhar,Biswas,44,IN,1,Red
5,Muriel Adams,Muriel,Adams,30,DE,2,Yellow
6,Saira Polom,Saira,Polom,25,IN,3,Blue
7,Alex Edwin,Alex,Edwin,71,UK,7,Blue
8,Kit Ching,Kit,Ching,35,UK,0,Red
9,Dog Woof,Dog,Woof,2,XX,9,Gray
