# Pandas and Text Methods

We will use again the `people` dataframe, with some more people and columns:

In [121]:
import pandas as pd
names = ["Erika Schumacher", "Javi López", "Maria Rovira", "Ana Garamond", 
         "Shekhar Biswas", "Muriel Adams", "Saira Polom", "Alex Edwin", 
         "Kit Ching", "Dog Woof"]
ages = [22, 50, 23, 29, 44, 30, 25, 71, 35, 2]
nations = ["DE", "ES", "ES", "ES", "IN", "DE", "IN", "UK", "UK", "XX"]
sibilings = [2, 0, 4, 1, 1, 2, 3, 7, 0, 9]
colors = ["Red", "Yellow", "Yellow", "Blue", "Red", "Yellow", "Blue", "Blue", "Red", "Gray"]



people = pd.DataFrame({"name":names,
                       "age":ages,
                       "country":nations,
                       "sibilings":sibilings,
                       "favourite_color":colors
                      })

people.head()

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red


## Filtering data based on conditions

Let's say we want to select only rows for people whose favourite color is "Yellow".

If we just type the condition (`favourite_color=="Yellow"`), we will create a Pandas Series of boolean values of the same length as the rows in the dataframe. It holds `True` for rows where the condition is met, and `False` otherwise:

In [48]:
people.favourite_color=="Yellow"

0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: favourite_color, dtype: bool

Note: a Pandas Series is like a list, but it has an index and all of its elements must share the same data type. You can think of it as a "single column dataframe".

We can use this Series inside of the `loc[]` function we learned earlier to select only the rows that corrspond to the `True` values:

In [4]:
people.loc[people.favourite_color=="Yellow",]
people

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red
9,Dog Woof,2,XX,9,Gray


**Exercise:** Filter the `people` dataframe and keep only people from the UK.

In [3]:
# code here
people.loc[people.country=="UK",]

Unnamed: 0,name,age,country,sibilings,favourite_color
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red


**Exercise:** Filter the `people` dataframe and keep only people from either the UK or Germany (the country code for Germany is "DE"). 

Tip: To use two conditions inside of `loc[]`, wrap each condition in parentheses and separate them using logical operators `&` if you need both conditions to be met or `|` if meeting one of the conditions is enough.

In [13]:
# code here
people.loc[(people.country =="UK") | (people.country=="DE"),]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
5,Muriel Adams,30,DE,2,Yellow
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red


**Exercise**: Filter the `people` dataframe and keep only:

- people from either the UK or Germany (the country code for Germany is "DE").
- people with 2 or more sibilings

In [15]:
# code here
 people.loc[(people.country =="UK")| (people.country=="DE") | (people.sibilings >=2),]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
2,Maria Rovira,23,ES,4,Yellow
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red
9,Dog Woof,2,XX,9,Gray


## String Operations

The previous exercises could be solved combining simple conditions based on equalities (`==` or comparisons (`>`, `<`...). But when it comes to text data, sometimes the conditions are more complex. How would we select all the people whose name starts with a certain letter? 

This is where Pandas String Operations are really helpful. Go through [this user guide](https://pandas.pydata.org/docs/user_guide/text.html#string-methods) from Pandas' documentation, it is a good introduction to them. Here are some examples:

Filtering rows with name starting with A:

In [3]:
# we generate the boolean expression
people.name.str.startswith("A")

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
8    False
9    False
Name: name, dtype: bool

In [4]:
# and we pass it to `loc[]`
people.loc[people.name.str.startswith("A"),]

Unnamed: 0,name,age,country,sibilings,favourite_color
3,Ana Garamond,29,ES,1,Blue
7,Alex Edwin,71,UK,7,Blue


String methods can also change text:

In [11]:
# Names to lower case
people.name.str.lower()


0    erika schumacher
1          javi lópez
2        maria rovira
3        ana garamond
4      shekhar biswas
5        muriel adams
6         saira polom
7          alex edwin
8           kit ching
9            dog woof
Name: name, dtype: object

Note that we have just outputted these names, but we have not changed the original dataframe:

In [12]:
people.head(2)

Unnamed: 0,name,age,country,sibilings,favourite_color
0,erika schumacher,22,DE,2,Red
1,javi lópez,50,ES,0,Yellow


If we wanted to change the original dataframe, we would have assign this output (the names in lower case) to the column in the dataframe we want to change. When doing that, it is important that you select that column using `loc[]`, and not simply `DataFrame.column`:

In [14]:
people.loc[:,"name"] = people.name.str.lower()
people.loc[:,"name"]

0    erika schumacher
1          javi lópez
2        maria rovira
3        ana garamond
4      shekhar biswas
5        muriel adams
6         saira polom
7          alex edwin
8           kit ching
9            dog woof
Name: name, dtype: object

In [15]:
# the original dataframe has been modified:
people.head(2)

Unnamed: 0,name,age,country,sibilings,favourite_color
0,erika schumacher,22,DE,2,Red
1,javi lópez,50,ES,0,Yellow


**Exercises:**

Select all people whose name contains (either in the first name or the surname) the letter `p`.

In [31]:
# code here
people = pd.DataFrame({"name":names,
                       "age":ages,
                       "country":nations,
                       "sibilings":sibilings,
                       "favourite_color":colors
                      })

people[['name' , 'Last_Name']] = people['name'].str.split(' ', 1, expand = True)    
people
people.loc[(people.name.str.startswith("P")) | (people.Last_Name.str.startswith("P"))]

Unnamed: 0,name,age,country,sibilings,favourite_color,Last_Name
6,Saira,25,IN,3,Blue,Polom


Select all people whose full name + surname has more than 12 characters.

In [36]:
people['name_length']= people['name'].str.len() + people['Last_Name'].str.len() 
people
name_long = (people.name_length > 12)
people[name_long]

Unnamed: 0,name,age,country,sibilings,favourite_color,Last_Name,name length,name_length
0,Erika,22,DE,2,Red,Schumacher,15,15
4,Shekhar,44,IN,1,Red,Biswas,13,13


Select all people whose surname starts with the letter `e`:

In [38]:
# code here
people.name.str.lower()

people.loc[(people.Last_Name.str.startswith("e"))]

Unnamed: 0,name,age,country,sibilings,favourite_color,Last_Name,name length,name_length


Create a new dataframe, `people_names`, where the first name and the last name are split into two different columns, `first_name` and `last_name`. The first row of the new dataframe should look like this:

`name           	first_name	last_name	age	country 	sibilings	favourite_color`

`erika schumacher	erika    	schumacher	22	DE      	2       	Red`

In [60]:
# code here

#people[['name' , 'Last_Name']] = people['name'].str.split(' ', 1, expand = True)    
people =people.rename(columns={'name' : 'First_name'})
people_names = people.reindex(columns=['First_name', 'Last_Name' , 'age', 'country', 'sibilings', 'favourite_color'])
people_names

Unnamed: 0,First_name,Last_Name,age,country,sibilings,favourite_color
0,Erika,Schumacher,22,DE,2,Red
1,Javi,López,50,ES,0,Yellow
2,Maria,Rovira,23,ES,4,Yellow
3,Ana,Garamond,29,ES,1,Blue
4,Shekhar,Biswas,44,IN,1,Red
5,Muriel,Adams,30,DE,2,Yellow
6,Saira,Polom,25,IN,3,Blue
7,Alex,Edwin,71,UK,7,Blue
8,Kit,Ching,35,UK,0,Red
9,Dog,Woof,2,XX,9,Gray


## Cars challenges

Read the `vehicles.csv` dataset into a Pandas Dataframe called `cars`. We will use it for some extra challenges.

In [117]:
# code here
import pandas as pd
cars=pd.read_csv('vehicles.csv')
cars

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100


Create a column called `Auto` filled with either `True` or `False` depending on whether the transmission is Automatic or not.

In [118]:
# code here
cars_transmission = cars.Transmission.str.contains('Auto.')
#cars_transmission
cars['cars_transmission'] = cars_transmission
cars

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,cars_transmission
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950,True
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,True
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100,True
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,True
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100,True
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100,True
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100,True
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100,True


Create a column called `Speeds` that contains the number of speeds each transmission has, based on the number that appears in the column `Transmission`. For example, a transmission named "Automatic 4-spd" has 4 speeds, and one named "Auto (AM6)" has 6 speeds. If you find edge cases (e.g. numbers that do not make sense, no number at all...), use your own judgement to assign values to them.

Note: you will most likely need to use something called a "Regular Expression" or "regex" inside of the string method. Regular expressions are sequences of characters designed to match patterns. They can become really complex (to match complex patterns), but for this case, a simple [5 minute tutorial](https://www.youtube.com/watch?v=UQQsYXa1EHs&ab_channel=Kite) or some google should be enough. Whenever you see people writing regex in plain python, remember that you can use any regular expression directly inside of a Pandas `str` method. In the example below, we use the regular expression `"[v-z]"`, which means "match any lowercase letter between v and z (alphabetically)", in combination with the string method `str.contains()`:

In [119]:

speed_car = cars.Transmission.str.extract('(\d+)')
cars['Speeds'] = speed_car

cars


Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,cars_transmission,Speeds
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950,True,3
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,True,3
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100,True,3
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,True,3
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550,True,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100,True,5
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100,True,5
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100,True,5
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100,True,6


In [None]:
cars.Speed

In [122]:
people.name.str.contains("[v-z]")

0    False
1     True
2     True
3    False
4     True
5    False
6    False
7     True
8    False
9    False
Name: name, dtype: bool

In [141]:
cars = cars.rename(columns={"Fuel Type": "Fuel_type"})

In [143]:
# code here
#cars.groupby(cars["Fuel Type"])
cars_gas = cars.Fuel_type.str.contains('[CNG , Regular Gas,Propane]')
cars['Fuel_type'] = cars_gas

#cars

AttributeError: Can only use .str accessor with string values!

Using string operations and your best judgement, clean the rest of the dataframe:

- Narrow down the "Fuel Type" column to 4-6 categories (include a category named "Others" if needed).
- Narrow down the "Vehicle Class" column to 4-8 categories.
- Remove non-alphanumeric characters from the "Drivetrain" and the "Make" column.

In [None]:
# code here