# Python for Data Science
## Session 7
### Datasets – Pandas II

---

## Exercises
### Handling Missing Data and Data Transformation



1. Create a new column called **professor_initials** that stores the initials of each professor's first and last names. Use the following data:

In [1]:
import pandas as pd

In [2]:
data = {
    'professor': ['Ludmila Kuncheva', 'Antonio Torralba', 'Manuel Gonzalez', 'Bastian Leibe'],
    'department': ['Computer Science', 'Computer Vision', 'AI & Robotics', 'Autonomous Systems'],
    'age': [45, 50, 47, 38]
}

df = pd.DataFrame(data)
df

Unnamed: 0,professor,department,age
0,Ludmila Kuncheva,Computer Science,45
1,Antonio Torralba,Computer Vision,50
2,Manuel Gonzalez,AI & Robotics,47
3,Bastian Leibe,Autonomous Systems,38


In [3]:
# We extract the initials of each professor
df['professor_initials'] = df['professor'].apply(lambda x: ''.join([name[0] for name in x.split()]))

df # Print the dataframe to check it worked.


Unnamed: 0,professor,department,age,professor_initials
0,Ludmila Kuncheva,Computer Science,45,LK
1,Antonio Torralba,Computer Vision,50,AT
2,Manuel Gonzalez,AI & Robotics,47,MG
3,Bastian Leibe,Autonomous Systems,38,BL


<div class="alert alert-warning"> We check the same code to see that it works properly even with more than 2 names or surnames. </div>

In [4]:
data_test = {
    'professor': ['Ludmila Kuncheva', 'Antonio Torralba','Ricardo Jose Suarez','Carlota Julieta Lopez Obrador'],
    'department': ['Computer Science', 'Computer Vision','Algebra','Social Sciences'],
    'age': [45, 50, 25, 31]
}
df_test = pd.DataFrame(data_test)
df_test['professor_initials'] = df_test['professor'].apply(lambda x: ''.join([name[0] for name in x.split()]))

df_test

Unnamed: 0,professor,department,age,professor_initials
0,Ludmila Kuncheva,Computer Science,45,LK
1,Antonio Torralba,Computer Vision,50,AT
2,Ricardo Jose Suarez,Algebra,25,RJS
3,Carlota Julieta Lopez Obrador,Social Sciences,31,CJLO


2. Given the dataframe below. Use **join** to combine this new DataFrame with the original one based on the professor column.

In [5]:
courses_data = {
    'professor': ['Ludmila Kuncheva', 'Antonio Torralba', 'Manuel Gonzalez', 'Bastian Leibe'],
    'courses': ['Machine Learning', 'Computer Vision', 'AI Programming', 'Self-Driving Cars']
}
df_courses = pd.DataFrame(courses_data)
df_courses

Unnamed: 0,professor,courses
0,Ludmila Kuncheva,Machine Learning
1,Antonio Torralba,Computer Vision
2,Manuel Gonzalez,AI Programming
3,Bastian Leibe,Self-Driving Cars


In [6]:
# Set 'professor' as the index 
df.set_index('professor', inplace=True)

df_courses = df_courses.join(df, on='professor', how='inner') # We join it using inner

df_courses

Unnamed: 0,professor,courses,department,age,professor_initials
0,Ludmila Kuncheva,Machine Learning,Computer Science,45,LK
1,Antonio Torralba,Computer Vision,Computer Vision,50,AT
2,Manuel Gonzalez,AI Programming,AI & Robotics,47,MG
3,Bastian Leibe,Self-Driving Cars,Autonomous Systems,38,BL


In [None]:
# There is also another way to make the join.
courses_data = {
    'professor': ['Ludmila Kuncheva', 'Antonio Torralba', 'Manuel Gonzalez', 'Bastian Leibe'],
    'courses': ['Machine Learning', 'Computer Vision', 'AI Programming', 'Self-Driving Cars']
}

df_courses = pd.DataFrame(courses_data)
df_courses.set_index('professor', inplace=True)

# Joining the DataFrames on the index
df_combined = df.join(df_courses)

df_combined

Unnamed: 0_level_0,department,age,professor_initials,courses
professor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ludmila Kuncheva,Computer Science,45,LK,Machine Learning
Antonio Torralba,Computer Vision,50,AT,Computer Vision
Manuel Gonzalez,AI & Robotics,47,MG,AI Programming
Bastian Leibe,Autonomous Systems,38,BL,Self-Driving Cars


3. Combine the original df and df_courses DataFrames.

In [9]:
data = {
    'professor': ['Ludmila Kuncheva', 'Antonio Torralba', 'Manuel Gonzalez', 'Bastian Leibe'],
    'department': ['Computer Science', 'Computer Vision', 'AI & Robotics', 'Autonomous Systems'],
    'age': [45, 50, 47, 38]
}

df = pd.DataFrame(data)

courses_data = {
    'professor': ['Ludmila Kuncheva', 'Antonio Torralba', 'Manuel Gonzalez', 'Bastian Leibe'],
    'courses': ['Machine Learning', 'Computer Vision', 'AI Programming', 'Self-Driving Cars']
}
df_courses = pd.DataFrame(courses_data)

In [None]:
# Now we use the merge function
df_combined = pd.merge(df, df_courses, on='professor')

df_combined

Unnamed: 0,professor,department,age,courses
0,Ludmila Kuncheva,Computer Science,45,Machine Learning
1,Antonio Torralba,Computer Vision,50,Computer Vision
2,Manuel Gonzalez,AI & Robotics,47,AI Programming
3,Bastian Leibe,Autonomous Systems,38,Self-Driving Cars


4. In the professor column, create a new column professor_last_name by extracting the last name of each professor using string operations.

In [None]:
# Creating professor_last_name column by extracting the last name
df_combined['professor_last_name'] = df_combined['professor'].apply(lambda x: x.split()[-1])

df_combined

Unnamed: 0,professor,department,age,courses,professor_last_name
0,Ludmila Kuncheva,Computer Science,45,Machine Learning,Kuncheva
1,Antonio Torralba,Computer Vision,50,Computer Vision,Torralba
2,Manuel Gonzalez,AI & Robotics,47,AI Programming,Gonzalez
3,Bastian Leibe,Autonomous Systems,38,Self-Driving Cars,Leibe


In [14]:
# We can easily do it as well for the first name 
df_combined['professor_first_name'] = df_combined['professor'].apply(lambda x: x.split()[0])

df_combined

Unnamed: 0,professor,department,age,courses,professor_last_name,professor_first_name
0,Ludmila Kuncheva,Computer Science,45,Machine Learning,Kuncheva,Ludmila
1,Antonio Torralba,Computer Vision,50,Computer Vision,Torralba,Antonio
2,Manuel Gonzalez,AI & Robotics,47,AI Programming,Gonzalez,Manuel
3,Bastian Leibe,Autonomous Systems,38,Self-Driving Cars,Leibe,Bastian



### Summary

Today we saw deeper ways of manipulating **dataframes** and **series**, from applying functions on columns, merging / joining multiple datasets, replacing missing values, to finally work with strings using regex in the back.