# Introduction to Pandas

Pandas is a very important Python Library for manipulating objects called DataFrames which are crucial in performing data science. In this section, we will go over the basics of Pandas but it is highly encouraged for you to read Chapter 3 of the [Python for Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/).

# Importing Pandas
Pandas is not a package found in base Python so it must be imported.

In [2]:
import pandas as pd
import numpy as np

# DataFrame Basics
A dataframe is essentially a table. It has rows and columns and each column has a heading. Just as we work with large tables with Excel we want to work in a similar way with data in Python. The advantage of Pandas is that it can handle far larger table sizes than Excel, the data wrangling is more robust and it's much easier than Excel.

In [3]:
# Create a DataFrame
## From a Python Dictionary
df = pd.DataFrame({'name': ['Bob', 'Lisa', 'Mike', 'Fatima', 'Zahra', 'Ali', 'Maryam', 'Maryam'],
'Age': [23, 23, 25, 19, 42, '54', 107, 107],
'Sex': ['M', 'F', 'M', 'F', 'F', 'M', 'F', 'F'],
'Job': ['Designer', 'Marketing Manager', 'Product Manager', 'Software Engineer', 'Data Scientist', 'Machine Learning Engineer', np.NaN, np.NaN],
'Hobbies': ['Waterskiing', 'Skydiving', 'Rock Climbing', 'Skateboarding', 'Baking', 'Improv Acting', 'Trolling', 'Trolling']})

df

Unnamed: 0,name,Age,Sex,Job,Hobbies
0,Bob,23,M,Designer,Waterskiing
1,Lisa,23,F,Marketing Manager,Skydiving
2,Mike,25,M,Product Manager,Rock Climbing
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking
5,Ali,54,M,Machine Learning Engineer,Improv Acting
6,Maryam,107,F,,Trolling
7,Maryam,107,F,,Trolling


In [6]:
# Preview a Dataframe
## Head: view first 5 rows
df.head()
#df.head(3)

Unnamed: 0,name,Age,Sex,Job,Hobbies
0,Bob,23,M,Designer,Waterskiing
1,Lisa,23,F,Marketing Manager,Skydiving
2,Mike,25,M,Product Manager,Rock Climbing
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking


In [7]:
## Tail: view final 5 rows
df.tail()
#df.tail(3)

Unnamed: 0,name,Age,Sex,Job,Hobbies
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking
5,Ali,54,M,Machine Learning Engineer,Improv Acting
6,Maryam,107,F,,Trolling
7,Maryam,107,F,,Trolling


In [8]:
## Show column names 
df.columns

Index(['name', 'Age', 'Sex', 'Job', 'Hobbies'], dtype='object')

In [9]:
## Shape: Get rows and columns of a dataframe
df.shape

(8, 5)

In [11]:
print(len(df)) ## Rows
print(len(df.columns)) ## Columns

8
5


In [12]:
## Show data types for each column in a dataframe
df.dtypes


name       object
Age        object
Sex        object
Job        object
Hobbies    object
dtype: object

In [13]:
## Show data types for a particular column in a dataframe
df['Sex'].dtype

dtype('O')

In [14]:
## Subsetting columns in a dataframe. We use the square brackets like we do with lists
df['name']
#df[['name', 'Hobbies']]

Unnamed: 0,name,Hobbies
0,Bob,Waterskiing
1,Lisa,Skydiving
2,Mike,Rock Climbing
3,Fatima,Skateboarding
4,Zahra,Baking
5,Ali,Improv Acting
6,Maryam,Trolling
7,Maryam,Trolling


In [16]:
# Subset rows with iloc
df.iloc[[3], :] # Gets 3rd row and all columns

Unnamed: 0,name,Age,Sex,Job,Hobbies
3,Fatima,19,F,Software Engineer,Skateboarding


In [17]:
df.iloc[[3,4], :] # Gets rows 3 and 4 and all columns

Unnamed: 0,name,Age,Sex,Job,Hobbies
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking


In [19]:
df.iloc[:, [2]] # Gets all rows and only 2nd column


Unnamed: 0,Sex
0,M
1,F
2,M
3,F
4,F
5,M
6,F
7,F


In [20]:
df.iloc[:, [3,4]] # Gets all rows and 3rd and 4th columns


Unnamed: 0,Job,Hobbies
0,Designer,Waterskiing
1,Marketing Manager,Skydiving
2,Product Manager,Rock Climbing
3,Software Engineer,Skateboarding
4,Data Scientist,Baking
5,Machine Learning Engineer,Improv Acting
6,,Trolling
7,,Trolling


In [21]:
df.iloc[[0,1], [3,4]] # Gets 0th and 1st rows and 3rd and 4th columns


Unnamed: 0,Job,Hobbies
0,Designer,Waterskiing
1,Marketing Manager,Skydiving


In [26]:
## Get unique values for a column
df['Age'].unique()



array([23, 25, 19, 42, '54', 107], dtype=object)

In [27]:
# Descriptive Stats for a Dataframe
df.describe()

Unnamed: 0,name,Age,Sex,Job,Hobbies
count,8,8,8,6,8
unique,7,6,2,6,7
top,Maryam,23,F,Designer,Trolling
freq,2,2,5,1,2


In [28]:
# We can count values for the dataframe
df.value_counts()

name    Age  Sex  Job                        Hobbies      
Ali     54   M    Machine Learning Engineer  Improv Acting    1
Bob     23   M    Designer                   Waterskiing      1
Fatima  19   F    Software Engineer          Skateboarding    1
Lisa    23   F    Marketing Manager          Skydiving        1
Mike    25   M    Product Manager            Rock Climbing    1
Zahra   42   F    Data Scientist             Baking           1
dtype: int64

In [31]:
# It's more helpful to count unique occuring values by column
print(df[['Age']].value_counts())
print(df[['Sex']].value_counts())



Age
23     2
107    2
19     1
25     1
42     1
54     1
dtype: int64
Sex
F      5
M      3
dtype: int64


In [32]:
# Cast column types: We can look at the Age column and realise it's a object/ string. It doesn't need to be, so we cast it to integer which is a more appropriate type
df.dtypes

name       object
Age        object
Sex        object
Job        object
Hobbies    object
dtype: object

In [33]:
# To do this, we use the astype() method

df['Age'] = df['Age'].astype(int)

In [34]:
df.dtypes

name       object
Age         int64
Sex        object
Job        object
Hobbies    object
dtype: object

# Data Wrangling

Data Wrangling involves manipulating the data contained within the dataframe itself. There are 5 major operations involved in data wrangling and all of them are essential when performing any operations pertaining to data analysis.

1. Filtering rows
2. Mutating columns
3. Sorting rows
4. Renaming columns
5. Grouping and Aggregating rows

In [35]:
# Let's make a copy of the dataframe so our changes don't affect the original.
df_copy = df.copy()

In [36]:
# Filter rows
## Single filters
df_copy[df_copy.Job == 'Data Scientist']

Unnamed: 0,name,Age,Sex,Job,Hobbies
4,Zahra,42,F,Data Scientist,Baking


In [40]:
## Multiple Filters
df_copy[(df_copy.Sex == 'F') & (df_copy.Age > 24)]

Unnamed: 0,name,Age,Sex,Job,Hobbies
4,Zahra,42,F,Data Scientist,Baking
6,Maryam,107,F,,Trolling
7,Maryam,107,F,,Trolling


In [41]:
# Check if values are within a range
df_copy[df_copy.Job.isin(['Designer', 'Machine Learning Engineer'])]


Unnamed: 0,name,Age,Sex,Job,Hobbies
0,Bob,23,M,Designer,Waterskiing
5,Ali,54,M,Machine Learning Engineer,Improv Acting


In [42]:
# Return values NOT in a range
df_copy[~df_copy.Hobbies.isin(['Trolling', 'Skydiving'])]

Unnamed: 0,name,Age,Sex,Job,Hobbies
0,Bob,23,M,Designer,Waterskiing
2,Mike,25,M,Product Manager,Rock Climbing
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking
5,Ali,54,M,Machine Learning Engineer,Improv Acting


In [43]:
# Mutate columns
## Make a new column from the existing columns
df_copy['new_age'] = df_copy['Age'] - 2
df_copy

Unnamed: 0,name,Age,Sex,Job,Hobbies,new_age
0,Bob,23,M,Designer,Waterskiing,21
1,Lisa,23,F,Marketing Manager,Skydiving,21
2,Mike,25,M,Product Manager,Rock Climbing,23
3,Fatima,19,F,Software Engineer,Skateboarding,17
4,Zahra,42,F,Data Scientist,Baking,40
5,Ali,54,M,Machine Learning Engineer,Improv Acting,52
6,Maryam,107,F,,Trolling,105
7,Maryam,107,F,,Trolling,105


In [44]:
## Apply a function to every row in a column
df_copy['log_age'] = df_copy['Age'].apply(np.log)
df_copy

Unnamed: 0,name,Age,Sex,Job,Hobbies,new_age,log_age
0,Bob,23,M,Designer,Waterskiing,21,3.135494
1,Lisa,23,F,Marketing Manager,Skydiving,21,3.135494
2,Mike,25,M,Product Manager,Rock Climbing,23,3.218876
3,Fatima,19,F,Software Engineer,Skateboarding,17,2.944439
4,Zahra,42,F,Data Scientist,Baking,40,3.73767
5,Ali,54,M,Machine Learning Engineer,Improv Acting,52,3.988984
6,Maryam,107,F,,Trolling,105,4.672829
7,Maryam,107,F,,Trolling,105,4.672829


In [45]:
## Can apply custom lambda functions to columns
df_copy['full_hobbies'] = df_copy['Hobbies'].apply(lambda x: x + ' & Scuba Diving')
df_copy

Unnamed: 0,name,Age,Sex,Job,Hobbies,new_age,log_age,full_hobbies
0,Bob,23,M,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving
1,Lisa,23,F,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving
2,Mike,25,M,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving
3,Fatima,19,F,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving
4,Zahra,42,F,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving
5,Ali,54,M,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving
6,Maryam,107,F,,Trolling,105,4.672829,Trolling & Scuba Diving
7,Maryam,107,F,,Trolling,105,4.672829,Trolling & Scuba Diving


In [46]:
## We can apply a function to a column and then replace that column
df_copy['Sex'] = df_copy['Sex'].apply(lambda x: x.lower())
df_copy

Unnamed: 0,name,Age,Sex,Job,Hobbies,new_age,log_age,full_hobbies
0,Bob,23,m,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving
1,Lisa,23,f,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving
2,Mike,25,m,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving
3,Fatima,19,f,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving
4,Zahra,42,f,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving
5,Ali,54,m,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving
6,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving
7,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving


In [47]:
# Sort values
df_copy.sort_values(by = ['Age'], ascending = [True])
#df_copy.sort_values(by = ['Age'], ascending = [False])



Unnamed: 0,name,Age,Sex,Job,Hobbies,new_age,log_age,full_hobbies
3,Fatima,19,f,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving
0,Bob,23,m,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving
1,Lisa,23,f,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving
2,Mike,25,m,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving
4,Zahra,42,f,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving
5,Ali,54,m,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving
6,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving
7,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving


In [None]:
## Can also sort by alphabetical order if the values inside are strings
df_copy.sort_values(by = ['Hobbies'], ascending = [True])


In [48]:
## Sorting by multiple columns. Sort by age and then by oldest first
df_copy.sort_values(by = ['Sex', 'Age'], ascending = [True, False])


Unnamed: 0,name,Age,Sex,Job,Hobbies,new_age,log_age,full_hobbies
6,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving
7,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving
4,Zahra,42,f,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving
1,Lisa,23,f,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving
3,Fatima,19,f,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving
5,Ali,54,m,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving
2,Mike,25,m,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving
0,Bob,23,m,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving


In [49]:
# Rename Columns
## Give a new columns list - We want to change name to Name
df_copy.columns = ['Name', 'Age', 'Sex', 'Job', 'Hobbies', 'new_age', 'log_age', 'full_hobbies']
df_copy


Unnamed: 0,Name,Age,Sex,Job,Hobbies,new_age,log_age,full_hobbies
0,Bob,23,m,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving
1,Lisa,23,f,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving
2,Mike,25,m,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving
3,Fatima,19,f,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving
4,Zahra,42,f,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving
5,Ali,54,m,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving
6,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving
7,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving


In [50]:
## Change selected column names via a dictionary.
df_copy = df_copy.rename(columns = {'new_age': 'Younger_Age', 'log_age': 'Log_Age'})
df_copy

Unnamed: 0,Name,Age,Sex,Job,Hobbies,Younger_Age,Log_Age,full_hobbies
0,Bob,23,m,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving
1,Lisa,23,f,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving
2,Mike,25,m,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving
3,Fatima,19,f,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving
4,Zahra,42,f,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving
5,Ali,54,m,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving
6,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving
7,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving


In [51]:
# Grouping and Aggregating
## What is the mean age by gender?
df_copy.groupby(['Sex'], as_index=False).agg({'Age': 'mean'})

Unnamed: 0,Sex,Age
0,f,59.6
1,m,34.0


In [52]:
## We can aggregate with multiple functions
df_copy.groupby(['Sex'], as_index=False).agg({'Age': ['mean', 'min', 'max']})

Unnamed: 0_level_0,Sex,Age,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max
0,f,59.6,19,107
1,m,34.0,23,54


In [53]:
## Add a new column for Salary and Ethnic Name
df_copy['Salary'] = [72, 65, 67, 71, 70, 89, 23, 23]
df_copy['Ethnic_Name'] = ["n", "n", "n", "n", "y", "y", "y", "y"]

In [54]:
df_copy.groupby(['Sex'], as_index=False).agg({'Salary': ['mean', 'median']})

Unnamed: 0_level_0,Sex,Salary,Salary
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median
0,f,50.4,65.0
1,m,76.0,72.0


In [55]:
## We can group by multiple columns
df_copy.groupby(['Sex', 'Ethnic_Name'], as_index=False).agg({'Salary': ['mean', 'median']})

Unnamed: 0_level_0,Sex,Ethnic_Name,Salary,Salary
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,median
0,f,n,68.0,68.0
1,f,y,38.666667,23.0
2,m,n,69.5,69.5
3,m,y,89.0,89.0


In [56]:
## Making your own aggregation functions - These run much slower for large dataframes. 
## We make a function to compute the exponential values for an array and then compute the mean.
def exp_mean(x):
    y = np.exp(np.array(x))
    return np.mean(y)

#exp_mean([1,2,3,4])

df_copy.groupby(['Sex'], as_index=False).agg({'Salary': exp_mean})


Unnamed: 0,Sex,Salary
0,f,1.874012e+30
1,m,1.496538e+38


# Conditional Logic with Dataframes
It is useful to perform if else logic on columns to extract data of interest and to mutate the dataframe more intelligently. For this we will need 2 numpy functions - `np.where` and `np.select`.

In [57]:
# Binary logic - Suppose we were creating a special club where the age of entry is 25 or over.
# Let's see who in our dataset qualifies to be a member.
df_copy['eligible_for_membership'] = np.where(df_copy.Age >= 25, True, False)
df_copy

Unnamed: 0,Name,Age,Sex,Job,Hobbies,Younger_Age,Log_Age,full_hobbies,Salary,Ethnic_Name,eligible_for_membership
0,Bob,23,m,Designer,Waterskiing,21,3.135494,Waterskiing & Scuba Diving,72,n,False
1,Lisa,23,f,Marketing Manager,Skydiving,21,3.135494,Skydiving & Scuba Diving,65,n,False
2,Mike,25,m,Product Manager,Rock Climbing,23,3.218876,Rock Climbing & Scuba Diving,67,n,True
3,Fatima,19,f,Software Engineer,Skateboarding,17,2.944439,Skateboarding & Scuba Diving,71,n,False
4,Zahra,42,f,Data Scientist,Baking,40,3.73767,Baking & Scuba Diving,70,y,True
5,Ali,54,m,Machine Learning Engineer,Improv Acting,52,3.988984,Improv Acting & Scuba Diving,89,y,True
6,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving,23,y,True
7,Maryam,107,f,,Trolling,105,4.672829,Trolling & Scuba Diving,23,y,True


In [59]:
# More than 2 conditions. When we have complex if else logic we can use np.select
# Suppose we wanted to flag which hobbies were considered adventurous in varying degrees
df_copy['Adventurous'] = np.select([df_copy.Hobbies == 'Waterskiing'
, df_copy.Hobbies == 'Skydiving'
, df_copy.Hobbies == 'Rock Climbing'
, df_copy.Hobbies == 'Skateboarding'
, df_copy.Hobbies == 'Baking'
]
, ['very_adventurous'
, 'extremely_adventurous'
, 'adventurous'
, 'adventurous'
, 'not adventurous']
, default = 'mildly_adventurous')

df_copy[['Name', 'Hobbies', 'Adventurous']]

Unnamed: 0,Name,Hobbies,Adventurous
0,Bob,Waterskiing,very_adventurous
1,Lisa,Skydiving,extremely_adventurous
2,Mike,Rock Climbing,adventurous
3,Fatima,Skateboarding,adventurous
4,Zahra,Baking,not adventurous
5,Ali,Improv Acting,mildly_adventurous
6,Maryam,Trolling,mildly_adventurous
7,Maryam,Trolling,mildly_adventurous


# Multiple Dataframes

Many times we will have more than one dataframe to deal with at once and we may wish to combine the data in scattered dataframes in different ways. 

In [61]:
# Union Dataframes/ Stack them on top of each other
df2 = pd.DataFrame({'name': ['James', 'Frankie'],
'Age': [47, 48],
'Sex': ['M', 'F'],
'Job': ['Accountant', 'Chef'],
'Hobbies': ['Travelling', 'Kickboxing']})

df2

Unnamed: 0,name,Age,Sex,Job,Hobbies
0,James,47,M,Accountant,Travelling
1,Frankie,48,F,Chef,Kickboxing


In [62]:
df3 = pd.concat([df, df2], ignore_index=True)
df3

Unnamed: 0,name,Age,Sex,Job,Hobbies
0,Bob,23,M,Designer,Waterskiing
1,Lisa,23,F,Marketing Manager,Skydiving
2,Mike,25,M,Product Manager,Rock Climbing
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking
5,Ali,54,M,Machine Learning Engineer,Improv Acting
6,Maryam,107,F,,Trolling
7,Maryam,107,F,,Trolling
8,James,47,M,Accountant,Travelling
9,Frankie,48,F,Chef,Kickboxing


In [64]:
## Add columns to a dataframe 
df_extra_cols = pd.DataFrame({'Country': ['USA', 'Japan', 'UK', 'UK', 'UK', 'Tanzania', 'Narnia', 'Narnia', 'Australia', 'Netherlands']})
df4 = pd.concat([df3, df_extra_cols], axis = 1) # axis = 1 tells pandas that you're joining columns and not rows
df4

Unnamed: 0,name,Age,Sex,Job,Hobbies,Country
0,Bob,23,M,Designer,Waterskiing,USA
1,Lisa,23,F,Marketing Manager,Skydiving,Japan
2,Mike,25,M,Product Manager,Rock Climbing,UK
3,Fatima,19,F,Software Engineer,Skateboarding,UK
4,Zahra,42,F,Data Scientist,Baking,UK
5,Ali,54,M,Machine Learning Engineer,Improv Acting,Tanzania
6,Maryam,107,F,,Trolling,Narnia
7,Maryam,107,F,,Trolling,Narnia
8,James,47,M,Accountant,Travelling,Australia
9,Frankie,48,F,Chef,Kickboxing,Netherlands


In [67]:
# Joining Dataframes
## One common key between dataframes - replaces Vlookup in Excel
df_a = pd.DataFrame({'Country': ['USA', 'Japan', 'UK', 'Tanzania', 'Australia', 'Netherlands'],
'National_Sport': ['Baseball', 'Sumo', 'Football', 'Football', 'Netball', 'Cycling']})

df_a


Unnamed: 0,Country,National_Sport
0,USA,Baseball
1,Japan,Sumo
2,UK,Football
3,Tanzania,Football
4,Australia,Netball
5,Netherlands,Cycling


In [68]:
df_4a = pd.merge(df4, df_a, on = ['Country'], how = 'left')
df_4a

Unnamed: 0,name,Age,Sex,Job,Hobbies,Country,National_Sport
0,Bob,23,M,Designer,Waterskiing,USA,Baseball
1,Lisa,23,F,Marketing Manager,Skydiving,Japan,Sumo
2,Mike,25,M,Product Manager,Rock Climbing,UK,Football
3,Fatima,19,F,Software Engineer,Skateboarding,UK,Football
4,Zahra,42,F,Data Scientist,Baking,UK,Football
5,Ali,54,M,Machine Learning Engineer,Improv Acting,Tanzania,Football
6,Maryam,107,F,,Trolling,Narnia,
7,Maryam,107,F,,Trolling,Narnia,
8,James,47,M,Accountant,Travelling,Australia,Netball
9,Frankie,48,F,Chef,Kickboxing,Netherlands,Cycling


In [69]:
# Multiple Keys 
df_b = pd.DataFrame({'Country': ['USA', 'USA', 'Japan', 'Japan', 'UK', 'UK'],
'Sex': ['M', 'F', 'M', 'F', 'M', 'F'],
'Sexist_National_Sport': ['Baseball', 'Cheerleading', 'Sumo', 'Ballet', 'Football', 'Gymnastics']})

df_b

Unnamed: 0,Country,Sex,Sexist_National_Sport
0,USA,M,Baseball
1,USA,F,Cheerleading
2,Japan,M,Sumo
3,Japan,F,Ballet
4,UK,M,Football
5,UK,F,Gymnastics


In [70]:
df_4b = pd.merge(df4, df_b, on = ['Country', 'Sex'], how = 'left')
df_4b

Unnamed: 0,name,Age,Sex,Job,Hobbies,Country,Sexist_National_Sport
0,Bob,23,M,Designer,Waterskiing,USA,Baseball
1,Lisa,23,F,Marketing Manager,Skydiving,Japan,Ballet
2,Mike,25,M,Product Manager,Rock Climbing,UK,Football
3,Fatima,19,F,Software Engineer,Skateboarding,UK,Gymnastics
4,Zahra,42,F,Data Scientist,Baking,UK,Gymnastics
5,Ali,54,M,Machine Learning Engineer,Improv Acting,Tanzania,
6,Maryam,107,F,,Trolling,Narnia,
7,Maryam,107,F,,Trolling,Narnia,
8,James,47,M,Accountant,Travelling,Australia,
9,Frankie,48,F,Chef,Kickboxing,Netherlands,


# Dealing with Null Values and Duplicates

In [71]:
# Check Null values
df['Job'].isna()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
Name: Job, dtype: bool

In [72]:
# Drop all rows with a Null value in at least one column. Not an inplace operation.
df.dropna()


Unnamed: 0,name,Age,Sex,Job,Hobbies
0,Bob,23,M,Designer,Waterskiing
1,Lisa,23,F,Marketing Manager,Skydiving
2,Mike,25,M,Product Manager,Rock Climbing
3,Fatima,19,F,Software Engineer,Skateboarding
4,Zahra,42,F,Data Scientist,Baking
5,Ali,54,M,Machine Learning Engineer,Improv Acting


In [77]:
# Replace Null value with a value
df['Job_notnull']= df['Job'].fillna('Tik Tok Star')
df

Unnamed: 0,name,Age,Sex,Job,Hobbies,Job_notnull
0,Bob,23,M,Designer,Waterskiing,Designer
1,Lisa,23,F,Marketing Manager,Skydiving,Marketing Manager
2,Mike,25,M,Product Manager,Rock Climbing,Product Manager
3,Fatima,19,F,Software Engineer,Skateboarding,Software Engineer
4,Zahra,42,F,Data Scientist,Baking,Data Scientist
5,Ali,54,M,Machine Learning Engineer,Improv Acting,Machine Learning Engineer
6,Maryam,107,F,,Trolling,Tik Tok Star
7,Maryam,107,F,,Trolling,Tik Tok Star


In [78]:
# Restrict dataframe to only Null rows
df[pd.isnull(df.Job)]

Unnamed: 0,name,Age,Sex,Job,Hobbies,Job_notnull
6,Maryam,107,F,,Trolling,Tik Tok Star
7,Maryam,107,F,,Trolling,Tik Tok Star


In [79]:
# Check which rows are duplicated in a column. Will flag the duplicated values, not the first occurrence.
df['Age'].duplicated()

0    False
1     True
2    False
3    False
4    False
5    False
6    False
7     True
Name: Age, dtype: bool

In [80]:
# Remove all duplicate rows in a dataframe. Not an inplace operation.
df.drop_duplicates()

Unnamed: 0,name,Age,Sex,Job,Hobbies,Job_notnull
0,Bob,23,M,Designer,Waterskiing,Designer
1,Lisa,23,F,Marketing Manager,Skydiving,Marketing Manager
2,Mike,25,M,Product Manager,Rock Climbing,Product Manager
3,Fatima,19,F,Software Engineer,Skateboarding,Software Engineer
4,Zahra,42,F,Data Scientist,Baking,Data Scientist
5,Ali,54,M,Machine Learning Engineer,Improv Acting,Machine Learning Engineer
6,Maryam,107,F,,Trolling,Tik Tok Star


# Importing and Exporting Dataframes
We won't usually build our own dataframes but grab data from CSV files or a Database, so we have to learn to import and export data.

In [82]:
# Reading from a CSV file
diamonds = pd.read_csv("../data/diamonds.csv")
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [83]:
# Writing to CSV
diamonds['Junaid_verdict'] = np.where(diamonds.cut == "Ideal", "Me Likey!", "Meh")
diamonds.to_csv("../data/diamonds_edited.csv", index = False)
