# CME538 - Introduction to Data Science

## Introduction to Pandas: Code Along Activity with `titanic.csv`!

**Data Dictionary**
| Column Name  | Definition                         | Key                              |
|-----------|------------------------------------|----------------------------------|
| survival  | Survival                           | 0 = No, 1 = Yes                 |
| pclass    | Ticket class                       | 1 = 1st, 2 = 2nd, 3 = 3rd       |
| sex       | Sex                                |                                  |
| Age       | Age in years                       |                                  |
| sibsp     | # of siblings / spouses aboard     |                                  |
| parch     | # of parents / children aboard     |                                  |
| ticket    | Ticket number                      |                                  |
| fare      | Passenger fare                     |                                  |
| cabin     | Cabin number                       |                                  |
| embarked  | Port of Embarkation                | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
# Import 3rd Party Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

First, let's import the dataset `titanic.csv`

In [None]:
# also see avaialable functions in pandas
dir(pd)

In [None]:
# to see the arguments with pd.read_csv
help(pd.read_csv) 

What if I have multiple files `csv`, or they are in another location?

In [None]:
import os

# to see current directory
print(os.listdir())

Let's read in `titanic.csv`

In [None]:
titanic_df = pd.read_csv("titanic.csv")
titanic_df # preview your data

Let's use `.head()` and `.tail()` to preview the results:

In [None]:
# first 5 results
titanic_df.head()

In [None]:
# first 2 results
titanic_df.head(2)

In [None]:
# see the last 5 rows
titanic_df.tail()

In [None]:
# see the last 10 rows
titanic_df.tail(10)

DataFrame attributes: `index`, `columns` and `shape`:

In [None]:
type(titanic_df)

In [None]:
# find all column names
titanic_df.columns

In [None]:
# print out the index
titanic_df.index

In [None]:
# shape of data; row/column
titanic_df.shape

In [None]:
# create a series
titanic_df['Survived']

In [None]:
# created a dataframe
titanic_df[['Survived']]

It is very easy to create a new column:

In [None]:
# preview data
titanic_df.head()

In [None]:
# create a new column Inflation_Fare
titanic_df["Inflation_Fare"] = titanic_df["Fare"] * 25

# preview result
titanic_df.head(3)

In [None]:
# add a new dummy column
titanic_df["Dummy_Column"] = "CME538 is great"

# preview
titanic_df.head(3)

Also very easy to update an existing column in our DataFrame:

In [None]:
# let's modify the Sex column
titanic_df["Sex"] = titanic_df["Sex"].str.upper()

# preview
titanic_df.head(3)

Let's practice using some of the in-built **utility methods**:
- `max()`
- `min()`
- `mean()`
- `unique()`
- `sort_values()`
- `value_counts()`
- `astype()` *very important!*

In [None]:
# let's do some stats on Fare!
print(titanic_df["Fare"].mean())
print(titanic_df["Fare"].max())
print(titanic_df["Fare"].min())

In [None]:
# let's do some more stats on Age
# print(titanic_df["Age"].unique())
print(titanic_df["Age"].value_counts())

In [None]:
# make a mini variable
# df = titanic_df
df_col = titanic_df["Name"]
print(df_col.max())

In [None]:
# outputting a variable return, instead of print
df_col.max()
df_col.min()

In [None]:
# what about methods on a string? max of a string?
print(titanic_df["Name"].max())

In [None]:
# what about methods on a string? what about mean?
print(titanic_df["Name"].mean())

In [None]:
# use describe to see numeric column stats
titanic_df.describe()

In [None]:
# use describe to see numeric column stats
titanic_df[["Age","Fare"]].describe()

In [None]:
# to see value distribution, missing values
titanic_df.info()

There are also built-in string methods we can apply. Note that we need to use the `.str` argument first before applying these:
- `.str.upper()`
- `.str.lower()`
- `.str.len()`
- `.str.replace()`
- `str.startswith()/str.endswith()` *returns a boolean series*

In [None]:
titanic_df.head(3)

In [None]:
# lowercase the sex column
titanic_df["Sex"] = titanic_df["Sex"].str.lower()

titanic_df.head(3)

In [None]:
titanic_df["Name Length"] = titanic_df["Name"].str.len() # find the length
titanic_df.head(3) # preview

In [None]:
# let's replace a string
titanic_df["Dummy_Column"] = titanic_df["Dummy_Column"].str.replace("great", "wonderful!!")
titanic_df["Dummy_Column"] = titanic_df["Dummy_Column"].str.replace("!!", "??")

# preview
titanic_df.head()

In [None]:
# let's create a last name column

# strategy - split on comman, retreive the first entry (last name with [0])
titanic_df["Last_Name"] = titanic_df["Name"].str.split().str.get(0)

titanic_df.head()

What about filtering our DataFrame? Using conditionals:

In [None]:
filt_50 = titanic_df['Age'] > 50
filt_50

In [None]:
# remember that comparisons always return a boolean in Python
4 > 5

In [None]:
# let's get everyone over 50
filt_50_df = titanic_df[titanic_df['Age'] > 50]

# preview
filt_50_df.head()

Let's filter on male over 50

In [None]:
filt_50_male_df = titanic_df[
    (titanic_df['Age'] > 50) &  # filter criteria 1
    (titanic_df["Sex"] == 'male') # filter criteria 2
]

filt_50_male_df.head(3)

In [None]:
titanic_df['Sex'].unique()

In [None]:
filt_50_not_male_df = titanic_df[
    (titanic_df['Age'] > 50) &  # filter criteria 1, over 50
    ~(titanic_df["Sex"] == 'male') # filter criteria 2, not male with ~
]

filt_50_not_male_df.head(3)

What if you are male, **or** over 50?

In [None]:
filt_50_or_male_df = titanic_df[
    (titanic_df['Age'] > 50) |  # filter criteria 1, over 50
    (titanic_df["Sex"] == 'male') & # filter criteria 2, male
    (titanic_df["Fare"] > 50) # filter criteria 3, fare > 50
]

filt_50_or_male_df.head(3)

Let's practice now using  `.loc` (label-based slicing) and `.iloc` (index-based slicing):

Using `loc` filter to all rows where the passenger's `Age` is greater than 50, and they `Embarked` from "S":

In [None]:
# let's practice filtering with loc -> [row, col]
filt_df = titanic_df.loc[(titanic_df["Age"] > 50) &  # row filter 1
                         (titanic_df["Embarked"] == "S"),  # row filter 2
                         ["Name","Cabin"]] # this filters the columns returned (Name and Cabin)
filt_df.head(3)

Using `iloc` select the first 10 rows, and the columns `Name`, `Age` and `Fare`:

In [None]:
filt_iloc_df = titanic_df.iloc[0:10] # just first 10 rows
filt_iloc_df

Easy alternative to get the first 10 rows with `head`:

In [None]:
temp_var = titanic_df.head(10)
temp_var

In [None]:
filt_iloc_df = titanic_df.iloc[0:10,[3,5,9]] # rows (0-10), columns (list with their position)
filt_iloc_df

What about getting aggregate statistics? Let's practice using `.groupby()`:

1. Find average paid fare by passengers based on their `Pclass` and `Survived` status:

In [None]:
# find count
grouped_df = titanic_df.groupby(["Pclass","Survived"]).agg("count")
grouped_df

In [None]:
# find average paid fare by passengers based on their Pclass and Survived status
grouped_df = titanic_df.groupby(["Pclass","Survived"])[['Fare']].mean()
grouped_df

2. Find the total number of passengers based on their `Sex` and `Embarked` location, then sort the results by the total number of passengers in descending order:

In [None]:
# Find the total number of passengers based on their Sex and Embarked location, 
# then sort the results by the total number of passengers in descending order.
passengers_df = titanic_df.groupby(["Sex","Embarked"])[["PassengerId"]].count()

# now let's sort
passengers_df = passengers_df.sort_values(by = "PassengerId", ascending = False)

# preview
passengers_df

3. Find the survival of passengers based on `Class`, `Sex`, `Survived` and express this as a **percent (%)**:

In [None]:
titanic_df.shape

In [None]:
passengers_grouped = titanic_df.groupby(['Pclass','Sex','Survived']).agg("count")
passengers_grouped

In [None]:
# notice if we do not have an aggregation no dataframe object is returned
titanic_df.groupby(['Pclass','Sex','Survived'])

In [None]:
# 1 - find the amount of passengers for Class, Sex and Survived
passengers_grouped = titanic_df.groupby(['Pclass','Sex','Survived'])[["PassengerId"]].count()

# 2 - find the total amount of all passengers
total_passengers = passengers_grouped['PassengerId'].sum()

# alternative way to find number people, use shape to get the row
# total_passengers = titanic_df.shape[0]

# 3 - create a new column
passengers_grouped['Percent'] = passengers_grouped["PassengerId"] / total_passengers * 100

# preview results
passengers_grouped

Creating Functions and Using `.apply()`, Looping over Rows

In [None]:
# quick review of functions

def celcius_to_farenheit(x):
    """
    Argument: x (float)
    Returns: y (float)
    
    Description: a function that takes a number x, and adds 2
    """
    y = x * 9/5 + 32
    return y

variable = 3
print(celcius_to_farenheit(variable))  
print(celcius_to_farenheit(2)) 
print(celcius_to_farenheit(5))

In [None]:
variable = 6

if variable > 20: # fail this condition, skip forward
    print("Greater than 20")
else: # 6 meets this condition
    print("Less than 20") # print

Create a function that categorizes passengers based on their age and apply it to create a new column `AgeCategory`. 
The function should return:
- 'Child' for passengers under 18
- 'Adult' for those between 18 and 60
- 'Senior' for those above 60

In [None]:
# define the function here
def age_category(age):
    """
    Input: age (numeric)
    Output: category (string)
    
    This function compares the age and returns one of 3 outputs, Child/Adult/Senior
    """
    if age < 18:
        return "Child"
    
    elif 18 < age < 60:
        return "Adult"
    
    else:
        return "Senior"

# code is starting here
test_age = 34
print(age_category(test_age)) # let's call our function on a single variable

In [None]:
# write code here
def age_category(age):
    """
    Input: age (numeric)
    Output: category (string)
    
    This function compares the age and returns one of 3 outputs, Child/Adult/Senior
    """
    if age < 18:
        return "Child"
    
    elif 18 <= age <= 60:
        return "Adult"
    
    else:
        return "Senior"
    
    
# let's define a new column   
titanic_df["AgeCategory"] = titanic_df["Age"].apply(age_category)

# preview
titanic_df.head(4)

In [None]:
# preview results from our column
titanic_df["AgeCategory"].value_counts()

What about saving our data?

In [None]:
# let's save into our same directory
titanic_df.to_csv("my_new_titanic.csv")

Almost done! Let's save our notebook `.ipynb` and also export as an HTML!