---
# **Lab 6: Titanic Project**
---


Through the following questions, you will hone and apply your skills to a famous dataset containing information about passengers on the Titanic and whether they survived or not: [Titanic dataset from Kaggle](https://www.kaggle.com/competitions/titanic/overview).


<br>

**About the Dataset**

There are twelve columns in the dataset. The target column is `Survived` which indicates if a passenger survived (1) or not (0). The features initially available are:

* `PassengerId`: Numeric, a unique number for each passenger.
* `Pclass`: Numeric, the ticket class.	1 = 1st, 2 = 2nd, 3 = 3rd.
* `Name`: Categorical, the name of the passenger.
* `Sex`: Categorical, the sex of the passenger.
* `Age`: Numeric, the passenger's age in years.
* `Sibsp`: Numeric, the number of siblings / spouses aboard the Titanic.
* `Parch`: Numeric, the number of parents / children aboard the Titanic.
* `Ticket`: Categorical, ticket number.
* `Fare`: Numeric, passenger fare.
* `Cabin`: Categorical, cabin number.
* `Embarked`: Categorical, port of embarkation.	C = Cherbourg, Q = Queenstown, S = Southampton.
* `Hometown`: Categorical, passenger home town.
* `Destination`: Categorical, ultimate return point.
* `HasCabin`: Categorical, whether the passenger(s) had a cabin or not.

<br>

**Resources**
* [Python Basics Cheat Sheet](https://docs.google.com/document/d/1NcIC6so-GM5t5kd-iS7HGapWmKc8WJyAQAI00gemBms/edit?usp=drive_link)

* [pandas Commands](https://docs.google.com/document/d/1pLCyzig38Mop0Iib021X47S25WBEqZCWf7LRdpC8hGw/edit?usp=drive_link)

* [Data Visualizations with matplotlib](https://docs.google.com/document/d/1tCKyB_E2A-S_rwTIN6JHE9lCQiK4DLTQTt25Lc-uQcs/edit?usp=drive_link)

* [Data Wrangling Cheat Sheet](https://docs.google.com/document/d/1rQaux3Ccj7x-cDdIGfd56BszSkqQpbZ3EonMFbJxfxI/edit?usp=drive_link)



<a name="p1"></a>

---
## **Week 6: Data Cleaning**
---

#### **Problem #0**

In the cell below, import the necessary libraries and functions you've used in the course so far.

Using the url below, import the dataset from github.

* https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/refs/heads/main/titanic/titanic_raw.csv

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
url = 'https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/refs/heads/main/titanic/titanic_raw.csv'
titanic_df = pd.read_csv(url)
titanic_df.head

#### **Problem #1**

Drop any duplicates from the dataset. How many duplicates were dropped?

In [None]:
titanic_df.drop_duplicates()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Sibsp,Parch,Ticket,Fare,Cabin,Embarked,Hometown,Boarded,Destination
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,"Bridgerule, Devon, England",Southampton,"Qu'Appelle Valley, Saskatchewan, Canada"
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"New York, New York, US",Cherbourg,"New York, New York, US"
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Jyväskylä, Finland",Southampton,New York City
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Scituate, Massachusetts, US",Southampton,"Scituate, Massachusetts, US"
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Birmingham, West Midlands, England",Southampton,New York City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,"London, England",Southampton,New York City
1305,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,"Madrid, Spain",Cherbourg,"New York, New York, US"
1306,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,"Skaun, Sør-Trøndelag, Norway",Southampton,US
1307,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,"Greenwich, London, England",Southampton,New York City


#### **Problem #2**

Determine how many missing values there are in the dataset.

In [None]:
titanic_df.isna().sum/len(titanic_df)

TypeError: unsupported operand type(s) for /: 'method' and 'int'

#### **Problem #3**

Drop all null rows in place using `Survived` as your subset. We aren't imputing these null values because it's a supposed to model what actually happened in reality, so we don't want to replace those values and skew the results. We also don't want to introduce bias by imputing on our entire dataset and then splitting the data into train and test sets.

In [None]:
# COMPLETE THIS CODE

#### **Problem #4**

Drop the column `Cabin`. There majority of the values in this column are null, so we might skew our dataset using imputation.

In [None]:
# COMPLETE THIS CODE

#### **Problem #5**

Impute the missing values in the `Age` column using the median.

In [None]:
# COMPLETE THIS CODE

#### **Problem #6**

Filter the data frame to visualize the remaining null values in the dataset.

In [None]:
# COMPLETE THIS CODE


#### **Problem #7**

Drop the `Boarded` column and replace the missing values in the `Embarked`, `Hometown`, and `Destination` columns with the most commonly occuring value.

In [None]:
# COMPLETE THIS CODE

#### **Problem #8**

Check for any potential missing values that aren't categorized as nulls.

In [None]:
# COMPLETE THIS CODE

#### **Problem #9**

Visualize the number of passengers who did versus did not survive using a bar graph.

In [None]:
categories = ['Not Survived', 'Survived']
bars = # COMPLETE THIS CODE

plt.# COMPLETE THIS CODE

plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE

plt.show()

#### **Problem #10**

Visualize the number of male versus female passengers.

In [None]:
categories = ['Male', 'Female']
bars = # COMPLETE THIS CODE

#### **Problem #11**

Encode the `Sex` and `Embarked` columns using dictionary encoding.

In [None]:
# COMPLETE THIS CODE

# Homework: 2015 World Happiness Report Project

---



### **Homework from Weeks 6 and 7 will both be based on the following information...**

---



### **Description**

In this project, you will use what you have learned so far about the machine learning process and Linear Regression to analyze the official 2015 World Happiness Report from the United Nations. In particular, you will explore and visualize this data and then model the Happiness Score based on the variables reported in this dataset.

<br>


### **Overview**
For this project, you are given data collected for the 2015 UN Happiness Report. The 2015 Happiness Report, also known as the World Happiness Report 2015, is a publication that presents rankings of countries based on their levels of happiness and well-being. The report is a collaborative effort between the Sustainable Development Solutions Network (SDSN) and the Earth Institute at Columbia University, with contributions from various researchers and experts.

<br>

The report includes rankings of 158 countries based on the "World Happiness Index," which is calculated using survey data from the Gallup World Poll and other sources. The index combines factors such as GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption to assess overall happiness levels.

<br>

The 2015 Happiness Report sheds light on the relationship between happiness, well-being, and sustainable development, emphasizing the importance of incorporating measures of happiness into policy-making and development strategies. It provides valuable insights into global happiness levels, highlighting the factors that contribute to happiness and offering recommendations for policymakers and individuals to improve overall well-being.

<br>

 Everything you need is provided below. But, if you are curious to learn more the [official source can be found here](https://worldhappiness.report/ed/2015/#appendices-and-data). Here is a list of variables for your reference:
 * `Country`: The country that the data corresponds to.

* `Region`: The region that this country is classified as belong to.

* `Happiness Score`: A metric measured in 2015 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."

* `GDP`: The extent to which GDP contributes to the calculation of the Happiness Score.

* `Social Support`: The extent to which Family contributes to the calculation of the Happiness Score

* `Health Life Expectancy`: The extent to which Life expectancy contributed to the calculation of the Happiness Score

* `Freedom`: The extent to which Freedom contributed to the calculation of the Happiness Score.

* `Corruption Perception`: The extent to which Perception of Corruption contributes to Happiness Score.

* `Generosity`: A model of the national average of response to the question “Have you donated money to a charity in the past month?” on GDP per
capita.

**NOTE**: All numerical variables except `Happiness Score` have already been standardized.

<br>

**Resources**
- [EDA Cheat Sheet](https://docs.google.com/document/d/1pLCyzig38Mop0Iib021X47S25WBEqZCWf7LRdpC8hGw/edit?usp=drive_link)

- [Data Wrangling Cheat Sheet](https://docs.google.com/document/d/1rQaux3Ccj7x-cDdIGfd56BszSkqQpbZ3EonMFbJxfxI/edit?usp=drive_link)

- [Linear Regression Cheat Sheet](https://docs.google.com/document/d/10ON1Vt_Ll3Bduu12r8g52VZzsMFt21lga_6zTvYDPVc/edit?usp=drive_link)

### **Homework 6: Data Exploration, Wrangling, and Visualization**
---
**There are corresponding questions about this project on Canvas.**

---

In this part, you will load in and explore the dataset for this project. This will involve using functions from pandas as well as reading source material to understand the data that you are working with.

<br>

***Note:*** *In most real world situations, you will not do data exploration, wrangling, and visualization separately as we have done in the past. As such, you will simply be asked to perform tasks throughout this section without explicitly distinguishing between exploration, wrangling, and visualization.*

<br>

---

<br>

#### **Problem #0**

In the cell below, import the necessary libraries and functions you've used in the course so far.

Using the url below, import the dataset from the google sheets link.

* https://docs.google.com/spreadsheets/d/e/2PACX-1vSUoGLZ90Qr6A5-DmdYD30CIEwMqIAmtWSbdcLgi10u5WoCtCuj_RuSm7wDsFsfcwPGRB6ZZDduCxpO/pub?gid=108149846&single=true&output=csv


In [None]:
#COMPLETE THIS CODE

#### **Problem #1**

Spend a few minutes getting familiar with the data. Some things to consider: how many instances are there? How many features? What are the features' datatypes?

In [None]:
# COMPLETE THIS CODE

#### **Problem #2**

This data currently has no consistent naming convention for columns, which is very bad practice. So, rename each column to be of the style, `'Column Name'`, where each word is separated by a space (not an underscore, slash, or anything else) and starts uppercase. Furthermore, make sure all words are spelled correctly.

<br>

**Hint**: It may make you life easier to quickly print the current column names here using the `.columns` attribute.

In [None]:
# COMPLETE THIS CODE

#### **Problem #3**

Drop any duplicate rows.

In [None]:
# COMPLETE THIS CODE

#### **Problem #4**

Determine the datatypes of each feature. Determine the number of non-null values in each column.

In [None]:
# COMPLETE THIS CODE

#### **Problem #5**

You should have seen from Problem #4 that there are 3 columns with null values. We need to either impute by filling with the average or drop the rows with null values.


Let's deal with these columns by type, specifically:
1. Impute or drop the numerical null values.
2. Impute or drop the object (string) null values.

##### **1. Impute or drop the numerical null values.**

Complete the code below to *drop* the numerical null values. There's an argument for dropping or imputing, but dropping is a safer choice that does not rely on making any assumptions about these variables since we aren't necessarily subject matter experts.

In [None]:
happy_df = happy_df.dropna(axis = 0, how='any', subset = [# COMPLETE THIS LINE

##### **2. Impute or drop the object (string) null values.**

Complete the code below to *impute* the object (string) null value(s). This is something we can look up, so it's completely reasonable to fill in the missing values manually and not have to sacrifice more data points.

<br>

**NOTE**: You will likely need to use the following three commands to accomplish this:

1. `happy_df[happy_df['column name'].isnull()]`: Print the specific data point(s) with a null value for `'column name'`.
2. `happy_df['column name'].unique()`: Print the possible values that we could use to fill in the null value found above.
3. `happy_df.loc[happy_df['column name'].isnull(), 'column name'] = 'non-null value'`: Fill in the null value with a new value. This should be the best option from the list of unique values found above.

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

#### **Problem #6**

Determine all the regions that were included in this dataset.

In [None]:
# COMPLETE THIS CODE

#### **Problem #7**

Complete the code below to create a new feature called `Region Encoded` that encodes the regions into numerical values using a python dictionary.

In [None]:
# COMPLETE THIS CODE

#### **Problem #8**

Let's visualize some of the data and see if we can discover some relationships. Specifically, create bar graphs of `Happiness Score` for the countries in several different regions: `"Middle East and Northern Africa"`, `"Southern Asia"`, and `"North America"`.


**NOTE:** Some of the code has already been provided for the first example to help you get started.

##### **Middle East and Northern Africa**

In [None]:
x = happy_df[happy_df["Region"] == # COMPLETE THIS LINE
y = # COMPLETE THIS LINE

plt.# COMPLETE THIS LINE

plt.title(# COMPLETE THIS LINE
plt.xlabel(# COMPLETE THIS LINE
plt.ylabel(# COMPLETE THIS LINE
plt.xticks(rotation = 90)

plt.show()

##### **Southern Asia**

In [None]:
# COMPLETE THIS CODE

##### **North America**

In [None]:
# COMPLETE THIS CODE

#### **Problem #9**

Now, create scatter plots of `Happiness Score` on the y-axis versus several different numerical variables on the x-axis: `Social Support`, `Freedom`, and `GDP`.

##### **Happiness Score vs. Social Support**

In [None]:
# COMPLETE THIS CODE

##### **Happiness Score vs. Freedom**

In [None]:
# COMPLETE THIS CODE

##### **Happiness Score vs. GDP**

In [None]:
# COMPLETE THIS CODE

---

# End of Notebook

© 2024 The Coding School, All rights reserved