---
---
---

## 🚢 **_Titanic_ Exploratory Data Analysis** 🔍

In this notebook, we'll lightly introduce the core technical concepts inherent in data analysis by taking a look at the infamous _Titanic Dataset_.

---
---

### 🔹 Import Our Data Science Toolkit

In [1]:
import numpy as np                  # Numerical/Mathematical Operations
import pandas as pd                 # Data Manipulation Operations
import seaborn as sns               # Extended (Beautified) Graphing Operations

### 🔹 Get the Titanic Dataset

1. Define the path to our data.
2. Call the data as a _Pandas DataFrame_.

In [2]:
DATAPATH = "titanic.csv"

dataset = pd.read_csv(DATAPATH)

#### 🔹 Peak at our Data

In [3]:
dataset.head(5)     # Shows the first five rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
dataset.tail(5)     # Shows the last five rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


#### 🔹 Create a Quick Data Dictionary

- `PassengerId`: IDs of Passengers
- `Survived`: Whether a passenger survived or not.
  - `0`: Did not survive.
  - `1`: Did survive.
- `Pclass`: Economic class of passenger.
- `Name`: Name of passenger.
- `Sex`: Sex of passenger.
- `Age`: Age of passenger.
- `SibSp`: Number of siblings and spouses that passenger has on board the vessel.
- `Parch`: (ParCh) Number of parents and children that passenger has on board the vessel.
- `Ticket`: Ticket number of passenger.
- `Fare`: Amount (in dollars) paid by passenger for ticket.
- `Cabin`: Precise cabin that passenger stayed in on vessel.
- `Embarked`: Location that passenger embarked upon vessel.

---

**Our goal is to distinguish between descriptive questions, inferential questions, and predictive questions.**

Descriptive questions examine basic patterns and relationships across our dataset.

> EX: What percentages of passengers were 1st class, 2nd class, and 3rd class?

Inferential questions explore deeper relationships with a heavier reliance on statistics and probability.

> EX: Given that a passenger paid less than $200 and was relatively young, what's the likelihood that they were 3rd class?

Predictive questions explore machine learning and modeling to examine hidden patterns on newer data.

> EX: If a new 3rd-class passenger embarks on the Titanic, are they going to survive or not?

---

### 🔹 Get Some Descriptive Statistics Across Our Data

In [5]:
dataset.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
dataset.describe(include="O")

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


#### 🔹 Get the Dataset's Dimensionality

In [7]:
len(dataset)            # Returns number of rows in data

891

In [8]:
np.shape(dataset)       # Returns number of rows and columns in data

(891, 12)

#### 🔹 Detect Null Values Across Dataset

_This is one of the immediate first things you want to do for ANY data: **ensure all null values are accounted for and "imputed".**_

In [9]:
dataset.isna().sum()      # if value is null, True. if value is not null, False.

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Most `Cabin` rows do _not_ have values - they are _null_.

Some `Age` rows do _not_ have values and are null.

Very few `Embarked` rows do _not_ have values and are null.

#### 🔹 Drop Heavily-Null Data and Useless Data (`Cabin`, `PassengerId`)

In [10]:
COLUMNS_TO_DROP = ["Cabin", "PassengerId", "Age_Group"]

try:
  dataset.drop(columns=COLUMNS_TO_DROP, inplace=True)
except:
  KeyError("Column(s) already dropped.")

In [11]:
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
dataset.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

---

### 🔶 Q1: What are the percentages of passengers per class?

#### 🔹 Slice Our Data

**Data Slicing** is a way of getting proportions of our dataset/DataFrame based on _logical statements_.

In [13]:
"""
QUESTION: What are the percentages of passengers per class?

ASSUMPTIONS:
   - There are 891 passengers that represent 100% of the data.
   - There are three classes.

METHODOLOGY:
   1. Determine specific values per passenger class (`Pclass`).
   2. Get slices (proportions) of our data per each class.
   3. Divide size of sliced data by size of original data.
"""

#STEP 1: Determine specific values per passenger class (`Pclass`).

#Slicing works the same way we get values from dictionaries: dict[key] -> values

dataset["Pclass"]               # Returns Series representation of sliced feature
dataset[["Pclass"]]             # Returns DF representation of sliced feature

dataset["Pclass"].unique()      # Returns set of unique values within feature

# Returns counts of unique values per unique value
# print(pd.DataFrame(dataset["Pclass"].value_counts()))
# print(type(dataset["Pclass"].value_counts()))

# # STEP 2: Get slices (proportions) of our data per each class.

# ### COMMON METHOD OF SLICING BY LOGICAL ARGUMENTS

dataset[(dataset["Pclass"] == 1)]
dataset[(dataset["Pclass"] == 2)]
dataset[(dataset["Pclass"] == 3)]

### DESCRIPTIVE METHOD OF SLICING BY LOGICAL ARGUMENTS

ARG_PCLASS_FIRST = (dataset["Pclass"] == 1)
dataset_pclass_first = dataset[ARG_PCLASS_FIRST]

ARG_PCLASS_SECOND = (dataset["Pclass"] == 2)
dataset_pclass_second = dataset[ARG_PCLASS_SECOND]

ARG_PCLASS_THIRD = (dataset["Pclass"] == 3)
dataset_pclass_third = dataset[ARG_PCLASS_THIRD]

# # STEP 3: Divide size of sliced data by size of original data.

total_p = len(dataset)
p1 = len(dataset_pclass_first)
p2 = len(dataset_pclass_second)
p3 = len(dataset_pclass_third)

print("Percentage of 1st-Class Passengers: {:.3f}%".format(100 * p1 / total_p))
print("Percentage of 2nd-Class Passengers: {:.3f}%".format(100 * p2 / total_p))
print("Percentage of 3rd-Class Passengers: {:.3f}%".format(100 * p3 / total_p))

Percentage of 1st-Class Passengers: 24.242%
Percentage of 2nd-Class Passengers: 20.651%
Percentage of 3rd-Class Passengers: 55.107%


### 🔶 Q2: How many passengers were female and under the age of 30?

**Two major logical arguments:**
- `Sex`: _Female_
- `Age`: <30

In [14]:
# Identify logical arguments/constraints
ARG_FEMALE =        (dataset["Sex"] == "female")
ARG_LESS_THAN_30 =  (dataset["Age"] < 30)

# Multislice dataset based on logical arguments
dataset_young_women = dataset[ARG_FEMALE & ARG_LESS_THAN_30]

len(dataset_young_women)

147

### 🔶 Q3: In Class Question:

#### `What percentage of passengers paid more than $38.00 per ticket and survived the wreck?`

(**NOTE**: _Make sure it's a descriptive question and not inferential or predictive!_)


In [15]:
len(dataset[(dataset["Fare"] > 38.00) & (dataset["Survived"] == 1)])

118

---

## 📌 **REQUIRED OBJECTIVE!** 📌

**Ask at least three more descriptive questions and make use of data slicing, basic data analysis techniques, and any methods available in the Numpy/Pandas cheatsheets to answer the questions.**

(_If you do not know the technical method to answer a question, at least define the general process for how to do so._)

---

In [16]:
# Question One:
# Of the survivors, how many had a family onboard?
survivors = dataset["Survived"] == 1
had_family = (dataset["SibSp"] & dataset["Parch"])

len(dataset[survivors & had_family])

36

In [17]:
# Question Two:
# Chunking the ages into groups of 5, what age chunk had the most survivors?

only_aged_passengers = dataset[dataset["Age"].notna()]
age_groups = only_aged_passengers["Age"].apply(lambda age: f"{(age // 5) * 5}-{((age // 5) * 5) + 4}")

# Find the age_group that appears the most
age_groups.value_counts().head(1)


Age
20.0-24.0    114
Name: count, dtype: int64

In [18]:
# Question Three:
# How many children, those less than 16, survive?

young_survivors = only_aged_passengers[(dataset["Age"] <= 16) & (dataset["Survived"] == 1)]

len(young_survivors)


  young_survivors = only_aged_passengers[(dataset["Age"] <= 16) & (dataset["Survived"] == 1)]


55

---

## 📌 **REQUIRED OBJECTIVE!** 📌

Refer to the **[Pandas Cheatsheet](https://images.datacamp.com/image/upload/v1676302204/Marketing/Blog/Pandas_Cheat_Sheet.pdf)** and **[NumPy Cheatsheet](https://images.datacamp.com/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf)** and implement at least three unique functions/methods from each package that you have not yet used before.

Feel free to refer to documentation and resources online to better understand what you can do with `pandas` and `numpy`!

---

In [19]:
# Students: write some code here!

# accidentally did all of that above lol
# notna
# apply
# dataset slicing (only_aged_passengers = dataset[dataset["Age"].notna()]))

---

## ✨ **LEVEL UP IN DATA SCIENCE** ✨

Complete the following objectives to ensure you have a good introductory presence in the world of data science:

1. Download an individual version of the **[Anaconda Distribution](https://www.anaconda.com/products/individual)** so you can have a Python distribution dedicated for data science and analysis on your local computer. 🐍

2. From your Anaconda Distribution, open a **Jupyter Notebook** and save it to a relevant location on your computer (e.g. a development folder or your Desktop). 🪐

3. Ensure your Python distribution is working effectively by creating a **dummy script/function** (e.g. Hello World) in a Jupyter cell. ✅

4. Take a quick break and visit **[Kaggle.com](https://www.kaggle.com/)**; create an individual account using either your DU credentials or an email account of your preference. 👤

5. Using the sidebar on the Kaggle page, explore the expanse of **[Kaggle Datasets](https://www.kaggle.com/datasets)** and download a dataset of your choice that seems interesting to explore. (**NOTE**: _For simplicity's sake, ensure that the dataset is a `.csv`-type file._) 📜

6. Utilize **`NumPy`** and **`Pandas`** in a Jupyter Notebook to begin **exploring your dataset** as a DataFrame object. 🔍

7. Perform an introductory **Exploratory Data Analysis** on your newly downloaded data similar to what we performed in this current notebook: explore and clean up your data a little bit before asking and answering some **basic descriptive analysis questions**. 💬

---

---
---
---