Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menu bar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menu bar, select Cell$\rightarrow$Run All).

Make sure that in addition to the code, you provide written answers for all questions of the assignment. 

Below, please fill in your name and collaborators:

In [None]:
NAME = "Sean(Cheng) Xu"
COLLABORATORS = ""

## Assignment 2 - Data Analysis using Pandas
**(15 points total)**

For this assignment, we will analyze the open dataset with data on the passengers aboard the Titanic.

The data file for this assignment can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file `train.csv`. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.

Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question: 

**What categories of passengers were most likely to survive the Titanic disaster?**

**Question 1.**  _(4 points)_
* The answer to the main question - What categories of passengers were most likely to survive the Titanic disaster? _(2 points)_
* The detailed explanation of the logic of the analysis _(2 points)_

**Question 2.**  _(3 points)_
* What other attributes did you use for the analysis? Explain how you used them and why you decided to use them. 
* Provide a complete list of all attributes used.

**Question 3.**  _(3 points)_
* Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?
* If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

**Question 4.**  _(5 points)_
* How did you treat missing values for those attributes that you included in the analysis (for example, `age` attribute)? Provide a detailed explanation in the comments.


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [3]:
#Only keep columns of 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch'

titanic = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']]
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
0,0,3,male,22.0,1,0
1,1,1,female,38.0,1,0
2,1,3,female,26.0,0,0
3,1,1,female,35.0,1,0
4,0,3,male,35.0,0,0


In [4]:
#Find the average ages for passengers in Pclass 1, 2 and 3

avg_age = titanic.groupby('Pclass')[['Age']].agg('mean').round(0)
avg_age

Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,38.0
2,30.0
3,25.0


In [5]:
avg_age_pclass_1 = avg_age.iloc[0,0]
avg_age_pclass_2 = avg_age.iloc[1,0]
avg_age_pclass_3 = avg_age.iloc[2,0]
(avg_age_pclass_1, avg_age_pclass_2, avg_age_pclass_3)

(38.0, 30.0, 25.0)

In [6]:
#Filtering the dataframe for passagers in Pclass 1, 2 and 3

pclass_1 = titanic[titanic['Pclass'] == 1]
pclass_2 = titanic[titanic['Pclass'] == 2]
pclass_3 = titanic[titanic['Pclass'] == 3]

In [7]:
#Fill the Nan value in 'Age' with the average ages

pclass_1['Age'].fillna(avg_age_pclass_1, inplace = True)
pclass_2['Age'].fillna(avg_age_pclass_2, inplace = True)
pclass_3['Age'].fillna(avg_age_pclass_3, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [8]:
#Concatenate the three dataframe after finishing to deal with the missing data

titanic_cleaned = pd.concat([pclass_1, pclass_2, pclass_3], axis = 0)
titanic_cleaned.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
1,1,1,female,38.0,1,0
3,1,1,female,35.0,1,0
6,0,1,male,54.0,0,0
11,1,1,female,58.0,0,0
23,1,1,male,28.0,0,0


In [9]:
titanic_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 48.7+ KB


In [10]:
#Find passengers' min and max age in column 'Age'

(titanic_cleaned['Age'].min(), titanic_cleaned['Age'].max())

(0.42, 80.0)

In [11]:
#Cut age into different bins to create age group

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
titanic_cleaned['Age_Group'] = pd.cut(titanic_cleaned['Age'], bins)

#Add column SibSp and Parch to get the total number of family members

titanic_cleaned['Family_Size'] = titanic_cleaned['SibSp'] + titanic_cleaned['Parch']

titanic_cleaned.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Age_Group,Family_Size
1,1,1,female,38.0,1,0,"(30, 40]",1
3,1,1,female,35.0,1,0,"(30, 40]",1
6,0,1,male,54.0,0,0,"(50, 60]",0
11,1,1,female,58.0,0,0,"(50, 60]",0
23,1,1,male,28.0,0,0,"(20, 30]",0


In [12]:
#Filtering the dataframe with condition of survived = 1

titanic_survived = titanic_cleaned[titanic_cleaned['Survived'] == 1]
titanic_survived.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Age_Group,Family_Size
1,1,1,female,38.0,1,0,"(30, 40]",1
3,1,1,female,35.0,1,0,"(30, 40]",1
11,1,1,female,58.0,0,0,"(50, 60]",0
23,1,1,male,28.0,0,0,"(20, 30]",0
31,1,1,female,38.0,1,0,"(30, 40]",1


In [13]:
#Find the survival rate for different Pclass

pclass_series = titanic_survived['Pclass'].value_counts() / titanic_cleaned['Pclass'].value_counts()
pclass_series.apply(lambda x : "%.2f%%" % (x * 100))

1    62.96%
2    47.28%
3    24.24%
Name: Pclass, dtype: object

In [14]:
#Find the survival rate for different Sex

sex_series = titanic_survived['Sex'].value_counts() / titanic_cleaned['Sex'].value_counts()
sex_series.apply(lambda x : "%.2f%%" % (x * 100))

female    74.20%
male      18.89%
Name: Sex, dtype: object

In [15]:
#Find the survival rate for different Age Group

age_series = titanic_survived['Age_Group'].value_counts() / titanic_cleaned['Age_Group'].value_counts()
age_series.apply(lambda x : "%.2f%%" % (x * 100))

(0, 10]     59.38%
(10, 20]    38.26%
(20, 30]    32.36%
(30, 40]    44.86%
(40, 50]    38.37%
(50, 60]    40.48%
(60, 70]    23.53%
(70, 80]    20.00%
Name: Age_Group, dtype: object

In [16]:
#Find the survival rate for different size of Family

family_series = titanic_survived['Family_Size'].value_counts() / titanic_cleaned['Family_Size'].value_counts()
family_series.apply(lambda x : "%.2f%%" % (x * 100))

0     30.35%
1     55.28%
2     57.84%
3     72.41%
4     20.00%
5     13.64%
6     33.33%
7       nan%
10      nan%
Name: Family_Size, dtype: object

In [17]:
#Find the survival rate of female for different age group
#Do .sum() to count the number of female passengers that survived since 1 means survived and 0 means not survived
#Do .count() to count the total number of female passengers

titanic_cleaned[titanic_cleaned['Sex'] == 'female'].groupby(['Pclass', 'Age_Group'])['Survived'].apply(lambda x : "%.2f%%" % ((x.sum() / x.count() * 100)))

Pclass  Age_Group
1       (0, 10]        0.00%
        (10, 20]     100.00%
        (20, 30]      95.24%
        (30, 40]     100.00%
        (40, 50]      92.31%
        (50, 60]     100.00%
        (60, 70]     100.00%
        (70, 80]         NaN
2       (0, 10]      100.00%
        (10, 20]     100.00%
        (20, 30]      90.00%
        (30, 40]      94.12%
        (40, 50]      90.00%
        (50, 60]      66.67%
        (60, 70]         NaN
        (70, 80]         NaN
3       (0, 10]       50.00%
        (10, 20]      52.00%
        (20, 30]      55.41%
        (30, 40]      42.86%
        (40, 50]       0.00%
        (50, 60]         NaN
        (60, 70]     100.00%
        (70, 80]         NaN
Name: Survived, dtype: object

In [18]:
#Find the survival rate of female for different family size
#Do .sum() to count the number of female passengers that survived since 1 means survived and 0 means not survived
#Do .count() to count the total number of female passengers

titanic_cleaned[titanic_cleaned['Sex'] == 'female'].groupby(['Pclass', 'Family_Size'])['Survived'].apply(lambda x : "%.2f%%" % ((x.sum() / x.count() * 100)))

Pclass  Family_Size
1       0               97.06%
        1              100.00%
        2              100.00%
        3               50.00%
        4              100.00%
        5              100.00%
2       0               90.62%
        1               89.47%
        2               92.86%
        3              100.00%
        4              100.00%
        5              100.00%
3       0               61.67%
        1               51.72%
        2               54.55%
        3               83.33%
        4                0.00%
        5                0.00%
        6               37.50%
        7                0.00%
        10               0.00%
Name: Survived, dtype: object

In [19]:
#Find the survival rate of male for different age group
#Do .sum() to count the number of male passengers that survived since 1 means survived and 0 means not survived
#Do .count*() to count the total number of male passengers

titanic_cleaned[titanic_cleaned['Sex'] == 'male'].groupby(['Pclass', 'Age_Group'])['Survived'].apply(lambda x : "%.2f%%" % ((x.sum() / x.count() * 100)))

Pclass  Age_Group
1       (0, 10]      100.00%
        (10, 20]      40.00%
        (20, 30]      47.37%
        (30, 40]      39.13%
        (40, 50]      37.50%
        (50, 60]      28.57%
        (60, 70]       0.00%
        (70, 80]      33.33%
2       (0, 10]      100.00%
        (10, 20]      10.00%
        (20, 30]       4.76%
        (30, 40]      11.54%
        (40, 50]      11.11%
        (50, 60]       0.00%
        (60, 70]      33.33%
        (70, 80]         NaN
3       (0, 10]       36.36%
        (10, 20]      12.96%
        (20, 30]      12.04%
        (30, 40]      14.29%
        (40, 50]       9.09%
        (50, 60]       0.00%
        (60, 70]       0.00%
        (70, 80]       0.00%
Name: Survived, dtype: object

In [20]:
#Find the survival rate of male for different family size
#Do .sum() to count the number of male passengers that survived since 1 means survived and 0 means not survived
#Do .count*() to count the total number of male passengers

titanic_cleaned[titanic_cleaned['Sex'] == 'male'].groupby(['Pclass', 'Family_Size'])['Survived'].apply(lambda x : "%.2f%%" % ((x.sum() / x.count() * 100)))

Pclass  Family_Size
1       0               33.33%
        1               38.71%
        2               45.45%
        3              100.00%
        5                0.00%
2       0                9.72%
        1                6.67%
        2               47.06%
        3               25.00%
3       0               12.12%
        1               17.86%
        2               32.00%
        3               33.33%
        4                0.00%
        5                0.00%
        6               25.00%
        7                0.00%
        10               0.00%
Name: Survived, dtype: object

Question 1. (4 points)

The answer to the main question - What categories of passengers were most likely to survive the Titanic disaster? (2 points)
The detailed explanation of the logic of the analysis (2 points)

In [None]:
# Relatively, for female, whose Pclass is 1 with family size of 1,2 4 or 5 and at the age between 10 to 40, and whose
# Pclass is 2 with family size of 3 - 5 at the age between 0-20 are most likely to survive.
# For male, whose Pclass is 1 with family size of 3 at the age between 0 and 10 are most likely to survive.
# Generally, female passengers in Pclass 1, age between 0 and 10, and with family size of 3 have more chance to 
# survive than others including male.

# For the analysis, I'm looking for the survival rate of each columns such as Pclass, Sex, Family Size and Age Group
# since different categories must have different survival rate. I cannot just simply look for the number of passengers
# who have survived because some categories of passengers might survived a lot but large number of them may died as well,
# hence their survival rate could be very low.
# Then I seperately look into different sex since in the general analysis which is regardless of sex I found out female 
# has much more chance to survice than male. Therefore on Titanic, some of females or several groups of females might 
# have the most chance to survive. Finally it turns out many categories of female passengers have 100% chance to 
# survive but male not.

Question 2. (3 points)

What other attributes did you use for the analysis? Explain how you used them and why you decided to use them.
Provide a complete list of all attributes used.

In [None]:
# I used 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch'. The reason I used these attributes is that 'Survived' 
# absolutely being the most important attribute and also I think 'Pclass', 'Sex' and 'Age' might have some relationship 
# or correlation with survival rate in this disaster, and Family Size which can be the sum of 'SibSp', 'Parch' should 
# has some relationship with survival rate as well since you have to take your family with you if you try to survive 
# and it may positively or nagetively affect your survival rate.

Question 3. (3 points)

Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?
If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

In [None]:
# Yes I have created some new attributes, 'Age_Group' and 'Family_Size'. 'Age_Group' is just like different age range 
# bins containing different ages in that range. It is useful when you look into the question that if passagers' 
# survival rate is correlated with thier age. 'Family_Size' is the total number of family members which consists of the 
# total number of 'SibSp' and 'Parch', since family size might be another key attribute that can affect survival rate.

# Meanwhile I've decided to drop 'PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin' and 'Embarked' as I don't think these 
# attributes have any correlation or relationship with the survival rate. But yes 'Fare' and where your 'Cabin' is might 
# affect the survival rate but those two attributes can be replaced by 'Pclass' since what level your class is will 
# generally tell where your cabin is and how much your fare is.

Question 4. (5 points)

How did you treat missing values for those attributes that you included in the analysis (for example, age attribute)? Provide a detailed explanation in the comments.

In [None]:
# I found there are missing values in 'Age' and 'Cabin', since I have dropped 'Cabin' so I only need to deal with the 
# missing value in 'Age'. What I think is the average age of passengers in different Pclasses might be different, since 
# people in pclass 1 should be richer than the ones in pclass 2 and 3, and the average age of richer people might be 
# little larger than people who are not that rich. Therefore I calculated the average ages seperately for these 
# 3 pclasses, and replace the missing values in 'Age' for different pclasses using its corresponding average age.