# FINAL PROJECT - The Titanic shipwreck

##### In this project, I will attempt to analyze the Titanic Shipwreck data. I will do using Data Analytics process which is detailed below: 

1. **Posing questions:** In this phase, questions are raised or posed about the data that is going to be explored or
     investigated. You can also first explore the data a little bit and then pose your questions.

2. **Data Wrangling:** In this phase, data is acquired and then cleaned for easy analysis. For instance, if the data is not in csv format, we load it in csv format and then analyze it.  

3. **Exploration:** In this phase, data analysts explore the data as the name suggests and try to look for patterns that will help us solve our questions...  

4. **Drawing conclusions:** In this phase, DAs (data analysts) draw conclusions based on their data. 

5. **Communication:** In this phase, the findings are communicated via an email, a blogpost, a power point presentation, a book, etc.

### Onto the first step:-

## 1. POSING QUESTIONS 

**Here are the questions that I have prepared:** 

1. What is the distribution of male and female data in the sample? Was there any correlation between gender and the survival rate? if so, which gender had more survival rate, why possibly? Were there any empty values fields?  
    
2. What is the distribution of male and female data in the sample? Was there any correlation between gender and the survival rate? if so, which gender had more survival rate, why possibly? Were there any empty values fields?

3. What are the percentages of passengers in each class? Did the class of the passengers identified as Pclass have any impact on the survival? If so, which ticket class had the best survival rate? 
    
4. Did more or fewer people travel with their families? What was the correlation between a passenger having a family member and the survival rate for such groups? 

### Onto the second step:-

## 2. DATA WRANGLING

### file

In [99]:
titanic_file = "data/titanic_survival.csv"

### libraries to import 

In [66]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### create titanic dataframe

In [73]:
initial_titanic_df = pd.read_csv(titanic_file)

### modifying the dataset

In [5]:
# For our data analysis, we do not need some of the extract columns, so we are going to 
# modify the dataframe.
modified_titanic_df = initial_titanic_df.loc[:,'pclass':'parch']

### a little about the data

1. **_Passengerid_:** This is the passenger id starting from 1 to 891. (not a column name) 

2. **_Survived_:** The value is numerical of 0 or 1; 1 indicating the passenger survived and 0 otherwise.

3. **_Pclass_:** Also numerical values of 1, upper deck; 2, middle deck; and 3, bottom deck. 

4. **_Name_:** This column lists the name of passenger. 

5. **_Sex_:** This column lists the gender of the passenger. 

6. **_Age_:** This column shows the age of passenger in years. 

7. **_SibSp_:** This column shows the number of sibling/siblings or a spouse of the passenger if any.  

8. **_Parch_:** This column shows the number of a parent/parents or child/children of the passenger if any. 

-------- our dataframe does not cover below this line.
9. **_Ticket_:** This shows the ticket number of the ticket issued. 

10. **_Fare_:** This shows the fare the ticket is bought at. 

11. **_Cabin_:** This shows the cabin number. 

12. **_Embarked_:** This shows the port the passenger embarked at. C = Cherbourg, Q = Queenstown, S = Southampton

Credit: https://www.kaggle.com/c/titanic/data

### first three rows

In [71]:
modified_titanic_df.head(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0


### columns with null values

In [7]:
# so the age column has quite a few null values. 
modified_titanic_df.isnull().sum()

pclass        1
survived      1
name          1
sex           1
age         264
sibsp         1
parch         1
dtype: int64

### let us do a little cleaning

In [74]:
# let us take out the last row since it is all null.
cleaned_titanic_df = modified_titanic_df.iloc[:1309]

# let us rename is to titanic_df
titanic_df = cleaned_titanic_df
titanic_df.isnull().sum()

pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
dtype: int64

### sample size and survivors

In [82]:
sample_size = len(titanic_df)
survivors_df = titanic_df[titanic_df.survived == 1.0]
all_survivors = len(survivors_df)
all_dead = len(titanic_df[titanic_df.survived == 0.0])

## gender-based numbers and survival

In [83]:
male_count = len(titanic_df[titanic_df.sex == 'male'])
female_count = len(titanic_df[titanic_df.sex == 'female'])
male_survivors = len(survivors_df[survivors_df.sex == 'male'])
female_survivors = len(survivors_df[survivors_df.sex == 'female'])

## age-based numbers and survival

**Assumptions:** 

* I am going to divide the age_related data into two categories for the sake of simplicity. 
* Those who are 18 years old or older will be considered as adults, and those less than 18 years of age will be considered children. 
* But I will also provide some interesting numbers concerning the ages of those on board the titanic. 


In [90]:
# let us take out the nulls and save the titanic_df in a different variable
titanic_cleaned_byAge = titanic_df[titanic_df.age.notnull()]
titanic_with_nullAge = titanic_df.age.isnull().sum()

# let us count the adults and adult survivors
adults_df = titanic_cleaned_byAge[titanic_cleaned_byAge.age >= 18.0]
adult_survivors = adults_df[adults_df.survived == 1.0]
adults_count = len(adults_df)

# let us count the children and the child survivors.
children_df = titanic_cleaned_byAge[titanic_cleaned_byAge.age < 18.0]
child_survivors = children_df[children_df.survived == 1.0]
children_count = len(children_df)

# let us see the outliers.
youngest_survivor = titanic_cleaned_byAge.age.min() *12
oldest_survivor = titanic_cleaned_byAge['age'].max()

len(titanic_cleaned_byAge)

1046

## ticket_class-based numbers and survival

**Titanic Data stats by Pclass**

*There are three classes as defined in Pclass column of the titanic data.*

**Assumptions:**

* I will call the class with Pclass value of 1 first_class
* I will call the class with Pclass value of 2 second_class
* I will call the class with Pclass value of 3 third_class

First of all, let us see if the Pclass column has any empty or null values. After calling the empty_fields_count function for 'Pclass' column, we found out that there are not empty values. 

In [54]:
#it seems like the column does not have any null values. 
print ("There are *{}* null values in pclass column.".format(titanic_df.pclass.isnull().sum()))

There are *0* null values in pclass column.


In [84]:
#Let us find the first_class count and survivors count
first_class_count = len(titanic_df[titanic_df.pclass == 1.0])
first_class_survivors = len(survivors_df[survivors_df.pclass == 1.0]) 

#Now, let's find the second_class count and survivors count
second_class_count = len(titanic_df[titanic_df.pclass == 2.0])
second_class_survivors = len(survivors_df[survivors_df.pclass == 2.0])

#Finally, let's find the third_class count and survivors count
third_class_count = len(titanic_df[titanic_df.pclass == 3.0])
third_class_survivors = len(survivors_df[survivors_df.pclass == 3.0])


## traveling_companion-based numbers and survival

In [53]:
#let us see if there are any empty cells in the Parch and SibSp columns. 

print ("There are *{}* null values in parch column.".format(titanic_df.parch.isnull().sum()))
print ("There are *{}* null values in sibsp column.".format(titanic_df.sibsp.isnull().sum()))

There are *0* null values in parch column.
There are *0* null values in sibsp column.


In [57]:
#since there are no null values, let us go ahead and use come up family_traveling and lonely_traveling data using
#vectorized operations as well as the select_df function. 

lonely_travelers_df = titanic_df[(titanic_df['sibsp'] == 0) & (titanic_df['parch'] == 0)]
lonely_travelers_count = len(lonely_travelers_df)
lonely_traveling_survivors_count = len(lonely_travelers_df[lonely_travelers_df.survived == 1.0])
family_travelers_df = titanic_df[(titanic_df['sibsp'] > 0) | (titanic_df['parch'] > 0)]
family_travelers_count = len(family_travelers_df)
family_traveling_survivors_count = len(family_travelers_df[family_travelers_df.survived == 1.0])


### Onto the third step:-


## DATA EXPLORATION 

In total there were 891 in the sample. I will break down the sample size as well as the 
survivors data by gender, age, and ticket class. 

   **1. Distribution of data by gender:** 
   
   Change ***False*** to ***True*** and run the following Python code to see stats related to gender distribution: 
 

In [91]:
samplesize = len(titanic_df)

if True: 
    def printing(**kwargs):
        """
        This function prints the value upon calling the function. 
        """
        for name, value in kwargs.items():
            print ("There were {}".format(value))

    printing(sample_size = str(sample_size)+" people in the sample.", 
             survs = str(all_survivors)+ " survivors.",
            male_num = str(male_count)+ " males.",
             mal_surv = str(male_survivors)+ " male survivors.",
            fem_num = str(female_count)+ " females.",
             fem_surv = str(female_survivors)+ " female survivors."
            )

There were 843 males.
There were 161 male survivors.
There were 466 females.
There were 339 female survivors.
There were 1309 people in the sample.
There were 500 survivors.


   **2. Distribution of data by age:-**

   Change ***False*** to ***True***, and run the following Python code to see stats related to age distribution: 


In [97]:
#the printing function is called to print stats related to age. 

if True: 
    printing(with_age_data_length = str(len(titanic_cleaned_byAge))," people, whose ages were specified.") 
            """adults = str(adults_count)+" adults.", 
            children = str(children_count)+" children.", 
            all_survs = str(adult_survivors+child_survivors)+" survivors, with ages specified.", 
             adult_surv = str(adult_survivors)+" adult survivors.", 
             child_surv = str(child_survivors)+ " child survivors."
            )"""
    
    print ("The oldest survivor was {} years old.".format(oldest_survivor))
    print ("The youngest survivor was {} months old.".format(youngest_survivor))

IndentationError: unexpected indent (<ipython-input-97-c8052f10a47d>, line 5)

   **3. Distribution of data by ticket class:** 
    
   Change ***False*** to ***True***, and run the following Python code to see stats related to ticket class distribution:


In [98]:
#We are going to call the printing function to print stats related to ticket class. 
if True: 
    printing( first_class = str(first_class_count)+" people in first class category.", 
            first_class_survs = str(first_class_survivors)+" survivors in that category.", 
            second_class = str(second_class_count)+" people in second class category.", 
            second_class_survs = str(second_class_survivors)+" survivors in that category.", 
             third_class = str(third_class_count)+" people in third class category.", 
             third_class_survs = str(third_class_survivors)+ " survivors in that category."
            ) 

There were 181 survivors in that category.
There were 200 survivors in that category.
There were 277 people in second class category.
There were 709 people in third class category.
There were 119 survivors in that category.
There were 323 people in first class category.


   **4. Distribution of data by family indicators: **
    
  Change ***False*** to ***True***, and run the following Python code to see stats related to gender distribution:


In [None]:
#We are going to call the printing function to print stats related to family. 
if True: 
    printing(lonely_travelers = str(lonely_travelers_count)+" people, traveling alone.", 
            lonely_traveling_survs = str(lonely_traveling_survivors_count)+" survivors in the lonely category.", 
            family_travelers = str(family_travelers_count)+" people, traveling with family.", 
             family_traveling_survs = str(family_traveling_survivors_count)+" survivors in the family category." 
            ) 

### Onto the fourth step:-

## CONCLUSION

### Findings:- 

I was able to find answers to the questions I raised. Following are my conclusions: 

1. About survival rate and gender, I found that persons identified as female had a higher survival rate in proportion to their representation in the sample.

2. About survival and age, I found that persons aged less than 18 years had better survival rate in proportion to their representation in the sample. 

3. About survival and ticket class of passengers, I found that passengers who had bought a higher class ticket, had better survival rate compared to the two other classes. 

4. About survival and availability of traveling companion, I found that passengers who were traveling as part of family had better survival rate compared to others who traveled by themselves. 

All of the above findings are backed by data explored and visuals in the upcoming section. 

### Limitations:- 
* The limitations of the findings are the reasons provided in the dataset. For instance, after exploring the data, I
  was able to find out that children and women had better survival rate compared to adults and males. The dataset does
  not mention that children and women were given preference over adults and males. 

* Or the fact that people who were traveling as part of one family had better survival rate because they looked out
  for each other so their survival was better compared to people who were traveling by themsleves. 

* And the dataset also lacks reason for the fact that higher ticket class had better survival because they must have
  been given preference when it came to lifeboats (which is not mentioned in the dataset). But even being on the
  lifeboats does not mean higher chances of survival. 



In [None]:
#Here are some stats about the data. 

titanic_df.describe()

### Onto the fifth step:-

## COMMUNICATION

## **GENDER: Let us remind ourselves of the questions related to gender ** 

***What is the distribution of male and female data in the sample? Was there any correlation between gender and the survival rate? if so, which gender had more survival rate, why possibly? Were there any empty values fields?*** 

Well, first of all information was collected on 891 people, which is our sample size. Out of the sample size of 891 people, there were 577 people identified as male of which 19% or 109 survived, and 314 people identified as female, of which 74% or 233 survived.  

So yes, there was a correlation between gender and survival rate as the people identified as female had a higher survival rate proportionate to their representation of the sample compared to people identified as male. 

Change ***False*** to ***True*** and run the following code to observe the visualization of the above. 

In [None]:
#this is to print out the output in the same notebook. 
%matplotlib inline 

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def barlabel(plots):
        """
        This function attaches a text label above each bar displaying its height
        """
        for plot in plots:
            height = plot.get_height()
            ax.text(plot.get_x() + plot.get_width()/2, height,
                    '%d' % int(height),
                    ha='center', va='bottom')

if True: 
    
    N = 3
    x = np.arange(N)  
    width = 0.35      

    fig = plt.figure(figsize=(14,7)) 
    ax = fig.add_subplot(221)

    mf_data1 = (len(titanic_df), male_count, female_count)
    plot_mf1 = ax.bar(x,mf_data1, width, color='c')

    mf_data2 = (survivors_count, male_survivors_count, female_survivors_count)
    plot_mf2 = ax.bar(x+width, mf_data2, width, color='y')

    ax.set_title('Titanic Overall Survival Numbers')
    ax.set_ylabel('Number of people')
    totals_leg = mpatches.Patch(color='c', label='Totals')
    surv_leg = mpatches.Patch(color='y', label='survivors')
    ax.legend(handles = [totals_leg, surv_leg])
    ax.set_xticks(x+width/2)
    ax.set_xticklabels(('sample size & survivors','males','females'))
    
    barlabel(plot_mf1)
    barlabel(plot_mf2)    
    
    survivors_gender_table = pd.crosstab(titanic_df['Survived'], titanic_df['Sex'])
    fig, (ax1, ax2) = plt.subplots(ncols=2)
    colors = ['y','c']
    for sex, ax in zip(['male', 'female'], [ax1, ax2]):
        survivors_gender_table.plot.pie(y = sex, 
                                    ax=ax, 
                                    labels =['Dead', 'Survived'],
                                    autopct='%1.2f%%', 
                                    colors=colors, 
                                    legend=False, 
                                    startangle=90)
        ax.set_aspect('equal')
    fig.suptitle('Survivors and Dead by Gender', y=0.8, fontsize = 12)

**AGE: Let us take a look at the questions related to age.**

***What is the distribution of data based on age? Was there any correlation between age and the chances of survival? Did any particular age group had better survival rate? How old were the oldest and youngest survivors?***

There are two age categories, adults and children. People aged 18 or over are considered adults and those aged less than 18 are considered children. Out of the sample of 891 people, there were 177 people who had no age specifed so we excluded them while analyzing the data between adults and children. So we have a new sample for this analysis with 714 people in it. 

Out of 714 age_inclusive sample, there were 649 adults and 65 children. That is 90% adults and 10% children. The survival rate proportionate to their relevant groups was 39% for adults, that is 255 survivors out of 649 adults, and 54% for children, and that is 35 children survived out of 65 children. Hence, we can say that children had better survival rate when compared to adults. And this could be that children were given priority to board the lifeboats as they were the most vulnerable group. 

Change ***False*** to ***True*** and run the following code to get the visuals backing up the above. 

In [None]:
if True:     
    fig = plt.figure(figsize=(12,5)) #the figure along with the size of it.
    ax = fig.add_subplot(121)
    colors = ['b','c','y']
    bar1 = ax.bar(range(3), [len(titanic_df),len(age_inclusive_df), ageless_count], color = colors)
    ax.set(title = 'Overall Composition by Age', 
          ylabel = 'Number of people', 
          xticks = range(3),
          xticklabels = ['Sample Size','With Ages','Without Ages']
          )
    barlabel(bar1)
    
    ax = fig.add_subplot(122)

    ac_data1 = (len(age_inclusive_df), adult_count, child_count)
    plot_ac1 = ax.bar(x,ac_data1, width, color='c')

    ac_data2 = (adult_survivors_count+child_survivors_count, adult_survivors_count, child_survivors_count)
    plot_ac2 = ax.bar(x+width, ac_data2, width, color='y')

    ax.set_title('Detailed Composition by Age')
    ax.set_ylabel('Number of people')
    totals_leg = mpatches.Patch(color='c', label='Totals')
    surv_leg = mpatches.Patch(color='y', label='survivors')
    ax.legend(handles = [totals_leg, surv_leg])
    ax.set_xticks(x+width/2)
    ax.set_xticklabels(('sample size & survivors','adults','children'))


    barlabel(plot_ac1)
    barlabel(plot_ac2)

**TICKET CLASS. Let us tackle the questions related to ticket class of the passengers.**

***What are the proportions of passengers in each class and what are their survival rates? Was there any correlation between ticket class and survival? If so, which ticket classes had the best and worst survival rate?***

*The 891 passengers were grouped into three ticket classes:* 

1. 216 first_class ticket holders (Pclass = '1')
2. 184 second_class ticket holders (Pclass = '2')
3. 491 third_class ticket holders (Pclass = '3')

*Let us take a look at their survival rates:* 

1. 136/216 first_class survivors, 63% survival rate
2. 87/184 second_class survivors, 47% survival rate
3. 119/491 third_class survivors, 24% survival rate

Looking at the above stats, we can clearly see that the first_class had the best survival rate, 62% and the third_class had the worst survival rate of 24%. The ticket class on the titanic represented socio-economic status with first_class being the highest in terms of that status. 

Thus, we can say that there was a correlation between a ticket class and the survival rate. The higher the ticket class, the better the socio-economic status of the passenger and the more likely the passenger survived. 


*Please change **False** to **True** in the following code and then run the code to observe representative visuals.*

In [None]:
if True: 
    fig = plt.figure(figsize=(10, 7))
    N = 4
    x = np.arange(N)  
    width = 0.35      

    ax = fig.add_subplot(111)

    tc_data1 = (len(titanic_df), first_class_count, second_class_count, third_class_count)
    plot_tc1 = ax.bar(x,tc_data1, width, color='c')

    tc_data2 = (survivors_count, first_class_survivors, second_class_survivors, third_class_survivors)
    plot_tc2 = ax.bar(x+width, tc_data2, width, color='y')

    ax.set_title('Titanic Numbers by Ticket Class')
    ax.set_ylabel('Number of people')
    totals_leg = mpatches.Patch(color='c', label='Totals')
    surv_leg = mpatches.Patch(color='y', label='survivors')
    ax.legend(handles = [totals_leg, surv_leg])
    ax.set_xticks(x+width/2)
    ax.set_xticklabels(('sample', 'First_class','Second_class', 'Third_class'))
    
    barlabel(plot_tc1)
    barlabel(plot_tc2)

**FAMILY Vs LONELY TRAVELERS Let us tackle questions related to whether people traveled with their families or not.**

***Did more or fewer people travel with their families? What was the correlation between a passenger having a family member and the survival rate for such groups?***

Out of 891 people, 537 people traveled by themselves, so they did not have any family onboard. And 354 people travelved either with their children, parents, spouses or siblings. 

It seems like the people who traveled with their families had a better survival rate compared people traveling by themselves. There were 179 survivors out of 354 travelers, which is 50% survival rate. On the other hand, 163 survivors out of 537 travelers, which is every third person or 30% survival rate. 

Thus, family traveling people were more likely to survive than non-family travelers. 

*Please change **False** to **True** in the following code and then run the code to observe representative visuals.*

In [None]:
if True: 
    fig = plt.figure(figsize=(14, 5))
    N = 3
    x = np.arange(N)  
    width = 0.35      

    ax = fig.add_subplot(121)

    tc_data1 = (len(titanic_df), lonely_travelers_count, family_travelers_count)
    plot_tc1 = ax.bar(x,tc_data1, width, color='c')

    tc_data2 = (survivors_count, lonely_traveling_survivors_count, family_traveling_survivors_count)
    plot_tc2 = ax.bar(x+width, tc_data2, width, color='y')

    ax.set_title('Titanic Numbers by Family Association')
    ax.set_ylabel('Number of people')
    totals_leg = mpatches.Patch(color='c', label='Totals')
    surv_leg = mpatches.Patch(color='y', label='survivors')
    ax.legend(handles = [totals_leg, surv_leg])
    ax.set_xticks(x+width/2)
    ax.set_xticklabels(('sample', 'Lonely Travelers', 'Family Travelers'))
    
    barlabel(plot_tc1)
    barlabel(plot_tc2)
    
    ax = fig.add_subplot(122)
    slices = [lonely_traveling_survivors_count/survivors_count, family_traveling_survivors_count/survivors_count]
    labels = ['Lonely Travelers', 'Family Travelers']
    colors = ['#00bfff', 'c']
    plt.title(' Survival Rate')
    plt.pie(slices, 
            labels=labels, 
            colors=colors,
            autopct='%1.2f%%', 
           startangle = 90, 
           shadow = True)
    plt.axis('equal')
    

Sources: 

http://matplotlib.org

http://kaggle.com

http://stackoverflow.com/
