# Project: Titanic survival predictions

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

In [1]:
pwd

'E:\\ALL_data_science\\Udacity\\Advanced_track\\Titanic investigation\\titanic desktop\\Titanic-Dataset-Investigation'

<a id='intro'></a>
## Introduction

> **Tip**: The titanic data frames describe the survival status of individual passengers
on the Titanic. The titanic dataframe contains PassengerId, survived,Pclass,Name,Ag,Ticket,Fare and cabin.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [3]:
df=pd.read_csv("titanic-data.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# get the data shape
df.shape

(891, 12)

In [5]:
#get the data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
df.Age.value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64

* In cabin column there is only 204 rows not null ,and I think It's not effective column So let's drop it.

In [7]:
#get further types of columns which has type object
type(df["Name"][0])

str

In [8]:
type(df["Sex"][0])

str

In [9]:
type(df["Ticket"][0])

str

In [10]:
type(df["Embarked"][0])

str

#### Check if there is dublicates in the dataset


In [11]:
#there is no dublicates in this dataset
sum(df.duplicated())

0

#### Check if there is null values in the dataset

In [12]:
# there is no null values in this dataset
print("The count of nulls is {}".format(sum(df.isnull().sum())))
# i have used another way of getting nulls as it gives you the place of each nulls among features
df.isnull().sum()

The count of nulls is 866


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [13]:
#### Check unique values for Pclass
df["Pclass"].unique()
#df.Pclass.nunique()

array([3, 1, 2], dtype=int64)

In [14]:
#### Check unique values for Fare
print(df["Fare"].unique())
print(" The max value is {},\n The min vlaue is {},\n And the count is {}\n".format(df["Fare"].unique().max(),df["Fare"].unique().min(),df.Fare.nunique()))

[  7.25    71.2833   7.925   53.1      8.05     8.4583  51.8625  21.075
  11.1333  30.0708  16.7     26.55    31.275    7.8542  16.      29.125
  13.      18.       7.225   26.       8.0292  35.5     31.3875 263.
   7.8792   7.8958  27.7208 146.5208   7.75    10.5     82.1708  52.
   7.2292  11.2417   9.475   21.      41.5792  15.5     21.6792  17.8
  39.6875   7.8     76.7292  61.9792  27.75    46.9     80.      83.475
  27.9     15.2458   8.1583   8.6625  73.5     14.4542  56.4958   7.65
  29.      12.475    9.       9.5      7.7875  47.1     15.85    34.375
  61.175   20.575   34.6542  63.3583  23.      77.2875   8.6542   7.775
  24.15     9.825   14.4583 247.5208   7.1417  22.3583   6.975    7.05
  14.5     15.0458  26.2833   9.2167  79.2      6.75    11.5     36.75
   7.7958  12.525   66.6      7.3125  61.3792   7.7333  69.55    16.1
  15.75    20.525   55.      25.925   33.5     30.6958  25.4667  28.7125
   0.      15.05    39.      22.025   50.       8.4042   6.4958  10.4625
  1

In [15]:
#Check unique values for Embarked
df["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)



### Data Cleaning 

#### We can get the total number of  relatives by adding a new column called Tot_Relatives which is the sum of Parch & SibSb

In [16]:
df['Tot_Relatives']=df['Parch']+df['SibSp']
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Tot_Relatives
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [17]:
df.Tot_Relatives.describe()

count    891.000000
mean       0.904602
std        1.613459
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max       10.000000
Name: Tot_Relatives, dtype: float64

#### Firstly we need to drop columns passengerId , name,Ticket & cabin , We also should drop SibSp and Parch columns since we have the total number of them.

In [18]:
df.drop(['PassengerId', 'Name',"Ticket","Cabin","Parch","SibSp"], axis=1, inplace=True)

#### I think the Age feature is  important to survival, so we should probably attempt to fill the null values of the age.


In [19]:
#validateing that the mean is the right value to choose
print(df["Age"].describe())


count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64


In [20]:
# filling the age null values with age mean
df["Age"].fillna(df["Age"].mean(),inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Tot_Relatives
0,0,3,male,22.0,7.25,S,1
1,1,1,female,38.0,71.2833,C,1
2,1,3,female,26.0,7.925,S,0
3,1,1,female,35.0,53.1,S,1
4,0,3,male,35.0,8.05,S,0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Survived       891 non-null    int64  
 1   Pclass         891 non-null    int64  
 2   Sex            891 non-null    object 
 3   Age            891 non-null    float64
 4   Fare           891 non-null    float64
 5   Embarked       889 non-null    object 
 6   Tot_Relatives  891 non-null    int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 48.9+ KB


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (What factors are important for us to know in order to predict if a passenger would survive or not?)

In [None]:
#diplaying full data
pd.set_option('display.max_rows', 900)
pd.set_option('display.max_colwidth', None)

In [24]:
#sruvivied passengers with relatives count more than 1
sur_rel = df[(df["Survived"]== 1) & (df["Tot_Relatives"] >= 1)].sort_values("Tot_Relatives", ascending=False)
sur_rel.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Tot_Relatives
25,1,3,female,38.0,31.3875,S,6
233,1,3,female,5.0,31.3875,S,6
261,1,3,male,3.0,31.3875,S,6
68,1,3,female,17.0,7.925,S,6
341,1,1,female,24.0,263.0,S,5
437,1,2,female,24.0,18.75,S,5
88,1,1,female,23.0,263.0,S,5
311,1,1,female,18.0,262.375,C,4
742,1,1,female,21.0,262.375,C,4
774,1,2,female,54.0,23.0,S,4


In [28]:
#checking sur_rel stats
sur_rel.describe()

Unnamed: 0,Survived,Pclass,Age,Fare,Tot_Relatives
count,179.0,179.0,179.0,179.0,179.0
mean,1.0,1.843575,25.969012,58.254191,1.793296
std,0.0,0.833364,15.646723,64.37332,1.074087
min,1.0,1.0,0.42,7.775,1.0
25%,1.0,1.0,15.5,19.5,1.0
50%,1.0,2.0,27.0,31.3875,2.0
75%,1.0,3.0,36.0,77.9583,2.0
max,1.0,3.0,63.0,512.3292,6.0


In [32]:
#getting the percent of each gender
df['Sex'].value_counts(normalize=True) * 100

male      64.758698
female    35.241302
Name: Sex, dtype: float64

In [27]:
#getting the proportion of Servival of each Sex
df.groupby('Sex')['Survived'].value_counts(normalize=True) * 100

Sex     Survived
female  1           74.203822
        0           25.796178
male    0           81.109185
        1           18.890815
Name: Survived, dtype: float64

In [25]:
#checking ages less than 1 year
df[(df["Age"] < 1.0)]

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Tot_Relatives
78,1,2,male,0.83,29.0,S,2
305,1,1,male,0.92,151.55,S,3
469,1,3,female,0.75,19.2583,C,3
644,1,3,female,0.75,19.2583,C,3
755,1,2,male,0.67,14.5,S,2
803,1,3,male,0.42,8.5167,C,1
831,1,2,male,0.83,18.75,S,2


In [26]:
#checking age and class by survived status
df[(df["Age"] >= 30) & (df["Pclass"] == 1)].sort_values("Survived", ascending=False)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Tot_Relatives
1,1,1,female,38.0,71.2833,C,1
447,1,1,male,34.0,26.5500,S,0
621,1,1,male,42.0,52.5542,S,1
609,1,1,female,40.0,153.4625,S,0
604,1,1,male,35.0,26.5500,C,0
...,...,...,...,...,...,...,...
434,0,1,male,50.0,55.9000,S,1
659,0,1,male,58.0,113.2750,C,2
662,0,1,male,47.0,25.5875,S,0
671,0,1,male,31.0,52.0000,S,1


In [None]:
def plot_features(column_name,category_1,category_2):
    
    fg ,ax=plt.subplots(figsize=(10,8))
    ax.hist(category_1[column_name],alpha=0.5,label="Survived")
    ax.hist(category_2[column_name],alpha=0.5,label="Not survived")
    ax.set_title("Distributions of Survived and Not survived "+column_name)
    ax.set_xlabel(column_name)
    ax.set_ylabel("count")
    ax.legend(loc="upper right")
    plt.show()

In [None]:
df_survived=df[df["Survived"]==1]
df_not_survived=df[df["Survived"]==0]

In [None]:
plot_features("Age",df_survived,df_not_survived)


#### From this graph we can conclude that
* from age 0 years (which can mean several months) to nearly 18 Years the no of survived is greater the no of not survived.
* At the adult age (from age of 18 to 30s ) the no of not survived is greater than the no of survived.



In [None]:
plot_features("Sex",df_survived,df_not_survived)


#### From this graph we can conclude that
* The number of males who did not survive are more than the number of females who didn't survive .
* Most of females survived.
* Most of males didn't survive.

In [None]:
plot_features("Tot_Relatives",df_survived,df_not_survived)


#### From this graph we can conclude that
* People who has smaller number of relatives has more probabilty to survive.
* Most of females survived.
* Most of males didn't survive.

In [None]:
plot_features("Pclass",df_survived,df_not_survived)


#### From this graph we can conclude that
* Most of passengers in class 3 didn't survive.
* Most of passengers in class 1 survived.
* So that the probabilty to be survived in class1 is the most.



In [None]:
plot_features("Fare",df_survived,df_not_survived)

* The no of the not survived  passengers who paid a lower fare  is more than no of passengers who paid a higher fare .I think the fare can affect the location of the passengers in the ship thus it can affect their survival.

### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


In [None]:
#export df for Tableau analysis
df.to_csv("Titanic_DF.csv", mode="w")

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!

## This data has a strong limitaion factor which is what is the count of survived Relatives? as there is more than one case that has a passenger with relatives and servived so it is not countable also the cabin place or how far from the life support devices or tools?