# **The "Titanic" Project**

## *Contents*

1) Problem Analysis and Definition

2) Data "Collection"

3) Light Data Analysis

3) Data Cleaning

4) Data Transformation

5) Data Modelling

6) Results Submission

## **Problem Analysis and Definition**

Understanding the history and context of the data allows for a better analysis.
The data set covers the fate of the passengers on the Titanic, a ship that sunk after hitting an iceburg. 
The data provided consists of a subset of the passengers on board the ship (891 out of 2240), 
split into the following two groups, 
1) A set for testing ("test.csv" - which doesnt contain details of the passengers survival) <br>
2) A set for training ("train.csv" - which does contain details of the passengers survival). <br>

The data contains several fields about each passenger (alongside the key detail of whether they lived or not)
such as their age, information about their family, their social class, and so on. 
The aim of this project is to create a model to predict a passengers survival using this given data.

## **Data "Collection"**

Importing the correct modules for later in project (including visualisation and modelling modules)<br>
The final code here outputs the input files from kaggle (datasets)

In [1]:
# This is the code given by default by kaggle for the competition - it has been edited in some places

# Default Imports
import numpy as np 
import pandas as pd 

# Added Imports for Visualisation
import seaborn as sns 
import matplotlib.pyplot as plt

# Added Imports for Modelling/ML
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report


# Default given code to output input files from Kaggle
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

ModuleNotFoundError: No module named 'numpy'

The code below imports the data from the given CSVs into data frames.<br>
Datasets are also "combined" (joined by column) so that this pointer can be used to perform transformative actions on both datasets at once later on in the code (eg: changing column names of both in one line of code).
<br><br>
As can be seen by printing each of the column headings for both datasets, the only current difference is the lack of "Suvived" attribute for the testing data.

In [None]:
# Import the data frames from the provided CSVs
training_data = pd.read_csv('/kaggle/input/titanic/train.csv')
testing_data = pd.read_csv('/kaggle/input/titanic/test.csv')

# Combined into a list, in order to iterate through later when changing both data frames at once
combined_datasets = [training_data, testing_data]

print(training_data.columns.values)
print(testing_data.columns.values)

## **Light Data Analysis**
In this section, I will inspect whether the data is catagorical or numerical - as this will change how we analyse the data at first. <br>
This can be seen by inspecting the datas head and analysing the values carefully to ensure their data type.

In [None]:
training_data.head()

As seen by the snippet of the data frame above, it is obvious which data is catagorical and numerical. These two sections will be explored differently - as catagorical datatypes cannot be analysed using popular visualisations such as histograms <br><br>
***Catagorical Attributes***
- Survived (Boolean, Catagorical)
- Sex (Boolean, Catagorical)
- Embarked (Char, Catagorical)
- Pclass (Int, Ordinal) <br>

***Numerical Attributes***
- Age (Int)
- Fare (Float)
- SibSp (Int)
- Parch (Int)

***Mixed DataTypes***
- Ticket (Mostly Ints, some with a leading String)
- Cabin (Int with leading Char) <br>

Mixed datatypes make analysing and exploring those datatypes more complex. To better analyse and compare this data, mixed datatypes should be sorted into catagories (such as taking only the leading letter for each cabin number) or transformed into pure ints (such as  removing leading strings from the ticket, leaving only the integer to fit the format of most of the tickets).
<br>
Before visualising or cleaning the now untransformed data, I will use the pandas info() and describe() functions to get a general idea about the data to raise some questions, which in asnwering will allow for a greater understanding of the data, and therefore (hopefully) a better prepared dataframe for training. 

In [None]:
training_data.info()
print(".........................................................")
testing_data.info()

**Info() Findings:**
In general, training and testing null proportions for each variable are the same (data frames are comparable, as expected).
Some interesting numbers to pay attention to are that there is very little na values for most data - making this a fairly complete data frame. The only variables with large null counts are age - which lacks almost 1/8t of the possibble values in both data sets, and cabin (wihch stores the cabin ID) - which lacks almost 3/4 of possible values. This makes using this variable much more complex - as the amount of values that will have to e approxmiated may be too large to accurately make up for the lost values. 

In [None]:
training_data.describe()

In [None]:
testing_data.describe()

In [None]:
# Describe the dataframe object ('O')
training_data.describe(include=['O'])

**Describe() Findings**
These describe tables are useful for closely inspecting numerical data. <br>
From these tables we can see that there is a large difference in the fair price paid (due to a large standard deviation), that the average age is around 30 years old (younger than the average age at the time, meaning that the ship was taken mostly by younger people), that around half of the people on the ship were travelling with siblings meaning a large amount of people were travelling with family, and including the number of parents or spouses, this number likely climbs even higher.<br>
Using describe(include['O']) is used to get the description of the "object", which in this case, is the data frame itself. This shows how many unique values there are. As can be seen by inspecting the training data, there are large amount of ticket duplications, which is an issue that needs resolving as the ticket value is meant to be unique, however, this could be due to the existance of family tickets or something similar. As names are not dupliucated, we can safely assume that the entire passenger isnt being duplicated, just the ticket number. 

<br><br>
**Basic Numeric Distributions**


In [None]:
# Split dataframe into catagorical and numerical numbers 
training_data_numerical = training_data[['Age','SibSp','Parch','Fare']]
training_data_catagorical = training_data[['Survived','Pclass','Sex','Ticket','Cabin','Embarked']]

# Display numerical variables using histograms to observe distributions
for x in training_data_numerical.columns:
    plt.hist(training_data_numerical[x])
    plt.title(x)
    plt.show()

*Findings*
- Age gives a nice normal distriution, will be useful untransformed 
- Siblings and parent numbers are similar distriutions, likely as families are travelling all together (increasing both at the same rate) - could consider combinging into one "family" attribute
- Fare is a poor distribution, but seems to be in a shape that shows an exponential distribution - taking the log of this curve should give a normalised distribution



In [None]:
# Display a heatmap to display correlations between numerical variables
sns.heatmap(training_data_numerical.corr())

*Findings* <br>
This step is important to ensure that two attributes arent too strongly correlated - which could lead to Multicollinearity - in which the model too strongly advocates for something due to the two strongly correlating attributes compounding. 
- Age and siblings/spouses are inversely correlated - most likely because most data for this variable is for spouses - as adults are ore likely to be married the older they are
- The strongest correlation is between "parch" and "sibsp" - which suggests that if the person has a spouse/sibling, they are also likely to have a parent/child also present. This follows the logic that entire families are present on the ship. 
- Other correlations are present, but not strong enough to be concerned about changing results negetively.


In [None]:
# Create a simple table that shows the link between variables and survival
pd.pivot_table(training_data, index = 'Survived', values = ['Age','SibSp','Parch','Fare'])

*Findings* <br>
- In general, younger people are more likely to survive 
- In general, people who paid more for their ticket are more likely to survive (significantly more likely)
- In general, people with parents or children aboard are more likely to survive
- In general, those with siblings or spouses aboard are more likely to not survive (although not significantly)


In [None]:
# Display a series of tables which show the link between the catagories in catagorical variables and how survival is affected
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Pclass', values = 'Ticket' ,aggfunc ='count'))
print()
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Sex', values = 'Ticket' ,aggfunc ='count'))
print()
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Embarked', values = 'Ticket' ,aggfunc ='count'))
print()
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Parch', values = 'Ticket' ,aggfunc ='count'))
print()
print(pd.pivot_table(training_data, index = 'Survived', columns = 'SibSp', values = 'Ticket' ,aggfunc ='count'))

*Findings* <br>
- Class seems to have a large effect on those who survive, with nearly 2/3rds of 1st class passengers surviving, compared to only 1/4 of 3rd class.
- Sex seems to also have a massive effect, there is a huge difference between the proportion of women who survived when compared to their male counterparts.
- As shown before, family size seems to have a positive effect - whether thats have a spouse or having children. 
- Location embarked from does not seem to be statistically significant




In [None]:
# Create a palet bar chart which better shows how age effects survival
g = sns.FacetGrid(training_data, col='Survived')
g.map(plt.hist, 'Age', bins=20)

Earlier analysis showed that the age of the passenger was not that significant (although younger people were slightly more likely to live) - this graph shows a mmuch better and easily understandable distribution. It shows that those who are young children are much more likely to survive. This distrubution also shows that for all other age groups, the survival and demise of passengers of any age is quite similar, showing that ignoring the skew provided by children, the chances of surviving are not heavily correlated to age.

In [None]:
# Extract the first string from each name (which is the title)
training_data['Title'] = training_data.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
# Display the number of each titles present
training_data['Title'].value_counts()

After listing the titles, we can see that the only statistically significant titles are Mr, Miss, Mrs and Master. These titles are already covered by the age and sex variables. The other titles, which could be significant, such as Countess, are covered by the class variable. This makes including the title obsolete. 


In [None]:
# Removing the Title variable
training_data = training_data.drop(['Title'], axis=1)

# Conclusion

*Variables Included in Modelling*
- Age
- Class
- Sex
- Parch
- SibSp
- Fare

*Variables NOT Included in Modelling*
- Ticket
    - Not relevant, including the fare tells the model everything needed to know about the ticket, alphanumerical string tells us nothing 
- Cabin
    - Excluded due to a lack of compelte dataset, too many null values make using this variable too inaccurate
- Embarked
    - Does not appear to be statisically relevant, also not logically relevant. Titanic sank after all of these destinations were visited. 
- Name
    - Excluding the title due to overlap between a passengers title, and their sex and class, the name is not relevant at all. 


## **Data Cleaning**


In [None]:
# Fill NA values in variables with the averages to fill data set
testing_data.Age = testing_data.Age.fillna(testing_data.Age.median())
testing_data.Fare = testing_data.Fare.fillna(testing_data.Fare.mean())

# Drop unneeded data
testing_data = testing_data.drop(['Ticket'], axis=1)
testing_data = testing_data.drop(['Cabin'], axis=1)
testing_data = testing_data.drop(['Embarked'], axis=1)
testing_data = testing_data.drop(['Name'], axis=1)

# Translate fare prices into logorithic data to normalise and drop fare
testing_data['LogFare'] = np.log(testing_data.Fare+1)
testing_data['LogFare'].hist()
testing_data = testing_data.drop(['Fare'], axis=1)

# Replace catagorical sex strings with int values
testing_data['Sex'] = testing_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [None]:
# Fill NA values in variables with the averages to fill data set
training_data.Age = training_data.Age.fillna(training_data.Age.median())
training_data.Fare = training_data.Fare.fillna(training_data.Fare.mean())

# Drop unneeded data
training_data = training_data.drop(['Ticket'], axis=1)
training_data = training_data.drop(['Cabin'], axis=1)
training_data = training_data.drop(['Embarked'], axis=1)
training_data = training_data.drop(['Name'], axis=1)

# Translate fare prices into logorithic data to normalise and drop fare
training_data['LogFare'] = np.log(training_data.Fare+1)
training_data['LogFare'].hist()
training_data = training_data.drop(['Fare'], axis=1)

# Replace catagorical sex strings with int values
training_data['Sex'] = training_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [None]:
# Confirm data changes
training_data.info()
training_data.head()

In [None]:
# Confirm data changes
testing_data.info()
testing_data.head()

## **Data Modelling**


***Data Preparation***

In [None]:
# Seperate independant and dependant variables
X_train = training_data[['Pclass','Sex','Age','SibSp','Parch','LogFare']]
y_train = training_data['Survived']
X_test = testing_data[['Pclass','Sex','Age','SibSp','Parch','LogFare']]

len(X_test)

***Model Usage***

In [None]:
# Logistic Regression Model Use
lr=LogisticRegression()
lr.fit(X_train,y_train)
y_predict = lr.predict(X_test)
lrScore = round(lr.score(X_train, y_train), 2)
print(lrScore)

In [None]:
# Support Vector Classification Use
svc = SVC()
svc.fit(X_train, y_train)
y_test = svc.predict(X_test)
svcScore = round(svc.score(X_train, y_train), 2)
print(svcScore)

In [None]:
# K Neighbors Classifier Use
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
y_test = knn.predict(X_test)
knnScore = round(knn.score(X_train, y_train), 2)
knnScore

In [None]:
# Gaussian Use
gs = GaussianNB()
gs.fit(X_train, y_train)
y_test = gs.predict(X_test)
gsScore = round(gs.score(X_train, y_train), 2)
gsScore

In [None]:
# Random Forrest Classifier Use
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
y_test = rf.predict(X_test)
rf.score(X_train, y_train)
rfScore = round(rf.score(X_train, y_train), 2)
rfScore

In [None]:
# Decision Tree Use
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_test = dt.predict(X_test)
dtScore = round(dt.score(X_train, y_train), 2)
dtScore

It appears clearly that Random Forest Classifier and Decision Tree are the most accurate models used. 
I will choose to submit The Decision Tree model predictions, as the data set is small, and random forest is likely too complex for the predictions that are to be made- and therefore may overanalyse the data (as well as simply not being as efficient for the small task at hand).

## **Submission**

In [None]:
submission = pd.DataFrame({
        "PassengerId": testing_data["PassengerId"],
        "Survived": Y_pred
    })

submission.to_csv('submission.csv', index =False)
print(submission)
