<a href="https://colab.research.google.com/github/Camicb/practice/blob/main/Titanic_survival.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Titanic - Machine Learning from Disaster**

# **1. Introduction**

In this notebook, I will solve the Kaggle's Titanic Machine Learning Competition. The idea is to create a machine learning model that predicts which passengers survived the Titanic shipwreck.

**The Challenge**

The sinking of the Titanic is one of the most infamous shipwrecks in history. 
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. 
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, the idea is to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

# **2. Import Required Libraries**

In [None]:
# !pip install missingno
# !pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 
# !pip install pycaret
# !pip install plotly


In [None]:
# Importing libraries
import pandas as pd
import numpy as np
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno 
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from IPython.core.interactiveshell import InteractiveShell
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsRegressor
InteractiveShell.ast_node_interactivity = "all"
from pycaret.utils import enable_colab
enable_colab()        



In [None]:
# Enabling plotly - This is useful when appears an empty white space instead of the plot after executing the cell

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

from plotly.offline import iplot
import plotly.graph_objs as go
import plotly.express as px

enable_plotly_in_cell()

# choosing some colors for plotly
colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)']

# **3. Exploratory Data Analysis**

## **3.1 About the data**

Variables in the dataset:

**survived:**	Survival, 	0 = No, 1 = Yes 

**pclass:**	Ticket class,	1 = 1st, 2 = 2nd, 3 = 3rd

**sex:**	Sex	

**Age:**	Age in years	

**sibsp:**	# of siblings / spouses aboard the Titanic	

**parch:**	# of parents / children aboard the Titanic	

**ticket:**	Ticket number	

**fare:**	Passenger fare	

**cabin:**	Cabin number	

**embarked:**	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


In [None]:
#Loading the training dataset
train=pd.read_csv('https://raw.githubusercontent.com/Camicb/practice/main/train_titanic.csv')
#Loading the testing dataset
test=pd.read_csv('https://raw.githubusercontent.com/Camicb/practice/main/test_titanic.csv')


## **3.2 Exploratory data analysis**

I will explore all and each variable from an statistical view.

In [None]:
train.head()
train.info()

In [None]:
train.tail()

In [None]:
test.head()
test.info()

In [None]:
# Interactive Statistical report
profile = ProfileReport(train, html={'style': {'full_width': True, 'primary_color': '#30b6c2'}},  samples=None, missing_diagrams=None, interactions=None)
profile.to_file("report.html")
profile.to_notebook_iframe()

In [None]:
# Visualization of missing values 
msno.matrix(train, figsize=(10,5), fontsize=10, color=(0.0, 0.75, 0.75))

In [None]:
# Survivals by Ticket Class
colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)']
fig1 = px.histogram(train, 
                   x='Pclass', 
                   color='Survived', 
                   histnorm='percent', 
                   width=400, height=400, 
                   marginal='violin', 
                   hover_data=train.columns, 
                   color_discrete_sequence=colors)
fig1.update_xaxes(type='category')

In [None]:
# Survival by Sex
fig2 = px.histogram(train,
                    x='Sex',
                    color='Survived',
                    width=400, height=400,
                    hover_data=train.columns,
                    color_discrete_sequence=colors)
fig2.update_xaxes(type='category')


# **4. Preprocessing**

I will perform changes in every column in order to prepare the training data for  modeling. I will drop the columns with with more than 70% of missing values and the ones with irrelevant information. I will also transform some of the columns in order to get new and different features.

In [None]:
def feat_eng(data):

  # Name
  titule=data['Name'].str.split(",", expand = True)
  data['Titule']=titule[1]
  final=data['Titule'].str.split(expand = True)
  data['Titule']=final[0]

  # Sex
  data['Sex']= data['Sex'].replace({'male':0, 'female':1})

  #Family
  data['Family']=data['SibSp']+data['Parch']
  data['Family']=data['Family'].apply(lambda i:'No' if i == 0 else 'Yes')
  
  #Drop columns
  data=data.drop(['Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

  return data

In [None]:
training=feat_eng(train)
training.head()
training.info()

In [None]:
# Split the train data into a new training and validation dataset
X=training.drop(['Survived'], axis=1)
y=training['Survived']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

In [None]:
# Preprocessing:
  
# Age
mean_imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
X_train['Age'] = mean_imputer.fit(X_train[['Age']])

# Embarked
mode_imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
X_train['Embarked'] = mode_imputer.fit_transform(X_train[['Embarked']])

# One hot encoding
#data=pd.get_dummies(data, columns=['Embarked',	'Titule',	'Family'])

msno.matrix(X_train, figsize=(10,5), fontsize=10, color=(0.0, 0.75, 0.75)) 



In [None]:
 X_train['Titule'].value_counts(normalize=True)

#3. Modeling

# 4. Model Performance

In [None]:
#from sklearn.metrics import accuracy_score
#accuracy_score(y_test, predictions)

# The confusion matrix
#from sklearn.metrics import confusion_matrix
#confusion_matrix(y_test, predictions)

In [None]:
#from sklearn.ensemble import RandomForestClassifier
# Create a random forest classifier
#rf_clf = RandomForestClassifier(n_estimators=10000, random_state=1, n_jobs=-1)

# n_estimators = The number of trees in the forest.
# n_jobs = -1 : Use all processors for training

# Train the classifier
#rf_clf.fit(X_train, y_train)

#plot graph of feature importances for better visualization
#feat_importances = pd.Series(rf_clf.feature_importances_, index=X_train.columns)
#feat_importances.nlargest(10).plot(kind='barh')
#plt.show()


#5. Pycaret
I will repeat the modeling part with this library.