<h2> Titanic Survival Prediction🚢⚓ </h2>

- In this notebook, we would be aiming to predict the survival of passengers on the Titanic based on the data and features available to us.
- We would be exploring the data at an initial stage to get some insights from the data and then we would be processing the data inorder to ensure effective model development 
- Certain Steps we would be following in the notebook includes:
1. [Exploratory Data Analysis 📊](#section-one)
2. [Feature Engineering 🔧](#section-two)
3. [Model Development and Metric Analysis🚀](#section-three)
4. [Final Predictions Submission📝](#section-four)

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from xgboost import XGBClassifier

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Loading the dataset
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

In [None]:
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
train_data.describe()

<a id="section-one"></a>
<h2 align = center>1. Exploratory Data Analysis📊</h2>

In [None]:
# Getting the Categorical and Numerical Features
numerical_features = [x for x in train_data.columns if train_data[x].dtypes != 'O']
categorical_features = [x for x in train_data.columns if train_data[x].dtypes == 'O']

In [None]:
numerical_features, categorical_features

In [None]:
# Name and PassengerId are some unimportant columns and hence are dropped
train_data.drop(['Name','PassengerId'],axis = 1,inplace = True)
test_data.drop(['Name'],axis = 1,inplace = True)
categorical_features.remove('Name')

<h2>NULL Value Imputation</h2>

In [None]:
sns.heatmap(train_data.isnull())

From above, we can see that there are NULL/NaN values present in the Dataset. 
Before we start with the Analysis, we need to process this data and then proceed with the analysis

In [None]:
sns.histplot(data = train_data,x = 'Age',kde = True,hue = 'Survived')

The above clears that the data is normally distributed, and hence we would be using mean for NULL Value imputation

In [None]:
train_data.Age.describe()

In [None]:
train_data.Age.fillna(train_data.Age.mean(),inplace = True)
test_data.Age.fillna(test_data.Age.mean(),inplace = True)

In [None]:
sns.countplot(train_data['Embarked'])

From the above plot,Embarked is a categorical feature and hence we would be using Mode, the most common value i.e <b>'S'</b> to impute the NULL values in the dataset

In [None]:
train_data.Embarked.fillna(train_data.Embarked.mode()[0],inplace = True)
test_data.Embarked.fillna(test_data.Embarked.mode()[0],inplace = True)

In [None]:
(train_data[train_data['Cabin'].isna()].shape[0]/train_data.shape[0])*100

Further, in case of the column Cabin, <b>77% of the column values are NULL</b>. Hence any imputation would lead to certain bias for the model, hence dropping the column from further analysis is considered

In [None]:
train_data.groupby(['Cabin','Survived']).size().reset_index(name = 'Counts').sort_values(by = 'Counts',ascending = False)[:10]

In [None]:
train_data[train_data['Cabin'] == 'B96 B98']

From above, we can clearly see a observation i.e all passengers in the Cabin: <b>B96 B98</b> were saved

In [None]:
train_data.drop(['Cabin','Ticket'],axis=1,inplace = True)
test_data.drop(['Cabin','Ticket'],axis=1,inplace = True)

In [None]:
for i in ['Sex','Embarked']:
    sns.countplot(train_data[i],hue = train_data['Survived'])
    plt.show()

Fromthe above plots, we can derieve the following conclusions as:
1. Females are tend to survive more compared to males. Out of the total females, <b>74.2%</b> of females are survived and of the males, only <b>18.9%</b> are survived hence Sex stands an important feature in predicting the survival
2. In Embarked, S Category has the largest number of passengers and the survival percentage is also highest in it
3. For Each of the Passengers in Embarked Category as C, <b>55%</b> of passengers were Survived and Out of the Total in Q, only <b>38%</b> of them and for S, only <b>33.7% </b>of them were survived. Henec people in C category were more likely to be rescued/survived.

In [None]:
train_data.groupby(['Sex','Survived']).size().reset_index(name = 'Counts')

In [None]:
# Based on different categories in Embarked, how has the survival depended. Any Specifics if available are checked
train_data.groupby(['Embarked','Survived']).size().reset_index(name = 'Counts')

In [None]:
#38.38% of passengers are survived
sns.countplot(train_data.Survived)
print(train_data.Survived.value_counts())

In [None]:
numerical_features.remove('PassengerId')
numerical_features.remove('Survived')

In [None]:
features = ['Pclass','SibSp','Parch']
for i in features:
    sns.countplot(train_data[i],hue = train_data['Survived'])
    plt.show()

From the above plots, we can clearly derieve the following conclusions:
1. Highest Number of people didnt survive from Pclass 3
2. From passengers in <b>SibSp 5 and 8 none of them were able to survive</b>. Common observations were both of these classes were having <b>PClass as 3 and there wasnt able specific Cabin value available</b>. Along with this, Embarked value was found to be <b>S</b> for both of these
3. Passengers in Parch 4 and 6, all of them were not survived

In [None]:
train_data[train_data['SibSp'] == 8]

<a id="section-two"></a>
<h2 align = center>2. Feature Engineering🛠🔧</h2>

In [None]:
sns.boxplot(train_data.Fare)
q1 = train_data.Fare.quantile(0.25)
q3 = train_data.Fare.quantile(0.75)
iqr = q3-q1

print(q1,q3,iqr)
outlier_data = train_data[(train_data['Fare']<q1-1.5*iqr) | (train_data['Fare']>q3+1.5*iqr)]

For this use-case, considering the business aspect of travel, we cannot remove all the 116 outliers that are identified,since there can be some range of existing classes which would be requiring more pay and hence can be ignored.
However, the outliers 4 records whose fare is >500 need to be normalised.

In [None]:
train_data[train_data['Fare']>500]

* From Above, we can clearly see the Fare Distribution for different passengers on Titanic. 
For passengers who paid more fare(>500), were all found to be survived

In [None]:
sns.displot(train_data['Fare'])

In [None]:
train_data['Fare'].skew()

From the above distribution plot, we can see that the <b>Fare variable is right skewed</b> and hence in order to further work with this variable, we need to preprocess it by applying some transformation to make it normally distributed. 

- In order to handle this skewness, there are certain transformations available. Some amongst them are:
1. Log transform : Taking the log of the variable to make it near to normal distribution.
2. Square Root Transform: Applying square root to all values of the variable.
3. Box-cox transform: For this, it removes skewness to a lot more extent compared to the above two, but we need to have all values of the variable to be positive.


Here, we would be applying the Square root transform and can be seen below

In [None]:
from scipy import stats
fare_boxcox = np.sqrt(train_data['Fare'])
fare_boxcox.skew()

In [None]:
fare_boxcox

In [None]:
type(fare_boxcox)
fare_boxcox

In [None]:
fare = pd.DataFrame(fare_boxcox)
fare.rename(columns = {'Fare':'Fare_transform'},inplace = True)
fare

In [None]:
train_data = pd.concat([train_data,fare],axis = 1)
train_data

In [None]:
sns.heatmap(train_data.corr(),annot = True)

In [None]:
train_data.drop('Fare',axis = 1,inplace = True)

<h2>Feature Encoding</h2>

- Encoding the Categorical Variables into numerical ones to make it useful for the Machine Learning Model that gets trained
- Here we have used the One-hot encoding technique since we have less number of categories and is the most effective one
- Same transformations would be applied to the test data

In [None]:
sex = pd.get_dummies(train_data.Sex,drop_first = True)
emb = pd.get_dummies(train_data.Embarked,drop_first = True)
train_data = pd.concat([train_data,sex,emb],axis = 1)
train_data.drop(['Sex','Embarked'],axis = 1,inplace = True)
train_data.head()

In [None]:
plt.figure(figsize = (7,6))
sns.heatmap(train_data.corr(),annot = True)


From the Above correlation plot, it can clearly be concluded that: 

    
1. There is a significant correlation between ticket price and survival
2. Males are survived less in comparison to females


In [None]:
#Splitting Data into Features and Labels
X = train_data.drop('Survived',axis = 1)
y = train_data.Survived

<a id="section-three"></a>
<h2 align = center>3. Model Development and Metric Analysis📊🛠</h2>

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
model = LogisticRegression(solver = 'newton-cg')
model.fit(X_train,y_train)
model.score(X_test,y_test)

In [None]:
model_1 = XGBClassifier(max_depth = 3,n_estimators = 200)
model_1.fit(X,y)
model_1.score(X_test,y_test)

In [None]:
from sklearn.linear_model import SGDClassifier
model_2 = RandomForestClassifier()
model_2.fit(X_train,y_train)
model_2.score(X_test,y_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,model_2.predict(X_test)))
print(classification_report(y_test,model_2.predict(X_test)))

In [None]:
from sklearn.preprocessing import MinMaxScaler,QuantileTransformer,RobustScaler
mm = MinMaxScaler()
qt = QuantileTransformer()
rs = RobustScaler()
X1 = rs.fit_transform(X)
xtrain,xtest,ytrain,ytest = train_test_split(X1,y,random_state = 42)

In [None]:
model = LogisticRegression(solver = 'newton-cg')
model.fit(xtrain,ytrain)
print(model.score(xtest,ytest))

model_1 = XGBClassifier(max_depth = 3,n_estimators = 200)
model_1.fit(xtrain,ytrain)
print(model_1.score(xtest,ytest))

from sklearn.linear_model import SGDClassifier
model_2 = SGDClassifier(l1_ratio = 0.5,random_state = 42)
model_2.fit(xtrain,ytrain)
print(model_2.score(xtest,ytest))

In [None]:
from sklearn.ensemble import StackingClassifier,VotingClassifier
estimators = [('mdl1',model),('mdl3',model_2),('mdl2',model_1)]
model_stack = StackingClassifier(estimators = estimators,final_estimator = model,stack_method = 'predict',cv = 6)
model_stack.fit(X,y)
model_stack.score(X_test,y_test)

In [None]:
model_vote = VotingClassifier(estimators = estimators)
model_vote.fit(X,y)
model_vote.score(X_test,y_test)

<a id="section-four"></a>
<h2 align = 'center'>4. Final Prediction Submission📑</h2>

In [None]:
test_data.head()

In [None]:
test_data.isnull().sum()

In [None]:
test_data.Fare.fillna(test_data['Fare'].median(),inplace = True)

In [None]:
test_fare = np.sqrt(test_data['Fare'])
test_fare = pd.DataFrame(test_fare)
test_fare.rename(columns = {'Fare':'Fare_transform'},inplace = True)
test_data.drop('Fare',axis = 1,inplace = True)
test_data = pd.concat([test_data,test_fare],axis = 1)

In [None]:
sex = pd.get_dummies(test_data['Sex'],drop_first = True)
emb = pd.get_dummies(test_data['Embarked'],drop_first = True)
test_data = pd.concat([test_data,sex,emb],axis = 1)
test_data.drop(['Sex','Embarked'],axis = 1,inplace = True)
test_data.head()

In [None]:
test_data

In [None]:
train_data.shape,test_data.shape

In [None]:
X.head()

In [None]:
test_data_1 = test_data.drop('PassengerId',axis = 1)
pred = model_stack.predict(test_data_1)

In [None]:
sample = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
sample.head()

In [None]:
id_p = test_data.PassengerId
pred =  pd.DataFrame(pred,columns = ['Survived'])
df = pd.concat([id_p,pred],axis = 1)
df.head()

In [None]:
df.to_csv('submission.csv',index = False)

<b>More updates will be made soon!</b>
- Do Upvote the kernel, if you found something interesting or new✨
- Suggestions are always welcome😊