**Introduction**



Predictive analysis: the main objective is predict the survival passengers from the Well-Known Titanic shipwreck.
Two datasets include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled `train.csv` and the other is titled `test.csv`.

Train.csv file contains the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The `test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.



In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
from matplotlib import pyplot as plt
import xgboost as xgb
import seaborn as sns
import pandas as pd
import numpy as np
import shap
import math
%matplotlib inline



In [None]:
#data import 
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
train=pd.DataFrame(train)
test=pd.DataFrame(test)
data_test=pd.read_csv('../input/titanic/test.csv')

print("Shape of the train set is", train.shape, " and the shape of the test is ",test.shape)

In [None]:
train.head(10)

In [None]:
test.head(10)

* #Data analysis & cleansing



In [None]:
train.info()


Features are as follows:

PassengerId  :  numerical: does not correlate with the rest of data and can be removed/ not useful

Survived   :   can serve as label Y_train

Pclass     :    numerical: Passenger class 

Name      :     categorical: Passenger's name

Sex       :     categorical: Passenger's sex

Age        :   numerical:   Passenger's age

SibSp      :    numerical: Number of siblings ans spouses Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch      :    numerical: Number of children

Ticket     :    categorical: The ticket id.

Fare      :    numerical: The ticket cost.

Cabin      :    categorical: the cabin number

Embarked    :   categorical: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)' 


According to this description some feature seem to be unusful for such as PassengerId ,Ticket. These two variables does not correlate with the rest of variables. It is important to drop them.


In [None]:
#correlation matrix traces how variables are related to each other
#This will give an idea how to empute missing variable without dropping col
sns.heatmap(train.corr(), annot=True, cmap="coolwarm")
plt.show()

In [None]:
#drop PassengerId ,Ticket columns
train.drop(['PassengerId', 'Ticket','Name'], axis=1, inplace=True)
test.drop(['PassengerId', 'Ticket', 'Name'], axis=1, inplace=True)
train.shape, test.shape

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
#Missing columns
cols_with_missing_train = train.isnull().sum()
print("Training set columns with missing values are :\n", cols_with_missing_train[cols_with_missing_train>0])

cols_with_missing_test = test.isnull().sum()
print("\n\n Test set columns with missing values are :\n", cols_with_missing_test[cols_with_missing_test>0])

print("\n\n Mean of survived passengers:\n",train["Survived"].mean())

Only 38% of passengers survived.

In [None]:
#Suvived by Age 
ax = sns.boxplot(x="Survived", y="Age", 
                data=train)
ax = sns.stripplot(x="Survived", y="Age",
                   data=train, jitter=True,
                   edgecolor="gray")
plt.title("Suvived by Age in training data")

Observation:

1-Most survived have median age 28 (young are more lucky to survive)!!!

In [None]:
sns.countplot('Pclass',hue='Survived',data=train)
plt.show()

In [None]:
#Corelation between Age and Parch
sns.boxplot(x='Parch',y='Age', data=train, palette='hls')
plt.title("Age % Parch in training data")

In [None]:
sns.boxplot(x='Parch',y='Age', data=test, palette='hls')
plt.title("Age % Parch in test data")
#Mean age % parch is due to the correlation between Age and Parch
print("Correlation between Age and Parch \n",train.corr()["Age"].sort_values(ascending = False))

mean_age_train=train.groupby(['Parch'])['Age'].mean()
mean_age_test= test.groupby(['Parch'])['Age'].mean()      

In [None]:
#Imputer function to fill age using mean  on Parch
def fill_age(data,mean_age):
    for i in data['Age'].index:
        if (math.isnan(data['Age'][i])):
            if (data['Parch'][i]==0): 
                data['Age'][i]=mean_age[0]
            if (data['Parch'][i]==1): 
                data['Age'][i]=mean_age[1]
            if (data['Parch'][i]==2): 
                data['Age'][i]=mean_age[2]
            if (data['Parch'][i]==3): 
                data['Age'][i]=mean_age[3]
            if (data['Parch'][i]==4): 
                data['Age'][i]=mean_age[4]
            if (data['Parch'][i]==5): 
                data['Age'][i]=mean_age[5]
            if (data['Parch'][i]==6): 
                data['Age'][i]=mean_age[6]
            data['Age'][i]=mean_age[6]
    return data

In [None]:
fill_age(train,mean_age_train)
fill_age(test,mean_age_test)

In [None]:
#Embarked has missing values lets observe how it moves?
train[train['Embarked'].isnull()]

#Both passengers have Embarked missing values while having Pclass=1 and fare=80, lets observe Embarked according to these values


In [None]:
sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=train)

It can be observed that When  Pclass=1 and MEDIAN passes through fare=80 ====>>  Embarked tends to be  ~ C

In [None]:
#Fill Embarked with C
train["Embarked"] = train["Embarked"].fillna('C')

In [None]:
#Categorical to numerical Embarked in train/test
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
Enc=LabelEncoder()
train["Embarked"]=Enc.fit_transform(train["Embarked"])
test["Embarked"]=Enc.fit_transform(test["Embarked"])
train.head()

In [None]:
test.head()

In [None]:
#Cabin feature are missing a lot of values in both training and test datasets
train["Cabin"].isnull().sum(),test["Cabin"].isnull().sum()
train["Cabin"].unique()

Feature engineering: create a new feature that contain only the cabine partition (C23 is in the C partition)


In [None]:
train['Part']=train['Cabin'].str[0]
test['Part']=test['Cabin'].str[0]
train.drop(['Cabin'],axis=1, inplace=True)
test.drop(['Cabin'],axis=1, inplace=True)
train.head()

In [None]:
sns.boxplot(x="Part", y="Fare",  data=train)

In [None]:
train.isnull().sum()

In [None]:
train['Part'].unique()


In [None]:
#Fill Part with random choices
train["Part"] = train['Part'].fillna((pd.Series(np.random.choice(['C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], size=len(train.index)))))


In [None]:
test['Part'].unique()

In [None]:
#Fill Part with random choices
test["Part"] = test['Part'].fillna((pd.Series(np.random.choice([ 'B', 'E', 'A', 'C', 'D', 'F', 'G'], size=len(test.index)))))
#Categorical to numerical Part
train["Part"]=Enc.fit_transform(train["Part"])
test["Part"]=Enc.fit_transform(test["Part"])

In [None]:
#convert Sex from categorical to numeric
train['Sex'].replace(['male','female'],[0,1],inplace=True)
test['Sex'].replace(['male','female'],[0,1],inplace=True)

In [None]:
#Test Fare missing value 
test[test['Fare'].isnull()]

In [None]:
#We can replace missing value in Fare by taking median of all fares of those passengers who share 3rd Passenger class
median_fare=test[(train['Pclass'] == 3)]['Fare'].median()
median_fare
test["Fare"] = test["Fare"].fillna(median_fare)

#After data exploring and analysing we have to split the data into train and validation


In [None]:
from sklearn import preprocessing


#convert Age from float to Int
train['Age'] = train['Age'].astype(int)
test['Age']    = test['Age'].astype(int)


std_scale = preprocessing.StandardScaler().fit(train[['Age', 'Fare']])
train[['Age', 'Fare']] = std_scale.transform(train[['Age', 'Fare']])


std_scale = preprocessing.StandardScaler().fit(test[['Age', 'Fare']])
test[['Age', 'Fare']] = std_scale.transform(test[['Age', 'Fare']])

In [None]:
#train.drop(['Part'], axis=1, inplace=True)
#test.drop(['Part'], axis=1, inplace=True)
train.head()


In [None]:

test.head()


In [None]:
y_train = train["Survived"]
X_train = train.drop("Survived",axis=1)
X_test=test

In [None]:
X_train.shape , y_train.shape, X_test.shape

In [None]:
X_train.head()

#Logistic regression 


In [None]:
from sklearn.linear_model import LogisticRegression
Linaer_reg = LogisticRegression()
Linaer_reg.fit(X_train,y_train)
predictions = Linaer_reg.predict(X_test)
LR_score= Linaer_reg.score(X_train, y_train)
LR_score

In [None]:
predictions


In [None]:
data_test

In [None]:

output = pd.DataFrame({'PassengerId': data_test['PassengerId'], 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)