<a id="Intro"></a>
<div class="alert alert-block alert-info">
<b>Introduction</b>
</div>

This is my first kernel at Kaggle. I choosed the Titanic competition because it's a very good way to introduce feature engineering and classification models.

**objective** : Predict if a passenger can survived on the titanic or not.

![Titanic](https://pngimg.com/uploads/titanic/titanic_PNG6.png)

* **Content :**
1. [Introduction](#Intro)
2. [Importing Labreries](#ImportingLabreries)
3. [Loading the data](#LoadingData)
4. [Exploratory data Analysis](#EDA)
5. [Data Preprocessing](#Pre-Processing)
6. [Modling](#Modling)
7. [Submition](#Submition)

<a id="ImportingLabreries"></a>
<div class="alert alert-block alert-info">
<b>Importing Labreries</b>

In [3]:
import numpy as np 
import pandas as pd
import os

#Data Visualization:
import matplotlib.pyplot as plt
import seaborn as sns

#Text Color
from termcolor import colored

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import warnings

warnings.filterwarnings("ignore")
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)

<a id="LoadingData"></a>
<div class="alert alert-block alert-info">
<b>Loading the data</b><center>

In [4]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [5]:
train=pd.read_csv('/kaggle/input/titanic/train.csv')
train.head()

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/titanic/train.csv'

In [None]:
test=pd.read_csv('/kaggle/input/titanic/test.csv')
test.head()

In [None]:
print('train : {} , test : {}'.format(colored(train.shape,'blue'),colored(test.shape,'blue')))

In [None]:
gender_submission=pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
gender_submission.head()

<a id="EDA"></a>

<div class="alert alert-block alert-info">
<b>Exploratory data Analysis</b><center>

* **Surivors and gender :**

In [None]:
# #Survivors By gneder
sns.countplot(x='Survived',hue='Sex',data=train)
plt.title('#Survivors by gender')
plt.show()

**Women more likely to survive than men**

In [None]:
train.groupby(['Sex']).Survived.mean().plot.bar(color=['fuchsia', 'blue'])
plt.xlabel('Sex')
plt.ylabel('% of Survivors')
plt.title('Sex vs Survived')
plt.show()

**74% of female and 18% of male survived form the titanc.**

* **Passanger Age :**

In [None]:
sns.violinplot(x='Survived',y='Age',data=train)
plt.title('Survived VS Age')
plt.show()

In [None]:
# Distribution of Age
sns.displot(x='Age',data=train,color='r',bins=45)
plt.show()

* **Passanger Class (Pclass) :**

In [None]:
train.groupby(['Pclass']).Survived.agg(['count','mean'])

In [None]:
sns.countplot(x='Survived',hue='Pclass',data=train)
plt.show()

**62% of survivors are using the class 1**

Travling in a Higher class increase the probability of surviving

* **Port of Embarkation :**


*  C = Cherbourg
* Q = Queenstown 
* S = Southampton

In [None]:
train.Embarked.value_counts().to_frame()

In [None]:
sns.countplot(x='Survived',hue='Embarked',data=train)
plt.title('#Survivors by port of embarkation')
plt.show()

In [None]:
train.groupby(['Embarked']).Survived.mean().plot.barh(color=['Red','green','blue' ])
plt.xlabel('% fo Survivors')
plt.show()

* **Familly (SibSp and Parch):**

In [None]:
train.groupby(['SibSp']).Survived.agg(['count','mean'])

In [None]:
train.groupby(['Parch']).Survived.agg(['count','mean'])

In [None]:
train.groupby(['SibSp','Parch']).Survived.agg(['count','mean'])

* **Passenger Fare :**

In [None]:
#Fare Distribution:

train.Fare.hist(bins=70,color='b')
plt.xlabel('Fare')
plt.show()

* **Conrdinality:**

In [None]:
for col in ['Name','Ticket','Cabin']:
    print('Conrdinality of {} is : {} '.format(colored(col,'green'),colored(len(train[col].unique()),'blue')))

**We can see that the cordinality of Name,Ticket and Cabin is high**

* **Correlation**

In [None]:
heatmap=sns.heatmap(train.corr(),annot=True,cmap='coolwarm')

In [None]:
train.corr()[['Survived']].T

Correlations between the target variable and numerical variables aren't high

<a id="Pre-Processing"></a>

<div class="alert alert-block alert-info">
<b>Data Pre-Processing</b><center>


Let's Combine train and test set for easy preprocessing

In [None]:
ntrain=train.shape[0] # will be used to split combined data set

data=pd.concat((train,test)).reset_index(drop=True)
print('The shape of the combined dataframe is:', colored(data.shape,'blue'))

In [None]:
data.info()

In [None]:
train.describe()

#### Handling Missing values :

In [None]:
#Check for missing values :
train.isnull().sum()

In [None]:
# Check if there any missing values in train set
ax = train.isna().sum().sort_values().plot(kind = 'barh', figsize = (8, 7),color='b')
plt.title('Percentage of Missing Values Per Column in Train Set', fontdict={'size':15})
for p in ax.patches:
    percentage ='{:,.0f}%'.format((p.get_width()/train.shape[0])*100)
    width, height =p.get_width(),p.get_height()
    x=p.get_x()+width+0.02
    y=p.get_y()+height/2
    ax.annotate(percentage,(x,y))

In [None]:
test.isnull().sum()

In [None]:
# Check if there any missing values in test set
ax = test.isna().sum().sort_values().plot(kind = 'barh', figsize = (8, 7),color='b')
plt.title('Percentage of Missing Values Per Column in Train Set', fontdict={'size':15})
for p in ax.patches:
    percentage ='{:,.0f}%'.format((p.get_width()/train.shape[0])*100)
    width, height =p.get_width(),p.get_height()
    x=p.get_x()+width+0.02
    y=p.get_y()+height/2
    ax.annotate(percentage,(x,y))

***The Cabin column contains maximum null values in both datasets.***

*  **Fill in missing values :**

In [None]:
#lets save the location of Nan values first :
cabin_nan=np.where(data.Cabin.isnull(),1,0)
age_nan=np.where(data.Age.isnull(),1,0)

* For Embarked and Cabin variables. I choose to use the most popular value
* For Age and Fare I will use the mean value

In [None]:
data['Cabin']=data['Cabin'].fillna(data.Cabin.value_counts().index[0])
data['Embarked']=data['Embarked'].fillna(data.Embarked.value_counts().index[0])
data['Age']=data['Age'].fillna(data['Age'].mean())
data['Fare']=data['Fare'].fillna(data['Fare'].mean())

### Feature Engineering

This two columns save the locations of NaN values in the two columns Cabin and Age

In [None]:
data['CabinNan']=cabin_nan
data['AgeNan']=age_nan

Here I will create a new feature where I will get first letter from every Cabin in the dataset.

In [None]:
data['cabinLetter']=data.Cabin.apply(lambda x:x[0])
data['cabinLetter'].value_counts()

In [None]:
data['familySize']=data.SibSp+data.Parch+1

In [None]:
sns.factorplot(x="familySize",y="Survived",data = data).set_ylabels('Survived Probability')
plt.show()

In [None]:
data['Alone']=[1 if Fsize==1 else 0 for Fsize in data['familySize']]
data['withFamily']=[1 if Fsize>=2 else 0 for Fsize in  data['familySize']]

* Name :

In [None]:
data.Name.sample(10)

**Maybe We can extract Name title ;)**

For that we will use Regular Expressions, the title is always starting with one capital letter after that we have small letters and finaly there's end point.

In [None]:
import re
data["NameTitle"] = data.Name.apply(lambda x:re.search(' ([A-Z][a-z]+)\. ',x).group(1))

In [None]:
data.NameTitle.value_counts()

In [None]:
data[data.NameTitle=='Dr'][['Sex','Age','NameTitle']]

there just one female with Dr as a NameTitle and her age is 49. So we can change her title to Mrs

In [None]:
data.loc[[796],['NameTitle']]='Mrs' 
data.loc[[796]]

In [None]:
#Col and Major:
data[(data.NameTitle=='Col')|(data.NameTitle=='Major')|(data.NameTitle=='Jonkheer')|(data.NameTitle=='Capt')][['Sex','NameTitle']]

All Majors, Colonels, Capt and Jonkheer are male gender so we can use Mr as a Name title

In [None]:
data[data['NameTitle']=='Ms']

In [None]:
data['NameTitle']=data['NameTitle'].replace(['Lady','Mme','Dona','Countess'],'Mrs')
data['NameTitle']=data['NameTitle'].replace(['Mlle','Ms'],'Miss')
data['NameTitle']=data['NameTitle'].replace(['Rev','Dr','Major','Don','Capt','Col','Sir','Jonkheer'],'Mr')

In [None]:
sns.factorplot(x="Survived", y ="NameTitle", data=data, kind="bar", size=6)
plt.show()

In [None]:
def Create_Cat(col,q):
    return pd.qcut(data['Age'],  q=q, labels=False)
    
data['Age_Cat']=Create_Cat('Age',4)
data['Fare_Cat']=Create_Cat('Fare',4)

###  Lable Encoding 

In [None]:
numericalFeatures = [feature for feature in data.columns if data[feature].dtypes!='O']
CategoricalFeatures=[feature for feature in data.columns if data[feature].dtype=='O']

In [None]:
from sklearn.preprocessing import LabelEncoder
for col in CategoricalFeatures:
    data[col]=LabelEncoder().fit_transform(data[col])

<a id="Modling"></a>

<div class="alert alert-block alert-info">
<b>Modling</b><center>


* **Split the data**:

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
data[['Age','Fare']]=sc.fit_transform(data[['Age','Fare']])

In [None]:
# Separate train and test data from the combined dataframe
train_df=data[:ntrain]
test_df=data[ntrain:].drop(['Survived'],axis=1)

# Check the shapes of the split dataset
train_df.shape, test_df.shape

In [None]:
data.head()

In [None]:
#Separete the features and the target varaibles:
drop_cols=['Survived','PassengerId','Name','Ticket','CabinNan','AgeNan']
X=train_df.drop(drop_cols,axis=1)
y=train_df.Survived.astype(int)
# Split data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=44)

* **Logistic Regression :**

In [None]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(X_train,y_train)
y_predict= lr.predict(X_test)
print('score =',lr.score(X_test,y_test))
print ("Accuracy = %.2f" % (accuracy_score(y_test,y_predict)))

* **Catboost** 

In [None]:
from catboost import CatBoostClassifier

cb = CatBoostClassifier(iterations=1000,
                           depth=3,
                           learning_rate=0.002,
                           loss_function='Logloss',
                           verbose=False)
cb.fit(X_train,y_train)



y_predict= cb.predict(X_test)
print('score =',cb.score(X_test,y_test))
print ("Accuracy = %.2f" % (accuracy_score(y_test,y_predict)))

* **XgBoostClassifier :**

In [None]:
from xgboost.sklearn import XGBClassifier
XgC = XGBClassifier(learning_rate=0.001,n_estimators=3000,
                                max_depth=2, min_child_weight=0,
                                subsample=0.5,
                                colsample_bytree=0.5,
                                scale_pos_weight=1, seed=44,
                                reg_alpha=0.001)
XgC.fit(X_train,y_train)

y_predict= XgC.predict(X_test)
print('score =',XgC.score(X_test,y_test))
print ("Accuracy = %.2f" % (accuracy_score(y_test,y_predict)))

* Random Forest:

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)

y_predict= clf.predict(X_test)
print ("Accuracy = %.2f" % (accuracy_score(y_test,y_predict)))

<a id="Submition"></a>
<div class="alert alert-block alert-info">
<b>Submition</b>

In [None]:
X_Submission=test_df.drop(drop_cols[1:],axis=1)
predictions = cb.predict(X_Submission)

# Generate Submission
output = pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':predictions})
output.to_csv('gender_submission.csv', index=False)
print("Submission successfully saved")

**Thanks for your time :)**