# **April Tabular Playground Series - Your Baseline Model**

It is extremely important to start somewhere and identify it as your first standard of comparision against the progress you have. This notebook helps you make a baseline model , get a baseline score . Once you get the hold of this , do try and increase your score from here by either adding more features, trying more number of classification models, adding parameters to some of the already given models,etc , etc.

Shall we get started then ?

Alright , in case you forgot , these are the variables in the dataset : 
### **Variable Notes**

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# **Importing Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Loading the Data**

In [None]:
data = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
data.head()

In [None]:
data.info()

In [None]:
data.shape[0]

# **Deciding what to impute , drop and how ?**
  
  **ft : NaN Values**

In [None]:
for i in data.columns:
    print(i,":",data[str(i)].isnull().sum())

Alright so let's list out the columns holding Numerical Values :

1. Age
2. Fare

The remaining columns do not hold numerical values. 

Let's explore the distribution of the numerical values a bit before we replace their NaN values

# **Studying Distribution of Age and Fare**

**What are we looking for ?**

We are looking for a possible metric for imputation. We will choose median if the data has a lot of large numerical values (as they tend to influence the central tendencies like mean and range but not median and IQR) and mean otherwise or neither and use some other tool.


## **Fare**

In [None]:
figure , axes = plt.subplots(ncols = 3 , figsize = (17,4) , dpi = 100)

sns.boxplot(x = 'Fare' , data = data , color = 'red' , ax = axes[0])
sns.distplot(x = data['Fare'] , ax = axes[1])
sns.distplot(x = data['Fare'].apply(np.log) , ax = axes[2])

for i in range(3):
    axes[i].set_ylabel('')
    axes[i].set_xlabel('Fare')
    axes[i].tick_params(axis='x', labelsize=10)
    axes[i].tick_params(axis='y', labelsize=10)

axes[0].set_title('Descriptive Statistics For Fare', fontsize=13)
axes[1].set_title('Distribution (Fare)', fontsize=13)
axes[2].set_title('Logarithmic Distribution (Fare)', fontsize=13)

plt.show()

From the boxplot , we can see that anything above 100 looks like an outlier and that there are a lot of outliers. The suggested median seems to be somewhere between 0-100. 

The Distribution Plot (right now we are assuming that we have a normal distribution) suggests that the data is left skewed. We could also deduce it from the boxplot as the data seems to be centered around values between 0-100 , and the rest is a tail of outliers. 

Upon taking a closer look using the function log , we can see that my assumption of thinking of fare as a normal distribution was wrong. We can clearly see three different peaks in the distribution (Logarithmic Distribution). This also matches with the the fact that our data has three classes and hence one can assume that these three peaks are a result of price distribution on those three classes.

Will mean be an accurate way to go about it or median for that instance ?

## **Let's explore the fares of each classes!**

**Assumption** : Distribution is multimodal only because of different classes.

In [None]:
firstclass = data[data['Pclass'] == 1]
secondclass = data[data['Pclass'] == 2]
thirdclass = data[data['Pclass'] == 3]

### **First Class Fare**

In [None]:
figure , axes = plt.subplots(ncols = 3 , figsize = (17,4) , dpi = 100)

sns.boxplot(x = firstclass['Fare'] , color = 'red' , ax = axes[0])
sns.distplot(x = firstclass['Fare'] , ax = axes[1])
sns.distplot(x = firstclass['Fare'].apply(np.log) , ax = axes[2])

for i in range(3):
    axes[i].set_ylabel('')
    axes[i].set_xlabel('Fare')
    axes[i].tick_params(axis='x', labelsize=10)
    axes[i].tick_params(axis='y', labelsize=10)

axes[0].set_title('Descriptive Statistics For First Class Fare', fontsize=13)
axes[1].set_title('Distribution (Fare) For First Class Fare', fontsize=13)
axes[2].set_title('Logarithmic Distribution (Fare) For First Class Fare', fontsize=13)

plt.show()

## **Second Class Fare**

In [None]:
figure , axes = plt.subplots(ncols = 3 , figsize = (17,4) , dpi = 100)

sns.boxplot(x = secondclass['Fare'] , color = 'blue' , ax = axes[0])
sns.distplot(x = secondclass['Fare'] , ax = axes[1])
sns.distplot(x = secondclass['Fare'].apply(np.log) , ax = axes[2])

for i in range(3):
    axes[i].set_ylabel('')
    axes[i].set_xlabel('Fare')
    axes[i].tick_params(axis='x', labelsize=10)
    axes[i].tick_params(axis='y', labelsize=10)

axes[0].set_title('Descriptive Statistics For Second Class Fare', fontsize=13)
axes[1].set_title('Distribution (Fare) For Second Class Fare', fontsize=13)
axes[2].set_title('Logarithmic Distribution (Fare) For Second Class Fare', fontsize=13)

plt.show()

## **Third Class Fare**

In [None]:
figure , axes = plt.subplots(ncols = 3 , figsize = (17,4) , dpi = 100)

sns.boxplot(x = thirdclass['Fare'] , color = '#42e3bb' , ax = axes[0])
sns.distplot(x = thirdclass['Fare'] , ax = axes[1])
sns.distplot(x = thirdclass['Fare'].apply(np.log) , ax = axes[2])

for i in range(3):
    axes[i].set_ylabel('')
    axes[i].set_xlabel('Fare')
    axes[i].tick_params(axis='x', labelsize=10)
    axes[i].tick_params(axis='y', labelsize=10)

axes[0].set_title('Descriptive Statistics For Third Class Fare', fontsize=13)
axes[1].set_title('Distribution (Fare) For Third Class Fare', fontsize=13)
axes[2].set_title('Logarithmic Distribution (Fare) For Third Class Fare', fontsize=13)

plt.show()

Well we can see that my assumption was wrong once again. Seems like the ticket to the titanic did not have any fixed price for any class in particular.

This means we will have to scale the data and then proceed to impute it with its mean.

In [None]:
data['Fare'] = (data['Fare'] - data['Fare'].mean()) / data['Fare'].std()

## **Age**

In [None]:
figure , axes = plt.subplots(ncols = 2 , figsize = (17,4) , dpi = 100)

sns.boxplot(x = 'Age' , data = data , color = 'pink' , ax = axes[0])
sns.distplot(x = data['Age'] , ax = axes[1])

for i in range(2):
    axes[i].set_ylabel('')
    axes[i].set_xlabel('Fare')
    axes[i].tick_params(axis='x', labelsize=10)
    axes[i].tick_params(axis='y', labelsize=10)

axes[0].set_title('Descriptive Statistics For Fare', fontsize=13)
axes[1].set_title('Distribution (Fare)', fontsize=13)

plt.show()

I think going for with the median for Age will do the job for us.

In [None]:
data['Age'].describe()

We will also be imputing Embarked Values using the KNN Imputer.

In [None]:
data['Embarked'].value_counts()

In [None]:
embarked = {'S' : 0, 'C': 1, 'Q':2}
data['Embarked'] = data['Embarked'].map(embarked)

**Imputing Values**

In [None]:
impute = KNNImputer()
data['Embarked']= pd.Series(impute.fit_transform(data['Embarked'].values.reshape(-1,1)).reshape(1,-1).flatten())
data['Age'] = data['Age'].replace(np.nan,data['Age'].median())
data['Fare'] = data['Fare'].replace(np.nan,data['Fare'].mean())

In [None]:
for i in data.columns:
    print(i,":",data[str(i)].isnull().sum())

# **Transformation and Correlations With Target Variable**

For now we are not using the variables ticket and cabin. You may use them in the future to increase the accuracy and draw more insights from the data.

In [None]:
sex = {'male':0,'female':1}
data['Sex'] = data['Sex'].map(sex)

In [None]:
train = data.drop(columns = ['Name','Ticket','Cabin','Survived'],axis = 1)
test = data['Survived']

In [None]:
train.corrwith(test).plot.bar(figsize=(15,10),title="Correlation with response variable",fontsize=15,rot=90, color = 'red', grid=True )

So , from this we can see that variables SibSp, Parch , PassengerId have a very small correlation coefficient as compared to others. We will be dropping these variables.

**P.S:**

Please remember correlation graphs are only capable for showing linear relationships. That means correlation will be high if the variables can be arranged in a straight line wrt to the target variable or else not (they are based on the Pearsonr Coefficient Formula).

In [None]:
train.drop(['PassengerId','SibSp','Parch'],axis=1,inplace=True)
train.columns

In [None]:
train.head()

Let's scale the data for Age as well so that all variables are in nearly, the same range.

In [None]:
train['Age'] = (train['Age'] - train['Age'].mean()) / train['Age'].std()
train.head()

# **ML Models Implementation**

We will first split them and then fit the models. 

In [None]:
x_train,x_test,y_train,y_test=train_test_split(train,test,test_size=0.2)

In [None]:
lr=LogisticRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
print(classification_report(y_test,y_pred))

In [None]:
clf = DecisionTreeClassifier()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
print(classification_report(y_test,y_pred))

In [None]:
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)
print(classification_report(y_test,y_pred))

In [None]:
random_forest = RandomForestClassifier()
random_forest.fit(x_train, y_train)
y_pred = random_forest.predict(x_test)
print(classification_report(y_test,y_pred))

In [None]:
estimators = [('knn',KNeighborsClassifier()),('lr',LogisticRegression()),('dtr',DecisionTreeClassifier()),('rf',random_forest)]
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
print(classification_report(y_test,y_pred))

I think we can try Stacking Classifier and Logistic Regression on the test data.

# **Implementing it on Test Data**

In [None]:
test_data = pd.read_csv("/kaggle/input/tabular-playground-series-apr-2021/test.csv")
for i in test_data.columns:
    print(i,":",test_data[str(i)].isnull().sum())

In [None]:
test_data['Age'] = test_data['Age'].fillna(test_data['Age'].median())
test_data['Age'] = (test_data['Age'] - test_data['Age'].mean()) / test_data['Age'].std()
test_data['Fare'] = (test_data['Fare'] - test_data['Fare'].mean()) / test_data['Fare'].std()
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
test_data['Embarked'] = test_data['Embarked'].map(embarked)
test_data['Embarked']= pd.Series(impute.fit_transform(test_data['Embarked'].values.reshape(-1,1)).reshape(1,-1).flatten())
test_data['Sex'] = test_data['Sex'].map(sex)
passengers = test_data['PassengerId']
test_data.drop(columns = ['PassengerId','Name','SibSp','Cabin','Ticket','Parch'], axis=1,inplace = True)


In [None]:
ans = lr.predict(test_data)
pd.DataFrame({'PassengerId' : passengers , 'Survived': ans}).to_csv("my_submission.csv",index=False)

# **That's all Folks!**

Liked my work ? Give an upvote ! 
Have some suggestions ? Leave a comment and i'll get back to you ASAP.