# Lecture 2
## Introduction to Sklearn
### Custom transformers in sklearn

<ol>
<li> Used data: Titanic Data set ( https://www.kaggle.com/c/titanic ) with slight adaptions
<li> Notebook Goal: Learn how to construct a custom transformer in sklearn. 
<li> Extra Exercise: Yes, coding.
</ol>

In [8]:
#Necessary Imports
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

We start by loading the data set we will use in this notebook. This is a slight variation of the famous Titanic data set.

In [9]:
df = pd.read_csv('Data/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
2,7,No,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
3,11,Yes,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
4,12,Yes,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


We see that our data set contains a mix of categorical variables and numerical variables. 
Suppose we want to
<ol>
<li> Drop the columns we are not interested in, being PassengerId, Name, Ticket, Cabin, Embarked, Parch
<li> One Hot Encode the remaining categorical columns
<li> Standardize the remaining numerical columns
</ol>

We want to build a ColumnTransformer object that does this in one go. 

A ColumnTransformer consists of an array of named steps, where each step is of the form (Chosen_name_step, Method_used_in_step, Columns_in_step) 

In [10]:
t = ColumnTransformer(
    [
    ('drop_columns', 'drop', ['PassengerId','Name','Ticket','Cabin','Embarked','Parch']),
    ('NameActuallyDoesntMatter', OneHotEncoder(handle_unknown='ignore'), ['Sex','SibSp','Pclass']),
    ('StandardScaler',StandardScaler(), ['Age','Fare'])
    ]
)

Just like almost all classes in Sklearn, the resulting class has as methods fit, transform and fit_transform. 
Let us first apply logistic regression on this data set, after first preprocessing using the column transformer.

In [11]:
#Start by defining the target variable and feature matrix.

y = df['Survived']
X =  df.drop(['Survived'], axis=1, inplace=False)
X_train,X_test,y_train,y_test = train_test_split(X,y)

X_train = t.fit_transform(X_train)
lr = LogisticRegression()
lr.fit(X_train,y_train)

Let us now use this fitted model to predict the survival of the passengers in the test data set. 
Notice that the test data needs to be transformed as well, __using the transformer fit on the training data__.

After the prediction, we consider the accuracy.

In [12]:
X_test = t.transform(X_test)
preds = lr.predict(X_test)

accuracy_score(y_test, preds)

0.7608695652173914