# Feature Engineering
## Feature Construction - Feature Splitting

Unlike all the previous techniques, this one is entirely creative or intuition driven. We have to come up with reason and code to change the columns like we want.

For this demo, we are using the evergreen, Titanic Data set. 
We will Engineer the column, SibSp and Parent Child into a single columns and notice an increase in accuracy by doing so...

The reason here is unknow to a newbie, you'll eventually build up intuition and know when you need to do it.

We'll also learn Feature Splitting, this is also intuition driven.

In [36]:
import numpy as np
import pandas as pd
import seaborn as sns

In [37]:
df = pd.read_csv('train.csv')[['Age', 'Pclass', 'SibSp', 'Parch', 'Survived']]
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [38]:
df.shape

(891, 5)

In [39]:
# we'll just drop all the null values, just in case
df.dropna(inplace=True)

In [40]:
# input and output
x = df.iloc[:, 0:4]
y = df.iloc[:, -1]

In [41]:
# we'll quickly run a Logistic Regression using the cross val score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# we'll need the regression class as well for this

np.mean(cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=20))
# not entirely sure why to do it this way, but we'll just go along here
# we are taking the mean of all the trained results based on accuracy

0.6933333333333332

Note that we have an accuracy of 69.33 %...

### Applying Feature Construction

In [42]:
x['family_size'] = x['SibSp'] + x['Parch'] + 1
# the +1 is to account for oneself in the count of family

In [43]:
# creating a function
def myfunc(num):
    if num == 1:
        return 0    # travelling alone
    elif num > 1 and num <= 4:
        return 1    # small family
    else:
        return 2    # large family

# we assigned numeric value to a seemilngly categorical data...

In [44]:
# create a new columns with this function
x['family_type'] = x['family_size'].apply(myfunc)

In [45]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,family_size,family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [46]:
# now we remove the un-necessary columns
x.drop(columns=['SibSp', 'Parch', 'family_size'], inplace=True)

In [48]:
# now we run the cross val score again
np.mean(cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=20))

0.7003174603174602

Note that we have an improvement!!

### Feature Splitting

This one here is purely for demo, there is no reason why you would do it in this dataset. But, there might be cases when this comes in handy.

In [52]:
# read the csv file again
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [66]:
# we are using "Name" for the demo
df['Name']
# if you notice the names are given in a weird way, (Family name, title. first_name middle_name)

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [64]:
# creating a new column
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
# since .str.split works only on series of data we cannot make a function for this...

In [69]:
df[['Title', 'Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [74]:
# intresting facts, you can tell if someone survived by thier title!!!
df.groupby('Title').mean(numeric_only=True)['Survived']

Title
Capt            0.000000
Col             0.500000
Don             0.000000
Dr              0.428571
Jonkheer        0.000000
Lady            1.000000
Major           0.500000
Master          0.575000
Miss            0.697802
Mlle            1.000000
Mme             1.000000
Mr              0.156673
Mrs             0.792000
Ms              1.000000
Rev             0.000000
Sir             1.000000
the Countess    1.000000
Name: Survived, dtype: float64

In [79]:
# we'll order them based on survived
(df.groupby('Title').mean(numeric_only=True)['Survived']).sort_values(ascending=False)

# note that high class women and high class men managed to survive...
# whereas your average joe's with title Mr. Master. have low survival rate

Title
the Countess    1.000000
Mlle            1.000000
Sir             1.000000
Ms              1.000000
Lady            1.000000
Mme             1.000000
Mrs             0.792000
Miss            0.697802
Master          0.575000
Col             0.500000
Major           0.500000
Dr              0.428571
Mr              0.156673
Jonkheer        0.000000
Rev             0.000000
Don             0.000000
Capt            0.000000
Name: Survived, dtype: float64

In [101]:
# you can create categories based on these titles as well
df['is_married'] = 0
df['is_married'].loc[df['Title'] == 'Mrs'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['is_married'].loc[df['Title'] == 'Mrs'] = 1


In [105]:
df['is_married'].value_counts()
# not a lot of married women here....
# but yeah there cpuld be other married women but dont use the title .Mrs

0    766
1    125
Name: is_married, dtype: int64