# 2 Feature Engineering

- In this section, we'll be doing four things.

- Cleaning : we'll fill in missing values.

- Removing : we'll removing the less relevant variables that doesn't improve predictions, as "Ticket" and "Cabin", because they contains many null values both in training and test datatest. Therefore, "PassengerId" may be also dropped from training dataset as it does not contribute to survival

- Data Binning : we'll transform  categorical features into numerical as "Sex", "Embarked"

- Creating New Varibles: we'll create new variable called "FamilySize" based on "Parch" feature and "SibSp" feature.
  we'll create "IsAlone" feature to check if a person traveling alone is more likely to survived or died
  We may want to engineer the Name feature to extract Title as a new feature, to determine if it played a role in       survival.

##### Checking for missing values

In [19]:
# let's take more detaild look of what data is actually missing 
total = train.isnull().sum().sort_values(ascending=False)
percent_1 = train.isnull().sum()/train.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

Unnamed: 0,Total,%
Cabin,687,77.1
Age,177,19.9
Embarked,2,0.2
Fare,0,0.0
Ticket,0,0.0


A critical part of the success of a Machine Learning Project is Feature Engineering

In [20]:
train.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [21]:
test.isnull().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool

## 2.1 Filling missing Values

In [22]:
# Processing Age


train["Age"] = train["Age"]. fillna(train["Age"].mean())

test["Age"]  = test["Age"] . fillna(test["Age"].mean())

In [23]:
# Processing Fare

test["Fare"] = test ["Fare"]. fillna(test["Fare"].mean())

## 2.2 Bining Categorical variables

In [24]:
# Processing Sex


Before we fit the data into a machine learning algorithm, there is a step very crucial is that we make sure to encode categorical variables correctly

We will change Sex to binary, as either 1 for female or 0 for male. We do the same for Embarked. We do this same process on both the training and testing set to prepare our data for Machine Learning.

In [25]:
train.loc[train["Sex"] == "male" , "Sex"] = 0
train.loc[train["Sex"] == "female","Sex"] = 1



test.loc[test["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "female", "Sex"] = 1


In [26]:
# Processing Embarked

In [27]:
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2


test.loc[test["Embarked"]  == "S", "Embarked"] = 0
test.loc[test["Embarked"]  == "C", "Embarked"] = 1
test.loc[test["Embarked"]  == "Q", "Embarked"] = 2

In [28]:
train["Embarked"].fillna("S", inplace = True)

## 2.3 Creating New Features

Then, introducing new features as Family size (to join these Parch and SibSp)

In [29]:
train["FamSize"] = train["SibSp"] + train["Parch"] + 1
test["FamSize"]  =  test["SibSp"] + test["Parch"]  + 1

The next option is to cerate IsAlone feature to check wheter a person traveling alolne is more likely to survived or died

In [30]:
train["IsAlone"] = train.FamSize.apply(lambda x: 1 if x == 1 else 0)
test["IsAlone"]  = test.FamSize.apply( lambda x: 1 if x == 1 else 0)

In [31]:
# Extraction the passengers titles

If we have a quick look in the names of the passengers we will notice that each name has a title in it, so it can be a useful information for our analyze. Therefore we can extract this title from the name of each passenger and then encode it like we did for Sex and Embarked.

In [32]:
for name in train["Name"]:
    train["Title"] = train["Name"].str.extract("([A-Za-z]+)\.",expand=True)
    
for name in test["Name"]:
    test["Title"] = test["Name"].str.extract("([A-Za-z]+)\.",expand=True)
    
title_replacements = {"Mlle": "Other", "Major": "Other", "Col": "Other", "Sir": "Other", "Don": "Other", "Mme": "Other",
          "Jonkheer": "Other", "Lady": "Other", "Capt": "Other", "Countess": "Other", "Ms": "Other", "Dona": "Other", "Rev": "Other", "Dr": "Other"}

train.replace({"Title": title_replacements}, inplace=True)
test.replace({"Title": title_replacements}, inplace=True)

train.loc[train["Title"] == "Miss", "Title"] = 0
train.loc[train["Title"] == "Mr", "Title"] = 1
train.loc[train["Title"] == "Mrs", "Title"] = 2
train.loc[train["Title"] == "Master", "Title"] = 3
train.loc[train["Title"] == "Other", "Title"] = 4

test.loc[test["Title"] == "Miss", "Title"] = 0
test.loc[test["Title"] == "Mr", "Title"] = 1
test.loc[test["Title"] == "Mrs", "Title"] = 2
test.loc[test["Title"] == "Master", "Title"] = 3
test.loc[test["Title"] == "Other", "Title"] = 4

In [33]:
print(set(train["Title"]))

{0, 1, 2, 3, 4}


In [1]:
# Age

In [35]:
train['Age'] = train['Age'].astype(int)
train.loc[ train['Age'] <= 11, 'Age'] = 0
train.loc[(train['Age'] > 11) & (train['Age'] <= 18), 'Age'] = 1
train.loc[(train['Age'] > 18) & (train['Age'] <= 22), 'Age'] = 2
train.loc[(train['Age'] > 22) & (train['Age'] <= 27), 'Age'] = 3
train.loc[(train['Age'] > 27) & (train['Age'] <= 33), 'Age'] = 4
train.loc[(train['Age'] > 33) & (train['Age'] <= 40), 'Age'] = 5
train.loc[(train['Age'] > 40) & (train['Age'] <= 66), 'Age'] = 6
train.loc[ train['Age'] > 66, 'Age'] = 6

# let's see how it's distributed train_df['Age'].value_counts()

In [36]:
test['Age'] = test['Age'].astype(int)
test.loc[test['Age']<= 11, 'Age'] =0
test.loc[(test['Age'] > 11)& (test['Age'] <= 18), 'Age'] = 1
test.loc[(test['Age'] > 18) & (test['Age'] <= 22), 'Age'] = 2
test.loc[(test['Age'] > 22) & (test['Age'] <= 27), 'Age'] = 3
test.loc[(test['Age'] > 27) & (test['Age'] <= 33), 'Age'] = 4
test.loc[(test['Age'] > 33) & (test['Age'] <= 40), 'Age'] = 5
test.loc[(test['Age'] > 40) & (test['Age'] <= 66), 'Age'] = 6
test.loc[ test['Age'] > 66, 'Age'] = 6

## 2.4 Removing irrelevant variables

the next step is dropping the less relevant features beacuse, The problem with less important features is that they create more noice and actually take over the importance of real features like Sex and Pclass.

In [38]:
features_drop = ['Ticket', 'SibSp', 'Parch', "Name", "Cabin", "Fare"]
train = train.drop(features_drop, axis=1)
test = test.drop(features_drop, axis=1)
train = train.drop(['PassengerId'], axis=1)

## 2.5 Creating dummy variables

In [37]:
train = pd.get_dummies(train, columns=['Pclass','Sex','Embarked','Title'], 
                       drop_first=False)
test = pd.get_dummies(test, columns=['Pclass','Sex','Embarked','Title'],
                      drop_first=False)