## 2. Feature Engineering

* **Inspect the missing values** 

In [22]:
# let's take more detaild look of what data is actually missing

total = train.isnull().sum().sort_values(ascending=False)
percent_1 = train.isnull().sum()/train.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

Unnamed: 0,Total,%
Cabin,687,77.1
Age,177,19.9
Embarked,2,0.2
Fare,0,0.0
Ticket,0,0.0


* **Missing values in train data**

In [23]:
train.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

* **Missing values in test data**

In [24]:
test.isnull().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool

### 2.1 Filling missing Values

In [25]:
#Age


train["Age"] = train["Age"]. fillna(train["Age"].mean())

test["Age"]  = test["Age"] . fillna(test["Age"].mean())


#Fare

test["Fare"] = test ["Fare"]. fillna(test["Fare"].mean())


#Embarked

train["Embarked"].fillna("S", inplace = True)

### 2.2 Bining Categorical variables


* Before we fit the data into a machine learning algorithm, there is a step very crucial is that we make sure to encode categorical variables

  correctly We will change Sex to binary, as either 1 for female and 0 for male. We do the same for Embarked. We do this same process on  

  both the training and testing set to prepare our data for Machine Learning.

* **Preprocessing Sex**

In [26]:
train.loc[train["Sex"] == "male" , "Sex"] = 0
train.loc[train["Sex"] == "female","Sex"] = 1


test.loc[test["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "female", "Sex"] = 1

* **Processing Embarked**

In [27]:
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2


test.loc[test["Embarked"]  == "S", "Embarked"] = 0
test.loc[test["Embarked"]  == "C", "Embarked"] = 1
test.loc[test["Embarked"]  == "Q", "Embarked"] = 2

### 2.3 Creating New Features

- Then, introducing new features as Family size (to join these Parch and SibSp)

* **Family size**

In [28]:
train["FamSize"] = train["SibSp"] + train["Parch"] + 1
test["FamSize"]  =  test["SibSp"] + test["Parch"]  + 1

- The next option is to cerate IsAlone feature to check wheter a person traveling alolne is more likely to survived or died

* **IsAlone**

In [29]:
train["IsAlone"] = train.FamSize.apply(lambda x: 1 if x == 1 else 0)
test["IsAlone"]  = test.FamSize.apply( lambda x: 1 if x == 1 else 0)

* **Extraction the passengers titles**

* If we have a quick look in the names of the passengers we will notice that each name has a title in it, so it can be a useful information 

  for our analyze. Therefore we can extract this title from the name of each passenger and then encode it like we did for Sex and Embarked.

In [30]:
for name in train["Name"]:
    train["Title"] = train["Name"].str.extract("([A-Za-z]+)\.",expand=True)
    
for name in test["Name"]:
    test["Title"] = test["Name"].str.extract("([A-Za-z]+)\.",expand=True)
    
title_replacements = {"Mlle": "Other", "Major": "Other", "Col": "Other", "Sir": "Other", "Don": "Other", "Mme": "Other",
          "Jonkheer": "Other", "Lady": "Other", "Capt": "Other", "Countess": "Other", "Ms": "Other", "Dona": "Other", "Rev": "Other", "Dr": "Other"}

train.replace({"Title": title_replacements}, inplace=True)
test.replace({"Title": title_replacements}, inplace=True)

train.loc[train["Title"] == "Miss", "Title"] = 0
train.loc[train["Title"] == "Mr", "Title"] = 1
train.loc[train["Title"] == "Mrs", "Title"] = 2
train.loc[train["Title"] == "Master", "Title"] = 3
train.loc[train["Title"] == "Other", "Title"] = 4

test.loc[test["Title"] == "Miss", "Title"] = 0
test.loc[test["Title"] == "Mr", "Title"] = 1
test.loc[test["Title"] == "Mrs", "Title"] = 2
test.loc[test["Title"] == "Master", "Title"] = 3
test.loc[test["Title"] == "Other", "Title"] = 4

In [31]:
print(set(train["Title"]))

{0, 1, 2, 3, 4}


### 2.4 Removing irrelevant variables

* The next step is dropping the less relevant features because, The problem with less important features is that they create more noise
 
  and actually take over the importance of real features like Sex and Pclass.

In [32]:
features_drop = ['Ticket', 'SibSp', 'Parch', "Name", "Cabin", "Fare", "PassengerId"]

train = train.drop(features_drop, axis=1)

test = test.drop(features_drop, axis=1)

### 2.5 Creating dummy variables

In [33]:
train = pd.get_dummies(train, columns=['Pclass','Sex','Embarked','Title'], 
                       drop_first=False)

test = pd.get_dummies(test, columns=['Pclass','Sex','Embarked','Title'],
                      drop_first=False)