![title](http://www.netresec.com/images/NetworkMiner_logo_313x313.png)

In Data Science, **Feature Engineering** is the process of creating new features from existing features of a dataset. This is where you can really get creative with the data, since there are can be a lot of features hidden in a dataset, and some of them can improve the accuracy of the predictions that you later make with the data. Some frequent examples of features that be created from common data include:

- Current Date - Birth Date = Age
- Date2 - Date1 = Days between. 
- Divided Full name = First Name, Surname
- Divided date = Day, Month, Year

And so on. The possibilities depend on your imagination and on the data that you are working with. Of course, you also have to determine how these extra features affect the performance of your model. Let's see this in action with the titanic dataset that we worked with previously.

In [7]:
import pandas as pd

cleaned_titanic = pd.read_csv("cleaned_titanic.csv")

## Creating new features

First, we have to analyze the data that we currently have to see if we can find anything useful to transform into a new feature. 

In [8]:
cleaned_titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S
5,6,0,3,"Moran, Mr. James",male,35.0,0,0,330877,8.4583,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C


Two interesting columns here are the Sibling Spouse count (SibSp) and Parent Child count(Parch), which denote how many siblings and spouses the passenger has on board and how many parents and children the passenger has on board respectively. If we add both of these columns, we would have how many **family members** does the passenger have on board.

In [10]:
#Perform row-wise sum between the two columns.
family = cleaned_titanic.SibSp + cleaned_titanic.Parch
cleaned_titanic["Fam"] = family
cleaned_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0


Now we have the total family members each passenger has on board. Unfortunately, some of our passengers are completely alone, with their total family members on board being 0. We can create a new boolean or "Yes or No" column with this information, classifying those that do not have family on board as being **alone**.

In [12]:
import numpy as np
#Create a new column with those that are alone being True, and those that are not being False.
Is_Alone = np.where(cleaned_titanic.Fam == 0,True,False)

cleaned_titanic["Alone"]= Is_Alone
cleaned_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0,True


Now this feature may seem redundant, since we know whether someone is alone or not based on the number of family members already. However, ,maybe being alone is more powerful at predicting if someone survived or not than the actual **number** of family members on board. You can always compare how good each feature is  when applying the predictions of the model later on. 

#### Now your turn.

Create a new boolean feature, Is_Mother, which is True for passengers that are Female, over the age of 20 and have a parent/child count greater than 0 and False otherwise. 

In [13]:
#Your code here

## Creating new features from Text.

Text is often riddled with new features waiting to be created. However, text is also more difficult to analyze than numbers, since there's no easy way to know if all text is created equal. For the purposes of this notebook however, all the text is equal.  

When working with text features, you'll often have to extract a pattern that may become a significant feature later on. To do this we employ something called **Regular Expressions**. Regular expressions are a languange in and of its own, which are an incredible boon to extract patterns from text.

To get started on regular expressions, first read these: 

https://developers.google.com/edu/python/regular-expressions

For example, one use we could have for regular expressions, is to extract an email from a text.

In [18]:
sentence_email = "My email is  realemail@email.com"
# Import the regular expression library.
import re
#The pattern for an email would be (one or more letters) @ (one or more letters) .com
pattern = "\w+@\w+\.com"
#Find the pattern in the text.
match = re.search(pattern,sentence_email)
#Print all matches.
print(match.group())

realemail@email.com


Practice a lot with regular expressions on your time, they are a very powerful tool in your arsenal when working with text. 

Let's see how we can use regular expressions in the titanic dataset. Our text column is the name of the passenger, so let's have a look at it. 

In [21]:
cleaned_titanic.head().Name

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Now let's use our imagination. It's the early 1900s and you are a passenger on the Titanic.  Back in those years, almost everyone had some type of title to be addresed with when speaking to them. This is reflected on our dataset, as almost every person has a title, like Mr. Owen Harris. Let's create a new feature with these titles.

In [33]:
#The pattern for a title, a space, a word, a period and another space, like  Mr. 
title_pattern = "\s\w+\.\s"
#Use a lambda function to apply it to all names.
titles = cleaned_titanic.Name.apply(lambda name:re.search(title_pattern,name).group())
cleaned_titanic["title"] = titles
cleaned_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam,Alone,title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1,False,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1,False,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0,True,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1,False,Mrs.
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0,True,Mr.


Just for  a cleaner text, let's remove non alphabetic characters from the titles. We can also use regular expressions for this, but this time using the **sub()** function. Sub stands for substitute, so instead of finding a text that contains the specified pattern, we replace the text with the specified pattern with another text.

In [34]:
#Create a lambda function that replaces all non alphabetic characters in a title with an empty space.
replace_nonalpha = lambda title: re.sub("[^A-Za-z]"," ",title)
cleaned_titanic.title = cleaned_titanic.title.apply(replace_nonalpha)
cleaned_titanic.title.head()

0       Mr  
1      Mrs  
2     Miss  
3      Mrs  
4       Mr  
Name: title, dtype: object

Regular expressions can also be used for text cleaning as we can observe. I can't stress enough how important they are, so make sure to learn them well!

#### Now your turn.

Create a new feature, surname, which gets the Last name for each passenger Name. For example, for Braund, Mr. Owen Harris, the surname would be Braund.

In [35]:
#Your code here

## Minimizing existing features

Another way of feature engineering, is taking existing features, and making them smaller. It's similar to what we did with the Alone feature, where instead of having the total amout of family members for the passenger on board, we would just divide it in two conditions, the passenger is alone or not.  Let's try to do this for the numerical feature Age. To help us with this, let's see the summary statistics for these it.

In [36]:
cleaned_titanic.Age.describe()

count    891.000000
mean      30.752155
std       13.173100
min        0.420000
25%       22.000000
50%       32.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Let's say we want to minimize both of these features, by just grouping the ages and fares into equally divided ranges. If we wanted to divide it in n equally distanced ranges, then the formula for this operation would be:

    (maximum value - minimum value)/n 
    
For n = 3, the range would be (80 - 0) / 3  which is equal to approximately 26.

So for n = 3 , those aged 0-26 would be "Young", those aged 27-53 would be "Middle Aged" and those ages 54-80 would be "Old". We could use a function that takes an Age, and depending on the range the Age is, we would return one of these labels. 

In [40]:
def Label_Age(age):
    #Function that converts an age to a label.
    if age >= 0 and age <27:
        return "Young"
    elif age>=27 and age <54:
        return "Middle-Aged"
    else:
        return "Old"
    
labeled_Age = cleaned_titanic.Age.apply(Label_Age)
cleaned_titanic["Labeled_Age"] = labeled_Age
cleaned_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Fam,Alone,title,Labeled_Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1,False,Mr,Young
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1,False,Mrs,Middle-Aged
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0,True,Miss,Young
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1,False,Mrs,Middle-Aged
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0,True,Mr,Middle-Aged


Just like with the Alone feature, we can only know if the feature helps us or hinders us when we try to create a predictive model with the dataset, and seeing how good the accuracy is with and without the feature. 

#### Your turn now.

Create a a new,smaller feature based on the Fare feature. Divide it into 5 labels, approximately equally distanced, like what was done with Age before. You can name the labels however you want.

In [41]:
#Your code here.

In a nutshell, feature engineering is where you can let your imagination loose. When you have cleaned and dealt with missing data in a dataset, try to create new features with the data that you have. Sometimes, creative feature engineering is more powerful than fancy algorithms when creating a predictive model, so always work a lot with your data before moving forward!

## Exercise.

Create at least two new features for the titanic dataset that we've been working with so far. 

**Hint:** Try to minimize the Title feature, or taking only the numbers from a ticket. 

In [42]:
#Your cells below.

## Further reading

Feature Engineering in more depth: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750