# **Instructions**
Please make a copy of this notebook first and then answer the questions in your own copy.

**To make a copy of the notebook:**
- Click on **`File`**
- Click on  **`Save a copy in Drive`**
- Work from that saved file.

**Another alternative would be to:**
- Download the notebook and the data file
- Work on it from your computer
- Reupload your solution notebook to [Google's Colaboratory](https://colab.research.google.com/) so I can leave comments on it.


**Please ensure that you have run all the cells of your submission notebook so I'll be able to see your results without having to run them myself.** 

Thank you!!!

# **How to get the Data file**
The exercises here are based on the Training Titanic dataset which was available on Kaggle [here](https://www.kaggle.com/c/titanic/data) during the Titanic competition. This is possibly one of the most popular dataset available online. The goal of the competiotion was to **Predict those who'd survive on the Titanic**

**Note:**
If you are getting the data from Kaggle, you'll need to **JOIN** the competition before you can access the data.

You can also download it directly from google drive through this [link](https://drive.google.com/file/d/1Eipec9zf0NNiM4WsLqVGqMh_Tr1Ee6Vk/view?usp=sharing)

# **Import Pandas and read in the CSV File**
*You can also do all other library imports here*

In [1]:
import pandas as pd
#file_path ="add the path to the dataset here"
df = pd.read_csv('kaggle_titanic_train_data.csv')

In [2]:
import numpy as np

# **Check to see how the data looks like**
**Hint:** You need a Pandas function for this

***Your solution:***

In [3]:
# Your solution goes in here
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


# **Data Cleanup Exercises**
Use the appropriate pandas function or method to perform the tasks below

## **1. Replace male/female with boolean values**
Boolean values are binary values. You can use 1/0 or True/False.

In [4]:
# Your solution goes in here
df['Sex'] = pd.get_dummies(df['Sex'],drop_first=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S


## **2. Fill any missing age values with the average age of the passengers**

In [5]:
# Your solution goes in here
mean_age = np.mean(df['Age'].dropna())
df['Age'].fillna(mean_age)

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

## **3. Use a Regex expression on the `Ticket` column to get just the ticket numbers attached to the ticket**


**`A/5 21171`** should now be represented as **`21171`**

## Hi Mercy, 

I tried using the apply() method here, it didn't work. But I'm sure the function is alright, as it was the same code I used with a for loop in the next column, and that worked just fine. Could you please show me what I'm doing wrong with the apply() method? Thanks

In [86]:
def ticket_cleaner(Ticket_column):
    for row in Ticket_column:
        pk=re.findall(r'\b\d+\b',str(row))
        if pk!=[]:
            return(pk[-1])
        else:
            return 0
dfs['Ticket']=dfs['Ticket'].apply(ticket_cleaner)
        

## Here's the for loop that eventually worked

In [7]:
# Your solution goes in here
import re
#clone the dataframe
dfk=df
#create an empty list 
holder=[] 
#iterate through the column to extract numbers as strings
for row in dfk['Ticket']:
    pk=re.findall(r'\b\d+\b', str(row))#regex expression 
    if pk!=[]:
        holder.append(pk[-1]) #extract the last set of numbers 
    else:
        holder.append(0)#handle the cases where there are no numbers
holder_int=[]
for t in holder:
    i=int(t)#convert from string to numbers
    holder_int.append(i)    
holder_series = pd.Series(holder_int) #convert from list to strings
df['Ticket'] = holder_series #attach back to dataframe.
df['Ticket']





0        21171
1        17599
2      3101282
3       113803
4       373450
        ...   
886     211536
887     112053
888       6607
889     111369
890     370376
Name: Ticket, Length: 891, dtype: int64

## **4. Use the Sklearn `Binarizer` on the `Fare` Column**
Any value **less than 40** should belong to the **`0`** class, and values **greater than 40** should belong to the **`1`** class.*italicized text*

In [8]:
#binarizer wasn't working on pandas dataframe, so I converted to numpy array first
temp_fare = df['Fare'].to_numpy()

In [9]:
# Your solution goes in here
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(39)
df_fare_temp = binarizer.fit_transform(temp_fare.reshape(-1,1))
df['Fare']=df_fare_temp

In [10]:
df['Fare']

0      0.0
1      1.0
2      0.0
3      1.0
4      0.0
      ... 
886    0.0
887    0.0
888    0.0
889    0.0
890    0.0
Name: Fare, Length: 891, dtype: float64

## **5. Extract the titles from the names and add it to the dataframe as a new column**
Titles in the name column are **Mr., Mrs., Miss.**, etc. Extract the title from the name and add it into a new column. The column should only include Mr, Mrs, Miss, etc.

**Hints:** You may need to separate the work into several lines of code. You may need to string together several functions like **apply(), split(), lambda**. Check the python documentation to learn more about each function.

In [12]:
# Your solution goes in here
title_container=[]
for k in df['Name']:
    full_string = k.split()
    title_container.append(full_string[1])
    
df['Titles']=title_container
df['Titles']

0        Mr.
1       Mrs.
2      Miss.
3       Mrs.
4        Mr.
       ...  
886     Rev.
887    Miss.
888    Miss.
889      Mr.
890      Mr.
Name: Titles, Length: 891, dtype: object


You can refer back to the Notebook from my session [here](https://colab.research.google.com/drive/1dmMKVs2uOJIVuOF3G2RavLnebkwIQ_RN)

# **Resources**

**Here's a list of some resources that'll help with Feature Engineering:**
- [Feature Engineering Made Easy](https://github.com/PacktPublishing/Feature-Engineering-Made-Easy)
- [Tips Of Feature Engineering](https://github.com/Pysamlam/Tips-of-Feature-engineering)
- [Awesome Feature Engineering for Machine Learning](https://github.com/aikho/awesome-feature-engineering)