# Hands-On Lab 3 - Wrangling Strings

In this lab you will use the *pandas* library to wrangle the Titanic dataset. The findings of the previous lab will be used to guide the the wrangling performed.

### Step 1 - Load Data

The *titantic_train.csv* file is the training dataset. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.head()

### Step 2 - Exploring the *Name* Feature

Your Lab 1 EDA showed that the *Name* feature has potential as a machine learning feature. A deeper look at the *Name* data will confirm these initial findings. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 3 - Engineering a Proxy Feature

The analysis of the *Name* feature data shows the titles of passengers are highly associated with both *Sex* and *Age* features. Since every passenger has a title, this information can be used as a **proxy feature** for both *Sex* and *Age*.

Your Lab 1 data profiling showed that the *Age* feature has about 20% missing data values. While *Age* is a desirable feature for the machine learning model, it is just missing too much data to be used as-is.

However, if you can extract the passenger titles, you can provide some information to the machine learning algorithm about passenger ages by proxy.

**NOTE** - Be sure to copy and paste the following code for the lab's next step.

```python
# Create the train_wrangled DataFrame
train_wrangled = (titanic_train
                    .assign(Female = lambda df_: df_['Sex'].replace({'female': 1, 'male': 0}),
                            Embarked = lambda df_: df_['Embarked'].fillna('S'),
                            PartySize = lambda df_: df_['SibSp'] + df_['Parch'] + 1,
                            PartyFare = lambda df_: df_['Fare'] / df_['PartySize'],
                            TicketSize = lambda df_: df_.groupby('Ticket').transform('size'))
                 )

train_wrangled.head()
```

### Step 4 - Splitting the Name Feature

The *Name* feature data follows a format where passenger surnames come first, followed by a comma, space, and title. The *split()* method can be used to iteratively split the *Name* feature to get at the passenger titles. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - The next few steps of the lab will iteratively build up the code to arrive at the *Title* feature. Be sure to copy and paste code from one code cell to the next.

In [None]:
# Enter your lab code here

### Step 5 - Extracting the *Title* Feature

With the first split stored in the *CommaSplit* feature, another call to the *split()* method will create the new *Title* feature. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 6 - Dropping Features

The wrangled data is looking great! Time to drop the unneeded features. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 7 - Encoding Categorical Features

The last step is to encode the categorical features (i.e., *Pclass*, *Embarked*, & *Title*). Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - The following cell contains all the code needed to perform all the data wrangling for the lab. A best practice would be to include a cell near the top of a Notebook to ensure all the following code is located in one place.

In [None]:
# Enter your lab code here