# Hands-On Lab 2 - pandas

In this lab you will use the *pandas* library to wrangle the Titanic dataset. The findings of the previous lab will be used to guide the the wrangling performed.

### Step 1 - Load Data

The *titantic_train.csv* file is the training dataset. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.head()

### Step 2 - Wrangle the *Female* feature

As you learned during the lecture, the *scikit-learn* machine learning library only supports numeric features. In its raw form, the *Sex* feature is a binary categorical feature with string values. While this feature could be one-hot encoded, the encoding would produce two features (i.e., *Sex_male* & *Sex_female*). A better strategy is to create a new binary numeric feature indicating whether a passenger is female. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - You will iteratively build up the data-wrangling code using a single call to the *assign()* method. Each code cell in the lab will add additional wrangling so you can see the progress step-by-step.

In [None]:
# Enter your lab code here

### Step 3 - Wrangle the *Embarked* Feature

As you learned in the last lab, the *Embarked* feature has two missing values. A common strategy with categorical features to replace missing data is to **impute** (i.e., replace) the missing data using the most commonly occurring categorical value (i.e., the **mode**). Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - Be sure to copy and paste the code from the previous cell and extend the code within the *assign()* method call.

In [None]:
# Enter your lab code here

### Step 4 - Wrangle the *PartySize* Feature

It is common to craft new features based on domain knowledge. Experimenting with features is the hallmark of crafting the most valuable machine learning models. Most of the features you engineer will not be useful. Despite the expected high failure rate, experimenting with features is where you can add the most value as a data scientist. For example, you can engineer a *PartySize* feature that provides the count of all the passengers traveling together on the Titanic. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - Be sure to copy and paste the code from the previous cell and extend the code within the *assign()* method call.

In [None]:
# Enter your lab code here

### Step 5 - Exploring the *Ticket* Feature

The last lab illustrated that the values of the *Ticket* feature are not unique. This lack of uniqueness implies that the same *Ticket* "numbers" are shared by multiple passengers. You can start to confirm this by aggregating the *Ticket* feature. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 6 - Exploring the *Ticket* Feature Continued

The last step illustrates that a single *Ticket* value is shared among multiple passengers. A closer look at the first couple of *Ticket* values is a good idea. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 7 - Wrangling the *PartyFare* Feature

The last step shows that *Ticket* "numbers" can be shared across groups of passengers. The last step also showed that the *Fare* paid for the ticket includes all passengers. This information can be used to engineer a new *PartyFare* feature that distributes the *Fare* across all the passengers traveling together. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - Be sure to copy and paste the code from Step 4 and extend the code within the *assign()* method call.

In [None]:
# Enter your lab code here

### Step 8 - Wrangling the *TicketSize* Feature

Based on the knowledge gleaned from Step 5, you can also engineer a *TicketSize* feature for the count of passengers sharing the same *Ticket* "number." Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 9 - Dropping Features

Your EDA from Lab 1 showed that many features are not useful at all (e.g., *PassengerId*) or require additional feature engineering (e.g., *Name*). At this stage of the course, dropping these features is reasonable. Type the following code into the blank code cell in your lab notebook and run it to produce the results. 

In [None]:
# Enter your lab code here

### Step 10 - Encoding Categorical Features

With the initial set of features complete, the last step is to one-hot encode the categorical features that are not binary (i.e., *Pclass* and *Embarked*). Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - The following cell contains all the code needed to perform all the data wrangling for the lab. A best practice would be to include a cell near the top of a Notebook to ensure all the following code is located in one place.

In [None]:
# Enter your lab code here