<a href="https://colab.research.google.com/github/ML-Challenge/week5-preprocessing-and-tunning/blob/master/L1.Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>

This lesson covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when we get our data ready for modeling. Between importing and cleaning the data and fitting the machine learning model is when preprocessing comes into play. We'll learn how to standardize the data so that it's in the right form for the model, create new features to best leverage the information in the dataset, and select the best features to improve the model fit. Finally, we'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

# Setup

In [None]:
# Download lesson datasets
# Required if you're using Google Colab
#!wget "https://github.com/ML-Challenge/week5-preprocessing-and-tunning/raw/master/datasets.zip"
#!unzip -o datasets.zip

In [None]:
# Import utils
# We'll be using this module throughout the lesson
import utils

In [None]:
# Import dependencies
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

# Introduction to Data Preprocessing

In this lesson, we'll learn exactly what it means to `preprocess` data. We'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

## What is data preprocessing?

### Missing data - columns

We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?

In [None]:
utils.volunteer.info()

In [None]:
volunteer_clean = utils.volunteer.dropna(axis=1, thresh=3)
volunteer_clean.info()

### Missing data - rows

Taking a look at the `volunteer` dataset again, we want to drop rows where the `category_desc` column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

In [None]:
# Check how many values are missing in the category_desc column
print(utils.volunteer['category_desc'].isnull().sum())

In [None]:
# Subset the volunteer dataset
volunteer_subset = utils.volunteer[utils.volunteer['category_desc'].notnull()]

In [None]:
# Print out the shape of the subset
print(volunteer_subset.shape)

> **Note**: we can use boolean indexing to effectively subset DataFrames.

## Working with data types

### Exploring data types

Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing. Which data types are present in the volunteer dataset?

In [None]:
utils.volunteer.dtypes

We have int, float and object (string)

### Converting a column type

If we take a look at the `volunteer` dataset types, we'll see that the column `hits` is type object. But, if we actually look at the column, we'll see that it consists of integers. Let's convert that column to type int.

In [None]:
# Print the head of the hits column
print(utils.volunteer["hits"].head())

We can use astype to convert between a variety of types.

In [None]:
# Convert the hits column to type int
utils.volunteer["hits"] = utils.volunteer["hits"].astype(int)

In [None]:
# Look at the dtypes of the dataset
print(utils.volunteer.dtypes)

## Class distribution

### Class imbalance

In the `volunteer` dataset, we're thinking about trying to predict the `category_desc` variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?

In [None]:
utils.volunteer['category_desc'].value_counts()

Both Emergency Prepardness and Environment occur less than 50 times.

### Stratified sampling

We know that the distribution of variables in the `category_desc` column in the `volunteer` dataset is uneven. If we wanted to train a model to try to predict `category_desc`, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [None]:
# Create a data with all columns except category_desc
volunteer_X = utils.volunteer.drop('category_desc', axis=1)

In [None]:
# Create a category_desc labels dataset
volunteer_y = utils.volunteer[['category_desc']]

In [None]:
from sklearn.model_selection import train_test_split

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y, random_state=42)

In [None]:
# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

# Standardizing Data

This part of the lesson is all about standardizing data. Often a model will make some assumptions about the distribution or scale of the features. Standardization is a way to make the data fit these assumptions and improve the algorithm's performance.

## What is data standardization

### When to standardize

Now that we've learned when it is appropriate to standardize our data, which of these scenarios would we **NOT** want to standardize?

**Possible Answers**

1. A column we want to use for modeling has extremely high variance.
2. We have a dataset with several continuous columns on different scales and we'd like to use a linear model to train the data.
3. The models we're working with use some sort of distance metric in a linear space, like the Euclidean metric.
4. Our dataset is comprised of categorical data.

In [None]:
# Use 1,2,3 or 4 as a parameter
utils.when_to_standardize()

### Modeling without normalizing

Let's take a look at what might happen to our model's accuracy if we try to model data without doing some sort of standardization first. Here we have a subset of the `wine` dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which we'll learn about later in this lesson.

In [None]:
utils.wine.head()

In [None]:
# Create a subset of data
wine_X = utils.wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]

In [None]:
# Create a Type labels dataset
wine_y = utils.wine['Type']

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(wine_X, wine_y, stratify=wine_y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

We can see that the accuracy score is pretty low. Let's explore methods to improve this score.

## Log normalization

### Checking the variance

Let's check the variance of the columns in the wine dataset and see which column is a candidate for normalization?

In [None]:
utils.wine.var()

We can see that the `Proline` column has an extremely high variance.

### Log normalization in Python

Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it.

In [None]:
# Print out the variance of the Proline column
print(utils.wine['Proline'].var())

In [None]:
# Apply the log normalization function to the Proline column
utils.wine['Proline_log'] = np.log(utils.wine['Proline'])

In [None]:
# Check the variance of the normalized Proline column
print(utils.wine['Proline_log'].var())

The `np.log()` function is an easy way to log normalize a column.

In [None]:
# Create a subset of data
wine_X = utils.wine[['Proline_log', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(wine_X, wine_y, stratify=wine_y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

## Scaling data for feature comparison

### Scaling data - investigating columns

We want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()` to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?

**Possible Answers**

1. The max of `Ash` is 3.23, the max of `Alcalinity of ash` is 30, and the max of `Magnesium` is 162.
2. The means of `Ash` and `Alcalinity of ash` are less than 20, while the mean of `Magnesium` is greater than 90.
3. The standard deviations of `Ash` and `Alcalinity of ash` are equal.
4. 1 and 2 are true

In [None]:
utils.wine.describe()

### Scaling data - standardizing columns

Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

In [None]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

In [None]:
# Create the scaler
ss = StandardScaler()

In [None]:
# Take a subset of the DataFrame we want to scale 
wine_subset = utils.wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

In [None]:
# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

In [None]:
pd.DataFrame(wine_subset_scaled, columns=['Ash', 'Alcalinity of ash', 'Magnesium']).describe()

In scikit-learn, running `fit_transform` during preprocessing will both fit the method to the data as well as transform the data in a single step.

## Standardized data and modeling

### KNN on non-scaled data

Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data.

In [None]:
X = utils.wine.drop(['Type', 'Proline_log'], axis=1)
y = utils.wine['Type']

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

### KNN on scaled data

The accuracy score on the unscaled `wine` dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous example, with the added step of scaling the data.

In [None]:
# Create the scaling method.
ss = StandardScaler()

In [None]:
# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

The increase in accuracy is worth the extra step of scaling the dataset.

# Feature Engineering

In this section we'll learn about feature engineering. We'll explore different ways to create new, more useful, features from the ones already in the dataset. We'll see how to encode, aggregate, and extract information from both numerical and textual features.

## What is Feature engineering

### Feature engineering knowledge test

Now that we've learned about feature engineering, which of the following examples are good candidates for creating new features?

**Possible Answers**

1. A column of timestamps
2. A column of newspaper headlines
3. A column of weight measurements
4. 1 and 2
5. None of the above

In [None]:
# Use 1,2,3,4 or 5 as parameter
utils.feature_engineering_puzzle()

### Identifying areas for feature engineering

Let's take an exploratory look at the `volunteer` dataset. Which of the following columns would we want to perform a feature engineering task on?

**Posible Answers**

1. `vol_requests`
2. `title`
3. `created_date`
4. `category_desc`
5. 2,3 and 4

In [None]:
utils.volunteer.head()

In [None]:
# Use 1,2,3,4 or 5 as parameter
utils.identity_features_puzzle()

## Encoding categorical variables

### Encoding categorical variables - binary

Let's take a look at the `hiking` dataset. There are several columns here that need encoding, one of which is the `Accessible` column, which needs to be encoded in order to be modeled. `Accessible` is a binary feature, so it has two values - either `Y` or `N` - so it needs to be encoded into 1s and 0s. We'll use scikit-learn's `LabelEncoder` method to do that transformation.

In [None]:
utils.hiking.head()

In [None]:
# Import dependencies
from sklearn.preprocessing import LabelEncoder

In [None]:
# Set up the LabelEncoder object
enc = LabelEncoder()

In [None]:
# Apply the encoding to the "Accessible" column
utils.hiking['Accessible_enc'] = enc.fit_transform(utils.hiking['Accessible'])

In [None]:
# Compare the two columns
utils.hiking[['Accessible', 'Accessible_enc']].head()

Nice! `.fit_transform()` is a good way to both fit an encoding and transform the data in a single step.

### Encoding categorical variables - one-hot

One of the columns in the volunteer dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. We'll use Pandas' `get_dummies()` function to do so.

In [None]:
# Transform the category_desc column
category_enc = pd.get_dummies(utils.volunteer['category_desc'])

In [None]:
# Take a look at the encoded columns
category_enc.head()

`get_dummies()` is a simple and quick way to encode categorical variables.

## Engineering numerical features

### Engineering numerical features - taking an average

A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, we have a DataFrame of running times named `running_times_5k`. For each `name` in the dataset, take the mean of their 5 run times.

In [None]:
utils.running_times_5k

In [None]:
# Create a list of the columns to average
run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']

In [None]:
# Use apply to create a mean column
utils.running_times_5k["mean"] = utils.running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

In [None]:
# Take a look at the results
utils.running_times_5k

Lambdas are especially helpful for operating across columns.

### Engineering numerical features - datetime

There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.

In [None]:
utils.volunteer[['start_date_date', 'end_date_date']].head()

In [None]:
# First, convert string column to date column
utils.volunteer["start_date_converted"] = pd.to_datetime(utils.volunteer["start_date_date"])

In [None]:
utils.volunteer[['start_date_date', 'start_date_converted']].head()

In [None]:
# Extract just the month from the converted column
utils.volunteer['start_date_month'] = utils.volunteer['start_date_converted'].apply(lambda row: row.month)

In [None]:
# Take a look at the converted and new month columns
utils.volunteer[['start_date_date', 'start_date_converted', 'start_date_month']].head()

We can also use attributes like `.day` to get the day and `.year` to get the year from datetime columns

## Text classification

### Engineering features from strings - extraction

The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

In [None]:
utils.hiking.head()

In [None]:
# Write a pattern to extract numbers and decimals
import re

def return_mileage(length):
    if length is not None:
        pattern = re.compile(r"\d+\.\d+")

        # Search the text for matches
        mile = re.match(pattern, length)

        # If a value is returned, use group(0) to return the found value
        if mile is not None:
            return float(mile.group(0))

In [None]:
# Apply the function to the Length column
utils.hiking["Length_num"] = utils.hiking["Length"].apply(lambda row: return_mileage(row))

In [None]:
# Take a look at both columns
utils.hiking[["Length", "Length_num"]].head()

### Engineering features from strings - tf/idf

Let's transform the `volunteer` dataset's `title` column into a text vector, to use in a prediction task in the next example.

In [None]:
# Take the title text
title_text = utils.volunteer.title

In [None]:
# Create the vectorizer method
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()

In [None]:
# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

### Text classification using tf/idf vectors

Now that we've encoded the`volunteer` dataset's `title` column into tf/idf vectors, let's use those vectors to try to predict the `category_desc` column. Notice that we have to run the `toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.

In [None]:
# Split the dataset according to the class distribution of category_desc
y = utils.volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

In [None]:
# Let's use Naive Bayes
from sklearn.naive_bayes import GaussianNB

# Fit the model to the training data
nb = GaussianNB()
nb.fit(X_train, y_train)

In [None]:
# Print out the model's accuracy
nb.score(X_test, y_test)

Notice that the model doesn't score very well. We'll work on selecting the best features for modeling in the next part of the lesson.

# Selecting features for modeling

This section goes over a few different techniques for selecting the most important features from the dataset. We'll learn how to drop redundant features, work with text vectors, and reduce the number of features in the dataset using principal component analysis (PCA).

## Feature selection

### When to use feature selection

Let's say we had finished standardizing our data and creating new features. Which of the following scenarios is **NOT** a good candidate for feature selection?

**Possible Answers**

1. Several columns of running times that have been averaged into a new column.
2. A text field that hasn't been turned into a tf/idf vector yet.
3. A column of text that has already had a float extracted out of it.
4. A categorical field that has been one-hot encoded.
5. The dataset contains columns related to whether something is a fruit or vegetable, the name of the fruit or vegetable, and the scientific name of the plant.

In [None]:
# Use 1,2,3,4 or 5 as parameter
utils.feature_selction_puzzle()

## Removing redundant features

### Selecting relevant features

Now let's identify the redundant columns in the `volunteer_processed` dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if we explore the volunteer dataset, we'll see three features which are related to location: `locality`, `region`, and `postalcode`. They contain repeated information, so it would make sense to keep only one of the features.

In [None]:
utils.volunteer[['locality', 'region', 'postalcode']].head(10)

There are also features that have gone through the feature engineering process: columns like 'Education' and 'Emergency Preparedness' are a product of encoding the categorical variable `category_desc`, so `category_desc` itself is redundant now.

In [None]:
utils.volunteer_processed.head()

In [None]:
utils.volunteer_processed.info()

In [None]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

In [None]:
# Drop those columns from the dataset
volunteer_subset = utils.volunteer_processed.drop(to_drop, axis=1)

In [None]:
# Print out the head of the new dataset
volunteer_subset.head()

### Checking for correlated features

Let's take a look at the wine dataset again, which is made up of continuous, numerical features. We run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

In [None]:
# Print out the column correlations of the wine dataset
corr_matrix = utils.wine.corr()
corr_matrix

In [None]:
# Take a minute to find the column where the correlation value is greater than 0.75 at least twice or run the following code
corrs = corr_matrix.abs().unstack().sort_values(kind='quicksort', ascending=False)
corrs[(corrs>0.75) & (corrs<1.0)]

In [None]:
# Flavanoids has high correlation with Total phenols and OD280/OD315 of diluted wines
# Proline is already redundant because of Proline_log
# We don't drop Type because it the target column
to_drop = ["Flavanoids", "Proline"]

In [None]:
# Drop that column from the DataFrame
utils.wine = utils.wine.drop(to_drop, axis=1)
utils.wine.head()

In [None]:
utils.wine.corr()

## Selecting features using text vectors

### Exploring text vectors, part 1

Let's expand on the text vector exploration method we just learned about, using the `volunteer` dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about above. We'll return a list of numbers with the function. In the next example, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our `text_tfidf` vector.

In [None]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

In [None]:
# Print out the weighted words
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3)

### Exploring text vectors, part 2

Using the function we wrote in the previous example, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous example, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

In [None]:
# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

### Training Naive Bayes with feature selection

Let's re-run the Naive Bayes text classification model from earlier, with our selection choices from the previous example, on the `volunteer` dataset's `title` and `category_desc` columns.

In [None]:
# Split the dataset according to the class distribution of category_desc
y = utils.volunteer["category_desc"]
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

In [None]:
# Fit the model to the training data
nb.fit(train_X, train_y)

In [None]:
# Print out the model's accuracy
nb.score(test_X, test_y)

We can see that our accuracy score wasn't that different from our previous. That's okay; the title field is a very small text field, appropriate for demonstrating how filtering vectors works.

## Dimensionality reduction

### Using PCA

Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = utils.wine.drop("Type", axis=1)

In [None]:
# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

In [None]:
# Look at the percentage of variance explained by the different components
pca.explained_variance_ratio_

In the next section we'll train a model using the PCA-transformed vector.

### Training a model with PCA

Now that we have run PCA on the `wine` dataset, let's try training a model with it.

In [None]:
# Split the transformed X and the y labels into training and test sets
wine_y = utils.wine["Type"]
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, wine_y, random_state=42) 

In [None]:
# Fit knn to the training data
knn = KNeighborsClassifier()
knn.fit(X_wine_train, y_wine_train)

In [None]:
# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

# Putting it all together

Now that we've learned all about preprocessing we'll try these techniques out on a dataset that records information on UFO sightings.

## UFOs and preprocessing

### Checking column types

Let's take a look at the UFO dataset's column types using the `dtypes` attribute. One column jumps out for transformation: the `date` column, which can be transformed into the `datetime` type. That will make our feature engineering efforts easier later on.

In [None]:
ufo = pd.read_csv('data/ufo_sightings_large.csv')
ufo.head()

In [None]:
ufo.dtypes

In [None]:
# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

### Dropping missing data

Let's remove some of the rows where certain columns have missing values. We're going to look at the `length_of_time column`, the `state` column, and the `type` column. If any of the values in these columns are missing, we're going to drop the rows.

In [None]:
ufo.info()

In [None]:
# Check how many values are missing in the length_of_time, state, and type columns
ufo[['length_of_time', 'state', 'type']].isnull().sum()

In [None]:
# Keep only rows where length_of_time, state, and type are not null
ufo = ufo[ufo["length_of_time"].notnull() & ufo["state"].notnull() & ufo["type"].notnull()]
ufo.reset_index(drop=True, inplace=True)

In [None]:
# Print out the shape of the new dataset
ufo.info()

## Categorical variables and standardization

### Extracting numbers from strings

The `length_of_time` field in the UFO dataset is a text field that has the number of minutes within the string. Here, we'll extract that number from that text field using regular expressions.

In [None]:
import math
def return_minutes(time_string):
    # We'll use \d+ to grab digits and match it to the column values
    pattern = re.compile(r"(\d+|\d{1,2}\.\d{1,2})(?:[^0-9\.]*)?(?:minutes*|mins*)|(\d+):(?:\d*?)")    
        
    # Use match on the pattern and column
    num = re.search(pattern, time_string)
    if num is not None:
        return math.floor(float((num.group(1) if num.group(1) is not None else num.group(2))))

In [None]:
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

In [None]:
# Take a look at the head of both of the columns
ufo[["length_of_time", "minutes"]].head(10)

As we can see, we end up with some `NaN`s in the DataFrame. That's okay for now; we'll take care of those before modeling.

In [None]:
ufo = ufo[(ufo['seconds'] != 0.0) & ufo['minutes'].notnull()]
ufo.reset_index(drop=True, inplace=True)

### Identifying features for standardization

In this section, we'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the `seconds` and `minutes` column, we'll see that the variance of the seconds column is extremely high. Because `seconds` and `minutes` are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the `seconds` column.

In [None]:
# Check the variance of the columns
ufo.var()

In [None]:
# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

In [None]:
# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

## Engineering new features

### Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. We'll do that transformation here, using both binary and one-hot encoding methods.

In [None]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda val: 1 if val == "us" else 0)

In [None]:
# Print the number of unique type values
print(len(ufo["type"].unique()))

In [None]:
# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo["type"])

In [None]:
# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

In [None]:
ufo.head()

### Features from dates

Another feature engineering task to perform is month and year extraction. We'll perform this task on the `date` column of the `ufo` dataset.

In [None]:
# Look at the first 5 rows of the date column
ufo["date"].head()

In [None]:
# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda row: row.month)

In [None]:
# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda row: row.year)

In [None]:
# Take a look at the head of all three columns
ufo[["date", "month", "year"]].head()

### Text vectorization

Let's transform the `desc` column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [None]:
# Take a look at the head of the desc field
ufo["desc"].head()

In [None]:
# Create the tfidf vectorizer object
vec = TfidfVectorizer()

In [None]:
# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo["desc"])

In [None]:
# Look at the number of columns this creates.
desc_tfidf.shape

The text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

## Feature selection and modeling

### Selecting the ideal dataset

Let's get rid of some of the unnecessary features. Because we have an encoded country column, `country_enc`, we keep it and drop other columns related to location: `city`, `country`, `lat`, `long`, `state`.

We have columns related to `month` and `year`, so we don't need the `date` or `recorded` columns.

We vectorized `desc`, so we don't need it anymore. For now we'll keep `type`.

We'll keep `seconds_log` and drop `seconds` and `minutes`.

Let's also get rid of the `length_of_time` column, which is unnecessary after extracting `minutes`.

In [None]:
# Check the correlation between the seconds, seconds_log, and minutes columns
ufo[["seconds", "seconds_log", "minutes"]].corr()

In [None]:
# Make a list of features to drop   
to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

In [None]:
# Drop those features
ufo = ufo.drop(to_drop, axis=1)

In [None]:
# Let's also filter some words out of the text vector we created
vocab = {v:k for k,v in vec.vocabulary_.items()}
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

We're almost done. In the next examples, we'll try modeling the UFO data in a couple of different ways.

### Modeling the UFO dataset, part 1

In this example, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our `X` dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The `y` labels are the encoded country column, where 1 is `us` and 0 is `ca`.

In [None]:
X = ufo.drop(['type', 'country_enc'], axis=1)
y = ufo['country_enc']

In [None]:
# Take a look at the features in the X set of data
X.columns

In [None]:
# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y, random_state=42)

In [None]:
# Fit knn to the training sets
knn.fit(train_X, train_y)

In [None]:
# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

This model performs pretty well. It seems like we've made pretty good feature selection choices here.

### Modeling the UFO dataset, part 2

Finally, let's build a model using the text vector we created, `desc_tfidf`, using the `filtered_words` list to create a filtered text vector. Let's see if we can predict the `type` of the sighting based on the text. We'll use a Naive Bayes model for this.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

In [None]:
# Split the X and y sets using train_test_split, setting stratify=y
y = ufo['type']
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

In [None]:
# Fit nb to the training sets
nb.fit(train_X, train_y)

In [None]:
# Print the score of nb on the test sets
nb.score(test_X, test_y)

As we can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting `type`.

---
**[Week 5 - Data Preprocessing and Hyperparameter Tuning](https://radu-enuca.gitbook.io/ml-challenge/preprocessing-and-tuning)**

*Have questions or comments? Visit the ML Challenge Mattermost Channel.*