<a href="https://colab.research.google.com/github/ML-Challenge/week5-preprocessing-and-tunning/blob/master/L1.Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>

This lesson covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when we get our data ready for modeling. Between importing and cleaning the data and fitting the machine learning model is when preprocessing comes into play. We'll learn how to standardize the data so that it's in the right form for the model, create new features to best leverage the information in the dataset, and select the best features to improve the model fit. Finally, we'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

# Setup

In [None]:
# Download lesson datasets
# Required if you're using Google Colab
!wget "https://github.com/ML-Challenge/week5-preprocessing-and-tunning/raw/master/datasets.zip"
!unzip -o datasets.zip

In [None]:
# Import utils
# We'll be using this module throughout the lesson
import utils

In [None]:
# Import dependencies
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

# Introduction to Data Preprocessing

In this lesson, we'll learn exactly what it means to `preprocess` data. We'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

## What is data preprocessing?

Data preprocessing comes after we've cleaned up our data and after we've done some exploratory analysis to understand our dataset. Once we understand our dataset, we'll probably have some idea about how we want to model our data. 

Machine learning models in Python require numerical input, so if our dataset has categorical variables, we'll need to transform them. Think of data preprocessing as a prerequisite for modeling.

### Missing Data

One of the first steps we can take to preprocess our data is to **remove missing data**. There's a lot of ways to deal with missing data, but here we're only going to cover ways to remove either columns or rows with missing data.

If we wanted to drop all rows from a dataframe that contain missing values, we can do that with `dropna`. We can drop specific rows by passing index labels to the `drop` function, which defaults to dropping rows.

### Missing data - columns

Usually we'll want to focus on dropping a particular column, especially if all or most of its values are missing. We can
use the `drop` method as well, though the parameters are different. The first parameter is the column name. We have to specify `axis=1` in order to designate that we want to drop a column.

We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?

In [None]:
utils.volunteer.info()

In [None]:
volunteer_clean = utils.volunteer.dropna(axis=1, thresh=3)
volunteer_clean.info()

### Missing data - rows

What if we want to drop rows where data is missing in a particular column? We can do this with the help of boolean
indexing, which is a way to filter a dataframe based on certain values. Instead of indexing a dataframe using column or
row names, we can set a condition to filter our dataframe by to return a specific set of data. 

Taking a look at the `volunteer` dataset again, we want to drop rows where the `category_desc` column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

First, let's take a look at how many null values we have in column `category_desc`, using `isnull` to get null values and then using sum to output a count.

In [None]:
# Check how many values are missing in the category_desc column
print(utils.volunteer['category_desc'].isnull().sum())

To filter those out, we can simply use the notnull method on column `category_desc` as a boolean index

In [None]:
# Subset the volunteer dataset
volunteer_subset = utils.volunteer[utils.volunteer['category_desc'].notnull()]

In [None]:
# Print out the shape of the subset
print(volunteer_subset.shape)

> **Note**: we can use boolean indexing to effectively subset DataFrames.

## Working with data types

Now that we've reviewed some Pandas basics, we need to start thinking about other steps we have to take in order to
prepare the data. One of these steps is to think about the types that are present in our dataset, because
we'll likely have to transform some of these columns to other types later on. Let's take a deeper look at types, as well
as how to convert column types in our dataset. Pandas datatypes are similar to native python types, but there are a couple of things to be aware of. The most common types we'll be working with are `object` (string values or is of mixed types), `int64` (equivalent to the Python integer type), and `float64` (equivalent to the float type) types.

### Exploring data types

Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing. Which data types are present in the volunteer dataset?

> **Note:** Recall that we can check the types of a dataframe by using the `dtypes` attribute.

In [None]:
utils.volunteer.dtypes

We have int, float and object (string)

### Converting column types

Sometimes, we'll start working with a dataset that has an incorrect column type: maybe a numerical column was written out into a csv as a string, and when we try to work with that column, numerical operations won't work. It's also good to be as sure as we can that the column type we want to convert to is representative of the whole column. The object type can represent a column that includes both string and numeric types.

Let's take a look at how to adjust the type of a column if the type that pandas has inferred upon reading in the file is incorrect.

If we take a look at the `volunteer` dataset types, we'll see that the column `hits` is type object. But, if we actually look at the column, we'll see that it consists of integers. Let's convert that column to type int.

In [None]:
# Print the head of the hits column
print(utils.volunteer["hits"].head())

We can change the type using the `astype` method and passing in the type we want to convert it to.

In [None]:
# Convert the hits column to type int
utils.volunteer["hits"] = utils.volunteer["hits"].astype(int)

In [None]:
# Look at the dtypes of the dataset
print(utils.volunteer.dtypes)

## Class distribution

One of the most necessary steps for preprocessing, is splitting up our data into training and test sets. We do this to avoid the issue of overfitting. If we train a model on our entire set of data, we won't have any way to test and validate our model because the model will essentially know the dataset by heart. Holding out a test set allows us to preserve some data the model hasn't seen yet.

In scikit learn, we can split our dataset by using the `train_test_split` function. The function shuffles up our dataset and then randomly splits it. By default, the function will split 75% of the data into the training set and 25% into the test set. In many scenarios, the default splitting parameters will work well. However, if our labels have an uneven distribution, our test and training sets might not be representative samples of our dataset and could bias the model we're trying to train.

### Class imbalance

In the `volunteer` dataset, we're thinking about trying to predict the `category_desc` variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?

In [None]:
utils.volunteer['category_desc'].value_counts()

Both Emergency Prepardness and Environment occur less than 50 times.

### Stratified sampling

A good technique for sampling more accurately when we have imbalanced classes is **stratified sampling**, which is a way of sampling that takes into account the distribution of
classes or features in our dataset.

We know that the distribution of variables in the `category_desc` column in the `volunteer` dataset is uneven. If we wanted to train a model to try to predict `category_desc`, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [None]:
# Create a data with all columns except category_desc
volunteer_X = utils.volunteer.drop('category_desc', axis=1)

In [None]:
# Create a category_desc labels dataset
volunteer_y = utils.volunteer[['category_desc']]

There's a really easy way to do this in scikit learn using the train test split function. The function comes with a `stratify` parameter, and to stratify according to class labels, we just pass in our `volunteer_y` to that parameter.

In [None]:
from sklearn.model_selection import train_test_split

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y, random_state=42)

If we check the distribution of classes for our training and test labels, we can see the distribution of classes is in accordance with the original `category_desc` class distribution.

In [None]:
# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

# Print out the category_desc counts on the test y labels
print(y_test['category_desc'].value_counts())

# Standardizing Data

This part of the lesson is all about standardizing data. Often a model will make some assumptions about the distribution or scale of the features. Standardization is a way to make the data fit these assumptions and improve the algorithm's performance.

## What is data standardization

It's possible that we'll come across datasets with lots of numerical noise built in, such as lots of variance or differently-scaled data. The preprocessing solution for that is standardization. 

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn, this is often a necessary step, because many models assume that the data we are training on is normally distributed, and if it isn't, we risk biasing our model.

We can standardize our data in different ways, but in this lesson we're going to talk about two methods: `log normalization` and `scaling`. It's also important to note that standardization is a preprocessing method applied to continuous, numerical data. We'll learn methods for dealing with categorical data later in the lesson.

### When to standardize

There are a few different scenarios in which we want to standardize our data. First, if we're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features we're giving it are related in a linear fashion,
or can be measured with a linear distance metric. There are a number of models that deal with nonlinear spaces, but for those models that are in a linear space, the data must also be in that space.

The case when a feature or features in our dataset have high variance is related to this. This could bias a model that assumes the data is normally distributed. If a feature in our dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.

Modeling a dataset that contains continuous features that are on different scales is another scenario to watch out for. For example, consider a dataset that contains a column related to height and another related to weight. In order to compare these features, they must be in the same linear space, and therefore must be standardized in some way. 

All of these scenarios assume we're working with a model that makes some kind of linearity assumptions. There are a number of models that are perfectly fine operating in a nonlinear space or do a certain amount of standardization upon input, but that's outside the scope of this lesson.

Now that we've learned when it is appropriate to standardize our data, which of these scenarios would we **NOT** want to standardize?

**Possible Answers**

1. A column we want to use for modeling has extremely high variance.
2. We have a dataset with several continuous columns on different scales and we'd like to use a linear model to train the data.
3. The models we're working with use some sort of distance metric in a linear space, like the Euclidean metric.
4. Our dataset is comprised of categorical data.

In [None]:
# Use 1, 2, 3 or 4 as a parameter
utils.when_to_standardize()

### Modeling without normalizing

Let's take a look at what might happen to our model's accuracy if we try to model data without doing some sort of standardization first. Here we have a subset of the `wine` dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which we'll learn about later in this lesson.

In [None]:
utils.wine.head()

In [None]:
# Create a subset of data
wine_X = utils.wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]

In [None]:
# Create a Type labels dataset
wine_y = utils.wine['Type']

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(wine_X, wine_y, stratify=wine_y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

We can see that the accuracy score is pretty low. Let's explore methods to improve this score.

## Log normalization

Log normalization is a method for standardizing our data that can be useful when we have a particular column with **high variance**. As we saw in the previous section's
examples, training a k-nearest neighbors classifier on that subset of the wine dataset didn't get a very high accuracy score. This is because within that subset, the `Proline` colummn has extremely high variance, which is affecting the accuracy of the classifier.

Log normalization is a good strategy:
* when we care about relative changes in a linear model
* when we still want to capture the magnitude of change
* when we want to keep everything in the positive space.

It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.

### Checking the variance

Let's check the variance of the columns in the wine dataset and see which column is a candidate for normalization?

In [None]:
utils.wine.var()

We can see that the `Proline` column has an extremely high variance.

### Log normalization in Python

The method of log normalization we're going to work with in Python takes the natural log of each number in the left hand column, which is simply the exponent we would raise above the mathematical constant e (approximately equal to 2.718) to get that number.

Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it.

In [None]:
# Print out the variance of the Proline column
print(utils.wine['Proline'].var())

In [None]:
# Apply the log normalization function to the Proline column
utils.wine['Proline_log'] = np.log(utils.wine['Proline'])

In [None]:
# Check the variance of the normalized Proline column
print(utils.wine['Proline_log'].var())

The `np.log()` function is an easy way to log normalize a column.

In [None]:
# Create a subset of data
wine_X = utils.wine[['Proline_log', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(wine_X, wine_y, stratify=wine_y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

## Scaling data for feature comparison

### What is feature scaling?

Scaling is a method of standardization that's most useful when we're working with a dataset that contains continuous features that are on different scales, and we're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors).

Feature scaling transforms the features in our dataset so they have a mean of 0 and a variance of 1. This will make it easier to linearly compare features. This is a requirement for many models in scikit-learn.

### Scaling data - investigating columns

We want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()` to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?

**Possible Answers**

1. The max of `Ash` is 3.23, the max of `Alcalinity of ash` is 30, and the max of `Magnesium` is 162.
2. The means of `Ash` and `Alcalinity of ash` are less than 20, while the mean of `Magnesium` is greater than 90.
3. The standard deviations of `Ash` and `Alcalinity of ash` are equal.
4. 1 and 2 are true

In [None]:
utils.wine.describe()

### Scaling data - standardizing columns

Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

Scikit-learn has a variety of scaling capabilities, but we're only going to focus on the `StandardScaler` class, imported from preprocessing. It works by removing the mean and scaling each feature to have unit variance. There's a simpler scale function in scikit-learn, but the benefit of using the `StandardScaler` object is that we can apply the same transformation on other data, like a test set, or new data that's part of the same set, for example, without having to rescale everything. So once we have the standard scaler method, we can apply the `fit_transform` function on the dataframe.

In [None]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

In [None]:
# Create the scaler
ss = StandardScaler()

In [None]:
# Take a subset of the DataFrame we want to scale 
wine_subset = utils.wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

In [None]:
# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

In [None]:
pd.DataFrame(wine_subset_scaled, columns=['Ash', 'Alcalinity of ash', 'Magnesium']).describe()

In scikit-learn, running `fit_transform` during preprocessing will both fit the method to the data as well as transform the data in a single step.

## Standardized data and modeling

Now that we've learned a couple of different methods for standardization, it's time to put this into practice with modeling. As mentioned before, many models in scikit-learn require our data to be scaled appropriately across columns, otherwise we risk biasing our results. The last part of this section will be dedicated to modeling data on both unscaled and scaled data, so we can see the difference in model performance. The model we're going to use is k-nearest neighbors.

### KNN on non-scaled data

Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data.

In [None]:
X = utils.wine.drop(['Type', 'Proline_log'], axis=1)
y = utils.wine['Type']

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

### KNN on scaled data

The accuracy score on the unscaled `wine` dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous example, with the added step of scaling the data.

In [None]:
# Create the scaling method.
ss = StandardScaler()

In [None]:
# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=42)

In [None]:
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

In [None]:
# Score the model on the test data
print(knn.score(X_test, y_test))

The increase in accuracy is worth the extra step of scaling the dataset.

# Feature Engineering

In this section we'll learn about feature engineering. We'll explore different ways to create new, more useful, features from the ones already in the dataset. We'll see how to encode, aggregate, and extract information from both numerical and textual features.

## What is Feature engineering

Feature engineering is the creation of new features based on existing features, and it adds information to our dataset that is useful in some way: it adds features useful for our prediction or clustering task, or it sheds insight into relationships between features. Real world data is often not neat and tidy, and in addition to preprocessing steps like standardization, we'll likely have to extract and expand information that exists in the columns in our dataset.

Feature engineering is also something that is very dependent on the particular dataset we're analyzing.

There are a variety of scenarios in which we might want to engineer features from existing data. An extremely common one is with **text data**. For example, if we're building some kind of natural language processing model, we'll have to create a vector of the words in our dataset.

Another scenario might also be related to string data: maybe we have a column which  records people's favorite colors. In order to feed this information into a model in scikit-learn, we'll have to encode this information numerically.

Another common example is with timestamps. We might see a full timestamp that includes the time down to the second or millisecond, which might be much too granular for a prediction task, so we'll want to create a new column with the day or the month. Perhaps a column contains a list of some kind: test scores, or running times, and maybe it's more useful to use an average.

### Feature engineering knowledge test

Now that we've learned about feature engineering, which of the following examples are good candidates for creating new features?

**Possible Answers**

1. A column of timestamps
2. A column of newspaper headlines
3. A column of weight measurements
4. 1 and 2
5. None of the above

In [None]:
# Use 1,2,3,4 or 5 as parameter
utils.feature_engineering_puzzle()

### Identifying areas for feature engineering

Let's take an exploratory look at the `volunteer` dataset. Which of the following columns would we want to perform a feature engineering task on?

**Posible Answers**

1. `vol_requests`
2. `title`
3. `created_date`
4. `category_desc`
5. 2,3 and 4

In [None]:
utils.volunteer.head()

In [None]:
# Use 1,2,3,4 or 5 as parameter
utils.identity_features_puzzle()

## Encoding categorical variables

Because models in scikit-learn require numerical input, if our dataset contains categorical variables, we'll have to encode them.

### Encoding categorical variables - binary

Let's take a look at the `hiking` dataset. There are several columns here that need encoding, one of which is the `Accessible` column, which needs to be encoded in order to be modeled. `Accessible` is a binary feature, so it has two values - either `Y` or `N` - so it needs to be encoded into 1s and 0s. We'll use scikit-learn's `LabelEncoder` method to do that transformation.

In [None]:
utils.hiking.head()

In [None]:
# Import dependencies
from sklearn.preprocessing import LabelEncoder

In [None]:
# Set up the LabelEncoder object
enc = LabelEncoder()

In [None]:
# Apply the encoding to the "Accessible" column
utils.hiking['Accessible_enc'] = enc.fit_transform(utils.hiking['Accessible'])

In [None]:
# Compare the two columns
utils.hiking[['Accessible', 'Accessible_enc']].head()

`fit_transform()` is a good way to both fit an encoding and transform the data in a single step.

### Encoding categorical variables - one-hot

One of the columns in the volunteer dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. We'll use Pandas' `get_dummies()` function to do so.

In [None]:
# Transform the category_desc column
category_enc = pd.get_dummies(utils.volunteer['category_desc'])

In [None]:
# Take a look at the encoded columns
category_enc.head()

`get_dummies()` is a simple and quick way to encode categorical variables.

## Engineering numerical features

Though we may have a dataset filled with numerical features, they may need a little bit of feature engineering to properly prepare for modeling. In this section, we'll talk about aggregate statistics as well as dates and how engineering numerical features can add value to our dataset.

If we had, say, a collection of features related to a single feature, like temperature or running time, we might want to take an average or median to use as a feature for modeling instead. A common method of feature engineering is to take an aggregate of a set of numbers to use in place of those features. This can be helpful in reducing the dimensionality
of our feature space, or perhaps we simply don't need multiple similar values that are close in distance to each other.

### Engineering numerical features - taking an average

A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, we have a DataFrame of running times named `running_times_5k`. For each `name` in the dataset, take the mean of their 5 run times.

In [None]:
utils.running_times_5k

In [None]:
# Create a list of the columns to average
run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']

In [None]:
# Use apply to create a mean column
utils.running_times_5k["mean"] = utils.running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

In [None]:
# Take a look at the results
utils.running_times_5k

Lambdas are especially helpful for operating across columns.

### Engineering numerical features - datetime

There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.

In [None]:
utils.volunteer[['start_date_date', 'end_date_date']].head()

In [None]:
# First, convert string column to date column
utils.volunteer["start_date_converted"] = pd.to_datetime(utils.volunteer["start_date_date"])

In [None]:
utils.volunteer[['start_date_date', 'start_date_converted']].head()

In [None]:
# Extract just the month from the converted column
utils.volunteer['start_date_month'] = utils.volunteer['start_date_converted'].apply(lambda row: row.month)

In [None]:
# Take a look at the converted and new month columns
utils.volunteer[['start_date_date', 'start_date_converted', 'start_date_month']].head()

We can also use attributes like `.day` to get the day and `.year` to get the year from datetime columns

## Text classification

Though text data is a little more complicated to work with, there's a lot of useful feature engineering we can do with it. One method is to extract the pieces of information that we need: maybe part of a string, or extracting a number, and transforming it into a feature. We can also transform the text itself into features, for use with natural language processing methods or prediction tasks.

### Engineering features from strings - extraction

The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

In [None]:
utils.hiking.head()

In [None]:
# Write a pattern to extract numbers and decimals
import re

def return_mileage(length):
    if length is not None:
        pattern = re.compile(r"\d+\.\d+")

        # Search the text for matches
        mile = re.match(pattern, length)

        # If a value is returned, use group(0) to return the found value
        if mile is not None:
            return float(mile.group(0))

In [None]:
# Apply the function to the Length column
utils.hiking["Length_num"] = utils.hiking["Length"].apply(lambda row: return_mileage(row))

In [None]:
# Take a look at both columns
utils.hiking[["Length", "Length_num"]].head()

### Vectorizing text

If we're working with text, we might want to model it in some way. In order to do that, we'll need to vectorize the text and transform it into a numerical input that scikit-learn can use.

Let's transform the `volunteer` dataset's `title` column into a text vector, to use in a prediction task in the next example.

In [None]:
# Take the title text
title_text = utils.volunteer.title

We're going to create a `tf/idf` vector. `tf/idf` is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs. It stands for term frequency / inverse document frequency and places the weight on words that are ultimately more significant in the entire corpus of words.

Creating `tf/idf` vectors is straightforward in scikit-learn, and we can use the `TfidfVectorizer` class to do it. In order to vectorize `title_text`, we can simply pass it into `TfidfVectorizer`'s `fit_transform` method.

In [None]:
# Create the vectorizer method
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()

In [None]:
# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

### Text classification using tf/idf vectors

Now that we've encoded the`volunteer` dataset's `title` column into `tf/idf` vectors, let's use those vectors to try to predict the `category_desc` column. Notice that we have to run the `toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.

We'll use a Naive Bayes classifier, which is based on Bayes' theorem of conditional probability, and performs well on text classification tasks. Naive Bayes treats each feature as independent from the others, which can be a naive assumption, but this works out well on text data. Because each feature is treated independently, this classifier works well on high-dimensional data and isb very efficient.

In [None]:
# Split the dataset according to the class distribution of category_desc
y = utils.volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

In [None]:
# Let's use Naive Bayes
from sklearn.naive_bayes import GaussianNB

# Fit the model to the training data
nb = GaussianNB()
nb.fit(X_train, y_train)

In [None]:
# Print out the model's accuracy
nb.score(X_test, y_test)

Notice that the model doesn't score very well. We'll work on selecting the best features for modeling in the next part of the lesson.

# Selecting features for modeling

This section goes over a few different techniques for selecting the most important features from the dataset. We'll learn how to drop redundant features, work with text vectors, and reduce the number of features in the dataset using principal component analysis (PCA).

## Feature selection

Feature selection is a method of selecting features from our feature set to be used for modeling. It draws from a set of existing features, so it's different than feature
engineering because it doesn't create new features.

The overarching goal of feature selection is to improve our model's performance. Perhaps our existing feature set is much too large, or some of the features we're working with are unnecessary. There are different ways we can perform feature selection.

### When to use feature selection

It helps to git rid of noise in our model. Maybe we have redundant features, like both latitude and longitude and city and state, that are adding noise. Or maybe we have features that are strongly statistically correlated, which breaks the assumptions of certain models and thus impacts model performance. If we're working with text vectors, we'll want to use those tf-idf vectors to determine which set of words to train our model on. And finally, if our feature set is large, it may be beneficial to use dimensionality reduction to combine and reduce the number of features in our dataset in a way that also reduces the overall variance.

Let's say we had finished standardizing our data and creating new features. Which of the following scenarios is **NOT** a good candidate for feature selection?

**Possible Answers**

1. Several columns of running times that have been averaged into a new column.
2. A text field that hasn't been turned into a tf/idf vector yet.
3. A column of text that has already had a float extracted out of it.
4. A categorical field that has been one-hot encoded.
5. The dataset contains columns related to whether something is a fruit or vegetable, the name of the fruit or vegetable, and the scientific name of the plant.

In [108]:
# Use 1,2,3,4 or 5 as parameter
utils.feature_selction_puzzle()

Correct! The text field needs to be vectorized before we can eliminate it, otherwise we might miss out on important data.


## Removing redundant features

One of the easiest ways to determine if a feature is unnecessary is if it is redundant in some way. For example, if it exists in another form as another feature, or if two features are very strongly correlated. Sometimes, when we create features through feature engineering, we end up duplicating existing features in some way.

Some redundant features can be identified manually, by simply having an understanding of the  features in our dataset. Like the machine learning process in general, feature selection is an iterative process. We might try removing some features only to find it doesn't improve our model's performance, and we might have to reassess our selection choices.

### Selecting relevant features

Now let's identify the redundant columns in the `volunteer_processed` dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if we explore the `volunteer` dataset, we'll see three features which are related to location: `locality`, `region`, and `postalcode`. They contain repeated information, so it would make sense to keep only one of the features.

In [None]:
utils.volunteer[['locality', 'region', 'postalcode']].head(10)

There are also features that have gone through the feature engineering process: columns like 'Education' and 'Emergency Preparedness' are a product of encoding the categorical variable `category_desc`, so `category_desc` itself is redundant now.

In [None]:
utils.volunteer_processed.head()

In [None]:
utils.volunteer_processed.info()

In [None]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

In [None]:
# Drop those columns from the dataset
volunteer_subset = utils.volunteer_processed.drop(to_drop, axis=1)

In [None]:
# Print out the head of the new dataset
volunteer_subset.head()

### Checking for correlated features

A clear situation in which we'd want to drop features is when they are statistically correlated, meaning they move together directionally. Linear models in particular assume that features are independent of each other, and if features are strongly correlated, that could introduce bias into our model.

Let's take a look at the wine dataset again, which is made up of continuous, numerical features. We run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

In [None]:
# Print out the column correlations of the wine dataset
corr_matrix = utils.wine.corr()
corr_matrix

In [None]:
# Take a minute to find the column where the correlation value is greater than 0.75 at least twice or run the following code
corrs = corr_matrix.abs().unstack().sort_values(kind='quicksort', ascending=False)
corrs[(corrs>0.75) & (corrs<1.0)]

In [None]:
# Flavanoids has high correlation with Total phenols and OD280/OD315 of diluted wines
# Proline is already redundant because of Proline_log
# We don't drop Type because it the target column
to_drop = ["Flavanoids", "Proline"]

In [None]:
# Drop that column from the DataFrame
utils.wine = utils.wine.drop(to_drop, axis=1)
utils.wine.head()

In [None]:
utils.wine.corr()

## Selecting features using text vectors

Previously, we used scikit-learn to create a tf-idf vector of text in one of our datasets. We don't necessarily need the entire vector to train a model, though. We could potentially take something like the top 20% of weighted words across the vector. This is a scenario where iteration is important, and it might be helpful to test out different subsets of our tf-idf vector to see what works.

Rather than just blindly taking some top percentage of a tf-idf vector, let's look at how to pull out the words and their weights on a per document basis. It isn't especially straightforward to do this in scikit-learn, but it's very useful.

### Exploring text vectors, part 1

Let's expand on the text vector exploration method we just learned about, using the `volunteer` dataset's title `tf/idf` vectors. We'll return a list of numbers with the function. In the next example, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our `text_tfidf` vector.

We'll pass in the reversed vocab list, the vector, and the row we want to retrieve data for. We'll do row zipping to a dictionary in the function. And finally, we'll return a dictionary mapping the word to its score.

In [None]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

In [None]:
# Print out the weighted words
# It'll be easier later on if we have the index number in the key position in the dictionary
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3)

At this point we could sort by score, or eliminate words below a certain threshold.

### Exploring text vectors, part 2

Using the function we wrote in the previous example, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous example, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

In [None]:
# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

### Training Naive Bayes with feature selection

Let's re-run the Naive Bayes text classification model from earlier, with our selection choices from the previous example, on the `volunteer` dataset's `title` and `category_desc` columns.

In [None]:
# Split the dataset according to the class distribution of category_desc
y = utils.volunteer["category_desc"]
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

In [None]:
# Fit the model to the training data
nb.fit(train_X, train_y)

In [None]:
# Print out the model's accuracy
nb.score(test_X, test_y)

We can see that our accuracy score wasn't that different from our previous. That's okay; the title field is a very small text field, appropriate for demonstrating how filtering vectors works.

## Dimensionality reduction

A less manual way of reducing the size of our feature set is through dimensionality reduction.

Dimensionality reduction is a form of unsupervised learning that transforms our data in a way that shrinks the number of features in our feature space. This data transformation can be done in a linear or nonlinear fashion. Dimensionality reduction is a feature extraction method, given that data is being transformed into new and different features.

### Using PCA

PCA uses a linear transformation to project features into a space where they are completely uncorrelated. While the feature space is reduced, the variance is captured in a meaningful way by combining features into components. PCA captures, in each component, as much of the variance in the dataset as possible. In terms of feature selection, it can be a useful method when we have a large number of features and no strong candidates for elimination.

Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = utils.wine.drop("Type", axis=1)

In [None]:
# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

In [None]:
# Look at the percentage of variance explained by the different components
pca.explained_variance_ratio_

### PCA caveats

* difficult to interpret PCA components beyond which components explain the most variance
* PCA is a good step to do at the end of our preprocessing journey, because of the way the data gets transformed and reshaped

In the next section we'll train a model using the PCA-transformed vector.

### Training a model with PCA

Now that we have run PCA on the `wine` dataset, let's try training a model with it.

In [None]:
# Split the transformed X and the y labels into training and test sets
wine_y = utils.wine["Type"]
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, wine_y, random_state=42) 

In [None]:
# Fit knn to the training data
knn = KNeighborsClassifier()
knn.fit(X_wine_train, y_wine_train)

In [None]:
# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

# Putting it all together

Now that we've learned all about preprocessing we'll try these techniques out on a dataset that records information on UFO sightings. 

Each row in this dataset contains information like the location, the type of the sighting, the number of seconds and minutes the sighting lasted, a description of the sighting, and the date the sighting was recorded.

## UFOs and preprocessing

### Checking column types

Let's take a look at the UFO dataset's column types using the `dtypes` attribute. One column jumps out for transformation: the `date` column, which can be transformed into the `datetime` type. That will make our feature engineering efforts easier later on.

In [None]:
ufo = pd.read_csv('data/ufo_sightings_large.csv')
ufo.head()

In [None]:
ufo.dtypes

In [None]:
# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

### Dropping missing data

Let's remove some of the rows where certain columns have missing values. We're going to look at the `length_of_time column`, the `state` column, and the `type` column. If any of the values in these columns are missing, we're going to drop the rows.

In [None]:
ufo.info()

In [None]:
# Check how many values are missing in the length_of_time, state, and type columns
ufo[['length_of_time', 'state', 'type']].isnull().sum()

In [None]:
# Keep only rows where length_of_time, state, and type are not null
ufo = ufo[ufo["length_of_time"].notnull() & ufo["state"].notnull() & ufo["type"].notnull()]
ufo.reset_index(drop=True, inplace=True)

In [None]:
# Print out the shape of the new dataset
ufo.info()

## Categorical variables and standardization

There are a number of categorical variables in the UFO dataset, including location data and the type of the encounter.

### Extracting numbers from strings

The `length_of_time` field in the UFO dataset is a text field that has the number of minutes within the string. Here, we'll extract that number from that text field using regular expressions.

In [None]:
import math
def return_minutes(time_string):
    # We'll use \d+ to grab digits and match it to the column values
    pattern = re.compile(r"(\d+|\d{1,2}\.\d{1,2})(?:[^0-9\.]*)?(?:minutes*|mins*)|(\d+):(?:\d*?)")    
        
    # Use match on the pattern and column
    num = re.search(pattern, time_string)
    if num is not None:
        return math.floor(float((num.group(1) if num.group(1) is not None else num.group(2))))

In [None]:
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

In [None]:
# Take a look at the head of both of the columns
ufo[["length_of_time", "minutes"]].head(10)

As we can see, we end up with some `NaN`s in the DataFrame. That's okay for now; we'll take care of those before modeling.

In [None]:
ufo = ufo[(ufo['seconds'] != 0.0) & ufo['minutes'].notnull()]
ufo.reset_index(drop=True, inplace=True)

### Identifying features for standardization

In this section, we'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the `seconds` and `minutes` column, we'll see that the variance of the seconds column is extremely high. Because `seconds` and `minutes` are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the `seconds` column.

In [None]:
# Check the variance of the columns
ufo.var()

In [None]:
# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

In [None]:
# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

## Engineering new features

### Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. We'll do that transformation here, using both binary and one-hot encoding methods.

In [None]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda val: 1 if val == "us" else 0)

In [None]:
# Print the number of unique type values
print(len(ufo["type"].unique()))

In [None]:
# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo["type"])

In [None]:
# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

In [None]:
ufo.head()

### Features from dates

Another feature engineering task to perform is month and year extraction. We'll perform this task on the `date` column of the `ufo` dataset.

In [None]:
# Look at the first 5 rows of the date column
ufo["date"].head()

In [None]:
# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda row: row.month)

In [None]:
# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda row: row.year)

In [None]:
# Take a look at the head of all three columns
ufo[["date", "month", "year"]].head()

### Text vectorization

Let's transform the `desc` column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [None]:
# Take a look at the head of the desc field
ufo["desc"].head()

In [None]:
# Create the tfidf vectorizer object
vec = TfidfVectorizer()

In [None]:
# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo["desc"])

In [None]:
# Look at the number of columns this creates.
desc_tfidf.shape

The text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

## Feature selection and modeling

### Selecting the ideal dataset

Let's get rid of some of the unnecessary features. Because we have an encoded country column, `country_enc`, we keep it and drop other columns related to location: `city`, `country`, `lat`, `long`, `state`.

We have columns related to `month` and `year`, so we don't need the `date` or `recorded` columns.

We vectorized `desc`, so we don't need it anymore. For now we'll keep `type`.

We'll keep `seconds_log` and drop `seconds` and `minutes`.

Let's also get rid of the `length_of_time` column, which is unnecessary after extracting `minutes`.

In [None]:
# Check the correlation between the seconds, seconds_log, and minutes columns
ufo[["seconds", "seconds_log", "minutes"]].corr()

In [None]:
# Make a list of features to drop   
to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

In [None]:
# Drop those features
ufo = ufo.drop(to_drop, axis=1)

In [None]:
# Let's also filter some words out of the text vector we created
vocab = {v:k for k,v in vec.vocabulary_.items()}
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

We're almost done. In the next examples, we'll try modeling the UFO data in a couple of different ways.

### Modeling the UFO dataset, part 1

In this example, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our `X` dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The `y` labels are the encoded country column, where 1 is `us` and 0 is `ca`.

In [None]:
X = ufo.drop(['type', 'country_enc'], axis=1)
y = ufo['country_enc']

In [None]:
# Take a look at the features in the X set of data
X.columns

In [None]:
# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y, random_state=42)

In [None]:
# Fit knn to the training sets
knn.fit(train_X, train_y)

In [None]:
# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

This model performs pretty well. It seems like we've made pretty good feature selection choices here.

### Modeling the UFO dataset, part 2

Finally, let's build a model using the text vector we created, `desc_tfidf`, using the `filtered_words` list to create a filtered text vector. Let's see if we can predict the `type` of the sighting based on the text. We'll use a Naive Bayes model for this.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

In [None]:
# Split the X and y sets using train_test_split, setting stratify=y
y = ufo['type']
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

In [None]:
# Fit nb to the training sets
nb.fit(train_X, train_y)

In [None]:
# Print the score of nb on the test sets
nb.score(test_X, test_y)

As we can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting `type`.

We now know how to deal with basic data issues like missing data and incorrect types, how to standardize numerical values and process categorical ones, how to engineer new features that will improve our dataset, and finally, how to select features for modeling.

---
**[Week 5 - Data Preprocessing and Hyperparameter Tuning](https://radu-enuca.gitbook.io/ml-challenge/preprocessing-and-tuning)**

*Have questions or comments? Visit the ML Challenge Mattermost Channel.*