In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# Missing Values, Categorical Features, and Text


In this notebook, we discuss:
1. how to deal with missing values
2. how to encode categorical features,
3. and how to encode text features.

In the process, we will work through feature engineering to construct a model that predicts vehicle efficiency.


## The Data

For this notebook, we will use the seaborn `mpg` data set which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars.  Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

In [None]:
from seaborn import load_dataset
data = load_dataset("mpg")
data

## Quantitative Continuous Features

This data set has several quantitative continuous features that we can use to build our first model.  However, even for quantitative continuous features, we may want to do some additional feature engineering.  Things to consider are:

1. transforming features with non-linear functions (log, exp, sine, polynomials)
2. constructing products or ratios of features
3. dealing with missing values

### Missing Values

We can use the Pandas `DataFrame.isna` function to find rows with missing values:

In [None]:
data[data.isna().any(axis=1)]

There are many ways to deal with missing values.  A common strategy is to substitute the mean.  Because missing values can actually be useful signal, it is often a good idea to include a feature indicating that the value was missing. 

In [None]:
def phi_cont(df):
    Phi = df[["cylinders", "displacement", 
              "horsepower", "weight", 
              "acceleration", 
              "model_year"]].copy()
    Phi["horsepower_missing"] = Phi["horsepower"].isna()
    Phi = Phi.fillna(Phi.mean())
    return Phi

Using our feature function, we can fit our first model to the transformed data:

In [None]:
model = LinearRegression()
model.fit(phi_cont(data), data[["mpg"]])

### Keeping Track of Progress

Because we are going to be building multiple models with different feature functions it is important to have a standard way to track each of the models.  

The following function takes a model prediction function, the name of a model, and the dictionary of models that we have already constructed.  It then evaluates the new model on the data and plots how the new model performs relative to the previous models as well as the $Y$ vs $\hat{Y}$ scatter plot.  

In addition, it updates the dictionary of models to include the new model for future plotting.

In [None]:
def evaluate_model(name, model, phi, models=dict()):
    # run the prediction function and compute the RMSE
    Yhat = model.predict(phi(data)).flatten()
    Y = data['mpg'].to_numpy()
    rmse = np.sqrt(mean_squared_error(Y, Yhat))
    print("Root Mean Squared Error:", rmse)
    
    # Save the model and rmse to the collection of models 
    models[name] = dict(model=model, phi=phi, rmse=rmse)
    
    # Generate diagnostic and model comparison plots
    fig = make_subplots(rows=1, cols=2)
    fig.add_trace(go.Scatter(x=Yhat, y=Y, mode="markers"), row=1, col=1)
    fig.update_xaxes(title = "Yhat", row=1, col=1)
    fig.update_yaxes(title = "Y", row=1, col=1)
    ymin = np.min(Yhat)
    ymax = np.max(Yhat)
    fig.add_trace(go.Scatter(x=[ymin,ymax], y=[ymin,ymax], name="y=yhat"), row=1, col=1)
    fig.add_trace(go.Bar(x=list(models.keys()), 
                         y=[models[k]['rmse'] for k in models]), row=1, col=2)    
    fig.update_layout(showlegend=False)
    fig.update_yaxes(title = "RMSE", row=1, col=2)
    fig.show()
    


models = {}

In [None]:
evaluate_model("cont.", model, phi_cont, models)

### Stable Feature Functions

Unfortunately, the feature function we just implemented applies a different transformation depending on what input we provide. Specifically, if the `horesepower` is missing when we go to make a prediction we will substitute it with a different mean then was used when we fit our model.  Furthermore, if we only want predictions on a few records and the `horsepower` is missing from those records then the feature function will be unable to substitute a meaningful value.

For example, if we were to get new records that look like the following:

In [None]:
new_data = data[data['horsepower'].isna()].head(3)
new_data

The feature function is be unable to substitute the mean since none of the records have a `horsepower` value.

In [None]:
try:
    model.predict(phi_cont(new_data))
except Exception as e:
    print(e)

We can fix this by computing the mean on the original data and using that mean on any new data.

In [None]:
# Making a global variable
def phi_cont(df, data_mean = data.mean()):
    feature_cols = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
    Phi = df[feature_cols].copy()
    Phi["horsepower_missing"] = Phi["horsepower"].isna().astype(float)
    Phi = Phi.fillna(data_mean)
    return Phi

In [None]:
model.predict(phi_cont(new_data))

### Scikit-learn Model Imputer

Because these kinds of transformations are fairly common. Scikit-learn has built-in transformations for data imputation.  These transformations have a common pattern of `fit` and `transform`.  You first `fit` the transformation to your data and then you can `transform` your data and any future data using the same transformation.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="mean")

In [None]:
imputer.fit(data[['weight', 'horsepower']])

In [None]:
imputer.transform(data[['weight', 'horsepower']])[32]

In [None]:
imputer.fit(data[['horsepower']])
def phi_cont(df, imputer=imputer):
    feature_cols = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
    Phi = df[feature_cols].copy()
    Phi["horsepower_missing"] = Phi["horsepower"].isna().astype(float)
    Phi["horsepower"] = imputer.transform(Phi[["horsepower"]]).flatten()
    return Phi

In [None]:
model = LinearRegression()
model.fit(phi_cont(data), data[["mpg"]])
evaluate_model("cont.", model, phi_cont, models)

## Applying Domain Knowledge

The displacement of an engine is defined as the product of the volume of each cylinder and number of cylinders.  However, not all cylinders fire at the same time (at least in a functioning engine) so the fuel economy might be more closely related to the volume of any one cylinder.  

![Cylinders from https://gifimage.net/piston-gif-3/](https://gifimage.net/wp-content/uploads/2018/04/piston-gif-3.gif)

We can use this "domain knowledge" to compute a new feature encoding the volume per cylinder by taking the ratio of displacement and cylinders. 

In [None]:
def phi_with_displacement(df):
    Phi = phi_cont(df)
    Phi['displacement/cylinder'] = Phi['displacement'] / Phi['cylinders']
    return Phi

In [None]:
phi_with_displacement(data).head()

Again fitting and evaluating our model we see a reduction in prediction error (RMSE).

In [None]:
model = LinearRegression()
model.fit(phi_with_displacement(data), data[["mpg"]])
evaluate_model("cont.+(d/c)", model, phi_with_displacement, models)

## Encoding Categorical Data 

The `origin` column in this data set is categorical (nominal) data taking on a fixed set of possible values.

In [None]:
data.head()

In [None]:
px.histogram(data, x='origin')

To use this kind of data in a model, we need to transform into a vector encoding that treats each distinct value as a separate dimension.  This is called One-hot Encoding or Dummy Encoding. 

### One-Hot Encoding (Dummy Encoding)


One-Hot encoding, sometimes also called **dummy encoding** is a simple mechanism to encode categorical data as real numbers such that the magnitude of each dimension is meaningful.  Suppose a feature can take on $k$ distinct values (e.g., $k=50$ for 50 states in the United Stated).  A new feature (dimension) is created for each distinct value.  For each record, all the new features are set to zero except the one corresponding to the value in the original feature. 

<img src="images/one_hot_state.png" width="600px">

The term one-hot encoding comes from a digital circuit encoding of a categorical state as particular "hot" wire:

<img src="images/one_hot_encoding.png" width="400px">

## Dummy Encoding in Pandas

We can construct a one-hot (dummy) encoding of the origin column using the `Pandas.get_dummies` function:

In [None]:
pd.get_dummies(data[['origin']])

Using the `Pandas.get_dummies`, we can build a new feature function which extends our previous features with the additional dummy encoding columns.

In [None]:
def phi_with_origin(df):
    Phi = phi_with_displacement(df)
    return Phi.join(pd.get_dummies(df[['origin']]))

In [None]:
phi_with_origin(data).head()

We fit a new model with the origin feature encoding:

In [None]:
model = LinearRegression()
model.fit(phi_with_origin(data), data[["mpg"]])
evaluate_model("cont.+(d/c)+o", model, phi_with_origin, models)

Unfortunately, the above feature function is not stable.  For example, if we are given a single vehicle to make a prediction the model will fail:

In [None]:
try:
    model.predict(phi_with_origin(data.head(1)))
except Exception as e:
    print(e)

To see why this fails look at the feature transformation for a single row:

In [None]:
phi_with_origin(data.head(1))

The dummy columns are not created for the other categories.  

There are a couple solutions.  We could maintain a list of dummy columns and always add these columns.  Alternatively, we could use a library function designed to solve this problem.  The second option is much easier.  

### Scikit-learn One-hot Encoder

The scikit-learn library has a wide range feature transformations and a framework for composing them in reusable (stable) pipelines.  Let's first look at a basic [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) transformation.

In [None]:
from sklearn.preprocessing import OneHotEncoder
oh_enc = OneHotEncoder()

We then fit that instance to some data.  This is where we would determine the specific values that a categorical feature can take:

In [None]:
oh_enc.fit(data[['origin']])

Once we fit the transformation, we can then use it transform new data:

In [None]:
oh_enc.transform(data[['origin']].head())

In [None]:
oh_enc.transform(data[['origin']].head()).todense()

We can also inspect the categories of the one-hot encoder:

In [None]:
oh_enc.get_feature_names()

We can update our feature function to use the one-hot encoder instead.

In [None]:
def phi_with_origin(df):
    Phi = phi_with_displacement(df)
    dummies = pd.DataFrame(oh_enc.transform(df[['origin']]).todense(), 
                           columns=oh_enc.get_feature_names(),
                           index = df.index)
    return Phi.join(dummies)

In [None]:
phi_with_origin(data.head())

In [None]:
model = LinearRegression()
model.fit(phi_with_origin(data), data[["mpg"]])
evaluate_model("cont.+(d/c)+o", model, phi_with_origin, models)

In [None]:
data[['name']]

## Encoding _Text_ Features

The only remaining feature to encode is the vehicle name.  Is there potentially signal in the vehicle name?


In [None]:
data[['name']].head(10)

Encoding text can be challenging.  The capturing the semantics and grammar of language in mathematical (vector) representations is an active area of research.  State-of-the-art techniques often rely on neural networks trained on large collections of text. In this class, we will focus on basic text encoding techniques that are still widely used.  If you are interested in learning more, checkout [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial).



Here we present two widely used representations of text:

* **Bag-of-Words Encoding**: encodes text by the frequency of each word
* **N-Gram Encoding**: encodes text by the frequency of sequences of words of length $N$

Both of these encoding strategies are related to the one-hot encoding with dummy features created for every word or sequence of words and with multiple dummy features having counts greater than zero.


## The Bag-of-Words Encoding


The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.  The following is a simple illustration of the bag-of-words encoding:

<img src="images/bag_of_words.png" width="600px">

**Notice**
1. **Stop words are often removed.** Stop-words are words like `is` and `about` that in isolation contain very little information about the meaning of the sentence.  Here is a good list of [stop-words in many languages](https://code.google.com/archive/p/stop-words/). 
1. **Word order information is lost.**  Nonetheless the vector still suggests that the sentence is about `fun`, `machines`, and `learning`.  Thought there are many possible meanings _learning machines have fun learning_ or _learning about machines is fun learning_ ...
1. **Capitalization and punctuation are typically removed.**  However, emoji symbols may be worth preserving.  
1. **Sparse Encoding:** is necessary to represent the bag-of-words efficiently.  There are millions of possible words (including terminology, names, and misspellings) and so instantiating a `0` for every word that is not in each record would be inefficient.  


### Professor Gonzalez is an "artist"

When professor Gonzalez was a graduate student at Carnegie Mellon University, he and several other computer scientists created the following art piece on display in the Gates Center:

<img src="images/bag_of_words_art.jpg" width="300px">

Is this art or science? 

**Notice**
1. The unordered collection of words in the bag.
1. The stop words on the floor.
1. _The missing broom._  The original sculpture had a broom attached but the janitor got confused .... 



### Bag-of-words in Scikit-learn

We can use scikit-learn to construct a bag-of-words representation of text

In [None]:
frost_text = [x for x in """
Some say the world will end in fire,
Some say in ice.
From what Ive tasted of desire
I hold with those who favor fire.
""".split("\n") if len(x) > 0]

frost_text

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer with English stop words
bow = CountVectorizer(stop_words="english")

# fit the model to the passage
bow.fit(frost_text)

In [None]:
# Print the words that are kept
print("Words:", list(enumerate(bow.get_feature_names())))

In [None]:
print("Sentence Encoding: \n")
# Print the encoding of each line
for (text, encoding) in zip(frost_text, bow.transform(frost_text)):
    print(text)
    print(encoding.todense())
    print("------------------")

## The N-Gram Encoding

The n-gram encoding is a generalization of the bag-of-words encoding designed to capture information about word ordering.  Consider the following passage of text:

> _The book was not well written but I did enjoy it._

If we re-arrange the words we can also write:

> _The book was well written but I did not enjoy it._

Moreover, local word order can be important when making decisions about text.  The n-gram encoding captures local word order by defining counts over sliding windows. In the following example a bi-gram ($n=2$) encoding is constructed:

<img src="images/ngram.png" width="800px">

The above n-gram would be encoded in the sparse vector:

<img src="images/ngram_vector.png" width="300px">

Notice that the n-gram captures key pieces of sentiment information: `"well written"` and `"not enjoy"`.  

N-grams are often used for other types of sequence data beyond text. For example, n-grams can be used to encode genomic data, protein sequences, and click logs. 

**N-Gram Issues**
1. Maintaining the dictionary of possible n-grams can be very costly.  There are several approximations leveraging hashing that can be used to closely approximate n-gram encoding without the need to maintain the dictionary of all possible n-grams. 
1. As the size $n$ of n-grams increases the chance of observing more than one instance decreases limiting their value as a feature.

In [None]:
# Construct the tokenizer with English stop words
bigram = CountVectorizer(ngram_range=(1, 2))
# fit the model to the passage
bigram.fit(frost_text)

In [None]:
# Print the words that are kept
print("\nWords:", 
      list(zip(range(0,len(bigram.get_feature_names())), bigram.get_feature_names())))

In [None]:
print("\nSentence Encoding: \n")
# Print the encoding of each line
for (text, encoding) in zip(frost_text, bigram.transform(frost_text)):
    print(text)
    print(encoding.todense())
    print("------------------")

### Applying Text Encoding

We can add the text encoding features to our feature function:

In [None]:
bow = CountVectorizer()
bow.fit(data["name"])

def phi_with_name(df):
    Phi = phi_with_origin(df)
    bow_encoding = pd.DataFrame(
        bow.transform(df['name']).todense(), 
        columns=bow.get_feature_names(),
        index = df.index)
    return Phi.join(bow_encoding)

In [None]:
Phi = phi_with_name(data)
Phi.head()

In [None]:
model = LinearRegression()
model.fit(phi_with_name(data), data[["mpg"]])
evaluate_model("cont.+(d/c)+o+n", model, phi_with_name, models)

## Quick Reflection

Notice that as we added more features we were able to improve the accuracy of our model.  This is not always a good thing and we will see the problems associated with this in a future lecture.  

It is also worth noting that our feature functions each depended on the last and the in some cases we were converting sparse features to dense features.  There is a better way to deal with feature pipelines using the scikit-learn pipelines module.  

## Success!!!!!