# Week 07

## Scikit-Learn and regression modeling

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this week's exercises

In [None]:
!wget -q https://github.com/DM-GY-9103-2024S-R/9103-utils/raw/main/src/data_utils.py
!wget -q https://github.com/DM-GY-9103-2024S-R/9103-utils/raw/main/src/io_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder

from data_utils import LinearRegression, MinMaxScaler, PolynomialFeatures
from io_utils import object_from_json_url

### Regression

Regression, or Regression Analysis, is a set of statistical processes for estimating the relationship between a dependent variable (sometimes called the 'outcome', 'response' or 'label') and one or more independent variables (called 'features', 'dimensions' or 'columns').

For example, let's say we have the following data about people's wages and years of experience:

<img src="./imgs/wages-exp.png" width="620px"/>

We could use regression to calculate how the values for wages are affected by years of experience in our dataset, and then create a function to more generally estimate the relation between wages and experience:

<img src="./imgs/wages-exp-fit.png" width="620px"/>

We could now estimate wages for values of years of experience that we didn't have measurements for.

This is an estimate, but the more points we use and the more features we have in our dataset the better the regression results will be.

### Setting up Regression

For a simple dataset we can perform regression by following these steps:

1. Load dataset
2. Encode label features as numbers
3. Normalize the data
4. Separate the outcome variable and the feature variables
5. Create a regression model
6. Run model on input data and measure error

### Diamond Prices

Let's use the dataset from last week to set up a diamond price estimator.

Steps 1 - 3 should look familiar:

In [None]:
## 1. Load Dataset
DIAMONDS_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024S-R/9103-utils/main/datasets/json/diamonds.json"

# Read into DataFrame
diamonds_data = object_from_json_url(DIAMONDS_FILE)
diamonds_df = pd.DataFrame.from_records(diamonds_data)


## 2. Encode non-numeric values
cut_order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
color_order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
clarity_order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']

diamond_encoder = OrdinalEncoder(categories=[cut_order, color_order, clarity_order])

ccc_vals = diamond_encoder.fit_transform(diamonds_df[["cut", "color", "clarity"]].values)
diamonds_df[["cut", "color", "clarity"]] = ccc_vals


## 3. Normalize
diamond_scaler = MinMaxScaler()
diamonds_scaled = diamond_scaler.fit_transform(diamonds_df)

#### Chose Features

Now we separate the outcome variable values and the independent variables.

Let's start simple and use only one feature: carat.

In [None]:
## 4. Separate the outcome variable and the independent variables
prices = diamonds_scaled["price"]
carats = diamonds_scaled[["carat"]]

# Plot the variables, just for checking
plt.scatter(carats, prices, marker='o', linestyle='', alpha=0.3)
plt.xlabel("carat")
plt.ylabel("price")
plt.show()

#### Model

Setup and create regression model:

In [None]:
## 5. Create a LinearRegression object
price_model = LinearRegression()

# Create a model that relates price of diamonds to their carat value
model = price_model.fit(carats, prices)

#### Evaluate

Run the regression model, put the result back in a DataFrame and measure its error.

There are a lot of steps here that are just for re-structuring the data back to its original shape and range (before normalization).

In [None]:
## 6. Run the model on the training data
predicted_scaled = price_model.predict(carats)

# Un-normalize the data
predicted = diamond_scaler.inverse_transform(predicted_scaled)

# Measure error
mean_squared_error(diamonds_df["price"].values, predicted["price"].values, squared=False)

#### Result

Hmmm.... what this means is that on average our model is wrong by $\$1388$ dollars.

We can plot our predictions with the original data:

In [None]:
# Plot the original values
plt.scatter(carats, prices, marker='o', linestyle='', alpha=0.3)
plt.xlabel("carat")
plt.ylabel("price")

# Plot the predictions
plt.scatter(carats, predicted_scaled, color='r', marker='o', linestyle='', alpha=0.05)
plt.xlabel("carat")
plt.ylabel("price")
plt.show()

#### Using more features

Let's use a few more features to build our model:

In [None]:
## 4. Separate the outcome variable and the independent variables
prices = diamonds_scaled["price"]
features = diamonds_scaled[["carat", "x", "y"]]

## 5. Create a LinearRegression object
price_model = LinearRegression()

# Create a model that relates price of diamonds to their carat value as well as width and length
price_model.fit(features, prices)

## 6. Run the model on the training data
predicted_scaled = price_model.predict(features)

# Un-normalize the data
predicted = diamond_scaler.inverse_transform(predicted_scaled)

# Measure error
mean_squared_error(diamonds_df["price"].values, predicted["price"].values, squared=False)

#### Using all features

The model is better, but it only improved by a little bit.

Let's use all of the features to build our model:

In [None]:
## 4. Separate the outcome variable and the independent variables
prices = diamonds_scaled["price"]

# All except price
features = diamonds_scaled.drop(columns=["price"])

## 5. Create a LinearRegression object
price_model = LinearRegression()

# Create a model that relates price of diamonds to many features
price_model.fit(features, prices)

## 6. Run the model on the training data
predicted_scaled = price_model.predict(features)

# Un-normalize the data
predicted = diamond_scaler.inverse_transform(predicted_scaled)

# Measure error
mean_squared_error(diamonds_df["price"].values, predicted["price"].values, squared=False)

#### Plot Result

The error is getting better.

Let's plot all of the prices from the original dataset and the reconstructed prices from our model:

In [None]:
prices_original = diamonds_df["price"]
prices_predicted = predicted["price"]

# Plot the original and predicted prices
plt.plot(sorted(prices_original), marker='o', linestyle='', alpha=0.3)
plt.plot(sorted(prices_predicted), color='r', marker='o', markersize='3', linestyle='', alpha=0.1)
plt.ylabel("price")
plt.show()

#### Interpretation

The model doesn't look too bad.

It seems to not be performing very well for diamonds on the extreme ends of price: the too cheap and too expensive ones.

And since even a small percentage of error for an expensive diamond contributes to a large error in dollars, this is probably where a lot of the error is coming from.

#### Even More Features !

One trick to improve our model is to create some extra features from the current ones.

For example, in addition to considering carat and width of each diamond separately, we can create a feature that is a combination of these two values.

SciKit-Learn has an object called [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) that does exactly this.

We just have to instantiate it and use it to create some extra features for us.

In [None]:
## 4. Separate the outcome variable and the independent variables
prices = diamonds_scaled["price"]
features = diamonds_scaled.drop(columns=["price"])

## 4B. Create extra features
poly = PolynomialFeatures(degree=3, include_bias=False)
features_poly = poly.fit_transform(features)

## 5. Create a LinearRegression object
price_model = LinearRegression()

# Create a model that relates price of diamonds to many features
result = price_model.fit(features_poly, prices)

## 6. Run the model on the training data
predicted_scaled = price_model.predict(features_poly)

# Un-normalize the data
predicted = diamond_scaler.inverse_transform(predicted_scaled)

# Measure error
mean_squared_error(diamonds_df["price"].values, predicted["price"].values, squared=False)

#### Better !

Let's plot the resulting prices:

In [None]:
prices_original = diamonds_df["price"]
prices_predicted = predicted["price"]

# Plot the original and predicted prices
plt.plot(sorted(prices_original), marker='o', linestyle='', alpha=0.3)
plt.plot(sorted(prices_predicted), color='r', marker='o', markersize='3', linestyle='', alpha=0.1)
plt.ylabel("price")
plt.show()