# Predictive Modeling for the Bachelorette Predictor
### Kwame V. Taylor

I will use linear regression and machine learning to predict values of contestants' ```ElimWeek```.

## Set up Environment

In [1]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, explained_variance_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, TweedieRegressor
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures

import warnings
warnings.filterwarnings("ignore")

In [10]:
from wrangle import acquire_data, join_dfs, drop_extra_cols
from preprocessing import handle_dates_and_elims, train_validate_test
from model import model_1, model_1_test

## Wrangle and Preprocess the Data

In [3]:
df, join = acquire_data()
df = join_dfs(df, join)
df = drop_extra_cols(df)

In [4]:
df = handle_dates_and_elims(df)
X_train, y_train, X_validate, y_validate, X_test, y_test, train, validate, test = train_validate_test(df, 'ElimWeek')

Shape of train: (147, 4) | Shape of validate: (64, 4) | Shape of test: (53, 4)


## Modeling

**The goal is to produce a predictive model that outperforms the baseline in predicting the target value -- in this case, ```ElimWeek```.**

### Define and Evaluate Baseline

In [5]:
#np.median(y_train)
np.mean(y_train)

3.727891156462585

In [6]:
#baseline = y_train.median()
baseline = y_train.mean()

baseline_rmse_train = round(mean_squared_error(y_train, np.full(len(y_train), baseline))**1/2, 6)
print('RMSE (Root Mean Square Error) of Baseline on train data:\n', baseline_rmse_train)
baseline_rmse_validate = round(mean_squared_error(y_validate, np.full(len(y_validate), baseline))**1/2, 6)
print('RMSE (Root Mean Square Error) of Baseline on validate data:\n', baseline_rmse_validate)

RMSE (Root Mean Square Error) of Baseline on train data:
 4.16706
RMSE (Root Mean Square Error) of Baseline on validate data:
 4.18615


Mean performed better than median. So, my baseline will be ```3.727891156462585```.

For the MVP I'll just do one model, if it beats the baseline.

### Model 1 - Ordinary Least Squares (OLS) using Linear Regression

In [7]:
X_train.columns

Index(['Age', 'Season', 'One-on-One_Score', 'FirstDate'], dtype='object')

In [8]:
# use all features except season

X = X_train.drop(columns=['Season'])
y = y_train

X_v = X_validate.drop(columns=['Season'])
y_v = y_validate

lm_pred, lm_rmse, lm_pred_v, lm_rmse_v = model_1(X, y, X_v, y_v)

RMSE for OLS using LinearRegression

On train data:
 1.130448 

 On validate data:
 1.005982


This model performs better than the baseline. 🎉

**Now that I know Model 1 is the best performing, I will test it on the test data.**

In [11]:
# use all features except season

X = X_test.drop(columns=['Season'])
y = y_test

lm_pred, lm_rmse = model_1_test(X, y)

RMSE for OLS using LinearRegression

On test data:
 0.896826


**Looks good!! Model 1 performs the best and beats the baseline.** 🥳

On next iteration of this project I will save a dataframe (as .csv) that has the best model's predictions in it.

Now I will put this code into functions in a ```model.py``` file, and transfer my findings to the main notebook.