Python 3.8

Project Instructions

In this project, you will use regression models to predict the number of days a customer rents DVDs for.

As with most data science projects, you will need to pre-process the data provided, in this case, a csv file called rental_info.csv. Specifically, you need to:

    Read in the csv file rental_info.csv using pandas.
    Create a column named "rental_length_days" using the columns "return_date" and "rental_date", and add it to the pandas DataFrame. This column should contain information on how many days a DVD has been rented by a customer.
    Create two columns of dummy variables from "special_features", which takes the value of 1 when:
        The value is "Deleted Scenes", storing as a column called "deleted_scenes".
        The value is "Behind the Scenes", storing as a column called "behind_the_scenes".
    Make a pandas DataFrame called X containing all the appropriate features you can use to run the regression models, avoiding columns that leak data about the target.
    Choose the "rental_length_days" as the target column and save it as a pandas Series called y.

Following the preprocessing you will need to:

    Split the data into X_train, y_train, X_test, and y_test train and test sets, avoiding any features that leak data about the target variable, and include 20% of the total data in the test set.
    Set random_state to 9 whenever you use a function/method involving randomness, for example, when doing a test-train split.

Recommend a model yielding a mean squared error (MSE) less than 3 on the test set

    Save the model you would recommend as a variable named best_model, and save its MSE on the test set as best_mse.

How to approach the project

1. Getting the number of rental days.

Notice that the columns "return_date" and "rental_date" are not already in a pandas datetime format - you need to convert them into a datetime format.
How to convert into datetime format?

    An example to convert the column "return_date" in the DataFrame df into datetime format using the syntax pd.to_datetime(df["return_date"]).

How to get number of rental days?

    Once you have converted "return_date" and "rental_date" into pandas datetime format, you can get the number of days by using the following code:


df["rental_length"] = pd.to_datetime(df["return_date"]) - pd.to_datetime(df["rental_date"])
df["rental_length_days"] = df["rental_length"].dt.days


2. Adding dummy variables using the special features column.

One way of adding dummy variables according to the entries in the "special_features" column is using np.where().
What is the code to add the dummy variables?

    If df_rental is the pandas DataFrame read via the csv file, then you can add a dummy variable for "deleted_scenes" there using the following syntax df_rental["deleted_scenes"] =  np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1,0)

3. Executing a train-test split

You can perform a train-test split by using the train_test_split() function from sklearn.model_selection.
What's the code to do a train-test split?

    If X is the feature matrix and y is the target variable, a train-test split can be performed using the code X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=9).
    The function train_test_split() also takes the proportion of entries to be in the test set, called test_size, which is 20%, or 0.2, in this case.
    The random_state keyword argument sets a seed so that the split can be replicated in the future.

4. Performing feature selection

Identify the features with the best prediction power for the target variable.
Using Lasso Regression

    Using Lasso() from sklearn.linear_model allows you to look at feature importance by accessing the model's .coef_ attribute, where values over 0 indicate a contribution to the model's performance.
    Lasso() can be instantiated by setting random_state to 9 and providing a positive decimal value to the alpha keyword argument for regularization (the lower the number, the weaker the regularization).
    You can subset the training and test features for columns with non-zero coefficients using .iloc[], keeping all rows and filtering columns using the syntax lasso_coef > 0.

5. Choosing models and performing hyperparameter tuning

Try a variety of regression models.
Fitting models to the data

    You can try models such as LinearRegression(), DecisionTreeRegressor(), and RandomForestRegressor() to estimate the target variable based on the features.

Tuning your model

    The RandomizedSearchCV() function allows you to search for the best model performance using random values from ranges of hyperparameters.
    To use this function you should pass the model object, set the param_distributions keyword argument equal to a dictionary of hyperparameters to search over, set cv equal to the number of folds of cross-validation to perform, and set random_state equal to 9.

6. Predicting values on test set

Since you are using sklearn, you can use the .predict() function of the trained model to get fitted values. The .predict() function takes a feature matrix whose outcome you want to predict as an argument.
What is an example code to predict values?

    In the following code, a linear regression model is being fit where X is the feature matrix and y is the target variable.
    The .predict() function is used to predict the target variable using features from X:
    ols = LinearRegression()
    ols = ols.fit(X, y)
    y_pred = ols.predict(X)


7. Computing mean squared error

You can use the mean_squared_error() function from sklearn.metrics to compute mean squared error. It takes as input the observed target variable and the predicted target variable.
How to compute a model's mean squared error?

    You can use the code mean_squared_error(y, y_pred) to compute mean squared error. Here y is the observed target variable and y_pred is its corresponding predicted values.

![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [3]:
# Start your coding from below
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# For lasso
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Run OLS
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Read in data
df_rental = pd.read_csv("rental_info.csv")

# Add information on rental duration
df_rental["rental_length"] = pd.to_datetime(df_rental["return_date"]) - pd.to_datetime(df_rental["rental_date"])
df_rental["rental_length_days"] = df_rental["rental_length"].dt.days

### Add dummy variables
# Add dummy for deleted scenes
df_rental["deleted_scenes"] =  np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1, 0)
# Add dummy for behind the scenes
df_rental["behind_the_scenes"] =  np.where(df_rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

# Choose columns to drop
cols_to_drop = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]

# Split into feature and target sets
X = df_rental.drop(cols_to_drop, axis=1)
y = df_rental["rental_length_days"]

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X, 
                                                 y, 
                                                 test_size=0.2, 
                                                 random_state=9)

# Create the Lasso model
lasso = Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest

# Congratulations, you completed the project!