# Lab | Making predictions with logistic regression

**In this lab, you will be using the Sakila database of movie rentals.**

In order to optimize our inventory, we would like to know which films will be rented next month and we are asked to create a model to predict it.

**Instructions**

1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.
2. Create a query to get the list of films and a boolean indicating if it was rented last month (August 2005). This would be our target variable.
3. Read the data into a Pandas dataframe.
4. Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.
5. Create a logistic regression model to predict this variable from the cleaned data.
Evaluate the results.

In [3]:
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass 

In [4]:
password = getpass.getpass()

········


In [5]:
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)

With some help from Ferreira I managed to import the cells I want. I adapted his SQL code to import values I deemed importaint: film id, times a movie was rented in total, True/False whether it was rented in May, and how often that month, rental duration, length, rating, category and rental rate.

I did not include information such as extra features (I don't believe they would matter much), Language (they are all English anyways), Title (we are judging by factors like category & rating, length, price and historical demand and have the id if we need to find out titles)

In [6]:
query = '''SELECT f.film_id, COUNT(r.rental_id) AS times_rented, f.rental_duration, f.length,
f.rating, c.name AS category, f.rental_rate, 
CASE
    WHEN r.rental_date BETWEEN '2005-05-01' AND '2005-05-31' THEN True
    ELSE False END AS may
FROM sakila.film f
Left JOIN inventory i
    ON f.film_id = i.film_id
JOIN sakila.rental r
    ON i.inventory_id = r.inventory_id
Join sakila.film_category fc
    On fc.film_id = f.film_id
Join sakila.category c
    On c.category_id = fc.category_id
WHERE r.rental_date BETWEEN '2005-01-01' AND '2005-12-31'
GROUP BY film_id, rental_duration, f.length, f.rating, category, may, f.rental_rate;'''

In [7]:
film = pd.read_sql(query, engine)
film.head()

Unnamed: 0,film_id,times_rented,rental_duration,length,rating,category,rental_rate,may
0,19,19,6,113,PG,Action,0.99,0
1,19,1,6,113,PG,Action,0.99,1
2,21,2,3,129,R,Action,4.99,1
3,21,19,3,129,R,Action,4.99,0
4,29,10,5,168,NC-17,Action,2.99,0


### Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.

In [12]:
film.shape

(1585, 8)

In [10]:
film.isna().sum()

film_id            0
times_rented       0
rental_duration    0
length             0
rating             0
category           0
rental_rate        0
may                0
dtype: int64

In [8]:
film.describe()


Unnamed: 0,film_id,times_rented,rental_duration,length,rental_rate,may
count,1585.0,1585.0,1585.0,1585.0,1585.0,1585.0
mean,499.197476,10.007571,4.958991,115.574763,2.957192,0.395584
std,285.964805,8.317785,1.39948,40.388044,1.654176,0.48913
min,1.0,1.0,3.0,46.0,0.99,0.0
25%,255.0,2.0,4.0,82.0,0.99,0.0
50%,501.0,9.0,5.0,114.0,2.99,0.0
75%,745.0,17.0,6.0,150.0,4.99,1.0
max,1000.0,31.0,7.0,185.0,4.99,1.0


In [13]:
film.dtypes


film_id              int64
times_rented         int64
rental_duration      int64
length               int64
rating              object
category            object
rental_rate        float64
may                  int64
dtype: object

In [14]:
film['rental_duration'].value_counts()

6    341
4    328
3    324
5    307
7    285
Name: rental_duration, dtype: int64

In [15]:
film['rental_rate'].value_counts()

0.99    555
4.99    529
2.99    501
Name: rental_rate, dtype: int64

I would treat rental duration and rate as categoricals as there are only a handful of values for either column

In [16]:
film['rental_rate'] = film['rental_rate'].astype('object') 

In [18]:
film['rental_duration'] = film['rental_duration'].astype('object') 

In [19]:
film.dtypes

film_id             int64
times_rented        int64
rental_duration    object
length              int64
rating             object
category           object
rental_rate        object
may                 int64
dtype: object

In [22]:
# examining relationship between rate and duration

film.groupby(['rental_rate','rental_duration']).agg({'rental_duration':'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,rental_duration
rental_rate,rental_duration,Unnamed: 2_level_1
0.99,3,135
0.99,4,115
0.99,5,91
0.99,6,126
0.99,7,88
2.99,3,90
2.99,4,102
2.99,5,93
2.99,6,111
2.99,7,105
