![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Making predictions with logistic regression

In this lab, you will be using the [Sakila](https://dev.mysql.com/doc/sakila/en/) database of movie rentals.

In order to optimize our inventory, we would like to know which films will be rented next month and we are asked to create a model to predict it.




1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.

In [17]:
# prep: import modules and get pwd
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass  # To get the password without showing the input
import numpy as np
password = getpass.getpass()

········


In [29]:
# get the data
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)

In [37]:
query_features = '''select f.film_id, f.rental_rate, f.special_features, f.rating, fc.category_id, fa.actor_id
FROM sakila.film as f
LEFT JOIN sakila.film_actor fa USING (film_id)
LEFT JOIN sakila.film_category fc USING (film_id)
GROUP BY f.film_id;'''
features = pd.read_sql_query(query_features, engine)

In [127]:
query_target = '''select f.film_id, rental_date
FROM sakila.rental 
LEFT JOIN sakila.inventory i USING (inventory_id)
LEFT JOIN sakila.film f USING (film_id)
WHERE rental_date BETWEEN '2005-05-01 00:00:00' AND '2005-05-31 23:59:59'
GROUP BY f.film_id;'''
target = pd.read_sql_query(query_target, engine)

WE WANT TO USE TITLE, CATEGORY NAME AND ACTOR NAME INSTEAD OF FILM_ID CATEGORY_ID AND ACTOR_ID

LATER WE WILL ENCODE THOSE VARIABLES

In [39]:
features

Unnamed: 0,film_id,rental_rate,special_features,rating,category_id,actor_id
0,1,0.99,"Deleted Scenes,Behind the Scenes",PG,6,1.0
1,2,4.99,"Trailers,Deleted Scenes",G,11,19.0
2,3,2.99,"Trailers,Deleted Scenes",NC-17,6,2.0
3,4,2.99,"Commentaries,Behind the Scenes",G,11,41.0
4,5,2.99,Deleted Scenes,G,8,51.0
...,...,...,...,...,...,...
995,996,0.99,"Trailers,Behind the Scenes",G,6,3.0
996,997,0.99,"Trailers,Behind the Scenes",NC-17,12,23.0
997,998,0.99,Deleted Scenes,NC-17,11,13.0
998,999,2.99,"Trailers,Deleted Scenes",R,3,52.0


In [40]:
target

Unnamed: 0,film_id,rental_date
0,80,2005-05-24 22:53:30
1,333,2005-05-24 22:54:33
2,373,2005-05-24 23:03:39
3,535,2005-05-24 23:04:41
4,450,2005-05-24 23:05:21
...,...,...
681,864,2005-05-31 19:19:36
682,859,2005-05-31 19:30:27
683,689,2005-05-31 20:34:45
684,47,2005-05-31 21:32:17


In [41]:
target['rented_in_may']=1

In [133]:
target

Unnamed: 0,film_id,rental_date,rented_in_may
0,80,2005-05-24 22:53:30,False
1,333,2005-05-24 22:54:33,False
2,373,2005-05-24 23:03:39,False
3,535,2005-05-24 23:04:41,False
4,450,2005-05-24 23:05:21,False
...,...,...,...
681,864,2005-05-31 19:19:36,False
682,859,2005-05-31 19:30:27,False
683,689,2005-05-31 20:34:45,False
684,47,2005-05-31 21:32:17,False


WE ARE TRYING TO MAKE THE TARGET HAVE THE SAME FILM_ID'S (ROWS) AS THE FEATURES
WE PUT THE EACH FILM_ID COLUMN IN A LIST, WE COMPARE THEM, AND THEN WE EXTRACT THE DIFFERENT VALUES IN A NEW LIST (314 FILM_IDS)
LATER WE WANT BY USING A QUERY TO ADD THOSE VALUES TO THE TARGET DATAFRAME

In [109]:
'''film_id_features=features['film_id']
film_id_target=target['film_id']
film_id_target
film_id_features
missing_film_id=[]
i=0
j=0
def compareList(l1,l2):
    for i in l1:
        for j in l2:
            if(i==j):
                pass
            else:
                missing_film_id.append(i)
    return missing_film_id
compareList(film_id_features,film_id_target)'''

In [111]:
len(missing_film_id)

685314

2. Create a query to get the list of films and a boolean indicating if it was rented last month (May 2005). This would be our target variable.


In [130]:
target['rented_in_may']=np.where(target["rented_in_may"] == "1",True,False)

In [131]:
target.head()

Unnamed: 0,film_id,rental_date,rented_in_may
0,80,2005-05-24 22:53:30,False
1,333,2005-05-24 22:54:33,False
2,373,2005-05-24 23:03:39,False
3,535,2005-05-24 23:04:41,False
4,450,2005-05-24 23:05:21,False


3. Read the data into a Pandas dataframe.


4. Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.


5. Create a logistic regression model to predict this variable from the cleaned data.


6. Evaluate the results.