# Projeto de Machine Learning

O objetivo deste projeto é prever a nota da avaliação IMDb de um filme, estando disponíveis informações sobre plataforma de streaming, diretores, países, gêneros, entre outros. Os dados de análise foram retirados de: 
https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

In [2]:
rating_db = pd.read_csv("MoviesOnStreamingPlatforms_updated.csv")
rating_db.head()

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


In [3]:
rating_db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16744 entries, 0 to 16743
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       16744 non-null  int64  
 1   ID               16744 non-null  int64  
 2   Title            16744 non-null  object 
 3   Year             16744 non-null  int64  
 4   Age              7354 non-null   object 
 5   IMDb             16173 non-null  float64
 6   Rotten Tomatoes  5158 non-null   object 
 7   Netflix          16744 non-null  int64  
 8   Hulu             16744 non-null  int64  
 9   Prime Video      16744 non-null  int64  
 10  Disney+          16744 non-null  int64  
 11  Type             16744 non-null  int64  
 12  Directors        16018 non-null  object 
 13  Genres           16469 non-null  object 
 14  Country          16309 non-null  object 
 15  Language         16145 non-null  object 
 16  Runtime          16152 non-null  float64
dtypes: float64(2

In [4]:
#Separador por vírgulas
def splitter(x):
    try:
        return x.split(",")
    except:
        return []

#Lista todas as variáveis com suas respectivas aparições, incluindo casos com dois valores.
#Ex: 'Ademar, Bernardo' --> 'Ademar': 1; 'Bernardo': 1
def sorted_bom(x):
    return sorted(list(Counter(rating_db[x].apply(splitter).apply(pd.Series).unstack().reset_index(drop=True).dropna().values).items()),key = lambda x: -x[1])

In [5]:
# Não mexer:
# sorted(list(Counter(rating_db["Directors"].apply(splitter).apply(pd.Series).unstack().reset_index(drop=True).dropna().values).items()),key = lambda x: x[1])

In [6]:
# Conversão das classes com apenas uma contribuição, para uma categoria "outros":
def generaliza_valores(x, nome_coluna):
    class_per_movie = rating_db[x].apply(splitter).apply(pd.Series)
    all_class_contributions = sorted_bom(x)

    # Classes a serem trocadas
    others = []

    # Identificação das classes com apenas uma contribuição
    
    for classes, contributions in all_class_contributions:
        if int(contributions) < 2:
            others.append(classes)

    # Listando somente classes com mais de 2 contribuições
    official_classes = class_per_movie.replace(others, "Outros "+nome_coluna)
    
    construtor = list(official_classes.columns)
    rating_db[x] = official_classes[construtor].apply(lambda x: ','.join(x[x.notnull()]), axis = 1)

In [7]:
%%time

official_languages = generaliza_valores("Language", "Idiomas")
print("Idiomas ok")
official_directors = generaliza_valores("Directors", "Diretores")
print("Diretores ok")
official_countries = generaliza_valores("Country", "Paises")
print("Paises ok")

Idiomas ok
Diretores ok
Paises ok
Wall time: 2min 49s


In [8]:
# https://stackoverflow.com/questions/57469676/python-one-hot-encoding-for-comma-separated-values

In [9]:
%%time

#Multiple One-Hot-Encoding
new_df = pd.concat([rating_db.drop('Language', 1), rating_db['Language'].str.get_dummies(sep=",")], 1)
print("Rodou 1 coluna")
new_df2 = pd.concat([new_df.drop('Directors', 1), new_df['Directors'].str.get_dummies(sep=",")], 1)
print("Rodou 2 colunas")
new_df3 = pd.concat([new_df2.drop('Genres', 1), new_df2['Genres'].str.get_dummies(sep=",")], 1)
print("Rodou 3 colunas")
rating_db_tratado = pd.concat([new_df3.drop('Country', 1), new_df3['Country'].str.get_dummies(sep=",")], 1)
print("Rodou 4 colunas")

Rodou 1 coluna
Rodou 2 colunas
Rodou 3 colunas
Rodou 4 colunas
Wall time: 47.5 s


In [10]:
# https://stackoverflow.com/questions/37646473/how-could-i-do-one-hot-encoding-with-multiple-values-in-one-cell
# pd.DataFrame({"diretor":pd.Series(['2','1','3']).astype('category',categories = [sorted_bom("Directors")])})

In [11]:
# directors = sorted(list(Counter(official_directors.unstack().reset_index(drop=True).dropna().values).items()),key = lambda x: -x[1])

In [12]:
#Como o objetivo é prever o IMDb, linhas sem a informação não serão consideradas
rating_db_nans = rating_db_tratado.copy()
rating_db_nans = rating_db_nans.dropna(subset=["IMDb"])
rating_db_nans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16173 entries, 0 to 16742
Columns: 3217 entries, Unnamed: 0 to Zimbabwe
dtypes: float64(2), int64(3212), object(3)
memory usage: 397.1+ MB


In [13]:
from sklearn.impute import SimpleImputer

#Preencher os valores vazios de RunTime, a partir da mediana da coluna
imputer = SimpleImputer(strategy="median")

rating_db_imputer = rating_db_nans.copy()

#Treina o imputer
imputer.fit(rating_db_imputer[["Runtime"]])

#Adiciona os valores faltantes 
rating_db_imputer["Runtime"] = imputer.transform(rating_db_imputer[["Runtime"]])

In [14]:
# from sklearn.impute import SimpleImputer

# # Antes de treinar o SimpleImputer, remover a coluna de dados categóricos. O dataset resultante tem apenas
# # as variáveis independentes numéricas.
# rating_db_imputer = rating_db_nans.copy()

# # Cria um imputer que substitui células inválidas (NaN) pela mediana dos valores da coluna à qual a célula pertence.
# imputer = SimpleImputer(strategy="median")

# # Agora treinar o Imputer. Isto vai causar o cálculo da mediana de cada coluna,
# # que ficará armazenado no Imputer para uso futuro.
# rating_db_imputer["Runtime"] = imputer.fit_transform(rating_db_imputer[["Runtime"]])

In [15]:
rating_db_nans["Runtime"] = rating_db_imputer["Runtime"]
rating_db_nans[["Runtime"]].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16173 entries, 0 to 16742
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Runtime  16173 non-null  float64
dtypes: float64(1)
memory usage: 252.7 KB


In [16]:
# # Trocar Nan de Age por Desconhecido
# rating_db_age = rating_db_nans.copy()
# rating_db_age = rating_db_age.replace(np.nan, value = "Desconhecido", regex = True)
# # rating_db_nans["Age"] = rating_db_age["Age"]
# rating_db_age[["Age"]].info()

## Regressor

In [17]:
#Semente aleatória
RANDOM_SEED = 42

#Colunas a serem retiradas
colunas = ["Rotten Tomatoes", "Age", "Title", "ID", "Unnamed: 0"]
rating_db_reg = rating_db_nans.copy()
rating_db_reg = rating_db_reg.drop(columns = colunas)
rating_db_reg.head()

Unnamed: 0,Year,IMDb,Netflix,Hulu,Prime Video,Disney+,Type,Runtime,Aboriginal,Afrikaans,...,Ukraine,United Arab Emirates,United Kingdom,United States,Uruguay,Venezuela,Vietnam,West Germany,Yugoslavia,Zimbabwe
0,2010,8.8,1,0,0,0,0,148.0,0,0,...,0,0,1,1,0,0,0,0,0,0
1,1999,8.7,1,0,0,0,0,136.0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,2018,8.5,1,0,0,0,0,149.0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,1985,8.5,1,0,0,0,0,116.0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,1966,8.8,1,0,1,0,0,161.0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [18]:
from sklearn.model_selection import train_test_split

# Separação dos conjuntos treinamento e teste
train_set, test_set = train_test_split(
    rating_db_reg,
    test_size=0.2,
    random_state=RANDOM_SEED,
)

In [19]:
X_train = train_set.drop(columns=["IMDb"])
y_train = train_set["IMDb"]

In [20]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

#Testa a eficiência do modelo
def testa_modelo(model, X, y):
    scores = cross_val_score(
            model,
            X,
            y,
            scoring="neg_mean_squared_error",
            cv = 10,
            n_jobs=-1)
    
    rmse_scores = np.sqrt(-scores)
    
    print("Scores:", rmse_scores.round(decimals=2))
    print("Mean:", rmse_scores.mean())
    print("Standard deviation:", rmse_scores.std())

In [21]:
# # Para obter as previsões, basta chamar o método predict()
# predicted_labels = lin_reg.predict(X_train)
# print("Predição: {}".format(predicted_labels.round(decimals=2)))
# # Compare com os valores originais:
# print("Original: {}".format(y_train.values.round(decimals=2)))

In [22]:
# from sklearn.metrics import mean_squared_error

# predicted_labels = lin_reg.predict(X_train)
# lin_mse = mean_squared_error(y_train, predicted_labels)
# lin_rmse = np.sqrt(lin_mse)
# print("Regressão linear: RMSE = {:.2f}".format(lin_rmse))

In [23]:
%%time

#Regressor linear
lin_reg = LinearRegression()

testa_modelo(lin_reg, X_train, y_train)

Scores: [2.67628194e+09 3.40492460e+07 8.88945916e+08 2.19161218e+09
 4.88177109e+08 3.85815577e+09 3.76640388e+08 1.05194886e+08
 1.20000000e+00 5.24881828e+08]
Mean: 1114393926.941948
Standard deviation: 1260571405.2570174
Wall time: 4min 12s


In [24]:
%%time

#Regressor Ridge
ridge_reg = Ridge(alpha=0.1)

testa_modelo(ridge_reg, X_train, y_train)

Scores: [1.14 1.18 1.11 1.17 1.14 1.12 1.18 1.16 1.16 1.1 ]
Mean: 1.1464891392293741
Standard deviation: 0.026632713529391637
Wall time: 27.6 s


In [25]:
%%time

#Regressor linear
lasso_reg = Lasso(alpha=0.1)

testa_modelo(lasso_reg, X_train, y_train)

Scores: [1.26 1.27 1.21 1.3  1.26 1.27 1.26 1.28 1.24 1.21]
Mean: 1.2551632317735129
Standard deviation: 0.027267966552262476
Wall time: 8.66 s


In [26]:
%%time

#Regressor linear
elastic_reg = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state = RANDOM_SEED)

testa_modelo(elastic_reg, X_train, y_train)

Scores: [1.22 1.23 1.17 1.25 1.21 1.21 1.21 1.23 1.19 1.17]
Mean: 1.2081254494399114
Standard deviation: 0.02477249563609101
Wall time: 8.76 s


In [27]:
%%time

#Regressor linear
forest_reg = RandomForestRegressor(random_state=RANDOM_SEED)

testa_modelo(forest_reg, X_train, y_train)

Scores: [1.1  1.11 1.02 1.09 1.06 1.06 1.07 1.06 1.05 1.08]
Mean: 1.0699053184293368
Standard deviation: 0.023809048289999996
Wall time: 9min 12s


* Tirar as variáveis mais importantes: como fica o desempenho do regressor?
* Análise exploratória - visualização
* Criar uma lista de perguntas para responder - ex. qual diretor tem mais impacto positivo no rating? Qual a relação entre rating e ano de lançamento?
* Qualidade do código - melhorar nome das variáveis