## Importando bibliotecas e o dataset

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [33]:
df = pd.read_excel("Data_Train.xlsx")


In [34]:
df.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62


## Primeira Análise e Feature engineerig 

In [35]:
df.shape

(6237, 9)

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6237 entries, 0 to 6236
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         6237 non-null   object 
 1   Author        6237 non-null   object 
 2   Edition       6237 non-null   object 
 3   Reviews       6237 non-null   object 
 4   Ratings       6237 non-null   object 
 5   Synopsis      6237 non-null   object 
 6   Genre         6237 non-null   object 
 7   BookCategory  6237 non-null   object 
 8   Price         6237 non-null   float64
dtypes: float64(1), object(8)
memory usage: 438.7+ KB


Precisamos mudar Reviews e Rating para numérico.

In [37]:
df['Reviews'] = df['Reviews'].apply(lambda r: float(r.split()[0]))
df['Ratings']= df['Ratings'].str.extract('(\d+)').astype(int)



In [38]:
df.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0,8,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9,14,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8,6,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1,13,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0,1,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62


Vamos separar as varáveis categóricas e as númericas

In [39]:
def separar_cat_col(df):
    
    # selecionar colunas categóricas
    cat_cols = list(df.select_dtypes(include=['object']).columns)
    
    # selecionar colunas numéricas
    numeric_cols = list(df.select_dtypes(include=['int', 'float']).columns)
    
    return cat_cols, numeric_cols

In [40]:
separar_cat_col(df)

(['Title', 'Author', 'Edition', 'Synopsis', 'Genre', 'BookCategory'],
 ['Reviews', 'Ratings', 'Price'])

In [41]:
df.nunique()

Title           5568
Author          3679
Edition         3370
Reviews           36
Ratings          323
Synopsis        5549
Genre            345
BookCategory      11
Price           1614
dtype: int64

Para criarmos um modelo de regressão precisamos das features numéricas.

Transformaremos a featue BookCategory em númerica com one-hot-encoding

In [42]:
df.drop(columns= ['Title', 'Author', 'Edition', 'Synopsis', 'Genre'], inplace=True)

In [43]:
df_1 =  pd.get_dummies(df, columns= ['BookCategory'])

Vamos obersvar a correlação das features

In [44]:
corr_matrix = df_1.corr(method='spearman')
corr_matrix

Unnamed: 0,Reviews,Ratings,Price,BookCategory_Action & Adventure,"BookCategory_Arts, Film & Photography","BookCategory_Biographies, Diaries & True Accounts",BookCategory_Comics & Mangas,"BookCategory_Computing, Internet & Digital Media","BookCategory_Crime, Thriller & Mystery",BookCategory_Humour,"BookCategory_Language, Linguistics & Writing",BookCategory_Politics,BookCategory_Romance,BookCategory_Sports
Reviews,1.0,-0.237119,0.221209,-0.001964,0.055514,-0.012549,0.105757,-0.019096,-0.106099,0.054828,-0.055098,-0.01423,-0.05472,0.064412
Ratings,-0.237119,1.0,-0.288681,-0.013456,-0.062887,0.232679,-0.067619,-0.039075,0.09103,-0.110408,0.02387,0.005089,0.038062,-0.125817
Price,0.221209,-0.288681,1.0,-0.093823,0.168208,-0.116849,0.117048,0.192154,-0.190717,0.098524,-0.091353,0.024289,-0.169109,0.13636
BookCategory_Action & Adventure,-0.001964,-0.013456,-0.093823,1.0,-0.116806,-0.126288,-0.124759,-0.115941,-0.140687,-0.119616,-0.126054,-0.091094,-0.122026,-0.111043
"BookCategory_Arts, Film & Photography",0.055514,-0.062887,0.168208,-0.116806,1.0,-0.097722,-0.096539,-0.089716,-0.108864,-0.09256,-0.097541,-0.070489,-0.094424,-0.085925
"BookCategory_Biographies, Diaries & True Accounts",-0.012549,0.232679,-0.116849,-0.126288,-0.097722,1.0,-0.104376,-0.096999,-0.117701,-0.100073,-0.105459,-0.076211,-0.102089,-0.092901
BookCategory_Comics & Mangas,0.105757,-0.067619,0.117048,-0.124759,-0.096539,-0.104376,1.0,-0.095825,-0.116277,-0.098862,-0.104182,-0.075289,-0.100853,-0.091776
"BookCategory_Computing, Internet & Digital Media",-0.019096,-0.039075,0.192154,-0.115941,-0.089716,-0.096999,-0.095825,1.0,-0.108058,-0.091875,-0.096819,-0.069967,-0.093725,-0.085289
"BookCategory_Crime, Thriller & Mystery",-0.106099,0.09103,-0.190717,-0.140687,-0.108864,-0.117701,-0.116277,-0.108058,1.0,-0.111483,-0.117483,-0.0849,-0.113729,-0.103493
BookCategory_Humour,0.054828,-0.110408,0.098524,-0.119616,-0.09256,-0.100073,-0.098862,-0.091875,-0.111483,1.0,-0.099888,-0.072185,-0.096696,-0.087993


In [45]:
corr_matrix.Price.sort_values(ascending=False)

Price                                                1.000000
Reviews                                              0.221209
BookCategory_Computing, Internet & Digital Media     0.192154
BookCategory_Arts, Film & Photography                0.168208
BookCategory_Sports                                  0.136360
BookCategory_Comics & Mangas                         0.117048
BookCategory_Humour                                  0.098524
BookCategory_Politics                                0.024289
BookCategory_Language, Linguistics & Writing        -0.091353
BookCategory_Action & Adventure                     -0.093823
BookCategory_Biographies, Diaries & True Accounts   -0.116849
BookCategory_Romance                                -0.169109
BookCategory_Crime, Thriller & Mystery              -0.190717
Ratings                                             -0.288681
Name: Price, dtype: float64

## Pre processamento dos dados

In [46]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings("ignore")

In [47]:
y = df_1['Price'].values
X = df_1.drop('Price',axis = 1).values

## Criando os modelos preditivos

Usando o módulo Pipeline vamos comparar o MSE (negativo) e desvio padrão obtido por cada algoritmo, através do crossvalidation

In [48]:
from sklearn.pipeline import Pipeline

pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR',LinearRegression())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('DT', DecisionTreeRegressor())])))
pipelines.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM', GradientBoostingRegressor())])))

results = []
names = []

for name, model in pipelines:
    kfold = KFold(n_splits=10,shuffle=True, random_state=21)
    cv_results = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


ScaledLR: -429639.010334 (96653.327741)
ScaledCART: -637136.822422 (246473.202087)
ScaledGBM: -426574.686145 (95320.190701)


Escolhemos o GradientBoosting pelo melhor desepenho e agora vamos para o hypertunning com BayesSearchCV

In [49]:
from sklearn.ensemble import GradientBoostingRegressor
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

In [50]:
search_space = {
    'learning_rate': Real(0.01, 1.0, prior='log-uniform'),
    'n_estimators': Integer(50, 500),
    'max_depth': Integer(1, 10),
    'min_samples_split': Integer(2, 20),
    'min_samples_leaf': Integer(1, 10),
    'max_features': Categorical(['sqrt', 'log2', None])
}

# Criando um objeto BayesSearchCV com GradientBoostingRegressor
model = GradientBoostingRegressor(random_state=7)
bayes_cv_tuner = BayesSearchCV(
    estimator=model,
    search_spaces=search_space,
    n_iter=50,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=0,
    random_state=42
)

# Ajustando o modelo com a otimização dos hiperparâmetros
result = bayes_cv_tuner.fit(X, y)

# Imprimindo o melhor conjunto de hiperparâmetros e o erro médio quadrático negativo correspondente

print("Melhores parâmetros encontrados: ", result.best_params_)
print("Negative mean squared error: ", result.best_score_)

Melhores parâmetros encontrados:  OrderedDict([('learning_rate', 0.4224214733521323), ('max_depth', 1), ('max_features', 'sqrt'), ('min_samples_leaf', 10), ('min_samples_split', 2), ('n_estimators', 50)])
Negative mean squared error:  -422353.3118215559


In [51]:
import joblib

# Treinando o modelo com os melhores hiperparâmetros
best_model = GradientBoostingRegressor(**result.best_params_, random_state=7)
best_model.fit(X,y)

# Salvando o modelo com o joblib
joblib.dump(best_model, 'modelo_otimizado.joblib')

['modelo_otimizado.joblib']

## Usando o modelo para prever os dados de teste

In [52]:
df_test = pd.read_excel("Data_Test.xlsx")

In [53]:
df_test

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory
0,The Complete Sherlock Holmes: 2 Boxes sets,Sir Arthur Conan Doyle,"Mass Market Paperback,– 1 Oct 1986",4.4 out of 5 stars,960 customer reviews,A collection of entire body of work of the She...,Short Stories (Books),"Crime, Thriller & Mystery"
1,Learn Docker - Fundamentals of Docker 18.x: Ev...,Gabriel N. Schenker,"Paperback,– Import, 26 Apr 2018",5.0 out of 5 stars,1 customer review,Enhance your software deployment workflow usin...,Operating Systems Textbooks,"Computing, Internet & Digital Media"
2,Big Girl,Danielle Steel,"Paperback,– 17 Mar 2011",5.0 out of 5 stars,4 customer reviews,"'Watch out, world. Here I come!'\nFor Victoria...",Romance (Books),Romance
3,Think Python: How to Think Like a Computer Sci...,Allen B. Downey,"Paperback,– 2016",4.1 out of 5 stars,11 customer reviews,"If you want to learn how to program, working w...",Programming & Software Development (Books),"Computing, Internet & Digital Media"
4,Oxford Word Skills: Advanced - Idioms & Phrasa...,Redman Gairns,"Paperback,– 26 Dec 2011",4.4 out of 5 stars,9 customer reviews,"Learn and practise the verbs, prepositions and...",Linguistics (Books),"Language, Linguistics & Writing"
...,...,...,...,...,...,...,...,...
1555,100 Things Every Designer Needs to Know About ...,Susan Weinschenk,"Paperback,– 14 Apr 2011",5.0 out of 5 stars,4 customer reviews,We design to elicit responses from people. We ...,Design,"Computing, Internet & Digital Media"
1556,"Modern Letter Writing Course: Personal, Busine...",ARUN SAGAR,"Paperback,– 8 May 2013",3.6 out of 5 stars,13 customer reviews,"A 30-day course to write simple, sharp and att...",Children's Reference (Books),"Biographies, Diaries & True Accounts"
1557,The Kite Runner Graphic Novel,Khaled Hosseini,"Paperback,– 6 Sep 2011",4.0 out of 5 stars,5 customer reviews,The perennial bestseller-now available as a se...,Humour (Books),Humour
1558,Panzer Leader (Penguin World War II Collection),Heinz Guderian,"Paperback,– 22 Sep 2009",3.5 out of 5 stars,3 customer reviews,Heinz Guderian - master of the Blitzkrieg and ...,United States History,"Biographies, Diaries & True Accounts"


Precisamos dar o mesmo tratamento para os dados de teste

In [54]:
df_test['Reviews'] = df_test['Reviews'].apply(lambda r: float(r.split()[0]))
df_test['Ratings']= df_test['Ratings'].str.extract('(\d+)').astype(int)


df_test.drop(columns= ['Title', 'Author', 'Edition', 'Synopsis', 'Genre'], inplace=True)

df_2 =  pd.get_dummies(df, columns= ['BookCategory'])

X_test = df_2.drop(columns=['Price'])
Y_test = df_2['Price']

Criando as previsões e exportando para um arquivo csv

In [55]:
Y_submit = best_model.predict(X_test)

In [66]:
df_test = pd.read_excel("Data_Test.xlsx")

# Criando um DataFrame com as previsões

Y_submit = pd.DataFrame(np.round(Y_submit,2), columns=['Predictions'])

df_submissao = df_test.merge(Y_submit, left_index=True, right_index=True)

# Exportando as previsões para um arquivo CSV
df_submissao.to_excel('Predicoes.xlsx', index=False)

In [67]:
df_submissao.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Predictions
0,The Complete Sherlock Holmes: 2 Boxes sets,Sir Arthur Conan Doyle,"Mass Market Paperback,– 1 Oct 1986",4.4 out of 5 stars,960 customer reviews,A collection of entire body of work of the She...,Short Stories (Books),"Crime, Thriller & Mystery",356.67
1,Learn Docker - Fundamentals of Docker 18.x: Ev...,Gabriel N. Schenker,"Paperback,– Import, 26 Apr 2018",5.0 out of 5 stars,1 customer review,Enhance your software deployment workflow usin...,Operating Systems Textbooks,"Computing, Internet & Digital Media",287.67
2,Big Girl,Danielle Steel,"Paperback,– 17 Mar 2011",5.0 out of 5 stars,4 customer reviews,"'Watch out, world. Here I come!'\nFor Victoria...",Romance (Books),Romance,722.19
3,Think Python: How to Think Like a Computer Sci...,Allen B. Downey,"Paperback,– 2016",4.1 out of 5 stars,11 customer reviews,"If you want to learn how to program, working w...",Programming & Software Development (Books),"Computing, Internet & Digital Media",253.84
4,Oxford Word Skills: Advanced - Idioms & Phrasa...,Redman Gairns,"Paperback,– 26 Dec 2011",4.4 out of 5 stars,9 customer reviews,"Learn and practise the verbs, prepositions and...",Linguistics (Books),"Language, Linguistics & Writing",1011.62
