## Previsão de Preço de Laptop / Laptop Price Forecast

**Objetivo**: Seu objetivo, como Cientista de Dados da Amazon, é Desenvolver um Sistema que possa prever um preço provisório de um laptop, com base na configuração desejada pelo usuário. Isso servirá como um Balisador para definição de quais configurações devem ser priorizadas na escolha do Laptop por parte do Cliente.

---------------------------------------------------------------------------
**Objective**: your objective, as Amazon Data Scientist, is to develop a provisional system to predict the of a laptop, based on the configuration desired by the user. This will serve as a Beacon for defining which settings should be prioritized when choosing the Laptop by the Customer.

## Análise exploratória de dados / Exploratory data analysis

### Colunas do Dataset / Dataset Columns

In [1]:
# Importando as Bibliotecas
# Importing the Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score,mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

**Company**: Fabricante ou marca / brand (Apple, HP, lenovo, Asus)

---------------------------------------------------------------------------
**TypeName**: Tipo laptop / Laptop Type (Ultrabook, Mackbook, 2 in 1 convertible, notebook)

---------------------------------------------------------------------------
**Ram**: Memória RAM / RAM memory (4gb, 8gb, 16gb, 32gb)

---------------------------------------------------------------------------
**Weight**: Peso (Kg)

---------------------------------------------------------------------------
**Touchscreen**: Se possui / If it owns (Sim/Não / Yes/No)

---------------------------------------------------------------------------
**Price**: Preço (Em Reais(Brazilian currency))

---------------------------------------------------------------------------
**Ips**: Tecnologia de redes / Network technology (Sim/Não / Yes/No)

---------------------------------------------------------------------------
**ppi**: Pixels por polegadas / Pixels per inch (220) 

---------------------------------------------------------------------------
**Cpu_brand** Fabricante ou marca do processador / processor brand (Intel I7, Intel I5,  Other Intel Processor)

---------------------------------------------------------------------------
**HDD**: HD gigabyte  (500GB)

---------------------------------------------------------------------------
**SSD**: SSD megabyte  (256GB)

---------------------------------------------------------------------------
**Gpu_brand**: Marca da placa de video / Video Card brand (Nvidia)

---------------------------------------------------------------------------
**Os**: Sistema Operacional / Operational System ( windows, Others/No OS/Linux, Mac)


In [2]:
# Instalação do pacote de relatório automatizado para analise de dados
# Instaling the automated reporting package for data analysis
!pip install pandas-profiling



In [3]:
# Conectando com os dados
# Data Connecting 
df = pd.read_csv('data.csv', delimiter=';')
df

Unnamed: 0,Company,TypeName,Ram,Weight,Touchscreen,Price,Ips,ppi,Cpu_brand,HDD,SSD,Gpu_brand,os
0,Apple,Ultrabook,8,1.37,0,5491,1,226.983005,Intel Core i5,0,128,Intel,Mac
1,Apple,Ultrabook,8,1.34,0,3684,0,127.677940,Intel Core i5,0,0,Intel,Mac
2,HP,Notebook,8,1.86,0,2357,0,141.211998,Intel Core i5,0,256,Intel,Others/No OS/Linux
3,Apple,Ultrabook,16,1.83,0,10400,1,220.534624,Intel Core i7,0,512,AMD,Mac
4,Apple,Ultrabook,8,1.37,0,7392,1,226.983005,Intel Core i5,0,256,Intel,Mac
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1297,Lenovo,2 in 1 Convertible,4,1.80,1,2615,1,157.350512,Intel Core i7,0,128,Intel,Windows
1298,Lenovo,2 in 1 Convertible,16,1.30,1,6144,1,276.053530,Intel Core i7,0,512,Intel,Windows
1299,Lenovo,Notebook,2,1.50,0,939,0,111.935204,Other Intel Processor,0,0,Intel,Windows
1300,HP,Notebook,6,2.19,0,3131,0,100.454670,Intel Core i7,1000,0,AMD,Windows


In [4]:
# Relatório Automatizado
# Automated Report
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Laptop Profiling Report")

In [5]:
profile.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Pré-Processamento dos Dados / Data pre-processing

In [6]:
# Variáveis explicativas
# Explanatory variables
X = df.drop(columns=['Price'])

In [7]:
# Classe / O que queremos prever
# Target / What we want to predict
y = np.log(df['Price'])

### Alguns testes com o test_size e a random state para ver qual se sai melhor com o meu modelo
### Some tests with test_size and random state to see which one works best with my model


- R2 score 0.9091915379824
- MAE 0.15052603283795518
 - test_size=0.22,random_state=7
 ---------------------------------------------
- R2 score 0.9145226769778227
- MAE 0.14801615001618848
 - test_size=0.21,random_state=7
 -------------------------------------------
- R2 score 0.9156235861402156
- MAE 0.14764715063561304
 - test_size = 0.20, random_state = 7
 ---------------------------------------------
- R2 score 0.9168877460053753
- MAE 0.1439709364900349              - O Ganhador / The Winner
 - test_size=0.19,random_state=7
--------------------------------------------
- R2 score 0.9149143673731063
- MAE 0.1466489470816401
 - test_size=0.18,random_state=7
 --------------------------------------------
- R2 score 0.9123834335574939
- MAE 0.14691635148390247
 - test_size=0.17,random_state=7) 
 --------------------------------------------
- R2 score 0.9108143414466527
- MAE MAE 0.14904559077796425
 - test_size=0.16,random_state=7)-

In [8]:
# Separação dos dados de treinamento e teste do algoritimo (maquina preditiva)
# Separation of training and testing data from the algorithm (predictive machine)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.19,random_state=7)

In [9]:
# Informação do Dataframe
# Dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302 entries, 0 to 1301
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Company      1302 non-null   object 
 1   TypeName     1302 non-null   object 
 2   Ram          1302 non-null   int64  
 3   Weight       1302 non-null   float64
 4   Touchscreen  1302 non-null   int64  
 5   Price        1302 non-null   int64  
 6   Ips          1302 non-null   int64  
 7   ppi          1302 non-null   float64
 8   Cpu_brand    1302 non-null   object 
 9   HDD          1302 non-null   int64  
 10  SSD          1302 non-null   int64  
 11  Gpu_brand    1302 non-null   object 
 12  os           1302 non-null   object 
dtypes: float64(2), int64(6), object(5)
memory usage: 132.4+ KB


In [10]:
# Transformando variáveis object em numéricas
# Transforming object variables to numeric variables
step1 = ColumnTransformer(transformers=[
        ('col_tnf', OneHotEncoder(handle_unknown='ignore', sparse=False),[0,1,7,10,11])],
         remainder='passthrough')

## Criação da maquina preditiva / Creation of predctive machine

#### Fazendo testes a/b para melhorar meu modelo
    - Começando com modelo sem os hiperparametros
---------------------------------------------------------------------------------------------------------------------
#### Doing a/b tests to improve my model
    - Starting with model without the hyperparameters

In [11]:
# Criando a Máquina Preditiva com  algoritimo regressor
# Creating the Predictive Machine with a regressor algorithm
step2 = RandomForestRegressor()

## Pipeline: Automatizando Processamento de Dados e Machine Learning

In [12]:
# Automatizando o pré-processamento com a criação da maquina preditiva em uma etapa
# automating pre-processing creating the predictive machine in one step 
pipe= Pipeline([

# Processamento de Dados: Transformação das variáveis object em númericas
# Data processin: Variable transforming, object to numeric
('step1',step1), 

# Criação da maquina preditiva com Machine Learning (Algoritimo RandomForest)
# Creating the predictive machine with machine learning(RandomForest algorithm)
('step2',step2)

])

In [13]:
# Treinamento com os dados de treino
# Training with training data
pipe.fit(X_train,y_train)

Pipeline(steps=[('step1',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('col_tnf',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  [0, 1, 7, 10, 11])])),
                ('step2', RandomForestRegressor())])

In [14]:
# Novas Predições com os Dados de teste
# New predictions with test data
y_pred = pipe.predict(X_test)

##  Avaliação da Máquina Preditiva / Predictive Machine Assessment

In [15]:
# Utilizando  a métrica R2
# Using the R2 metric
print('R2 score',r2_score(y_test,y_pred))

R2 score 0.9152314280456784


In [16]:
# Utilizando a métrica MAE
# Using the MAE metric
print('MAE',mean_absolute_error(y_test,y_pred))

MAE 0.14673844845795203


- Criando um novo pipeline para descobrir os melhores hiperparametros, com o GridSearchCV

- Creating a new pipeline to search for the best hiperparameters, with GridSearchCV

In [17]:
sorted(pipe.get_params().keys())

['memory',
 'step1',
 'step1__col_tnf',
 'step1__col_tnf__categories',
 'step1__col_tnf__drop',
 'step1__col_tnf__dtype',
 'step1__col_tnf__handle_unknown',
 'step1__col_tnf__sparse',
 'step1__n_jobs',
 'step1__remainder',
 'step1__sparse_threshold',
 'step1__transformer_weights',
 'step1__transformers',
 'step1__verbose',
 'step1__verbose_feature_names_out',
 'step2',
 'step2__bootstrap',
 'step2__ccp_alpha',
 'step2__criterion',
 'step2__max_depth',
 'step2__max_features',
 'step2__max_leaf_nodes',
 'step2__max_samples',
 'step2__min_impurity_decrease',
 'step2__min_samples_leaf',
 'step2__min_samples_split',
 'step2__min_weight_fraction_leaf',
 'step2__n_estimators',
 'step2__n_jobs',
 'step2__oob_score',
 'step2__random_state',
 'step2__verbose',
 'step2__warm_start',
 'steps',
 'verbose']

In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
# n_estimators: Número de árvores que serao criadas na floresta
# n_estimators: Number of the trees that will be created in the forest
n_estimators=[195, 198, 200, 205]

# random_state: Cria semente aleatorias para replicar os resultados
# random_state: Create a random seed to replicate the results
random_state=[2, 4, 5, 6]

# max_depth: Profundidade máxima da arvore
# max_depth: The max tree depth
max_depth=[10, 13, 15, 17]

# min_sample_split: Número de amostras minimas para considerar um nó para a divisão
# min_sample_split: Minimum sample number to consider a node for the division
min_samples_split = [1, 2, 4, 6 ]

# min_sample_leaf: Número de amostras minimas no nivel folha
# min_sample_leaf: Minimum number of samples at the sheet level
min_samples_leaf = [1, 2, 3, 5]

# bootstrap: Se as amostras de bootstrap são usadas ao construir árvores. Se False, todo o conjunto de dados é usado para construir cada árvore.
# bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
bootstrap = ['True', 'False']

In [20]:
# Definindo um dicionarios para receber os parametros e valores
# Defining a dictionary to receive parameters and values
grid_param = dict(step2__n_estimators=n_estimators,
                  step2__random_state=random_state,
                  step2__max_depth=max_depth,
                  step2__min_samples_split=min_samples_split,
                  step2__min_samples_leaf=min_samples_leaf,
                  step2__bootstrap=bootstrap
                  )
grid_param

{'step2__n_estimators': [195, 198, 200, 205],
 'step2__random_state': [2, 4, 5, 6],
 'step2__max_depth': [10, 13, 15, 17],
 'step2__min_samples_split': [1, 2, 4, 6],
 'step2__min_samples_leaf': [1, 2, 3, 5],
 'step2__bootstrap': ['True', 'False']}

In [21]:
grid = GridSearchCV(pipe, grid_param)

In [22]:
grid.fit(X_train, y_train)

2560 fits failed out of a total of 10240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2560 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\LuisFS\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\LuisFS\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\LuisFS\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
    trees = Parallel(
  File "C:\Users\LuisFS\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator

GridSearchCV(estimator=Pipeline(steps=[('step1',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('col_tnf',
                                                                         OneHotEncoder(handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         [0, 1,
                                                                          7, 10,
                                                                          11])])),
                                       ('step2', RandomForestRegressor())]),
             param_grid={'step2__bootstrap': ['True', 'False'],
                         'step2__max_depth': [10, 13, 15, 17],
                         'step2__min_samples_leaf': [1, 2, 3, 5],
                         'step2__min_samples_split': [1,

In [23]:
grid.cv_results_

{'mean_fit_time': array([0.06080279, 0.06099887, 0.05798311, ..., 0.60299139, 0.60119872,
        0.61919885]),
 'std_fit_time': array([0.0048746 , 0.00613947, 0.00290745, ..., 0.00788819, 0.00932153,
        0.01376531]),
 'mean_score_time': array([0.        , 0.        , 0.        , ..., 0.02739549, 0.034793  ,
        0.02759843]),
 'std_score_time': array([0.        , 0.        , 0.        , ..., 0.00393131, 0.0117288 ,
        0.00184095]),
 'param_step2__bootstrap': masked_array(data=['True', 'True', 'True', ..., 'False', 'False', 'False'],
              mask=[False, False, False, ..., False, False, False],
        fill_value='?',
             dtype=object),
 'param_step2__max_depth': masked_array(data=[10, 10, 10, ..., 17, 17, 17],
              mask=[False, False, False, ..., False, False, False],
        fill_value='?',
             dtype=object),
 'param_step2__min_samples_leaf': masked_array(data=[1, 1, 1, ..., 5, 5, 5],
              mask=[False, False, False, ..., False, F

In [24]:
grid.best_params_

{'step2__bootstrap': 'True',
 'step2__max_depth': 15,
 'step2__min_samples_leaf': 1,
 'step2__min_samples_split': 2,
 'step2__n_estimators': 200,
 'step2__random_state': 5}

In [25]:
# Criando a Máquina Preditiva com  algoritimo regressor
# Creating a predictive machine with regressor algorithm
step2 = RandomForestRegressor(bootstrap='True',
                              max_depth=15,
                              min_samples_leaf=1,
                              min_samples_split=2,
                              n_estimators=200,
                              random_state=5
                             )

In [26]:
# Automatizando o pré-processamento com a criação da maquina preditiva em uma etapa
# Automating pre-processing creating the predictive machine in one step 
pipe= Pipeline([

# Processamento de Dados: Transformação das variáveis object em númericas
# Data processin: Variable transforming, object to numeric
('step1',step1), 

# Criação da MP com Machine Learning (Algoritimo RF)
# Creating the predictive machine with machine learning (RandomForest algorithm)
('step2',step2)

])

In [27]:
# Treinamento com os dados de treino
# Training with training data
pipe.fit(X_train,y_train)

Pipeline(steps=[('step1',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('col_tnf',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  [0, 1, 7, 10, 11])])),
                ('step2',
                 RandomForestRegressor(bootstrap='True', max_depth=15,
                                       n_estimators=200, random_state=5))])

In [28]:
# Novas Predições com os Dados de teste
# New predictions with data test
y_pred = pipe.predict(X_test)

In [29]:
print('MAE',mean_absolute_error(y_test,y_pred))

MAE 0.14331508114484326


In [30]:
print('R2 score',r2_score(y_test,y_pred))

R2 score 0.9177995249180473


In [31]:
# Exportando dados para Fase de implementação
# Exporting Data for Implementation Phase
df.to_csv("data_mp.csv", index=False)

In [32]:
# Salvando a maquina preditiva para implementaçãp (deploy)
# Saving the predictive machine for deployment (deploy)
import pickle
df.to_csv("data_mp.csv", index=False)
pickle.dump(pipe,open('MP.pkl','wb'))