## Preparación de Datos
1. Split de datos:
   - Train/Test/Validation
   - Mantener estratificación si es necesario
2. Procesamiento de variables (fit solo en train):
   - Codificación de variables categóricas
   - Normalización/Estandarización de variables numéricas
3. Balanceo de clases (solo en train si es necesario):
   - Oversampling
   - Undersampling
   - Técnicas híbridas

### Librerías

In [188]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import re

### Datos

In [191]:
df1 = pd.read_csv(r'C:\Users\nuria\OneDrive\Escritorio\ML_laptops\notebooks\DataFrame_laptops_limpio_1')
df1.head()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,256GB SSD,AMD Radeon R7 M445,...,2.5,AMD,Radeon R7 M445,Windows,0,0,0,1,0,0
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,2TB HDD,Intel HD Graphics 620,...,2.5,Intel,HD Graphics 620,Sin OS,0,0,0,1,0,0
2,1267,Dell,XPS 13,2 in 1 Convertible,13.3,Quad HD+ / Touchscreen 3200x1800,Intel Core i5 7Y54 1.2GHz,3,256GB SSD,Intel HD Graphics 615,...,1.2,Intel,HD Graphics 615,Windows,1,0,0,0,0,0
3,161,Dell,Inspiron 5579,2 in 1 Convertible,15.6,Full HD / Touchscreen 1920x1080,Intel Core i7 8550U 1.8GHz,3,256GB SSD,Intel UHD Graphics 620,...,1.8,Intel,UHD Graphics 620,Windows,1,0,0,0,0,0
4,922,LG,Gram 14Z970,Ultrabook,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 7500U 2.7GHz,3,512GB SSD,Intel HD Graphics 620,...,2.7,Intel,HD Graphics 620,Windows,0,0,0,0,1,0


In [192]:
df1.isnull().sum()

laptop_ID             0
Company               0
Product               0
TypeName              0
Inches                0
ScreenResolution      0
Cpu                   0
Ram                   0
Memory                0
Gpu                   0
OpSys                 0
Weight                0
Price_euros           0
Resolución            0
tipo_pantalla         0
memoria               0
tipo_memoria          0
tipo_cpu              0
Marca_cpu             0
Serie_cpu             0
Modelo_cpu            0
velocidad_cpu_ghz     0
marca_gpu             0
modelo_gpu            0
OpSys_general         0
2 in 1 Convertible    0
Gaming                0
Netbook               0
Notebook              0
Ultrabook             0
Workstation           0
dtype: int64

In [193]:
# Convertir a numérico usando el método de codificación
df1['OpSys'] = df1['OpSys'].astype('category').cat.codes
df1['velocidad_cpu_ghz'] = df1['velocidad_cpu_ghz'].astype('category').cat.codes

In [194]:
# Convertir todas las columnas a numéricas, forzando a NaN donde no se pueda
# df_numeric = df1.apply(pd.to_numeric, errors='coerce')

## Procesado para ML

In [195]:
df1.head(7)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,256GB SSD,AMD Radeon R7 M445,...,16,AMD,Radeon R7 M445,Windows,0,0,0,1,0,0
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,2TB HDD,Intel HD Graphics 620,...,16,Intel,HD Graphics 620,Sin OS,0,0,0,1,0,0
2,1267,Dell,XPS 13,2 in 1 Convertible,13.3,Quad HD+ / Touchscreen 3200x1800,Intel Core i5 7Y54 1.2GHz,3,256GB SSD,Intel HD Graphics 615,...,3,Intel,HD Graphics 615,Windows,1,0,0,0,0,0
3,161,Dell,Inspiron 5579,2 in 1 Convertible,15.6,Full HD / Touchscreen 1920x1080,Intel Core i7 8550U 1.8GHz,3,256GB SSD,Intel UHD Graphics 620,...,8,Intel,UHD Graphics 620,Windows,1,0,0,0,0,0
4,922,LG,Gram 14Z970,Ultrabook,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 7500U 2.7GHz,3,512GB SSD,Intel HD Graphics 620,...,18,Intel,HD Graphics 620,Windows,0,0,0,0,1,0
5,1040,HP,ProBook 650,Notebook,14.0,1366x768,Intel Core i5 7200U 2.5GHz,1,500GB HDD,Intel HD Graphics 620,...,16,Intel,HD Graphics 620,Windows,0,0,0,1,0,0
6,1039,HP,Elitebook 820,Ultrabook,12.5,1366x768,Intel Core i5 7200U 2.5GHz,1,256GB SSD,Intel HD Graphics 620,...,16,Intel,HD Graphics 620,Windows,0,0,0,0,1,0


### 1. Definir X e y

In [196]:
X = df1.drop(['Company', 'Product', 'TypeName', 'ScreenResolution', 'Cpu', 'Memory', 'marca_gpu','modelo_gpu','OpSys_general', 'Gpu', 'Price_euros','Resolución','tipo_pantalla','memoria', 'Marca_cpu','Serie_cpu','Modelo_cpu', 'Weight',	'tipo_memoria',	'tipo_cpu'], axis=1)
y = df1['Price_euros'].copy()
X.shape

(912, 11)

In [126]:
y.shape

(912,)

### 2. Dividir X_train, X_test, y_train, y_test

In [197]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [198]:
X_train

Unnamed: 0,laptop_ID,Inches,Ram,OpSys,velocidad_cpu_ghz,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
25,439,14.0,1,5,18,0,0,0,1,0,0
84,979,15.6,1,2,11,0,0,0,1,0,0
10,337,15.6,3,5,16,0,0,0,1,0,0
342,139,15.6,1,4,2,0,0,0,1,0,0
890,517,13.3,3,5,16,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
106,474,15.6,3,5,7,0,0,0,1,0,0
270,478,15.6,3,5,16,0,0,0,1,0,0
860,952,14.0,3,5,16,0,0,0,0,1,0
435,370,15.6,3,4,11,0,0,0,1,0,0


### 3. Asignar el modelo (vacío) a una variable

In [199]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train, y_train) # Entrenar

### Sacar métricas, valorar el modelo
RMSE para la competición

In [200]:
from sklearn.metrics import root_mean_squared_error

predictions = regressor.predict(X_test)

rmse = root_mean_squared_error(y_test, predictions)

In [201]:
print(rmse)

421.92689491903496


## Una vez listo el modelo, toca predecir con el dataset de predicción 

Definición de **modelo que está listo**. 

Tras hacer suficientes pruebas, analizar los datos, hacer feature engineering, probar diferentes modelos con diferentes parámetros, es con este con el que observo mejores métricas y menos overfitting. ¡Cuidado con el overfitting aquí! Si vuestro modelo aprende muy bien de estos datos pero hay overfitting cuando le pasemos los datos desconocidos de `test.csv` nos arriesgamos a que digamos, no salga lo esperado.

### 1. Entrena dicho modelo con TODOS tus datos de train, esto es con `train.csv` al completo.


**CON LAS TRANSFORMACIONES QUE LE HAYAS REALIZADO A `X` INCLUÍDAS.**


Véase:
- Estandarización/Normalización
- Eliminación de Outliers
- Eliminación de columnas
- Creación de columnas nuevas
- Gestión de valores nulos
- Y un largo etcétera de técnicas que como Data Scientist hayas considerado las mejores para tu dataset.

In [None]:
X_train

Unnamed: 0.1,Unnamed: 0,laptop_ID,Inches,Ram,OpSys,velocidad_cpu_ghz,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
25,25,439,14.0,1,5,18,0,0,0,1,0,0
84,84,979,15.6,1,2,11,0,0,0,1,0,0
10,10,337,15.6,3,5,16,0,0,0,1,0,0
342,342,139,15.6,1,4,2,0,0,0,1,0,0
890,890,517,13.3,3,5,16,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
106,106,474,15.6,3,5,7,0,0,0,1,0,0
270,270,478,15.6,3,5,16,0,0,0,1,0,0
860,860,952,14.0,3,5,16,0,0,0,0,1,0
435,435,370,15.6,3,4,11,0,0,0,1,0,0


### 2. Carga los datos de `test.csv` para predecir.

In [202]:
X_pred = pd.read_csv(r"C:\Users\nuria\OneDrive\Escritorio\ML_laptops\data\raw_data\test.csv")
X_pred.head()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight
0,539,Asus,Zenbook UX510UW-FI095T,Notebook,15.6,IPS Panel 4K Ultra HD 3840x2160,Intel Core i7 7500U 2.7GHz,8GB,256GB SSD + 1TB HDD,Nvidia GeForce GTX 960M,Windows 10,2kg
1,327,Asus,ZenBook UX410UA-GV183T,Notebook,14.0,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,8GB,256GB SSD,Intel HD Graphics 620,Windows 10,2kg
2,563,Mediacom,SmartBook 130,Notebook,13.3,IPS Panel Full HD 1920x1080,Intel Atom x5-Z8350 1.44GHz,4GB,32GB Flash Storage,Intel HD Graphics,Windows 10,1.35kg
3,13,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.8GHz,16GB,256GB SSD,AMD Radeon Pro 555,macOS,1.83kg
4,935,HP,EliteBook 850,Ultrabook,15.6,Full HD 1920x1080,Intel Core i7 6500U 2.5GHz,8GB,256GB SSD,AMD Radeon R7 M365X,Windows 10,1.84kg


In [203]:
X_pred.shape

(391, 12)

In [204]:
#ScreenResolution
#Agrupar las medidas en una nueva columna
X_pred['Resolución'] = X_pred['ScreenResolution'].apply(
    lambda x: re.search(r'(\d{3,4}x\d{3,4})', x).group(0) if re.search(r'(\d{3,4}x\d{3,4})', x) else None)
#Crear otra variable para características de la pantalla

In [205]:
# Extraer y crear nuevas columnas
X_pred['tipo_pantalla'] = X_pred['ScreenResolution'].str.extract(r'^(\w+)')

X_pred['tipo_pantalla'] = X_pred['tipo_pantalla'].replace({
    '1366x768': 'HD',
    '1600x900':'HD+',
    '1920x1080':'Full HD',
    '1440x900':'WXGA+',
    '2560x1440':'Quad-HD'
    })

In [207]:
X_pred['memoria'] = X_pred['Memory'].str.extract(r'^(\w+)')
X_pred['tipo_memoria'] = X_pred['Memory'].str.extract(r"([A-Za-z]+)$")
X_pred['tipo_memoria'].value_counts()

tipo_memoria
SSD        187
HDD        185
Storage     17
Hybrid       2
Name: count, dtype: int64

In [208]:
X_pred['tipo_cpu'] = X_pred['Cpu'].str.extract(r'^(.*)\s')
X_pred['tipo_cpu'].value_counts()

tipo_cpu
Intel Core i5 7200U              58
Intel Core i7 7700HQ             48
Intel Core i7 7500U              39
Intel Core i5 6200U              28
Intel Core i7 8550U              28
Intel Core i3 6006U              21
Intel Core i5 8250U              18
Intel Core i7 6700HQ             17
Intel Core i7 6500U              17
Intel Core i3 7100U              15
Intel Core i5 7300HQ             11
Intel Celeron Dual Core N3350     8
Intel Celeron Dual Core N3060     6
Intel Core i7 6820HK              6
Intel Core i7 6820HQ              4
Intel Pentium Quad Core N3710     3
Intel Celeron Dual Core 3855U     3
Intel Core i7 6600U               3
Intel Core i5 6300U               3
Intel Core i5 7300U               3
Intel Core i5                     3
AMD A8-Series 7410                3
AMD A9-Series 9420                2
Intel Core i7                     2
Intel Core i7 7820HK              2
Intel Atom x5-Z8350               2
Intel Atom X5-Z8350               2
AMD A6-Series 9220 

In [209]:
X_pred[['Marca_cpu', 'Serie_cpu', 'Modelo_cpu']] = X_pred['tipo_cpu'].str.extract(
    r'^(Intel|AMD)\s+([\w\-]+(?:\s[\w\-]+)?)\s+(.*)$')

In [210]:
# Comprobar si todo esta GB
cantidad_con_ghz = X_pred['Cpu'].str.contains('GHz', case=False).sum()
print(f"Valores con 'GHz': {cantidad_con_ghz} de {len(X_pred)}")
#Sacar la velocidad de cpu
X_pred['velocidad_cpu_ghz'] = X_pred['Cpu'].str.extract(r'(\d+(?:\.\d+)?)GHz')

Valores con 'GHz': 391 de 391


In [211]:
#Gpu
X_pred[['marca_gpu', 'modelo_gpu']] = X_pred['Gpu'].str.extract(r'(\w+) (.*)')
X_pred['modelo_gpu'].value_counts()

modelo_gpu
HD Graphics 620       79
HD Graphics 520       56
UHD Graphics 620      21
GeForce GTX 1050      21
GeForce GTX 1060      18
                      ..
R4 Graphics            1
Radeon R7 Graphics     1
HD Graphics 630        1
Mali T860 MP4          1
Radeon R4 Graphics     1
Name: count, Length: 69, dtype: int64

In [212]:
#OpSys
X_pred['OpSys_general']= X_pred['OpSys'].replace({
    'Windows 10':'Windows',
    'Windows 7':'Windows',
    'Windows 10 S':'Windows',
    'Linux':'Linux',
    'MacOS':'MacOS',
    'Mac OS X':'MacOS',
    'Android':'Android',
    'Chrome OS':'Chrome OS',
    'No OS':'Sin OS'
    })

X_pred.head(4)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,memoria,tipo_memoria,tipo_cpu,Marca_cpu,Serie_cpu,Modelo_cpu,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general
0,539,Asus,Zenbook UX510UW-FI095T,Notebook,15.6,IPS Panel 4K Ultra HD 3840x2160,Intel Core i7 7500U 2.7GHz,8GB,256GB SSD + 1TB HDD,Nvidia GeForce GTX 960M,...,256GB,HDD,Intel Core i7 7500U,Intel,Core i7,7500U,2.7,Nvidia,GeForce GTX 960M,Windows
1,327,Asus,ZenBook UX410UA-GV183T,Notebook,14.0,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,8GB,256GB SSD,Intel HD Graphics 620,...,256GB,SSD,Intel Core i7 7500U,Intel,Core i7,7500U,2.7,Intel,HD Graphics 620,Windows
2,563,Mediacom,SmartBook 130,Notebook,13.3,IPS Panel Full HD 1920x1080,Intel Atom x5-Z8350 1.44GHz,4GB,32GB Flash Storage,Intel HD Graphics,...,32GB,Storage,Intel Atom x5-Z8350,Intel,Atom,x5-Z8350,1.44,Intel,HD Graphics,Windows
3,13,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.8GHz,16GB,256GB SSD,AMD Radeon Pro 555,...,256GB,SSD,Intel Core i7,Intel,Core,i7,2.8,AMD,Radeon Pro 555,macOS


In [213]:
ram= {
    '2GB': 0,
    '4GB': 1,
    '6GB': 2,
    '8GB': 3,
    '12GB': 4,
    '16GB': 5,
    '24GB': 6,
    '32GB': 7,
    '64GB': 8
}
X_pred.head(2)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,memoria,tipo_memoria,tipo_cpu,Marca_cpu,Serie_cpu,Modelo_cpu,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general
0,539,Asus,Zenbook UX510UW-FI095T,Notebook,15.6,IPS Panel 4K Ultra HD 3840x2160,Intel Core i7 7500U 2.7GHz,8GB,256GB SSD + 1TB HDD,Nvidia GeForce GTX 960M,...,256GB,HDD,Intel Core i7 7500U,Intel,Core i7,7500U,2.7,Nvidia,GeForce GTX 960M,Windows
1,327,Asus,ZenBook UX410UA-GV183T,Notebook,14.0,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,8GB,256GB SSD,Intel HD Graphics 620,...,256GB,SSD,Intel Core i7 7500U,Intel,Core i7,7500U,2.7,Intel,HD Graphics 620,Windows


In [214]:
X_pred['Ram'].value_counts(normalize=True)

Ram
8GB     0.480818
4GB     0.271100
16GB    0.171355
6GB     0.033248
12GB    0.020460
2GB     0.010230
32GB    0.010230
24GB    0.002558
Name: proportion, dtype: float64

In [215]:
X_pred['Ram'] = X_pred['Ram'].replace(ram)
X_pred['Ram'] = X_pred['Ram'].astype(int)
X_pred.head(3)

  X_pred['Ram'] = X_pred['Ram'].replace(ram)


Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,memoria,tipo_memoria,tipo_cpu,Marca_cpu,Serie_cpu,Modelo_cpu,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general
0,539,Asus,Zenbook UX510UW-FI095T,Notebook,15.6,IPS Panel 4K Ultra HD 3840x2160,Intel Core i7 7500U 2.7GHz,3,256GB SSD + 1TB HDD,Nvidia GeForce GTX 960M,...,256GB,HDD,Intel Core i7 7500U,Intel,Core i7,7500U,2.7,Nvidia,GeForce GTX 960M,Windows
1,327,Asus,ZenBook UX410UA-GV183T,Notebook,14.0,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,3,256GB SSD,Intel HD Graphics 620,...,256GB,SSD,Intel Core i7 7500U,Intel,Core i7,7500U,2.7,Intel,HD Graphics 620,Windows
2,563,Mediacom,SmartBook 130,Notebook,13.3,IPS Panel Full HD 1920x1080,Intel Atom x5-Z8350 1.44GHz,1,32GB Flash Storage,Intel HD Graphics,...,32GB,Storage,Intel Atom x5-Z8350,Intel,Atom,x5-Z8350,1.44,Intel,HD Graphics,Windows


In [216]:
# Los tipos únicos de 'TypeName'
tipos = ['2 in 1 Convertible', 'Gaming', 'Netbook', 'Notebook', 'Ultrabook', 'Workstation']

# Columnas binarias para cada tipo
for tipo in tipos:
    X_pred[tipo] = X_pred['TypeName'].apply(lambda x: 1 if x == tipo else 0)

# Eliminar la columna original si ya no la necesitas
# df = df.drop(columns=['TypeName'])

X_pred.head(2)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,539,Asus,Zenbook UX510UW-FI095T,Notebook,15.6,IPS Panel 4K Ultra HD 3840x2160,Intel Core i7 7500U 2.7GHz,3,256GB SSD + 1TB HDD,Nvidia GeForce GTX 960M,...,2.7,Nvidia,GeForce GTX 960M,Windows,0,0,0,1,0,0
1,327,Asus,ZenBook UX410UA-GV183T,Notebook,14.0,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,3,256GB SSD,Intel HD Graphics 620,...,2.7,Intel,HD Graphics 620,Windows,0,0,0,1,0,0


In [217]:
# Convertir a numérico usando el método de codificación
X_pred['OpSys'] = X_pred['OpSys'].astype('category').cat.codes
X_pred['velocidad_cpu_ghz'] = X_pred['velocidad_cpu_ghz'].astype('category').cat.codes

In [218]:
#X_pred = X_pred[['laptop_ID', 'Inches', 'Ram',	'OpSys', 'velocidad_cpu_ghz',	'2 in 1 Convertible', 'Gaming',	'Netbook',	'Notebook',	'Ultrabook', 'Workstation']]
# X_pred

In [219]:
print(X_pred.isnull().sum())

laptop_ID             0
Company               0
Product               0
TypeName              0
Inches                0
ScreenResolution      0
Cpu                   0
Ram                   0
Memory                0
Gpu                   0
OpSys                 0
Weight                0
Resolución            0
tipo_pantalla         0
memoria               0
tipo_memoria          0
tipo_cpu              0
Marca_cpu             1
Serie_cpu             1
Modelo_cpu            1
velocidad_cpu_ghz     0
marca_gpu             0
modelo_gpu            0
OpSys_general         0
2 in 1 Convertible    0
Gaming                0
Netbook               0
Notebook              0
Ultrabook             0
Workstation           0
dtype: int64


In [166]:
#X_pred.drop(columns=['laptop_ID'])

Unnamed: 0,Inches,Ram,OpSys,velocidad_cpu_ghz,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,15.6,3,5,17,0,0,0,1,0,0
1,14.0,3,5,17,0,0,0,1,0,0
2,13.3,1,5,4,0,0,0,1,0,0
3,15.4,5,8,19,0,0,0,0,1,0
4,15.6,3,5,15,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
386,13.3,3,5,15,0,0,0,1,0,0
387,13.3,3,5,6,0,0,0,0,1,0
388,15.6,2,5,15,0,0,0,1,0,0
389,14.0,3,7,13,0,0,0,1,0,0


In [220]:
X_pred.head()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,539,Asus,Zenbook UX510UW-FI095T,Notebook,15.6,IPS Panel 4K Ultra HD 3840x2160,Intel Core i7 7500U 2.7GHz,3,256GB SSD + 1TB HDD,Nvidia GeForce GTX 960M,...,17,Nvidia,GeForce GTX 960M,Windows,0,0,0,1,0,0
1,327,Asus,ZenBook UX410UA-GV183T,Notebook,14.0,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,3,256GB SSD,Intel HD Graphics 620,...,17,Intel,HD Graphics 620,Windows,0,0,0,1,0,0
2,563,Mediacom,SmartBook 130,Notebook,13.3,IPS Panel Full HD 1920x1080,Intel Atom x5-Z8350 1.44GHz,1,32GB Flash Storage,Intel HD Graphics,...,4,Intel,HD Graphics,Windows,0,0,0,1,0,0
3,13,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.8GHz,5,256GB SSD,AMD Radeon Pro 555,...,19,AMD,Radeon Pro 555,macOS,0,0,0,0,1,0
4,935,HP,EliteBook 850,Ultrabook,15.6,Full HD 1920x1080,Intel Core i7 6500U 2.5GHz,3,256GB SSD,AMD Radeon R7 M365X,...,15,AMD,Radeon R7 M365X,Windows,0,0,0,0,1,0


IMPORTANTE: APLICAR LO MISMO A ESTOS DATOS QUE HAYÁIS APLICADO A LOS DATOS DE ENTRENAMIENTO

- SI EL ARRAY CON EL QUE HICISTEIS `.fit()` ERA DE 4 COLUMNAS, PARA `.predict()` DEBEN SER LAS MISMAS
- SI AL ARRAY CON EL QUE HICISTEIS `.fit()` LO NORMALIZASTEIS, PARA `.predict()` DEBÉIS NORMALIZARLO
- TODO IGUAL SALVO BORRAR FILAS, EL NÚMERO DE ROWS SE DEBE MANTENER EN ESTE SET, PUES LA PREDICCIÓN DEBE TENER 391 FILAS, SI O SI

**Entonces, si al cargar los datos de train usé `index_col=0` para que utilizara la primera columna del conjunto de datos como índice, ¿tendré que hacerlo también para el conjunto `test.csv`?**

In [221]:
X_pred = X_pred[['laptop_ID','Inches', 'Ram',	'OpSys', 'velocidad_cpu_ghz',	'2 in 1 Convertible', 'Gaming',	'Netbook',	'Notebook',	'Ultrabook', 'Workstation']]
X_pred

Unnamed: 0,laptop_ID,Inches,Ram,OpSys,velocidad_cpu_ghz,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,539,15.6,3,5,17,0,0,0,1,0,0
1,327,14.0,3,5,17,0,0,0,1,0,0
2,563,13.3,1,5,4,0,0,0,1,0,0
3,13,15.4,5,8,19,0,0,0,0,1,0
4,935,15.6,3,5,15,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
386,742,13.3,3,5,15,0,0,0,1,0,0
387,660,13.3,3,5,6,0,0,0,0,1,0
388,983,15.6,2,5,15,0,0,0,1,0,0
389,1137,14.0,3,7,13,0,0,0,1,0,0


In [174]:
#X_train.drop(columns='laptop_ID')

Unnamed: 0,Inches,Ram,OpSys,velocidad_cpu_ghz,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
25,14.0,1,5,18,0,0,0,1,0,0
84,15.6,1,2,11,0,0,0,1,0,0
10,15.6,3,5,16,0,0,0,1,0,0
342,15.6,1,4,2,0,0,0,1,0,0
890,13.3,3,5,16,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
106,15.6,3,5,7,0,0,0,1,0,0
270,15.6,3,5,16,0,0,0,1,0,0
860,14.0,3,5,16,0,0,0,0,1,0
435,15.6,3,4,11,0,0,0,1,0,0


### 3. AHORA puedo hacer la predicción que será lo que subirás a Kaggle.

**¿Qué es lo que subirás a Kaggle?**

In [222]:
print("X_train columns:", X_train.columns.tolist())
print("X_pred columns:", X_pred.columns.tolist())


X_train columns: ['laptop_ID', 'Inches', 'Ram', 'OpSys', 'velocidad_cpu_ghz', '2 in 1 Convertible', 'Gaming', 'Netbook', 'Notebook', 'Ultrabook', 'Workstation']
X_pred columns: ['laptop_ID', 'Inches', 'Ram', 'OpSys', 'velocidad_cpu_ghz', '2 in 1 Convertible', 'Gaming', 'Netbook', 'Notebook', 'Ultrabook', 'Workstation']


In [223]:
predictions_submit = regressor.predict(X_pred)
predictions_submit

array([1080.32886206, 1085.01765349,  400.02617444, 2135.83202782,
       1466.36240239,  568.17578484, 1961.08693927,  709.85226282,
       1309.72829767,  555.99579392, 1090.80253691, 2002.44295374,
        414.04909704,  388.58395021, 1334.06550284, 1848.10478377,
        106.25140265, 1075.27845474, 1962.23214379,  463.30629719,
       1636.95168959, 1497.00773585, 1963.94077201,  494.8025847 ,
        416.20704329,  554.50233291, 1182.52673611, 1959.54719014,
       1376.4908962 ,  644.15626187, 1752.77315856, 1344.42107168,
        418.63804114, 1992.65599756, 1170.54861111,  158.57185913,
       1281.11148889, 1845.73488341, 1004.46734892, 1423.21399932,
       2055.79400325, 1673.23146673, 1602.21822396, 1423.59316623,
       1314.29893115,  870.24815338,  511.97752696, 2381.17853058,
       1137.76069941,  910.86115195,  429.87933913, 1060.62470852,
        269.35441433,  344.23074744, 1247.53394871,   29.01815984,
       1325.63333501,  957.09476079, 1938.83935626, 1221.83633

**¡PERO! Para subir a Kaggle la predicción, ésta tendrá que tener una forma específica y no valdrá otra.**

En este caso, la **MISMA** forma que `sample_submission.csv`. 

In [224]:
sample = pd.read_csv(r'C:\Users\nuria\OneDrive\Escritorio\ML_laptops\data\raw_data\sample_submission.csv')
sample.head()

Unnamed: 0,laptop_ID,Price_euros
0,539,650
1,327,650
2,563,650
3,13,650
4,935,500


In [225]:
sample.shape

(391, 2)

### 4. Mete tus predicciones en un dataframe. 

En este caso, la **MISMA** forma que `sample_submission.csv`. 

In [236]:
submission = pd.DataFrame({"laptop_ID": sample['laptop_ID'], "Price_euros": predictions_submit})
submission.head()

Unnamed: 0,laptop_ID,Price_euros
0,539,1080.328862
1,327,1085.017653
2,563,400.026174
3,13,2135.832028
4,935,1466.362402


### 5. Pásale el CHEQUEATOR para comprobar que efectivamente está listo para subir a Kaggle.

In [233]:
import urllib.request
from PIL import Image

In [234]:
def chequeator(df_to_submit,version):
    """
    Esta función se asegura de que tu submission tenga la forma requerida por Kaggle.
    
    Si es así, se guardará el dataframe en un `csv` y estará listo para subir a Kaggle.
    
    Si no, LEE EL MENSAJE Y HAZLE CASO.
    
    Si aún no:
    - apaga tu ordenador, 
    - date una vuelta, 
    - enciendelo otra vez, 
    - abre este notebook y 
    - leelo todo de nuevo. 
    Todos nos merecemos una segunda oportunidad. También tú.
    """
    if df_to_submit.shape == sample.shape:
        if df_to_submit.columns.all() == sample.columns.all():
            if df_to_submit['laptop_ID'].all() == sample['laptop_ID'].all():
                print("You're ready to submit!")
                submission.to_csv(f"submission_{version}.csv", index = False) #muy importante el index = False
                urllib.request.urlretrieve("https://i.kym-cdn.com/photos/images/facebook/000/747/556/27a.jpg", "gfg.png")     
                img = Image.open("gfg.png")
                img.show()   
            else:
                print("Check the ids and try again")
        else:
            print("Check the names of the columns and try again")
    else:
        print("Check the number of rows and/or columns and try again")
        print("\nMensaje secreto de Clara: No me puedo creer que después de todo este notebook hayas hecho algún cambio en las filas de `diamonds_test.csv`. Lloro.")


In [235]:
chequeator(submission,2)

You're ready to submit!
