# Tratamiento de datos + Feature engineer

## Preprocesamiento de Datos Inicial
1. Eliminar duplicados si existen
2. Tratar valores faltantes:
   - Decidir entre eliminar o imputar
   - Documentar la estrategia elegida
3. Identificar y tratar outliers:
   - Análisis estadístico de outliers
   - Decidir estrategia (eliminar, transformar o mantener)
4. Limpieza general de datos

### Librerías

In [2]:
import pandas as pd
import numpy as np

import urllib.request
from PIL import Image
import re

import matplotlib.pyplot as plt
import seaborn as sns

### Datos

In [3]:
df = pd.read_csv(r'C:\Users\nuria\OneDrive\Escritorio\ML_laptops\data\raw_data\train.csv')
df.head(3)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,AMD Radeon R7 M445,Windows 10,2.36kg,889.0
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,2TB HDD,Intel HD Graphics 620,No OS,2.2kg,519.0
2,1267,Dell,XPS 13,2 in 1 Convertible,13.3,Quad HD+ / Touchscreen 3200x1800,Intel Core i5 7Y54 1.2GHz,8GB,256GB SSD,Intel HD Graphics 615,Windows 10,1.24kg,1813.0


## Feature Engineering
1. Selección inicial de características:
   - Análisis de correlaciones
   - Importancia de variables
2. Creación de nuevas características:
   - Combinaciones de variables existentes
   - Transformaciones matemáticas
3. Aplicación de técnicas no supervisadas (si es necesario):
   - PCA para reducción de dimensionalidad
   - Clustering para nuevas features

In [4]:
#ScreenResolution

#Agrupar las medidas en una nueva columna
df['Resolución'] = df['ScreenResolution'].apply(
    lambda x: re.search(r'(\d{3,4}x\d{3,4})', x).group(0) if re.search(r'(\d{3,4}x\d{3,4})', x) else None)
#Crear otra variable para características de la pantalla

In [5]:
# Extraer y crear nuevas columnas
df['tipo_pantalla'] = df['ScreenResolution'].str.extract(r'^(\w+)')

df['tipo_pantalla'] = df['tipo_pantalla'].replace({
    '1366x768': 'HD',
    '1600x900':'HD+',
    '1920x1080':'Full HD',
    '1440x900':'WXGA+',
    '2560x1440':'Quad-HD'
    })

In [6]:
df['memoria'] = df['Memory'].str.extract(r'^(\w+)')
df['tipo_memoria'] = df['Memory'].str.extract(r"([A-Za-z]+)$")
df['tipo_memoria'].value_counts()

tipo_memoria
SSD        454
HDD        391
Storage     57
Hybrid      10
Name: count, dtype: int64

In [7]:
df['tipo_cpu'] = df['Cpu'].str.extract(r'^(.*)\s')
df['tipo_cpu'].value_counts()

tipo_cpu
Intel Core i5 7200U     135
Intel Core i7 7700HQ     99
Intel Core i7 7500U      96
Intel Core i3 6006U      60
Intel Core i5 8250U      54
                       ... 
Intel Atom Z8350          1
Intel Core i7 8650U       1
Intel Core M 7Y30         1
Intel Core i5 6440HQ      1
Intel Core i7 6560U       1
Name: count, Length: 81, dtype: int64

In [8]:
df[['Marca_cpu', 'Serie_cpu', 'Modelo_cpu']] = df['tipo_cpu'].str.extract(
    r'^(Intel|AMD)\s+([\w\-]+(?:\s[\w\-]+)?)\s+(.*)$')

In [9]:
# Comprobar si todo esta GB
cantidad_con_ghz = df['Cpu'].str.contains('GHz', case=False).sum()
print(f"Valores con 'GHz': {cantidad_con_ghz} de {len(df)}")

#Sacar la velocidad de cpu
df['velocidad_cpu_ghz'] = df['Cpu'].str.extract(r'(\d+(?:\.\d+)?)GHz')

Valores con 'GHz': 912 de 912


In [10]:
#Gpu
df[['marca_gpu', 'modelo_gpu']] = df['Gpu'].str.extract(r'(\w+) (.*)')
df['modelo_gpu'].value_counts()

modelo_gpu
HD Graphics 620      202
HD Graphics 520      129
UHD Graphics 620      47
GeForce GTX 1050      45
Radeon 530            33
                    ... 
HD Graphics 530        1
Radeon R5 430          1
FirePro W5130M         1
GeForce GTX 940M       1
GeForce GTX 1070M      1
Name: count, Length: 91, dtype: int64

In [11]:
#OpSys
df['OpSys_general']= df['OpSys'].replace({
    'Windows 10':'Windows',
    'Windows 7':'Windows',
    'Windows 10 S':'Windows',
    'Linux':'Linux',
    'MacOS':'MacOS',
    'Mac OS X':'MacOS',
    'Android':'Android',
    'Chrome OS':'Chrome OS',
    'No OS':'Sin OS'
    })

df.head(2)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,memoria,tipo_memoria,tipo_cpu,Marca_cpu,Serie_cpu,Modelo_cpu,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,AMD Radeon R7 M445,...,256GB,SSD,Intel Core i5 7200U,Intel,Core i5,7200U,2.5,AMD,Radeon R7 M445,Windows
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,2TB HDD,Intel HD Graphics 620,...,2TB,HDD,Intel Core i5 7200U,Intel,Core i5,7200U,2.5,Intel,HD Graphics 620,Sin OS


In [17]:
ram= {
    '2GB': 0,
    '4GB': 1,
    '6GB': 2,
    '8GB': 3,
    '12GB': 4,
    '16GB': 5,
    '24GB': 6,
    '32GB': 7,
    '64GB': 8
}
df.head(5)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,256GB SSD,AMD Radeon R7 M445,...,2.5,AMD,Radeon R7 M445,Windows,0,0,0,1,0,0
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,2TB HDD,Intel HD Graphics 620,...,2.5,Intel,HD Graphics 620,Sin OS,0,0,0,1,0,0
2,1267,Dell,XPS 13,2 in 1 Convertible,13.3,Quad HD+ / Touchscreen 3200x1800,Intel Core i5 7Y54 1.2GHz,3,256GB SSD,Intel HD Graphics 615,...,1.2,Intel,HD Graphics 615,Windows,1,0,0,0,0,0
3,161,Dell,Inspiron 5579,2 in 1 Convertible,15.6,Full HD / Touchscreen 1920x1080,Intel Core i7 8550U 1.8GHz,3,256GB SSD,Intel UHD Graphics 620,...,1.8,Intel,UHD Graphics 620,Windows,1,0,0,0,0,0
4,922,LG,Gram 14Z970,Ultrabook,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 7500U 2.7GHz,3,512GB SSD,Intel HD Graphics 620,...,2.7,Intel,HD Graphics 620,Windows,0,0,0,0,1,0


In [13]:
df['Ram'].value_counts(normalize=True)

Ram
8GB     0.472588
4GB     0.294956
16GB    0.145833
6GB     0.030702
2GB     0.019737
12GB    0.018640
32GB    0.014254
24GB    0.002193
64GB    0.001096
Name: proportion, dtype: float64

In [14]:
df['Ram'] = df['Ram'].replace(ram)
df['Ram'] = df['Ram'].astype(int)
df.head(3)

  df['Ram'] = df['Ram'].replace(ram)


Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,memoria,tipo_memoria,tipo_cpu,Marca_cpu,Serie_cpu,Modelo_cpu,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,256GB SSD,AMD Radeon R7 M445,...,256GB,SSD,Intel Core i5 7200U,Intel,Core i5,7200U,2.5,AMD,Radeon R7 M445,Windows
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,2TB HDD,Intel HD Graphics 620,...,2TB,HDD,Intel Core i5 7200U,Intel,Core i5,7200U,2.5,Intel,HD Graphics 620,Sin OS
2,1267,Dell,XPS 13,2 in 1 Convertible,13.3,Quad HD+ / Touchscreen 3200x1800,Intel Core i5 7Y54 1.2GHz,3,256GB SSD,Intel HD Graphics 615,...,256GB,SSD,Intel Core i5 7Y54,Intel,Core i5,7Y54,1.2,Intel,HD Graphics 615,Windows


In [15]:
# Los tipos únicos de 'TypeName'
tipos = ['2 in 1 Convertible', 'Gaming', 'Netbook', 'Notebook', 'Ultrabook', 'Workstation']

# Columnas binarias para cada tipo
for tipo in tipos:
    df[tipo] = df['TypeName'].apply(lambda x: 1 if x == tipo else 0)

# Eliminar la columna original si ya no la necesitas
# df = df.drop(columns=['TypeName'])

df.head(2)

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,...,velocidad_cpu_ghz,marca_gpu,modelo_gpu,OpSys_general,2 in 1 Convertible,Gaming,Netbook,Notebook,Ultrabook,Workstation
0,1223,Dell,Inspiron 5567,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,256GB SSD,AMD Radeon R7 M445,...,2.5,AMD,Radeon R7 M445,Windows,0,0,0,1,0,0
1,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,3,2TB HDD,Intel HD Graphics 620,...,2.5,Intel,HD Graphics 620,Sin OS,0,0,0,1,0,0


In [None]:
# Los tipos únicos de 'TypeName'
tipos = ['2 in 1 Convertible', 'Gaming', 'Netbook', 'Notebook', 'Ultrabook', 'Workstation']

# Columnas binarias para cada tipo
for tipo in tipos:
    df[tipo] = df['TypeName'].apply(lambda x: 1 if x == tipo else 0)

# Eliminar la columna original si ya no la necesitas
# df = df.drop(columns=['TypeName'])

df.head(2)

In [16]:
# Guardar el dataframe limpio
df.to_csv('DataFrame_laptops_limpio_1', index=False)