# Diferencia entre pd.get_dummies, LabelEncoder, OneHotEncoder, ColumnTransformer



## pd.get_dummies:

- Función de pandas que convierte variables categóricas en variables dummy/indicadoras.
- Crea una nueva columna para cada categoría con valores 0 o 1.

### Ejemplo:

In [267]:
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green']})
df_dummies = pd.get_dummies(df)
df_dummies

Unnamed: 0,color_blue,color_green,color_red
0,False,False,True
1,True,False,False
2,False,True,False


In [268]:
df_original = df_dummies.idxmax(axis=1)
df_original

0      color_red
1     color_blue
2    color_green
dtype: object

In [269]:
df = pd.DataFrame({'color': ['red', 'blue', 'green','red','red']})
df_dummies = pd.get_dummies(df)
df_dummies

Unnamed: 0,color_blue,color_green,color_red
0,False,False,True
1,True,False,False
2,False,True,False
3,False,False,True
4,False,False,True


In [270]:
df_original = df_dummies.idxmax(axis=1)
df_original

0      color_red
1     color_blue
2    color_green
3      color_red
4      color_red
dtype: object

## LabelEncoder:

- Clase de sklearn.preprocessing que convierte etiquetas categóricas en números enteros.
- Asigna un número único a cada categoría.

### Ejemplo:

In [214]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels = ['red', 'blue', 'green']
encoded_labels = le.fit_transform(labels)
encoded_labels

array([2, 0, 1])

In [215]:
original_labels = le.inverse_transform(encoded_labels)
original_labels

array(['red', 'blue', 'green'], dtype='<U5')

In [216]:
transformado = le.transform(['red', 'blue'])  
transformado

array([2, 0])

## OneHotEncoder

- Clase de sklearn.preprocessing que convierte variables categóricas en una matriz de variables dummy/indicadoras.
- Similar a pd.get_dummies, pero más flexible y parte del ecosistema de scikit-learn.

### Ejemplo:

In [271]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
ohe = OneHotEncoder()
data = np.array([['red'], ['blue'], ['green']])
encoded_data = ohe.fit_transform(data).toarray()
encoded_data

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [272]:
original_data = ohe.inverse_transform(encoded_data)
original_data

array([['red'],
       ['blue'],
       ['green']], dtype='<U5')

In [273]:
ohe.transform( [['red']] ).toarray()

array([[0., 0., 1.]])

## ColumnTransformer:

- Clase de sklearn.compose que permite aplicar diferentes transformaciones a diferentes columnas de un DataFrame.
- Útil para aplicar preprocesamientos específicos a distintas columnas en un solo paso.

### Ejemplo:


In [220]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pandas as pd

df = pd.DataFrame({
    'color': ['red', 'blue', 'green'],
    'size': [1, 2, 3]
})

ct = ColumnTransformer(
    transformers=[
        ('color_one', OneHotEncoder(), ['color']),
        ('size_scaler', StandardScaler(), ['size'])
    ]
)

transformed_data = ct.fit_transform(df)
transformed_data

array([[ 0.        ,  0.        ,  1.        , -1.22474487],
       [ 1.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  1.22474487]])

In [221]:
# Nota: inverse_transform solo funcionará si todas las transformaciones soportan inverse_transform
original_data = ct.named_transformers_['color_one'].inverse_transform(transformed_data[:, :3])
original_size = ct.named_transformers_['size_scaler'].inverse_transform(transformed_data[:, 3:])

# Combinar los datos originales
original_df = pd.DataFrame({
    'color': original_data.flatten(),
    'size': original_size.flatten()
})

print(original_df)

   color  size
0    red   1.0
1   blue   2.0
2  green   3.0


In [222]:
original_data

array([['red'],
       ['blue'],
       ['green']], dtype=object)

In [223]:
trans = ct.transformers_[0][1]
trans.inverse_transform([[1, 0, 0]]),trans.inverse_transform([[0, 1, 0]]),trans.inverse_transform([[0, 0, 1]])


(array([['blue']], dtype=object),
 array([['green']], dtype=object),
 array([['red']], dtype=object))

<h2>Variables Categóricas y Codificación One Hot</h2>

In [224]:
import pandas as pd
import numpy as np

In [225]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


<h2 style='color:purple'>Usando pandas para crear variables dummy</h2>

In [226]:
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False
5,False,False,True
6,False,False,True
7,False,False,True
8,False,False,True
9,False,True,False


In [227]:
df_dummies= pd.concat([df,dummies],axis='columns')
df_dummies

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,True,False,False
1,monroe township,3000,565000,True,False,False
2,monroe township,3200,610000,True,False,False
3,monroe township,3600,680000,True,False,False
4,monroe township,4000,725000,True,False,False
5,west windsor,2600,585000,False,False,True
6,west windsor,2800,615000,False,False,True
7,west windsor,3300,650000,False,False,True
8,west windsor,3600,710000,False,False,True
9,robinsville,2600,575000,False,True,False


In [228]:
#df_dummies.drop('town',axis='columns',inplace=True)
# Evitar KeyError comprobando si la columna existe y aplicando el cambio en el DataFrame
if 'town' in df_dummies.columns:
    df_dummies.drop('town', axis='columns', inplace=True)

df_dummies

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,True,False,False
1,3000,565000,True,False,False
2,3200,610000,True,False,False
3,3600,680000,True,False,False
4,4000,725000,True,False,False
5,2600,585000,False,False,True
6,2800,615000,False,False,True
7,3300,650000,False,False,True
8,3600,710000,False,False,True
9,2600,575000,False,True,False


<h3 style='color:purple'>Trampa de la Variable Dummy</h3>


Cuando se puede derivar una variable a partir de otras, se dice que están multicolineadas. Por ejemplo, si conoces los valores de california y georgia puedes inferir fácilmente el valor de new jersey (p. ej. california=0 y georgia=0). Por tanto, esas columnas de estado son multicolineales. En esa situación la regresión lineal no funcionará como se espera, así que hay que eliminar una columna.

**NOTA: la librería sklearn se encarga de la trampa de la variable dummy, por lo que incluso si no eliminas una de las columnas de estado, funcionará. Sin embargo, deberíamos hacer el hábito de cuidar nosotros mismos la trampa de la variable dummy por si acaso la librería que estás usando no lo maneja por ti**

In [229]:
df_dummies.drop('west windsor',axis='columns',inplace=True)
df_dummies

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,True,False
1,3000,565000,True,False
2,3200,610000,True,False
3,3600,680000,True,False
4,4000,725000,True,False
5,2600,585000,False,False
6,2800,615000,False,False
7,3300,650000,False,False
8,3600,710000,False,False
9,2600,575000,False,True


In [230]:
X = df_dummies.drop('price',axis='columns')
y = df_dummies.price

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X.values,y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [231]:
X

Unnamed: 0,area,monroe township,robinsville
0,2600,True,False
1,3000,True,False
2,3200,True,False
3,3600,True,False
4,4000,True,False
5,2600,False,False
6,2800,False,False
7,3300,False,False
8,3600,False,False
9,2600,False,True


In [232]:
model.predict(X.values) # 2600 sqr ft home in new jersey

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,
       717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,
       706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,
       692293.59277574])

In [233]:
model.score(X.values,y)

0.9573929037221873

In [234]:
model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

array([681241.66845839])

In [235]:
model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

<h2 style='color:purple'>Usando sklearn OneHotEncoder</h2>

El primer paso es usar label encoder para convertir los nombres de las ciudades en números

In [236]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [237]:
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [238]:
dfle = df.copy()
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [239]:
X = dfle[['town','area']].values

In [240]:
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]])

In [241]:
y = dfle.price.values
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000])

Ahora usa one hot encoder para crear variables dummy para cada una de las ciudades

In [242]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first')

In [243]:
X_tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()
X_tmp

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [244]:
X[:,1:]

array([[2600],
       [3000],
       [3200],
       [3600],
       [4000],
       [2600],
       [2800],
       [3300],
       [3600],
       [2600],
       [2900],
       [3100],
       [3600]])

In [245]:
X = np.hstack((X_tmp[:,:2],X[:,1:]))
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [246]:
model.fit(X,y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [247]:
model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

array([681241.6684584])

In [248]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

In [249]:
dfle = df.copy()
ohe = OneHotEncoder(drop='first')
X = dfle[['town','area']].values
y = dfle.price.values
X_tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()
X = np.hstack((X_tmp[:,:2],X[:,1:]))
model.fit(X,y)
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

In [250]:
ohe.inverse_transform( [[1,0]])

array([['robinsville']], dtype=object)

In [251]:
ohe.inverse_transform( [[0,0]]),ohe.inverse_transform( [[1,0]]),ohe.inverse_transform( [[0,1]]),ohe.inverse_transform( [[1,1]])#Porque?

(array([['monroe township']], dtype=object),
 array([['robinsville']], dtype=object),
 array([['west windsor']], dtype=object),
 array([['robinsville']], dtype=object))

<h2 style='color:purple'>Usando sklearn ColumnTransformer</h2>

In [252]:
from sklearn.compose import ColumnTransformer 

In [253]:
dfle = df.copy()
dfle

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [254]:
X = dfle[['town','area']].values
y = dfle.price.values


In [255]:
# The last arg ([0]) is the list of columns you want to transform in this step
ct = ColumnTransformer([("town", OneHotEncoder(drop='first'),[0])], remainder="passthrough")  
X = ct.fit_transform(X) 


In [256]:
X

array([[0.0, 0.0, 2600],
       [0.0, 0.0, 3000],
       [0.0, 0.0, 3200],
       [0.0, 0.0, 3600],
       [0.0, 0.0, 4000],
       [0.0, 1.0, 2600],
       [0.0, 1.0, 2800],
       [0.0, 1.0, 3300],
       [0.0, 1.0, 3600],
       [1.0, 0.0, 2600],
       [1.0, 0.0, 2900],
       [1.0, 0.0, 3100],
       [1.0, 0.0, 3600]], dtype=object)

In [257]:
model.fit(X,y)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [258]:
model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

In [259]:
categories  = ct.transformers_[0][1].categories_[0]
categories


array(['monroe township', 'robinsville', 'west windsor'], dtype=object)

In [260]:
tmp = ct.transform([['robinsville',2800]])
tmp

array([[1.0, 0.0, 2800]], dtype=object)

In [261]:
model.predict(tmp)

array([590775.63964739])

In [262]:
trans = ct.transformers_[0][1]
trans.inverse_transform([[1.0, 0.0]])

array([['robinsville']], dtype=object)

https://www.kaggle.com/code/ksvmuralidhar/columntransformer-pipeline-simplified/notebook

<h2 style='color:green'>Exercise</h2>

Hay una carpeta Exercise que contiene carprices.csv.
Este archivo tiene precios de venta de coches para 3 modelos diferentes. Primero traza los puntos de datos en un gráfico de dispersión
para ver si se puede aplicar el modelo de regresión lineal. Si es así, entonces construye un modelo que pueda responder
las siguientes preguntas:

**1) Predice el precio de un Mercedes Benz que tiene 4 años de antigüedad con 45000 millas**

**2) Predice el precio de un BMW X5 que tiene 7 años de antigüedad con 86000 millas**

**3) Dime la puntuación (precisión) de tu modelo. (Pista: usa LinearRegression().score())**

In [276]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder


In [287]:

df = pd.read_csv("Exercise/carprices.csv")
df.columns = ['car_model', 'mileage', 'price', 'year']
df

Unnamed: 0,car_model,mileage,price,year
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [288]:
X = df.drop('price',axis='columns')
y = df.price


In [292]:
ohe = OneHotEncoder()

In [302]:
transformed_data = ohe.fit_transform(X[['car_model']])
transformed_data

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13 stored elements and shape (13, 3)>

In [306]:
transformed_data

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13 stored elements and shape (13, 3)>

In [307]:
ohe.get_feature_names_out(['car_model'])

array(['car_model_Audi A5', 'car_model_BMW X5',
       'car_model_Mercedez Benz C class'], dtype=object)

In [None]:

feature_names = ohe.get_feature_names_out(['car_model'])
df_ohe = pd.DataFrame(transformed_data.toarray(), columns=feature_names)
df_ohe.head()

# Opcional: unir con columnas numéricas originales
X_num = X.drop(columns=['car_model'])
df_encoded = pd.concat([X_num.reset_index(drop=True), df_ohe], axis=1)
df_encoded.head()

Unnamed: 0,mileage,year,car_model_Audi A5,car_model_BMW X5,car_model_Mercedez Benz C class
0,69000,6,0.0,1.0,0.0
1,35000,3,0.0,1.0,0.0
2,57000,5,0.0,1.0,0.0
3,22500,2,0.0,1.0,0.0
4,46000,4,0.0,1.0,0.0
