<a href="https://colab.research.google.com/github/JoJa171199/ML/blob/main/%5B01%5D%20-%20Preparaci%C3%B3n%20de%20datos%20y%20EDA/Transformacion_Ejercicio_4_transformacion_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transformación - Ejercicio 4: transformacion_titanic.ipynb

En este ejercicio aprenderás cómo aplicar transformaciones básicas a las variables de un dataset para prepararlo para un modelo de machine learning.

Trabajaremos con el dataset clásico de Titanic y aplicaremos técnicas de normalización, estandarización y codificación de variables categóricas.

### Objetivos
- Aplicar escalado Min-Max y Z-score sobre variables numéricas.
- Realizar codificación one-hot y label encoding sobre variables categóricas.
- Identificar qué variables deben ser transformadas y cómo.

### Descripción del dataset
Usaremos el dataset de Titanic disponible públicamente. Contiene información de los pasajeros como edad, sexo, clase, tarifa pagada y si sobrevivieron o no.

In [1]:
# Importar librerías necesarias
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Cargar el dataset Titanic desde GitHub
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 1. Selección de variables para transformar

In [9]:
# Seleccionamos variables numéricas y categóricas relevantes
vars_numericas = df.select_dtypes(include= ['int','float']).columns
vars_categoricas = df.select_dtypes(include=['object', 'category', 'boolean']).columns

In [10]:
vars_numericas

Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

In [11]:
# Visualizamos valores nulos
df.isnull().mean().sort_values(ascending=False) * 100

Unnamed: 0,0
Cabin,77.104377
Age,19.86532
Embarked,0.224467
PassengerId,0.0
Name,0.0
Pclass,0.0
Survived,0.0
Sex,0.0
Parch,0.0
SibSp,0.0


In [12]:
# Imputamos valores nulos antes de transformar
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Como Cabin tiene muchos valores nulos, lo deshechamos
df = df.drop("Cabin", axis = 1)

In [13]:
# Comprobamos que ya no quedan valores null
df.isnull().mean().sort_values(ascending=False) * 100

Unnamed: 0,0
PassengerId,0.0
Survived,0.0
Pclass,0.0
Name,0.0
Sex,0.0
Age,0.0
SibSp,0.0
Parch,0.0
Ticket,0.0
Fare,0.0


In [14]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C


### 2. Normalización (Min-Max Scaling)

In [15]:
# Aplicamos Min-Max Scaling a las variables numéricas
scaler_minmax = MinMaxScaler()
df_minmax = df.copy()
df_minmax[vars_numericas] = scaler_minmax.fit_transform(df[vars_numericas])
df_minmax[vars_numericas].describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.5,0.383838,0.654321,0.363679,0.065376,0.063599,0.062858
std,0.289162,0.486592,0.418036,0.163605,0.137843,0.134343,0.096995
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.25,0.0,0.5,0.271174,0.0,0.0,0.01544
50%,0.5,0.0,1.0,0.346569,0.0,0.0,0.028213
75%,0.75,1.0,1.0,0.434531,0.125,0.0,0.060508
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 3. Estandarización StandardScaler

In [16]:
# Aplicamos StandardScaler
scaler_std = StandardScaler()
df_std = df.copy()
df_std[vars_numericas] = scaler_std.fit_transform(df[vars_numericas])
df_std[vars_numericas].describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,6.379733000000001e-17,3.9873330000000006e-17,-8.772133e-17,2.27278e-16,4.3860660000000004e-17,5.3829000000000005e-17,3.9873330000000004e-18
std,1.000562,1.000562,1.000562,1.000562,1.000562,1.000562,1.000562
min,-1.730108,-0.7892723,-1.566107,-2.224156,-0.4745452,-0.4736736,-0.6484217
25%,-0.865054,-0.7892723,-0.3693648,-0.5657365,-0.4745452,-0.4736736,-0.4891482
50%,0.0,-0.7892723,0.8273772,-0.1046374,-0.4745452,-0.4736736,-0.3573909
75%,0.865054,1.26699,0.8273772,0.4333115,0.4327934,-0.4736736,-0.02424635
max,1.730108,1.26699,0.8273772,3.891554,6.784163,6.974147,9.667167


### 4. Codificación de variables categóricas

In [17]:
# Label Encoding para 'Sex'
le = LabelEncoder()
df['Sex_encoded'] = le.fit_transform(df['Sex'])
df[['Sex', 'Sex_encoded']].head()

Unnamed: 0,Sex,Sex_encoded
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


In [None]:
df.head(5)

In [18]:
# One-Hot Encoding para 'Embarked'
df_encoded = pd.get_dummies(df, columns=['Embarked'])

df_encoded[['Embarked_S', 'Embarked_Q', 'Embarked_C']].head(5)

Unnamed: 0,Embarked_S,Embarked_Q,Embarked_C
0,True,False,False
1,False,False,True
2,True,False,False
3,True,False,False
4,True,False,False


### Conclusión
Hemos aprendido a aplicar transformaciones básicas de preparación de datos:
- Escalado con MinMaxScaler y StandardScaler
- Codificación con LabelEncoder y One-Hot Encoding

Estas transformaciones son necesarias antes de alimentar modelos de machine learning que no pueden procesar variables categóricas o son sensibles a las escalas numéricas.