# Entrenamiento del Modelo de Clasificacion - Satisfaccion de Pasajeros

Este notebook guia el proceso de entrenamiento de un modelo de Machine Learning (XGBoost) para predecir la satisfaccion de los pasajeros.

## 1. Instalacion de Librerias y Carga de Datos

In [1]:
import os
import pandas as pd

# Gestión de rutas robusta
file_name = 'airline_passenger_satisfaction.csv'
data_path = file_name
if not os.path.exists(data_path):
    # Buscar en carpeta data (estructura local)
    possible_path = os.path.join('..', 'data', file_name)
    if os.path.exists(possible_path):
        data_path = possible_path
    else:
        try:
            from google.colab import files
            print("Sube el archivo:")
            uploaded = files.upload()
            data_path = file_name
        except ImportError:
            print("Error: Archivo no encontrado.")

df = pd.read_csv(data_path)
df.head()


Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


## 2. Preprocesamiento de Datos
Prepararemos los datos para que la IA pueda entenderlos.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# 1. Tratar valores nulos en retrasos
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].fillna(df['Departure Delay in Minutes'])

# 2. Separar caracteristicas (X) y objetivo (y)
X = df.drop('satisfaction', axis=1)
y = df['satisfaction']

# 3. Codificar la variable objetivo (satisfied=1, dissatisfied=0)
le = LabelEncoder()
y = le.fit_transform(y)

# 4. Identificar tipos de columnas
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Variables categoricas: {categorical_cols}")
print(f"Variables numericas: {numerical_cols}")

Variables categoricas: ['Gender', 'Customer Type', 'Type of Travel', 'Class']
Variables numericas: ['Unnamed: 0', 'id', 'Age', 'Flight Distance', 'Inflight wifi service', 'Departure/Arrival time convenient', 'Ease of Online booking', 'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort', 'Inflight entertainment', 'On-board service', 'Leg room service', 'Baggage handling', 'Checkin service', 'Inflight service', 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes']


## 3. Creacion del Pipeline y Entrenamiento

In [3]:
from xgboost import XGBClassifier

# Pipeline para numeros: Imputar faltantes y escalar
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline para categorias: Imputar faltantes y convertir a numeros (OneHot)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

# Definir el modelo XGBoost
model = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    eval_metric='logloss'
)

# Pipeline completo
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model)
])

# Division Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Entrenando modelo...")
clf.fit(X_train, y_train)
print("¡Entrenamiento completado!")

Entrenando modelo...
¡Entrenamiento completado!


## 4. Evaluacion del Rendimiento

In [4]:
from sklearn.metrics import classification_report, accuracy_score

train_acc = clf.score(X_train, y_train)
test_acc = clf.score(X_test, y_test)
overfitting = (train_acc - test_acc) * 100

print(f"Accuracy Entrenamiento: {train_acc:.4f}")
print(f"Accuracy Test: {test_acc:.4f}")
print(f"Overfitting: {overfitting:.2f}%")

y_pred = clf.predict(X_test)
print("\n--- Informe de Clasificacion ---")
print(classification_report(y_test, y_pred, target_names=le.classes_))

Accuracy Entrenamiento: 0.9738
Accuracy Test: 0.9652
Overfitting: 0.86%

--- Informe de Clasificacion ---
                         precision    recall  f1-score   support

neutral or dissatisfied       0.96      0.98      0.97     11776
              satisfied       0.97      0.95      0.96      9005

               accuracy                           0.97     20781
              macro avg       0.97      0.96      0.96     20781
           weighted avg       0.97      0.97      0.97     20781



## 5. Exportar el Modelo
Guardamos el modelo para usarlo en la aplicacion Web.

In [None]:
joblib.dump(clf, 'airline_model.joblib')
joblib.dump(le, 'label_encoder.joblib')
print("Modelos guardados como 'airline_model.joblib' y 'label_encoder.joblib'")

# Descargar los archivos a tu ordenador
files.download('airline_model.joblib')
files.download('label_encoder.joblib')