#  Titanic  — Advanced Workflow 



<p align="center">
  <img src="separador.png" alt="Separador" width=300"/>
</p>





##  Objetivo
- Construir una **submission competitiva**
- Validación estable y fiable
- Target esperado: **≥ 0.80 accuracy**

---

##  Estrategia general
- Notebook nuevo desde cero
- Pipeline limpio
- Validación primero, Kaggle después
- Optimizar con cabeza, no por ensayo–error

---

## 1️⃣ Setup inicial
- Importar librerías estándar (pandas, numpy, sklearn)
- Fijar `random_state` global
- Cargar `train.csv` y `test.csv`

---

## 2️⃣ Preparación de datos
- Separar `y = Survived`
- Unir `train + test` **solo para feature engineering**
- Guardar índices para separar al final
- No volver a tocar el target

---

## 3️⃣ Feature Engineering (core)

### Numéricas
- `Age`: imputar por grupos (`Sex`, `Pclass`)
- `Fare`: imputar con mediana

### Categóricas
- `Sex`: binaria
- `Embarked`: imputar moda + one-hot
- `Pclass`: one-hot (no ordinal)

### Features nuevas (clave)
- `FamilySize = SibSp + Parch + 1`
- `IsAlone = FamilySize == 1`
- `Title` (extraído de `Name`)
  - Agrupar títulos raros → `Rare`

### Opcional (si ajusta)
- `Fare_per_person = Fare / FamilySize`

---

## 4️⃣ Encoding y dataset final
- One-hot encoding una sola vez (train + test)
- Alinear columnas
- Separar de nuevo train / test
- Dataset final listo para modelado

---

## 5️⃣ Validación
- `StratifiedKFold (5)`
- Métrica: **accuracy**
- Comparar modelos por:
  - media de CV
  - estabilidad (std)

---

## 6️⃣ Modelos a evaluar
Probar solo estos:

1. **Logistic Regression**
   - `C` bajo / medio
   - `max_iter` alto

2. **Random Forest**
   - `n_estimators`: 300–500
   - `max_depth`: limitado
   - `min_samples_leaf` > 1

3. **GradientBoostingClassifier**
   - Sin tuning agresivo

 Elegir el modelo con mejor CV medio y menor varianza.

---

## 7️⃣ Entrenamiento final
- Reentrenar el modelo ganador con **todo el train**
- Predecir `test.csv`

---

## 8️⃣ Submission
- Columnas:
  - `PassengerId`
  - `Survived`
- Formato Kaggle estándar

---





<p align="center">
  <img src="separador.png" alt="Separador" width=300"/>
</p>



In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score


In [2]:
# Para tener la misma reproducibilidad
RANDOM_STATE=11

In [3]:
# Cargando data..
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")

In [4]:
# Separando el target del resto de las features
y=train["Survived"]
train=train.drop(columns=["Survived"])

In [5]:
# Guardando ids, para nunca olvidar en el df "full" cual eran de train y cual eran de test
train_ids=train.index
test_ids=test.index

In [6]:
# Unir para hacer Feature Engineering.
full=pd.concat([train,test],axis=0).reset_index(drop=True)

In [7]:
# Checkpoint rapidito:
full.shape, train.shape, test.shape

((1309, 11), (891, 11), (418, 11))

In [8]:
# Todo correcto

<p align="center">
  <img src="separador.png" alt="Separador" width=300"/>
</p>



In [9]:
# Features de familia.
full["FamilySize"]=full["SibSp"]+full["Parch"]+1 
full["IsAlone"]=(full["FamilySize"] == 1).astype(int)


In [10]:
# Consiguiendo y agrupando titulos desde Name

# Extraemos el titulo
# extraer título
full["Title"] = full["Name"].str.extract(r",\s*([^\.]+)\.", expand=False)


In [11]:
# Agrupamos titulos raros
rare_titles=["Lady","Master","Countess","Capt","Col","Don","Dr","Major","Rev","Sir","Jonkheer","Dona"]

full["Title"]=full["Title"].replace(rare_titles, "Rare")

In [12]:
# Normalizamos los nombres comunes
full["Title"]=full["Title"].replace({
    "Mlle":"Miss",
    "Ms":"Miss",
    "Mme":"Mrs"
})

In [13]:
# Imputamos valores faltantes (NaN) en "Age" de manera inteligente.
# Rellenamos la edad faltante con la mediana de personas del mismo sexo y clase.

full["Age"]=full.groupby(["Sex","Pclass"])["Age"].transform(
    lambda x: x.fillna(x.median())
)

In [14]:
# Ahora procedemos con "Fare" y "Embarked"
full["Fare"]=full["Fare"].fillna(full["Fare"].median())
full["Embarked"]=full["Embarked"].fillna(full["Embarked"].mode()[0])

<p align="center">
  <img src="separador.png" alt="Separador" width=300"/>
</p>



In [15]:
# Drop de columnas inutiles
full=full.drop(columns=[
    "Name",
    "Ticket",
    "Cabin"
])

In [16]:
# Encoding final
# Convierte una columna categórica en varias columnas binarias
full= pd.get_dummies(
    full,
    columns=["Sex","Embarked","Title","Pclass"],
    drop_first=True
)

In [17]:
full[:1]

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Sex_male,Embarked_Q,Embarked_S,Title_Mr,Title_Mrs,Title_Rare,Title_the Countess,Pclass_2,Pclass_3
0,1,22.0,1,0,7.25,2,0,True,False,True,True,False,False,False,False,True


In [18]:
# Separamos train / test
# Recuerda que se unio full (train+test) para hacer las transformaciones, ahora toca volver a separarlo
X=full.iloc[:len(train)]
X_test=full.iloc[len(train):]

In [19]:
X.shape, X_test.shape

((891, 16), (418, 16))

<p align="center">
  <img src="separador.png" alt="Separador" width=300"/>
</p>



In [20]:
# Validacion + modelos

In [21]:
# Setup validación (Stratified K-Fold)
skf = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=RANDOM_STATE
)

In [22]:
# Modelo 1 Logistic Regression. Esto sera un Baseline fuerte
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(
    C=1.0,
    max_iter=1000,
    random_state=RANDOM_STATE
)


In [23]:
lr_scores = []

for train_idx, val_idx in skf.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    lr.fit(X_train, y_train)
    preds = lr.predict(X_val)

    lr_scores.append(accuracy_score(y_val, preds))

np.mean(lr_scores), np.std(lr_scores)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to 

(np.float64(0.824882304940054), np.float64(0.03447877743329289))

In [24]:
# Modelo 2: Random Forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=400,
    max_depth=8,
    min_samples_leaf=2,
    random_state=RANDOM_STATE
)


In [25]:
rf_scores = []

for train_idx, val_idx in skf.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    rf.fit(X_train, y_train)
    preds = rf.predict(X_val)

    rf_scores.append(accuracy_score(y_val, preds))

np.mean(rf_scores), np.std(rf_scores)


(np.float64(0.8204067541271736), np.float64(0.028312941935997652))

In [27]:
# Modelo 3 : GRADIENT BOOSTING
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=RANDOM_STATE
)


In [28]:
gb_scores = []

for train_idx, val_idx in skf.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    gb.fit(X_train, y_train)
    preds = gb.predict(X_val)

    gb_scores.append(accuracy_score(y_val, preds))

np.mean(gb_scores), np.std(gb_scores)


(np.float64(0.814776222459356), np.float64(0.04232751877630252))

### Elegimos usar el modelo 2, Random Forest:

por tener el mayor valor de "mean" en "score", 0.82

<p align="center">
  <img src="separador.png" alt="Separador" width=300"/>
</p>


In [39]:
# Entrenamiento final
final_model = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    random_state=RANDOM_STATE
)

final_model.fit(X, y)
test_preds = final_model.predict(X_test)


    

In [40]:
# Predccion test
test_preds=final_model.predict(X_test)

In [41]:
# Creamos el submission de Kaggle como nos indica el competion
submission=pd.DataFrame({
    "PassengerId":test["PassengerId"],
    "Survived":test_preds
})

In [42]:
submission.to_csv("submission_4_advanced.csv", index=False)


> ⚠️ **Nota importante**
>
> En local tenemos un Score de 0.81477
>
> Deberiamos esperar lo mismo de Kaggle...
>
> y Kaggle nos da como Score en el submission-> 0.77

# RESULTADO?  GAFAS!

0.77 es un buen modelo, pero no nos hemos superado...