## **PPG11_s2 - Practical Exercise: Did They Survive the Titanic?**

**Context**
The sinking of the RMS Titanic is one of the most well-known maritime disasters in history. In this exercise, you will work with a real dataset containing information about Titanic passengers. You will use **logistic regression** to predict whether a passenger survived or not based on a few basic characteristics.

**Exercise Objective**
Build a **logistic regression model** to predict the `Survived` variable, which indicates whether a passenger survived (`1`) or not (`0`), using the following predictors:

- `Sex`
- `Age`
- `Pclass` (Passenger class: 1st, 2nd, or 3rd)



**Suggested Steps**



1. **Import the necessary libraries**: pandas, scikit-learn, etc.


In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

2. **Load the dataset** from the following URL:
   ```
   https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
   ```


In [12]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Ver las primeras filas del dataframe
display(df.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


3. **Select the relevant variables**: `Survived`, `Sex`, `Age`, `Pclass`.


In [13]:
df = df[["Survived", "Sex", "Age", "Pclass"]]

4. **Clean the data**:
   - Drop rows with missing values.
   - Encode the `Sex` variable as numeric (e.g., 0 = female, 1 = male).


In [14]:
# Eliminar filas con valores nulos
df.dropna(inplace=True)

# Codificar la variable 'Sex': female = 0, male = 1
df["Sex"] = df["Sex"].map({"female": 0, "male": 1})

In [15]:
display(df.head())

Unnamed: 0,Survived,Sex,Age,Pclass
0,0,1,22.0,3
1,1,0,38.0,1
2,1,0,26.0,3
3,1,0,35.0,1
4,0,1,35.0,3


5. **Split the data** into training and test sets.


In [16]:
# Separar variables independientes (X) y dependiente (y)
X = df[["Sex", "Age", "Pclass"]]
y = df["Survived"]

In [17]:
# Dividir en entrenamiento y prueba (80% - 20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Mostrar dimensiones para confirmar
print("Tamaño del conjunto de entrenamiento:", X_train.shape)
print("Tamaño del conjunto de prueba:", X_test.shape)

Tamaño del conjunto de entrenamiento: (571, 3)
Tamaño del conjunto de prueba: (143, 3)


6. **Train a logistic regression model** using the training set.


In [20]:
# Paso 6: Entrenar el modelo
model = LogisticRegression()
model.fit(X_train, y_train)

7. **Make predictions** using the model on the test set.


In [23]:
# Paso 7: Hacer predicciones en el conjunto de prueba
y_pred = model.predict(X_test)

# Ver algunas predicciones
print("Predicciones:", y_pred[:25])

Predicciones: [0 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 0 0]


8. **Evaluate the model** using:
   - Confusion matrix
   - Accuracy
   - Sensitivity and specificity
   - Classification report



In [25]:
# Matriz de confusión
cm = confusion_matrix(y_test, y_pred)
print("Matriz de Confusión:\n", cm)

Matriz de Confusión:
 [[68 19]
 [17 39]]


In [None]:
# Accuracy
acc = accuracy_score(y_test, y_pred)
print("\nAccuracy:", round(acc, 4))

#Acierta ek 74.83% de las veces


Accuracy: 0.7483


In [None]:
# Sensibilidad (recall para clase 1)
sensitivity = cm[1,1] / (cm[1,0] + cm[1,1])
print("Sensibilidad (Recall para clase 1):", round(sensitivity, 4))

# Detecta correctamente al 69.6% de las personas que sobrevivieron.

Sensibilidad (Recall para clase 1): 0.6964


In [None]:
# Especificidad (recall para clase 0)
specificity = cm[0,0] / (cm[0,0] + cm[0,1])
print("Especificidad (Recall para clase 0):", round(specificity, 4))

#Identifica correctamente al 78.2% de quienes no sobrevivieron.

Especificidad (Recall para clase 0): 0.7816


In [29]:
# Reporte de clasificación
print("\nReporte de Clasificación:\n", classification_report(y_test, y_pred))


Reporte de Clasificación:
               precision    recall  f1-score   support

           0       0.80      0.78      0.79        87
           1       0.67      0.70      0.68        56

    accuracy                           0.75       143
   macro avg       0.74      0.74      0.74       143
weighted avg       0.75      0.75      0.75       143



📋 Reporte de Clasificación:

Precisión para clase 0 (no sobrevivió): 80%

Precisión para clase 1 (sobrevivió): 67%

F1-score global bastante equilibrado (~0.75)

**Additional Reflection Questions**
- Which variable seems to have the most influence on survival?
- What happens if you change the probability threshold (default is 0.5)?
- How could you improve the model?

In [30]:
# Revisar los coeficientes del modelo
import numpy as np

feature_names = X.columns
coefficients = model.coef_[0]

for name, coef in zip(feature_names, coefficients):
    print(f"{name}: {coef:.4f}")

Sex: -2.5323
Age: -0.0425
Pclass: -1.2488


In [31]:
# Obtener probabilidades
y_probs = model.predict_proba(X_test)[:, 1]

# Cambiar umbral, por ejemplo, a 0.4
y_pred_custom = (y_probs > 0.4).astype(int)

# Evaluar de nuevo
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred_custom))


[[61 26]
 [13 43]]
