# Predictive Analysis of Titanic Survivors with Python and Scikit-learn

This machine learning mini-project is for study purposes and good practices to be applied in other futures analyses 


### Prepare The environment

### First, open CMD or your space working jupyter lab with the cooding for install any more libraries necessary.

- pip install pandas numpy matplotlib seaborn scikit-learn


### Import Libraries and Scikits-Learn


In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


### Loading data

In [15]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(train_df.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### Removing values Null in the columns

In [16]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print("Train dataset nulls:")
print(train_df.isnull().sum())

print("\nTest dataset nulls:")
print(test_df.isnull().sum())

Train dataset nulls:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Test dataset nulls:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


### Handle Missing values

- We now need to fill in the blanks to make a more accurate analysis.

In [17]:
# Fill missing Age
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].median())
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].median())

# Fill missing Embarked (training only)
train_df['Embarked'] = train_df['Embarked'].fillna(train_df['Embarked'].mode()[0])

# Fill missing Fare (test only)
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())

# If you want, drop Cabin column (too many NaNs)
train_df = train_df.drop(columns=['Cabin'])
test_df = test_df.drop(columns=['Cabin'])



### Convert categorical variables to numeric

- The purpose is to transform categorical variables (text or label) into integers. This is very important, because the Sex column is now numeric, ready to use in machine learning models, which don't understand text.

In [18]:

le = LabelEncoder()

# Encode 'Sex'
train_df['Sex'] = le.fit_transform(train_df['Sex'])
test_df['Sex'] = le.transform(test_df['Sex'])

# Encode 'Embarked'
train_df['Embarked'] = le.fit_transform(train_df['Embarked'])
test_df['Embarked'] = le.transform(test_df['Embarked'])



### Select Features



In [19]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = train_df[features]
y = train_df['Survived']
X_test = test_df[features]

# Normalize features
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_test = scaler.transform(X_test)


### Train the model 



In [20]:
# Split training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression model

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

# Evaluate on validation set

y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))


Accuracy: 0.8044692737430168
[[90 15]
 [20 54]]
              precision    recall  f1-score   support

           0       0.82      0.86      0.84       105
           1       0.78      0.73      0.76        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.80       179
weighted avg       0.80      0.80      0.80       179



### Actual percentage of survivors in the dataset

In [25]:
# Taxa de sobrevivência real
survival_rate = train_df['Survived'].mean()  # valor entre 0 e 1
print("Survival rate:", survival_rate*100, "%")



Survival rate: 38.38383838383838 %


### Comporation survival Male and Female

- Result of the sexes that had the greatest chance of survival
- 0 = woman
- 1 = man

In [31]:

survival_by_sex = train_df.groupby('Sex')['Survived'].mean() * 100
print("Survival rate by Sex (%):")
print(survival_by_sex)


Survival rate by Sex (%):
Sex
0    74.203822
1    18.890815
Name: Survived, dtype: float64


In conclusion, this analysis could be better, with other more accurate techniques that perform better, example: Random Forest, What it is: An ensemble of multiple decision trees that vote on the final prediction.Why it's effective:Handles mixed categorical and continuous features well.

Captures non-linear interactions between variables, such as age + class + gender.

Less sensitive to outliers and overfitting than a single tree.