<a href="https://colab.research.google.com/github/Abdulmajed48/titanic-classification/blob/main/titanic-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score



# Step 1: Load the dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)
# Step 2: Data preprocessing

# Step 2.1 : Drop unnecessary columns
# the inplace=True parameter means that the operation will be performed directly on the original DataFrame
#Without inplace=True: A new DataFrame is created with the specified columns dropped
columns_to_drop = ['PassengerId', 'Name', 'Ticket', 'Cabin']
data.drop(columns=columns_to_drop, inplace=True)

# Step 2.2 : fill missing values
# print(data.isnull().sum()) # check which colmun has a null values
data['Age'].fillna(data['Age'].median(), inplace=True)#Median is chosen over mean because the distribution of ages might be skewed
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)#categorical variable, and replacing missing values with the most common category is a simple and effective approach.

# 2.3:Encode categorical variables
encoder = LabelEncoder()
data['Sex'] = encoder.fit_transform(data['Sex'])
data['Embarked'] = encoder.fit_transform(data['Embarked'])

# 2.4:Separate features and target
# Separating them ensures clarity and avoids accidentally using the target as a predictor, which would lead to data leakage.
X = data.drop('Survived', axis=1) #axis 1 Refers to columns, axis 0 refers to rows(default)
y = data['Survived']

# 2.5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2.6: Standardize the features
# The scaler computes the mean and standard deviation of each feature in the training data and uses these values to standardize the data.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)#fit_transform: Combines the fit and transform steps for efficiency when working with the training data.
X_test = scaler.transform(X_test) # Apply the same scaling to X_test without recomputing statistics.


# Step 3: Train the model
# why random forest
# 1- It handles both categorical and numerical data.
# 2-It is robust to overfitting, especially when you have many trees.
# 3-Unlike linear models, Random Forests can capture complex relationships between features.
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 4: Make predictions
y_pred = model.predict(X_test)

# Step 5: Evaluate the model
# accuracy = Number of Correct Predictions / Total Number of Predictions
# report Provides a comprehensive summary of model performance including:
# Precision = Measures how many of the predicted positives were actually correct
# recall = Measures how many actual positives the model correctly identified.
# F-1 Score = A harmonic mean of precision and recall, balancing both.
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].median(), inplace=True)#Median is chosen over mean because the distribution of ages might be skewed
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)#categorical variable, and replacing

Accuracy: 0.82
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85       105
           1       0.81      0.74      0.77        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179

