<a href="https://colab.research.google.com/github/Annafi06/ML-Basics/blob/main/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd

# Load the datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Always keep raw backups
train_raw = train.copy()
test_raw = test.copy()

print(train.info())
print(train.isnull().sum())
train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int6

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# PassengerId should be unique
print("Duplicates:", train.duplicated().sum())
print("Unique Passenger IDs:", train['PassengerId'].nunique())

# Make sure Survived is int
train['Survived'] = train['Survived'].astype(int)

Duplicates: 0
Unique Passenger IDs: 891


In [5]:
# Check again
train.isnull().sum()

# Fill Embarked missing with mode
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)

# Fill Fare missing (if any) with median
train['Fare'].fillna(train['Fare'].median(), inplace=True)
test['Fare'].fillna(test['Fare'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Fare'].fillna(train['Fare'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate obje

In [6]:
def get_title(name):
    title = name.split(',')[1].split('.')[0].strip()
    return title

train['Title'] = train['Name'].apply(get_title)
test['Title'] = test['Name'].apply(get_title)

# Replace rare titles
rare_titles = ['Lady','Countess','Capt','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona']
train['Title'].replace(rare_titles, 'Other', inplace=True)
test['Title'].replace(rare_titles, 'Other', inplace=True)

# Normalize similar titles
title_replace = {'Mlle':'Miss', 'Ms':'Miss', 'Mme':'Mrs'}
train['Title'].replace(title_replace, inplace=True)
test['Title'].replace(title_replace, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Title'].replace(rare_titles, 'Other', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test['Title'].replace(rare_titles, 'Other', inplace=True)


In [7]:
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1

train['IsAlone'] = (train['FamilySize'] == 1).astype(int)
test['IsAlone'] = (test['FamilySize'] == 1).astype(int)

In [8]:
train['Deck'] = train['Cabin'].fillna('U').astype(str).str[0]
test['Deck'] = test['Cabin'].fillna('U').astype(str).str[0]

In [9]:
age_medians = train.groupby(['Title', 'Pclass', 'Sex'])['Age'].median()

def fill_age(row):
    if np.isnan(row['Age']):
        return age_medians.loc[row['Title'], row['Pclass'], row['Sex']]
    else:
        return row['Age']

train['Age'] = train.apply(fill_age, axis=1)
test['Age'] = test.apply(fill_age, axis=1)

In [10]:
drop_cols = ['Name','Ticket','Cabin']
train.drop(columns=drop_cols, inplace=True)
test.drop(columns=drop_cols, inplace=True)

In [11]:
# Convert categorical to string (so encoders see them as categorical)
for col in ['Pclass','Sex','Embarked','Title','Deck']:
    train[col] = train[col].astype(str)
    test[col] = test[col].astype(str)

In [12]:
train['Fare'] = np.log1p(train['Fare'])
test['Fare'] = np.log1p(test['Fare'])

In [13]:
from sklearn.model_selection import train_test_split

X = train.drop(['Survived','PassengerId'], axis=1)
y = train['Survived']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(X_train.shape, X_val.shape)

(712, 11) (179, 11)


In [14]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

num_cols = ['Age','SibSp','Parch','Fare','FamilySize']
cat_cols = ['Pclass','Sex','Embarked','Title','Deck','IsAlone']

# Numeric pipeline
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', RandomForestClassifier(
        n_estimators=200, random_state=42, max_depth=8, min_samples_split=4
    ))
])

model.fit(X_train, y_train)

# Validate
pred_val = model.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, pred_val))
print(confusion_matrix(y_val, pred_val))
print(classification_report(y_val, pred_val))

Validation Accuracy: 0.8100558659217877
[[99 11]
 [23 46]]
              precision    recall  f1-score   support

           0       0.81      0.90      0.85       110
           1       0.81      0.67      0.73        69

    accuracy                           0.81       179
   macro avg       0.81      0.78      0.79       179
weighted avg       0.81      0.81      0.81       179



In [16]:
# Retrain on all data
X_full = train.drop(['Survived','PassengerId'], axis=1)
y_full = train['Survived']

model.fit(X_full, y_full)

# Predict on test data
test_pred = model.predict(test.drop(['PassengerId'], axis=1))

In [17]:
submission = pd.DataFrame({
    'PassengerId': test['PassengerId'],
    'Survived': test_pred
})

submission.to_csv('submission.csv', index=False)
print("submission.csv created successfully!")

submission.csv created successfully!


In [18]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_full, y_full, cv=cv, scoring='accuracy')
print("CV scores:", scores)
print("Mean accuracy:", scores.mean())

CV scores: [0.84916201 0.83146067 0.79775281 0.83707865 0.85393258]
Mean accuracy: 0.8338773460548616


In [19]:
# Get feature importances from RandomForest
final_model = model.named_steps['clf']
# Need to get feature names after one-hot encoding
feature_names = (
    model.named_steps['preprocessor']
    .transformers_[0][2]
    + list(model.named_steps['preprocessor'].transformers_[1][1]
           .named_steps['encoder'].get_feature_names_out(cat_cols))
)

importances = pd.Series(final_model.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False).head(15))

Title_Mr        0.145991
Sex_male        0.124348
Fare            0.116691
Sex_female      0.110432
Age             0.088949
Pclass_3        0.063563
FamilySize      0.054263
Title_Mrs       0.044256
Deck_U          0.033742
Title_Miss      0.030286
SibSp           0.026079
Pclass_1        0.025500
Pclass_2        0.017037
Parch           0.016620
Title_Master    0.016025
dtype: float64
