<a href="https://colab.research.google.com/github/PatelHarshitt/ML2025/blob/main/titanic_lab2_decisiontree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install any missing libraries (optional, Colab usually has these)
!pip install pandas scikit-learn

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder




In [2]:
# Load datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Combine for uniform preprocessing
combined = pd.concat([train_df, test_df], sort=False)

# Fill missing values
combined['Age'].fillna(combined['Age'].median(), inplace=True)
combined['Fare'].fillna(combined['Fare'].median(), inplace=True)
combined['Embarked'].fillna(combined['Embarked'].mode()[0], inplace=True)

# Encode categorical variables
le = LabelEncoder()
combined['Sex'] = le.fit_transform(combined['Sex'])
combined['Embarked'] = le.fit_transform(combined['Embarked'])

# Separate processed data
train_processed = combined[:len(train_df)]
test_processed = combined[len(train_df):]


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined['Age'].fillna(combined['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined['Fare'].fillna(combined['Fare'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate obj

In [6]:
combined

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.2500,,2
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,0
2,3,1.0,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,,2
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1000,C123,2
4,5,0.0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.0500,,2
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",1,28.0,0,0,A.5. 3236,8.0500,,2
414,1306,,1,"Oliva y Ocana, Dona. Fermina",0,39.0,0,0,PC 17758,108.9000,C105,0
415,1307,,3,"Saether, Mr. Simon Sivertsen",1,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,2
416,1308,,3,"Ware, Mr. Frederick",1,28.0,0,0,359309,8.0500,,2


In [3]:
# Select features and labels
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = train_processed[features]
y = train_processed['Survived']
X_test = test_processed[features]

# Split training data for evaluation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict and print accuracy
y_pred = clf.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print("Classification Accuracy:", round(accuracy * 100, 2), "%")



Classification Accuracy: 78.21 %


In [4]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 7, 9, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Initialize classifier
dt = DecisionTreeClassifier(random_state=42)

# Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_clf = grid_search.best_estimator_

# Evaluate on validation set
y_pred = best_clf.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print("Improved Accuracy:", round(accuracy * 100, 2), "%")
print("Best Parameters:", grid_search.best_params_)


Improved Accuracy: 79.89 %
Best Parameters: {'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 2}


In [5]:
# Create new features
train_processed['FamilySize'] = train_processed['SibSp'] + train_processed['Parch'] + 1
train_processed['IsAlone'] = (train_processed['FamilySize'] == 1).astype(int)

# Extract Title from Name
train_processed['Title'] = train_df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
train_processed['Title'] = train_processed['Title'].replace(['Mlle', 'Ms'], 'Miss')
train_processed['Title'] = train_processed['Title'].replace(['Mme'], 'Mrs')
rare_titles = train_processed['Title'].value_counts()[train_processed['Title'].value_counts() < 10].index
train_processed['Title'] = train_processed['Title'].replace(rare_titles, 'Rare')
train_processed['Title'] = LabelEncoder().fit_transform(train_processed['Title'])

# Update features
features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'FamilySize', 'IsAlone', 'Title']
X = train_processed[features]
y = train_processed['Survived']

# Split again
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the best model again
best_clf.fit(X_train, y_train)
y_pred = best_clf.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print("Accuracy After Feature Engineering:", round(accuracy * 100, 2), "%")


Accuracy After Feature Engineering: 79.89 %


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_processed['FamilySize'] = train_processed['SibSp'] + train_processed['Parch'] + 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_processed['IsAlone'] = (train_processed['FamilySize'] == 1).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_processed['Title'] = train_df['Na

Unnamed: 0,Sex
0,male
1,female
2,male
3,male
4,female


Unnamed: 0,Sex
0,1
1,0
2,1
3,1
4,0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0,Sex
0,1
1,0
2,0
3,0
4,1


Unnamed: 0,Sex
0,1
1,0
2,0
3,0
4,1


Unnamed: 0,Sex
35,1
46,1
453,1
291,0
748,1


ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.