In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

df = pd.read_csv("cleaned_adult_data.csv")



All the required libraries were imported here.

In [12]:
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,education_age_ratio,work_age_ratio
0,38,Private,215646,Graduate,9,Divorced,Skilled Trade,Not-in-family,White,Male,0,0,40,United-States,<=50K,0.236842,54.736842
1,53,Private,234721,School,7,Married-civ-spouse,Skilled Trade,Husband,Black,Male,0,0,40,United-States,<=50K,0.132075,39.245283
2,28,Private,338409,Bachelor,13,Married-civ-spouse,Professional Jobs,Wife,Black,Female,0,0,40,Cuba,<=50K,0.464286,74.285714
3,37,Private,284582,Graduate,14,Married-civ-spouse,Professional Jobs,Wife,White,Female,0,0,40,United-States,<=50K,0.378378,56.216216
4,49,Private,160187,School,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K,0.102041,16.979592
5,52,Self-Employed,209642,Graduate,9,Married-civ-spouse,Professional Jobs,Husband,White,Male,0,0,45,United-States,>50K,0.173077,45.0
6,31,Private,45781,Graduate,14,Never-married,Professional Jobs,Not-in-family,White,Female,14084,0,50,United-States,>50K,0.451613,83.870968
7,42,Private,159449,Bachelor,13,Married-civ-spouse,Professional Jobs,Husband,White,Male,5178,0,40,United-States,>50K,0.309524,49.52381
8,37,Private,280464,Bachelor,10,Married-civ-spouse,Professional Jobs,Husband,Black,Male,0,0,80,United-States,>50K,0.27027,112.432432
9,30,Government,141297,Bachelor,13,Married-civ-spouse,Professional Jobs,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K,0.433333,69.333333


Casually checking whether the dataset is clean or not. All the data has been cleaned here.

In [13]:
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])



Here, the categorical features were encoded to make the model workable. 

In [22]:
X = df.drop(columns=["income"]) 
y = df["income"]

I dropped the income feature as it is the target variable for this model.

In [16]:

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

df["dataset_split"] = "test" 

df.loc[X_train.index, "dataset_split"] = "train"
df.loc[X_val.index, "dataset_split"] = "validation"

df.to_csv("cleaned_adult_data_with_splits.csv", index=False)

print("Dataset with splits saved successfully.")



Dataset with splits saved successfully.


I split the dataset into 3 parts, i.e the test set, validation set and the train set. Then created a new csv file for the split data for later.  

In [17]:
# Scale numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

I chose the RF Classifier as the dataset I chose was a structured data, and from the lecture recording I learnt that RandomForest Classifier is efficient for such dataset.

In [18]:
# Evaluate the model
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print("Validation Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_val, y_pred))

Validation Accuracy: 0.8514588859416445
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.93      0.90      3375
           1       0.75      0.62      0.68      1149

    accuracy                           0.85      4524
   macro avg       0.81      0.78      0.79      4524
weighted avg       0.85      0.85      0.85      4524



The validation accuracy is 0.8514588859416445, indicating that the model performs well on the validation set.

In [25]:
# Save the best model
joblib.dump(scaler, "scaler.pkl")
joblib.dump(model, "random_forest_model.pkl")


['random_forest_model.pkl']

I saved the model here, On to the next notebook :)