(Noa) I have chosen a dataset about adults, mostly native to the United States, that have an income lower or equal to 50K or higher than 50K. Working class, education, marital-status, occupation, relationship, race, gender, capital-gain, capital-loss, and hours per week are mentioned. 

But why would somebodywant to know in which income class someone belongs?

Some practical examples for why it could be interesting for someone to predict someone else’s income are: 
1. Credit scoring and lending. Banks can use income predictions to determine if they are willing to lend someone money, to determine credit limits and terms.
2. Targeted marketing. Being able to predict someone’s income could be interesting for companies so they can target to whom they will advertise.
3. Insurances. Having the ability to predict someone's income could be handy for assessing risks and setting premiums.
4. Real estate. Being able to predict income could be an advantage for evaluating mortgage approvals.


In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

data = pd.read_csv('adult.csv')
data.dropna(inplace=True)
label_encoder = LabelEncoder()
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation',
                       'relationship', 'race', 'gender', 'native-country', 'income']

for col in categorical_columns:
    data[col] = label_encoder.fit_transform(data[col])

X = data.drop('income', axis=1)
y = data['income']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
classification_report_output = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report_output)

Accuracy: 0.8344366341363544
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.91      0.89     11233
           1       0.66      0.60      0.63      3420

    accuracy                           0.83     14653
   macro avg       0.77      0.75      0.76     14653
weighted avg       0.83      0.83      0.83     14653



(Noa) I asked ChatGPT for a basic kNN model layout and it provided me with this model. I noticed that in the original file it has some '?' which I asked ChatGPT to remove in order to make the kNN model even more precise.

In [19]:
data.replace('?', np.nan, inplace=True)         #the code to replace the '?' in the original file

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report_output)

Accuracy: 0.8269329991892092
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.90      0.89     10241
           1       0.66      0.60      0.63      3326

    accuracy                           0.83     13567
   macro avg       0.77      0.75      0.76     13567
weighted avg       0.82      0.83      0.82     13567



(Noa) If I compare the 2 classification reports: I notice that it is almost the same but not quite. Class 0, precision and recall are both 1% higher while class 1 remains the same. Apparently, while I replace the "?", the accuracy drops down.

(Noa) The accuracy score shows my kNN model correctly predicted the target dependent variable for about 82% of the test data.

Precision means that when the model predicted class 0/1, it was X% correct of the time. For class 0, 87% and class 1, 66% of the time. 

Recall means that the model correctly identified X% of the actual class 0/1 instances. 

F1-score indicates a balance between precision and recall for class 0/1

Observations: 

1. The model performs better at predicting class 0, since both precision and recall are way higher.
I asked ChatGPT why this could be the case and it provided that class 0 is probably the majority class. 

2. I notice that class 1 is predicted less effecively. I asked ChatGPT why this could be the case and it provided that there might be an imbalance in the data. Techniques like resampling, adjusting class weights, or using different algorithms may help improve class 1 predictions.



(Noa) Discussion: 

Class Imbalance. The adult income dataset often has an imbalance between the two income classes (<=50K and >50K). If the majority of people in the dataset earn less than 50K, the model may predict the majority class well while underperforming for the minority class (e.g., people with income >50K). This is evident if the precision and recall for class 1 (high income) are much lower than those for class 0.

(Noa) Further steps to improve my kNN model:

1. Use SMOTE (Synthetic Minority Over-sampling Technique) to decrease the class imbalance.
2. Use GRIDSEARCH for hyperparameter tuning

SMOTE helps by generating synthetic samples instead of duplicating the existing ones. It does this by interpolating between the minority class data points, which avoids overfitting that can happen with simple oversampling techniques.

Different values of hyperparameters can lead to significant differences in model performance. GridSearch helps systematically explore combinations of hyperparameter values to find the best settings for your model.

In [23]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE

data = pd.read_csv('adult.csv')
data.dropna(inplace=True)

label_encoder = LabelEncoder()
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation',
                       'relationship', 'race', 'gender', 'native-country', 'income']

for col in categorical_columns:
    data[col] = label_encoder.fit_transform(data[col])

X = data.drop('income', axis=1)
y = data['income']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Apply SMOTE to the training set
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

knn = KNeighborsClassifier()

param_grid = {
    'n_neighbors': [3, 5, 7, 9],         
    'weights': ['uniform', 'distance'],  
    'metric': ['euclidean', 'manhattan'] 
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train_smote, y_train_smote)

# Print best parameters and best score from GridSearchCV
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_}")

# Evaluate the model with the best parameters on the test set
best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
classification_report_output = classification_report(y_test, y_pred)

print(f"Test Set Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report_output)


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Parameters: {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
Best Cross-Validation Score: 0.8886082919141696
Test Set Accuracy: 0.7999044564253054
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.83      0.86     11233
           1       0.56      0.70      0.62      3420

    accuracy                           0.80     14653
   macro avg       0.73      0.77      0.74     14653
weighted avg       0.82      0.80      0.81     14653



(Noa) After using SMOTE and GridsearchBy, the best cross-validation score is 88%. This means the model is able to correctly predict the target (dependent) variable 88.86% of the time in the cross-validation process. The test set accuracy is significantly lower which shows that it might indicate overfitting, meaning the model is performing well during cross-validation but struggling on unseen test data. 

There is a possibility that a kNN model is prone to be more overfitting than other models. 
This is my final Model. I am keen to see what Rowan thinks of my model and if he has some improvements for me.

Review

(Rowan) After looking at Noa's code and reading his comments I can see that he tried his best to improve the overall prediction scores and especially trying to improve the scores for class 1. I also like that he made his updates to the code in different code boxes so it is easy to see where and what he changed to make the prediction better. His kNN model also shows the same things I noticed with my NB model: Class 0 is much better predicted than class 1.
To improve this model maybe something like class weights could be used to further improve the predictions. Also maybe something like Lasso or so to try and prevent over fitting.

Comparing the prediction scores of Noa's kNN model and my NB model:
The overall prediction accuracy with Noa's model is higher than with my NB model. kNN: 89%, NB 81%
Noa's model is more accurate in class 1 while my NB model is more accurate in class 0.