In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score

data = pd.read_csv('adult.csv')

data.replace('?', np.nan, inplace=True)

data.dropna(inplace=True)

label_encoder = LabelEncoder()

categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 
                       'relationship', 'race', 'gender', 'native-country', 'income']

for col in categorical_columns:
    data[col] = label_encoder.fit_transform(data[col])

X = data.drop('income', axis=1) 
y = data['income'] 

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

nb = GaussianNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
classification_report_output = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report_output)


Accuracy: 0.7986290263138498
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.95      0.88     10241
           1       0.68      0.33      0.45      3326

    accuracy                           0.80     13567
   macro avg       0.75      0.64      0.66     13567
weighted avg       0.78      0.80      0.77     13567



(Rowan) I asked ChatGPT for a basic Naive Bayes model and it provided me with this code. We (Noa and I) noticed that our CSV file had some missing information which in the file were noted as '?', I asked ChatGPT to remove these question marks in order to make te model more precise.

(Rowan) I noticed that this model is much better at predicting class 0 than class 1, therefore I will ask ChatGPT to improve the class 1 predictions.

In [2]:
import numpy as np

y_prob = nb.predict_proba(X_test) 

threshold = 0.4 
y_pred_adjusted = np.where(y_prob[:, 1] > threshold, 1, 0) 

accuracy_adjusted = accuracy_score(y_test, y_pred_adjusted)
classification_report_adjusted = classification_report(y_test, y_pred_adjusted)

print(f"Adjusted Accuracy: {accuracy_adjusted}")
print("Adjusted Classification Report:")
print(classification_report_adjusted)


Adjusted Accuracy: 0.8037148964398909
Adjusted Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.94      0.88     10241
           1       0.68      0.37      0.48      3326

    accuracy                           0.80     13567
   macro avg       0.75      0.66      0.68     13567
weighted avg       0.79      0.80      0.78     13567



(Rowan) I asked ChatGPT to try and improve the precision of the prediction for class 1. After doing that I saw that the overall accuracy was higher after trying to improve the class 1 prediction, it went from 79% to 80%. 
Comparing the predictions between the classes shows that after the change the precision of class 0 went down by 1% while the recall was higher by 1%, the f1-score did not change.
For class 1 the prediction stayed the same while the recall was higher by 4% and the f1-score was higher by 3%.

Precision means that when the model predicted both of the classes it was correct x% of the time. For class 0 this was 82% while it for class 1 68% was.
Recall means that the model correctly identified x% of the actual class 0/1 instances. For class 0, 94% and for class 1, 37%.
f1-score indicates a balance between the precision and recall. for class 0 this was 88%, and for class 1 this was 48%.

Looking at these results I conclude that the model is much better at predicting class 0 than class 1 since precision, recall and f1-score are higher than for class 1.

(Rowan) discussion:

The model predicts class 0 better than class 1. This is because of class imbalance. The reason for this imbalance is that in our dataset the majority of people earn less than 50K instead of more than 50K. Therefore the model is much better at predicting class 0 while it has more data to train and test on.

To try and improve the class 1 prediction I will ask ChatGPT for an improvement of the code.

In [3]:
threshold = 0.2 
y_pred_more_adjusted = np.where(y_prob[:, 1] > threshold, 1, 0) 

accuracy_more_adjusted = accuracy_score(y_test, y_pred_more_adjusted)
classification_report_more_adjusted = classification_report(y_test, y_pred_more_adjusted)

print(f"More Adjusted Accuracy: {accuracy_more_adjusted}")
print("More Adjusted Classification Report:")
print(classification_report_more_adjusted)


More Adjusted Accuracy: 0.8142551780054544
More Adjusted Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.92      0.88     10241
           1       0.66      0.49      0.57      3326

    accuracy                           0.81     13567
   macro avg       0.76      0.71      0.72     13567
weighted avg       0.80      0.81      0.80     13567



(Rowan) After asking ChatGPT to try and improve the model again, specifically for improving the predictions of class 1 the results above came out.

Looking at these results and comparing them to the earlier results I saw that even though I asked the AI to improve class 1, class 0 also was improved.
The precision for class 0 was increased from 82% to 85%, the recall decreased from 94% to 92% and the f1-score stayed the same. 
For class 1 the precision decreased from 68% to 66% while the recall improved from 37% to 49% and the f1-score increased from 48% to 57%.
The overal prediction accuracy was also increased from 80% to 81%.

Some practical examples for why it could be interesting for someone to predict someone else’s income are: 
1. Credit scoring and lending. Banks can use income predictions to determine if they are willing to lend someone money, to determine credit limits and terms.
2. Targeted marketing. Being able to predict someone’s income could be interesting for companies so they can target to whom they will advertise.
3. Insurances. Having the ability to predict someone's income could be handy for assessing risks and setting premiums.
4. Real estate. Being able to predict income could be an advantage for evaluating mortgage approvals.

REVIEW

(Noa) After looking at the models made and the improvements tried, I can conclude that Rowan did a fine job. What the model means is well explained and a discussion has been applied. The only thing that I am missing is more information on the steps that he has taken and more information about what he did to improve the model.

Compared to my own model, the kNN model, Rowan his NB model is more accurate in class 0 and mine is more accurate in class 1. The overall accuracy is better with the NB model.

Possible further improvements are assigning higher weights to class 1 in order to make the model more sensitive to misclassifications in class 1. Another improvement could be using SMOTE to oversample the underpresented class, which is in this case class 1.
