## Business Objective
The business objective is to predict whether an individual earns more than or less than $50,000 annually based on specific demographic and work-related features. This model can help organizations in tasks such as targeted marketing, talent acquisition, or workforce planning, where salary level insights are valuable for making informed decisions.

## Constraints
- The dataset is provided in two parts (train and test), so care must be taken to avoid data leakage by ensuring that the test dataset is only used for final model evaluation, not training.
  

In [3]:
import pandas as pd
import numpy as np

In [4]:
df=pd.read_csv("SalaryData_Train.csv")
df.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
df.shape

(30161, 14)

In [7]:
df.value_counts('Salary')

Salary
<=50K    22653
>50K      7508
Name: count, dtype: int64

### Preprocessing

In [12]:
df.isnull().sum()

age              0
workclass        0
education        0
educationno      0
maritalstatus    0
occupation       0
relationship     0
race             0
sex              0
capitalgain      0
capitalloss      0
hoursperweek     0
native           0
Salary           0
dtype: int64

In [14]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for column in df.columns:
    if df[column].dtypes=='object':
        df[column]=le.fit_transform(df[column])

print(df[column])

0        0
1        0
2        0
3        0
4        0
        ..
30156    0
30157    1
30158    0
30159    0
30160    1
Name: Salary, Length: 30161, dtype: int32


### Define Features and Labels

In [17]:
X=df.drop(columns=["Salary"])
y=df["Salary"]

In [19]:
X.shape

(30161, 13)

In [21]:
y.shape

(30161,)

### Split the data

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [36]:
X_train.shape

(24128, 13)

In [38]:
X_test.shape

(6033, 13)

In [40]:
y_train.shape

(24128,)

In [42]:
y_test.shape

(6033,)

### Initialize and Train the Naive Bayes Model

In [45]:
from sklearn.naive_bayes import GaussianNB
nb_model=GaussianNB()

In [47]:
nb_model.fit(X_train,y_train)

 ### Make Predictions on the Test Set

In [50]:
y_pred=nb_model.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

### Evaluate the model

In [53]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

In [55]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.7908171722194597


In [57]:
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.95      0.87      4490
           1       0.69      0.33      0.44      1543

    accuracy                           0.79      6033
   macro avg       0.75      0.64      0.66      6033
weighted avg       0.78      0.79      0.76      6033



In [59]:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Confusion Matrix:
 [[4268  222]
 [1040  503]]


- TP (503): cases where the model correctly predicted "positive" means correctly predicted greater than 50k
- TN (4268): cases where model correctly predicted as "negative"means cottectly predicted salary<=50k
- FP (222): cases where model predicted they earn above 50k but actually they earn <=50k
- FN (1040): cases where model predicted they earn <=50k but but actually they earn >50k

### Perform on test.csv file

In [63]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

train_data = pd.read_csv("SalaryData_Test.csv")
train_data.head()

# One-hot encode categorical features
train_data = pd.get_dummies(train_data, drop_first=True)


X = train_data.drop(columns=["Salary_ >50K"])  
y = train_data["Salary_ >50K"] 

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

y_val_pred = nb_model.predict(X_val)

conf_matrix = confusion_matrix(y_val, y_val_pred)
print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", classification_report(y_val, y_val_pred))
print("\nAccuracy:", accuracy_score(y_val, y_val_pred))


Confusion Matrix:
 [[2924  462]
 [ 326  806]]

Classification Report:
               precision    recall  f1-score   support

       False       0.90      0.86      0.88      3386
        True       0.64      0.71      0.67      1132

    accuracy                           0.83      4518
   macro avg       0.77      0.79      0.78      4518
weighted avg       0.83      0.83      0.83      4518


Accuracy: 0.8255865427180168


### Giving 82% for test.csv file