#### Income Prediction

#### Overview
This Python script predicts whether an individual's income is above or below $50,000 based on various demographic and economic factors. Predicting income levels is crucial for financial institutions, government agencies, and businesses for targeted marketing, policy-making, and resource allocation.

#### Problem Description
The problem addressed in this script is the binary classification of individual income into two categories: <=$50,000 and >$50,000. This classification is based on demographic and economic features such as age, education level, capital gain, and hours worked per week.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
# Load Dataset
dataset = pd.read_csv('salary.csv')
print(dataset.shape)
print(dataset.head())

(32561, 5)
   age  education.num  capital.gain  hours.per.week income
0   90              9             0              40  <=50K
1   82              9             0              18  <=50K
2   66             10             0              40  <=50K
3   54              4             0              40  <=50K
4   41             10             0              40  <=50K


In [3]:
# Mapping Salary Data to Binary Value
dataset['income'] = dataset['income'].map({'<=50K': 0, '>50K': 1}).astype(int)

In [4]:
print(dataset)

       age  education.num  capital.gain  hours.per.week  income
0       90              9             0              40       0
1       82              9             0              18       0
2       66             10             0              40       0
3       54              4             0              40       0
4       41             10             0              40       0
...    ...            ...           ...             ...     ...
32556   22             10             0              40       0
32557   27             12             0              38       0
32558   40              9             0              40       1
32559   58              9             0              40       0
32560   22              9             0              20       0

[32561 rows x 5 columns]


In [5]:
# Segregate Dataset into X(Input/IndependentVariable) & Y(Output/DependentVariable)
X = dataset.drop(columns=['income'])
Y = dataset['income']

In [6]:
# Splitting Dataset into Train & Test
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.22, random_state=0
)

In [7]:
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [8]:
# List of classifiers to try
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=0),
    'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=22, metric='minkowski', p=2),
    'Support Vector Machine': SVC(random_state=0),
    'Decision Tree': DecisionTreeClassifier(random_state=0),
    'Random Forest': RandomForestClassifier(random_state=0),
    'Naive Bayes': GaussianNB()
}

In [9]:
# Evaluate each classifier using cross-validation
for clf_name, clf in classifiers.items():
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    print(f"{clf_name} Cross-Validation Accuracy: {scores.mean()}")

Logistic Regression Cross-Validation Accuracy: 0.8097805461115943
k-Nearest Neighbors Cross-Validation Accuracy: 0.8172619695426435
Support Vector Machine Cross-Validation Accuracy: 0.821081107478222
Decision Tree Cross-Validation Accuracy: 0.7986769359087055
Random Forest Cross-Validation Accuracy: 0.8052131673883351
Naive Bayes Cross-Validation Accuracy: 0.7917468408593049


In [10]:
# Choose the best-performing classifier
best_classifier_name = max(classifiers, key=lambda k: cross_val_score(classifiers[k], X_train, y_train, cv=5, scoring='accuracy').mean())
best_classifier = classifiers[best_classifier_name]

In [11]:
# Train the best classifier on the entire training set
best_classifier.fit(X_train, y_train)

In [12]:
# Make predictions on the test set
y_pred = best_classifier.predict(X_test)

In [13]:
# Evaluate the performance of the best classifier
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

In [14]:
print(f"\nBest Classifier: {best_classifier_name}")
print("Confusion Matrix:")
print(cm)
print(f"Accuracy of the Model: {accuracy * 100}%")


Best Classifier: Support Vector Machine
Confusion Matrix:
[[5243  200]
 [1089  632]]
Accuracy of the Model: 82.0072585147962%


In [15]:
# Predicting whether a new employee with Age, Education, Capital Gain, and Hours per week 
age = int(input("Enter New Employee's Age: "))
edu = int(input("Enter New Employee's Education: "))
cg = int(input("Enter New Employee's Capital Gain: "))
wh = int(input("Enter New Employee's Hours Per week: "))
new_emp = [[age, edu, cg, wh]]
result = best_classifier.predict(sc.transform(new_emp))
print(age,edu,cg,wh)
print(result)

if result == 1:
    print("Employee might have a salary above 50K.")
else:
    print("Employee might not have a salary above 50K.")

20 2 2 2
[0]
Employee might not have a salary above 50K.


