# Introduction

In today's world, income is an essential factor that determines the quality of life. Understanding the factors that contribute to high income levels can help individuals, businesses, and governments make better decisions.

In this Kaggle notebook, we will be exploring the Income Classification dataset from the UC Irvine Machine Learning Repository. The goal of this dataset is to predict whether an individual's income exceeds $50,000 per year or not based on various features such as age, education level, occupation, and more.

The Income Classification problem is a binary classification task where the target variable is either 0 or 1, representing whether the individual's income is less than or equal to (50,000) or more than    (50,000) respectively.

We will be using various machine learning techniques to build a model that can accurately predict an individual's income level based on the given features. This problem provides an excellent opportunity to explore different machine learning algorithms and techniques and compare their performance on a real-world dataset.

Let's dive into the data and see what insights we can uncover!

# Libraries and Data Import

To begin, let's import the necessary libraries that we'll be using throughout this notebook:

In [None]:
# Data Manipulation Libraries
import pandas as pd
import numpy as np

# Data Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning Libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier,GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

Now, let's import the Income Classification dataset from the UC Irvine Machine Learning Repository using the read_csv() function from pandas:

In [None]:
df = pd.read_csv('/kaggle/input/income-classification/income_evaluation.csv')
df.head()

The head() function displays the first 5 rows of the dataset, allowing us to get a sense of what the data looks like. 

In [None]:
df.info()

In [None]:
df.describe(include='all')

The info() function provides information about the dataset, such as the number of non-null values and the data types of each feature. the describe() function gives us summary statistics for each numerical feature in the dataset.

# Data Cleaning

Before we can start modeling, we need to clean the Income Classification dataset by handling missing values, and removing duplicates

In [None]:
# Rename columns with spaces to column names without spaces
df.columns = df.columns.str.replace(' ', '')

In [None]:
# Replace " ?" values in the dataset with "Other"
df = df.replace(' ?',' Others')

In [None]:
# Remove duplicate instances
df.drop_duplicates(inplace=True)

# Data Visualization

Here, I attempted to use a pie chart to assess the balance of the dataset by visualizing the proportions of each category in the target variable (income level).

In [None]:
plt.pie(df.income.value_counts(),labels = df.income.unique(),autopct='%1.1f%%')
plt.show()

In [None]:
capital_gain = df[df['capital-gain'] > 0]
capital_loss = df[df['capital-loss'] > 0]

In [None]:
capital_gain.tail(3)

Here, I calculated the average value of the 'capital-gain' column to investigate whether non-zero values in this column have an impact on the target variable. By checking the average value, I can get a sense of the distribution of the 'capital-gain' variable and see if it has any correlation with the target variable.

In [None]:
xx = capital_gain['income'].value_counts().keys()
yy=[capital_gain[capital_gain['income'] == i]['capital-gain'].mean() for i in xx]
plt.bar(xx, yy,width = 0.4)
 
plt.xlabel("Income")
plt.ylabel("Capital Gain")
plt.show()

I also calculated the average value of the 'capital-loss' column to investigate if non-zero values in this column have any relationship with the target variable.

In [None]:
xx = capital_loss['income'].value_counts().keys()
yy=[capital_loss[capital_loss['income'] == i]['capital-loss'].mean() for i in xx]
plt.bar(xx, yy,width = 0.4)
 
plt.title('Capital loss')
plt.xlabel("Income")
plt.ylabel("Capital Gain")
plt.show()

I utilized 'value_counts' function to observe the frequency of each unique value in the 'native-country' column. By doing so, I was able to identify which countries are most common in the dataset and determine whether there are any imbalances in the representation of different countries in the data.

In [None]:
print(df['native-country'].value_counts(normalize=True)[:5])

Here, I examined the distribution of the 'relationship' column and its relationship with the target variable 'income' using a countplot. By using this plot, I was able to visualize the frequency of each unique value in the 'relationship' column and compare the number of occurrences of each value between different income categories. This plot can help to identify if certain categories of 'relationship' are more likely to be associated with higher or lower incomes.

In [None]:
sns.countplot(data=df, x="relationship", hue="income")
plt.xticks(rotation=60)

In order to compare the distribution of 'marital-status' with the target variable 'income' and see if it is different from that of the 'relationship' column, I created a countplot. This plot is similar to the previous countplot, but the 'marital-status' column is used instead of 'relationship'. By comparing these two plots, it is possible to identify any differences or similarities in the distribution of income levels across different marital statuses and relationship categories. This can provide insights into the potential impact of these variables on the target variable.

In [None]:
sns.countplot(data=df, x="marital-status", hue="income")
plt.xticks(rotation=60)

I used a countplot to examine the relationship between 'workclass' and 'income' and observe the distribution of income levels across different work classes.

In [None]:
sns.countplot(data=df, x="workclass", hue="income")
plt.xticks(rotation=60)

In [None]:
sns.countplot(data=df, x="occupation", hue="income")
plt.xticks(rotation=80)

After visualizing the data and identifying the columns that are not useful for the modeling process, I removed those columns from the dataset. This helps to simplify the dataset and prevent irrelevant or redundant information from impacting the model's performance.

In [None]:
df_new = df.drop(['marital-status','race','fnlwgt', 'education','native-country','workclass'],axis=1)
df_new.head()

In [None]:
df_new.head(3)

In [None]:
df_new.select_dtypes(include=np.number).hist(figsize=(8,8))

In [None]:
xx = pd.cut(df_new['hours-per-week'], bins=[0,20,40,70,100], include_lowest=True, labels=['0-20', '20-40', '40-70','70-100'])
sns.countplot(x=xx, hue=df["income"])

In [None]:
xx = pd.cut(df_new['age'], bins=[17,23,40,60,100], include_lowest=True, labels=['17-23', '23-40', '40-60','60-100'])
sns.countplot(x=xx, hue=df["income"])

# Data Preprocessing

Data encoding to transform categorical data into numerical values.

In [None]:
df_new['income'].replace({' <=50K':1,' >50K':0},inplace=True)
df_new['sex'].replace({' Male':1,' Female':0},inplace=True)
df_new.head(2)

In [None]:
one_hot_encoded_data = pd.get_dummies(df_new, columns = ['occupation', 'relationship'])
one_hot_encoded_data

Splitting the data into input features (X) and target variable (y).

In [None]:
X = one_hot_encoded_data.drop('income',axis=1)
y = one_hot_encoded_data.income
X.shape,y.shape

Standardizing the input features using StandardScaler to scale the data.

In [None]:
scaler = StandardScaler()

# transform data
X = scaler.fit_transform(X)

Splitting the data into training and testing sets using train_test_split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

# Model Building

For the model building step, I first utilized a decision tree algorithm to create a classifier. To ensure the model's generalization performance, I performed cross-validation.

In [None]:
clf = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 4)

scores = cross_val_score(clf, X_train, y_train, cv = 10)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

After using a decision tree model with cross-validation, I decided to try out a different algorithm to see if it would perform better. I chose to use logistic regression, which is a commonly used algorithm for binary classification problems like the one in this dataset.

In [None]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
clf.score(X_train, y_train),clf.score(X_test, y_test)

scores = cross_val_score(clf, X_train, y_train, cv = 10)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

After using logistic regression, I observed that the model performed better than the decision tree algorithm. To further evaluate the model's performance, I utilized a confusion matrix. which can provide insights into the model's performance in terms of accuracy, precision, recall, and F1-score.

In [None]:
y_pred = clf.predict(X_train)
confusion_matrix(y_train, y_pred)

To improve the model's performance, I utilized ensemble learning, which combines the predictions of multiple models to create a single prediction. Specifically, I used a voting classifier that combines the predictions of five different models to produce a final prediction. This approach can improve the model's performance by reducing overfitting and increasing accuracy.

In [None]:
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GradientBoostingClassifier()
clf4 = XGBClassifier()
clf5 = AdaBoostClassifier()
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gbc', clf3),('xgb',clf4),('abc',clf5)], voting='soft')

In [None]:
eclf1 = eclf1.fit(X_train, y_train)
y_pred = eclf1.predict(X_train)
confusion_matrix(y_train, y_pred)

# Model Evaluation

In [None]:
scores = cross_val_score(eclf1, X_train, y_train, cv = 5)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

After training the voting classifier, I evaluated its performance on the testing set to assess how well the model generalizes to new data. This step is important because it allows us to estimate the model's true performance on unseen data. I utilized various evaluation metrics such as accuracy, precision, recall, and F1-score. These metrics help us understand how well the model is performing in terms of correctly classifying income levels.

In [None]:
y_pred = eclf1.predict(X_test)
print(classification_report(y_test, y_pred))

# Conclusion

After performing data exploration, preprocessing, model building, and evaluation, we can conclude that:

1. The data contains information about individuals' demographic, education, and work-related attributes, which can be used to predict their income level.
2. The dataset was preprocessed to handle missing values, encode categorical features, and scale the numerical features.
3. We trained multiple models, including Decision Tree, Logistic Regression, and a Voting Classifier, to predict the income level.
4. The Voting Classifier outperformed the other models in terms of accuracy, precision, recall, F1-score. It achieved an accuracy of 86.1% on the testing set.
5. The Voting Classifier model can be used to predict income levels for individuals based on their demographic, education, and work-related attributes.

Overall, the project demonstrates the importance of data exploration, preprocessing, and model selection in developing accurate and reliable machine learning models.