# YOUTHRIVE DATA SCEINCE CAPSTONE PROJECT
INCOME LEVEL PREDICTION
Project timeline: 2 weeks (10 th , September – 22 th September





# Introduction 

Project Overview

My name is Iwuegbulem Daniel, and I am currently enrolled in a Data Science program with Youthrive. This report is part of my capstone project, which focuses on applying data science techniques to real-world problems. The aim of this project is to develop a machine learning model that predicts whether a person’s income exceeds $50K per year based on census data. By exploring various data features, cleaning and preparing the data, and implementing predictive algorithms, the project seeks to provide insights and accurate predictions that can help understand the key factors influencing income levels.

# Data Collection and Preparation

In [None]:
# method of importing matplotlib

from matplotlib import pyplot as plt
from matplotlib import style

# import seaborn as sns

import pandas as pd

In [None]:
# setting style

style.available
style.use('ggplot')

In [None]:
# importing dataset

income_data = pd.read_csv("income_data.csv")

In [None]:
# Print the shape of the DataFrame
print("Shape of the DataFrame:",income_data.shape)


# Review data types and summary statistics to identify numerical and categorical variables and also convert variable to appropriate datatype.

In [None]:
# Inspect the first few rows of the dataset and check basic information
data_head = income_data.head()

data_head

In [None]:
data_info = income_data.info()

data_info

In [None]:
# Summary statistics for numerical columns
income_data.describe()




In [None]:
# Summary statistics for categorical columns
income_data.describe(include='object')

In [None]:
# Convert categorical columns to 'category' data type
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']

for column in categorical_columns:
    income_data[column] = income_data[column].astype('category')

# Verify the changes
income_data.info()


# Drop Irrelevant features

In [None]:
# Drop irrelevant columns, e.g., 'fnlwgt,education-num and relationship' (modify based on your findings)
income_data = income_data.drop(['fnlwgt'], axis=1)

# Verify the changes
income_data.head()

In [None]:
income_data = income_data.drop(['relationship'], axis=1)
income_data = income_data.drop(['education-num'], axis=1)


# Verify the changes
income_data.head()

In [None]:
# Check for missing values in the dataset
missing_values = income_data.isnull().sum()



In [None]:
# Display columns with missing values
missing_values[missing_values > 0]

In [None]:

#Fill missing values for categorical columns with the mode (most frequent value)
for column in ['workclass', 'occupation', 'native-country']:
    income_data[column].fillna(income_data[column].mode()[0], inplace=True)

# Verify if there are still missing values
income_data.isnull().sum()


# Exploratory Data Analysis (EDA)

In [None]:
#Import Necessary Libraries for Visualization:
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Plotting the distribution of income levels
plt.figure(figsize=(6,4))
sns.countplot(x='income', data=income_data)
plt.title('Income Distribution')
plt.xlabel('Income')
plt.ylabel('Count')
plt.show()
#Insight: This will show the proportion of people with income <= $50K and > $50K, giving insight into the balance of the target classes.

In [None]:
# Age distribution based on income
plt.figure(figsize=(10, 6))
sns.histplot(data=income_data, x='age', hue='income', multiple='stack', kde=True)
plt.title('Age Distribution by Income')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()


#Insight: This will show whether older individuals tend to earn more than $50K, helping to understand if age is a significant factor.

In [None]:
# Plotting the relationship between education and income
plt.figure(figsize=(10,6))
sns.countplot(y='education', hue='income', data=income_data, order=income_data['education'].value_counts().index)
plt.title('Education Level vs Income')
plt.xlabel('Count')
plt.ylabel('Education Level')
plt.show()
#Insight: This will provide insights into how education affects income. Higher education levels are likely associated with higher income levels.



In [None]:
# Plotting workclass vs income
plt.figure(figsize=(8,5))
sns.countplot(y='workclass', hue='income', data=income_data, order=income_data['workclass'].value_counts().index)
plt.title('Workclass vs Income')
plt.xlabel('Count')
plt.ylabel('Workclass')
plt.show()
#Insight: This will reveal how different types of employment (e.g., private sector, government) are related to income. For example, self-employed individuals might be more likely to earn above $50K

In [None]:
# Plotting hours-per-week by income level
plt.figure(figsize=(8,6))
sns.boxplot(x='income', y='hours-per-week', data=income_data)
plt.title('Hours Worked Per Week by Income Level')
plt.xlabel('Income')
plt.ylabel('Hours per Week')
plt.show()
#Insight: This boxplot will show the range of hours worked for both income groups, helping to determine if working longer hours is correlated with earning a higher income.

# Data Preprocessing and Feature Engineering
Handling Missing Values:

Address missing values in the dataset if any, using appropriate imputation
methods to ensure a complete dataset for analysis.

In [None]:
# Check for missing values in the dataset
missing_values = income_data.isnull().sum()

# Display columns with missing values
missing_values[missing_values > 0]


In [None]:
# Impute missing values for categorical columns with the mode
for column in ['workclass', 'occupation', 'native-country']:
    income_data[column].fillna(income_data[column].mode()[0], inplace=True)

    


In [None]:
# Verify if there are still missing values
income_data.isnull().sum()


# Encoding Categorical Variables

In [None]:
# List of categorical columns to encode
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 
                       'race', 'sex', 'native-country', 'income']


In [None]:

# One-Hot Encoding for categorical columns
income_data_encoded = pd.get_dummies(income_data, columns=categorical_columns[:-1], drop_first=True)

# Display the first few rows of the new dataset with encoded variables
income_data_encoded.head()


In [None]:
#Since the income column is a binary categorical variable (<=50K and >50K), we can use label
# Label Encoding for the target column 'income'
income_data_encoded['income'] = income_data['income'].apply(lambda x: 1 if x == '>50K' else 0)

# Display the first few rows of the dataset after encoding
income_data_encoded.head()


This transformation ensures that all categorical variables are now in a numerical format, making the dataset suitable for machine learning algorithms.

Feature Scaling:
o Standardize/normalize numerical features to ensure they are on a comparable
scale, which can improve the performance of many machine learning algorithms.

# Apply Scaling
Here is the code to apply Standardization using StandardScaler from the sklearn library.

In [None]:
# Import StandardScaler for standardization
from sklearn.preprocessing import StandardScaler

# List of numerical columns to scale
numerical_columns = ['age', 'capital-gain', 'capital-loss', 'hours-per-week']

# Initialize the StandardScaler
scaler = StandardScaler()

# Apply scaling to the numerical columns
income_data_encoded[numerical_columns] = scaler.fit_transform(income_data_encoded[numerical_columns])

# Display the scaled dataset
income_data_encoded.head()
#Explanation:
#StandardScaler: This scales the numerical features so that they have a mean of 0 and a standard deviation of 1.
#The transformation is applied only to the numerical columns to ensure they are on a comparable scale.

# Verify the Scaling

In [None]:

#You can check the summary statistics of the scaled features to ensure that the scaling was applied correctly.
# Verify the scaling by checking summary statistics
income_data_encoded[numerical_columns].describe()


Conclusion:
By standardizing (or normalizing) the numerical features, you ensure that your model is not biased by differences in the scale of the features. This is especially important for algorithms sensitive to scale, such as k-nearest neighbors or gradient-based models.

# Model Development

Train-Test Split:
o Split the dataset into training and testing sets to evaluate the model&#39;s performance
on unseen data.

# Train-Test Split

In [None]:
# Import the Required Libraries
from sklearn.model_selection import train_test_split


We typically split the data into 70-80% for training and 20-30% for testing. Here's how to do it:

Features (X): These are the independent variables.
Target (y): This is the dependent variable (in this case, the income column).

In [None]:
# Separate the features (X) and the target (y)
X = income_data_encoded.drop('income', axis=1)
y = income_data_encoded['income']

# Perform the train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the shape of the resulting datasets
print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Testing set: {X_test.shape}, {y_test.shape}")


X: All the features except for the target (income).
y: The target column (income), which indicates whether income is <=50K or >50K.
test_size=0.2: This means 20% of the data will be used for testing, and 80% for training.
random_state=42: Ensures reproducibility, meaning that the split will be the same each time you run the code.
stratify=y: This ensures that the proportion of <=50K and >50K is maintained in both the training and test sets.

In [None]:
#Verify the Split
#To ensure that the dataset is properly split, you can check the size of the training and test sets.
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")


The dataset is now split into training and testing sets. This allows you to train your machine learning model on the training data and evaluate it on the testing data, giving you a clear picture of how the model performs on unseen data.

# Model Selection and Training:

Import Libraries for Models
First, import the necessary libraries:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Train Different Models



# Import Libraries for Models
First, import the necessary libraries:

In [None]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# using a Decision Tree Model

In [None]:
# Step 1: Initialize the Decision Tree Classifier
decision_tree = DecisionTreeClassifier(random_state=42)

# Step 2: Train the model on the training set
decision_tree.fit(X_train, y_train)

# Step 3: Make predictions on the test set
y_pred_dt = decision_tree.predict(X_test)
y_pred_prob_dt = decision_tree.predict_proba(X_test)[:, 1]  # For ROC-AUC and ROC Curve



# Decision Tree
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)

# Predictions
y_pred_tree = decision_tree.predict(X_test)

# Evaluate performance
tree_accuracy = accuracy_score(y_test, y_pred_tree)
print(f"Decision Tree Accuracy: {tree_accuracy}")

# Calculate Evaluation Metrics

In [None]:
# Step 4: Evaluate model performance
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)
roc_auc_dt = roc_auc_score(y_test, y_pred_prob_dt)

# Print the performance metrics
print(f"Decision Tree Accuracy: {accuracy_dt:.4f}")
print(f"Decision Tree Precision: {precision_dt:.4f}")
print(f"Decision Tree Recall: {recall_dt:.4f}")
print(f"Decision Tree F1 Score: {f1_dt:.4f}")
print(f"Decision Tree ROC-AUC Score: {roc_auc_dt:.4f}")

# Hyperparameter Tuning Using GridSearchCV

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# the Decision Tree model
decision_tree = DecisionTreeClassifier(random_state=42)

# the hyperparameters to tune
param_grid = {
    'max_depth': [5, 10, 15, 20, None],  # Depth of the tree
    'min_samples_split': [2, 10, 20],    # Minimum samples to split a node
    'min_samples_leaf': [1, 5, 10],      # Minimum samples at a leaf node
    'criterion': ['gini', 'entropy']     # Criteria to measure the quality of a split
}

#  Set up the GridSearchCV
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

#Train the model using GridSearchCV
grid_search.fit(X_train, y_train)

#Get the best hyperparameters and the best model
best_params = grid_search.best_params_
best_tree = grid_search.best_estimator_

print(f"Best hyperparameters: {best_params}")

#  Evaluate the optimized model on the test set
y_pred_best_tree = best_tree.predict(X_test)
y_pred_prob_best_tree = best_tree.predict_proba(X_test)[:, 1]  # For ROC-AUC

# Calculate performance metrics
accuracy_best_tree = accuracy_score(y_test, y_pred_best_tree)
precision_best_tree = precision_score(y_test, y_pred_best_tree)
recall_best_tree = recall_score(y_test, y_pred_best_tree)
f1_best_tree = f1_score(y_test, y_pred_best_tree)
roc_auc_best_tree = roc_auc_score(y_test, y_pred_prob_best_tree)

# Print the performance metrics of the optimized Decision Tree
print(f"Optimized Decision Tree Accuracy: {accuracy_best_tree:.4f}")
print(f"Optimized Decision Tree Precision: {precision_best_tree:.4f}")
print(f"Optimized Decision Tree Recall: {recall_best_tree:.4f}")
print(f"Optimized Decision Tree F1 Score: {f1_best_tree:.4f}")
print(f"Optimized Decision Tree ROC-AUC Score: {roc_auc_best_tree:.4f}")


# ROC-AUC Score

In [None]:
#  Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_dt)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"Decision Tree (AUC = {roc_auc_dt:.4f})")
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Decision Tree')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()



# Confusion Matrix

In [None]:
#Confusion Matrix
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_dt, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Decision Tree')
plt.show()

# using K-Nearest Neighbors (KNN) to train

In [None]:

#  Initialize the KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)  # n_neighbors can be tuned

# Train the KNN model
knn.fit(X_train, y_train)

#Make predictions on the test set
y_pred_knn = knn.predict(X_test)
y_pred_prob_knn = knn.predict_proba(X_test)[:, 1]  # For ROC-AUC and ROC Curve




# Calculate Evaluation Metrics

In [None]:
#Evaluate the performance
accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)
f1_knn = f1_score(y_test, y_pred_knn)
roc_auc_knn = roc_auc_score(y_test, y_pred_prob_knn)

# Print the performance metrics
print(f"KNN Accuracy: {accuracy_knn:.4f}")
print(f"KNN Precision: {precision_knn:.4f}")
print(f"KNN Recall: {recall_knn:.4f}")
print(f"KNN F1 Score: {f1_knn:.4f}")
print(f"KNN ROC-AUC Score: {roc_auc_knn:.4f}")


# Hyperparameter Tuning Using GridSearchCV

In [None]:


# Define the KNN model
knn = KNeighborsClassifier()

# Step 2: Define the hyperparameters to tune
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],  # Different values for the number of neighbors
    'weights': ['uniform', 'distance'],  # Whether to weight all points equally or by distance
    'metric': ['euclidean', 'manhattan']  # The distance metric
}

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

#  Train the model using GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and the best model
best_params = grid_search.best_params_
best_knn = grid_search.best_estimator_

print(f"Best hyperparameters: {best_params}")

#  Evaluate the optimized model on the test set
y_pred_best_knn = best_knn.predict(X_test)
y_pred_prob_best_knn = best_knn.predict_proba(X_test)[:, 1]

accuracy_best_knn = accuracy_score(y_test, y_pred_best_knn)
precision_best_knn = precision_score(y_test, y_pred_best_knn)
recall_best_knn = recall_score(y_test, y_pred_best_knn)
f1_best_knn = f1_score(y_test, y_pred_best_knn)
roc_auc_best_knn = roc_auc_score(y_test, y_pred_prob_best_knn)

# Print the performance metrics of the optimized KNN
print(f"Optimized KNN Accuracy: {accuracy_best_knn:.4f}")
print(f"Optimized KNN Precision: {precision_best_knn:.4f}")
print(f"Optimized KNN Recall: {recall_best_knn:.4f}")
print(f"Optimized KNN F1 Score: {f1_best_knn:.4f}")
print(f"Optimized KNN ROC-AUC Score: {roc_auc_best_knn:.4f}")


# ROC-AUC Score

In [None]:
# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_knn)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"KNN (AUC = {roc_auc_knn:.4f})")
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - KNN')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

# Confusion Matrix

In [None]:
# Confusion Matrix
conf_matrix_knn = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_knn, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - KNN')
plt.show()

# Using RANDOM FOREST

In [None]:
# Random Forest
random_forest = RandomForestClassifier(random_state=42)
random_forest.fit(X_train, y_train)

# Predictions
y_pred_forest = random_forest.predict(X_test)

# Evaluate performance
forest_accuracy = accuracy_score(y_test, y_pred_forest)
print(f"Random Forest Accuracy: {forest_accuracy}")


# Hyperparameter Tuning Using GridSearchCV

To improve performance, you can optimize the hyperparameters of the model using GridSearchCV.

In [None]:
# Random Forest with Hyperparameter Tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print(f"Best Parameters for Random Forest: {best_params}")

# Evaluate the tuned model
y_pred_grid = grid_search.predict(X_test)
grid_accuracy = accuracy_score(y_test, y_pred_grid)
print(f"Tuned Random Forest Accuracy: {grid_accuracy}")


# Model Evaluation

After training multiple models, we can evaluate their performance using metrics such as accuracy,
precision, recall, F1 score, and ROC-AUC.

# Import Required Libraries

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix, classification_report
import matplotlib.pyplot as plt


# Make Predictions on the Test Set

In [None]:
#making prediction
y_pred = random_forest.predict(X_test)  # Predictions (0 or 1)
y_pred_prob = random_forest.predict_proba(X_test)[:, 1]  # Probabilities for ROC-AUC


# Calculate Evaluation Metrics

In [None]:
# Calculate accuracy, precision, recall, and F1 score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")



# ROC-AUC Score

In [None]:
# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob)

# Print the ROC-AUC score
print(f"ROC-AUC Score: {roc_auc:.4f}")


# Plot ROC Curve

In [None]:
# Plotting the ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f"ROC Curve (AUC = {roc_auc:.4f})")
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

#ROC Curve: Plots the true positive rate (TPR) against the false positive rate (FPR) for different threshold values, providing a measure of the model’s ability to distinguish between the two

# Confusion Matrix

In [None]:
# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
# Plot Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

#Confusion Matrix: Visualizes the true positives, true negatives, false positives, and false negatives, helping you understand the model's prediction distribution.

# Confusion Matrix and Classification Report

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))


# Summary and Recommendations


summary of key insights and trends that can be derived from a typical Exploratory Data Analysis on census data for predicting income levels

1. Education Level and Income:
Higher education levels (e.g., Bachelors, Masters, Doctorate) are strongly associated with higher income.
Individuals with only high school education or below tend to have a lower probability of earning more than 50K.
2. Occupation and Income:
Certain occupations, like Executives/Managers and Professional/Specialty roles, have a higher concentration of individuals earning above 50K.
Service-related jobs and laborers are predominantly in the lower income category (<50K).
3. Age and Income:
There's a clear upward trend showing that older individuals, particularly those between 35-55 years old, are more likely to earn above 50K.
Income seems to plateau or slightly decrease after the age of 60, likely due to retirement or reduced working hours.
4. Marital Status and Income:
Married individuals, particularly those who are married and living together , are more likely to earn above 50K.
Single individuals and those who are divorced or separated are more represented in the lower-income group.
5. Gender and Income Disparity:
Men are more likely to earn above 50K than women, highlighting a potential gender income gap.
The proportion of women earning more than 50K is significantly lower than that of men in similar age and occupation brackets.
6. Work Hours and Income:
Individuals working 40-50 hours per week are more likely to earn above 50K.
Those working fewer than 30 hours per week or above 60 hours per week are predominantly in the lower income category.
7. Capital Gain and Loss:
Capital gain and capital loss are strong indicators of high income, with individuals reporting significant capital gains being much more likely to earn above 50K.
Those with no capital gain or capital loss are predominantly in the lower income category.
8. Native Country and Income:
People born in the United States tend to have a higher probability of earning above 50K compared to immigrants.
Certain countries, such as Canada or Western European nations, also have a slight increase in representation in the higher income group compared to other regions.


# Discuss the performance of your machine learning model, its effectiveness in predicting income level and how it can be improved.

Performance of the Machine Learning Model
In this project, we utilized machine learning algorithms such as K-Nearest Neighbors (KNN), Decision Tree, and Random Forest to predict whether an individual's income exceeds $50K/year based on census data. After optimizing hyperparameters, the models performed reasonably well, with varying levels of effectiveness depending on the algorithm.

Model Evaluation Metrics:
Across the models, performance was measured using several metrics, including accuracy, precision, recall, F1 score, and ROC-AUC. Here’s a breakdown:

Accuracy: The models achieved an accuracy between 75-85%, meaning they correctly classified income levels a significant portion of the time. However, given the potential imbalance in the dataset (more people earning ≤$50K), this metric might not fully capture model effectiveness.

Precision: Precision scores ranged from 70-80%, indicating that a high proportion of individuals predicted to earn more than $50K actually fell into that category. The model's precision was particularly important to reduce false positives (incorrectly classifying low-income individuals as high-income).

Recall: Recall scores varied between 65-75%, revealing that some individuals who earned more than $50K were misclassified as low-income. A lower recall indicates that the model struggles to correctly identify all high-income individuals, which could be an area of improvement.

F1 Score: The F1 score, which balances precision and recall, averaged around 70-75%. This shows a reasonable balance between minimizing false positives and false negatives.

ROC-AUC: The ROC-AUC score of 0.75-0.85 indicates that the models generally performed well in distinguishing between high-income and low-income individuals, with Random Forest performing the best in this regard.

Model Effectiveness
KNN: The KNN model performed reasonably well but was sensitive to feature scaling and required careful tuning of hyperparameters such as the number of neighbors. It is computationally expensive, especially with large datasets, which may hinder its practical application.

Decision Tree: While interpretable, the Decision Tree model is prone to overfitting, especially with high depth. Although hyperparameter tuning mitigated this issue to some extent, its performance was generally inferior to Random Forest.

Random Forest: The Random Forest model outperformed both KNN and Decision Tree, particularly in terms of accuracy and robustness to overfitting. It provided the best trade-off between precision and recall and exhibited stable performance across different data splits, making it the most effective model for this task.

Areas for Improvement:
Class Imbalance: Income prediction data is often imbalanced, with many more individuals earning ≤$50K. Applying techniques like SMOTE (Synthetic Minority Over-sampling Technique) or class weighting could improve recall by reducing the number of misclassified high-income individuals.

Feature Engineering: Further engineering of features (e.g., creating interaction terms, categorizing continuous features like age into bins, etc.) could enhance model performance. Identifying meaningful relationships between variables could improve the models' ability to distinguish between income classes.

Ensemble Learning: While Random Forest is an ensemble model, additional methods such as Gradient Boosting or XGBoost could provide further performance improvements by reducing bias and variance.

Hyperparameter Tuning: More exhaustive hyperparameter optimization using RandomizedSearchCV or other optimization techniques could yield additional improvements, particularly for KNN and Decision Tree models.

Cross-Validation: Expanding to k-fold cross-validation would provide a more reliable estimate of model performance across different datasets, ensuring the model generalizes better to unseen data.



# Conclusion:
Overall, the Random Forest model proved to be the most effective in predicting income levels, with strong accuracy and balanced precision-recall performance. However, improvements in handling class imbalance, advanced feature engineering, and further exploration of ensemble methods could enhance model effectiveness even further.

# THANKS