<h1><center>INCOME PREDICTION</center></h1>        

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
data = pd.read_csv("incomeData.csv")

In [None]:
data.head(3)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

In [None]:
data.drop_duplicates(inplace=True)

In [None]:
data.duplicated().sum()

In [None]:
data.dtypes

In [None]:
data.rename(columns={'nan':'age'},inplace=True)

In [None]:
data.columns

In [None]:
data.describe()

In [None]:
data['native-country'].unique()

In [None]:
corr_inc = data[['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']].corr()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(corr_inc,cbar=True,annot=True,cmap='Blues') #coolwarm,reds,Blues,greens,viridis
plt.title('Correlation Metrix of Numerical Values')
plt.show()

#### There is no much correlation between any of the numerical features

## Check for error values

In [None]:
data['age'].unique()

In [None]:
data['workclass'].unique()

In [None]:
data['fnlwgt'].unique()

In [None]:
data['education'].unique()

In [None]:
data['education-num'].unique()

In [None]:
data['marital-status'].unique()

In [None]:
data['occupation'].unique()

In [None]:
data['relationship'].unique()

In [None]:
data['race'].unique()

In [None]:
data['sex'].unique()

In [None]:
data['capital-gain'].unique()

In [None]:
data['capital-loss'].unique()

In [None]:
data['hours-per-week'].unique()

In [None]:
data['native-country'].unique()

In [None]:
data['Income'].unique()

There are missing values in three columns ['workclass'], ['occupation'] and ['native-country']

## Treating the error values

In [None]:
data['workclass'].replace(' ?',np.nan,inplace=True);

In [None]:
data['workclass'].unique()

In [None]:
data['workclass'].isnull().sum()

In [None]:
data['workclass'].fillna(data['workclass'].mode()[0],inplace=True)

In [None]:
data['workclass'].isnull().sum()

In [None]:
data['occupation'].replace(' ?',np.nan,inplace=True);

In [None]:
data['occupation'].unique()

In [None]:
data['occupation'].isnull().sum()

In [None]:
data['occupation'].fillna(data['workclass'].mode()[0],inplace=True)

In [None]:
data['occupation'].isnull().sum()

In [None]:
data['native-country'].replace(' ?',np.nan,inplace=True);

In [None]:
data['native-country'].isnull().sum()

In [None]:
data['native-country'].fillna(data['workclass'].mode()[0],inplace=True)

In [None]:
data['native-country'].isnull().sum()

## Univariate Analysis

### Income Distribution

In [None]:
# income class distribution
plt.figure(figsize=(6,4))
sns.countplot(data=data,x='Income',palette='BuGn_r')
plt.title("Income class distribution")
plt.xlabel("Income")
plt.ylabel("Count")
plt.show()

* The dataset is imbalanced, with a significantly higher number of individuals earning <=50K.

### Impact of Education 

In [None]:
# impact of education on employment
sns.histplot(data['education'],palette='Set2')
plt.xticks(rotation=90)
plt.title("Impact of Education on Employment")
plt.show()

* There is a great variation in number of employees based on their education.
* More than 10000 employees are HS-Graduates.

### Number of employees in each sector

In [None]:
#Analysing employment opertunities in each sector
sns.countplot(x=data['workclass'])
plt.xticks(rotation=45)
plt.title("Number of employees in each workclass")
plt.show()

* Private sectors are providing more employment opertunities.

In [None]:
#Getting information about peoples having high income
income_high=data[data['Income']==" >50K"]

In [None]:
income_high

### Work class distribution on high income

In [None]:
#Analysing work class of high income individuals
highworkclass=income_high['workclass'].value_counts()
plt.pie(highworkclass,labels=highworkclass.index, autopct='%1.1f%%', startangle=90,colors=sns.color_palette('pastel'))
plt.title("Workclass Distribution Among >50K Earners")
plt.tight_layout()
plt.show()

* Most high-income individuals belong to Private, Self-employed, or Government sectors.
* Private sector is likely dominant, reflecting broader employment trends.
* Workclass type clearly influences income potential, possibly tied to benefits, job roles, or access to higher salaries.



In [None]:
#filtering low income individuals
low_income=data[data['Income'] == ' <=50K']
low_income

### Work class distribution on low income

In [None]:
#Analysing work class of low income individuals
lowworkclass=low_income['workclass'].value_counts()
plt.pie(lowworkclass,labels=lowworkclass.index, autopct='%1.1f%%', startangle=90,colors=sns.color_palette('pastel'))
plt.title("Workclass Distribution Among <=50K Earners")
plt.tight_layout()
plt.show()

* Most low-income individuals belong to Private, Self-employed, or Government sectors.
* Private sector is likely dominant, reflecting broader employment trends.

In [None]:
data['occupation'].unique()

### Occupation of high income individuals

In [None]:

sns.countplot(x=income_high['occupation'],palette='pastel')
plt.xticks(rotation=90)
plt.title("Occupation of high income individuals") 
plt.show()

* Most high-income individuals are working as Executive-manager, prof-specialty,craft repair or sales
* Executive-manager is likely dominant.
* Ocuupation of employees influences the income of individuals

### Age Distribution Across Income Groups

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(data=data, x='age', hue='Income', kde=True, bins=30, palette='Set2', multiple='stack')
plt.title('Age Distribution by Income Group')
plt.xlabel('Age')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

* Higher income ( >50K ) is more common among people aged 35–55.
* People earning ≤50K are spread more broadly across ages, including younger demographics.
* This suggests age is a significant predictor of income level — potentially nonlinear in influence.

#### Education Level Distribution by Income

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data=data, x='education', hue='Income', order=data['education'].value_counts().index, palette='Set2')
plt.title('Education Level vs Income Group')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

* Individuals with Bachelors, Masters, and Doctorate degrees have a higher proportion of >50K income.
* Lower education levels (like HS-grad, Some-college, or 11th) are mostly associated with ≤50K income.
* Education is a strong determinant of income, and can be a powerful feature for predictive modeling.

#### Occupation vs Income

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(data=data, x='occupation', hue='Income', order=data['occupation'].value_counts().index, palette='Set2')
plt.title('Occupation vs Income Group')
plt.xlabel('Occupation')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

* Occupations such as Exec-managerial, Prof-specialty, and craft-repair and sales have higher proportions of >50K earners.
* Roles like Handlers-cleaners, Machine-op-inspct, and Other-service are dominated by ≤50K earners.
* This confirms that occupation is highly correlated with income class and is a strong predictive feature.

#### Boxplot of Hours Worked per Week by Income Group

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(data=data, x='Income', y='hours-per-week', palette='Set2')
plt.title('Hours Worked per Week by Income Group')
plt.xlabel('Income Group')
plt.ylabel('Hours per Week')
plt.tight_layout()
plt.show()

* Median hours worked for both income groups is around 40 hours.
* High-income earners have a wider range and more outliers (some working 60–80+ hours).
* Low-income group shows a tighter, more consistent spread.
* This suggests more hours doesn't guarantee higher income, but many high earners do work longer hours.

## Conclusion

* Higher number of individuals are earning <=50k salary.
* Private sectors are providing more employment opertunities for individuals.They provide both high income and low income jobs.
* Most high-income individuals are working as Executive-manager, prof-specialty,craft repair or sales.
* Higher income ( >50K ) is more common among people aged 35–55.
* Individuals with Bachelors, Masters, and Doctorate degrees have a higher proportion of >50K income.
* Lower education levels (like HS-grad, Some-college, or 11th) are mostly associated with ≤50K income.
* Education is a strong determinant of income, and can be a powerful feature for predictive modeling.
* More hours doesn't guarantee higher income, but many high earners do work longer hours.
