# Census Income

## Description
We want to determine whether a person makes over \\$50,000 a year based on census data. This is a classification problem where we aim to categorize individuals into two groups: those with an income exceeding \\$50,000 annually and those with an income below this threshold. Given that we have a classification task, a machine learning algorithm can be applied to the problem.

To achieve this, we will analyze the "Census Income" dataset containing various features and by leveraging them, we can train a machine learning model to make accurate classifications. This classification task can be useful for applications such as targeted marketing, economic research, and policy-making.

Once the model is trained, we will evaluate its performance using metrics such as accuracy, precision, recall, and the F1 score. These metrics will help us ensure that our model is reliable and effective in distinguishing between the two income groups.


## Dataset
- Source: https://archive.ics.uci.edu/dataset/2/adult
- Summary: Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)).

## Data fetching

In [1]:
import warnings

warnings.simplefilter("ignore")

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [3]:
!pip install ucimlrepo



In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 

## Exploratory Data Analysis

In [None]:
df = pd.concat([X,y], axis=1)

In [None]:
df.shape

Our dataset contains 48,842 observations and 15 features.

In [None]:
df.head()

The above cell gives us an overview of the first lines of our dataset.

In [None]:
df.dtypes.value_counts().plot.pie(autopct='%1.1f%%');

As we can see in the pie plot above, 60 % of our features sont de type object et 40 % sont de type int. We only have categorical and discrete numerical variables.

In [None]:
pd.DataFrame( df.isna().sum() / df.shape[0] ).rename({0: "Missing pct"}, axis=1)

The above shows us that some of the values from the columns *workclass*, *occupation* and *native-country* are missing.

### Categorical variables

In [None]:
pd.DataFrame( df.select_dtypes("O").nunique() ).rename({0: "distinct values"}, axis=1)

The table above shows the number of unique values in each categorical variable. Notably, the *native-country*, *education* and *occupation* columns contains a significant number of unique values. Additionally, we observe an issue with the *income* column, which should have only two unique values but instead has four different ones.

In [None]:
g = sns.FacetGrid(
    data=df.select_dtypes("O").melt().value_counts().reset_index(),
    col="variable",
    col_wrap=3,
    sharex=False,
    sharey=False,
    height=5,
    aspect=1.5,
)
g.map(sns.barplot, "count", "value", orientation="horizontal");

The bar plots above illustrate the distributions of observations for each value within the categorical variables, including the target variable. We observe significant imbalances, particularly in the *native-country*, *race*, and *workclass* variables. These imbalances could potentially bias our model toward the dominant categories, leading to ethical concerns.

Regarding the *income* variable, we observe that there are equivalent classes that can be merged to achieve the two classes of interest.

In [None]:
_d = df.select_dtypes("O").melt(id_vars="income").value_counts().reset_index()
_d["income"] = _d["income"].str.replace(r"<=.*", "<=50", regex=True).replace(r">.*", ">50", regex=True)

g = sns.FacetGrid(
    data=_d,
    col="variable",
    col_wrap=3,
    sharex=False,
    sharey=False,
    height=5,
    aspect=1.5,
)
g.map(sns.barplot, "count", "value", orientation="horizontal");