# U.S. Census Data Exploratory Analysis and Classification
---


## Table of Contents

1. [Project Overview](#1-project-overview)  
2. [Dataset Information](#2-dataset-information)  
   - [2.1 Data Source](#21-data-source)  
3. [Data Cleaning & Preparation](#3-data-cleaning--preparation)  
   - [3.1 Column Cleaning](#31-column-cleaning)  
   - [3.2 Handling Missing Values](#32-handling-missing-values)  
   - [3.3 Feature Usability Check](#33-feature-usability-check)  
4. [Exploratory Data Analysis (EDA)](#4-exploratory-data-analysis-eda)  
   - [4.1 Univariate Analysis](#41-univariate-analysis)  
   - [4.2 Bivariate Analysis](#42-bivariate-analysis)  
   - [4.3 Correlation & Insights](#43-correlation--insights)  
   - [4.4 Dataiku Visual Integration](#44-dataiku-visual-integration)  
5. [Classification Preparation](#5-classification-preparation)  
   - [5.1 Define Target Variable](#51-define-target-variable)  
   - [5.2 Feature Encoding & Selection](#52-feature-encoding--selection)  
6. [Summary & Key Takeaways](#6-summary--key-takeaways)  
7. [References](#7-references)

---




## 1. Project Overview

This project explores U.S. Census employment and demographic data to uncover trends and relationships within the workforce.  
 
The analysis focuses on identifying patterns across variables such as age, education, and occupation, and preparing the data for a **classification task** aimed at predicting an individual's *income level*.    

In addition to the Python-based workflow, further exploratory analysis and visualization were conducted in **Dataiku** to enhance interpretability and support data-driven insights.

---

## 2. Dataset Information

### 2.1 Data Source

This dataset was extracted from the **U.S. Census Bureau** database available at [https://www.census.gov/data.html](https://www.census.gov/data.html).  
  
The data is used to study **income classification**, where the goal is to predict whether an individual's income exceeds **$50K per year**, based on demographic and employment attributes.  
  
It includes **40 attributes** (7 continuous, 33 categorical) covering features such as **age, education, occupation, class of worker,** and **hours worked**, with income level serving as the **target variable**.


### 2.2 Initial Setup
```python
# Import core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
