# Exploratory Data Analysis

**Project:** Income Predictors

**Team:** Team 1 C4: Masha Bystritskii & Sara Mahmoud

**Date:** February 19th, 2026




## Table of Contents
1. Setup & Load Data
2. Data Quality Check
3. Target Variable Analysis
4. Feature Distributions
5. Correlation Analysis
6. Key Findings Summary



## 1. Setup & Load Data

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', None)

print("✓ Libraries loaded!")

✓ Libraries loaded!


In [7]:
# Load raw data
# TODO: Update the file path to your dataset
# Define column names
columns = [
    "age","workclass","fnlwgt","education","education_num",
    "marital_status","occupation","relationship","race","sex",
    "capital_gain","capital_loss","hours_per_week","native_country","income"
]

# Load raw data
df = pd.read_csv('adult.data', names=columns, skipinitialspace=True)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K



## 2. Data Quality Check

**Questions to answer:**
- What are the data types?
- Are there missing values?
- Are there duplicate rows?

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df.describe()

### Data Quality Observations

*TODO: Write your observations here*

1. **Data types:** The dataset contains 32,561 rows and 15 columns. There are 6 numeric variables (age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week) and 9 categorical variables, workclass, education, marital_status, occupation, race, sex, native_country, income).

2. **Missing values:**missing NaN values were detected using df.isnull(). Howe an issue I found is thatver, this dataset represents missing values as "?" in certain categorical columns, which will need to be handled during preprocessing.

3. **Duplicates:** There are 24 duplicate rows in the dataset.

4. **Potential issues:** 
   - capital_gain has a maximum value of 99,999 and capital_loss has a maximum of 4,356, indicating extreme outliers.
   - hours_per_week ranges from 1 to 99 hours, which may contain unrealistic values.
   - Categorical variables will require encoding before modeling.



## 3. Target Variable Analysis

**Your target variable:** [TODO: What are you trying to predict?]

In [None]:
df['income'].value_counts()
df['income'].value_counts().plot(kind='bar')

### Target Variable Observations

*TODO: Write your observations here*

1. **Distribution shape:** The target variable contains two categori beings: <=50K and >50K.

2. **Class counts:** There are 24,720 individuals earning <=50K and 7,841 individuals earning >50K.

3. **Class balance:** The dataset is imbalanced, with approximately 76% of individuals earning <=50K and 24% earning >50K.

4. **Potential issues:** Class imbalance may bias a model toward predicting the majority class (<=50K), so evaluation metrics beyond  ( (e.g., precision, recall) will be important in later stages.



## 4. Feature Distributions

### Feature Distribution Observations

*TODO: Write your observations here*




## 5. Correlation Analysis

### Correlation Observations

*TODO: Write your observations here*

1. **Strongest predictor:** ...
2. **Other important features:** ...
3. **Multicollinearity concerns:** ...


## 6. Key Findings Summary

## EDA Checklist

Before moving to modeling, ensure you've completed:

- [ ] Loaded and examined the data
- [ ] Checked data types
- [ ] Identified and documented missing values
- [ ] Analyzed target variable distribution
- [ ] Examined feature distributions
- [ ] Created correlation analysis
- [ ] Documented key findings
- [ ] Identified potential data quality issues