# 🟣 Project 2: Demographic Data Analysis
🎯 Project Goal:

Analyze a real-world dataset about adults' personal information and economic status.

Answer key questions such as:
1. Which education levels earn more?
2. Do men or women work longer hours on average?
3. What is the income distribution across countries?
4. Can we detect patterns in age vs income?

## 🧪 Step 1: Load and Inspect the Dataset

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('../data/adult-data.csv')

# Show basic info
print(df.shape)
print(df.columns)
print(df.head())

## 🔍 Step 2: Clean the Data

In [None]:
# Strip whitespace from column values
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# Check for missing or placeholder values
print((df == '?').sum())


📌 In this dataset, missing values are often marked as '?'.

We will treat them as missing:

In [None]:
# Replace '?' with NaN
df.replace('?', pd.NA, inplace=True)

# Drop rows with missing values
df = df.dropna()


## 🧠 Step 3: Simple Demographic Questions


1. What is the average age?

In [None]:
print("Average age:", df['age'].mean())

2. What is the average working hours per week by gender?

In [None]:
print(df.groupby('sex')['hours-per-week'].mean())

3. How many people work in each education level?

In [None]:
print(df['education'].value_counts())

4. Salary distribution by education

In [None]:
salary_by_education = df.groupby('education')['salary'].value_counts(normalize=True).unstack()
print(salary_by_education)

## 📈 Step 4: Visual Exploration

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Average working hours by education
plt.figure(figsize=(12, 4))
sns.barplot(data=df, x='education', y='hours-per-week', errorbar=None)
plt.xticks(rotation=90)
plt.title('Average Working Hours by Education')
plt.show()


## 🔄 Step 5: Income Distribution by Age Group – Income vs Age

In [None]:
# Create age buckets
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 35, 50, 65, 100],
    labels=['<25', '25-35', '35-50', '50-65', '65+']
)

# Group by age_group, then count salary values and normalize to percentage
income_by_age = (
    df.groupby('age_group', observed=True)['salary']
    .value_counts(normalize=True)
    .unstack()
)

# Plot % of people earning >50K by age group
income_by_age['>50K'].plot(kind='bar', color='green', title='% Earning >50K by Age Group')
plt.ylabel("Percentage")
plt.show()

📊 What this code does:

It creates 5 age ranges to simplify age-based analysis.

It then calculates the percentage of people in each age group who earn:
- <=50K → lower income
- >50K → higher income

Finally, it plots only the percentage of high-income earners (>50K) by age group.

🔎 Insights:

- Income rises with age and peaks between 35–50 years old
- It slightly drops after 50, and even more after 65

## 📊 Step 6: Income by Country

In [None]:
# Get top 10 countries with most entries
top_countries = df['native-country'].value_counts().head(10).index

# Filter to those countries
filtered_df = df[df['native-country'].isin(top_countries)]

# Plot income proportion per country
country_income = filtered_df.groupby('native-country')['salary'].value_counts(normalize=True).unstack()

country_income['>50K'].sort_values().plot(kind='barh', color='orange', title='Income >50K by Country')
plt.xlabel("Percentage")
plt.show()


## 🧠 Advanced Real-World Use Case


### 🔍 Advanced Use Case 01: Scatter Plot: Hours-per-week vs Age, colored by income

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='age', y='hours-per-week', hue='salary', alpha=0.6)
plt.title('Age vs Working Hours by Income Group')
plt.show()


📌 Helps understand if younger or older people are working more, and how that relates to income.

### 🔍 Advanced Use Case 02: Decision-Making – Who earns more and why?
Let’s explore combined effects of education, age, and hours on income:

In [None]:
# Create a combined feature of education and salary
pivot = df.pivot_table(index='education', columns='salary', values='hours-per-week', aggfunc='mean')
pivot = pivot.sort_values('>50K', ascending=False)
pivot.plot(kind='barh', figsize=(10, 6), title="Avg Hours per Education Level by Salary")
plt.xlabel("Average Hours")
plt.grid(True)
plt.show()


📌 Insight: You can see which education levels spend more hours and earn more.

### 📦 Advanced Use Case 03: Predictive Pattern — Logistic Trends
We create a binary column high_income = 1 if salary == >50K

In [None]:
df['high_income'] = df['salary'].apply(lambda x: 1 if x == '>50K' else 0)

# Average high_income rate by age
age_income = df.groupby('age')['high_income'].mean()

plt.figure(figsize=(12, 4))
age_income.plot(title="Probability of >50K Income by Age", color='green')
plt.ylabel("Probability")
plt.grid(True)
plt.show()


📌 Insight: This gives a sense of probability of high income depending on age. useful for modeling.

### 🗂 Advanced Use Case 4: Crosstab Heatmap – Gender vs Occupation

In [None]:
# Cross-tabulation of occupation and sex
occupation_sex = pd.crosstab(df['occupation'], df['sex'], normalize='index') * 100

# Heatmap
sns.heatmap(occupation_sex, annot=True, cmap='Blues', fmt='.1f')
plt.title("Gender Proportion per Occupation (%)")
plt.ylabel("Occupation")
plt.show()


📌 Insight: Great for detecting gender imbalance in different job categories.

### 📌 Final Professional Tip: Feature Engineering for Modeling
Let’s create features that are often used in machine learning:

In [None]:
df['is_married'] = df['marital-status'].apply(lambda x: 1 if 'Married' in x else 0)
df['has_capital_gain'] = df['capital-gain'].apply(lambda x: 1 if x > 0 else 0)
df['work_overtime'] = df['hours-per-week'].apply(lambda x: 1 if x > 45 else 0)


📌 These features can later be used for predictive modeling, e.g., with scikit-learn.

## 📌 Summary
In this project, we:
- Cleaned and processed demographic data
- Explored education, age, gender, and salary relationships
- Visualized distribution, patterns, and trends
- Used both simple and advanced techniques for real-world questions