# Assignment: Exploratory Data Analysis with Pandas

In this assignment, you will use **Pandas** to explore the [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/Adult). The dataset contains demographic information about individuals, and a `salary` column indicating whether they earn `<=50K` or `>50K` per year.

**Instructions:**
- Write your code in the empty cells below each question.
- Run the setup cell first to load the data.
- Use `print()` to display your final answers.

## Setup: Load the Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('data.csv')
data.head()

---
### Question 1
How many rows and columns does the dataset have? Use `.shape`.

In [None]:
# Your code here


---
### Question 2
What are the column names and data types? Use `.info()` or `.dtypes`.

In [None]:
# Your code here


---
### Question 3
How many men and women are in the dataset? Use `.value_counts()` on the `sex` column.

In [None]:
# Your code here


---
### Question 4
What is the average age of women in the dataset?

*Hint: Filter the dataframe where `sex == 'Female'`, then use `.mean()` on the `age` column.*

In [None]:
# Your code here


---
### Question 5
What percentage of people in the dataset are from the United States?

*Hint: Filter on `native-country`, count the rows, and divide by the total number of rows.*

In [None]:
# Your code here


---
### Question 6
What is the average age of people who earn **>50K** vs. those who earn **<=50K**?

*Hint: Use `.groupby('salary')['age'].mean()`.*

In [None]:
# Your code here


---
### Question 7
What are the top 5 most common occupations in the dataset?

*Hint: Use `.value_counts()` and `.head(5)` on the `occupation` column.*

In [None]:
# Your code here


---
### Question 8
What is the maximum number of hours a person works per week? How many people work that many hours?

*Hint: Use `.max()` to find the maximum, then filter and count.*

In [None]:
# Your code here


---
### Question 9
Display age statistics (count, mean, std, min, max) grouped by `race` and `sex`. Use `.groupby()` and `.describe()`.

Then answer: What is the maximum age of men in the `Amer-Indian-Eskimo` group?

In [None]:
# Your code here


---
### Question 10
What is the average `hours-per-week` for each `salary` group (`<=50K` and `>50K`)?

*Hint: Use `.groupby('salary')['hours-per-week'].mean()`.*

In [None]:
# Your code here


---
## Part 2: Groupby and Aggregation

The questions below require combining filtering, grouping, and aggregation.

---
### Question 11
For each `workclass`, compute the **mean age**, **mean hours-per-week**, and **count** of people. Sort the result by count in descending order.

*Hint: Use `.groupby('workclass').agg(...)` with named aggregations. Use `.sort_values()` to sort.*

In [None]:
# Your code here


---
### Question 12
For each `education` level, compute the **min**, **max**, and **mean** of `hours-per-week`. Sort by mean in descending order.

Which education level has the highest average working hours?

*Hint: Use `.groupby('education')['hours-per-week'].agg(['min', 'max', 'mean'])`.*

In [None]:
# Your code here


---
### Question 13
Create a crosstab showing the **count** of people for each combination of `education` and `salary`. Then compute the **proportion** of `>50K` earners for each education level.

Which education level has the **highest** proportion of people earning >50K?

*Hint: Use `pd.crosstab(data['education'], data['salary'])`. To get proportions, divide the `>50K` column by the row total.*

In [None]:
# Your code here


---
### Question 14
Among people who work **more than 40 hours per week**, what is the average `capital-gain` for each `occupation`? Show only the **top 5** occupations by average capital-gain.

*Hint: First filter the dataframe for `hours-per-week > 40`, then use `.groupby('occupation')['capital-gain'].mean()` and `.sort_values(ascending=False).head(5)`.*

In [None]:
# Your code here


---
## Bonus (Extra Difficult)

### Bonus Question
Create a new column called `age_group` that bins ages into the following categories:
- `17-30`
- `31-45`
- `46-60`
- `61+`

Then, for **each `age_group` and `sex`**, compute a summary table with three columns:
1. **pct_over_50K** — the percentage of people earning >50K
2. **avg_hours_per_week** — the average hours worked per week
3. **most_common_occupation** — the most frequently occurring occupation

Display the result as a single DataFrame.

*Hints:*
- *Use `pd.cut()` to create the `age_group` column.*
- *Create a helper column like `is_over_50K = (data['salary'] == '>50K').astype(int)` to make computing the percentage easier.*
- *Use `.groupby(['age_group', 'sex']).agg(...)` with custom lambda functions.*
- *To find the most common value, use `.mode().iloc[0]` inside a lambda.*

In [None]:
# Your code here


---
## Part 3: Data Visualization

The questions below require you to create visualizations using **matplotlib** and/or **seaborn**. Make sure your plots have appropriate titles, axis labels, and legends where needed.

---
### Question 15
Create a **histogram** of the `age` column with 20 bins. Add a title and axis labels.

*Hint: Use `data['age'].plot(kind='hist', bins=20)` or `plt.hist(data['age'], bins=20)`.*

In [None]:
# Your code here


---
### Question 16
Create a **bar chart** showing the count of people in each `workclass` category. Rotate the x-axis labels for readability.

*Hint: Use `data['workclass'].value_counts().plot(kind='bar')`. Use `plt.xticks(rotation=45)` to rotate labels.*

In [None]:
# Your code here


---
### Question 17
Create a **boxplot** comparing the distribution of `age` across the two `salary` groups (`<=50K` and `>50K`).

What can you observe about the age distributions of the two groups?

*Hint: Use `sns.boxplot(x='salary', y='age', data=data)` or `data.boxplot(column='age', by='salary')`.*

In [None]:
# Your code here
