# Homework 1: NumPy, Pandas, and Data Exploration
## CS 2100/4700: Introduction to Machine Learning

**Due Date:** 01/29/2026
**Total Points:** 100

## Instructions

- Complete all exercises in this notebook
- Include your code AND output for each question
- Comment your code where appropriate
- Do NOT use loops where vectorized operations are possible


## Part 1: NumPy Basics (20 points)

### Question 1.1 (10 points)

Given the following array of weekly hours worked:

```python
hours = np.array([40, 52, 35, 45, 60, 38, 42, 55, 48, 37])
```

a) Calculate and print the mean, min, and max hours

b) Find all values greater than 45 (use boolean indexing)

c) Count how many people worked more than 45 hours

In [None]:
import numpy as np

hours = np.array([40, 52, 35, 45, 60, 38, 42, 55, 48, 37])

# a) Calculate and print the mean, min, and max hours
print("Mean:", hours.mean())
print("Min:", hours.min())
print("Max:", hours.max())

# b) Find all values greater than 45 (use boolean indexing)
greater_45 = hours[hours > 45]
print("Hours > 45:", greater_45)

# c) Count how many people worked more than 45 hours
count_over_45 = np.sum(hours > 45)
print("Count > 45:", count_over_45)

### Question 1.2 (10 points)

Given this 2D array of student grades (rows = students, columns = [midterm, final, project]):

```python
grades = np.array([
    [85, 90, 88],
    [72, 68, 75],
    [90, 95, 92],
    [65, 70, 68]
])
```

a) Print the shape of the array

b) Extract all final exam scores (second column)

c) Calculate the average grade for each student (row means)

In [None]:
grades = np.array([
    [85, 90, 88],
    [72, 68, 75],
    [90, 95, 92],
    [65, 70, 68]
])

# a) Print the shape of the array
print("Shape:", grades.shape)

# b) Extract all final exam scores (second column)
final_scores = grades[:, 1]
print("Final exam scores:", final_scores)

# c) Calculate the average grade for each student (row means)
student_averages = grades.mean(axis=1)
print("Student averages:", student_averages)

## Part 2: Pandas Basics (30 points)

### Question 2.1 (10 points)

Create a DataFrame called `employees` with the following data:

| Name | Age | Department | Salary |
|------|-----|------------|--------|
| Alice | 28 | Engineering | 75000 |
| Bob | 35 | Marketing | 65000 |
| Carol | 42 | Engineering | 95000 |
| David | 31 | Sales | 55000 |
| Eve | 29 | Engineering | 80000 |

Then print: (a) the shape, (b) column names, and (c) data types

In [None]:
import pandas as pd

# Create the employees DataFrame
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'Age': [28, 35, 42, 31, 29],
    'Department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'Engineering'],
    'Salary': [75000, 65000, 95000, 55000, 80000]
})

# a) Print the shape
print("Shape:", employees.shape)

# b) Print column names
print("Columns:", employees.columns)

# c) Print data types
print(employees.dtypes)

### Question 2.2 (10 points)

Using the `employees` DataFrame:

a) Select only the 'Name' and 'Salary' columns

b) Filter for employees in the Engineering department

c) Filter for employees with Salary >= 70000 AND Age < 40

In [None]:
# a) Select only the 'Name' and 'Salary' columns
print(employees[['Name', 'Salary']])

# b) Filter for employees in the Engineering department
print(employees[employees['Department'] == 'Engineering'])

# c) Filter for employees with Salary >= 70000 AND Age < 40
print(employees[(employees['Salary'] >= 70000) & (employees['Age'] < 40)])


### Question 2.3 (10 points)

Using the `employees` DataFrame:

a) Calculate the mean salary

b) Use `value_counts()` to count employees in each Department

c) Add a new column `Senior` that is `True` if Age >= 35, otherwise `False`

In [None]:
# a) Calculate the mean salary
print("Mean salary:", employees['Salary'].mean())

# b) Use value_counts() to count employees in each Department
print(employees['Department'].value_counts())

# c) Add a new column 'Senior' that is True if Age >= 35, otherwise False
employees['Senior'] = employees['Age'] >= 35
print(employees)


## Part 3: Exploring the Adult Dataset (50 points)

Run the cell below to load the UCI Adult Census Income dataset:

In [None]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race',
           'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
           'native_country', 'income']

df = pd.read_csv(url, names=columns, na_values=' ?', skipinitialspace=True)
print(f"Dataset loaded: {len(df)} records")

### Question 3.1 (20 points)

a) How many rows and columns are in the dataset?

b) Which columns have missing values and how many?

c) What percentage of people earn >50K?

In [None]:
# a) How many rows and columns are in the dataset?
print("Shape:", df.shape)

# b) Which columns have missing values and how many?
print(df.isna().sum())

# c) What percentage of people earn >50K?
percent_over_50k = (df['income'] == '>50K').mean() * 100
print(f"Percent earning >50K: {percent_over_50k:.2f}%")

### Question 3.2 (15 points)

a) What is the average age in the dataset?

b) What is the average hours_per_week?

c) What are the top 3 most common education levels?

In [None]:
# a) What is the average age in the dataset?
print("Average age:", df['age'].mean())

# b) What is the average hours_per_week?
print("Average hours/week:", df['hours_per_week'].mean())

# c) What are the top 3 most common education levels?
print(df['education'].value_counts().head(3))


### Question 3.3 (15 points)

a) Filter the dataset for people aged 30 or younger. How many are there?

b) Filter for people with income '>50K' AND hours_per_week > 40. How many?

c) What percentage of people work in the 'Private' sector?

In [None]:
# a) Filter the dataset for people aged 30 or younger. How many are there?
under_30 = df[df['age'] <= 30]
print("People age 30 or younger:", len(under_30))

# b) Filter for people with income '>50K' AND hours_per_week > 40. How many?
high_income_long_hours = df[(df['income'] == '>50K') & (df['hours_per_week'] > 40)]
print("High income & >40 hours:", len(high_income_long_hours))

# c) What percentage of people work in the 'Private' sector?
percent_private = (df['workclass'] == 'Private').mean() * 100
print(f"Percent in Private sector: {percent_private:.2f}%")

### Question 3.4 (10 points--bonus)

a) Create a cross-tabulation of `sex` vs `income` showing row percentages

b) **Written Answer:** Based on the cross-tabulation, describe the income difference between males and females in 1-2 sentences.

In [None]:
# a) Create a cross-tabulation of sex vs income showing row percentages
crosstab = pd.crosstab(df['sex'], df['income'], normalize='index') * 100
print(crosstab)

**b) Written Answer:**

Males earn >50K at a significantly higher rate than females. 
Females are much more concentrated in the <=50K income category, indicating a notable income gap by sex in this dataset.

### Question 3.5 (10 points--bonus)

a) Create a new column `is_high_earner` that is 1 if income is '>50K', otherwise 0

b) Create a new column `age_group` using `pd.cut()` with bins [0, 30, 50, 100] and labels ['Young', 'Middle', 'Senior']

c) Show the first 5 rows with columns: age, age_group, income, is_high_earner

In [None]:
# a) Create a new column 'is_high_earner' that is 1 if income is '>50K', otherwise 0
df['is_high_earner'] = (df['income'] == '>50K').astype(int)

# b) Create a new column 'age_group' using pd.cut()
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 30, 50, 100],
    labels=['Young', 'Middle', 'Senior']
)

# c) Show the first 5 rows with columns: age, age_group, income, is_high_earner
print(df[['age', 'age_group', 'income', 'is_high_earner']].head())

## Submission Checklist

Before submitting, ensure:

- [ ] All code cells show output
- [ ] Written answer is complete
- [ ] Notebook runs top to bottom without errors
- [ ] File named: `HW1_LastName_FirstName.ipynb`


## Grading Rubric

| Part | Points |
|------|--------|
| Part 1: NumPy | 20 |
| Part 2: Pandas | 30 |
| Part 3: Adult Dataset | 50 |
| **Total** | **100** |