# Data 101 — Module 5 Assignment

**What you will practice:**
- Importing and using libraries (`numpy`, `pandas`)
- Array creation, slicing, broadcasting, reshaping
- DataFrame creation, selection with `loc`/`iloc`, boolean filtering
- Data wrangling: missing values, types, duplicates, outliers
- Summarizing with `groupby`, `agg`, and pivot tables

**Rules:**
- Write your code in the `#TODO` sections only.
- Do not change variable or function names.
- You may add new cells for scratch work.

**Grading rubric:** 100 points total
- A. NumPy (30 pts)
- B. Pandas basics (25 pts)
- C. Wrangling & summarizing (45 pts)

In [1]:
# Setup
import numpy as np
import pandas as pd

# Display options for readability
pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 20)
print("Versions:", {"numpy": np.__version__, "pandas": pd.__version__})

Versions: {'numpy': '2.3.3', 'pandas': '2.3.2'}


## Part A — NumPy (30 pts)
Work only where marked `#TODO`. Keep the variable names as given.

### A1. Create arrays (5 pts)
Create:
- `a`: 1D array with values 1..10
- `b`: 2D array of shape (3, 4) filled with ones
- `c`: 10 numbers linearly spaced from 0 to 1 inclusive

In [None]:
# ASSIGNMENT CELL: Q1
a = np.arange(1,11,1)
b = np.ones(shape=(3,4))
c = np.linspace(0,1,10)

### A2. Indexing and slicing (10 pts)
Using `a` from A1:
- Set `a_first` to the first element
- Set `a_mid` to elements at positions 3..6 inclusive (1-based: 4..7) via slicing
- Using a new array `d = np.arange(12).reshape(3,4)`, set `d_edge` to the last column

In [None]:
# ASSIGNMENT CELL: Q2
a_first = a[0]
a_mid = a[3:7]
d = np.arange(12).reshape(3,4)
d_edge = d[:,3]

### A3. Broadcasting and vectorization (10 pts)

Let `x = np.array([2, 4, 6, 8])`.

- Use a single vectorized expression to scale each element by 10 (* 10) and then shift the result by 1 (+ 1). Store in `y`.
- Use broadcasting to combine `x` with a column vector of shape (2,1), so that the result has shape (2,4). The first row should reproduce `x`, and the second row should be `x` shifted by one unit (+ 1). Store in `z`.

In [None]:
# ASSIGNMENT CELL: Q3
x = np.array([2,4,6,8])
y = x * 10 + 1
z = x + np.array([[0],[1]])

### A4. Reshape and combine (5 pts)
- Reshape `a` into a 2D array `a2` with shape (2,5)
- Vertically stack `a2` on top of itself to form `a3` with shape (4,5)

In [None]:
# ASSIGNMENT CELL: Q4
a2 = a.reshape((2,5))
a3 = np.vstack([a2,a2])

## Part B — Pandas basics (25 pts)

### B1. Load and inspect (5 pts)
Load the CSV at `./data/students_messy.csv` into `df`.

In [None]:
# ASSIGNMENT CELL: Q5
df = pd.read_csv("./data/students_messy.csv")

### B2. Column and row selection (15 pts)
- Set `gpa_series` to the `GPA` column as a Series
- Set `name_gpa_df` to a DataFrame with columns `Name` and `GPA`
- Using `loc`, select rows 1..3 inclusive into `mid_rows`
- Using `iloc`, select all rows and the first two columns into `first_two_cols`

In [None]:
# ASSIGNMENT CELL: Q6
gpa_series = df["GPA"]
name_gpa_df = df[["Name", "GPA"]]
mid_rows = df.loc[1:3]
first_two_cols = df.iloc[:,0:2]

### B3. Boolean filtering (5 pts)
Filter to rows with `GPA >= 3.5` into `high_gpa`. Use numeric comparison, not strings.

In [None]:
# ASSIGNMENT CELL: Q7
high_gpa = df[df["GPA"] >= 3.5]

## Part C — Wrangling & summarizing (45 pts)

### C1. Clean data (20 pts)

A new DataFrame named `df_clean` is created from `df`. Do all of the following in this new DataFrame:

- Fix text: remove spaces, make `Major` title case
- Convert `GPA` and `Hours_Studied` to numbers
- Convert `ExamDate` to dates
- Remove duplicate rows
- Fill missing `GPA` with the mean
- Convert `Major` and `Gender` to category

After this, `df_clean` should have these types:  
`Name` (object), `Major` (category), `GPA` (float), `Hours_Studied` (float), `Gender` (category), `ExamDate` (datetime).

In [None]:
# ASSIGNMENT CELL: Q8
df_clean = df.copy()
df_clean["Major"] = df_clean["Major"].str.strip().str.title()
df_clean["GPA"] = pd.to_numeric(df_clean["GPA"], errors="coerce")
df_clean["Hours_Studied"] = pd.to_numeric(df_clean["Hours_Studied"], errors="coerce")
df_clean["ExamDate"] = pd.to_datetime(df_clean["ExamDate"], errors="coerce")
df_clean = df_clean.drop_duplicates()
df_clean["GPA"] = df_clean["GPA"].fillna(df_clean["GPA"].mean())
df_clean["Major"] = df_clean["Major"].astype("category")
df_clean["Gender"] = df_clean["Gender"].astype("category")

### C2. Outlier handling (5 pts)
Use the IQR rule on `GPA` to filter to `df_no_outliers` that keeps rows within `[Q1 - 1.5*IQR, Q3 + 1.5*IQR]`.

In [None]:
# ASSIGNMENT CELL: Q9
q1, q3 = df["GPA"].quantile([0.25, 0.75])
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
df_no_outliers = df[(df["GPA"] >= lower) & (df["GPA"] <= upper)]

### C3. Derived columns (5 pts)
Add a `Failed` column to `df_clean` that is `'Yes'` if `GPA < 2.0` else `'No'`. Add a `Study_Efficiency` column defined as `GPA / Hours_Studied`.

In [None]:
# ASSIGNMENT CELL: Q10
df_clean["Failed"] = np.where(df_clean["GPA"] < 2.0, "Yes", "No")
df_clean["Study_Efficiency"] = df_clean["GPA"] / df_clean["Hours_Studied"]

### C4. Grouping and aggregation (10 pts)
Group by `Major` and compute:
- `GPA_mean` and `GPA_std`
- `Hours_mean`
Assign the result to `by_major`.

In [None]:
# ASSIGNMENT CELL: Q11
by_major = df_clean.groupby("Major", observed=True).agg(
    GPA_mean=("GPA","mean"),
    GPA_std=("GPA","std"),
    Hours_mean=("Hours_Studied","mean")
)

### C5. Pivot table (5 pts)

Make a pivot table named `gpa_by_major_gender`.  
- Use `Major` as the index (rows).  
- Use `Gender` as the columns.  
- The values should be the **maximum GPA** within each Major–Gender group.  

In [None]:
# ASSIGNMENT CELL: Q12
gpa_by_major_gender = df_clean.pivot_table(values="GPA", index="Major",
    columns="Gender", aggfunc="max", observed=False)