# HW3: Python Fundamentals - Housing Price Prediction

This notebook performs the required tasks for the assignment step-by-step. Each section includes a detailed explanation of the task, the code implementation, and comments for clarity.

## Step 1: NumPy Operations

**Detailed Explanation:**
- We start by importing NumPy and the time module to measure execution time.
- Create a large array using np.arange() to demonstrate operations on a sizable dataset (1,000,000 elements).
- Perform element-wise squaring using a traditional Python loop (list comprehension) and measure the time taken.
- Then, perform the same operation using NumPy's vectorized np.square() function and measure the time.
- Compare the times to show the efficiency of vectorized operations over loops.
- Finally, print the first 5 results to verify the operation.

This step addresses the requirement to create an array, perform element-wise operations, and compare loop vs. vectorized execution.

In [1]:
import numpy as np
import time

# Step 1.1: Create a large array for demonstration
array = np.arange(1000000)  # Array from 0 to 999999

# Step 1.2: Perform squaring using a loop (non-vectorized)
start_time = time.time()  # Start timing
loop_result = [x**2 for x in array]  # Square each element using list comprehension
loop_time = time.time() - start_time  # End timing

# Step 1.3: Perform squaring using NumPy vectorization
start_time = time.time()  # Start timing
vectorized_result = np.square(array)  # Vectorized squaring
vectorized_time = time.time() - start_time  # End timing

# Step 1.4: Print results and comparison
print(f'Loop time: {loop_time:.4f} seconds')  # Display loop execution time
print(f'Vectorized time: {vectorized_time:.4f} seconds')  # Display vectorized execution time
print(f'First 5 squared values: {vectorized_result[:5]}')  # Verify results with first 5 elements

Loop time: 0.0453 seconds
Vectorized time: 0.0006 seconds
First 5 squared values: [ 0  1  4  9 16]


## Step 2: Dataset Loading

**Detailed Explanation:**
- Import pandas for data handling.
- Load the CSV file from the root directory using pd.read_csv(). The path is '../starter_data.csv' relative to the notebooks folder.
- Use df.info() to display the dataset structure, including column names, data types, and non-null counts.
- Use df.head() to show the first 5 rows for a quick inspection of the data.

This step fulfills the requirement to load the provided CSV and inspect it with .info() and .head().

In [2]:
import pandas as pd

# Step 2.1: Load the dataset from the root directory
df = pd.read_csv('/Users/junshao/bootcamp_Jun_Shao/homework/hw3/data/starter_data.csv')  # Read the CSV file into a DataFrame

# Step 2.2: Inspect the dataset structure
print('Dataset Info:')  # Header for info output
df.info()  # Display info: columns, types, non-null counts

# Step 2.3: Display the first few rows
print('\nFirst 5 rows:')  # Header for head output
print(df.head())  # Show first 5 rows

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  10 non-null     object
 1   value     10 non-null     int64 
 2   date      10 non-null     object
dtypes: int64(1), object(2)
memory usage: 372.0+ bytes

First 5 rows:
  category  value      date
0        A     10  2025/8/1
1        B     15  2025/8/2
2        A     12  2025/8/3
3        B     18  2025/8/4
4        C     25  2025/8/5


## Step 3: Summary Statistics

**Detailed Explanation:**
- To import custom utilities, add the project root directory to sys.path, as the notebook runs in 'notebooks/' and utils.py is in '../src/'.
- Use os.getcwd() to get the current working directory and navigate to the parent directory (project root).
- Import the get_summary_stats function from src/utils.py.
- Call the function on the DataFrame to compute summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns.
- Print the results for verification.

This step calculates .describe() for numeric columns using a reusable function, addressing the requirement for summary statistics and reusable functions (bonus: moved to src/utils.py).

In [3]:
import sys
import os
# Add project root to sys.path (parent of notebooks/ directory)
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))  # Move up one directory from notebooks/
sys.path.append(project_root)  # Add project root to Python path

# Verify the path for debugging
print(f'Project root added to sys.path: {project_root}')
print(f'src/utils.py exists: {os.path.exists(os.path.join(project_root, "src", "utils.py"))}')

from src.utils import get_summary_stats

# Step 3.1: Calculate summary statistics using the reusable function
summary_stats = get_summary_stats(df)  # Call the function to get .describe()

# Step 3.2: Display the summary statistics
print('Summary Statistics:')  # Header for output
print(summary_stats)  # Print the DataFrame of statistics

Project root added to sys.path: /Users/junshao/bootcamp_Jun_Shao/homework/hw3
src/utils.py exists: True
Summary Statistics:
           value
count  10.000000
mean   17.600000
std     7.381659
min    10.000000
25%    12.250000
50%    14.500000
75%    23.250000
max    30.000000


## Step 4: Groupby Aggregation

**Detailed Explanation:**
- Use df.groupby() to group the data by a categorical column (assumed 'location'; replace if different).
- Aggregate on the 'price' column to compute mean and count.
- Include a try-except block to handle if the assumed column doesn't exist, printing available columns for easy debugging.
- Print the aggregation results.

This step performs .groupby() aggregation by category, as required. If 'location' is not present, update based on printed columns.

In [4]:
# Step 4.1: Attempt to group by 'location' and aggregate 'price'
try:
    groupby_stats = df.groupby('location')['price'].agg(['mean', 'count'])  # Group and aggregate mean and count
    print('Groupby Aggregation (by location):')  # Header for output
    print(groupby_stats)  # Display the aggregated DataFrame
except KeyError:
    print('No "location" column found. Please replace with a valid categorical column.')  # Error message
    print('Available columns:', df.columns.tolist())  # List all columns for reference

No "location" column found. Please replace with a valid categorical column.
Available columns: ['category', 'value', 'date']


## Step 5: Save Outputs

**Detailed Explanation:**
- Save the summary statistics to a CSV file in the processed data folder using to_csv().
- For the bonus task: Import matplotlib, create a bar plot of the average prices from the groupby results.
- Customize the plot with title, labels, and layout.
- Save the plot as a PNG file.
- Handle cases where groupby_stats might not exist due to column errors.

This step saves the outputs correctly and includes the bonus plot creation and saving.

In [5]:
# Step 5.1: Save summary statistics to CSV
summary_stats.to_csv('../data/processed/summary.csv')  # Export to CSV
print('Summary statistics saved to data/processed/summary.csv')  # Confirmation message

# Step 5.2: Bonus - Create and save a bar plot
import matplotlib.pyplot as plt  # Import plotting library

plt.figure(figsize=(8, 6))  # Set figure size
try:
    groupby_stats['mean'].plot(kind='bar')  # Plot mean values as bars
    plt.title('Average House Price by Location')  # Set title
    plt.xlabel('Location')  # Set x-axis label
    plt.ylabel('Average Price')  # Set y-axis label
    plt.tight_layout()  # Adjust layout
    plt.savefig('../data/processed/bar_plot.png')  # Save as PNG
    print('Bar plot saved to data/processed/bar_plot.png')  # Confirmation message
except NameError:
    print('Cannot create plot due to missing groupby_stats.')  # Error message if groupby failed

Summary statistics saved to data/processed/summary.csv
Cannot create plot due to missing groupby_stats.


<Figure size 800x600 with 0 Axes>