# Python Fundamentals Summary - Housing Price Prediction

This notebook demonstrates Python, NumPy, and pandas operations using the Kaggle House Prices train dataset as part of the Housing Price Prediction Project. It includes:
- Loading and inspecting the dataset (`../data/raw/train.csv`).
- Performing NumPy operations (e.g., vectorized calculations on SalePrice).
- Performing pandas operations (e.g., summary statistics, grouping by Neighborhood).
- Applying reusable utility functions from `src/utils.py` (e.g., clean_column_names, convert_year_to_age).
- Saving cleaned data to `/data/processed/train_cleaned.csv`.

The goal is to establish foundational Python skills for data preprocessing, exploratory data analysis (EDA), and modeling in later stages.

## Setup and Imports

**Explanation**:
- Import libraries: `pandas` for data handling, `numpy` for numerical operations, `os` and `dotenv` for environment variables.
- Add project root to `sys.path` to import `src.utils`.
- Load environment variables for paths.

In [15]:
import pandas as pd
import numpy as np
import os
import sys
from dotenv import load_dotenv

# Add project root to sys.path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)


import pandas as pd
import os

def clean_column_names(df):
    """
    Standardize DataFrame column names by converting to lowercase and replacing spaces/special
    characters with underscores.
    
    Args:
        df (pd.DataFrame): Input DataFrame.
        
    Returns:
        pd.DataFrame: DataFrame with cleaned column names.
    """
    df = df.copy()
    df.columns = df.columns.str.lower().str.replace(r'[^a-z0-9]', '_', regex=True)
    return df

def convert_year_to_age(df, year_columns, current_year=2025):
    """
    Convert year columns to age relative to current_year.

    """
    df = df.copy()
    for col in year_columns:
        if col not in df.columns:
            raise KeyError(f"Column {col} not found in DataFrame")
        age_col = "house_age" if col == "yearbuilt" else "remodel_age"
        df[age_col] = current_year - df[col]
    return df

def save_data(df, filename, data_dir):
    """
    Save DataFrame to CSV in the specified directory, creating folders if needed.
    
    Args:
        df (pd.DataFrame): Input DataFrame.
        filename (str): Name of the file (e.g., 'train_cleaned.csv').
        data_dir (str): Directory path to save the file.
        
    Returns:
        str: Path to the saved file.
    """
    os.makedirs(data_dir, exist_ok=True)  # Create directory if it doesn't exist
    filepath = os.path.join(data_dir, filename)
    df.to_csv(filepath, index=False)
    return filepath

# Load environment variables
load_dotenv()
DATA_DIR_RAW = os.getenv('DATA_DIR_RAW')
DATA_DIR_PROCESSED = os.getenv('DATA_DIR_PROCESSED')

# Verify environment variables
print(f'DATA_DIR_RAW: {DATA_DIR_RAW}')
print(f'DATA_DIR_PROCESSED: {DATA_DIR_PROCESSED}')

DATA_DIR_RAW: data/raw
DATA_DIR_PROCESSED: data/processed


## Load Data

**Explanation**:
- Load the Kaggle train dataset from `../data/raw/train.csv` (relative to notebook in /notebooks/).
- Display shape, preview, and missing value counts to confirm structure (includes SalePrice, LotArea, Neighborhood, etc.).

In [9]:
# Load dataset
data_path = os.path.join(DATA_DIR_RAW, 'train.csv')
df = pd.read_csv('/Users/junshao/bootcamp_Jun_Shao/project/data/raw/train.csv')

# Display initial state
print('Data Shape:', df.shape)
print('Data Preview:')
print(df.head())
print('\nMissing Values:')
print(df.isna().sum())

Data Shape: (1460, 81)
Data Preview:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleConditi

## NumPy Operations

**Explanation**:
- Extract the `SalePrice` column as a NumPy array.
- Perform element-wise squaring using loop and vectorized methods.
- Compare execution times to demonstrate NumPy's efficiency.

In [10]:
import time

# Extract SalePrice column as a NumPy array
prices = df['SalePrice'].to_numpy()

# Loop-based squaring
start_time = time.time()
loop_result = [x**2 for x in prices]
loop_time = time.time() - start_time

# Vectorized squaring
start_time = time.time()
vectorized_result = np.square(prices)
vectorized_time = time.time() - start_time

# Print results
print(f'Loop time: {loop_time:.4f} seconds')
print(f'Vectorized time: {vectorized_time:.4f} seconds')
print(f'First 5 squared values: {vectorized_result[:5]}')

Loop time: 0.0002 seconds
Vectorized time: 0.0001 seconds
First 5 squared values: [43472250000 32942250000 49952250000 19600000000 62500000000]


## Pandas Operations

**Explanation**:
- Calculate summary statistics for numeric columns (e.g., SalePrice, LotArea, GrLivArea, OverallQual).
- Group by `Neighborhood` to compute mean SalePrice, demonstrating aggregation.

In [11]:
# Select key numeric columns for summary
numeric_cols = ['SalePrice', 'LotArea', 'GrLivArea', 'OverallQual']
print('Summary Statistics:')
print(df[numeric_cols].describe())

# Groupby aggregation
print('\nGroupby Neighborhood (Mean SalePrice):')
print(df.groupby('Neighborhood')['SalePrice'].mean())

Summary Statistics:
           SalePrice        LotArea    GrLivArea  OverallQual
count    1460.000000    1460.000000  1460.000000  1460.000000
mean   180921.195890   10516.828082  1515.463699     6.099315
std     79442.502883    9981.264932   525.480383     1.382997
min     34900.000000    1300.000000   334.000000     1.000000
25%    129975.000000    7553.500000  1129.500000     5.000000
50%    163000.000000    9478.500000  1464.000000     6.000000
75%    214000.000000   11601.500000  1776.750000     7.000000
max    755000.000000  215245.000000  5642.000000    10.000000

Groupby Neighborhood (Mean SalePrice):
Neighborhood
Blmngtn    194870.882353
Blueste    137500.000000
BrDale     104493.750000
BrkSide    124834.051724
ClearCr    212565.428571
CollgCr    197965.773333
Crawfor    210624.725490
Edwards    128219.700000
Gilbert    192854.506329
IDOTRR     100123.783784
MeadowV     98576.470588
Mitchel    156270.122449
NAmes      145847.080000
NPkVill    142694.444444
NWAmes     189050.0

## Apply Utility Functions

**Explanation**:
- Apply `clean_column_names` to standardize column names (e.g., YearBuilt -> year_built).
- Apply `convert_year_to_age` to create age features from `year_built` and `year_remod_add`.
- Demonstrate reusability for preprocessing.

In [16]:
# Apply clean_column_names
df_clean = clean_column_names(df)
print('Cleaned Column Names:')
print(df_clean.columns)

# Apply convert_year_to_age
df_clean = convert_year_to_age(df_clean, ['yearbuilt', 'yearremodadd'])
print('\nData Types After Converting Years:')
print(df_clean[['yearbuilt', 'yearremodadd', 'house_age', 'remodel_age']].dtypes)

Cleaned Column Names:
Index(['id', 'mssubclass', 'mszoning', 'lotfrontage', 'lotarea', 'street',
       'alley', 'lotshape', 'landcontour', 'utilities', 'lotconfig',
       'landslope', 'neighborhood', 'condition1', 'condition2', 'bldgtype',
       'housestyle', 'overallqual', 'overallcond', 'yearbuilt', 'yearremodadd',
       'roofstyle', 'roofmatl', 'exterior1st', 'exterior2nd', 'masvnrtype',
       'masvnrarea', 'exterqual', 'extercond', 'foundation', 'bsmtqual',
       'bsmtcond', 'bsmtexposure', 'bsmtfintype1', 'bsmtfinsf1',
       'bsmtfintype2', 'bsmtfinsf2', 'bsmtunfsf', 'totalbsmtsf', 'heating',
       'heatingqc', 'centralair', 'electrical', '1stflrsf', '2ndflrsf',
       'lowqualfinsf', 'grlivarea', 'bsmtfullbath', 'bsmthalfbath', 'fullbath',
       'halfbath', 'bedroomabvgr', 'kitchenabvgr', 'kitchenqual',
       'totrmsabvgrd', 'functional', 'fireplaces', 'fireplacequ', 'garagetype',
       'garageyrblt', 'garagefinish', 'garagecars', 'garagearea', 'garagequal',
       'ga

## Save Cleaned Data

**Explanation**:
- Save the cleaned DataFrame to `/data/processed/train_cleaned.csv` using `save_data` from `src/utils.py`.
- This demonstrates reproducible data storage.

In [17]:
# Save cleaned data
saved_path = save_data(df_clean, 'train_cleaned.csv', DATA_DIR_PROCESSED)
print(f'Saved cleaned data to {saved_path}')

Saved cleaned data to data/processed/train_cleaned.csv
