# Comprehensive k-Nearest Neighbors Assignment - Quiz 1
## Real Estate Price Prediction with Advanced k-NN Implementation

**Due Date:** Aug 31 11:59pm  
**Points:** 100  
**Individual Assignment - Academic Integrity Required**

---

## Assignment Overview

You will build a complete machine learning pipeline using k-Nearest Neighbors to predict house prices using the a housing dataset of your choice. This assignment requires you to demonstrate deep understanding of k-NN concepts, data preprocessing, feature engineering, and model evaluation while implementing several components from scratch to prove your understanding.

---

## Dataset Information

One choice is the California Housing Dataset, as seen below, although you may choose a different dataset. You must create a folder in Github containing your data as well as your ipynb notebook. 

**Dataset:** California Housing Dataset (sklearn.datasets.fetch_california_housing)
- **Samples:** 20,640 house observations
- **Features:** 8 numerical features
- **Target:** Median house value (in hundreds of thousands of dollars)

**Features:**
- `MedInc`: Median income in block group
- `HouseAge`: Median house age in block group  
- `AveRooms`: Average number of rooms per household
- `AveBedrms`: Average number of bedrooms per household
- `Population`: Block group population
- `AveOccup`: Average number of household members
- `Latitude`: Block group latitude
- `Longitude`: Block group longitude

---

## Part 1: Data Loading and Exploration 

### 1.1 Data Loading and Initial Inspection 
```python
# Load the dataset and create initial DataFrame
# Display basic information about the dataset
# Show first/last few rows
# Check data types and basic statistics
```

**Requirements:**
- Load the dataset using sklearn
- Create a pandas DataFrame with proper column names
- Display dataset shape, info(), and describe()
- Identify the target variable and feature types

### 1.2 Comprehensive EDA 

Create visualizations and analysis for:

**A. Target Variable Analysis:**
- Distribution of house prices (histogram, box plot)
- Summary statistics with interpretation
- Identify potential outliers in target variable

**B. Feature Analysis:**
- Individual feature distributions (histograms for all features)
- Identify skewed distributions and potential transformations needed
- Correlation matrix heatmap
- Scatter plots of each feature vs target

**C. Geographic Analysis:**
- Scatter plot of Latitude vs Longitude colored by house price
- Identify geographic patterns in housing prices
- Discussion of California geography impact on prices

**D. Feature Relationships:**
- Identify the top 3 strongest correlations with target
- Create scatter plots with trend lines for these relationships
- Analyze multicollinearity between features

**Written Requirement:** Provide 2-3 paragraph analysis of your EDA findings, including insights about the housing market and potential challenges for modeling.

---

## Part 2: Data Cleaning and Preprocessing 

### 2.1 Missing Value Analysis 
- Check for missing values in all columns
- If missing values exist, analyze patterns and implement appropriate handling
- Document your decision-making process

### 2.2 Outlier Detection and Handling 

**Implement and apply TWO methods:**

**Method 1: Statistical Outlier Detection**
```python
def detect_outliers_iqr(data, column, factor=1.5):
    """
    Detect outliers using IQR method
    
    Parameters:
    data: DataFrame
    column: column name to check
    factor: IQR multiplier (default 1.5)
    
    Returns:
    Boolean series indicating outliers
    """
    # YOUR IMPLEMENTATION HERE
    pass
```

**Method 2: Z-Score Outlier Detection**
```python
def detect_outliers_zscore(data, column, threshold=3):
    """
    Detect outliers using Z-score method
    
    Parameters:
    data: DataFrame  
    column: column name to check
    threshold: Z-score threshold (default 3)
    
    Returns:
    Boolean series indicating outliers
    """
    # YOUR IMPLEMENTATION HERE
    pass
```

**Requirements:**
- Apply both methods to all numerical features
- Create visualizations showing outliers before and after removal
- Compare the two methods and justify which approach you choose
- Document impact on dataset size

### 2.3 Feature Engineering 

**Create the following engineered features:**

1. **Ratio Features:**
   - `rooms_per_household`: AveRooms / AveOccup
   - `bedrooms_per_room`: AveBedrms / AveRooms
   - `population_per_household`: Population / AveOccup

2. **Geographic Features:**
   - `distance_to_LA`: Distance from Los Angeles (34.0522°N, 118.2437°W)
   - `distance_to_SF`: Distance from San Francisco (37.7749°N, 122.4194°W)
   - `coastal_proximity`: Binary feature (1 if longitude > -121, 0 otherwise)

3. **Categorical Features:**
   - `income_category`: Low (<3), Medium (3-6), High (6-9), Very High (9+)
   - `house_age_category`: New (<10), Medium (10-30), Old (30+)


---

## Part 3: Custom k-NN Implementation 

### 3.1 Distance Metrics Implementation 

**Implement these distance functions from scratch:**

```python
import numpy as np

def euclidean_distance(point1, point2):
    """
    Calculate Euclidean distance between two points
    
    Parameters:
    point1, point2: numpy arrays of equal length
    
    Returns:
    float: Euclidean distance
    """
    # YOUR IMPLEMENTATION HERE
    pass

def manhattan_distance(point1, point2):
    """
    Calculate Manhattan distance between two points
    
    Parameters:
    point1, point2: numpy arrays of equal length
    
    Returns:
    float: Manhattan distance  
    """
    # YOUR IMPLEMENTATION HERE
    pass

def minkowski_distance(point1, point2, p=2):
    """
    Calculate Minkowski distance between two points
    
    Parameters:
    point1, point2: numpy arrays of equal length
    p: parameter (p=1 gives Manhattan, p=2 gives Euclidean)
    
    Returns:
    float: Minkowski distance
    """
    # YOUR IMPLEMENTATION HERE
    pass
```

### 3.2 k-NN Class Implementation 

**Implement a complete k-NN class:**

```python
class CustomKNN:
    def __init__(self, k=5, distance_metric='euclidean', weights='uniform'):
        """
        Custom k-NN implementation
        
        Parameters:
        k: number of neighbors
        distance_metric: 'euclidean', 'manhattan', or 'minkowski'
        weights: 'uniform' or 'distance'
        """
        self.k = k
        self.distance_metric = distance_metric
        self.weights = weights
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """Store training data"""
        # YOUR IMPLEMENTATION HERE
        pass
    
    def _calculate_distance(self, point1, point2):
        """Calculate distance based on selected metric"""
        # YOUR IMPLEMENTATION HERE
        pass
    
    def _get_neighbors(self, test_point):
        """Find k nearest neighbors for a test point"""
        # YOUR IMPLEMENTATION HERE
        # Return indices of k nearest neighbors
        pass
    
    def predict_single(self, test_point):
        """Predict for a single test point"""
        # YOUR IMPLEMENTATION HERE
        # Handle both uniform and distance-weighted predictions
        pass
    
    def predict(self, X_test):
        """Predict for multiple test points"""
        # YOUR IMPLEMENTATION HERE
        pass
    
    def score(self, X_test, y_test):
        """Calculate R-squared score"""
        # YOUR IMPLEMENTATION HERE
        pass
```

**Requirements:**
- Implement all methods completely
- Handle both uniform and distance-weighted predictions
- Include proper error handling
- Add docstrings for all methods

---

## Part 4: Manual Calculations (Proof of Understanding) 

### 4.1 Distance Calculations 

**Select 3 specific data points from your dataset and manually calculate (by hand, with a picture to be uploaded to Github in the same folder):**

1. Euclidean distance between points 1 and 2
2. Manhattan distance between points 1 and 3  
3. Minkowski distance (p=3) between points 2 and 3

**Show all work step by step. Include:**
- The actual feature values used
- Complete mathematical formulation
- Step-by-step calculation
- Verification using your implemented functions

### 4.2 k-NN Prediction Walkthrough 

**For a specific test instance:**

1. **Manual Neighbor Finding:**
   - Select a test point from your test set
   - Calculate distances to ALL training points (show first 10 calculations)
   - Identify the 5 nearest neighbors manually
   - Show the ranking process

2. **Prediction Calculation:**
   - Calculate uniform weighted prediction
   - Calculate distance-weighted prediction  
   - Show all mathematical steps
   - Compare with your implementation results

**Format Example:**
```
Test Point: [6.2, 15.0, 5.5, 1.1, 3000, 3.2, 33.8, -118.1]

Distance Calculations:
Point 1: [5.8, 20.0, 4.2, ...]  → Distance = √[(6.2-5.8)² + (15.0-20.0)² + ...] = X.XX
Point 2: [7.1, 10.0, 6.1, ...]  → Distance = √[(6.2-7.1)² + (15.0-10.0)² + ...] = X.XX
...

5 Nearest Neighbors:
1. Point Index: 1234, Distance: 2.45, Target: 3.2
2. Point Index: 5678, Distance: 2.67, Target: 2.9
...

Uniform Prediction: (3.2 + 2.9 + ...) / 5 = X.XX
Distance Weighted: [Show complete calculation]
```

---

## Part 5: Model Evaluation and Hyperparameter Tuning 

### 5.1 Train-Test Split and Scaling 

```python
# Split data (80/20 split)
# Apply appropriate scaling (compare StandardScaler vs MinMaxScaler)
# Justify your scaling choice based on EDA findings
```

### 5.2 Hyperparameter Grid Search 

**Implement a manual grid search:**

```python
def manual_grid_search(X_train, y_train, X_val, y_val, param_grid):
    """
    Perform manual grid search for k-NN hyperparameters
    
    Parameters:
    X_train, y_train: Training data
    X_val, y_val: Validation data  
    param_grid: Dictionary of parameters to search
    
    Returns:
    best_params: Best parameter combination
    results_df: DataFrame with all results
    """
    # YOUR IMPLEMENTATION HERE
    pass
```

**Search Grid:**
- k: [3, 5, 7, 9, 11, 15, 21]
- distance_metric: ['euclidean', 'manhattan']  
- weights: ['uniform', 'distance']

**Requirements:**
- Use 5-fold cross-validation
- Track both training and validation scores
- Create visualization of results
- Identify best parameters and analyze patterns

### 5.3 Performance Analysis

**Create comprehensive evaluation:**

1. **Learning Curves:** Plot training and validation scores vs k
2. **Distance Metric Comparison:** Visualize performance differences
3. **Feature Importance Analysis:** Which features contribute most to neighbor selection?
4. **Error Analysis:** Analyze residuals and identify prediction patterns

---

## Part 6: Comparison and Advanced Analysis

### 6.1 Sklearn Comparison
- Compare your implementation with sklearn's KNeighborsRegressor
- Verify results match (within reasonable tolerance)
- Analyze any performance differences

### 6.2 Curse of Dimensionality Analysis
- Create a synthetic high-dimensional dataset of random data
- Show how k-NN performance degrades with increasing dimensions
- Discuss implications for feature selection

---

## Deliverables and Submission Requirements

### Required Files:
1. **Jupyter Notebook:** `lastname_firstname_knn_assignment.ipynb`
2. **PDF Report:** `lastname_firstname_knn_report.pdf` (exported from notebook)

### Academic Integrity Requirements:

**To prove you completed this work yourself, include:**

1. **Process Documentation:**
   - Comments explaining your thought process for each major decision throughout the notebook.
   - Personal insights and analysis throughout

2. **Understanding Verification:**
   - All manual calculations must be shown step-by-step
   - Explain why you chose specific parameter values
   - Discuss what you learned from each analysis

3. **Code Comments:**
   - Every function must have detailed comments
   - Complex logic must be commented line-by-line
   - Include reasoning for implementation choices

4. **Personal Analysis:**
   - Write conclusions in your own words
   - Discuss surprises or unexpected findings
   - Explain how this assignment changed your understanding of k-NN

### Further Measures:

**Include these specific elements to demonstrate understanding:**

1. **Conceptual Questions (Answer in your Colab or Jupyter notebook):**
   - Why might Manhattan distance be preferable to Euclidean distance in certain scenarios?
   - How does the choice of k affect bias-variance tradeoff in k-NN?
   - What are the computational implications of different distance metrics?
   - How would you modify k-NN for categorical features?

2. **Implementation Decisions:**
   - Discuss alternative approaches you considered
   - Identify limitations of your implementation

3. **Personal Reflection:**
   - What was the most challenging part of this assignment?
   - How would you improve your approach if you had more time?
   - What real-world applications could benefit from your analysis?

---

## Grading Rubric

| Component | Criteria |
|-----------|----------|
| **Data Exploration** | Complete EDA with insights, proper visualizations, written analysis |
| **Data Preprocessing** | Correct outlier handling, feature engineering, documentation |
| **Custom Implementation** | Working k-NN class, multiple distance metrics, proper structure |
| **Manual Calculations** | Step-by-step work shown, correct calculations, verification |
| **Model Evaluation** | Comprehensive analysis, proper validation, hyperparameter tuning |
| **Advanced Analysis** | Sklearn comparison, dimensionality analysis |
| **Code Quality**  | Clean code, comments, documentation, organization |
| **Academic Integrity** | Evidence of original work, personal insights, understanding demonstration |
| **Presentation** | Clear notebook organization, professional formatting |

**Total: 100 points** 

---


## Academic Integrity Statement

By submitting this assignment, I certify that:
- All code was written by me (our group) personally
- All analysis and insights are mine (or our groups) original work  
- I understand the concepts demonstrated in my implementation
- I have properly cited any external resources used
- I am prepared to explain any part of my solution in detail

**Student Signature:** ___Ahsan Imran_______ **Date:** ___8/31/2025___

