<a href="https://colab.research.google.com/github/Chaotic-Legend/CMP-333-Codes/blob/main/Midterm%20Project%20Part%202%3A%20Home%20Prices%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midterm Project Part 2: Analyzing Home Prices

For the second part of the project, we will continue analyzing the house price dataset and apply the **k-nearest-neighbor** method to predict the prices of a few examples in the test set.
#### CMP 333: Data Management and Analysis - Isaac D. Hoyos

1. Load the `HousingData_processed.csv` file created in Part 1 and display its first five rows.

In [1]:
import pandas as pd
import numpy as np
from google.colab import files
import os

HousingData_processed = "HousingData_processed.csv"

# Check if the CSV file already exists in the environment.
if not os.path.exists(HousingData_processed):
    print(f"Please upload the \"HousingData_processed.csv\" file...\n")

    while True:
        uploaded = files.upload()

        # Get the name of the uploaded file.
        uploaded_filename = list(uploaded.keys())[0]

        # Check if it's the correct file.
        if uploaded_filename == HousingData_processed:
            print(f"\n✅ File \"HousingData_processed.csv\" has been uploaded successfully!")
            break
        else:
            print(f"\n❌ Error: Please upload the correct \"HousingData_processed.csv\" file...\n")
else:
    print(f"Using the uploaded \"HousingData_processed.csv\" file.")

# Read the uploaded CSV file.
HousingData_processed_df = pd.read_csv(HousingData_processed)

# Display the first five rows of the DataFrame
print("\n=== First 5 Rows of the HousingData_processed.csv DataFrame ===")
HousingData_processed_df.head()

Please upload the "HousingData_processed.csv" file...



Saving HousingData_processed.csv to HousingData_processed.csv

✅ File "HousingData_processed.csv" has been uploaded successfully!

=== First 5 Rows of the HousingData_processed.csv DataFrame ===


Unnamed: 0,SalePrice,OverallQual,YearBuilt,TotalBsmtSF,GrLivArea,TotalArea,AreaPerRoom,GarageCars
0,208500,7,2003,856,1710,2566,213.75,2
1,181500,6,1976,1262,1262,2524,210.333333,2
2,223500,7,2001,920,1786,2706,297.666667,2
3,140000,7,1915,756,1717,2473,245.285714,3
4,250000,8,2000,1145,2198,3343,244.222222,3


2. Load the first three instances in `test.csv` as a data frame and display these instances.

In [2]:
from google.colab import files
import os

test = "test.csv"

# Check if the CSV file already exists in the environment.
if not os.path.exists(test):
    print(f"Please upload the \"test.csv\" file...\n")

    while True:
        uploaded = files.upload()

        # Get the name of the uploaded file.
        uploaded_filename = list(uploaded.keys())[0]

        # Check if it's the correct file.
        if uploaded_filename == test:
            print(f"\n✅ File \"test.csv\" has been uploaded successfully!")
            break
        else:
            print(f"\n❌ Error: Please upload the correct \"test.csv\" file...\n")
else:
    print(f"Using the uploaded \"test.csv\" file.")

# Read the uploaded CSV file.
test_df = pd.read_csv(test)

# Load the first three instances in test.csv as a DataFrame.
print("\n=== First 3 Test Instances of the test.csv DataFrame ===")
test_df.head(3)

Please upload the "test.csv" file...



Saving test.csv to test.csv

✅ File "test.csv" has been uploaded successfully!

=== First 3 Test Instances of the test.csv DataFrame ===


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal


3. Let's predict the price of the test case using a method called "k-nearest-neighbors." To find similar houses, we need to create a similarity measure that combines the differences from all features.

Define a `similarity` function as follows:
* Parameters: `row1` (one instance), `row2` (another instance), and weights (used for combining differences).
* Calculate the absolute value of the differences between `row1` and `row2` in each input feature. For example, the difference for the `OverallQual` feature should be calculated as:
`np.abs(row1['OverallQual'] - row2['OverallQual'])`.
* The similarity is calculated as a weighted sum of all feature differences. Use the reciprocal of the standard deviation of each feature as its weight.
* Return the similarity value.

In [3]:
# Define similarity function using weighted sum of absolute differences.
def similarity(row1, row2, weights):
    return np.sum([
        weights[feature] * np.abs(row1[feature] - row2[feature])
        for feature in weights
    ])

# Define the feature list.
features = ['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea', 'GarageCars']

# Calculate the weights using the reciprocal of the standard deviation.
weights = {}
for feature in features:
    std = HousingData_processed_df[feature].std()
    weights[feature] = 1.0 / std if std != 0 else 0.0

print("=== Similarity Function Defined Successfully ===")
print("Features Used:", ', '.join(features))

similarity_value = similarity(
    HousingData_processed_df.iloc[0],
    test_df.iloc[0],
    weights)
print(f"Total Weighted Similarity Value: {similarity_value:.4f}")

# Detailed comparison table.
comparison = pd.DataFrame({
    'Feature': features,
    'HouseValue': [HousingData_processed_df.iloc[0][f] for f in features],
    'TestValue': [test_df.iloc[0][f] for f in features],
    'AbsoluteDifference': [
        np.abs(HousingData_processed_df.iloc[0][f] - test_df.iloc[0][f])
        for f in features
    ],
    'Weight (1/std)': [weights[f] for f in features],
    'WeightedDifference': [
        weights[f] * np.abs(HousingData_processed_df.iloc[0][f] - test_df.iloc[0][f])
        for f in features
    ]})
display(comparison)

=== Similarity Function Defined Successfully ===
Features Used: OverallQual, YearBuilt, TotalBsmtSF, GrLivArea, GarageCars
Total Weighted Similarity Value: 5.7832


Unnamed: 0,Feature,HouseValue,TestValue,AbsoluteDifference,Weight (1/std),WeightedDifference
0,OverallQual,7.0,5.0,2.0,0.723068,1.446135
1,YearBuilt,2003.0,1961.0,42.0,0.033109,1.390595
2,TotalBsmtSF,856.0,882.0,26.0,0.002279,0.059265
3,GrLivArea,1710.0,896.0,814.0,0.001903,1.549059
4,GarageCars,2.0,1.0,1.0,1.338124,1.338124


4. Calculate the difference between the first test house and each house in the data.

In [4]:
# Calculate the similarity between the first test house and each house in the dataset.

# Select the first test instance.
test_house_1 = test_df.iloc[0]

# Compute similarity with all houses using the updated weight system.
similarities = HousingData_processed_df.apply(
    lambda row: similarity(row, test_house_1, weights),
    axis=1
)

# Combine SalePrice and computed similarity into one table.
similarity_table = HousingData_processed_df[['SalePrice']].copy()
similarity_table['Similarity'] = similarities

# Sort by similarity (lower = more similar).
similarity_table = similarity_table.sort_values(by='Similarity', ascending=True).reset_index(drop=True)

print("=== Similarity Values for the First Test House ===")
display(similarity_table)

=== Similarity Values for the First Test House ===


Unnamed: 0,SalePrice,Similarity
0,109500,0.064269
1,144000,0.131941
2,138500,0.163597
3,109900,0.194448
4,125000,0.196706
...,...,...
1455,555000,15.755785
1456,755000,17.453003
1457,745000,17.714292
1458,184750,20.150440


5. Calculate the average price of the five closest houses, as this will be our prediction for the test house.

In [5]:
# Predict the price of the first test house using the 5 nearest neighbors.

# Attach the computed similarity values to the housing dataset.
data_with_similarity = HousingData_processed_df.copy()
data_with_similarity['Similarity'] = similarities

# Select the 5 houses with the smallest similarity values.
nearest_5 = data_with_similarity.nsmallest(5, 'Similarity').reset_index(drop=True)

# Calculate the average SalePrice of these 5 houses.
predicted_price_1 = nearest_5['SalePrice'].mean()

# Display the results.
print("=== 5 Most Similar Houses to Test House 1 ===")
display(nearest_5[['SalePrice', 'Similarity']])
print(f"\nAverage Price of the 5 Closest Houses: ${predicted_price_1:,.2f}")

print("\n=== Nearest Neighbors Statistics ===")
print(f"Price Range: ${nearest_5['SalePrice'].min():,.0f} - ${nearest_5['SalePrice'].max():,.0f}")
print(f"Average Similarity Value: {nearest_5['Similarity'].mean():.4f}")

=== 5 Most Similar Houses to Test House 1 ===


Unnamed: 0,SalePrice,Similarity
0,109500,0.064269
1,144000,0.131941
2,138500,0.163597
3,109900,0.194448
4,125000,0.196706



Average Price of the 5 Closest Houses: $125,380.00

=== Nearest Neighbors Statistics ===
Price Range: $109,500 - $144,000
Average Similarity Value: 0.1502


6. Repeat Task 4 and Task 5 on the other test houses and display the predictions.

In [6]:
# Repeat Tasks 4 & 5 for all test houses with similarity weights.

def predict_price(test_row, data, features, weights, k=5):
    # Convert weights to array in feature order.
    weight_array = np.array([weights[f] for f in features])

    # Compute absolute differences for each row in dataset.
    diffs = np.abs(data[features].values - test_row[features].values)

    # Weighted similarity scores.
    similarities = np.dot(diffs, weight_array)

    # Get indices of k smallest similarity values.
    nearest_idx = np.argpartition(similarities, k)[:k]

    # Prediction is the mean SalePrice of the k nearest houses.
    return data.iloc[nearest_idx]['SalePrice'].mean()

# Compute predictions for all test houses.
predictions = [
    predict_price(test_df.iloc[i], HousingData_processed_df, features, weights)
    for i in range(len(test_df))]

# Append predictions to test set.
results = test_df.copy()
results['PredictedPrice'] = predictions

# Display the full results and summary.
print("=== Predicted Sale Prices for All Test Houses ===")
display(results)
print(f"Total Test Houses Evaluated: {len(predictions)}")
print("\n=== The First Five Predictions ===")
for i, price in enumerate(predictions[:5]):
    print(f"Test House #{i+1}: ${price:,.2f}")

=== Predicted Sale Prices for All Test Houses ===


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,PredictedPrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,125380.0
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,153630.0
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,172000.0
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,179600.0
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,205300.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,85200.0
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,96500.0
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,145600.0
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,120020.0


Total Test Houses Evaluated: 1459

=== The First Five Predictions ===
Test House #1: $125,380.00
Test House #2: $153,630.00
Test House #3: $172,000.00
Test House #4: $179,600.00
Test House #5: $205,300.00
