<a href="https://colab.research.google.com/github/Chaotic-Legend/CMP-333-Codes/blob/main/Midterm%20Project%20Part%202%3A%20Home%20Prices%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midterm Project Part 2: Analyzing Home Prices

For the second part of the project, we will continue analyzing the house price dataset and apply the **k-nearest-neighbor** method to predict the prices of a few examples in the test set.
#### CMP 333: Data Management and Analysis - Isaac D. Hoyos

1. Load the `HousingData_processed.csv` file created in Part 1 and display its first five rows.

In [1]:
import pandas as pd
import numpy as np
from google.colab import files
import os

HousingData_processed = "HousingData_processed.csv"

# Check if the CSV file already exists in the environment.
if not os.path.exists(HousingData_processed):
    print(f"Please upload the \"HousingData_processed.csv\" file...\n")

    while True:
        uploaded = files.upload()

        # Get the name of the uploaded file.
        uploaded_filename = list(uploaded.keys())[0]

        # Check if it's the correct file.
        if uploaded_filename == HousingData_processed:
            print(f"\n✅ File \"HousingData_processed.csv\" has been uploaded successfully!")
            break
        else:
            print(f"\n❌ Error: Please upload the correct \"HousingData_processed.csv\" file...\n")
else:
    print(f"Using the uploaded \"HousingData_processed.csv\" file.")

# Read the uploaded CSV file.
HousingData_processed_df = pd.read_csv(HousingData_processed)

# Display the first five rows of the DataFrame
print("\n=== First 5 Rows of the HousingData_processed.csv DataFrame ===")
HousingData_processed_df.head()

Please upload the "HousingData_processed.csv" file...



Saving HousingData_processed.csv to HousingData_processed.csv

✅ File "HousingData_processed.csv" has been uploaded successfully!

=== First 5 Rows of the HousingData_processed.csv DataFrame ===


Unnamed: 0,SalePrice,OverallQual,YearBuilt,TotalBsmtSF,GrLivArea,TotalArea,AreaPerRoom,GarageCars
0,208500,7,2003,856,1710,2566,213.75,2
1,181500,6,1976,1262,1262,2524,210.333333,2
2,223500,7,2001,920,1786,2706,297.666667,2
3,140000,7,1915,756,1717,2473,245.285714,3
4,250000,8,2000,1145,2198,3343,244.222222,3


2. Load the first three instances in `test.csv` as a data frame and display these instances.

In [2]:
from google.colab import files
import os

test = "test.csv"

# Check if the CSV file already exists in the environment.
if not os.path.exists(test):
    print(f"Please upload the \"test.csv\" file...\n")

    while True:
        uploaded = files.upload()

        # Get the name of the uploaded file.
        uploaded_filename = list(uploaded.keys())[0]

        # Check if it's the correct file.
        if uploaded_filename == test:
            print(f"\n✅ File \"test.csv\" has been uploaded successfully!")
            break
        else:
            print(f"\n❌ Error: Please upload the correct \"test.csv\" file...\n")
else:
    print(f"Using the uploaded \"test.csv\" file.")

# Read the uploaded CSV file.
test_df = pd.read_csv(test)

# Load the first three instances in test.csv as a DataFrame.
print("\n=== First 3 Test Instances of the test.csv DataFrame ===")
test_df.head(3)

Please upload the "test.csv" file...



Saving test.csv to test.csv

✅ File "test.csv" has been uploaded successfully!

=== First 3 Test Instances of the test.csv DataFrame ===


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal


3. Let's predict the price of the test case using a method called "k-nearest-neighbors." To find similar houses, we need to create a similarity measure that combines the differences from all features.

Define a `similarity` function as follows:
* Parameters: `row1` (one instance), `row2` (another instance), and weights (used for combining differences).
* Calculate the absolute value of the differences between `row1` and `row2` in each input feature. For example, the difference for the `OverallQual` feature should be calculated as:
`np.abs(row1['OverallQual'] - row2['OverallQual'])`.
* The similarity is calculated as a weighted sum of all feature differences.
* Return the similarity value.

In [3]:
# Define a similarity function for the differences from all features.

def similarity(row1, row2, weights):
    # Vectorized weighted sum of absolute differences.
    return np.sum([
        weights[feature] * np.abs(row1[feature] - row2[feature])
        for feature in weights
    ])

# Define the features and weights from the data set.
features = ['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea', 'GarageCars']
weights = {feature: 1.0 for feature in features}

# Print the features used and the total similarity value.
print("=== Similarity Function Defined Successfully ===")
print("Features Used for Comparison:", ', '.join(features))
similarity_value = similarity(HousingData_processed_df.iloc[0], test_df.iloc[0], weights)
print(f"Total Weighted Similarity Value: {similarity_value:.2f}")

# Display detailed comparison from the defined similarity function.
print("\n=== Detailed Feature Comparison for the First Test House ===")
comparison = pd.DataFrame({
    'Feature': features,
    'HouseValue': [HousingData_processed_df.iloc[0][f] for f in features],
    'TestValue': [test_df.iloc[0][f] for f in features],
    'AbsoluteDifference': [np.abs(HousingData_processed_df.iloc[0][f] - test_df.iloc[0][f]) for f in features],
    'Weight': [weights[f] for f in features],
    'WeightedDifference': [weights[f] * np.abs(HousingData_processed_df.iloc[0][f] - test_df.iloc[0][f]) for f in features]
})
display(comparison)

=== Similarity Function Defined Successfully ===
Features Used for Comparison: OverallQual, YearBuilt, TotalBsmtSF, GrLivArea, GarageCars
Total Weighted Similarity Value: 885.00

=== Detailed Feature Comparison for the First Test House ===


Unnamed: 0,Feature,HouseValue,TestValue,AbsoluteDifference,Weight,WeightedDifference
0,OverallQual,7.0,5.0,2.0,1.0,2.0
1,YearBuilt,2003.0,1961.0,42.0,1.0,42.0
2,TotalBsmtSF,856.0,882.0,26.0,1.0,26.0
3,GrLivArea,1710.0,896.0,814.0,1.0,814.0
4,GarageCars,2.0,1.0,1.0,1.0,1.0


4. Calculate the difference between the first test house and each house in the data.

In [4]:
# Calculate the difference between the first test house and each house in the dataset.

# Select the first test instance.
test_house_1 = test_df.iloc[0]

# Compute similarity with all houses.
similarities = HousingData_processed_df.apply(
    lambda row: similarity(test_house_1, row, weights), axis=1
)

# Create a table combining SalePrice and computed similarity.
similarity_table = HousingData_processed_df[['SalePrice']].copy()
similarity_table['Similarity'] = similarities

# Sort by similarity and display the values.
similarity_table = similarity_table.sort_values(by='Similarity', ascending=True).reset_index(drop=True)
print("=== Similarity Values for the First Test House ===")
display(similarity_table)

=== Similarity Values for the First Test House ===


Unnamed: 0,SalePrice,Similarity
0,122000,10.0
1,109500,15.0
2,125500,18.0
3,138500,18.0
4,125000,19.0
...,...,...
1455,430000,4685.0
1456,755000,5022.0
1457,745000,5136.0
1458,184750,6089.0


5. Calculate the average price of the five closest houses, as this will be our prediction for the test house.

In [5]:
# Predict the price of the first test house using the 5 nearest neighbors.

# Combine housing data with computed similarity values.
data_with_similarity = HousingData_processed_df.copy()
data_with_similarity['Similarity'] = similarities

# Identify the five most similar houses.
nearest_5 = data_with_similarity.nsmallest(5, 'Similarity').reset_index(drop=True)

# Calculate the average SalePrice of these 5 houses.
predicted_price_1 = nearest_5['SalePrice'].mean()

# Display the results and summary statistics of the nearest houses.
print("=== 5 Most Similar Houses to Test House 1 ===")
display(nearest_5[['SalePrice', 'Similarity']])
print(f"\nAverage Price of the 5 Closest Houses: ${predicted_price_1:,.2f}")

print("\n=== Nearest Neighbors Statistics ===")
print(f"Price Range: ${nearest_5['SalePrice'].min():,.0f} - ${nearest_5['SalePrice'].max():,.0f}")
print(f"Average Similarity Value: {nearest_5['Similarity'].mean():.2f}")

=== 5 Most Similar Houses to Test House 1 ===


Unnamed: 0,SalePrice,Similarity
0,122000,10.0
1,109500,15.0
2,138500,18.0
3,125500,18.0
4,109900,19.0



Average Price of the 5 Closest Houses: $121,080.00

=== Nearest Neighbors Statistics ===
Price Range: $109,500 - $138,500
Average Similarity Value: 16.00


6. Repeat Task 4 and Task 5 on the other test houses and display the predictions.

In [6]:
# Repeat Tasks 4 & 5 for all other test houses and display the predictions.

def predict_price(test_row, data, features, weights, k=5):
    # Convert weights to an array ordered by features.
    weight_array = np.array([weights[f] for f in features])

    # Compute absolute feature differences.
    diffs = np.abs(data[features].values - test_row[features].values)

    # Compute similarity as weighted sum of differences.
    similarities = np.dot(diffs, weight_array)

    # Find indices of k smallest similarity values.
    nearest_idx = np.argpartition(similarities, k)[:k]

    # Average their SalePrice values to make prediction.
    predicted_price = data.iloc[nearest_idx]['SalePrice'].mean()
    return predicted_price

# Generate predictions for all other test houses.
predictions = [
    predict_price(test_df.iloc[i], HousingData_processed_df, features, weights)
    for i in range(len(test_df))
]

# Combine test data with predictions.
results = test_df.copy()
results['PredictedPrice'] = predictions

# Display the results and print summary for the first five houses.
print("=== Predicted Sale Prices for All Other Test Houses ===")
display(results)
print(f"Total Test Houses Evaluated: {len(predictions)}")
print("\n=== The First Five Test Houses Predictions ===")
for i, price in enumerate(predictions[:5]):
    print(f"Prediction Test House {i+1}: ${price:,.2f}")

=== Predicted Sale Prices for All Other Test Houses ===


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,PredictedPrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,121080.0
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,161180.0
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,178700.0
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,178700.0
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,193540.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,83900.0
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,83900.0
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,137190.0
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,136100.0


Total Test Houses Evaluated: 1459

=== The First Five Test Houses Predictions ===
Prediction Test House 1: $121,080.00
Prediction Test House 2: $161,180.00
Prediction Test House 3: $178,700.00
Prediction Test House 4: $178,700.00
Prediction Test House 5: $193,540.00
