<a href="https://colab.research.google.com/github/Chaotic-Legend/CMP-333-Codes/blob/main/Midterm%20Project%20Part%202%3A%20Home%20Prices%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midterm Project Part 2: Analyzing Home Prices

For the second part of the project, we will continue analyzing the house price dataset and apply the **k-nearest-neighbor** method to predict the prices of a few examples in the test set.
#### CMP 333: Data Management and Analysis - Isaac D. Hoyos

1. Load the `HousingData_processed.csv` file created in Part 1 and display its first 5 rows.

In [1]:
import pandas as pd
import numpy as np

# Load HousingData_processed.csv and display the first five rows.
housing_df = pd.read_csv("HousingData_processed.csv")

print("=== First 5 Rows of The HousingData_processed.csv DataFrame ===")
housing_df.head()

=== First 5 Rows of The HousingData_processed.csv DataFrame ===


Unnamed: 0,SalePrice,OverallQual,YearBuilt,TotalBsmtSF,GrLivArea,TotalArea,AreaPerRoom,GarageCars
0,208500,7,2003,856,1710,2566,213.75,2
1,181500,6,1976,1262,1262,2524,210.333333,2
2,223500,7,2001,920,1786,2706,297.666667,2
3,140000,7,1915,756,1717,2473,245.285714,3
4,250000,8,2000,1145,2198,3343,244.222222,3


2. Load the first three instances in `test.csv` as a data frame and display these instances.

In [2]:
# Load the first three instances in test.csv as a DataFrame.
test_df = pd.read_csv("test.csv").head(3)
print("=== First 3 Test Instances of The test.csv DataFrame ===")
test_df

=== First 3 Test Instances of The test.csv DataFrame ===


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal


3. Let's predict the price of the test case using a method called "k-nearest-neighbors." To find similar houses, we need to create a similarity measure that combines the differences from all features.

Define a `similarity` function as follows:
* Parameters: `row1` (one instance), `row2` (another instance), and weights (used for combining differences).
* Calculate the absolute value of the differences between `row1` and `row2` in each input feature. For example, the difference for the `OverallQual` feature should be calculated as:
`np.abs(row1['OverallQual'] - row2['OverallQual'])`.
* The similarity is calculated as a weighted sum of all feature differences.
* Return the similarity value.

In [3]:
# Define a similarity function.

# Calculates weighted similarity between two rows based on absolute differences.
def similarity(row1, row2, weights):
    diff = 0
    for feature, weight in weights.items():
        diff += weight * np.abs(row1[feature] - row2[feature])
    return diff

# Define features and equal weights for all.
features = ['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea', 'GarageCars']
weights = {feature: 1.0 for feature in features}
print("=== Similarity Function Defined Successfully ===")
print("\nFeatures Used:", features)

# Compare the first row of the processed housing data with the first test row.
test_similarity_value = similarity(housing_df.iloc[0], test_df.iloc[0], weights)
print(f"\nSimilarity Value Between The First House & First Test Row: {test_similarity_value:.2f}")

=== Similarity Function Defined Successfully ===

Features Used: ['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea', 'GarageCars']

Similarity Value Between The First House & First Test Row: 885.00


4. Calculate the difference between the first test house and each house in the data.

In [4]:
# Calculate differences between the first test house and each house in the data.
test_house_1 = test_df.iloc[0]
similarities = housing_df.apply(lambda x: similarity(test_house_1, x, weights), axis=1)

# Combine with original data for visualization.
similarity_table = housing_df.copy()
similarity_table['Similarity'] = similarities

print("=== Similarity Values For The First Test House ===")
similarity_table[['SalePrice', 'Similarity']]

=== Similarity Values For The First Test House ===


Unnamed: 0,SalePrice,Similarity
0,208500,885.0
1,181500,763.0
2,223500,971.0
3,140000,997.0
4,250000,1609.0
...,...,...
1455,175000,862.0
1456,210000,1856.0
1457,266500,1736.0
1458,142125,389.0


5. Calculate the average price of the 5 closest houses. This will be our prediction on the test house.

In [5]:
# Find the five most similar houses and calculate the average SalePrice.

# Combine housing data with similarity values.
data_with_similarity = housing_df.copy()
data_with_similarity['Similarity'] = similarities

# Select the five closest houses.
nearest_5 = data_with_similarity.nsmallest(5, 'Similarity')

# Compute the average SalePrice.
predicted_price_1 = nearest_5['SalePrice'].mean()
print("=== 5 Most Similar Houses To Test House 1 ===")
display(nearest_5[['SalePrice', 'Similarity']])
print(f"\nPredicted Price For Test House 1: ${predicted_price_1:,.2f}")

=== 5 Most Similar Houses To Test House 1 ===


Unnamed: 0,SalePrice,Similarity
288,122000,10.0
870,109500,15.0
698,138500,18.0
904,125500,18.0
709,109900,19.0



Predicted Price For Test House 1: $121,080.00


6. Repeat Task 4 and Task 5 on the other test houses and display the predictions.

In [6]:
# Compute similarity between the test row and all dataset rows.
def predict_price(test_row, data, features, weights, k=5):
    similarities = data.apply(lambda x: similarity(test_row, x, weights), axis=1)
    data_with_similarity = data.copy()
    data_with_similarity['Similarity'] = similarities
    nearest = data_with_similarity.nsmallest(k, 'Similarity')
    predicted_price = nearest['SalePrice'].mean()
    return predicted_price

# Generate predicted prices for test dataset.
predictions = [predict_price(test_df.iloc[i], housing_df, features, weights) for i in range(len(test_df))]

# Combine predictions with test data.
results = test_df.copy()
results['PredictedPrice'] = predictions
print("=== Predictions For The Other Test Houses ===")
display(results)

for i, price in enumerate(predictions):
    print(f"\nPredicted Price For Test House {i+1}: ${price:,.2f}")

=== Predictions For The Other Test Houses ===


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,PredictedPrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,121080.0
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,161180.0
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,178700.0



Predicted Price For Test House 1: $121,080.00

Predicted Price For Test House 2: $161,180.00

Predicted Price For Test House 3: $178,700.00
