<a href="https://colab.research.google.com/github/Leesanchez/Madrid_Property_Predictions/blob/main/PropertyPriceEstimation(Anthony_LeeSanchez).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 Madrid Property Predictions
### Regression & Classification Analysis

This notebook performs **regression** to predict property prices and **classification** to categorize properties as "cheap" or "expensive."

Link to colab notebook: https://colab.research.google.com/drive/1DuwPcD15K55JssT-R9TWiKtizAqBOT-p?usp=sharing

In [1]:
from google.colab import files
import pandas as pd

# Upload file manually
uploaded = files.upload()

# Get the file name automatically
file_name = list(uploaded.keys())[0]

# Load the dataset
df = pd.read_excel(file_name)

# Display first few rows
df.head()

Saving Session 7 Dataset.xlsx to Session 7 Dataset (1).xlsx


Unnamed: 0.1,Unnamed: 0,inm_floor,inm_size,inm_price,inm_longitude,inm_latitude,inm_barrio,inm_distrito,his_price,his_quarterly_variation,...,dem_TasaDeParo,dem_TamanoMedioDelHogar,dem_PropSinEstudiosUniversitarios,dem_PropSinEstudios,dem_Proporcion_de_nacidos_fuera_de_Espana,dem_PropConEstudiosUniversitarios,dem_PobTotal,dem_NumViviendas,dem_EdadMedia,dem_Densidad_(Habit/Ha)
0,0,3.0,141.0,990000,-3.656875,40.464347,Canillas,Hortaleza,3250,2.2,...,8.724674,2.527886,0.488949,0.175632,15.456193,,40838,16155,,161.894356
1,1,2.0,159.0,940000,-3.703523,40.419427,Universidad,Centro,5106,1.4,...,9.006094,1.975877,0.386598,0.083812,32.10246,0.52959,33418,16913,43.678945,352.500616
2,2,,,549000,-3.669626,40.435362,Guindalera,Salamanca,4100,0.6,...,7.441379,2.369951,0.365818,0.070351,18.224365,0.563831,42306,17851,46.477166,263.952286
3,3,2.0,232.0,750000,-3.720619,40.424164,Argüelles,Moncloa - Aravaca,4773,0.5,...,6.709633,2.328217,0.343683,0.066403,20.963846,0.589914,24423,10490,46.972342,322.402577
4,4,4.0,183.0,1550000,-3.705909,40.413214,Sol,Centro,4739,-5.5,...,9.05898,1.994244,0.43375,0.082242,39.490947,0.484009,7622,3822,44.632774,171.165183


In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Drop unnecessary index column if it exists
df = df.drop(columns=["Unnamed: 0"], errors='ignore')

# Handle missing values
for col in df.select_dtypes(include=["float64", "int64"]).columns:
    df[col] = df[col].fillna(df[col].median())

for col in df.select_dtypes(include=["object"]).columns:
    df[col] = df[col].fillna(df[col].mode()[0])

# Encode categorical variables
categorical_cols = ["inm_barrio", "inm_distrito"]
label_encoders = {}

for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])

# Display processed data
df.head()

Unnamed: 0,inm_floor,inm_size,inm_price,inm_longitude,inm_latitude,inm_barrio,inm_distrito,his_price,his_quarterly_variation,his_annual_variation,...,dem_TasaDeParo,dem_TamanoMedioDelHogar,dem_PropSinEstudiosUniversitarios,dem_PropSinEstudios,dem_Proporcion_de_nacidos_fuera_de_Espana,dem_PropConEstudiosUniversitarios,dem_PobTotal,dem_NumViviendas,dem_EdadMedia,dem_Densidad_(Habit/Ha)
0,3.0,141.0,990000,-3.656875,40.464347,19,8,3250,2.2,0.3,...,8.724674,2.527886,0.488949,0.175632,15.456193,0.512828,40838,16155,45.113343,161.894356
1,2.0,159.0,940000,-3.703523,40.419427,108,3,5106,1.4,-4.3,...,9.006094,1.975877,0.386598,0.083812,32.10246,0.52959,33418,16913,43.678945,352.500616
2,2.0,98.0,549000,-3.669626,40.435362,52,14,4100,0.6,-4.1,...,7.441379,2.369951,0.365818,0.070351,18.224365,0.563831,42306,17851,46.477166,263.952286
3,2.0,232.0,750000,-3.720619,40.424164,13,10,4773,0.5,-3.7,...,6.709633,2.328217,0.343683,0.066403,20.963846,0.589914,24423,10490,46.972342,322.402577
4,4.0,183.0,1550000,-3.705909,40.413214,105,3,4739,-5.5,-5.3,...,9.05898,1.994244,0.43375,0.082242,39.490947,0.484009,7622,3822,44.632774,171.165183


## 🏡 Regression Analysis (Predicting Property Prices)
We will train two models:
- **Linear Regression**
- **Random Forest Regressor**

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

# Select features and target
X = df.drop(columns=["inm_price"])
y = df["inm_price"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train models
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate models
lr_rmse = sqrt(mean_squared_error(y_test, y_pred_lr))
lr_r2 = r2_score(y_test, y_pred_lr)

rf_rmse = sqrt(mean_squared_error(y_test, y_pred_rf))
rf_r2 = r2_score(y_test, y_pred_rf)

# Display results
regression_results = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest"],
    "RMSE": [lr_rmse, rf_rmse],
    "R² Score": [lr_r2, rf_r2]
})

print("🏡 Regression Results:")
print(regression_results)

🏡 Regression Results:
               Model           RMSE  R² Score
0  Linear Regression  431478.500854  0.744158
1      Random Forest  301271.743257  0.875270


## 🔍 Classification Analysis (Cheap vs Expensive Properties)
We will classify properties into **cheap (0)** and **expensive (1)** using:
- **Perceptron**
- **Logistic Regression**
- **LDA (Linear Discriminant Analysis)**
- **QDA (Quadratic Discriminant Analysis)**
- **KNN (K-Nearest Neighbors)**

In [4]:
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Create classification target (cheap vs expensive)
df["price_category"] = pd.qcut(df["inm_price"], q=2, labels=[0, 1])

# Select features and target
X = df.drop(columns=["inm_price", "price_category"])
y = df["price_category"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifiers
models = {
    "Perceptron": Perceptron(),
    "Logistic Regression": LogisticRegression(),
    "LDA": LinearDiscriminantAnalysis(),
    "QDA": QuadraticDiscriminantAnalysis(),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

classification_results = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    classification_results[name] = accuracy
    print(f"\n🔍 {name} Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))

# Convert results to a DataFrame
classification_results_df = pd.DataFrame(list(classification_results.items()), columns=["Model", "Accuracy"])
print("\n📊 Classification Results:")
print(classification_results_df)


🔍 Perceptron Accuracy: 0.9087
              precision    recall  f1-score   support

           0       0.93      0.89      0.91      1777
           1       0.89      0.93      0.91      1795

    accuracy                           0.91      3572
   macro avg       0.91      0.91      0.91      3572
weighted avg       0.91      0.91      0.91      3572


🔍 Logistic Regression Accuracy: 0.9286
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      1777
           1       0.93      0.93      0.93      1795

    accuracy                           0.93      3572
   macro avg       0.93      0.93      0.93      3572
weighted avg       0.93      0.93      0.93      3572


🔍 LDA Accuracy: 0.8541
              precision    recall  f1-score   support

           0       0.89      0.81      0.85      1777
           1       0.83      0.90      0.86      1795

    accuracy                           0.85      3572
   macro avg       0.86      0.85

## 📁 Save Results
The results will be saved as CSV files so they can be downloaded.

In [5]:
# Save regression results
regression_results.to_csv("regression_results.csv", index=False)

# Save classification results
classification_results_df.to_csv("classification_results.csv", index=False)

# Provide download links
from google.colab import files

print("\n📥 Download regression results:")
files.download("regression_results.csv")

print("\n📥 Download classification results:")
files.download("classification_results.csv")


📥 Download regression results:


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


📥 Download classification results:


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>