# Predicting Price with Size

### Import Libraries

Given a real estate dataset containing property listings in Buenos Aires, the goal is to focus the analysis on a subset of the data that includes only apartments located in "Capital Federal" with a price below $400,000 USD.

 After filtering the data accordingly, the objective is to:

1.   Analyze the relationship between property characteristics and price.
2.   Develop a linear regression model to predict apartment prices based on the covered surface area.


This task involves data cleaning, filtering, exploratory data analysis, and applying a simple machine learning model for price prediction.

In [1]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.utils.validation import check_is_fitted

warnings.simplefilter(action="ignore", category=FutureWarning)

# Prepare Data

In [None]:
def wrangle(filepath):
    df=pd.read_csv(filepath)
    apt_mask=df["property_type"]=="apartment"
    price_mask=df["price_aprox_usd"]<400_000
    city_mask=df["place_with_parent_names"].astype(str).str.split("|",expand=True)[2]=="Capital Federal"

    df=df[apt_mask&price_mask&city_mask]
    min,max=df["surface_covered_in_m2"].quantile([0.1,0.9])
    quan_mask=df["surface_covered_in_m2"].between(min,max)

    return df[quan_mask]

In [None]:
df = wrangle("data/buenos-aires-real-estate-1.csv")
print("df shape:", df.shape)
df.head()

#Explore

##histogram of "surface_covered_in_m2"

In [None]:
plt.hist(df["surface_covered_in_m2"])
plt.xlabel("Area [sq meters]")
plt.title("Distribution of Apartment Sizes");

## summary statistics for df

In [None]:
df.describe()

## scatter plot that shows price ("price_aprox_usd") vs area ("surface_covered_in_m2")

In [None]:
plt.scatter(df["surface_covered_in_m2"],df["price_aprox_usd"])
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]");

#Split

In [None]:
features = ["surface_covered_in_m2"]
X_train = df[features]
target = "price_aprox_usd"
y_train = df[target]

#Build Model

##Baseline

In [None]:
y_mean = y_train.mean()
y_pred_baseline = [y_mean]*len(y_train)

relationship between the observations X_train and our dumb model's predictions y_pred_baseline

In [None]:
plt.plot(X_train,y_pred_baseline,color="magenta",label="baseline model")
plt.scatter(X_train, y_train)
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Buenos Aires: Price vs. Area")
plt.legend();

mean absolute error for your predictions in y_pred_baseline

In [None]:
mae_baseline = mean_absolute_error(y_train,y_pred_baseline)

print("Mean apt price", round(y_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))

#Iterate

In [None]:
model = LinearRegression()
model.fit(X_train,y_train)

#Evaluate

In [None]:
y_pred_training = model.predict(X_train)
mae_training = mean_absolute_error(y_train,y_pred_training)
print("Training MAE:", round(mae_training, 2))

#Communicate Results

In [None]:
intercept = model.intercept_
print("Model Intercept:", intercept)
coefficient = model.coef_[0]
print('Model coefficient for "surface_covered_in_m2":', coefficient)
print(f"apartment_price = {intercept} + {coefficient} * surface_covered")

Relationship between the observations in X_train and your model's predictions y_pred_training


In [None]:
plt.plot(X_train,y_pred_training,color="magenta",label="model_prediction")
plt.scatter(X_train, y_train)
plt.xlabel("surface covered [sq meters]")
plt.ylabel("price [usd]")
plt.legend();