Predicting Price with Size

In [None]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

warnings.simplefilter(action="ignore", category=FutureWarning)

Task 2.1.1: Write a function named wrangle that takes a file path as an argument and returns a DataFrame.

In [None]:
def wrangle(filepath):
    #Read CSV file into DataFrame
    dfs = pd.read_csv(filepath)
    mask_ba = dfs["place_with_parent_names"].str.contains("Capital Federal")
    mask_apt = dfs["property_type"] == "apartment"
    mask_price = dfs["price_aprox_usd"] < 400000
    dfs = dfs[mask_ba & mask_apt & mask_price]
    low, high = dfs["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = dfs["surface_covered_in_m2"].between(low, high)
    dfs = dfs[mask_area]
    return dfs

Task 2.1.2: Use your wrangle function to create a DataFrame df from the CSV file data/buenos-aires-real-estate-1.csv.
Task 2.1.3: Add to your wrangle function so that the DataFrame it returns only includes apartments in Buenos Aires ("Capital Federal") that cost less than $400,000 USD. Then recreate df from data/buenos-aires-real-estate-1.csv by re-running the cells above.

In [None]:
df = wrangle("data/buenos-aires-real-estate-1.csv")
print("df shape:", df.shape)
df.head()

Task 2.1.4: Create a histogram of "surface_covered_in_m2". Make sure that the x-axis has the label "Area [sq meters]" and the plot has the title "Distribution of Apartment Sizes".

In [None]:
plt.hist(df["surface_covered_in_m2"]);
plt.xlabel("Area [sq meters]")
plt.title("Distribution of Apartment Sizes")

Task 2.1.5: Calculate the summary statistics for df using the describe method.

In [None]:
df.describe()["surface_covered_in_m2"]

Task 2.1.6: Add to your wrangle function so that it removes observations that are outliers in the "surface_covered_in_m2" column. Specifically, all observations should fall between the 0.1 and 0.9 quantiles for "surface_covered_in_m2".

In [None]:
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low,high)
mask_area.head()

Task 2.1.7: Create a scatter plot that shows price ("price_aprox_usd") vs area ("surface_covered_in_m2") in our dataset. Make sure to label your x-axis "Area [sq meters]" and your y-axis "Price [USD]".

In [None]:
plt.scatter(x=df["surface_covered_in_m2"], y=df["price_aprox_usd"])


# Label axes
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")

# Add title
plt.title("Capital Federal: Price vs. Area");

CF = {
    "Capital Federal": df["surface_covered_in_m2"].corr(df["price_aprox_usd"])

}

CF

Task 2.1.8: Create the feature matrix named X_train, which you'll use to train your model. It should contain one feature only: ["surface_covered_in_m2"]. Remember that your feature matrix should always be two-dimensional.

In [None]:
features = ["surface_covered_in_m2"]
X_train = df[features]
X_train.head()

Task 2.1.9: Create the target vector named y_train, which you'll use to train your model. Your target should be "price_aprox_usd". Remember that, in most cases, your target vector should be one-dimensional.

In [None]:
target = "price_aprox_usd"
y_train = df[target]
y_train.head()

Task 2.1.10: Calculate the mean of your target vector y_train and assign it to the variable y_mean.

In [None]:
y_mean = y_train.mean()
y_mean

Task 2.1.11: Create a list named y_pred_baseline that contains the value of y_mean repeated so that it's the same length at y.

In [None]:
y_pred_baseline = [y_mean] * len(y_train)
y_pred_baseline[:5]

Task 2.1.12: Add a line to the plot below that shows the relationship between the observations X_train and our dumb model's predictions y_pred_baseline. Be sure that the line color is orange, and that it has the label "Baseline Model".

In [None]:
plt.plot(X_train.values, y_pred_baseline, color="orange", label="Baseline Model")

plt.scatter(X_train, y_train)
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Buenos Aires: Price vs. Area")
plt.legend();

Task 2.1.13: Calculate the baseline mean absolute error for your predictions in y_pred_baseline as compared to the true targets in y.

In [None]:
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean apt price", round(y_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))

Task 2.1.14: Instantiate a LinearRegression model named model.
Task 2.1.15: Fit your model to the data, X_train and y_train.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

Task 2.1.16: Using your model's predict method, create a list of predictions for the observations in your feature matrix X_train. Name this array y_pred_training.

In [None]:
y_pred_training = model.predict(X_train)
y_pred_training[:5]

Task 2.1.17: Calculate your training mean absolute error for your predictions in y_pred_training as compared to the true targets in y_train.

In [None]:
mae_training = mean_absolute_error(y_train, y_pred_training)
print("Training MAE:", round(mae_training, 2))

Task 2.1.18: Run the code below to import your test data buenos-aires-test-features.csv into a DataFrame and generate a Series of predictions using your model.

In [None]:
X_test = pd.read_csv("data/buenos-aires-test-features.csv")[features]
y_pred_test = pd.Series(model.predict(X_test))
y_pred_test.head()

Task 2.1.19: Extract the intercept from your model, and assign it to the variable intercept.

In [None]:
intercept = round(model.intercept_, 2)
print("Model Intercept:", intercept)

Task 2.1.20: Extract the coefficient associated "surface_covered_in_m2" in your model, and assign it to the variable coefficient.

In [None]:
coefficient = round(model.coef_[0], 2)
print('Model coefficient for "surface_covered_in_m2":', coefficient)


Task 2.1.21: Complete the code below and run the cell to print the equation that your model has determined for predicting apartment price based on size.

In [None]:
print(f"The equation of the model is: {intercept} + {coefficient}x")

Task 2.1.22: Add a line to the plot below that shows the relationship between the observations in X_train and your model's predictions y_pred_training. Be sure that the line color is red, and that it has the label "Linear Model".

In [None]:
plt.plot(X_train.values, model.predict(X_train), color="r", label="Linear Model")
plt.scatter(X_train, y_train)
plt.xlabel("surface covered [sq meters]")
plt.ylabel("price [usd]")
plt.legend();