<font size="+3"><strong>Predicting Apartment Prices in Mexico City </strong></font>

## Import

In [None]:
# Import libraries here
from glob import glob
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline

# Prepare Data

**1:** Write a `wrangle` function that takes the name of a CSV file as input and returns a DataFrame. The function should do the following steps:

1. Subset the data in the CSV file and return only apartments in Mexico City (`"Distrito Federal"`) that cost less than \$100,000.
2. Remove outliers by trimming the bottom and top 10\% of properties in terms of `"surface_covered_in_m2"`.
3. Create separate `"lat"` and `"lon"` columns.
4. Mexico City is divided into [15 boroughs](https://en.wikipedia.org/wiki/Boroughs_of_Mexico_City). Create a `"borough"` feature from the `"place_with_parent_names"` column.
5. Drop columns that are more than 50\% null values.
6. Drop columns containing low- or high-cardinality categorical values. 
7. Drop any columns that would constitute leakage for the target `"price_aprox_usd"`.
8. Drop any columns that would create issues of multicollinearity.

In [None]:
# Build your `wrangle` function
def wrangle(path):
    df = pd.read_csv(path)
    
    # Subset data: Apartments in "Capital Federal", less than 400,000
    mask_ba = df["place_with_parent_names"].str.contains("Distrito Federal")
    mask_apt = df["property_type"] == "apartment"
    mask_price = df["price_aprox_usd"] < 100_000
    df = df[mask_ba & mask_apt & mask_price]

    # Subset data: Remove outliers for "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    df[['lat', 'lon']] = df['lat-lon'].str.split(',', expand=True).astype('float')
    df.drop('lat-lon', axis=1, inplace=True)
    
    df['borough'] = df['place_with_parent_names'].str.split('|', expand=True)[1]
    df.drop(columns="place_with_parent_names", inplace=True)

    nulls_count_mask = df.count() < len(df) / 2
    nulls_count_cols = nulls_count_mask[nulls_count_mask].index
    df.drop(nulls_count_cols, axis=1, inplace=True)
    
    cardinality_mask = (df.select_dtypes(include="object").nunique() < 0.005*len(df)) | (df.select_dtypes(include="object").nunique() > 0.95*len(df))
    low_high_cardinality_cols = cardinality_mask[cardinality_mask].index
    df.drop(low_high_cardinality_cols, axis=1, inplace=True)
    
    columns_dep_price = ['price', 'price_aprox_local_currency', 'price_per_m2']
    df.drop(columns_dep_price, axis=1, inplace=True)
    
    return df

**2:** Use glob to create the list `files`. It should contain the filenames of all the Mexico City real estate CSVs in the `./data` directory, except for `mexico-city-test-features.csv`.

In [None]:
files = glob("./data/mexico-city-real-estate-[0-9].csv")
files

**3:** Combine your `wrangle` function, a list comprehension, and `pd.concat` to create a DataFrame `df`. It should contain all the properties from the five CSVs in `files`.

In [None]:
df = pd.concat([wrangle(file) for file in files])
print(df.info())
df.head()

## Explore

**4:** Create a histogram showing the distribution of apartment prices (`"price_aprox_usd"`) in `df`. Be sure to label the x-axis `"Price [$]"`, the y-axis `"Count"`, and give it the title `"Distribution of Apartment Prices"`. Use Matplotlib (`plt`).

What does the distribution of price look like? Is the data normal, a little skewed, or very skewed?

In [None]:
# Build histogram
plt.hist(df['price_aprox_usd'])


# Label axes
plt.xlabel('Price [$]')
plt.ylabel('Count')

# Add title
plt.title('Distribution of Apartment Prices')

**5:** Create a scatter plot that shows apartment price (`"price_aprox_usd"`) as a function of apartment size (`"surface_covered_in_m2"`). Be sure to label your x-axis `"Area [sq meters]"` and y-axis `"Price [USD]"`. Your plot should have the title `"Mexico City: Price vs. Area"`. Use Matplotlib (`plt`).

In [None]:
# Build scatter plot
plt.scatter(x=df['surface_covered_in_m2'], y=df['price_aprox_usd'])


# Label axes
plt.xlabel('Area [sq meters]')
plt.ylabel('Price [USD]')

# Add title
plt.title('Mexico City: Price vs. Area')

Do you see a relationship between price and area in the data? How is this similar to or different from the Buenos Aires dataset?<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

**6:** Create a Mapbox scatter plot that shows the location of the apartments in your dataset and represent their price using color.

What areas of the city seem to have higher real estate prices?

In [None]:
# Plot Mapbox location and price
fig = px.scatter_mapbox(
    df,  # Our DataFrame
    lat="lat",
    lon="lon",
    width=800,  # Width of map
    height=800,  # Height of map
    color="price_aprox_usd",
    hover_data=["price_aprox_usd"],  # Display price when hovering mouse over house
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

## Split

**7:** Create your feature matrix `X_train` and target vector `y_train`. Your target is `"price_aprox_usd"`. Your features should be all the columns that remain in the DataFrame you cleaned above.

In [None]:
# Split data into feature matrix `X_train` and target vector `y_train`.

X_train = df.drop('price_aprox_usd', axis=1)
y_train = df['price_aprox_usd']

# Build Model

## Baseline

**8:** Calculate the baseline mean absolute error for your model.

In [None]:
y_mean = y_train.mean()
y_pred_baseline = [y_mean] * len(y_train)
baseline_mae = mean_absolute_error(y_pred_baseline, y_train)
print("Mean apt price:", y_mean)
print("Baseline MAE:", baseline_mae)

## Iterate

**9:** Create a pipeline named `model` that contains all the transformers necessary for this dataset and one of the predictors you've used during this project. Then fit your model to the training data.

In [None]:
df.info()

In [None]:
# Build Model
model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    Ridge()
    
)
# Fit model
model.fit(X_train, y_train)

## Evaluate

**10:** Read the CSV file `mexico-city-test-features.csv` into the DataFrame `X_test`.

<div class="alert alert-block alert-info">
<b>Tip:</b> Make sure the <code>X_train</code> you used to train your model has the same column order as <code>X_test</code>. Otherwise, it may hurt your model's performance.
</div>

In [None]:
X_test = pd.read_csv('data/mexico-city-test-features.csv')
print(X_test.info())
X_test.head()

**11:** Use your model to generate a Series of predictions for `X_test`. When you submit your predictions to the grader, it will calculate the mean absolute error for your model.

In [None]:
y_test_pred = pd.Series(model.predict(X_test))
y_test_pred.head()

# Communicate Results

**12:** Create a Series named `feat_imp`. The index should contain the names of all the features your model considers when making predictions; the values should be the coefficient values associated with each feature. The Series should be sorted ascending by absolute value.

In [None]:
coefficients = model.named_steps['ridge'].coef_
features     = model.named_steps['onehotencoder'].get_feature_names()
feat_imp     = pd.Series(index=features, data=coefficients)
feat_imp

**13:** Create a horizontal bar chart that shows the **10 most influential** coefficients for your model. Be sure to label your x- and y-axis `"Importance [USD]"` and `"Feature"`, respectively, and give your chart the title `"Feature Importances for Apartment Price"`. Use pandas.

In [None]:
# Build bar chart
feat_imp.plot(kind='barh')


# Label axes
plt.xlabel('Importance [USD]')
plt.ylabel('Feature')

# Add title
plt.title('Feature Importances for Apartment Price')