# Visualizations and pre-processing
Here we'll visualize the data to better explore it and establish the pre-processing steps most reasonable.

> **Note**: The graphs render very slow, so the interactive ones are saved to html. If you want to see them, you can open the html files in your browser after running the notebook. The static ones are shown in the notebook.

In [12]:
import pandas as pd
from pathlib2 import Path
df_transactions = pd.read_csv(Path("data") / "transactions_cleaned.csv", index_col='id')

df_transactions.info()

ValueError: Index id invalid

## Basic Histograms

Let's plot histograms for every column and see if there are any outliers and if the data is balanced.

In [None]:
import numpy as np
import plotly.express as px

# split up dataframes by data type (should be all if all types converted correctly above)
df_num = df_transactions.select_dtypes(include=np.number)

# first let's plot all the numerical columns using a distribution plot
for col in df_num:
    # fig = ff.create_distplot([df_transactions[col]], group_labels=[col], bin_size = 0.5)
    # fig.write_html(str(Path("figures") / f"{col}_distplot.html"))
    fig = px.histogram(df_transactions[col], nbins=100, marginal='box', title=f"{col} distribution")
    fig.write_html(str(Path("figures") / f"{col}_histplot.html"))
    fig.show(renderer='svg')


### outliers
Togethe rwith the `df.describe` function we've run in `exploration` seen that some numerical columns have very high max value, but much lower qt 75% value. This suggests outliers and with those distribution plots. Here are the columns with outliers:
- price( and totalPrice) have outliers towards higher value (no extreme)
- ladderRatio has one very high outlier.
- square has outliers going to higher values, with 2 larger ones
- constructionTime some outliers towards lower values.
- tradetime some outliers towards lower values.
- followers some outliers towards higher values.
- DOM some outliers towards higher values.

### (Im)Balanced columns
We have a lot of categorical data. These can be problematic if one category is way more prevalent than the other ones. This can lead to a model that is biased towards the more prevalent category. Here are the columns with a very unbalanced distribution:
- renovationCondition doesn't have much data points in 1 (corresponding to 2; Rough condition)
- buildingStructure doesn't have much data points in 0 (1:unknown), 2 (3: concrete), 4 (5:steel)
- kitchen really only has value in 1 kitchen <-- can be thrown out
- All the rooms have some categories with very low amount of data points, but non as extreme (to one category) as kitchen.

## Conclusions to deal with outliers and (im)balanced data:
1. Delete the few extreme rows for ladderRatio and square
2. If applying a scaler, probably want to use a robust scaler to give the outliers in other categories less influence
3. Throw out the kitchen column
4. For the remaining categorical data, let's do one-hot encoding for the ones that have more than 2 categories.
      - `floorPosition`
      - `renovationCondition`
      - `buildingStructure`
      - `district`
Note: we don't do this with cId, way to many categories. We might drop this column later anyway.

In [22]:
print(" current shape", df_transactions.shape)
# Filter out the extreme outliers for square and ladderRatio by cutting off the top 0.001% of the data
df_transactions = df_transactions[df_transactions['square'] < df_transactions['square'].quantile(0.999)]
df_transactions = df_transactions[df_transactions['ladderRatio'] < df_transactions['ladderRatio'].quantile(0.999)]

# drop kitchen column
try:
    df_transactions.drop(columns=['kitchen'], inplace=True)
except KeyError:
    pass
print("shape after removing outliers and kitchen", df_transactions.shape)

# One-hot encode the categorical data
multi_cat_cols = ['floorPosition', 'renovationCondition', 'buildingStructure', 'district']
# get_dummes adds extra columns for each category (n_cat), and we drop the first one to avoid collinearity
try:
    df_transactions = pd.get_dummies(data=df_transactions, columns=multi_cat_cols, drop_first=True)
except KeyError:
    pass
print(f"shape after encoding: {df_transactions.shape}")

 current shape (298561, 43)
shape after removing outliers and kitchen (297838, 43)
shape after encoding: (297838, 43)


only lost ~700 rows to the extreme outliers, so that's fine.

## Correlation Matrix
From the correlation matrix below we can see some positive correlation with price in:
- tradeTime
- DOM
- communityAverage
- totalPrice
- renovationCondition
- followers
- subway
- Some districts
and some highly negative corelation with price in:
- Lng
- constructionTime
- square
- Lng
- drawingRoom

In [None]:
# Directly copied from: https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-heatmap-style/
df_transactions.corr().style.background_gradient(cmap='viridis') \
    .set_properties(**{'font-size': '20px'})


## Heatmap of price
We saw some strong negative correlation with lng and some correlations with different districts. Given that in the housing market it's all about location. I am suspecting that the price can be depended on the precise location

In [None]:
import folium
import branca.colormap as cmp

# Let's plot the data on a map with price as heatmap color
df_heatmap = df_transactions[['Lng', 'Lat', 'price']]

# Getting color gradient
linear = cmp.LinearColormap(
    ['green', 'red'],
    vmin=min(df_heatmap['price']), vmax=max(df_heatmap['price']),
    caption='price of property'
)
# Create a map
m = folium.Map(location=[40, 116], zoom_start=10)
for _, row in df_heatmap.iterrows():
    col_grad = linear(row['price'])
    folium.CircleMarker(
        location=(row['Lat'], row['Lng']),
        radius=1,
        color=col_grad,
        fill=False,
        fill_color=col_grad,
        fill_opacity=0.3
    ).add_to(m)
linear.add_to(m)

# Can't render the map in notebook statically, so only save.
m.save(str(Path("figures") / "price_map.html"))

## City center more expensive.
It looks like the city center (of Beijing!) is more expensive, than more to the outside. It looks like somewhat of a circular dependence. Let's keep this in mind for the model later.

## Time dependency
One of the strongest correlators with price is the time. Let's see how this dependency looks.

In [None]:
import plotly.graph_objects as go

df_transactions.sort_values(by='tradeTime', inplace=True)
price_rolling_avg = df_transactions['price'].rolling(window=2000, min_periods=50).mean()
datetimes = pd.to_datetime(df_transactions['tradeTime'])

# Create figure and add lines
fig = go.Figure(data=go.Scatter(x=datetimes, y=df_transactions['price'], name='transaction', mode='markers', marker=dict(size=2, opacity=0.2)))
fig.add_trace(go.Scatter(x=datetimes, y=price_rolling_avg, name='rolling avg (2000)', mode="lines",
                         line=dict(color='red', width=5)))
fig.update_layout(
    xaxis_title="time",
    yaxis_title="price",
    title="price over time"
)
fig.write_html(str(Path("figures") / "time_v_price.html"))
fig.show(renderer="svg")


## Price over time
The price seems to increase of time with dipping in the end. Consdering we're going to split the data based on time, this might be right around were we'll split the data in training and test set. Since we're using the datetime data to split it between testing and training, and it's not a simple linear increase over time, we probably don't want to use the time data for this model. Let's also drop totalprice, since that's the same as price * square and it wasn't originally in the dataset.

## Split data in training and test set.

In [None]:
# drop totalprice
try:
    df_transactions.drop(columns=['totalPrice'], inplace=True)
except:
    pass

# Let's add the datetimes column back for splitting the data
df_transactions['tradeTime'] = datetimes

# now let's split the data in training and test set by making everything before the start of 2017 training data:
df_train = df_transactions[df_transactions['tradeTime'] < pd.to_datetime('2017-01-01')]
df_test = df_transactions[df_transactions['tradeTime'] >= pd.to_datetime('2017-01-01')]
print(f"training shape: {df_train.shape}")
print(f"test shape: {df_test.shape}")
print(f"training set is: {df_train.shape[0] / (df_train.shape[0] + df_test.shape[0]) * 100}% of the data ")




## Scaling
Now that we've removed extreme outliers, we can scale the data. There are 3 main scaling methods: Standard, Robust and MinMax scaler. Bets are the Robust scaler will work best, as it is less sensitive to outliers. Let's see how the data looks after scaling with all methods to confirm.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

# drop the tradeTime column in the data, as we can't scale that and decided not to use it.
for df in [df_train, df_test]:
    try:
        df_train.drop(columns=['tradeTime'], inplace=True)
    except:
        pass

# Do all the scaling, only with the training data
std_scaler = StandardScaler()
array_num_standard_scaled = std_scaler.fit_transform(df_train)
df_standard_scaled = pd.DataFrame(array_num_standard_scaled, columns=df_train.columns)
rbst_scaler = RobustScaler()
array_num_robust_scaled = rbst_scaler.fit_transform(df_train)
df_robust_scaled = pd.DataFrame(array_num_robust_scaled, columns=df_train.columns)
minmax_scaler = MinMaxScaler()
array_num_minmax_scaled = minmax_scaler.fit_transform(df_train)
df_minmax_scaled = pd.DataFrame(array_num_minmax_scaled, columns=df_train.columns)

# Compare the scaling methods
scaling = ("no", "standard", "robust", "minmax")
top_features = {}
for i, df in enumerate((df_train, df_standard_scaled, df_robust_scaled, df_minmax_scaled)):
    print(f"Scaling: {scaling[i]}")
    Y = df['price']
    X = df.drop(columns=['price'])
    pca = PCA(n_components=3).fit(X)
    X_pca = pca.transform(X)
    print(f"Explained variance: {pca.explained_variance_ratio_}")
    print(f"Total explained variance: {np.sum(pca.explained_variance_ratio_)}")
    # initialize a dictionary to store the explained variance for each feature
    dict_features_explained = {X.columns[i]: 0 for i in range(len(X.columns))}
    for i, col in enumerate(X.columns):
        for pca_component in pca.components_:
            dict_features_explained[col] += abs(pca_component[i])
    dict_features_explained = sorted(sorted(dict_features_explained.items()), key=lambda x: x[1], reverse=True)
    top_15_dict = dict_features_explained[:15]
    for key, value in top_15_dict:
        if key in top_features:
            top_features[key] += 1
        else:
            top_features[key] = 0

    print(f"top features are:{top_15_dict} \n")


print(f"top features with combined variance explained: {[key for key, value in top_features.items() if value >= 2.0]}")


# Directly copied from: https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-heatmap-style/
df_robust_scaled.corr().style.background_gradient(cmap='viridis') \
    .set_properties(**{'font-size': '20px'})

## Best Scaling
The robust scaling method seems to explain the most variance (aside from no scaling, but the #1 feature there is cid and that's only because it's a large number). But the minmax scaler's top features are more intuitive. Let's save them both and potentially use them both in the model.

In [None]:
# Rename the scaled dataframes
df_train_robust = df_robust_scaled
df_train_minmax = df_minmax_scaled

# Scale the test data with the training data scaling
test_robust_scaled = rbst_scaler.transform(df_test)
df_test_robust = pd.DataFrame(test_robust_scaled, columns=df_test.columns)
test_minmax_scaled = minmax_scaler.transform(df_test)
df_test_minmax = pd.DataFrame(test_minmax_scaled, columns=df_test.columns)

# Save all the dataframes
df_train.to_csv(str(Path("data") / "train.csv"), index=False)
df_test.to_csv(str(Path("data") / "test.csv"), index=False)
df_train_robust.to_csv(str(Path("data") / "train_robust.csv"), index=False)
df_train_minmax.to_csv(str(Path("data") / "train_minmax.csv"), index=False)
df_test_robust.to_csv(str(Path("data") / "test_robust.csv"), index=False)
df_test_minmax.to_csv(str(Path("data") / "test_minmax.csv"), index=False)

## Conclusion Observations

In essence what we have is a regression problem. We want to predict the price of a house based on the features. Some considerations and lessons learned:
1. We have a lot of features, but we've seen with the PCA that reducing it to 3 features we can explain a lot of the variance.
2. We've seen that there's some dependence on the location using the coordinates (lng, lat). A Kernel PCA might be useful for that.
4. in Addition to 1 We've only removed 1 column, way mre columns can be removed. We could return here and quickly check which columns correlate the least with the price and account for the least variance. But best approach will be to do pca regardless.


## Supervised Models to consider
1. Multiple Linear Regression
2. Above with (kernel) PCA <-- didn't get the kernelPCA to work because of memory issues.
3. above with Elastic Net, lasso and/or ridge <-- somewhat redundant with PCA
4. More variations of linear regression
5. Random Forest (or just decision trees)