# Train Dataset EDA
I'm using this notebook to perform some basic exploratory data analysis to see if I can discern basic trends between any of the features.

In [0]:
%sql
select * from workspace.rental_predictions.train

In [0]:
%sql
with total_listings as (
  select count(*) as total_count from workspace.rental_predictions.train
)

select 
  t.District, 
  count(*) as NumListings, 
  c.total_count, 
  round(100*numlistings/c.total_count, 2) as pct
from workspace.rental_predictions.train as t
cross join total_listings as c
group by t.District, c.total_count
order by t.District

/*
Immediately we see that Manhattan listings comprise of ~98% of the total training set.  I would not feel confident training a model on all the districts listed because of the high level of skew in the data and will choose to just build a model to serve Manhattan records.  This could be alleviated if a greater number of listings were collected for the other boroughs.
*/

In [0]:
%sql
with manhattan_count as (
  select count(*) as total_count from workspace.rental_predictions.train
  where District = 'Manhattan'
)

select 
  t.Neighborhood,
  count(*) as NumListings,
  m.total_count,
  round(100*numlistings/m.total_count, 2) as pct
from workspace.rental_predictions.train as t
cross join manhattan_count as m
where t.District = 'Manhattan'
group by t.Neighborhood, m.total_count
order by NumListings desc

/*
The one Neighborhood that I have the most concern for is Marble Hill since this has only one record listed.  This will likely not be sufficient for training a model to discern pricing in this area.  It will likely be safer to exclude this and in a production model we could choose to collect more listings for this area or merge it with an adjacent neighborhood.
*/

In [0]:
%sql
select 
  Accommodates,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by Accommodates
order by Accommodates 

/*
Another interesting find is that we have more outliers in terms of accommodations.  Listings with more than 10 bedrooms comprise of 6 listings, with a single listing able to accommodate 16 guests.  The model will likely have a hard time being able to learn from this training sample and should be removed.
*/

Databricks visualization. Run in Databricks to view.

In [0]:
%sql
select 
  Bedrooms,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by Bedrooms
order by Bedrooms

/*
This query reveals a bizarre floating point value for bedrooms.  While I do not know the ground truth for how this data was created, I would make an assumption that this is a poor calculation that was caused for a small number of listings.  Since these values are slightly above 1, I would make the assumption that these records could be safely coerced to the integer 1.  Of course in a production setting it would be imperative to understand the sourcing of the data and how the floating point value was generated before "cleaning" it for modeling purposes.  
*/ 

In [0]:
%sql
select 
  RoomType,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by RoomType
order by RoomType

/*
The distribution of RoomType has some skew, but there 
*/

In [0]:
%sql
select 
  Bathrooms,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by Bathrooms
order by Bathrooms

/*
Here we see we have fractional values for bathroom.  While the concept of "half-bath" is not unusual, the long floating point value is poorly established.  I would make the assumption that these values are a result of a poor calculation and should be rounded to the nearest integer, but again in a production setting the ground truth needs to be established before handling such cases.
*/

In [0]:
%sql
select 
  ReviewRating,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by ReviewRating
order by ReviewRating

Databricks visualization. Run in Databricks to view.

In [0]:
%sql
select 
  Price,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by Price
order by Price

Databricks visualization. Run in Databricks to view.

In [0]:
%sql
select 
  PropertyType,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by PropertyType
order by NumListings desc

/*
There are a number of cases here where the PropertyType is only represented in a single listing.  
*/

# Evaluate Cross-Relationships
Here I will take a look at the relationship between variables and see if I can discern anything in particluar.

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px

In [0]:
train = spark.sql("""
select *
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
"""
).toPandas()

display(train)

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

### Box-Whiskers Plots
Will use some box and whisker plots to understand behavior of categorical variables.

In [0]:
def plot_box_whiskers(field: str):
    # Compute order
    median_prices = train.groupby(field)['Price'].median().sort_values(ascending=False)

    # Convert category ordering
    train[field] = pd.Categorical(train[field], categories=median_prices.index, ordered=True)

    fig, ax = plt.subplots(figsize=(12,7), dpi = 200)

    # Prepare data for plotting: list of price arrays by category in desired order
    data = [train.loc[train[field] == nbhd, 'Price'] 
            for nbhd in median_prices.index]

    ax.boxplot(data, vert=True)
    ax.set_xticklabels(median_prices.index, rotation=90)

    ax.set_title(f'Manhattan Listings: {field} vs. Price - Box/Whiskers')
    ax.set_xlabel(field)
    ax.set_ylabel('Price')
    return

In [0]:
fields_to_evaluate = [
    "Neighborhood",
    "PropertyType",
    "CancellationPolicy",
    "RoomType"
]
for field in fields_to_evaluate:
    plot_box_whiskers(field)

In [0]:
fig = px.scatter(
    train,
    x="Longitude",
    y="Latitude",
    color="Price",
    color_continuous_scale='Greens',
    opacity=0.4,
    size_max=15,
    title='Manhattan Listings: Price Color Scale',
    labels={'Latitude': 'Latitude', 'Longitude': 'Longitude', 'Price': 'Price'}
)
fig.update_traces(marker=dict(size=5))
fig.show()
"""A general trend that this shows is that higher prices will tend to be more in lower Manhattan and Midtown (though not uniformly so) and gradually fade out as we climb the upper East/West side into Harlem.  As we climb above Central Park (the rectangular void in the middle of the plot) prices tend to stay lower, especially when we get to points very high on the island like Washington Heights and Inwood."""

In [0]:
fig = px.scatter(
    train,
    x="Longitude",
    y="Latitude",
    color="Neighborhood",
    # color_continuous_scale='Greens',
    opacity=0.4,
    size_max=15,
    title='Manhattan Listings: Neighborhood Color Scale',
    labels={'Latitude': 'Latitude', 'Longitude': 'Longitude', 'Price': 'Price'}
)
fig.update_traces(marker=dict(size=5))
fig.show()

In [0]:
fig = px.scatter(
    train,
    x="Longitude",
    y="Latitude",
    color="RoomType",
    opacity=0.4,
    size_max=15,
    title='Manhattan Listings: RoomType Color Scale',
    labels={'Latitude': 'Latitude', 'Longitude': 'Longitude', 'Price': 'Price'}
)
fig.update_traces(marker=dict(size=5))
fig.show()

In [0]:
fig = px.scatter(
    train,
    x="Longitude",
    y="Latitude",
    color="ReviewRating",
    color_continuous_scale='Greens',
    opacity=0.4,
    size_max=15,
    title='Manhattan Listings: Accommodates Color Scale',
    labels={'Latitude': 'Latitude', 'Longitude': 'Longitude', 'Price': 'Price'}
)
fig.update_traces(marker=dict(size=5))
fig.show()