# Train Dataset EDA
I'm using this notebook to perform some basic exploratory data analysis to see if I can discern basic trends between any of the features.

In [0]:
%sql
select * from workspace.rental_predictions.train

In [0]:
%sql
with total_listings as (
  select count(*) as total_count from workspace.rental_predictions.train
)

select 
  t.District, 
  count(*) as NumListings, 
  c.total_count, 
  round(100*numlistings/c.total_count, 2) as pct
from workspace.rental_predictions.train as t
cross join total_listings as c
group by t.District, c.total_count
order by t.District

/*
Immediately we see that Manhattan listings comprise of ~98% of the total training set.  I would not feel confident training a model on all the districts listed because of the high level of skew in the data and will choose to just build a model to serve Manhattan records.  This could be alleviated if a greater number of listings were collected for the other boroughs.
*/

In [0]:
%sql
with manhattan_count as (
  select count(*) as total_count from workspace.rental_predictions.train
  where District = 'Manhattan'
)

select 
  t.Neighborhood,
  count(*) as NumListings,
  m.total_count,
  round(100*numlistings/m.total_count, 2) as pct
from workspace.rental_predictions.train as t
cross join manhattan_count as m
where t.District = 'Manhattan'
group by t.Neighborhood, m.total_count
order by NumListings desc

/*
The one Neighborhood that I have the most concern for is Marble Hill since this has only one record listed.  This will likely not be sufficient for training a model to discern pricing in this area.  It will likely be safer to exclude this and in a production model we could choose to collect more listings for this area or merge it with an adjacent neighborhood.
*/

In [0]:
%sql
select 
  Accommodates,
  count(*) as NumListings
from workspace.rental_predictions.train
where District = 'Manhattan' and Neighborhood != 'Marble Hill'
group by Accommodates
order by Accommodates 

/*
Another interesting find is that we have more outliers in terms of accommodations.  Listings with more than 10 bedrooms comprise of 6 listings, with a single listing able to accommodate 16 guests.  The model will likely have a hard time being able to learn from this training sample and should be removed.
*/

Databricks visualization. Run in Databricks to view.