# DISCLAIMER: WE STRONGLY RECOMMEND NOT RUNNING THE CODE SINCE SOME BLOCKS TAKE HOURS TO COMPLETE.


# Instructions

In this file we provide instructions on how to navigate through this folder and the correct order to inspect the code in it. 

Due to the large extension that a single notebook would have, code is split in several. Below is a brief expanation of what has been done in each of them.

- [1_Get_data.ipynb](_1_Get_data.ipynb)<br>
    Donwnloading all the datasets that will be used: one main Airbnb dataset and three additional datasets (for trees, rats, and touristic places) with some extra information.<br>

- [2_Clean_data.ipynb](_2_Clean_data.ipynb)<br>
    Prepping Airbnb dataset for analysis. This includes removal of outliers, discard of unnecessary data (columns in the dataframe) and encoding of categorical features. The output is a clean dataset stored in [data_air/AB_data_clean.csv](data_air/AB_data_clean.csv)<br>

- [3_Clean_data_2.ipynb](_3_Clean_data_2.ipynb)<br>
    Similar to the one above, prepping additional datasets for analysis. Only the features related to the location are kept. The output is three clean datasets, stored in [data_trees/trees_data_clean.csv](data_trees/trees_data_clean.csv), [data_rats/trees_data_clean.csv](data_rats/trees_data_clean.csv) and [data_places/trees_data_clean.csv](data_places/trees_data_clean.csv).<br>

- [4_Merge_data.ipynb](_4_Merge_data.ipynb)<br>
    The number of trees, rats and tourist places in 500, 1000 and 2500m are counted for each Airbnb listing in the main dataset. As a result, 3 datasets with the same length as the main one are generated. They are stored in [data_trees/trees_distances_simple.csv](data_trees/trees_distances_simple.csv), [data_rats/rats_distances_simple.csv](data_rats/rats_distances_simple.csv) and [and data_places/places_distances_simple.csv](and data_places/places_distances_simple.csv)  and merged into the main dataset afterwards, thus generating an output dataset stored in [joined_data.csv](joined_data.csv).<br>

- [5_Data_exploration.ipynb](_5_Data_exploration.ipynb)<br>
    We dive into our clean, now enchanced ([joined_data.csv](joined_data.csv)) Airbnb dataset and try to find a relationship between the listings' rental price and other variables in the dataset by making plots of them.<br>

- [6_ML.ipynb](_6_ML.ipynb)<br>
    Several machine learning models are built from the cleaned original dataset ([data_air/AB_data_clean.csv](data_air/AB_data_clean.csv)). In order to do this, a baseline test and some data prepping is done, which in some cases  involves a search of the optimal features to be fed to the models is also done.<br>

- [7_ML_on_joined.ipynb](_7_ML_on_joined.ipynb)<br>
    Similarly to what was done in [_6_ML.ipynb](_6_ML.ipynb), the same process followed, this time on the enhanced clean dataset with the new additional features ([joined_data.csv](joined_data.csv)).


# Explainer: Report

## 1. Motivation
#### What is your dataset?

Our main dataset stores information about the Airbnb activity in New York in 2019. Its initial size is contains approximately 50,000 Airbnb listings with plenty of information about each of them such as the host name, the place location and its room type and number of reveiws, etc.

The additional information is comprised in three datasets:
- [Tree Census in New York City](https://www.kaggle.com/nycparks/tree-census)<br>
    This dataset stores a record for every publicly owned tree in New York City and includes each of the  tree's location by borough and latitude/longitude, species, size, health, and more. Three census from 2015, 2005, and 1995 were conducted by NYC Parks and Recreation staff, TreesCount! program staff, and hundreds of volunteers. We chose the one from 2015, with almost 700,000 entries, since it is the closest in time to the records of our main Airbnb dataset.<br>

- [NYC Rat Sightings](https://www.kaggle.com/new-york-city/nyc-rat-sightings)<br>
    This dataset contains information about the rat sightings in New York. Data is from 2010-Sept 16th, 2017 and includes date, location (lat/lon), type of structure, borough, and community board. We filtered the dataset and chose only the reports from 2017, narrowing it down to approximately 15,000 rat sightings.<br>
    
- [348 New York Tourist Locations](https://www.kaggle.com/anirudhmunnangi/348-new-york-tourist-locations)<br>
    This dataset gathers information about 348 tourist places in New York. It only has the name, addres and zipcode of the place, but we will figure out a way of translating this into latitude and longitude coordinates to perform our analysis.<br><br>

#### What is the idea?
We want to be able to estimate pricing of AirBnB apartments in New York, based on information about the location we will stay in and the type of house/room we want to rent.
#### Why is it interesting?
There are different factors that bias the decision of AirBnB clients to choose between different places to stay. One of the most influential surely is the rating of the place, but what if there are external factors that also contribute to this decision? For that, in this project, we will try to find any existing relationship between the main dataset and the other three external datasets enumerated before.
#### Why did you choose this/these particular dataset(s)?
We choose the ‘New York City Airbnb Open Data’, because the phenomenon that is Airbnb is both interesting due to its size and availability and a little bit confusing  in terms of pricing at first glance.
Having first found the dataset for Airbnbs in New York city, we set about finding other dataset from the same region to find fun and possibly relevant datasets that might influence the pricing on a condo, and rats, trees and tourist places certainly seemed to fit that description, so we kept three datasets that gather data about those.
#### What was your goal for the end user's experience?
We wanted to provide the end user with the tools to get a good suggested price for their Airbnb hosting. But more so than suggesting them a price we wanted to help the reader understand what influences the pricing of their condo, so that they could understand why our suggested price is not the definitive truth and that they themselves may have “features“ that can influence the price of their condo. 

## 2. Basic stats. Let's understand the dataset better
#### Write about your choices in data cleaning and preprocessing.
Essentially, the process for cleaning and preprocessing started with a plot of the data that we wanted to keep and that would likely contain outliers. It was all about choosing a threshold and printing the number of entries that would be discarded until the number and the distribution seemed reasonable. Encoding was also applied to some of the features in the original dataset such as the borough and the type of room.
After this process, we started preparing the data to be suitable for the machine learning algorithm. For instance we calculated the distance from a given condo in the dataset to any given tree in New York and we one hot encoded variables to reduce the chance of errors.
#### Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.
points/plots from your exploratory data analysis.
Although we generated plenty of figures and different plots, not all of them were kept in the notebooks, and only a few made it to the website. For example, [_5_Data_exploration.ipynb](_5_Data_exploration.ipynb) contains further exploration about how price is influenced, and it is additionally compared to the minimum staying period or the number of listings of the host, but these features showed not to contribute to determine the price of a listing strongly. The main takeaway from the exploratory data analysis was that some of the features could be seen to have an influence on the price, but that in general the features did not provide any pronounced trends.

 
## 3. Data Analysis
#### Describe your data analysis and explain what you've learned about the dataset.
In general terms we found that the most influential features on the price of a condo is its location, as longitude and latitude were the two most important features, but other aspects such as the number of rats in the vicinity of your condo does lower your price. 
Something learned from the more simple histograms is the very uneven distribution of rats  and tourist places in New York, with Manhattan having multiple times as many tourist attractions as the other boroughs, while Brooklyn is having massive issues with rats, something that is apparently a big issues in the [local news papers]( https://www.brooklynpaper.com/brooklyn-is-the-citys-most-rat-infested-borough-report/).
#### If relevant, talk about your machine-learning.
While our machine learning did improve our ability to predict correct prices we were still far from close to the actual values. A more strict cleaning of the dataset such as removing all prices above 500\\$ as opposed to the 2500\\$ threshold we kept might have helped with this estimation. 
The explorative process of creating the machine learning section was however an interesting way to get to know the dataset, to see what features were deemed important and which did not explain a lot about the dataset. For instance the fact that the neighborhood and borough groupings were less relevant when keeping the latitude and longitude features was quite interesting. 

 
## 4.Genre. Which genre of data story did you use?
We made a slideshow using the Martini glass structure. The first slides are annotated graphs and maps showing interesting relasionships between the price and other features, and the final slide is an interactive slide where the reader can try different configurations and see how they affect the expected price on an Airbnb condo based on a random forest prediction.
By having a slideshow based introduction we are able to guide the reader experience and explicitly give the reader background information, that will be useful to get the most out of the final slide.
The final slide includes both sliders and a map to allow the reader to interact and experiment with the visualization.
#### Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
We utilized a consistent visual platform by having a consistent visual reference point, with one visualisation per page on the website, which is also a type of Object Continuity. As well as consistent coloring of our features. Keeping the reader familiar with the setup of each page. 
### Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?
Our narrative structure was set up around a linear ordering to keep it easy to teach the reader to use the high degrees of interactivity on the website. We feature: hovering highlighting on the visuals, navigation buttons, explicit instruction in the text and tacit tutorials by the ordering of our increasingly more advanced visuals. We even had stimulation default views on our maps, by choosing what is displayed on them by default.
 
 
## 5. Visualizations.
#### Explain the visualizations you've chosen.
We have chosen spreadsheets (if they count), maps, line graphs, bar charts, a line chart with a fitted spline curve going through it, and an array of multiple charts with fitted lines going through them.
#### Why are they right for the story you want to tell?
By choosing these types of visualisations and presenting them at the times we do, we gradually increase the complexity of the visuals without losing the reader by building familiarity with previously shown concepts. 
 
## 6. Discussion. Think critically about your creation
#### What went well?
Although a big part of the time was spent on sharpening our dataset and adding the external information to it, we eventually managed to find some relationships between certain features and the price of a condo. Also, the way we set up the website with a Martini glass structure, gradually introducing more advanced concepts to the reader, before letting them loose on predicting prices on condos.

#### What is still missing? What could be improved?, Why?
Our main dataset, along with all the extra features that we incorporated, turned out to be a very interesting dataset with plenty of aspects about each Airbnb listing. It would have been great to have some more time to explore even more in depth the relation or influence that they have on the final rental price of a condo.
The weak point of this project would mostly be the price estimator built at the end of the webpage. It is for sure able to provide reasonable prices depending on the input location that it is provided with, but the sliders (which essentially add more inputs to it) did not seem to make the estimator polish up its output.
Further work can be done on this project: it would be very interesting to generate an estimator that actually uses the external features that were added to the dataset. It could also have been relevant to present the plot we made of price vs number of reviews on the website, to introduce this relationship that a reader will encounter when using the sliders on page 9.
As a fussy comment, the plots in pages 6 and 7 of the website could have been made interactive, or at least done with plotly for consistency. We chose to simply import the images since the data that is processed for those graphs is large and they would take some time to generate.

 
## 7. Contributions. Who did what?
* Katharina: Website creation, transformation of matplotlib plots to plotly and  maps
* Rolf: Machine learning and text for the website
* Ana: Data prepping/processing and plotting.<br>
All members spent countless days working on the project during endless calls on Teams, reporting, documenting the Explainer folder and reviewing each other’s work on an almost daily basis. The initial steps (choosing the datasets, initial exploration) and all the decision making were also taken together.

