# Short Term Rentals - Exploratory Data Analysis

Now we're going to see what we've imported. As with the previous notebook let's import py2neo and pandas:

In [2]:
%matplotlib notebook

from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

In [3]:
graph = Graph("bolt://localhost", auth=("neo4j", "neo"))

Now we can run the following query to check how many nodes our database contains:

In [17]:
query = """
CALL db.schema() 
"""

graph.run(query).data()

[{'nodes': [(_-11:Node {constraints: ['CONSTRAINT ON ( node:Node ) ASSERT node.id IS UNIQUE'], indexes: [], name: 'Node'})],
  'relationships': [(Node)-[:LINK {}]->(Node)]}]

In [22]:
query = """
MATCH () 
RETURN COUNT(*) AS nodeCount
"""

graph.run(query).to_data_frame()

Unnamed: 0,nodeCount
0,1978892


Let's drill down a bit. What types of nodes do we have?

In [15]:
result = {"label": [], "count": []}
for label in graph.run("CALL db.labels()").to_series():
    query = f"MATCH (:`{label}`) RETURN count(*) as count"
    count = graph.run(query).to_data_frame().iloc[0]['count']
    result["label"].append(label)
    result["count"].append(count)
nodes_df = pd.DataFrame(data=result)
nodes_df.sort_values("count")

Unnamed: 0,label,count
2,Amenity,127
1,Neighborhood,224
3,Host,40309
0,Listing,50914
4,User,877779
5,Review,1009539


We can visualize this counts using matplotlib with the following code:

In [16]:
plt.style.use('fivethirtyeight')
nodes_df.plot(kind='bar', x='label', y='count', legend=None, title="Node Cardinalties")
plt.yscale("log")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

And what types of relationships?

In [23]:
result = {"relType": [], "count": []}
for relationship_type in graph.run("CALL db.relationshipTypes()").to_series():
    query = f"MATCH ()-[:`{relationship_type}`]->() RETURN count(*) as count"
    count = graph.run(query).to_data_frame().iloc[0]['count']
    result["relType"].append(relationship_type)
    result["count"].append(count)
rels_df = pd.DataFrame(data=result)
rels_df.sort_values("count")

Unnamed: 0,relType,count
0,IN_NEIGHBORHOOD,50914
2,HOSTS,50914
1,HAS,981512
3,WROTE,1009539
4,REVIEWS,1009539


We can visualize this counts using matplotlib with the following code:

In [24]:
plt.style.use('fivethirtyeight')
rels_df.plot(kind='bar', x='relType', y='count', legend=None, title="Relationship Cardinalties")
plt.yscale("log")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Now let's explore the neighborhood data:

In [26]:
exploratory_query = """
MATCH (n:Neighborhood)<-[:IN_NEIGHBORHOOD]-(l:Listing)-[:HAS]->(a:Amenity) 
RETURN n.name AS neighborhood, l.name AS name, collect(a.name) AS amenities, l.price AS price 
LIMIT 25
"""

graph.run(exploratory_query).to_data_frame()

Unnamed: 0,amenities,name,neighborhood,price
0,"[Hot water, Bed linens, Shampoo, Hangers, Carb...",Spacious room in Harlem sanctuary.,Harlem,45.0
1,"[TV, Cable TV, Internet, Wifi, Kitchen, Elevat...","Marilyn's Home Stay 1, Brooklyn, NY",Flatbush,70.0
2,"[Air conditioning, Kitchen, TV, Wifi, Hangers,...",Clean cozy room 10 min away from Manhattan,Long Island City,43.0
3,"[TV, Cable TV, Wifi, Air conditioning, Kitchen...",Stay in the Heart of Lincoln Square,Upper West Side,142.0
4,"[Wheelchair accessible, Air conditioning, Wifi...","5* Views, Terrace, 2BR2B, Modern Luxury, Gym, ...",Long Island City,379.0
5,"[Dryer, Smoke detector, Essentials, Hangers, A...",Nice Studio in safe area,Concourse,55.0
6,"[Indoor fireplace, Kitchen, Air conditioning, ...",Unique Designer 1BR in Best NYC Neighborhood,West Village,300.0
7,"[Internet, Laptop friendly workspace, Essentia...",Sunny Brooklyn 2BR w/ HUGE terrace!,Bushwick,125.0
8,"[translation missing: en.hosting_amenity_50, L...",Large 1BR in Heart of LES,Lower East Side,245.0
9,"[Iron, Hair dryer, Essentials, Fire extinguish...",1 Bedroom Apt in Chelsea,Chelsea,200.0


What are the most expensive places to live?

In [4]:
query = """
MATCH (l:Listing)-[:IN_NEIGHBORHOOD]->(n:Neighborhood)
WITH n, avg(l.price) AS averagePrice
RETURN n.id AS zip, n.name AS neighborhood, averagePrice
"""

(graph.run(query).to_data_frame()
 .sort_values("averagePrice", ascending=False)
 .head(10))

Unnamed: 0,averagePrice,neighborhood,zip
29,391.473684,Steiner Ranch,78732
6,316.593939,Barton Hills,78746
7,299.970822,Clarksville,78703
25,273.533333,,78725
2,266.29772,,78704
9,265.03937,,78702
16,258.25,,78734
30,257.252427,Northwest Hills,78731
36,251.645833,Downtown,78701
11,240.0,Oak Hill,78735


Add some charts with matplotlib
Take some ideas from the EDA analysis in the Yelp dataset (look at Grace's notebooks for this)

Add an exercise for people to plot along some other dimension