<a href="https://colab.research.google.com/github/araujoghm/DataScienceEMAp_AraujoNovais/blob/master/FDS_Airbnb_prices_and_Crime_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The relation between criminality and rent prices: a case study of Airbnb in Chicago
<b>Guilherme Araújo & Gabriel Novais</b>:


The goal of this work is to analyze the relation between the prices of Airbnb listings in Chicago and records of criminal occurences in the city for the period of July 2018 to July 2019.

Why Airbnb? Because price rates are more dynamic, since they operate on a more immediate supply-demand equiilibrium, can change daily and respond to many factors such as criminality, in particular. While some caveats have to be made, since many Airbnb listings are likely to be closer to touristic spots and to be less present in poor neighbourhoods, most listings are made available for most of the year, which would suggest there's an underlying mid-to-long term optimization logic for the hosts. This is not meant as an accurate proxy for long-term rent process, but more of an insight into how the decision-making process (hosts deciding at which prices to list their places for each date, consumers deciding which places to rent given price, location and other factors) can be affected by surrounding criminality. 

How are we doing it? We're going to estimate via linear regression a relationship between 1) prices and nearby criminal occurences, for each day in our sample where we have information on both crimes and Airbnb listings  and 2) variations on listed prices and on nearby criminal occurences, for the listings whose prices were changed by the hosts between the first posting of the listings and the actual renting date.  


<b>Sources and Links</b>:

<b>Airbnb</b>
<li><a href="http://insideairbnb.com/get-the-data.html">http://insideairbnb.com/get-the-data.html</a></li>

<b>Chicago</b>
    <li><a>https://data.cityofchicago.org/Public-Safety/Crimes-One-year-prior-to-present/x2n5-8w5q/data</li></a>

In [0]:
#Setting up Python
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import glob
import re
import io
import requests
import csv

from sklearn.linear_model import LinearRegression
from math import radians, sin, cos, acos, log, pi, tan, asin,sqrt
from decimal import Decimal
from bokeh.plotting import figure, show, output_notebook
from bokeh.tile_providers import CARTODBPOSITRON
from ast import literal_eval
from scipy import stats
import statsmodels.api as sm

In [0]:
#from google.colab import drive
#drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [0]:
def indices(lst, element):
    result = []
    offset = -1
    while True:
        try:
            offset = lst.index(element, offset+1)
        except ValueError:
            return result
        result.append(offset)

In [0]:
def distance(a,b):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees). Output in KM
    """
    lat1 = a[0]
    lat2 = b[0]
    lon1 = a[1]
    lon2 = b[1]
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    km = 6371 * c
    return km

#Data

## Importing, cleaning and organizing Airbnb data

Airbnb does not publically release information on its listings. When opening a listing on the Airbnb, all the information we can find are informations about the listing and its host, reviews and a calendar that shows the future dates when the place will be available and the rent price for each day. So how can we make inferences about Airbnb activity?

http://insideairbnb.com/ is a website that provides data scraped periodically from the Airbnb website for selected cities. For the city of Chicago, which will be our subject of choice, we have 14 different iterations of this scraping process, the earliest from April 15th, 2018 and the latest from July 15th, 2019. 

We'll be building 3 datasets from the data obtained from InsideAirbnb:
- <b>listings </b>, which has data for each listings such as host identification, neighborhood and location
- <b>reviews </b>, which compiles the dates of each review posted on the website for each listings
- <b>calendar</b>, which shows the availability and pricing for future dates; by joining data from different iterations of their web scraping, we can build a very accurate database of pricing for the cumulative time period.

It's important to highlight that the availability information is noisy, since booked dates are listed as unavailable, and we don't have explicit information on which dates the places were actually rented. What we do is use the date of reviews as proxy, assuming that users post a review as soon as they leave the rented place, which makes the data of a listing on the day a consumer posted a review relevant. While users may take a day or two to post their reviews, since prices don't vary much from day to day (even though it changes throughout the year), we assume any imprecision here is irrelevant on the aggregate.

### Listings

We'll import the data directly from our GitHub repository, where we've previously saved and organized the data extracted from Inside AirBnb.

Each listings.csv file features data from all the listings on the Airbnb website on that day. The most recent information is what is of our interest; however, it doesn't feature the entire history of listings. Thus, we appended data from previous versions of the listings dataset and only kept the most recent data, so we can have the most accurate information on the largest set of listings.

In order to select relevant listings, we discarded listings which are available for less than 10 days a year and that have had less than 10 reviews, to not burden ourselves with skewed information based on one-off rents. We have also discarded listings from dates previous to April 15th, 2018, since we have no calendar information on them.

One of the most important information on this dataset is the location for each listing, provided by latitude and longitude. Since our main interest in this information is to calculate distances between the listings and nearby crimes on each date, both latitude and longitude information have been rounded to 2 decimals to avoid redundant calculations and to offset errors in measurement, since rounding up to 3 or more decimals made it so that some listings showed up with different locations on different dates. This reduced thousands of listings to 370 general locations, for which we then created and id for each of those locations.

Also note that the method we used for creating id's generated ordered values but of seemingly random values, so we decide to create a second id reordering those values starting from 0 and incrementing by 1, which will facilitate consulting locations later on.

In [0]:
#Import listings data from each scraping iteration (from oldest to newest)
url_l1 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_4_15.csv'
url_l2 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_5_18.csv'
url_l3 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_7_18.csv'
url_l4 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_9_14.csv'
url_l5 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_10_11.csv'
url_l6 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_11_15.csv'
url_l7 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_12_13.csv'
url_l8 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_1_17.csv'
url_l9 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_2_9.csv'
url_l10 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_3_12.csv'
url_l11 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_4_15.csv'
url_l12 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_5_19.csv'
url_l13 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_6_14.csv'
url_l14 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_7_15.csv'


listings_1 = pd.read_csv(url_l1)
listings_2 = pd.read_csv(url_l2)
listings_3 = pd.read_csv(url_l3)
listings_4 = pd.read_csv(url_l4)
listings_5 = pd.read_csv(url_l5)
listings_6 = pd.read_csv(url_l6)
listings_7 = pd.read_csv(url_l7)
listings_8 = pd.read_csv(url_l8)
listings_9 = pd.read_csv(url_l9)
listings_10 = pd.read_csv(url_l10)
listings_11 = pd.read_csv(url_l11)
listings_12 = pd.read_csv(url_l12)
listings_13 = pd.read_csv(url_l13)
listings_14 = pd.read_csv(url_l14)

In [0]:
#The most recent listing data is the one we want, but some past listings may no longer show up
#We'll append to the most recent listings data from past scrapings, but we'll only keep the most recent information for each id 
listings=listings_14
listings=listings.append(listings_13)
listings=listings.append(listings_12)
listings=listings.append(listings_11)
listings=listings.append(listings_10)
listings=listings.append(listings_9)
listings=listings.append(listings_8)
listings=listings.append(listings_7)
listings=listings.append(listings_6)
listings=listings.append(listings_5)
listings=listings.append(listings_4)
listings=listings.append(listings_3)
listings=listings.append(listings_2)
listings=listings.append(listings_1)

listings=listings.drop_duplicates(subset="id", keep='first')
listings=listings.drop(columns=['name','host_name','price','minimum_nights','neighbourhood_group'])
listings=listings.rename(index=str, columns={"id": "listing_id"})

listings=listings.dropna(subset=['last_review'], axis=0)
listings['lr_m']=listings.last_review.apply(lambda x: int(x[5:7]))
listings['lr_d']=listings.last_review.apply(lambda x: int(x[8:10]))
listings['lr_y']=listings.last_review.apply(lambda x: int(x[0:4]))
listings.last_review = pd.to_datetime(listings.last_review)
listings['lat']=listings.latitude.round(2)
listings['lon']=listings.longitude.round(2)
listings['location'] = list(zip(listings.latitude, listings.longitude))
listings['loc'] = list(zip(listings.lat, listings.lon))

listings = listings.assign(loc_id=(listings['loc'].astype('category').cat.codes))

listings.room_type = listings.room_type.apply(lambda x: 1 if x=="Entire home/apt" else 2 if x=="Private room" else 3)

listings=listings[listings.number_of_reviews > 9]
listings=listings[listings.lr_y > 2017]
listings=listings[listings.availability_365>9]
listings=listings.drop(listings[(listings.lr_y==2018) & (listings.lr_m<4)].index)
listings=listings.drop(listings[(listings.lr_y==2018) & (listings.lr_m==4) & (listings.lr_d<15)].index)

listings=listings.drop(columns=['last_review'])

In [0]:
#We'll create a dataframe storing each pair of location and id
listings_locations = listings[['loc','loc_id']]
listings_locations = listings_locations.drop_duplicates('loc_id')
listings_locations = listings_locations.set_index('loc_id')
listings_locations = listings_locations.sort_index()
listings_locations = listings_locations.reset_index()
listings_locations['loc_id2']=listings_locations.index
list_locs=list(listings_locations['loc'])
len(list_locs)

370

In [0]:
listings_locations.head()

Unnamed: 0,loc_id,loc,loc_id2
0,0,"(41.65, -87.54)",0
1,2,"(41.66, -87.55)",1
2,4,"(41.67, -87.66)",2
3,10,"(41.69, -87.68)",3
4,11,"(41.69, -87.67)",4


In [0]:
listings=listings.merge(listings_locations,on=['loc_id','loc'])
listings.head()

Unnamed: 0,listing_id,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,loc_id2
0,2384,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",124,71
1,2604454,13339125,Hyde Park,41.78977,-87.58916,1,85,1.42,3,41,7,11,2019,41.79,-87.59,"(41.789770000000004, -87.58915999999999)","(41.79, -87.59)",124,71
2,6524346,34121377,Hyde Park,41.79119,-87.59099,2,26,0.52,1,34,7,7,2019,41.79,-87.59,"(41.79119, -87.59099)","(41.79, -87.59)",124,71
3,18549719,47172572,Hyde Park,41.79296,-87.59275,1,127,4.77,60,96,7,2,2019,41.79,-87.59,"(41.79296, -87.59275)","(41.79, -87.59)",124,71
4,22320506,47172572,Hyde Park,41.79386,-87.59469,1,99,5.32,60,93,7,8,2019,41.79,-87.59,"(41.793859999999995, -87.59469)","(41.79, -87.59)",124,71


In [0]:
#listings.to_csv('listings.csv')
#!cp listings.csv drive/My\ Drive/

#list_locs.to_csv('list_locs.csv')
#!cp list_locs.csv drive/My\ Drive/


### Reviews

Our reviews dataset is much simpler, since the latest information stores the entire history of Airbnb reviews by listing and date. We simply discarded information for dates outside of our interest and create a dummy variable called 'review', so when we merge the reviews to our calendar we can establish which dates of our listings we can assume have been actually rented. 

In [0]:
#Import review data
url_r = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/reviews/reviews_15_7_19.csv'
reviews = pd.read_csv(url_r)

In [0]:
reviews['month']=reviews.date.apply(lambda x: int(x[5:7]))
reviews['day']=reviews.date.apply(lambda x: int(x[8:10]))
reviews['year']=reviews.date.apply(lambda x: int(x[0:4]))
reviews.date = pd.to_datetime(reviews.date)

reviews=reviews[reviews.year > 2017]
reviews=reviews.drop(reviews[(reviews.year==2018) & (reviews.month<4)].index)
reviews=reviews.drop(reviews[(reviews.year==2018) & (reviews.month==4) & (reviews.day<15)].index)
reviews=reviews.drop(columns=['month','day','year'])
reviews['review']=1

In [0]:
reviews.head()

Unnamed: 0,listing_id,date,review
112,2384,2018-04-15,1
113,2384,2018-04-22,1
114,2384,2018-04-25,1
115,2384,2018-05-05,1
116,2384,2018-05-14,1


### Calendar

As previously explained, each scraping iteration of the calendars features prices and available dates for the near future, as provided by the host. We assumed the latest information is more likely to reflect the actual price exercised on each date. Thus, for our main analysis, we're only keeping the most recent prices made available on the website on our calendar dataset. 

However, since we're also interested in how hosts change their prices for future dates, we created another dataset named cal_change which stores listings for which different prices have been listed on different scraping dates. For now, we'll leave it aside and focus on our calendar dataset.

In [0]:
#Import calendar data from each scraping iteration (from oldest to newest)
url_c1 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_4_15.zip'
url_c2 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_5_18.zip'
url_c3 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_7_18.zip'
url_c4 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_9_14.zip'
url_c5 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_10_11.zip'
url_c6 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_11_15.zip'
url_c7 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_12_13.zip'
url_c8 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_1_17.zip'
url_c9 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_2_9.zip'
url_c10 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_3_12.zip'
url_c11 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_4_15.zip'
url_c12 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_5_19.zip'
url_c13 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_6_14.zip'
url_c14 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_7_15.zip'


calendar_1 = pd.read_csv(url_c1)
calendar_2 = pd.read_csv(url_c2)
calendar_3 = pd.read_csv(url_c3)
calendar_4 = pd.read_csv(url_c4)
calendar_5 = pd.read_csv(url_c5)
calendar_6 = pd.read_csv(url_c6)
calendar_7 = pd.read_csv(url_c7)
calendar_8 = pd.read_csv(url_c8)
calendar_9 = pd.read_csv(url_c9)
calendar_10 = pd.read_csv(url_c10)
calendar_11 = pd.read_csv(url_c11)
calendar_12 = pd.read_csv(url_c12)
calendar_13 = pd.read_csv(url_c13)
calendar_14 = pd.read_csv(url_c14)

In [0]:
#For each calendar scraping, add scraping date
calendar_1['scr_date']='2018-04-15'
calendar_2['scr_date']='2018-05-18'
calendar_3['scr_date']='2018-07-18'
calendar_4['scr_date']='2018-09-14'
calendar_5['scr_date']='2018-10-11'
calendar_6['scr_date']='2018-11-15'
calendar_7['scr_date']='2018-12-13'
calendar_8['scr_date']='2019-01-17'
calendar_9['scr_date']='2019-02-09'
calendar_10['scr_date']='2019-03-12'
calendar_11['scr_date']='2019-04-15'
calendar_12['scr_date']='2019-05-19'
calendar_13['scr_date']='2019-06-14'
calendar_14['scr_date']='2019-07-15'

In [0]:
calendar=calendar_14
calendar=calendar.append(calendar_13)
calendar=calendar.append(calendar_12)
calendar=calendar.append(calendar_11)
calendar=calendar.append(calendar_10)
calendar=calendar.append(calendar_9)
calendar=calendar.append(calendar_8)
calendar=calendar.append(calendar_7)
calendar=calendar.append(calendar_6)
calendar=calendar.append(calendar_5)
calendar=calendar.append(calendar_4)
calendar=calendar.append(calendar_3)
calendar=calendar.append(calendar_2)
calendar=calendar.append(calendar_1)

calendar=calendar[['listing_id','date','price','scr_date']]
calendar=calendar.drop_duplicates(subset=['listing_id','date','price'])
calendar=calendar.dropna(axis=0,subset=['price'])
calendar['month']=calendar.date.apply(lambda x: int(x[5:7]))
calendar['day']=calendar.date.apply(lambda x: int(x[8:10]))
calendar['year']=calendar.date.apply(lambda x: int(x[0:4]))
calendar.date = pd.to_datetime(calendar.date)
calendar.scr_date = pd.to_datetime(calendar.scr_date)
calendar.price = calendar.price.apply(lambda x: float(re.sub("[^\d\.]", "", (x[1:-3]))))
calendar.price = pd.to_numeric(calendar.price)

calendar=calendar.merge(reviews,on=['listing_id','date'],how='left')
calendar['review']=calendar['review'].fillna(0)
calendar['review']=calendar['review'].astype(int)

cal_change = calendar[calendar.duplicated(['listing_id','date'],keep=False)]
cal_change = cal_change.sort_values(by=['listing_id','date'])

calendar=calendar.drop(columns=['scr_date'])
calendar=calendar.drop_duplicates(subset=['listing_id','date'])
calendar=calendar.sort_values(by=['listing_id','date'])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [0]:
calendar.head()

Unnamed: 0,listing_id,date,price,month,day,year,review
14493724,2384,2018-04-15,55.0,4,15,2018,1
14493723,2384,2018-04-16,55.0,4,16,2018,0
14493722,2384,2018-04-17,55.0,4,17,2018,0
14493721,2384,2018-04-18,55.0,4,18,2018,0
14493720,2384,2018-04-22,55.0,4,22,2018,1


In [0]:
cal_change.head()

Unnamed: 0,listing_id,date,price,scr_date,month,day,year,review
13873240,2384,2018-06-11,65.0,2018-05-18,6,11,2018,0
14493696,2384,2018-06-11,80.0,2018-04-15,6,11,2018,0
13873226,2384,2018-07-01,65.0,2018-05-18,7,1,2018,1
14493689,2384,2018-07-01,60.0,2018-04-15,7,1,2018,1
13873225,2384,2018-07-02,65.0,2018-05-18,7,2,2018,0


In [0]:
#calendar.to_csv('calendar.csv')
#!cp calendar.csv drive/My\ Drive/

#cal_change.to_csv('cal_change.csv')
#!cp cal_change.csv drive/My\ Drive/

## Importing, cleaning and organizing crimes data

We extracted our data on crimes for the city of Chicago from the Chicago Data Portal website, which amongst its Public Safety data features a dataset named "Crimes: one year prior to present", which lists all reports of criminal occurences for an entire year up to the latest update (roughly a week before the present date). For the version of this file saved on our GitHub, data spans from July 9th, 2018 to July 8th, 2019. We decided to drop data from July, 2019 since the its few entries seem incomplete, listing only a handful of occurrences.

We decided to discard crimes of certain categories such as 'deceptive practice' (e.g. credit card frauds) and 'liquor law violation' (e.g. selling alcoholic drinks without a permit), which we deemed to not be relevant when it comes to the decision-making process from both hosts and consumers in regards to rent.

When checking how many occurrences there are on our dataset for each crime category (as listed by the Chicago Data Portal), it can be seen that the most frequent crimes are related to stealing private possessions ('theft','burglary', 'robbery', 'motor vehicle theft'), criminal damage and physical violence ('battery', 'assault'), while the number of homicides pale in comparison (which can be at least partially attributed to less reporting, as public information would suggest many more homicides happened on Chicago for that time period).

Since some types of crimes are much more reported than others, the relation between aggregate criminality and prices might be unclear and dominated by the categories with more representation. For example, we'd expect the correlation between price and nearby homicides to be negative, but the correlation between theft and price might actually be positive since higher rent prices are likely to be present in richer areas or more populated areas, where thefts might be more present (or at least, reported more often).

To make a more thorough analysis, we'll deal with the full set of criminal occurrences as well as subsets for crimes related to physical violence, stealing property and homicides.

Like on our listings dataset, we rounded locations (latitude and longitude) to 2 decimals, reducing over a hundred thousand criminal reports to 708 locations.

In [0]:
url_cr = "https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/crimes/crimes.csv"
crimes = pd.read_csv(url_cr)
crimes = crimes[['DATE  OF OCCURRENCE','LATITUDE','LONGITUDE','ARREST',' PRIMARY DESCRIPTION']]
crimes = crimes.rename(index=str, columns={"DATE  OF OCCURRENCE": "date","LONGITUDE": "longitude","LATITUDE": "latitude"," PRIMARY DESCRIPTION": "desc", "ARREST": "arrest"})
crimes = crimes[crimes.desc!="CONCEALED CARRY LICENSE VIOLATION"]
crimes = crimes[crimes.desc!="DECEPTIVE PRACTICE"]
crimes = crimes[crimes.desc!="INTERFERENCE WITH PUBLIC OFFICER"]
crimes = crimes[crimes.desc!="OBSCENITY"]
crimes = crimes[crimes.desc!="NON-CRIMINAL"]
crimes = crimes[crimes.desc!="NON-CRIMINAL (SUBJECT SPECIFIED)"]
crimes = crimes[crimes.desc!="LIQUOR LAW VIOLATION"]
crimes = crimes[crimes.desc!="PUBLIC INDECENCY"]

crimes['lat']=crimes.latitude.round(2)
crimes['lon']=crimes.longitude.round(2)
crimes['location'] = list(zip(crimes.latitude, crimes.longitude))
crimes['loc'] = list(zip(crimes.lat, crimes.lon))
crimes = crimes.assign(loc_id=(crimes['loc'].astype('category').cat.codes))
crimes.arrest = crimes.arrest.apply(lambda x: 0 if x=="N" else 1)
crimes.date = crimes.date.apply(lambda x: x[0:10])
crimes.date = pd.to_datetime(crimes.date)
crimes['date_str'] = crimes.date.astype('str')
crimes['month']=crimes.date_str.apply(lambda x: int(x[5:7]))
crimes['day']=crimes.date_str.apply(lambda x: int(x[8:10]))
crimes['year']=crimes.date_str.apply(lambda x: int(x[0:4]))

crimes=crimes.drop(columns=['date_str'])
#Dropping incomplete observations
crimes=crimes.drop(crimes[(crimes.year==2019) & (crimes.month==7)].index)

crimes=crimes.dropna(axis=0)
crimes=crimes.sort_values(by='date')
print(crimes['desc'].value_counts())

THEFT                         60775
BATTERY                       48332
CRIMINAL DAMAGE               26229
ASSAULT                       20093
OTHER OFFENSE                 16370
NARCOTICS                     12648
BURGLARY                      10272
MOTOR VEHICLE THEFT            9277
ROBBERY                        8448
CRIMINAL TRESPASS              6582
WEAPONS VIOLATION              5736
OFFENSE INVOLVING CHILDREN     2138
CRIM SEXUAL ASSAULT            1542
PUBLIC PEACE VIOLATION         1440
SEX OFFENSE                    1129
PROSTITUTION                    666
HOMICIDE                        551
ARSON                           357
STALKING                        200
INTIMIDATION                    182
GAMBLING                        168
KIDNAPPING                      161
HUMAN TRAFFICKING                14
OTHER NARCOTIC VIOLATION          4
Name: desc, dtype: int64


In [0]:
homicides=crimes[crimes.desc=="HOMICIDE"]
homicides=homicides.drop(columns=['desc'])

stealing=crimes[crimes.desc.isin(["BURGLARY", "THEFT", "ROBBERY", "MOTOR VEHICLE THEFT"])]
stealing=stealing.drop(columns=['desc'])

violence=crimes[crimes.desc.isin(["BATTERY", "ASSAULT"])]
violence=violence.drop(columns=['desc'])

crimes=crimes.drop(columns=['desc'])

Like we did for our listings, we'll create dataframe pairing locations to their id's (both the original and our "corrected" version)

In [0]:
crimes_locations = crimes[['loc','loc_id']]
crimes_locations = crimes_locations.drop_duplicates('loc_id')
crimes_locations = crimes_locations.set_index('loc_id')
crimes_locations = crimes_locations.sort_index()
crimes_locations = crimes_locations.reset_index()
crimes_locations['crim_loc_id2']=crimes_locations.index
crim_locs=list(crimes_locations['loc'])
len(crim_locs)

708

In [0]:
homicides_locations = homicides[['loc','loc_id']]
homicides_locations = homicides_locations.drop_duplicates('loc_id')
homicides_locations = homicides_locations.set_index('loc_id')
homicides_locations = homicides_locations.sort_index()
homicides_locations = homicides_locations.reset_index()
homicides_locations['homi_loc_id2']=homicides_locations.index
homi_locs=list(homicides_locations['loc'])
len(homi_locs)

244

In [0]:
violence_locations = violence[['loc','loc_id']]
violence_locations = violence_locations.drop_duplicates('loc_id')
violence_locations = violence_locations.set_index('loc_id')
violence_locations = violence_locations.sort_index()
violence_locations = violence_locations.reset_index()
violence_locations['viol_loc_id2']=violence_locations.index
viol_locs=list(violence_locations['loc'])
len(viol_locs)

682

In [0]:
stealing_locations = stealing[['loc','loc_id']]
stealing_locations = stealing_locations.drop_duplicates('loc_id')
stealing_locations = stealing_locations.set_index('loc_id')
stealing_locations = stealing_locations.sort_index()
stealing_locations = stealing_locations.reset_index()
stealing_locations['stea_loc_id2']=stealing_locations.index
stea_locs=list(stealing_locations['loc'])
len(stea_locs)

681

In [0]:
crimes = crimes.merge(crimes_locations,on=['loc_id','loc'])
homicides = homicides.merge(homicides_locations,on=['loc_id','loc'])
violence = violence.merge(violence_locations,on=['loc_id','loc'])
stealing = stealing.merge(stealing_locations,on=['loc_id','loc'])

In [0]:
#crimes.to_csv('crimes.csv')
#!cp crimes.csv drive/My\ Drive/

#stealing.to_csv('stealing.csv')
#!cp stealing.csv drive/My\ Drive/

#homicides.to_csv('homicides.csv')
#!cp homicides.csv drive/My\ Drive/

#violence.to_csv('violence.csv')
#cp violence.csv drive/My\ Drive/

##Crimes by location and date

Now we'll create dataframes for counting criminal occurences for each date and listing the locations in which those crimes happened

###Crimes by date

In [0]:
crimes_date=crimes[['date','loc_id','crim_loc_id2']]
crimes_date=crimes_date.groupby('date').agg(lambda x: list(x))
crimes_date['crimes_count_date']=np.nan
for i in range(len(crimes_date)):
  crimes_date.crimes_count_date.iloc[i]=len(list(crimes_date.loc_id.iloc[i]))
  crimes_date.loc_id.iloc[i]=np.unique(list(crimes_date.loc_id.iloc[i]))
  crimes_date.crim_loc_id2.iloc[i]=np.unique(list(crimes_date.crim_loc_id2.iloc[i]))
crimes_date.crimes_count_date=crimes_date.crimes_count_date.astype(int)
crimes_date=crimes_date.reset_index()
crimes_date=crimes_date.rename(index=str, columns={"loc_id": "crim_loc_id"})
crimes_date.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,date,crim_loc_id,crim_loc_id2,crimes_count_date
0,2018-07-09,"[0, 5883, 5900, 5903, 5904, 5905, 5906, 5908, ...","[0, 27, 34, 37, 38, 39, 40, 42, 47, 48, 49, 50...",622
1,2018-07-10,"[0, 5885, 5892, 5900, 5902, 5906, 5907, 5908, ...","[0, 28, 30, 34, 36, 40, 41, 42, 44, 46, 47, 49...",732
2,2018-07-11,"[0, 5883, 5893, 5894, 5900, 5903, 5904, 5905, ...","[0, 27, 31, 32, 34, 37, 38, 39, 41, 43, 44, 45...",665
3,2018-07-12,"[0, 5893, 5897, 5900, 5907, 5908, 5913, 5915, ...","[0, 31, 33, 34, 41, 42, 47, 49, 51, 57, 58, 60...",732
4,2018-07-13,"[0, 5890, 5892, 5894, 5903, 5904, 5905, 5906, ...","[0, 29, 30, 32, 37, 38, 39, 40, 41, 46, 47, 51...",753


In [0]:
homicides_date=homicides[['date','loc_id','homi_loc_id2']]
homicides_date=homicides_date.groupby('date').agg(lambda x: list(x))
homicides_date['homicides_count_date']=np.nan
for i in range(len(homicides_date)):
  homicides_date.homicides_count_date.iloc[i]=len(list(homicides_date.loc_id.iloc[i]))
  homicides_date.loc_id.iloc[i]=np.unique(list(homicides_date.loc_id.iloc[i]))
  homicides_date.homi_loc_id2.iloc[i]=np.unique(list(homicides_date.homi_loc_id2.iloc[i]))
homicides_date.homicides_count_date=homicides_date.homicides_count_date.astype(int)
homicides_date=homicides_date.reset_index()
homicides_date=homicides_date.rename(index=str, columns={"loc_id": "homi_loc_id"})
homicides_date.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,date,homi_loc_id,homi_loc_id2,homicides_count_date
0,2018-07-09,[6088],[108],1
1,2018-07-10,"[6001, 6101]","[45, 115]",2
2,2018-07-11,"[5980, 6029, 6103]","[38, 66, 117]",4
3,2018-07-12,"[5917, 6023, 6104, 6202, 6463]","[11, 60, 118, 156, 238]",5
4,2018-07-13,[6068],[95],1


In [0]:
violence_date=violence[['date','loc_id','viol_loc_id2']]
violence_date=violence_date.groupby('date').agg(lambda x: list(x))
violence_date['violence_count_date']=np.nan
for i in range(len(violence_date)):
  violence_date.violence_count_date.iloc[i]=len(list(violence_date.loc_id.iloc[i]))
  violence_date.loc_id.iloc[i]=np.unique(list(violence_date.loc_id.iloc[i]))
  violence_date.viol_loc_id2.iloc[i]=np.unique(list(violence_date.viol_loc_id2.iloc[i]))
violence_date.violence_count_date=violence_date.violence_count_date.astype(int)
violence_date=violence_date.reset_index()
violence_date=violence_date.rename(index=str, columns={"loc_id": "viol_loc_id"})
violence_date.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,date,viol_loc_id,viol_loc_id2,violence_count_date
0,2018-07-09,"[0, 5883, 5903, 5904, 5906, 5915, 5916, 5917, ...","[0, 19, 29, 30, 32, 41, 42, 43, 67, 70, 77, 78...",160
1,2018-07-10,"[0, 5900, 5906, 5907, 5908, 5915, 5916, 5925, ...","[0, 26, 32, 33, 34, 41, 42, 51, 54, 57, 62, 66...",193
2,2018-07-11,"[0, 5893, 5904, 5909, 5917, 5923, 5924, 5925, ...","[0, 23, 30, 35, 43, 49, 50, 51, 52, 53, 54, 65...",196
3,2018-07-12,"[0, 5893, 5907, 5908, 5926, 5927, 5929, 5938, ...","[0, 23, 33, 34, 52, 53, 55, 64, 70, 71, 72, 78...",200
4,2018-07-13,"[0, 5890, 5892, 5903, 5904, 5917, 5926, 5927, ...","[0, 21, 22, 29, 30, 43, 52, 53, 54, 55, 64, 77...",215


In [0]:
stealing_date=stealing[['date','loc_id','stea_loc_id2']]
stealing_date=stealing_date.groupby('date').agg(lambda x: list(x))
stealing_date['stealing_count_date']=np.nan
for i in range(len(stealing_date)):
  stealing_date.stealing_count_date.iloc[i]=len(list(stealing_date.loc_id.iloc[i]))
  stealing_date.loc_id.iloc[i]=np.unique(list(stealing_date.loc_id.iloc[i]))
  stealing_date.stea_loc_id2.iloc[i]=np.unique(list(stealing_date.stea_loc_id2.iloc[i]))
stealing_date.stealing_count_date=stealing_date.stealing_count_date.astype(int)
stealing_date=stealing_date.reset_index()
stealing_date=stealing_date.rename(index=str, columns={"loc_id": "stea_loc_id"})
stealing_date.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,date,stea_loc_id,stea_loc_id2,stealing_count_date
0,2018-07-09,"[0, 5905, 5906, 5913, 5915, 5928, 5930, 5933, ...","[0, 32, 33, 40, 42, 55, 57, 60, 82, 84, 90, 92...",259
1,2018-07-10,"[0, 5892, 5913, 5915, 5916, 5920, 5922, 5924, ...","[0, 23, 40, 42, 43, 47, 49, 51, 52, 55, 59, 60...",317
2,2018-07-11,"[0, 5883, 5905, 5911, 5917, 5923, 5926, 5927, ...","[0, 20, 32, 38, 44, 50, 53, 54, 60, 65, 67, 68...",264
3,2018-07-12,"[0, 5893, 5900, 5913, 5917, 5924, 5926, 5928, ...","[0, 24, 27, 40, 44, 51, 53, 55, 68, 89, 91, 94...",310
4,2018-07-13,"[0, 5903, 5904, 5905, 5906, 5907, 5913, 5923, ...","[0, 30, 31, 32, 33, 34, 40, 50, 55, 69, 80, 82...",328


In [0]:
#crimes_date.to_csv('crimes_date.csv')
#!cp crimes_date.csv drive/My\ Drive/

#stealing_date.to_csv('stealing_date.csv')
#!cp stealing_date.csv drive/My\ Drive/

#homicides_date.to_csv('homicides_date.csv')
#!cp homicides_date.csv drive/My\ Drive/

#violence_date.to_csv('violence_date.csv')
#!cp violence_date.csv drive/My\ Drive/

###Crimes by location

In [0]:
crimes_loc=crimes[['loc_id','crim_loc_id2']]
crimes_loc=crimes_loc.groupby(['loc_id']).agg(lambda x: list(x))
crimes_loc['crimes_count_loc']=np.nan
for i in range(len(crimes_loc)):
  crimes_loc.crimes_count_loc.iloc[i]=len(list(crimes_loc.crim_loc_id2.iloc[i]))
crimes_loc.crimes_count_loc=crimes_loc.crimes_count_loc.astype(int)
crimes_loc=crimes_loc.reset_index()
crimes_loc.crim_loc_id2=crimes_loc.index
crimes_loc=crimes_loc.rename(index=str, columns={"loc_id": "crim_loc_id"})

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [0]:
homicides_loc=homicides[['loc_id','homi_loc_id2']]
homicides_loc=homicides_loc.groupby(['loc_id']).agg(lambda x: list(x))
homicides_loc['homicides_count_loc']=np.nan
for i in range(len(homicides_loc)):
  homicides_loc.homicides_count_loc.iloc[i]=len(list(homicides_loc.homi_loc_id2.iloc[i]))
homicides_loc.homicides_count_loc=homicides_loc.homicides_count_loc.astype(int)
homicides_loc=homicides_loc.reset_index()
homicides_loc.homi_loc_id2=homicides_loc.index
homicides_loc=homicides_loc.rename(index=str, columns={"loc_id": "homi_loc_id"})

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [0]:
violence_loc=violence[['loc_id','viol_loc_id2']]
violence_loc=violence_loc.groupby(['loc_id']).agg(lambda x: list(x))
violence_loc['violence_count_loc']=np.nan
for i in range(len(violence_loc)):
  violence_loc.violence_count_loc.iloc[i]=len(list(violence_loc.viol_loc_id2.iloc[i]))
violence_loc.violence_count_loc=violence_loc.violence_count_loc.astype(int)
violence_loc=violence_loc.reset_index()
violence_loc.viol_loc_id2=violence_loc.index
violence_loc=violence_loc.rename(index=str, columns={"loc_id": "viol_loc_id"})

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [0]:
stealing_loc=stealing[['loc_id','stea_loc_id2']]
stealing_loc=stealing_loc.groupby(['loc_id']).agg(lambda x: list(x))
stealing_loc['stealing_count_loc']=np.nan
for i in range(len(stealing_loc)):
  stealing_loc.stealing_count_loc.iloc[i]=len(list(stealing_loc.stea_loc_id2.iloc[i]))
stealing_loc.stealing_count_loc=stealing_loc.stealing_count_loc.astype(int)
stealing_loc=stealing_loc.reset_index()
stealing_loc.stea_loc_id2=stealing_loc.index
stealing_loc=stealing_loc.rename(index=str, columns={"loc_id": "stea_loc_id"})

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [0]:
#crimes_loc.to_csv('crimes_loc.csv')
#!cp crimes_loc.csv drive/My\ Drive/

#stealing_loc.to_csv('stealing_loc.csv')
#!cp stealing_loc.csv drive/My\ Drive/

#homicides_loc.to_csv('homicides_loc.csv')
#!cp homicides_loc.csv drive/My\ Drive/

#violence_loc.to_csv('violence_loc.csv')
#!cp violence_loc.csv drive/My\ Drive/

# 1 - Airbnb listing prices x nearby crimes on each date 

## Calculating distances between Airbnb listings and criminal occurrences

We're interested in knowing the criminal activity surrounding each Airbnb listing. To do that, we'll calculate the distance between each unique location on our listings database and each unique location on our criminal occurrence datasets. In particular, we'll obtain a listing of which crimes locations are within a radius of 1km, 2km and 5km of each Airbnb listing location.

In [0]:
#list_locs=pd.read_csv('/content/list_locs.csv')
#crim_locs=pd.read_csv('/content/crim_locs.csv')
#viol_locs=pd.read_csv('/content/viol_locs.csv')
#stea_locs=pd.read_csv('/content/stea_locs.csv')
#homi_locs=pd.read_csv('/content/homi_locs.csv')

In [0]:
dist_cr=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(crim_locs)
  x=list_locs[i]
  for j in range(len(crim_locs)):
    a[j]=round(distance(x,crim_locs[j]),2)
  dist_cr[i]=a
  
dist_cr=pd.DataFrame(dist_cr)
dist_cr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707
0,27.71,4.15,17.39,3.73,3.34,6.0,5.55,2.22,42.94,6.94,42.99,46.14,41.98,23.29,16.13,18.84,3.44,49.74,49.54,24.61,46.97,36.57,48.54,3.44,4.0,45.88,41.97,0.0,9.21,5.92,5.82,4.99,0.83,6.65,8.38,7.56,6.74,5.92,5.11,1.11,...,43.42,42.14,40.26,39.61,42.34,47.15,44.8,44.36,43.93,41.28,44.89,42.63,42.44,43.49,44.7,43.7,38.89,43.33,40.64,46.83,44.48,44.09,26.0,45.69,3.5,5.46,49.92,6.74,47.23,19.35,45.05,16.4,4.75,43.98,4.71,5.11,663.84,45.59,45.71,44.88
1,26.42,3.5,16.12,2.37,2.0,4.71,4.16,1.39,41.55,5.55,41.61,44.76,40.64,21.98,15.12,17.59,2.78,48.35,48.16,23.23,45.59,35.18,47.15,2.22,2.73,44.5,40.61,1.39,8.31,5.46,5.11,4.3,1.11,5.92,7.48,6.65,5.82,4.98,4.15,0.83,...,42.03,40.76,38.91,38.27,40.97,45.76,43.41,42.98,42.55,39.93,43.51,41.28,41.11,42.17,43.33,42.34,37.62,41.97,39.31,45.46,43.11,42.72,24.61,44.33,2.49,4.3,48.53,6.23,45.86,18.1,43.67,15.32,4.16,42.63,3.34,4.71,664.3,44.21,44.32,43.49
2,24.59,6.23,8.33,8.38,7.48,5.11,7.01,9.97,36.21,6.7,35.89,38.33,37.15,14.39,5.92,9.41,10.85,42.03,42.14,17.66,39.27,29.44,41.35,9.2,6.64,37.86,36.53,10.21,1.39,5.33,4.71,5.46,9.4,4.0,2.0,2.73,3.5,4.3,5.11,10.03,...,36.93,36.11,35.1,34.83,36.73,39.98,38.27,37.97,37.69,36.19,38.74,37.44,38.03,39.14,39.09,38.02,35.59,37.81,35.93,40.85,38.48,38.24,18.99,40.17,6.74,4.98,42.72,4.71,41.11,17.66,37.46,6.22,11.84,38.7,6.74,6.0,660.19,37.89,38.91,37.62
3,22.62,8.7,5.55,10.03,9.4,6.74,8.3,11.84,33.52,7.55,33.16,35.56,34.72,11.62,4.3,6.68,12.51,39.26,39.39,15.04,36.5,26.73,38.61,10.85,8.6,35.09,34.01,12.45,4.16,8.04,7.32,8.0,11.68,6.68,4.71,5.33,6.0,6.7,7.43,12.1,...,34.27,33.51,32.63,32.42,34.18,37.25,35.6,35.32,35.05,33.73,36.11,34.95,35.67,36.78,36.53,35.45,33.45,35.27,33.52,38.24,35.88,35.65,16.34,37.62,8.95,7.01,39.98,7.47,38.48,16.27,34.7,4.15,13.29,36.19,8.38,8.66,661.15,35.12,36.21,34.89
4,22.48,8.0,6.08,9.2,8.6,5.92,7.47,11.02,33.84,6.73,33.52,36.01,34.83,12.17,5.11,7.32,11.68,39.71,39.79,15.29,36.94,27.08,38.99,10.03,7.8,35.57,34.18,11.68,3.73,7.47,6.68,7.32,10.91,6.09,4.16,4.71,5.33,6.0,6.7,11.3,...,34.56,33.74,32.76,32.51,34.37,37.62,35.9,35.6,35.32,33.86,36.37,35.1,35.74,36.84,36.73,35.65,33.4,35.45,33.61,38.48,36.11,35.88,16.61,37.81,8.19,6.22,40.36,6.94,38.74,15.91,35.12,4.98,12.46,36.35,7.56,8.04,661.6,35.56,36.54,35.27


In [0]:
dist1_cr=dist_cr[dist_cr<=1].notnull().astype('int')
dist1_cr=dist1_cr.fillna(0)
dist1_cr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
dist2_cr=dist_cr[dist_cr<=2].notnull().astype('int')
dist2_cr=dist2_cr.fillna(0)
dist2_cr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
dist5_cr=dist_cr[dist_cr<=5].notnull().astype('int')
dist5_cr=dist5_cr.fillna(0)
dist5_cr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707
0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
1,0,1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [0]:
dist_hm=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(homi_locs)
  x=list_locs[i]
  for j in range(len(homi_locs)):
    a[j]=round(distance(x,homi_locs[j]),2)
  dist_hm[i]=a
  
dist_hm=pd.DataFrame(dist_hm)
dist_hm.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243
0,4.99,6.65,8.38,6.74,5.92,5.11,8.6,7.01,10.51,9.73,8.19,7.44,11.68,10.91,9.42,8.7,8.0,7.32,15.17,12.14,10.69,9.32,8.66,8.04,5.8,5.56,10.65,10.02,9.42,8.85,6.67,12.64,12.0,10.79,9.24,8.82,8.17,7.78,13.99,13.36,...,29.86,36.96,36.02,35.13,34.71,34.3,33.9,33.53,36.95,36.08,34.89,34.18,33.53,32.02,38.33,35.89,34.87,33.3,37.25,37.9,37.24,36.93,35.66,38.58,38.27,36.95,41.35,38.02,41.25,42.75,42.53,41.39,40.06,41.45,42.98,9.4,11.41,23.59,35.19,40.37
1,4.3,5.92,7.48,5.82,4.98,4.15,7.56,5.92,9.4,8.6,7.01,6.23,10.51,9.73,8.19,7.43,6.7,6.0,14.01,10.91,9.42,8.0,7.32,6.68,4.52,4.52,9.31,8.66,8.04,7.47,5.62,11.31,10.65,9.41,7.86,7.45,6.88,6.72,12.64,12.0,...,28.58,35.57,34.63,33.75,33.33,32.92,32.53,32.16,35.57,34.7,33.53,32.82,32.19,30.75,36.95,34.53,33.53,32.02,35.89,36.54,35.9,35.6,34.37,37.24,36.93,35.66,39.98,36.73,39.97,41.45,41.25,40.06,38.74,40.17,41.68,8.38,10.16,22.2,33.84,39.0
2,5.46,4.0,2.0,3.5,4.3,5.11,1.66,3.32,1.11,1.39,2.73,3.5,2.37,2.22,2.78,3.34,4.0,4.71,5.33,3.44,3.44,4.16,4.71,5.33,8.95,10.51,4.75,5.1,5.55,6.08,10.91,5.56,5.62,6.09,7.47,8.04,9.31,11.41,6.72,6.67,...,26.8,30.12,29.53,29.01,28.78,28.58,28.4,28.24,30.57,30.08,29.49,29.21,29.02,29.02,31.9,30.58,30.21,30.07,31.83,32.76,32.51,32.42,32.26,33.73,33.61,33.36,35.66,34.47,37.81,38.95,38.93,36.93,35.74,37.82,39.0,0.83,3.34,17.19,30.31,34.58
3,8.0,6.68,4.71,6.0,6.7,7.43,4.0,5.46,2.0,2.73,4.3,5.11,0.83,1.66,3.32,4.15,4.98,5.81,2.73,1.39,2.73,4.3,5.1,5.92,10.03,11.68,4.0,4.71,5.46,6.22,11.83,3.73,4.16,5.33,7.43,8.18,9.72,12.09,4.52,4.75,...,24.81,27.44,26.88,26.42,26.22,26.05,25.91,25.79,27.94,27.5,27.01,26.8,26.7,27.01,29.26,28.11,27.85,28.0,29.34,30.31,30.13,30.07,30.13,31.31,31.23,31.18,33.1,32.29,35.62,36.69,36.7,34.56,33.4,35.59,36.7,3.34,2.0,14.67,27.91,32.02
4,7.32,6.09,4.16,5.33,6.0,6.7,3.34,4.71,1.39,2.0,3.5,4.3,0.0,0.83,2.49,3.32,4.15,4.98,3.5,1.11,2.0,3.5,4.3,5.1,9.2,10.85,3.34,4.0,4.71,5.46,11.02,3.44,3.73,4.71,6.7,7.43,8.95,11.3,4.45,4.52,...,24.69,27.75,27.15,26.64,26.42,26.22,26.05,25.91,28.2,27.71,27.15,26.89,26.74,26.89,29.52,28.24,27.91,27.91,29.49,30.43,30.21,30.13,30.07,31.41,31.31,31.15,33.29,32.26,35.59,36.7,36.69,34.63,33.45,35.58,36.73,2.78,1.39,14.83,28.0,32.22


In [0]:
dist1_hm=dist_hm[dist_hm<=1].notnull().astype('int')
dist1_hm=dist1_hm.fillna(0)
#dist1_hm.head()

dist2_hm=dist_hm[dist_hm<=2].notnull().astype('int')
dist2_hm=dist2_hm.fillna(0)
#dist2_hm.head()

dist5_hm=dist_hm[dist_hm<=5].notnull().astype('int')
dist5_hm=dist5_hm.fillna(0)
#dist5_hm.head()


In [0]:
dist_vi=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(viol_locs)
  x=list_locs[i]
  for j in range(len(viol_locs)):
    a[j]=round(distance(x,viol_locs[j]),2)
  dist_vi[i]=a
  
dist_vi=pd.DataFrame(dist_vi)
#dist_vi.head()

In [0]:
dist1_vi=dist_vi[dist_vi<=1].notnull().astype('int')
dist1_vi=dist1_vi.fillna(0)
#dist1_vi.head()

dist2_vi=dist_vi[dist_vi<=2].notnull().astype('int')
dist2_vi=dist2_vi.fillna(0)
#dist2_vi.head()

dist5_vi=dist_vi[dist_vi<=5].notnull().astype('int')
dist5_vi=dist5_vi.fillna(0)
#dist5_vi.head()

In [0]:
dist_st=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(stea_locs)
  x=list_locs[i]
  for j in range(len(stea_locs)):
    a[j]=round(distance(x,stea_locs[j]),2)
  dist_st[i]=a
  
dist_st=pd.DataFrame(dist_st)
#dist_st.head()

In [0]:
dist1_st=dist_st[dist_st<=1].notnull().astype('int')
dist1_st=dist1_st.fillna(0)
#dist1_st.head()

dist2_st=dist_st[dist_st<=2].notnull().astype('int')
dist2_st=dist2_st.fillna(0)
#dist2_st.head()

dist5_st=dist_st[dist_st<=5].notnull().astype('int')
dist5_st=dist5_st.fillna(0)
#dist5_st.head()

##Merging listings, calendars, reviews and crimes to build our databases

First, we'll merge calendar and listings datasets, which will give us all the listing informations for each listing, for every date the place is listed on our calendar dataset.

As previously explained, we don't know to the fullest extent on which dates each listing was actually rented; we only know the dates when reviews were posted. In our attempt to only select relevant information to be used in our regression, we decide to only keep the date for the listings on dates they were reviewed.

Next, we'll merge our new database with our crimes_date databases, which will gives us info on criminal occurences of each type (all crimes, homicides, physical violence and stealing) for each date.



In [0]:
airbnb = calendar.merge(listings,on=['listing_id'],how='inner')
airbnb = airbnb[airbnb.review==1]
airbnb = airbnb.drop(columns=['review'])

In [0]:
airbnb_cr=airbnb.merge(crimes_date,on=['date'])
airbnb_cr=airbnb_cr.merge(homicides_date,on=['date'])
airbnb_cr=airbnb_cr.merge(stealing_date,on=['date'])
airbnb_cr=airbnb_cr.merge(violence_date,on=['date'])

airbnb_cr['loc_id'] = airbnb_cr.loc_id2
airbnb_cr['crim_loc_id'] = airbnb_cr.crim_loc_id2
airbnb_cr['homi_loc_id'] = airbnb_cr.homi_loc_id2
airbnb_cr['stea_loc_id'] = airbnb_cr.stea_loc_id2
airbnb_cr['viol_loc_id'] = airbnb_cr.viol_loc_id2

airbnb_cr = airbnb_cr.drop(columns=['loc_id2','crim_loc_id2','stea_loc_id2','viol_loc_id2','homi_loc_id2'])
airbnb_cr = airbnb_cr.assign(date_id=(airbnb_cr['date'].astype('category').cat.codes))
airbnb_cr = airbnb_cr.rename(index=str, columns={"crim_loc_id": "crimes_that_date","homi_loc_id": "homicides_that_date","viol_loc_id": "violence_that_date","stea_loc_id": "stealing_that_date"})
airbnb_cr = airbnb_cr.sort_values(by=['listing_id','date'])

In [0]:
airbnb_cr.head()

Unnamed: 0,listing_id,date,price,month,day,year,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,crimes_that_date,crimes_count_date,homicides_that_date,homicides_count_date,stealing_that_date,stealing_count_date,violence_that_date,violence_count_date,date_id
0,2384,2018-07-26,75.0,7,26,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 33, 34, 35, 37, 40, 45, 49, 50, 51, 56, 58...",731,"[121, 127]",2,"[0, 26, 28, 30, 38, 44, 53, 55, 56, 61, 64, 67...",305,"[25, 32, 41, 43, 48, 52, 53, 54, 56, 60, 63, 6...",186,15
268,2384,2018-07-29,69.0,7,29,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 36, 37, 38, 40, 41, 42, 46, 47, 48, 49, 50...",731,"[25, 48, 190]",3,"[0, 31, 34, 40, 42, 44, 54, 55, 68, 74, 76, 77...",280,"[0, 29, 32, 38, 50, 53, 55, 78, 80, 98, 105, 1...",249,18
847,2384,2018-08-05,65.0,8,5,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 32, 33, 36, 37, 43, 45, 47, 50, 51, 52, 57...",811,"[60, 76, 132, 146, 154, 161, 166, 200]",8,"[0, 29, 30, 43, 44, 50, 54, 55, 56, 73, 78, 81...",297,"[0, 24, 25, 39, 49, 51, 53, 54, 55, 57, 63, 66...",276,24
1343,2384,2018-10-01,65.0,10,1,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 32, 34, 36, 37, 41, 47, 50, 54, 56, 58, 60...",732,"[3, 175, 232]",4,"[0, 25, 30, 34, 43, 53, 55, 65, 66, 71, 78, 81...",298,"[0, 26, 39, 48, 50, 52, 53, 65, 66, 84, 93, 98...",221,72
1793,2384,2018-10-31,75.0,10,31,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 27, 33, 37, 41, 44, 46, 47, 48, 50, 51, 56...",802,[208],1,"[0, 37, 39, 40, 41, 49, 51, 64, 68, 71, 77, 85...",279,"[29, 33, 43, 55, 63, 65, 66, 70, 79, 80, 81, 8...",208,97


##Counting crimes on the vicinity of each Airbnb location for each listed date

This will be done in 3 steps, for each of Airbnb listing and date: 1) we'll obtain a list of the locations (given by an id) within 1km, 2km and 5km of the listing's location; 2) we'll obtain a list of which crimes happened on that date and 3) the intersection of both sets we'll give us the nearby crimes that day, which we will count by summing the occurences of crimes with these location ids on that date.

Note that our counts represent disjoint sets, that is, crimes within a 1km radius are not included in our counts of crimes within a 2km radius; the 2km counts actually refer to crimes that happened between 1km and 2km, and 5km counts refer to crimes that happened between 2km and 5km.

In [0]:
#Step 1: Finding locations of crimes near the listing's location 
airbnb_cr['crimes_loc_1km']=np.nan
airbnb_cr['crimes_loc_2km']=np.nan
airbnb_cr['crimes_loc_5km']=np.nan

airbnb_cr['homicides_loc_1km']=np.nan
airbnb_cr['homicides_loc_2km']=np.nan
airbnb_cr['homicides_loc_5km']=np.nan

airbnb_cr['stealing_loc_1km']=np.nan
airbnb_cr['stealing_loc_2km']=np.nan
airbnb_cr['stealing_loc_5km']=np.nan

airbnb_cr['violence_loc_1km']=np.nan
airbnb_cr['violence_loc_2km']=np.nan
airbnb_cr['violence_loc_5km']=np.nan



crimes_loc_1km=[None]*len(airbnb_cr)
crimes_loc_2km=[None]*len(airbnb_cr)
crimes_loc_5km=[None]*len(airbnb_cr)

homicides_loc_1km=[None]*len(airbnb_cr)
homicides_loc_2km=[None]*len(airbnb_cr)
homicides_loc_5km=[None]*len(airbnb_cr)

stealing_loc_1km=[None]*len(airbnb_cr)
stealing_loc_2km=[None]*len(airbnb_cr)
stealing_loc_5km=[None]*len(airbnb_cr)

violence_loc_1km=[None]*len(airbnb_cr)
violence_loc_2km=[None]*len(airbnb_cr)
violence_loc_5km=[None]*len(airbnb_cr)



for i in range(len(airbnb_cr)):
  loc_id=airbnb_cr.loc_id.iloc[i]
  
  crimes_loc_1km[i]=indices(list(dist1_cr.iloc[loc_id]),1)
  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_cr.loc[loc_id]),1),indices(list(dist1_cr.loc[loc_id]),1)))
  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_cr.loc[loc_id]),1),indices(list(dist2_cr.loc[loc_id]),1)))

  homicides_loc_1km[i]=indices(list(dist1_hm.iloc[loc_id]),1)
  homicides_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_hm.loc[loc_id]),1),indices(list(dist1_hm.loc[loc_id]),1)))
  homicides_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_hm.loc[loc_id]),1),indices(list(dist2_hm.loc[loc_id]),1)))

  stealing_loc_1km[i]=indices(list(dist1_st.iloc[loc_id]),1)
  stealing_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_st.loc[loc_id]),1),indices(list(dist1_st.loc[loc_id]),1)))
  stealing_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_st.loc[loc_id]),1),indices(list(dist2_st.loc[loc_id]),1)))
  
  violence_loc_1km[i]=indices(list(dist1_vi.iloc[loc_id]),1)
  violence_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_vi.loc[loc_id]),1),indices(list(dist1_vi.loc[loc_id]),1)))
  violence_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_vi.loc[loc_id]),1),indices(list(dist2_vi.loc[loc_id]),1)))

  
airbnb_cr.crimes_loc_1km=crimes_loc_1km
airbnb_cr.crimes_loc_2km=crimes_loc_2km
airbnb_cr.crimes_loc_5km=crimes_loc_5km

airbnb_cr.homicides_loc_1km=homicides_loc_1km
airbnb_cr.homicides_loc_2km=homicides_loc_2km
airbnb_cr.homicides_loc_5km=homicides_loc_5km

airbnb_cr.stealing_loc_1km=stealing_loc_1km
airbnb_cr.stealing_loc_2km=stealing_loc_2km
airbnb_cr.stealing_loc_5km=stealing_loc_5km

airbnb_cr.violence_loc_1km=violence_loc_1km
airbnb_cr.violence_loc_2km=violence_loc_2km
airbnb_cr.violence_loc_5km=violence_loc_5km

In [0]:
#Step 2: Finding which close crime locations had crimes on that date
airbnb_cr['crimes_1km']=np.nan
airbnb_cr['crimes_2km']=np.nan
airbnb_cr['crimes_5km']=np.nan

airbnb_cr['homicides_1km']=np.nan
airbnb_cr['homicides_2km']=np.nan
airbnb_cr['homicides_5km']=np.nan

airbnb_cr['stealing_1km']=np.nan
airbnb_cr['stealing_2km']=np.nan
airbnb_cr['stealing_5km']=np.nan

airbnb_cr['violence_1km']=np.nan
airbnb_cr['violence_2km']=np.nan
airbnb_cr['violence_5km']=np.nan

crimes_1km=[None]*len(airbnb_cr)
crimes_2km=[None]*len(airbnb_cr)
crimes_5km=[None]*len(airbnb_cr)

homicides_1km=[None]*len(airbnb_cr)
homicides_2km=[None]*len(airbnb_cr)
homicides_5km=[None]*len(airbnb_cr)

violence_1km=[None]*len(airbnb_cr)
violence_2km=[None]*len(airbnb_cr)
violence_5km=[None]*len(airbnb_cr)

stealing_1km=[None]*len(airbnb_cr)
stealing_2km=[None]*len(airbnb_cr)
stealing_5km=[None]*len(airbnb_cr)



for i in range(len(airbnb_cr)):
  crimes_that_date = airbnb_cr.crimes_that_date.iloc[i]
  homicides_that_date = airbnb_cr.homicides_that_date.iloc[i]
  stealing_that_date = airbnb_cr.stealing_that_date.iloc[i]
  violence_that_date = airbnb_cr.violence_that_date.iloc[i]
  
  crimes_1km[i] = list(set(crimes_that_date).intersection(airbnb_cr.crimes_loc_1km.iloc[i]))
  crimes_2km[i] = list(set(crimes_that_date).intersection(airbnb_cr.crimes_loc_2km.iloc[i]))
  crimes_5km[i] = list(set(crimes_that_date).intersection(airbnb_cr.crimes_loc_5km.iloc[i]))
  
  homicides_1km[i] = list(set(homicides_that_date).intersection(airbnb_cr.crimes_loc_1km.iloc[i]))
  homicides_2km[i] = list(set(homicides_that_date).intersection(airbnb_cr.crimes_loc_2km.iloc[i]))
  homicides_5km[i] = list(set(homicides_that_date).intersection(airbnb_cr.crimes_loc_5km.iloc[i]))
  
  violence_1km[i] = list(set(violence_that_date).intersection(airbnb_cr.crimes_loc_1km.iloc[i]))
  violence_2km[i] = list(set(violence_that_date).intersection(airbnb_cr.crimes_loc_2km.iloc[i]))
  violence_5km[i] = list(set(violence_that_date).intersection(airbnb_cr.crimes_loc_5km.iloc[i]))
  
  stealing_1km[i] = list(set(stealing_that_date).intersection(airbnb_cr.crimes_loc_1km.iloc[i]))
  stealing_2km[i] = list(set(stealing_that_date).intersection(airbnb_cr.crimes_loc_2km.iloc[i]))
  stealing_5km[i] = list(set(stealing_that_date).intersection(airbnb_cr.crimes_loc_5km.iloc[i]))
  
airbnb_cr.crimes_1km=crimes_1km
airbnb_cr.crimes_2km=crimes_2km
airbnb_cr.crimes_5km=crimes_5km

airbnb_cr.homicides_1km=homicides_1km
airbnb_cr.homicides_2km=homicides_2km
airbnb_cr.homicides_5km=homicides_5km

airbnb_cr.stealing_1km=stealing_1km
airbnb_cr.stealing_2km=stealing_2km
airbnb_cr.stealing_5km=stealing_5km

airbnb_cr.violence_1km=violence_1km
airbnb_cr.violence_2km=violence_2km
airbnb_cr.violence_5km=violence_5km

In [0]:
#Step 3: Counting how many crimes happened that date on those close locations
airbnb_cr['crimes_1km_count']=np.nan
airbnb_cr['crimes_2km_count']=np.nan
airbnb_cr['crimes_5km_count']=np.nan

airbnb_cr['homicides_1km_count']=np.nan
airbnb_cr['homicides_2km_count']=np.nan
airbnb_cr['homicides_5km_count']=np.nan

airbnb_cr['stealing_1km_count']=np.nan
airbnb_cr['stealing_2km_count']=np.nan
airbnb_cr['stealing_5km_count']=np.nan

airbnb_cr['violence_1km_count']=np.nan
airbnb_cr['violence_2km_count']=np.nan
airbnb_cr['violence_5km_count']=np.nan

crimes_1km_count=[None]*len(airbnb_cr)
crimes_2km_count=[None]*len(airbnb_cr)
crimes_5km_count=[None]*len(airbnb_cr)

homicides_1km_count=[None]*len(airbnb_cr)
homicides_2km_count=[None]*len(airbnb_cr)
homicides_5km_count=[None]*len(airbnb_cr)

stealing_1km_count=[None]*len(airbnb_cr)
stealing_2km_count=[None]*len(airbnb_cr)
stealing_5km_count=[None]*len(airbnb_cr)

violence_1km_count=[None]*len(airbnb_cr)
violence_2km_count=[None]*len(airbnb_cr)
violence_5km_count=[None]*len(airbnb_cr)

for i in range(len(airbnb_cr)):
  c=crimes[crimes["date"]==airbnb_cr.date.iloc[i]]
  h=homicides[homicides["date"]==airbnb_cr.date.iloc[i]]
  s=stealing[stealing["date"]==airbnb_cr.date.iloc[i]]
  v=violence[violence["date"]==airbnb_cr.date.iloc[i]]
    
  l1_c=airbnb_cr.crimes_1km.iloc[i]
  l2_c=airbnb_cr.crimes_2km.iloc[i]
  l5_c=airbnb_cr.crimes_5km.iloc[i]
  
  l1_h=airbnb_cr.homicides_1km.iloc[i]
  l2_h=airbnb_cr.homicides_2km.iloc[i]
  l5_h=airbnb_cr.homicides_5km.iloc[i]
  
  l1_s=airbnb_cr.stealing_1km.iloc[i]
  l2_s=airbnb_cr.stealing_2km.iloc[i]
  l5_s=airbnb_cr.stealing_5km.iloc[i]
  
  l1_v=airbnb_cr.violence_1km.iloc[i]
  l2_v=airbnb_cr.violence_2km.iloc[i]
  l5_v=airbnb_cr.violence_5km.iloc[i]
  
  c1_c=c[c["crim_loc_id2"].isin(l1_c)]
  c2_c=c[c["crim_loc_id2"].isin(l2_c)]
  c5_c=c[c["crim_loc_id2"].isin(l5_c)]  
  
  c1_h=h[h["homi_loc_id2"].isin(l1_h)]
  c2_h=h[h["homi_loc_id2"].isin(l2_h)]
  c5_h=h[h["homi_loc_id2"].isin(l5_h)]  
  
  c1_s=s[s["stea_loc_id2"].isin(l1_s)]
  c2_s=s[s["stea_loc_id2"].isin(l2_s)]
  c5_s=s[s["stea_loc_id2"].isin(l5_s)]  
  
  c1_v=v[v["viol_loc_id2"].isin(l1_v)]
  c2_v=v[v["viol_loc_id2"].isin(l2_v)]
  c5_v=v[v["viol_loc_id2"].isin(l5_v)]  
    
  crimes_1km_count[i]=len(c1_c)
  crimes_2km_count[i]=len(c2_c)
  crimes_5km_count[i]=len(c5_c)  
  
  homicides_1km_count[i]=len(c1_h)
  homicides_2km_count[i]=len(c2_h)
  homicides_5km_count[i]=len(c5_h)  
  
  violence_1km_count[i]=len(c1_v)
  violence_2km_count[i]=len(c2_v)
  violence_5km_count[i]=len(c5_v)  
  
  stealing_1km_count[i]=len(c1_s)
  stealing_2km_count[i]=len(c2_s)
  stealing_5km_count[i]=len(c5_s)   
  
airbnb_cr.crimes_1km_count=crimes_1km_count
airbnb_cr.crimes_2km_count=crimes_2km_count
airbnb_cr.crimes_5km_count=crimes_5km_count

airbnb_cr.homicides_1km_count=homicides_1km_count
airbnb_cr.homicides_2km_count=homicides_2km_count
airbnb_cr.homicides_5km_count=homicides_5km_count

airbnb_cr.stealing_1km_count=stealing_1km_count
airbnb_cr.stealing_2km_count=stealing_2km_count
airbnb_cr.stealing_5km_count=stealing_5km_count

airbnb_cr.violence_1km_count=violence_1km_count
airbnb_cr.violence_2km_count=violence_2km_count
airbnb_cr.violence_5km_count=violence_5km_count

Now we have the database we'll use for our regression.

In [0]:
airbnb_cr.head()

Unnamed: 0,listing_id,date,price,month,day,year,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,crimes_that_date,crimes_count_date,homicides_that_date,homicides_count_date,stealing_that_date,stealing_count_date,violence_that_date,violence_count_date,date_id,crimes_loc_1km,crimes_loc_2km,crimes_loc_5km,homicides_loc_1km,homicides_loc_2km,homicides_loc_5km,stealing_loc_1km,stealing_loc_2km,stealing_loc_5km,violence_loc_1km,violence_loc_2km,violence_loc_5km,crimes_1km,crimes_2km,crimes_5km,homicides_1km,homicides_2km,homicides_5km,stealing_1km,stealing_2km,stealing_5km,violence_1km,violence_2km,violence_5km,crimes_1km_count,crimes_2km_count,crimes_5km_count,homicides_1km_count,homicides_2km_count,homicides_5km_count,stealing_1km_count,stealing_2km_count,stealing_5km_count,violence_1km_count,violence_2km_count,violence_5km_count
0,2384,2018-07-26,75.0,7,26,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 33, 34, 35, 37, 40, 45, 49, 50, 51, 56, 58...",731,"[121, 127]",2,"[0, 26, 28, 30, 38, 44, 53, 55, 56, 61, 64, 67...",305,"[25, 32, 41, 43, 48, 52, 53, 54, 56, 60, 63, 6...",186,15,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...","[244, 246]","[224, 225, 264, 265, 266, 633, 223]","[262, 263, 279, 281, 162, 163, 164, 165, 166, ...",[],[],[],[244],"[224, 267]","[289, 198, 166, 200, 263, 178, 184, 185, 282, ...",[],"[224, 243]","[163, 637, 263, 200, 202, 239, 281, 186, 221, ...",4,14,89,0,0,0,1,2,15,0,2,11
268,2384,2018-07-29,69.0,7,29,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 36, 37, 38, 40, 41, 42, 46, 47, 48, 49, 50...",731,"[25, 48, 190]",3,"[0, 31, 34, 40, 42, 44, 54, 55, 68, 74, 76, 77...",280,"[0, 29, 32, 38, 50, 53, 55, 78, 80, 98, 105, 1...",249,18,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...",[],"[224, 225, 265, 266, 267, 243, 223]","[262, 263, 278, 279, 280, 281, 289, 162, 163, ...",[],[],[],[],"[225, 264, 265, 267, 243]","[162, 163, 291, 262, 263, 200, 202, 278, 184, ...",[],"[224, 225, 264, 265]","[292, 165, 166, 199, 263, 282, 304, 278, 183, ...",0,20,76,0,0,0,0,5,14,0,5,16
847,2384,2018-08-05,65.0,8,5,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 32, 33, 36, 37, 43, 45, 47, 50, 51, 52, 57...",811,"[60, 76, 132, 146, 154, 161, 166, 200]",8,"[0, 29, 30, 43, 44, 50, 54, 55, 56, 73, 78, 81...",297,"[0, 24, 25, 39, 49, 51, 53, 54, 55, 57, 63, 66...",276,24,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...",[246],"[265, 266, 267, 243, 223]","[262, 263, 280, 281, 289, 162, 163, 164, 165, ...",[],[],"[200, 166]",[],"[265, 267, 223]","[280, 162, 163, 204, 240, 180, 279, 184, 281, ...",[246],"[224, 225, 264]","[163, 165, 166, 197, 281, 203, 180, 278, 185, ...",1,12,94,0,0,2,0,3,11,2,5,14
1343,2384,2018-10-01,65.0,10,1,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 32, 34, 36, 37, 41, 47, 50, 54, 56, 58, 60...",732,"[3, 175, 232]",4,"[0, 25, 30, 34, 43, 53, 55, 65, 66, 71, 78, 81...",298,"[0, 26, 39, 48, 50, 52, 53, 65, 66, 84, 93, 98...",221,72,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...",[244],"[224, 264, 265, 266, 223]","[263, 279, 280, 281, 289, 162, 163, 164, 165, ...",[],[],[],[246],"[632, 266, 267, 223]","[203, 240, 241, 182, 282, 222]",[246],"[264, 266]","[290, 164, 165, 166, 263, 200, 292, 203, 183, ...",1,7,86,0,0,0,1,5,7,1,5,17
1793,2384,2018-10-31,75.0,10,31,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 27, 33, 37, 41, 44, 46, 47, 48, 50, 51, 56...",802,[208],1,"[0, 37, 39, 40, 41, 49, 51, 64, 68, 71, 77, 85...",279,"[29, 33, 43, 55, 63, 65, 66, 70, 79, 80, 81, 8...",208,97,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...","[244, 245]","[225, 264, 265, 266, 267, 243, 633, 223]","[263, 278, 279, 281, 289, 162, 163, 164, 165, ...",[],[],[],[],"[224, 264, 267, 632, 223]","[163, 198, 202, 282, 179, 181, 184, 186, 639]",[],[224],"[280, 162, 164, 166, 199, 262, 281, 282, 182, ...",7,58,81,0,0,0,0,7,15,0,1,17
1969,2384,2018-11-09,65.0,11,9,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 2, 32, 38, 40, 49, 50, 51, 52, 55, 56, 57,...",663,[236],1,"[0, 25, 31, 49, 53, 65, 66, 67, 68, 77, 80, 83...",281,"[0, 2, 41, 42, 43, 44, 47, 49, 54, 65, 67, 69,...",161,102,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...",[],"[264, 243, 223]","[262, 263, 279, 282, 162, 163, 164, 165, 166, ...",[],[],[],[245],"[264, 267]","[289, 166, 203, 304, 242, 179, 180, 181, 184, ...",[245],"[265, 266, 223]","[290, 166, 281, 202, 178, 242, 183, 184, 185, ...",0,8,78,0,0,0,1,4,12,1,3,10
2154,2384,2018-11-12,65.0,11,12,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 23, 27, 37, 38, 41, 45, 49, 51, 52, 56, 57...",631,"[45, 166]",2,"[0, 20, 55, 56, 66, 71, 81, 91, 94, 100, 112, ...",242,"[0, 16, 30, 33, 37, 43, 44, 48, 55, 69, 71, 77...",162,105,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...","[244, 245]",[266],"[263, 279, 280, 289, 162, 163, 164, 165, 166, ...",[],[],[166],[246],"[264, 265, 266]","[165, 262, 203, 242, 181, 282, 637]",[],"[266, 267]","[165, 304, 183, 185, 221]",3,3,70,0,0,1,1,4,9,0,2,9
2687,2384,2018-11-30,65.0,11,30,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 26, 30, 36, 37, 40, 42, 47, 49, 50, 51, 54...",674,"[191, 234]",2,"[0, 19, 23, 30, 42, 43, 49, 52, 55, 63, 64, 68...",266,"[0, 34, 50, 52, 65, 71, 78, 90, 92, 106, 107, ...",193,119,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...","[244, 246]","[225, 266, 243, 265]","[262, 263, 278, 279, 280, 281, 282, 289, 162, ...",[],[],[],[],"[225, 223]","[290, 165, 281, 201, 242, 184, 185, 186, 220, ...",[],"[264, 225, 632]","[290, 163, 165, 202, 203, 278, 184]",2,8,69,0,0,0,0,2,12,0,4,8
2914,2384,2018-12-03,75.0,12,3,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 27, 30, 31, 34, 36, 37, 38, 40, 41, 46, 50...",688,"[14, 24, 68, 88, 101, 164, 210]",7,"[0, 24, 31, 49, 50, 52, 55, 61, 64, 81, 91, 92...",309,"[0, 26, 32, 42, 52, 63, 65, 74, 80, 81, 92, 95...",172,122,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...","[244, 245]","[264, 225, 266, 223]","[262, 263, 279, 281, 162, 163, 164, 165, 166, ...",[],[],[164],"[245, 246]","[224, 225, 264, 267, 243]","[291, 164, 166, 198, 262, 201, 304, 241, 181, ...",[244],[223],"[289, 163, 291, 165, 166, 199, 179, 278, 183, ...",2,9,89,0,0,1,2,7,15,1,1,11
3148,2384,2019-02-17,49.0,2,17,2019,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 5, 34, 37, 38, 40, 41, 47, 48, 49, 56, 59,...",572,[124],1,"[0, 3, 40, 53, 82, 85, 86, 93, 118, 124, 128, ...",205,"[0, 26, 29, 30, 32, 33, 39, 40, 41, 48, 51, 53...",199,165,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...","[154, 155, 156, 157, 158, 170, 171, 172, 173, ...","[235, 236, 237]","[214, 215, 216, 234, 255, 256, 257, 258, 619, ...","[153, 154, 155, 156, 157, 169, 170, 171, 172, ...",[],"[224, 264, 267, 243, 223]","[262, 263, 279, 162, 163, 164, 290, 166, 178, ...",[],[],[],[],"[225, 265, 223]","[163, 262, 279, 184, 185, 282, 221]",[],[264],"[163, 165, 304, 182, 183, 184, 185, 186]",0,12,58,0,0,0,0,3,7,0,1,10


In [0]:
#airbnb_cr.to_csv('airbnb_cr.csv')
#!cp airbnb_cr.csv drive/My\ Drive/

## Statistical analysis

As a preliminary step, we'll check correlations between price and our counts of nearby crimes on the date of the listing.

In [0]:
Z_c=airbnb_cr[['price','crimes_1km_count','crimes_2km_count','crimes_5km_count']]
Z_h=airbnb_cr[['price','homicides_1km_count','homicides_2km_count','homicides_5km_count']]
Z_v=airbnb_cr[['price','violence_1km_count','violence_2km_count','violence_5km_count']]
Z_s=airbnb_cr[['price','stealing_1km_count','stealing_2km_count','stealing_5km_count']]

In [0]:
Z_c.corr()

Unnamed: 0,price,crimes_1km_count,crimes_2km_count,crimes_5km_count
price,1.0,0.156351,0.159833,0.033166
crimes_1km_count,0.156351,1.0,0.566875,-0.011291
crimes_2km_count,0.159833,0.566875,1.0,0.267541
crimes_5km_count,0.033166,-0.011291,0.267541,1.0


In [0]:
Z_h.corr()

Unnamed: 0,price,homicides_1km_count,homicides_2km_count,homicides_5km_count
price,1.0,-0.012954,-0.035319,-0.074769
homicides_1km_count,-0.012954,1.0,0.021951,0.02562
homicides_2km_count,-0.035319,0.021951,1.0,0.118846
homicides_5km_count,-0.074769,0.02562,0.118846,1.0


In [0]:
Z_v.corr()

Unnamed: 0,price,violence_1km_count,violence_2km_count,violence_5km_count
price,1.0,0.086223,0.057213,0.082182
violence_1km_count,0.086223,1.0,0.370663,0.3303
violence_2km_count,0.057213,0.370663,1.0,0.530205
violence_5km_count,0.082182,0.3303,0.530205,1.0


In [0]:
Z_s.corr()

Unnamed: 0,price,stealing_1km_count,stealing_2km_count,stealing_5km_count
price,1.0,0.082256,0.003433,-0.002306
stealing_1km_count,0.082256,1.0,0.295942,0.095814
stealing_2km_count,0.003433,0.295942,1.0,0.330896
stealing_5km_count,-0.002306,0.095814,0.330896,1.0


Correlations above are very close to 0, which suggests price and occurences of those types of crimes are not much correlated. With the exception of homicides, crimes still seem to have a very weak but positive correlation with prices. 

Given we have left out so many important variables, we'll have better insight after doing a regression, which will take into account factors such as the type of room (entire home/apt, private room or shared room), neighborhood, listing availability, the listing history (number of reviews, reviews per month), as well as including counts of each of our categories of crime for our 3 radiuses. Note that even when take this into account, we still have a major omitted variable problem, since we don't have any kind of socioeconomical and demographical information other than what's encapsulated by the neighborhood dummy variable, but as said before, this is a rough estimate that's supposed to hint to us whether there's an evidence that nearby criminal occurences influence the supply-demand model for Airbnb listings, particularly on the city of Chicago.

We'll set up our regression by selecting which variables to include, as well as creating dummies for neighbourhoods and room types. Below, we'll check the results of our regression.

In [0]:
X = airbnb_cr[['crimes_1km_count','crimes_2km_count','crimes_5km_count','crimes_count_date',
               'homicides_1km_count','homicides_2km_count','homicides_5km_count','homicides_count_date',
               'violence_1km_count','violence_2km_count','violence_5km_count','violence_count_date',
               'stealing_1km_count','stealing_2km_count','stealing_5km_count','stealing_count_date',
               'room_type','neighbourhood','number_of_reviews','reviews_per_month','availability_365']]
y = airbnb_cr[['price']]

X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

X=X.drop(['neighbourhood','room_type'],axis=1)

reg = LinearRegression().fit(X, y)


X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

  return ptp(axis=axis, out=out, **kwargs)


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.292
Model:                            OLS   Adj. R-squared:                  0.291
Method:                 Least Squares   F-statistic:                     448.6
Date:                Wed, 04 Sep 2019   Prob (F-statistic):               0.00
Time:                        17:43:47   Log-Likelihood:            -5.8394e+05
No. Observations:               97060   AIC:                         1.168e+06
Df Residuals:                   96970   BIC:                         1.169e+06
Df Model:                          89                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     30

For further explanation and commentary on the results, check the Report section of our website.

#2 - Airbnb listed price changes x criminal occurences variation

For our second analysis, we'll go back to our cal_change dataset, which contains all listing for whom the hosts changed the listed price at least once before the listing date. On the occasion prices changed more than once, we kept the earliest and latest prices. As before, we'll only keep the entries for the dates which had reviews and crime information.

It's important to note we'll assume the prices were changed the day the new price was scraped by insideairbnb.com. This a very rough estimative but we chose it for simplicity and conservativeness, given we have no way of knowing which date prices were changed otherwise. Since most scraping iterations were done roughly a month apart, we risk missing on our estimate by up to a month or so, which is obviously not ideal.

In [0]:
#cal_change=pd.read_csv('/content/cal_change.csv')
#listings=pd.read_csv('/content/listings.csv')
#crimes_date=pd.read_csv('/content/crimes_date.csv')
#homicides_date=pd.read_csv('/content/homicide_date.csv')
#violence_date=pd.read_csv('/content/violence_date.csv')
#stealing_date=pd.read_csv('/content/stealing_date.csv')

In [0]:
cal_change=cal_change.drop(cal_change[(cal_change.year==2018) & (cal_change.month==7) & (cal_change.day<9)].index)
cal_change=cal_change.drop(cal_change[(cal_change.year>2019)].index)
cal_change=cal_change.drop(cal_change[(cal_change.year==2019) & (cal_change.month>6)].index)                           
cal_change.head()

Unnamed: 0,listing_id,date,price,scr_date,month,day,year,review
13873240,2384,2018-06-11,65.0,2018-05-18,6,11,2018,0
14493696,2384,2018-06-11,80.0,2018-04-15,6,11,2018,0
13203073,2384,2018-07-26,75.0,2018-07-18,7,26,2018,1
13873224,2384,2018-07-26,65.0,2018-05-18,7,26,2018,1
14493666,2384,2018-07-26,60.0,2018-04-15,7,26,2018,1


In [0]:
cal_change_r=cal_change[cal_change['review']==1]
cal_change_r_latest=cal_change_r.drop_duplicates(subset=['listing_id','date'],keep='first')
cal_change_r_latest=cal_change_r_latest.rename(index=str, columns={"price": "price_latest","scr_date": "scr_date_latest"})
cal_change_r_earliest=cal_change_r.drop_duplicates(subset=['listing_id','date'],keep='last')
cal_change_r_earliest=cal_change_r_earliest.rename(index=str, columns={"price": "price_earliest","scr_date": "scr_date_earliest"})
cal_change_r_earliest=cal_change_r_earliest.drop(['month','day','year','review'],axis=1)
cal_change_r=cal_change_r_latest.merge(cal_change_r_earliest,on=['listing_id','date'])
cal_change_r=cal_change_r.merge(listings, on='listing_id')
cal_change_r=cal_change_r.drop(['loc_id'], axis=1)
cal_change_r=cal_change_r.rename(index=str, columns={"loc_id2": "loc_id"})

cal_change_r['scr_lat_str']=cal_change_r.scr_date_latest.astype('str')
cal_change_r['scr_lat_m']=cal_change_r.scr_lat_str.apply(lambda x: int(x[5:7]))
cal_change_r['scr_lat_y']=cal_change_r.scr_lat_str.apply(lambda x: int(x[0:4]))
cal_change_r['scr_ear_str']=cal_change_r.scr_date_earliest.astype('str')
cal_change_r['scr_ear_m']=cal_change_r.scr_ear_str.apply(lambda x: int(x[5:7]))
cal_change_r['scr_ear_y']=cal_change_r.scr_ear_str.apply(lambda x: int(x[0:4]))
cal_change_r=cal_change_r.drop(['scr_lat_str','scr_ear_str'],axis=1)

cal_change_r=cal_change_r.drop(cal_change_r[(cal_change_r.scr_lat_y==2018) & (cal_change_r.scr_lat_m<7)].index)
cal_change_r=cal_change_r.drop(cal_change_r[(cal_change_r.scr_ear_y==2018) & (cal_change_r.scr_ear_m<7)].index)

cal_change_r['price_dif']=cal_change_r['price_latest']-cal_change_r['price_earliest']

cal_change_r.head()

Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,scr_lat_m,scr_lat_y,scr_ear_m,scr_ear_y,price_dif
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
5,2384,2018-11-09,65.0,2018-10-11,11,9,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
6,2384,2018-11-12,65.0,2018-10-11,11,12,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
7,2384,2018-11-30,65.0,2018-10-11,11,30,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
8,2384,2018-12-03,75.0,2018-11-15,12,3,2018,1,80.0,2018-09-14,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,11,2018,9,2018,-5.0


##Counting crimes on the vicinity of each Airbnb location for both scraping dates

Like we did on part 1 of our analysis, we'll count crimes within 1km, 2km and 5km of each location, but now we'll do it for the two dates the prices were scraped. After that, we have an extra step where we'll calculate the difference for prices and crime counts between both dates.

In [0]:
#Step 1: Finding locations of crimes near the listing's location 
cal_change_r['crimes_loc_1km']=np.nan
cal_change_r['crimes_loc_2km']=np.nan
cal_change_r['crimes_loc_5km']=np.nan

cal_change_r['homicides_loc_1km']=np.nan
cal_change_r['homicides_loc_2km']=np.nan
cal_change_r['homicides_loc_5km']=np.nan

cal_change_r['stealing_loc_1km']=np.nan
cal_change_r['stealing_loc_2km']=np.nan
cal_change_r['stealing_loc_5km']=np.nan

cal_change_r['violence_loc_1km']=np.nan
cal_change_r['violence_loc_2km']=np.nan
cal_change_r['violence_loc_5km']=np.nan


crimes_loc_1km=[None]*len(cal_change_r)
crimes_loc_2km=[None]*len(cal_change_r)
crimes_loc_5km=[None]*len(cal_change_r)

homicides_loc_1km=[None]*len(cal_change_r)
homicides_loc_2km=[None]*len(cal_change_r)
homicides_loc_5km=[None]*len(cal_change_r)

stealing_loc_1km=[None]*len(cal_change_r)
stealing_loc_2km=[None]*len(cal_change_r)
stealing_loc_5km=[None]*len(cal_change_r)

violence_loc_1km=[None]*len(cal_change_r)
violence_loc_2km=[None]*len(cal_change_r)
violence_loc_5km=[None]*len(cal_change_r)



for i in range(len(cal_change_r)):
  loc_id=cal_change_r.loc_id.iloc[i]
  
  crimes_loc_1km[i]=indices(list(dist1_cr.iloc[loc_id]),1)
  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_cr.loc[loc_id]),1),indices(list(dist1_cr.loc[loc_id]),1)))
  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_cr.loc[loc_id]),1),indices(list(dist2_cr.loc[loc_id]),1)))

  homicides_loc_1km[i]=indices(list(dist1_hm.iloc[loc_id]),1)
  homicides_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_hm.loc[loc_id]),1),indices(list(dist1_hm.loc[loc_id]),1)))
  homicides_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_hm.loc[loc_id]),1),indices(list(dist2_hm.loc[loc_id]),1)))

  stealing_loc_1km[i]=indices(list(dist1_st.iloc[loc_id]),1)
  stealing_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_st.loc[loc_id]),1),indices(list(dist1_st.loc[loc_id]),1)))
  stealing_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_st.loc[loc_id]),1),indices(list(dist2_st.loc[loc_id]),1)))
  
  violence_loc_1km[i]=indices(list(dist1_vi.iloc[loc_id]),1)
  violence_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_vi.loc[loc_id]),1),indices(list(dist1_vi.loc[loc_id]),1)))
  violence_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_vi.loc[loc_id]),1),indices(list(dist2_vi.loc[loc_id]),1)))

  
cal_change_r.crimes_loc_1km=crimes_loc_1km
cal_change_r.crimes_loc_2km=crimes_loc_2km
cal_change_r.crimes_loc_5km=crimes_loc_5km

cal_change_r.homicides_loc_1km=homicides_loc_1km
cal_change_r.homicides_loc_2km=homicides_loc_2km
cal_change_r.homicides_loc_5km=homicides_loc_5km

cal_change_r.stealing_loc_1km=stealing_loc_1km
cal_change_r.stealing_loc_2km=stealing_loc_2km
cal_change_r.stealing_loc_5km=stealing_loc_5km

cal_change_r.violence_loc_1km=violence_loc_1km
cal_change_r.violence_loc_2km=violence_loc_2km
cal_change_r.violence_loc_5km=violence_loc_5km

In [0]:
#Step 2.1: Finding which close crime locations had crimes on that date
cal_change_r['crimes_1km_lat']=np.nan
cal_change_r['crimes_2km_lat']=np.nan
cal_change_r['crimes_5km_lat']=np.nan
cal_change_r['crimes_1km_ear']=np.nan
cal_change_r['crimes_2km_ear']=np.nan
cal_change_r['crimes_5km_ear']=np.nan

crimes_1km_lat=[None]*len(cal_change_r)
crimes_2km_lat=[None]*len(cal_change_r)
crimes_5km_lat=[None]*len(cal_change_r)
crimes_1km_ear=[None]*len(cal_change_r)
crimes_2km_ear=[None]*len(cal_change_r)
crimes_5km_ear=[None]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  crimes_date_lat = crimes_date[crimes_date['date']==date_lat].crim_loc_id2
  crimes_date_ear = crimes_date[crimes_date['date']==date_ear].crim_loc_id2
 
  crimes_1km_lat[i] = list(set(crimes_date_lat[0]).intersection(cal_change_r.crimes_loc_1km.iloc[i]))
  crimes_1km_ear[i] = list(set(crimes_date_ear[0]).intersection(cal_change_r.crimes_loc_1km.iloc[i]))
  crimes_2km_lat[i] = list(set(crimes_date_lat[0]).intersection(cal_change_r.crimes_loc_2km.iloc[i]))
  crimes_2km_ear[i] = list(set(crimes_date_ear[0]).intersection(cal_change_r.crimes_loc_2km.iloc[i]))
  crimes_5km_ear[i] = list(set(crimes_date_ear[0]).intersection(cal_change_r.crimes_loc_5km.iloc[i]))
  crimes_5km_lat[i] = list(set(crimes_date_lat[0]).intersection(cal_change_r.crimes_loc_5km.iloc[i]))

cal_change_r.crimes_1km_lat = crimes_1km_lat
cal_change_r.crimes_2km_lat = crimes_2km_lat
cal_change_r.crimes_5km_lat = crimes_5km_lat
cal_change_r.crimes_1km_ear = crimes_1km_ear
cal_change_r.crimes_2km_ear = crimes_2km_ear
cal_change_r.crimes_5km_ear = crimes_5km_ear

In [0]:
#Step 2.2: Finding which close crime locations had homicides on that date

cal_change_r['homicides_1km_lat']=np.nan
cal_change_r['homicides_2km_lat']=np.nan
cal_change_r['homicides_5km_lat']=np.nan
cal_change_r['homicides_1km_ear']=np.nan
cal_change_r['homicides_2km_ear']=np.nan
cal_change_r['homicides_5km_ear']=np.nan

homicides_1km_lat=[None]*len(cal_change_r)
homicides_2km_lat=[None]*len(cal_change_r)
homicides_5km_lat=[None]*len(cal_change_r)
homicides_1km_ear=[None]*len(cal_change_r)
homicides_2km_ear=[None]*len(cal_change_r)
homicides_5km_ear=[None]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  homicides_date_lat = homicides_date[homicides_date['date']==date_lat].homi_loc_id2
  homicides_date_ear = homicides_date[homicides_date['date']==date_ear].homi_loc_id2
  if len(homicides_date_lat)>0:
    homicides_1km_lat[i] = list(set(homicides_date_lat[0]).intersection(cal_change_r.homicides_loc_1km.iloc[i]))
    homicides_2km_lat[i] = list(set(homicides_date_lat[0]).intersection(cal_change_r.homicides_loc_2km.iloc[i]))
    homicides_5km_lat[i] = list(set(homicides_date_lat[0]).intersection(cal_change_r.homicides_loc_5km.iloc[i]))
  if len(homicides_date_ear)>0:
    homicides_1km_ear[i] = list(set(homicides_date_ear[0]).intersection(cal_change_r.homicides_loc_1km.iloc[i]))
    homicides_2km_ear[i] = list(set(homicides_date_ear[0]).intersection(cal_change_r.homicides_loc_2km.iloc[i]))
    homicides_5km_ear[i] = list(set(homicides_date_ear[0]).intersection(cal_change_r.homicides_loc_5km.iloc[i]))
    
cal_change_r.homicides_1km_lat = homicides_1km_lat
cal_change_r.homicides_2km_lat = homicides_2km_lat
cal_change_r.homicides_5km_lat = homicides_5km_lat
cal_change_r.homicides_1km_ear = homicides_1km_ear
cal_change_r.homicides_2km_ear = homicides_2km_ear
cal_change_r.homicides_5km_ear = homicides_5km_ear

In [0]:
#Step 2.3: Finding which close crime locations had stealing-related crimes on that date

cal_change_r['stealing_1km_lat']=np.nan
cal_change_r['stealing_2km_lat']=np.nan
cal_change_r['stealing_5km_lat']=np.nan
cal_change_r['stealing_1km_ear']=np.nan
cal_change_r['stealing_2km_ear']=np.nan
cal_change_r['stealing_5km_ear']=np.nan

stealing_1km_lat=[None]*len(cal_change_r)
stealing_2km_lat=[None]*len(cal_change_r)
stealing_5km_lat=[None]*len(cal_change_r)
stealing_1km_ear=[None]*len(cal_change_r)
stealing_2km_ear=[None]*len(cal_change_r)
stealing_5km_ear=[None]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  stealing_date_lat = stealing_date[stealing_date['date']==date_lat].stea_loc_id2
  stealing_date_ear = stealing_date[stealing_date['date']==date_ear].stea_loc_id2
  if len(stealing_date_lat)>0:
    stealing_1km_lat[i] = list(set(stealing_date_lat[0]).intersection(cal_change_r.stealing_loc_1km.iloc[i]))
    stealing_2km_lat[i] = list(set(stealing_date_lat[0]).intersection(cal_change_r.stealing_loc_2km.iloc[i]))
    stealing_5km_lat[i] = list(set(stealing_date_lat[0]).intersection(cal_change_r.stealing_loc_5km.iloc[i]))
  if len(stealing_date_ear)>0:
    stealing_1km_ear[i] = list(set(stealing_date_ear[0]).intersection(cal_change_r.stealing_loc_1km.iloc[i]))
    stealing_2km_ear[i] = list(set(stealing_date_ear[0]).intersection(cal_change_r.stealing_loc_2km.iloc[i]))
    stealing_5km_ear[i] = list(set(stealing_date_ear[0]).intersection(cal_change_r.stealing_loc_5km.iloc[i]))
    
cal_change_r.stealing_1km_lat = stealing_1km_lat
cal_change_r.stealing_2km_lat = stealing_2km_lat
cal_change_r.stealing_5km_lat = stealing_5km_lat
cal_change_r.stealing_1km_ear = stealing_1km_ear
cal_change_r.stealing_2km_ear = stealing_2km_ear
cal_change_r.stealing_5km_ear = stealing_5km_ear

In [0]:
#Step 2.4: Finding which close crime locations had physical violence-related crimes on that date

cal_change_r['violence_1km_lat']=np.nan
cal_change_r['violence_2km_lat']=np.nan
cal_change_r['violence_5km_lat']=np.nan
cal_change_r['violence_1km_ear']=np.nan
cal_change_r['violence_2km_ear']=np.nan
cal_change_r['violence_5km_ear']=np.nan

violence_1km_lat=[None]*len(cal_change_r)
violence_2km_lat=[None]*len(cal_change_r)
violence_5km_lat=[None]*len(cal_change_r)
violence_1km_ear=[None]*len(cal_change_r)
violence_2km_ear=[None]*len(cal_change_r)
violence_5km_ear=[None]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  violence_date_lat = violence_date[violence_date['date']==date_lat].viol_loc_id2
  violence_date_ear = violence_date[violence_date['date']==date_ear].viol_loc_id2
  if len(violence_date_lat)>0:
    violence_1km_lat[i] = list(set(violence_date_lat[0]).intersection(cal_change_r.violence_loc_1km.iloc[i]))
    violence_2km_lat[i] = list(set(violence_date_lat[0]).intersection(cal_change_r.violence_loc_2km.iloc[i]))
    violence_5km_lat[i] = list(set(violence_date_lat[0]).intersection(cal_change_r.violence_loc_5km.iloc[i]))
  if len(violence_date_ear)>0:
    violence_1km_ear[i] = list(set(violence_date_ear[0]).intersection(cal_change_r.violence_loc_1km.iloc[i]))
    violence_2km_ear[i] = list(set(violence_date_ear[0]).intersection(cal_change_r.violence_loc_2km.iloc[i]))
    violence_5km_ear[i] = list(set(violence_date_ear[0]).intersection(cal_change_r.violence_loc_5km.iloc[i]))
    
cal_change_r.violence_1km_lat = violence_1km_lat
cal_change_r.violence_2km_lat = violence_2km_lat
cal_change_r.violence_5km_lat = violence_5km_lat
cal_change_r.violence_1km_ear = violence_1km_ear
cal_change_r.violence_2km_ear = violence_2km_ear
cal_change_r.violence_5km_ear = violence_5km_ear

In [0]:
#Step 3.1: Counting how many crimes happened that date on those close locations
cal_change_r['crimes_1km_count_lat']=np.nan
cal_change_r['crimes_2km_count_lat']=np.nan
cal_change_r['crimes_5km_count_lat']=np.nan
cal_change_r['crimes_1km_count_ear']=np.nan
cal_change_r['crimes_2km_count_ear']=np.nan
cal_change_r['crimes_5km_count_ear']=np.nan

crimes_1km_count_lat=[None]*len(cal_change_r)
crimes_2km_count_lat=[None]*len(cal_change_r)
crimes_5km_count_lat=[None]*len(cal_change_r)
crimes_1km_count_ear=[None]*len(cal_change_r)
crimes_2km_count_ear=[None]*len(cal_change_r)
crimes_5km_count_ear=[None]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  c_lat=crimes[crimes["date"]==date_lat]
  c_ear=crimes[crimes["date"]==date_ear]
  
  l1_lat=cal_change_r.crimes_1km_lat.iloc[i]
  l2_lat=cal_change_r.crimes_2km_lat.iloc[i]
  l5_lat=cal_change_r.crimes_5km_lat.iloc[i]  
  l1_ear=cal_change_r.crimes_1km_ear.iloc[i]
  l2_ear=cal_change_r.crimes_2km_ear.iloc[i]
  l5_ear=cal_change_r.crimes_5km_ear.iloc[i]
  
  c1_lat=c_lat[c_lat['crim_loc_id2'].isin(l1_lat)]
  c2_lat=c_lat[c_lat['crim_loc_id2'].isin(l2_lat)]
  c5_lat=c_lat[c_lat['crim_loc_id2'].isin(l5_lat)]   
  c1_ear=c_ear[c_ear['crim_loc_id2'].isin(l1_ear)]
  c2_ear=c_ear[c_ear['crim_loc_id2'].isin(l2_ear)]
  c5_ear=c_ear[c_ear['crim_loc_id2'].isin(l5_ear)]  
  
  crimes_1km_count_lat[i]=len(c1_lat)
  crimes_2km_count_lat[i]=len(c2_lat)
  crimes_5km_count_lat[i]=len(c5_lat)    
  crimes_1km_count_ear[i]=len(c1_ear)
  crimes_2km_count_ear[i]=len(c2_ear)
  crimes_5km_count_ear[i]=len(c5_ear)
  
cal_change_r.crimes_1km_count_lat=crimes_1km_count_lat
cal_change_r.crimes_2km_count_lat=crimes_2km_count_lat
cal_change_r.crimes_5km_count_lat=crimes_5km_count_lat
cal_change_r.crimes_1km_count_ear=crimes_1km_count_ear
cal_change_r.crimes_2km_count_ear=crimes_2km_count_ear
cal_change_r.crimes_5km_count_ear=crimes_5km_count_ear

In [0]:
#Step 3.2: Counting how many homicides happened that date on those close locations
cal_change_r['homicides_1km_count_lat']=np.nan
cal_change_r['homicides_2km_count_lat']=np.nan
cal_change_r['homicides_5km_count_lat']=np.nan
cal_change_r['homicides_1km_count_ear']=np.nan
cal_change_r['homicides_2km_count_ear']=np.nan
cal_change_r['homicides_5km_count_ear']=np.nan

homicides_1km_count_lat=[0]*len(cal_change_r)
homicides_2km_count_lat=[0]*len(cal_change_r)
homicides_5km_count_lat=[0]*len(cal_change_r)
homicides_1km_count_ear=[0]*len(cal_change_r)
homicides_2km_count_ear=[0]*len(cal_change_r)
homicides_5km_count_ear=[0]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  c_lat=homicides[homicides["date"]==date_lat]
  c_ear=homicides[homicides["date"]==date_ear]

  if len(c_lat)>0:
    l1_lat=cal_change_r.homicides_1km_lat.iloc[i]
    if len(l1_lat)>0:
      c1_lat=c_lat[c_lat['homi_loc_id2'].isin(l1_lat)]
      homicides_1km_count_lat[i]=len(c1_lat)

    l2_lat=cal_change_r.homicides_2km_lat.iloc[i]  
    if len(l2_lat)>0:
      c2_lat=c_lat[c_lat['homi_loc_id2'].isin(l2_lat)]
      homicides_2km_count_lat[i]=len(c2_lat)

    l5_lat=cal_change_r.homicides_5km_lat.iloc[i]  
    if len(l5_lat)>0:
      c5_lat=c_lat[c_lat['homi_loc_id2'].isin(l5_lat)]   
      homicides_5km_count_lat[i]=len(c5_lat)    
 
  if len(c_ear)>0:
    l1_ear=cal_change_r.homicides_1km_ear.iloc[i]
    if len(l1_ear)>0:
      c1_ear=c_ear[c_ear['homi_loc_id2'].isin(l1_ear)]
      homicides_1km_count_ear[i]=len(c1_ear)
  
    l2_ear=cal_change_r.homicides_2km_ear.iloc[i]
    if len(l2_ear)>0:
      c2_ear=c_ear[c_ear['homi_loc_id2'].isin(l2_ear)]
      homicides_2km_count_ear[i]=len(c2_ear)
  
    l5_ear=cal_change_r.homicides_5km_ear.iloc[i]
    if len(l5_ear)>0:
      c5_ear=c_ear[c_ear['homi_loc_id2'].isin(l5_ear)]  
      homicides_5km_count_ear[i]=len(c5_ear)
  
cal_change_r.homicides_1km_count_lat=homicides_1km_count_lat
cal_change_r.homicides_2km_count_lat=homicides_2km_count_lat
cal_change_r.homicides_5km_count_lat=homicides_5km_count_lat
cal_change_r.homicides_1km_count_ear=homicides_1km_count_ear
cal_change_r.homicides_2km_count_ear=homicides_2km_count_ear
cal_change_r.homicides_5km_count_ear=homicides_5km_count_ear

In [0]:
#Step 3.3: Counting how many physical violence-related crimes happened that date on those close locations
cal_change_r['violence_1km_count_lat']=np.nan
cal_change_r['violence_2km_count_lat']=np.nan
cal_change_r['violence_5km_count_lat']=np.nan
cal_change_r['violence_1km_count_ear']=np.nan
cal_change_r['violence_2km_count_ear']=np.nan
cal_change_r['violence_5km_count_ear']=np.nan

violence_1km_count_lat=[0]*len(cal_change_r)
violence_2km_count_lat=[0]*len(cal_change_r)
violence_5km_count_lat=[0]*len(cal_change_r)
violence_1km_count_ear=[0]*len(cal_change_r)
violence_2km_count_ear=[0]*len(cal_change_r)
violence_5km_count_ear=[0]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  c_lat=violence[violence["date"]==date_lat]
  c_ear=violence[violence["date"]==date_ear]

  if len(c_lat)>0:
    l1_lat=cal_change_r.violence_1km_lat.iloc[i]
    if len(l1_lat)>0:
      c1_lat=c_lat[c_lat['viol_loc_id2'].isin(l1_lat)]
      violence_1km_count_lat[i]=len(c1_lat)

    l2_lat=cal_change_r.violence_2km_lat.iloc[i]  
    if len(l2_lat)>0:
      c2_lat=c_lat[c_lat['viol_loc_id2'].isin(l2_lat)]
      violence_2km_count_lat[i]=len(c2_lat)

    l5_lat=cal_change_r.violence_5km_lat.iloc[i]  
    if len(l5_lat)>0:
      c5_lat=c_lat[c_lat['viol_loc_id2'].isin(l5_lat)]   
      violence_5km_count_lat[i]=len(c5_lat)    
 
  if len(c_ear)>0:
    l1_ear=cal_change_r.violence_1km_ear.iloc[i]
    if len(l1_ear)>0:
      c1_ear=c_ear[c_ear['viol_loc_id2'].isin(l1_ear)]
      violence_1km_count_ear[i]=len(c1_ear)
  
    l2_ear=cal_change_r.violence_2km_ear.iloc[i]
    if len(l2_ear)>0:
      c2_ear=c_ear[c_ear['viol_loc_id2'].isin(l2_ear)]
      violence_2km_count_ear[i]=len(c2_ear)
  
    l5_ear=cal_change_r.violence_5km_ear.iloc[i]
    if len(l5_ear)>0:
      c5_ear=c_ear[c_ear['viol_loc_id2'].isin(l5_ear)]  
      violence_5km_count_ear[i]=len(c5_ear)
  
cal_change_r.violence_1km_count_lat=violence_1km_count_lat
cal_change_r.violence_2km_count_lat=violence_2km_count_lat
cal_change_r.violence_5km_count_lat=violence_5km_count_lat
cal_change_r.violence_1km_count_ear=violence_1km_count_ear
cal_change_r.violence_2km_count_ear=violence_2km_count_ear
cal_change_r.violence_5km_count_ear=violence_5km_count_ear

In [0]:
#Step 3.4: Counting how many stealing-related crimes happened that date on those close locations
cal_change_r['stealing_1km_count_lat']=np.nan
cal_change_r['stealing_2km_count_lat']=np.nan
cal_change_r['stealing_5km_count_lat']=np.nan
cal_change_r['stealing_1km_count_ear']=np.nan
cal_change_r['stealing_2km_count_ear']=np.nan
cal_change_r['stealing_5km_count_ear']=np.nan

stealing_1km_count_lat=[0]*len(cal_change_r)
stealing_2km_count_lat=[0]*len(cal_change_r)
stealing_5km_count_lat=[0]*len(cal_change_r)
stealing_1km_count_ear=[0]*len(cal_change_r)
stealing_2km_count_ear=[0]*len(cal_change_r)
stealing_5km_count_ear=[0]*len(cal_change_r)

for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  c_lat=stealing[stealing["date"]==date_lat]
  c_ear=stealing[stealing["date"]==date_ear]

  if len(c_lat)>0:
    l1_lat=cal_change_r.stealing_1km_lat.iloc[i]
    if len(l1_lat)>0:
      c1_lat=c_lat[c_lat['stea_loc_id2'].isin(l1_lat)]
      stealing_1km_count_lat[i]=len(c1_lat)

    l2_lat=cal_change_r.stealing_2km_lat.iloc[i]  
    if len(l2_lat)>0:
      c2_lat=c_lat[c_lat['stea_loc_id2'].isin(l2_lat)]
      stealing_2km_count_lat[i]=len(c2_lat)

    l5_lat=cal_change_r.stealing_5km_lat.iloc[i]  
    if len(l5_lat)>0:
      c5_lat=c_lat[c_lat['stea_loc_id2'].isin(l5_lat)]   
      stealing_5km_count_lat[i]=len(c5_lat)    
 
  if len(c_ear)>0:
    l1_ear=cal_change_r.stealing_1km_ear.iloc[i]
    if len(l1_ear)>0:
      c1_ear=c_ear[c_ear['stea_loc_id2'].isin(l1_ear)]
      stealing_1km_count_ear[i]=len(c1_ear)
  
    l2_ear=cal_change_r.stealing_2km_ear.iloc[i]
    if len(l2_ear)>0:
      c2_ear=c_ear[c_ear['stea_loc_id2'].isin(l2_ear)]
      stealing_2km_count_ear[i]=len(c2_ear)
  
    l5_ear=cal_change_r.stealing_5km_ear.iloc[i]
    if len(l5_ear)>0:
      c5_ear=c_ear[c_ear['stea_loc_id2'].isin(l5_ear)]  
      stealing_5km_count_ear[i]=len(c5_ear)
  
cal_change_r.stealing_1km_count_lat=stealing_1km_count_lat
cal_change_r.stealing_2km_count_lat=stealing_2km_count_lat
cal_change_r.stealing_5km_count_lat=stealing_5km_count_lat
cal_change_r.stealing_1km_count_ear=stealing_1km_count_ear
cal_change_r.stealing_2km_count_ear=stealing_2km_count_ear
cal_change_r.stealing_5km_count_ear=stealing_5km_count_ear

In [0]:
cal_change_r.head()

Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,scr_lat_m,scr_lat_y,scr_ear_m,scr_ear_y,price_dif,crimes_loc_1km,crimes_loc_2km,crimes_loc_5km,homicides_loc_1km,homicides_loc_2km,homicides_loc_5km,stealing_loc_1km,stealing_loc_2km,...,homicides_5km_ear,stealing_1km_lat,stealing_2km_lat,stealing_5km_lat,stealing_1km_ear,stealing_2km_ear,stealing_5km_ear,violence_1km_lat,violence_2km_lat,violence_5km_lat,violence_1km_ear,violence_2km_ear,violence_5km_ear,crimes_1km_count_lat,crimes_2km_count_lat,crimes_5km_count_lat,crimes_1km_count_ear,crimes_2km_count_ear,crimes_5km_count_ear,homicides_1km_count_lat,homicides_2km_count_lat,homicides_5km_count_lat,homicides_1km_count_ear,homicides_2km_count_ear,homicides_5km_count_ear,violence_1km_count_lat,violence_2km_count_lat,violence_5km_count_lat,violence_1km_count_ear,violence_2km_count_ear,violence_5km_count_ear,stealing_1km_count_lat,stealing_2km_count_lat,stealing_5km_count_lat,stealing_1km_count_ear,stealing_2km_count_ear,stealing_5km_count_ear,crimes_1km_count_dif,crimes_2km_count_dif,crimes_5km_count_dif
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,[],[237],"[257, 258]","[192, 193, 233, 170, 171, 190, 174, 175, 176, ...","[237, 238]","[257, 258, 619, 235, 215]","[272, 274, 155, 156, 157, 158, 295, 170, 172, ...",[],[216],"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18
5,2384,2018-11-09,65.0,2018-10-11,11,9,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,[],[237],"[257, 258]","[192, 193, 233, 170, 171, 190, 174, 175, 176, ...","[237, 238]","[257, 258, 619, 235, 215]","[272, 274, 155, 156, 157, 158, 295, 170, 172, ...",[],[216],"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18
6,2384,2018-11-12,65.0,2018-10-11,11,12,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,[],[237],"[257, 258]","[192, 193, 233, 170, 171, 190, 174, 175, 176, ...","[237, 238]","[257, 258, 619, 235, 215]","[272, 274, 155, 156, 157, 158, 295, 170, 172, ...",[],[216],"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18
7,2384,2018-11-30,65.0,2018-10-11,11,30,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,[],[237],"[257, 258]","[192, 193, 233, 170, 171, 190, 174, 175, 176, ...","[237, 238]","[257, 258, 619, 235, 215]","[272, 274, 155, 156, 157, 158, 295, 170, 172, ...",[],[216],"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18
8,2384,2018-12-03,75.0,2018-11-15,12,3,2018,1,80.0,2018-09-14,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,11,2018,9,2018,-5.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,[],[237],"[256, 257, 258, 259, 217]","[192, 194, 195, 231, 233, 170, 282, 172, 174, ...",[236],"[256, 257, 258, 259, 217]","[192, 194, 195, 196, 232, 234, 171, 174, 177, ...",[],"[256, 257, 215]","[194, 195, 232, 233, 170, 174, 270, 176, 177, ...",[],"[257, 234, 214, 216, 255]","[193, 194, 195, 230, 231, 232, 233, 171, 176, ...",1,15,73,1,17,87,0,0,1,0,0,0,0,4,16,0,6,21,1,8,28,1,9,28,0,-2,-14


In [0]:
#Step 4: calculating difference between dates
cal_change_r['crimes_1km_count_dif']=cal_change_r['crimes_1km_count_lat']-cal_change_r['crimes_1km_count_ear']
cal_change_r['crimes_2km_count_dif']=cal_change_r['crimes_2km_count_lat']-cal_change_r['crimes_2km_count_ear']
cal_change_r['crimes_5km_count_dif']=cal_change_r['crimes_5km_count_lat']-cal_change_r['crimes_5km_count_ear']

cal_change_r['homicides_1km_count_dif']=cal_change_r['homicides_1km_count_lat']-cal_change_r['homicides_1km_count_ear']
cal_change_r['homicides_2km_count_dif']=cal_change_r['homicides_2km_count_lat']-cal_change_r['homicides_2km_count_ear']
cal_change_r['homicides_5km_count_dif']=cal_change_r['homicides_5km_count_lat']-cal_change_r['homicides_5km_count_ear']

cal_change_r['stealing_1km_count_dif']=cal_change_r['stealing_1km_count_lat']-cal_change_r['stealing_1km_count_ear']
cal_change_r['stealing_2km_count_dif']=cal_change_r['stealing_2km_count_lat']-cal_change_r['stealing_2km_count_ear']
cal_change_r['stealing_5km_count_dif']=cal_change_r['stealing_5km_count_lat']-cal_change_r['stealing_5km_count_ear']

cal_change_r['violence_1km_count_dif']=cal_change_r['violence_1km_count_lat']-cal_change_r['violence_1km_count_ear']
cal_change_r['violence_2km_count_dif']=cal_change_r['violence_2km_count_lat']-cal_change_r['violence_2km_count_ear']
cal_change_r['violence_5km_count_dif']=cal_change_r['violence_5km_count_lat']-cal_change_r['violence_5km_count_ear']

In [0]:
cal_change_r.head()

Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,scr_lat_m,scr_lat_y,scr_ear_m,scr_ear_y,price_dif,crimes_loc_1km,crimes_loc_2km,crimes_loc_5km,homicides_loc_1km,homicides_loc_2km,homicides_loc_5km,stealing_loc_1km,stealing_loc_2km,...,violence_5km_lat,violence_1km_ear,violence_2km_ear,violence_5km_ear,crimes_1km_count_lat,crimes_2km_count_lat,crimes_5km_count_lat,crimes_1km_count_ear,crimes_2km_count_ear,crimes_5km_count_ear,homicides_1km_count_lat,homicides_2km_count_lat,homicides_5km_count_lat,homicides_1km_count_ear,homicides_2km_count_ear,homicides_5km_count_ear,violence_1km_count_lat,violence_2km_count_lat,violence_5km_count_lat,violence_1km_count_ear,violence_2km_count_ear,violence_5km_count_ear,stealing_1km_count_lat,stealing_2km_count_lat,stealing_5km_count_lat,stealing_1km_count_ear,stealing_2km_count_ear,stealing_5km_count_ear,crimes_1km_count_dif,crimes_2km_count_dif,crimes_5km_count_dif,homicides_1km_count_dif,homicides_2km_count_dif,homicides_5km_count_dif,stealing_1km_count_dif,stealing_2km_count_dif,stealing_5km_count_dif,violence_1km_count_dif,violence_2km_count_dif,violence_5km_count_dif
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18,0,0,0,-2,-4,-5,0,-2,-14
5,2384,2018-11-09,65.0,2018-10-11,11,9,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18,0,0,0,-2,-4,-5,0,-2,-14
6,2384,2018-11-12,65.0,2018-10-11,11,12,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18,0,0,0,-2,-4,-5,0,-2,-14
7,2384,2018-11-30,65.0,2018-10-11,11,30,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,"[192, 194, 230, 232, 282, 172, 173, 271, 212, ...",[],"[216, 214, 255]","[273, 281, 153, 154, 283, 157, 294, 170, 171, ...",1,5,60,3,9,78,0,0,0,0,0,0,0,1,15,0,3,29,1,2,26,3,6,31,-2,-4,-18,0,0,0,-2,-4,-5,0,-2,-14
8,2384,2018-12-03,75.0,2018-11-15,12,3,2018,1,80.0,2018-09-14,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,11,2018,9,2018,-5.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[123],"[109, 110, 111]","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...","[236, 237, 238]","[215, 216, 217, 235, 256, 257, 258, 259, 618, ...",...,"[194, 195, 232, 233, 170, 174, 270, 176, 177, ...",[],"[257, 234, 214, 216, 255]","[193, 194, 195, 230, 231, 232, 233, 171, 176, ...",1,15,73,1,17,87,0,0,1,0,0,0,0,4,16,0,6,21,1,8,28,1,9,28,0,-2,-14,0,0,1,0,-1,0,0,-2,-5


In [0]:
cal_change_r.to_csv('cal_change_r.csv')
!cp cal_change_r.csv drive/My\ Drive/

##Statistical analysis

Like before, we'll do a regression but this time we'll considere the difference between scraping dates in regards to price and crime counts. Note that this model implies a very strong (and rough) assumption: that the influence crime had on the price can be measured by the difference in criminal occurences between scraping dates, and nothing more. Again, this is not a very precise estimate but should be enough to suggest us whether there's an influence at all.

In [0]:
X = cal_change_r[['crimes_1km_count_dif','crimes_2km_count_dif','crimes_5km_count_dif',
               'homicides_1km_count_dif','homicides_2km_count_dif','homicides_5km_count_dif',
               'violence_1km_count_dif','violence_2km_count_dif','violence_5km_count_dif',
               'stealing_1km_count_dif','stealing_2km_count_dif','stealing_5km_count_dif',
               'room_type','neighbourhood','number_of_reviews','reviews_per_month','availability_365']]
y = cal_change_r[['price_dif']]

X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

X=X.drop(['neighbourhood','room_type'],axis=1)

reg = LinearRegression().fit(X, y)

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

  return ptp(axis=axis, out=out, **kwargs)


                            OLS Regression Results                            
Dep. Variable:              price_dif   R-squared:                       0.089
Model:                            OLS   Adj. R-squared:                  0.088
Method:                 Least Squares   F-statistic:                     70.03
Date:                Wed, 04 Sep 2019   Prob (F-statistic):               0.00
Time:                        23:36:16   Log-Likelihood:            -3.6202e+05
No. Observations:               59947   AIC:                         7.242e+05
Df Residuals:                   59862   BIC:                         7.250e+05
Df Model:                          84                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                     

For an interpretation of the results, check the Report section of our website.