#The relation between criminality and rent prices: a case study of Airbnb in Chicago
<b>Guilherme Araújo & Gabriel Novais</b>:


The objective of this work is to analyze the relation between the prices of Airbnb listings in Chicago and records of criminal occurences in the city for the period of July 2018 to July 2019.

Why Airbnb? Because price rates are more dynamic, since they operate on a more immediate supply-demand equiilibrium, can change daily and respond to many factors such as criminality, in particular. While some caveats have to be made, since many Airbnb listings are likely to be closer to touristic spots and to be less present in poor neighbourhoods, most listings are made available for most of the year, which would suggest there's an underlying mid-to-long term optimization logic for the hosts. This is not meant as an accurate proxy for long-term rent process, but more of an insight into how the decision-making process (hosts deciding at which prices to list their places for each date, consumers deciding which places to rent given price, location and other factors) can be affected by surrounding criminality. 

<b>Sources and Links</b>:

<b>Airbnb</b>
<li><a href="http://insideairbnb.com/get-the-data.html">http://insideairbnb.com/get-the-data.html</a></li>

<b>Chicago</b>
    <li><a>https://data.cityofchicago.org/Public-Safety/Crimes-One-year-prior-to-present/x2n5-8w5q/data</li></a>

In [0]:
#Setting up Python
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import glob
import re
import io
import requests
import csv

from sklearn.linear_model import LinearRegression
from math import radians, sin, cos, acos, log, pi, tan, asin,sqrt
from decimal import Decimal
from bokeh.plotting import figure, show, output_notebook
from bokeh.tile_providers import CARTODBPOSITRON
from ast import literal_eval
from scipy import stats

In [0]:
#from google.colab import drive
#drive.mount('drive')

## Importing and organizing Airbnb data

Airbnb does not publically release information on its listings. When opening a listing on the Airbnb, all the information we can find are informations about the listing and its host, reviews and a calendar that shows the future dates when the place will be available and the rent price for each day. So how can we make inferences about Airbnb activity?

http://insideairbnb.com/ is a website that provides data scraped periodically from the Airbnb website for selected cities. For the city of Chicago, which will be our subject of choice, we have 14 different iterations of this scraping process, the earliest from April 15th, 2018 and the latest from July 15th, 2019. 

We'll be building 3 datasets from the data obtained from InsideAirbnb:
- <b>listings </b>, which has data for each listings such as host identification, neighborhood and location
- <b>reviews </b>, which compiles the dates of each review posted on the website for each listings
- <b>calendar</b>, which shows the availability and pricing for future dates; by joining data from different iterations of their web scraping, we can build a very accurate database of pricing for the cumulative time period.

It's important to highlight that the availability information is noisy, since booked dates are listed as unavailable, and we don't have explicit information on which dates the places were actually rented. What we do is use the date of reviews as proxy, assuming that users post a review as soon as they leave the rented place, which makes the data of a listing on the day a consumer posted a review relevant. While users may take a day or two to post their reviews, since prices don't vary much from day to day (even though it changes throughout the year), we assume any imprecision here is irrelevant on the aggregate.

### Listings

We'll import the data directly from our GitHub repository, where we've previously saved and organized the data extracted from Inside AirBnb.

Each listings.csv file features data from all the listings on the Airbnb website on that day. The most recent information is what is of our interest; however, it doesn't feature the entire history of listings. Thus, we appended data from previous versions of the listings dataset and only kept the most recent data, so we can have the most accurate information on the largest set of listings.

In order to select relevant listings, we discarded listings which are available for less than 10 days a year and that have had less than 10 reviews, to not burden ourselves with skewed information based on one-off rents. We have also discarded listings from dates previous to April 15th, 2018, since we have no calendar information on them.

One of the most important information on this dataset is the location for each listing, provided by latitude and longitude. Since our main interest in this information is to calculate distances between the listings and nearby crimes on each date, both latitude and longitude information have been rounded to 2 decimals to avoid redundant calculations and to offset errors in measurement, since rounding up to 3 or more decimals made it so that some listings showed up with different locations on different dates. This reduced thousands of listings to 370 general locations, for which we then created and id for each of those locations.

Also note that the method we used for creating id's generated ordered values but of seemingly random values, so we decide to create a second id reordering those values starting from 0 and incrementing by 1, which will facilitate consulting locations later on.

In [0]:
#Import listings data from each scraping iteration (from oldest to newest)
url_l1 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_4_15.csv'
url_l2 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_5_18.csv'
url_l3 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_7_18.csv'
url_l4 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_9_14.csv'
url_l5 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_10_11.csv'
url_l6 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_11_15.csv'
url_l7 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_18_12_13.csv'
url_l8 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_1_17.csv'
url_l9 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_2_9.csv'
url_l10 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_3_12.csv'
url_l11 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_4_15.csv'
url_l12 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_5_19.csv'
url_l13 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_6_14.csv'
url_l14 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/listings/listings_19_7_15.csv'


listings_1 = pd.read_csv(url_l1)
listings_2 = pd.read_csv(url_l2)
listings_3 = pd.read_csv(url_l3)
listings_4 = pd.read_csv(url_l4)
listings_5 = pd.read_csv(url_l5)
listings_6 = pd.read_csv(url_l6)
listings_7 = pd.read_csv(url_l7)
listings_8 = pd.read_csv(url_l8)
listings_9 = pd.read_csv(url_l9)
listings_10 = pd.read_csv(url_l10)
listings_11 = pd.read_csv(url_l11)
listings_12 = pd.read_csv(url_l12)
listings_13 = pd.read_csv(url_l13)
listings_14 = pd.read_csv(url_l14)

In [0]:
#The most recent listing data is the one we want, but some past listings may no longer show up
#We'll append to the most recent listings data from past scrapings, but we'll only keep the most recent information for each id 
listings=listings_14
listings=listings.append(listings_13)
listings=listings.append(listings_12)
listings=listings.append(listings_11)
listings=listings.append(listings_10)
listings=listings.append(listings_9)
listings=listings.append(listings_8)
listings=listings.append(listings_7)
listings=listings.append(listings_6)
listings=listings.append(listings_5)
listings=listings.append(listings_4)
listings=listings.append(listings_3)
listings=listings.append(listings_2)
listings=listings.append(listings_1)

listings=listings.drop_duplicates(subset="id", keep='first')
listings=listings.drop(columns=['name','host_name','price','minimum_nights','neighbourhood_group'])
listings=listings.rename(index=str, columns={"id": "listing_id"})

listings=listings.dropna(subset=['last_review'], axis=0)
listings['lr_m']=listings.last_review.apply(lambda x: int(x[5:7]))
listings['lr_d']=listings.last_review.apply(lambda x: int(x[8:10]))
listings['lr_y']=listings.last_review.apply(lambda x: int(x[0:4]))
listings.last_review = pd.to_datetime(listings.last_review)
listings['lat']=listings.latitude.round(2)
listings['lon']=listings.longitude.round(2)
listings['location'] = list(zip(listings.latitude, listings.longitude))
listings['loc'] = list(zip(listings.lat, listings.lon))

listings = listings.assign(loc_id=(listings['loc'].astype('category').cat.codes))

listings.room_type = listings.room_type.apply(lambda x: 1 if x=="Entire home/apt" else 2 if x=="Private room" else 3)

listings=listings[listings.number_of_reviews > 9]
listings=listings[listings.lr_y > 2017]
listings=listings[listings.availability_365>9]
listings=listings.drop(listings[(listings.lr_y==2018) & (listings.lr_m<4)].index)
listings=listings.drop(listings[(listings.lr_y==2018) & (listings.lr_m==4) & (listings.lr_d<15)].index)

listings=listings.drop(columns=['last_review'])

In [89]:
#We'll create a dataframe storing each pair of location and id
listings_locations = listings[['loc','loc_id']]
listings_locations = listings_locations.drop_duplicates('loc_id')
listings_locations = listings_locations.set_index('loc_id')
listings_locations = listings_locations.sort_index()
listings_locations = listings_locations.reset_index()
listings_locations['loc_id2']=listings_locations.index
list_locs=list(listings_locations['loc'])
len(list_locs)

370

In [90]:
listings_locations.head()

Unnamed: 0,loc_id,loc,loc_id2
0,0,"(41.65, -87.54)",0
1,2,"(41.66, -87.55)",1
2,4,"(41.67, -87.66)",2
3,10,"(41.69, -87.68)",3
4,11,"(41.69, -87.67)",4


In [91]:
listings=listings.merge(listings_locations,on=['loc_id','loc'])
listings.head()

Unnamed: 0,listing_id,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,loc_id2
0,2384,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",124,71
1,2604454,13339125,Hyde Park,41.78977,-87.58916,1,85,1.42,3,41,7,11,2019,41.79,-87.59,"(41.789770000000004, -87.58915999999999)","(41.79, -87.59)",124,71
2,6524346,34121377,Hyde Park,41.79119,-87.59099,2,26,0.52,1,34,7,7,2019,41.79,-87.59,"(41.79119, -87.59099)","(41.79, -87.59)",124,71
3,18549719,47172572,Hyde Park,41.79296,-87.59275,1,127,4.77,60,96,7,2,2019,41.79,-87.59,"(41.79296, -87.59275)","(41.79, -87.59)",124,71
4,22320506,47172572,Hyde Park,41.79386,-87.59469,1,99,5.32,60,93,7,8,2019,41.79,-87.59,"(41.793859999999995, -87.59469)","(41.79, -87.59)",124,71


### Reviews

Our reviews dataset is much simpler, since the latest information stores the entire history of Airbnb reviews by listing and date. We simply discarded information for dates outside of our interest and create a dummy variable called 'review', so when we merge the reviews to our calendar we can establish which dates of our listings we can assume have been actually rented. 

In [0]:
#Import review data
url_r = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/reviews/reviews_15_7_19.csv'
reviews = pd.read_csv(url_r)

In [0]:
reviews['month']=reviews.date.apply(lambda x: int(x[5:7]))
reviews['day']=reviews.date.apply(lambda x: int(x[8:10]))
reviews['year']=reviews.date.apply(lambda x: int(x[0:4]))
reviews.date = pd.to_datetime(reviews.date)

reviews=reviews[reviews.year > 2017]
reviews=reviews.drop(reviews[(reviews.year==2018) & (reviews.month<4)].index)
reviews=reviews.drop(reviews[(reviews.year==2018) & (reviews.month==4) & (reviews.day<15)].index)
reviews=reviews.drop(columns=['month','day','year'])
reviews['review']=1

In [94]:
reviews.head()

Unnamed: 0,listing_id,date,review
112,2384,2018-04-15,1
113,2384,2018-04-22,1
114,2384,2018-04-25,1
115,2384,2018-05-05,1
116,2384,2018-05-14,1


### Calendar

As previously explained, each scraping iteration of the calendars features prices and available dates for the near future, as provided by the host. We assumed the latest information is more likely to reflect the actual price exercised on each date. Thus, for our main analysis, we're only keeping the most recent prices made available on the website on our calendar dataset. 

However, since we're also interested in how hosts change their prices for future dates, we created another dataset named cal_change which stores listings for which different prices have been listed on different scraping dates. For now, we'll leave it aside and focus on our calendar dataset.

In [0]:
#Import calendar data from each scraping iteration (from oldest to newest)
url_c1 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_4_15.zip'
url_c2 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_5_18.zip'
url_c3 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_7_18.zip'
url_c4 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_9_14.zip'
url_c5 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_10_11.zip'
url_c6 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_11_15.zip'
url_c7 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_18_12_13.zip'
url_c8 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_1_17.zip'
url_c9 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_2_9.zip'
url_c10 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_3_12.zip'
url_c11 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_4_15.zip'
url_c12 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_5_19.zip'
url_c13 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_6_14.zip'
url_c14 = 'https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/calendar/calendar_19_7_15.zip'


calendar_1 = pd.read_csv(url_c1)
calendar_2 = pd.read_csv(url_c2)
calendar_3 = pd.read_csv(url_c3)
calendar_4 = pd.read_csv(url_c4)
calendar_5 = pd.read_csv(url_c5)
calendar_6 = pd.read_csv(url_c6)
calendar_7 = pd.read_csv(url_c7)
calendar_8 = pd.read_csv(url_c8)
calendar_9 = pd.read_csv(url_c9)
calendar_10 = pd.read_csv(url_c10)
calendar_11 = pd.read_csv(url_c11)
calendar_12 = pd.read_csv(url_c12)
calendar_13 = pd.read_csv(url_c13)
calendar_14 = pd.read_csv(url_c14)

In [0]:
#For each calendar scraping, add scraping date
calendar_1['scr_date']='2018-04-15'
calendar_2['scr_date']='2018-05-18'
calendar_3['scr_date']='2018-07-18'
calendar_4['scr_date']='2018-09-14'
calendar_5['scr_date']='2018-10-11'
calendar_6['scr_date']='2018-11-15'
calendar_7['scr_date']='2018-12-13'
calendar_8['scr_date']='2019-01-17'
calendar_9['scr_date']='2019-02-09'
calendar_10['scr_date']='2019-03-12'
calendar_11['scr_date']='2019-04-15'
calendar_12['scr_date']='2019-05-19'
calendar_13['scr_date']='2019-06-14'
calendar_14['scr_date']='2019-07-15'

In [97]:
calendar=calendar_14
calendar=calendar.append(calendar_13)
calendar=calendar.append(calendar_12)
calendar=calendar.append(calendar_11)
calendar=calendar.append(calendar_10)
calendar=calendar.append(calendar_9)
calendar=calendar.append(calendar_8)
calendar=calendar.append(calendar_7)
calendar=calendar.append(calendar_6)
calendar=calendar.append(calendar_5)
calendar=calendar.append(calendar_4)
calendar=calendar.append(calendar_3)
calendar=calendar.append(calendar_2)
calendar=calendar.append(calendar_1)

calendar=calendar[['listing_id','date','price','scr_date']]
calendar=calendar.drop_duplicates(subset=['listing_id','date','price'])
calendar=calendar.dropna(axis=0,subset=['price'])
calendar['month']=calendar.date.apply(lambda x: int(x[5:7]))
calendar['day']=calendar.date.apply(lambda x: int(x[8:10]))
calendar['year']=calendar.date.apply(lambda x: int(x[0:4]))
calendar.date = pd.to_datetime(calendar.date)
calendar.scr_date = pd.to_datetime(calendar.scr_date)
calendar.price = calendar.price.apply(lambda x: float(re.sub("[^\d\.]", "", (x[1:-3]))))
calendar.price = pd.to_numeric(calendar.price)

calendar=calendar.merge(reviews,on=['listing_id','date'],how='left')
calendar['review']=calendar['review'].fillna(0)
calendar['review']=calendar['review'].astype(int)

cal_change = calendar[calendar.duplicated(['listing_id','date'],keep=False)]
cal_change = cal_change.sort_values(by=['listing_id','date'])

calendar=calendar.drop(columns=['scr_date'])
calendar=calendar.drop_duplicates(subset=['listing_id','date'])
calendar=calendar.sort_values(by=['listing_id','date'])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [98]:
calendar.head()

Unnamed: 0,listing_id,date,price,month,day,year,review
14493724,2384,2018-04-15,55.0,4,15,2018,1
14493723,2384,2018-04-16,55.0,4,16,2018,0
14493722,2384,2018-04-17,55.0,4,17,2018,0
14493721,2384,2018-04-18,55.0,4,18,2018,0
14493720,2384,2018-04-22,55.0,4,22,2018,1


In [99]:
cal_change.head()

Unnamed: 0,listing_id,date,price,scr_date,month,day,year,review
13873240,2384,2018-06-11,65.0,2018-05-18,6,11,2018,0
14493696,2384,2018-06-11,80.0,2018-04-15,6,11,2018,0
13873226,2384,2018-07-01,65.0,2018-05-18,7,1,2018,1
14493689,2384,2018-07-01,60.0,2018-04-15,7,1,2018,1
13873225,2384,2018-07-02,65.0,2018-05-18,7,2,2018,0


## Importing crimes data

We extracted our data on crimes for the city of Chicago from the Chicago Data Portal website, which amongst its Public Safety data features a dataset named "Crimes: one year prior to present", which lists all reports of criminal occurences for an entire year up to the latest update (roughly a week before the present date). For the version of this file saved on our GitHub, data spans from July 9th, 2018 to July 8th, 2019. We decided to drop data from July, 2019 since the its few entries seem incomplete, listing only a handful of occurrences.

We decided to discard crimes of certain categories such as 'deceptive practice' (e.g. credit card frauds) and 'liquor law violation' (e.g. selling alcoholic drinks without a permit), which we deemed to not be relevant when it comes to the decision-making process from both hosts and consumers in regards to rent.

When checking how many occurrences there are on our dataset for each crime category (as listed by the Chicago Data Portal), it can be seen that the most frequent crimes are related to stealing private possessions ('theft','burglary', 'robbery', 'motor vehicle theft'), criminal damage and violence ('battery', 'assault'), while the number of homicides pale in comparison (which can be at least partially attributed to less reporting, as public information would suggest many more homicides happened on Chicago for that time period).

Since some types of crimes are much more reported than others, the relation between aggregate criminality and prices might be unclear and dominated by the categories with more representation. For example, we'd expect the correlation between price and nearby homicides to be negative, but the correlation between theft and price might actually be positive since higher rent prices are likely to be present in richer areas or more populated areas, where thefts might be more present (or at least, reported more often).

To make a more thorough analysis, we'll deal with the full set of criminal occurrences as well as subsets for crimes related to physical violence, stealing private property and homicides.

Like on our listings dataset, we rounded locations (latitude and longitude) to 2 decimals, reducing over a hundred thousand criminal reports to 708 locations.

In [100]:
url_cr = "https://raw.githubusercontent.com/araujoghm/DataScienceEMAp_AraujoNovais/master/dados/crimes/crimes.csv"
crimes = pd.read_csv(url_cr)
crimes = crimes[['DATE  OF OCCURRENCE','LATITUDE','LONGITUDE','ARREST',' PRIMARY DESCRIPTION']]
crimes = crimes.rename(index=str, columns={"DATE  OF OCCURRENCE": "date","LONGITUDE": "longitude","LATITUDE": "latitude"," PRIMARY DESCRIPTION": "desc", "ARREST": "arrest"})
crimes = crimes[crimes.desc!="CONCEALED CARRY LICENSE VIOLATION"]
crimes = crimes[crimes.desc!="DECEPTIVE PRACTICE"]
crimes = crimes[crimes.desc!="INTERFERENCE WITH PUBLIC OFFICER"]
crimes = crimes[crimes.desc!="OBSCENITY"]
crimes = crimes[crimes.desc!="NON-CRIMINAL"]
crimes = crimes[crimes.desc!="NON-CRIMINAL (SUBJECT SPECIFIED)"]
crimes = crimes[crimes.desc!="LIQUOR LAW VIOLATION"]
crimes = crimes[crimes.desc!="PUBLIC INDECENCY"]

crimes['lat']=crimes.latitude.round(2)
crimes['lon']=crimes.longitude.round(2)
crimes['location'] = list(zip(crimes.latitude, crimes.longitude))
crimes['loc'] = list(zip(crimes.lat, crimes.lon))
crimes = crimes.assign(loc_id=(crimes['loc'].astype('category').cat.codes))
crimes.arrest = crimes.arrest.apply(lambda x: 0 if x=="N" else 1)
crimes.date = crimes.date.apply(lambda x: x[0:10])
crimes.date = pd.to_datetime(crimes.date)
crimes['date_str'] = crimes.date.astype('str')
crimes['month']=crimes.date_str.apply(lambda x: int(x[5:7]))
crimes['day']=crimes.date_str.apply(lambda x: int(x[8:10]))
crimes['year']=crimes.date_str.apply(lambda x: int(x[0:4]))

crimes=crimes.drop(columns=['date_str'])
#Dropping incomplete observations
crimes=crimes.drop(crimes[(crimes.year==2019) & (crimes.month==7)].index)

crimes=crimes.dropna(axis=0)
crimes=crimes.sort_values(by='date')
print(crimes['desc'].value_counts())

THEFT                         60775
BATTERY                       48332
CRIMINAL DAMAGE               26229
ASSAULT                       20093
OTHER OFFENSE                 16370
NARCOTICS                     12648
BURGLARY                      10272
MOTOR VEHICLE THEFT            9277
ROBBERY                        8448
CRIMINAL TRESPASS              6582
WEAPONS VIOLATION              5736
OFFENSE INVOLVING CHILDREN     2138
CRIM SEXUAL ASSAULT            1542
PUBLIC PEACE VIOLATION         1440
SEX OFFENSE                    1129
PROSTITUTION                    666
HOMICIDE                        551
ARSON                           357
STALKING                        200
INTIMIDATION                    182
GAMBLING                        168
KIDNAPPING                      161
HUMAN TRAFFICKING                14
OTHER NARCOTIC VIOLATION          4
Name: desc, dtype: int64


In [0]:
homicides=crimes[crimes.desc=="HOMICIDE"]
homicides=homicides.drop(columns=['desc'])

stealing=crimes[crimes.desc.isin(["BURGLARY", "THEFT", "ROBBERY", "MOTOR VEHICLE THEFT"])]
stealing=stealing.drop(columns=['desc'])

violence=crimes[crimes.desc.isin(["BATTERY", "ASSAULT"])]
violence=violence.drop(columns=['desc'])

crimes=crimes.drop(columns=['desc'])

Like we did for our listings, we'll create dataframe pairing locations to their id's (both the original and our "corrected" version)

In [102]:
crimes_locations = crimes[['loc','loc_id']]
crimes_locations = crimes_locations.drop_duplicates('loc_id')
crimes_locations = crimes_locations.set_index('loc_id')
crimes_locations = crimes_locations.sort_index()
crimes_locations = crimes_locations.reset_index()
crimes_locations['crim_loc_id2']=crimes_locations.index
crim_locs=list(crimes_locations['loc'])
len(crim_locs)

708

In [103]:
homicides_locations = homicides[['loc','loc_id']]
homicides_locations = homicides_locations.drop_duplicates('loc_id')
homicides_locations = homicides_locations.set_index('loc_id')
homicides_locations = homicides_locations.sort_index()
homicides_locations = homicides_locations.reset_index()
homicides_locations['homi_loc_id2']=homicides_locations.index
homi_locs=list(homicides_locations['loc'])
len(homi_locs)

244

In [104]:
violence_locations = violence[['loc','loc_id']]
violence_locations = violence_locations.drop_duplicates('loc_id')
violence_locations = violence_locations.set_index('loc_id')
violence_locations = violence_locations.sort_index()
violence_locations = violence_locations.reset_index()
violence_locations['viol_loc_id2']=violence_locations.index
viol_locs=list(violence_locations['loc'])
len(viol_locs)

682

In [105]:
stealing_locations = stealing[['loc','loc_id']]
stealing_locations = stealing_locations.drop_duplicates('loc_id')
stealing_locations = stealing_locations.set_index('loc_id')
stealing_locations = stealing_locations.sort_index()
stealing_locations = stealing_locations.reset_index()
stealing_locations['stea_loc_id2']=stealing_locations.index
stea_locs=list(stealing_locations['loc'])
len(stea_locs)

681

In [106]:
crimes = crimes.merge(crimes_locations,on=['loc_id','loc'])
homicides = homicides.merge(homicides_locations,on=['loc_id','loc'])
violence = violence.merge(violence_locations,on=['loc_id','loc'])
stealing = stealing.merge(stealing_locations,on=['loc_id','loc'])
crimes.sort_values(by='date')

Unnamed: 0,date,latitude,longitude,arrest,lat,lon,location,loc,loc_id,month,day,year,crim_loc_id2
0,2018-07-09,41.894328,-87.628143,1,41.89,-87.63,"(41.894327845999996, -87.62814321)","(41.89, -87.63)",0,7,9,2018,0
76680,2018-07-09,42.018312,-87.675867,0,42.02,-87.68,"(42.018311737, -87.675866628)","(42.02, -87.68)",6424,7,9,2018,556
76681,2018-07-09,42.018312,-87.675867,0,42.02,-87.68,"(42.018311737, -87.675866628)","(42.02, -87.68)",6424,7,9,2018,556
77295,2018-07-09,41.866363,-87.772931,0,41.87,-87.77,"(41.866363205999996, -87.772931225)","(41.87, -87.77)",6211,7,9,2018,345
77296,2018-07-09,41.869786,-87.774282,0,41.87,-87.77,"(41.869786389, -87.774281815)","(41.87, -87.77)",6211,7,9,2018,345
77519,2018-07-09,41.883292,-87.731694,0,41.88,-87.73,"(41.883292437, -87.73169417700001)","(41.88, -87.73)",6232,7,9,2018,366
77520,2018-07-09,41.879693,-87.731379,0,41.88,-87.73,"(41.879693076, -87.73137909399999)","(41.88, -87.73)",6232,7,9,2018,366
77521,2018-07-09,41.879661,-87.734130,0,41.88,-87.73,"(41.879660809, -87.734129778)","(41.88, -87.73)",6232,7,9,2018,366
77522,2018-07-09,41.882185,-87.725712,0,41.88,-87.73,"(41.882185193000005, -87.72571151)","(41.88, -87.73)",6232,7,9,2018,366
145132,2018-07-09,41.886193,-87.766724,0,41.89,-87.77,"(41.886192908000005, -87.76672413200001)","(41.89, -87.77)",6245,7,9,2018,379


Now we'll create dataframes for counting criminal occurences for each date and listing the locations in which those crimes happened 

Crimes by date:

In [107]:
crimes_date=crimes[['date','loc_id','crim_loc_id2']]
crimes_date=crimes_date.groupby('date').agg(lambda x: list(x))
crimes_date['crimes_count_date']=np.nan
for i in range(len(crimes_date)):
  crimes_date.crimes_count_date.iloc[i]=len(list(crimes_date.loc_id.iloc[i]))
  crimes_date.loc_id.iloc[i]=np.unique(list(crimes_date.loc_id.iloc[i]))
  crimes_date.crim_loc_id2.iloc[i]=np.unique(list(crimes_date.crim_loc_id2.iloc[i]))
crimes_date.crimes_count_date=crimes_date.crimes_count_date.astype(int)
crimes_date=crimes_date.reset_index()
crimes_date=crimes_date.rename(index=str, columns={"loc_id": "crim_loc_id"})
crimes_date.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


KeyboardInterrupt: ignored

In [0]:
homicides_date=homicides[['date','loc_id','homi_loc_id2']]
homicides_date=homicides_date.groupby('date').agg(lambda x: list(x))
homicides_date['homicides_count_date']=np.nan
for i in range(len(homicides_date)):
  homicides_date.homicides_count_date.iloc[i]=len(list(homicides_date.loc_id.iloc[i]))
  homicides_date.loc_id.iloc[i]=np.unique(list(homicides_date.loc_id.iloc[i]))
  homicides_date.homi_loc_id2.iloc[i]=np.unique(list(homicides_date.homi_loc_id2.iloc[i]))
homicides_date.homicides_count_date=homicides_date.homicides_count_date.astype(int)
homicides_date=homicides_date.reset_index()
homicides_date=homicides_date.rename(index=str, columns={"loc_id": "homi_loc_id"})
homicides_date.head()

In [0]:
violence_date=violence[['date','loc_id','viol_loc_id2']]
violence_date=violence_date.groupby('date').agg(lambda x: list(x))
violence_date['violence_count_date']=np.nan
for i in range(len(violence_date)):
  violence_date.violence_count_date.iloc[i]=len(list(violence_date.loc_id.iloc[i]))
  violence_date.loc_id.iloc[i]=np.unique(list(violence_date.loc_id.iloc[i]))
  violence_date.viol_loc_id2.iloc[i]=np.unique(list(violence_date.viol_loc_id2.iloc[i]))
violence_date.violence_count_date=violence_date.violence_count_date.astype(int)
violence_date=violence_date.reset_index()
violence_date=violence_date.rename(index=str, columns={"loc_id": "viol_loc_id"})
violence_date.head()

In [0]:
stealing_date=stealing[['date','loc_id','stea_loc_id2']]
stealing_date=stealing_date.groupby('date').agg(lambda x: list(x))
stealing_date['stealing_count_date']=np.nan
for i in range(len(stealing_date)):
  stealing_date.stealing_count_date.iloc[i]=len(list(stealing_date.loc_id.iloc[i]))
  stealing_date.loc_id.iloc[i]=np.unique(list(stealing_date.loc_id.iloc[i]))
  stealing_date.stea_loc_id2.iloc[i]=np.unique(list(stealing_date.stea_loc_id2.iloc[i]))
stealing_date.stealing_count_date=stealing_date.stealing_count_date.astype(int)
stealing_date=stealing_date.reset_index()
stealing_date=stealing_date.rename(index=str, columns={"loc_id": "stea_loc_id"})
stealing_date.head()

Crimes by location:

In [0]:
crimes_loc=crimes[['loc_id','crim_loc_id2']]
crimes_loc=crimes_loc.groupby(['loc_id']).agg(lambda x: list(x))
crimes_loc['crimes_count_loc']=np.nan
for i in range(len(crimes_loc)):
  crimes_loc.crimes_count_loc.iloc[i]=len(list(crimes_loc.crim_loc_id2.iloc[i]))
crimes_loc.crimes_count_loc=crimes_loc.crimes_count_loc.astype(int)
crimes_loc=crimes_loc.reset_index()
crimes_loc.crim_loc_id2=crimes_loc.index
crimes_loc=crimes_loc.rename(index=str, columns={"loc_id": "crim_loc_id"})

In [0]:
homicides_loc=homicides[['loc_id','homi_loc_id2']]
homicides_loc=homicides_loc.groupby(['loc_id']).agg(lambda x: list(x))
homicides_loc['homicides_count_loc']=np.nan
for i in range(len(homicides_loc)):
  homicides_loc.homicides_count_loc.iloc[i]=len(list(homicides_loc.homi_loc_id2.iloc[i]))
homicides_loc.homicides_count_loc=homicides_loc.homicides_count_loc.astype(int)
homicides_loc=homicides_loc.reset_index()
homicides_loc.homi_loc_id2=homicides_loc.index
homicides_loc=homicides_loc.rename(index=str, columns={"loc_id": "homi_loc_id"})

In [0]:
violence_loc=violence[['loc_id','viol_loc_id2']]
violence_loc=violence_loc.groupby(['loc_id']).agg(lambda x: list(x))
violence_loc['violence_count_loc']=np.nan
for i in range(len(violence_loc)):
  violence_loc.violence_count_loc.iloc[i]=len(list(violence_loc.viol_loc_id2.iloc[i]))
violence_loc.violence_count_loc=violence_loc.violence_count_loc.astype(int)
violence_loc=violence_loc.reset_index()
violence_loc.viol_loc_id2=violence_loc.index
violence_loc=violence_loc.rename(index=str, columns={"loc_id": "viol_loc_id"})

In [0]:
stealing_loc=stealing[['loc_id','stea_loc_id2']]
stealing_loc=stealing_loc.groupby(['loc_id']).agg(lambda x: list(x))
stealing_loc['stealing_count_loc']=np.nan
for i in range(len(stealing_loc)):
  stealing_loc.stealing_count_loc.iloc[i]=len(list(stealing_loc.stea_loc_id2.iloc[i]))
stealing_loc.stealing_count_loc=stealing_loc.stealing_count_loc.astype(int)
stealing_loc=stealing_loc.reset_index()
stealing_loc.stea_loc_id2=stealing_loc.index
stealing_loc=stealing_loc.rename(index=str, columns={"loc_id": "stea_loc_id"})

## Calculating distances between Airbnb listings and criminal occurrences

We're interested in knowing the criminal activity surrounding each Airbnb listing. To do that, we'll calculate the distance between each unique location on our listings database and each unique location on our criminal occurrence datasets.

In [0]:
def distance(a,b):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees). Output in KM
    """
    lat1 = a[0]
    lat2 = b[0]
    lon1 = a[1]
    lon2 = b[1]
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    km = 6371 * c
    return km

In [0]:
dist_cr=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(crim_locs)
  x=list_locs[i]
  for j in range(len(crim_locs)):
    a[j]=round(distance(x,crim_locs[j]),2)
  dist_cr[i]=a
  
dist_cr=pd.DataFrame(dist_cr)
dist_cr.head()

In [0]:
dist1_cr=dist_cr[dist_cr<=1].notnull().astype('int')
dist1_cr=dist1_cr.fillna(0)
dist1_cr.head()

In [0]:
dist2_cr=dist_cr[dist_cr<=2].notnull().astype('int')
dist2_cr=dist2_cr.fillna(0)
dist2_cr.head()

In [0]:
dist5_cr=dist_cr[dist_cr<=5].notnull().astype('int')
dist5_cr=dist5_cr.fillna(0)
dist5_cr.head()

In [0]:
dist_hm=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(homi_locs)
  x=list_locs[i]
  for j in range(len(homi_locs)):
    a[j]=round(distance(x,homi_locs[j]),2)
  dist_hm[i]=a
  
dist_hm=pd.DataFrame(dist_hm)
dist_hm.head()

In [0]:
dist1_hm=dist_hm[dist_hm<=1].notnull().astype('int')
dist1_hm=dist1_hm.fillna(0)
#dist1_hm.head()

dist2_hm=dist_hm[dist_hm<=2].notnull().astype('int')
dist2_hm=dist2_hm.fillna(0)
#dist2_hm.head()

dist5_hm=dist_hm[dist_hm<=5].notnull().astype('int')
dist5_hm=dist5_hm.fillna(0)
#dist5_hm.head()


In [0]:
dist_vi=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(aggr_locs)
  x=list_locs[i]
  for j in range(len(aggr_locs)):
    a[j]=round(distance(x,aggr_locs[j]),2)
  dist_vi[i]=a
  
dist_vi=pd.DataFrame(dist_vi)
#dist_vi.head()

In [0]:
dist1_vi=dist_vi[dist_vi<=1].notnull().astype('int')
dist1_vi=dist1_vi.fillna(0)
#dist1_vi.head()

dist2_vi=dist_vi[dist_vi<=2].notnull().astype('int')
dist2_vi=dist2_vi.fillna(0)
#dist2_vi.head()

dist5_vi=dist_vi[dist_vi<=5].notnull().astype('int')
dist5_vi=dist5_vi.fillna(0)
#dist5_vi.head()

In [0]:
dist_st=[None]*len(list_locs)
for i in range(len(list_locs)):
  a=[None]*len(stea_locs)
  x=list_locs[i]
  for j in range(len(stea_locs)):
    a[j]=round(distance(x,stea_locs[j]),2)
  dist_st[i]=a
  
dist_st=pd.DataFrame(dist_st)
#dist_st.head()

In [0]:
dist1_st=dist_st[dist_st<=1].notnull().astype('int')
dist1_st=dist1_st.fillna(0)
#dist1_st.head()

dist2_st=dist_st[dist_st<=2].notnull().astype('int')
dist2_st=dist2_st.fillna(0)
#dist2_st.head()

dist5_st=dist_st[dist_st<=5].notnull().astype('int')
dist5_st=dist5_st.fillna(0)
#dist5_st.head()

##Merging listings, calendars reviews to build our complete Airbnb database

In [0]:
airbnb = calendar.merge(listings,on=['listing_id'],how='inner')
airbnb = airbnb[airbnb.review==1]
airbnb = airbnb.drop(columns=['review'])

In [0]:
airbnb_cr=airbnb.merge(crimes_date,on=['date'])
airbnb_cr=airbnb_cr.merge(homicides_date,on=['date'])
airbnb_cr=airbnb_cr.merge(stealing_date,on=['date'])
airbnb_cr=airbnb_cr.merge(violence_date,on=['date'])

airbnb_cr['loc_id'] = airbnb_cr.loc_id2
airbnb_cr['crim_loc_id'] = airbnb_cr.crim_loc_id2
airbnb_cr['homi_loc_id'] = airbnb_cr.homi_loc_id2
airbnb_cr['stea_loc_id'] = airbnb_cr.stea_loc_id2
airbnb_cr['viol_loc_id'] = airbnb_cr.viol_loc_id2

airbnb_cr = airbnb_cr.drop(columns=['loc_id2','crim_loc_id2','stea_loc_id2','viol_loc_id2','homi_loc_id2'])
airbnb_cr = airbnb_cr.assign(date_id=(airbnb_cr['date'].astype('category').cat.codes))
airbnb_cr = airbnb_cr.rename(index=str, columns={"crim_loc_id": "crimes_that_date"})
airbnb_cr = airbnb_cr.sort_values(by=['listing_id','date'])

In [0]:
def indices(lst, element):
    result = []
    offset = -1
    while True:
        try:
            offset = lst.index(element, offset+1)
        except ValueError:
            return result
        result.append(offset)

In [0]:
#Step 1: Finding locations of crimes near the listing's location 
airbnb_cr['crimes_loc_1km']=np.nan
airbnb_cr['crimes_loc_2km']=np.nan
airbnb_cr['crimes_loc_5km']=np.nan

crimes_loc_1km=[None]*len(airbnb_cr)
crimes_loc_2km=[None]*len(airbnb_cr)
crimes_loc_5km=[None]*len(airbnb_cr)

for i in range(len(airbnb_cr)):
  loc_id=airbnb_cr.loc_id.iloc[i]
  crimes_loc_1km[i]=indices(list(dist1_cr.iloc[loc_id]),1)
  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_cr.loc[loc_id]),1),indices(list(dist1_cr.loc[loc_id]),1)))
  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_cr.loc[loc_id]),1),indices(list(dist2_cr.loc[loc_id]),1)))

airbnb_cr.crimes_loc_1km=crimes_loc_1km
airbnb_cr.crimes_loc_2km=crimes_loc_2km
airbnb_cr.crimes_loc_5km=crimes_loc_5km


#Step 2: Finding which close crime locations had crimes on that date
airbnb_cr['crimes_1km']=np.nan
airbnb_cr['crimes_2km']=np.nan
airbnb_cr['crimes_5km']=np.nan

crimes_1km=[None]*len(airbnb_cr)
crimes_2km=[None]*len(airbnb_cr)
crimes_5km=[None]*len(airbnb_cr)

for i in range(len(airbnb_cr)):
  crimes_that_date = airbnb_cr.crimes_that_date.iloc[i]
  crimes_1km[i] = list(set(crimes_that_date).intersection(airbnb_cr.crimes_loc_1km.iloc[i]))
  crimes_2km[i] = list(set(crimes_that_date).intersection(airbnb_cr.crimes_loc_2km.iloc[i]))
  crimes_5km[i] = list(set(crimes_that_date).intersection(airbnb_cr.crimes_loc_5km.iloc[i]))
  
airbnb_cr.crimes_1km=crimes_1km
airbnb_cr.crimes_2km=crimes_2km
airbnb_cr.crimes_5km=crimes_5km


#Step 3: Counting how many crimes happened that date on those close locations
airbnb_cr['crimes_1km_count']=np.nan
airbnb_cr['crimes_2km_count']=np.nan
airbnb_cr['crimes_5km_count']=np.nan

crimes_1km_count=[None]*len(airbnb_cr)
crimes_2km_count=[None]*len(airbnb_cr)
crimes_5km_count=[None]*len(airbnb_cr)

for i in range(len(airbnb_cr)):
  c=crimes[crimes["date"]==airbnb_cr.date.iloc[i]]
  l1=airbnb_cr.crimes_1km.iloc[i]
  l2=airbnb_cr.crimes_2km.iloc[i]
  l5=airbnb_cr.crimes_5km.iloc[i]
  
  c1=c[c["crim_loc_id2"].isin(l1)]
  c2=c[c["crim_loc_id2"].isin(l2)]
  c5=c[c["crim_loc_id2"].isin(l5)]  
  
  crimes_1km_count[i]=len(c1)
  crimes_2km_count[i]=len(c2)
  crimes_5km_count[i]=len(c5)  
  
airbnb_cr.crimes_1km_count=crimes_1km_count
airbnb_cr.crimes_2km_count=crimes_2km_count
airbnb_cr.crimes_5km_count=crimes_5km_count

In [57]:
airbnb_cr.head()

Unnamed: 0,listing_id,date,price,month,day,year,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,crimes_that_date,crimes_count_date,date_id,crimes_loc_1km,crimes_loc_2km,crimes_loc_5km,crimes_1km,crimes_2km,crimes_5km,crimes_1km_count,crimes_2km_count,crimes_5km_count
0,2384,2018-07-26,75.0,7,26,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 33, 34, 35, 37, 40, 45, 49, 50, 51, 56, 58...",731,17,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...","[244, 246]","[224, 225, 264, 265, 266, 633, 223]","[262, 263, 279, 281, 162, 163, 164, 165, 166, ...",4,14,89
268,2384,2018-07-29,69.0,7,29,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 36, 37, 38, 40, 41, 42, 46, 47, 48, 49, 50...",731,20,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[],"[224, 225, 265, 266, 267, 243, 223]","[262, 263, 278, 279, 280, 281, 289, 162, 163, ...",0,20,76
847,2384,2018-08-05,65.0,8,5,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 32, 33, 36, 37, 43, 45, 47, 50, 51, 52, 57...",811,27,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[246],"[265, 266, 267, 243, 223]","[262, 263, 280, 281, 289, 162, 163, 164, 165, ...",1,12,94
1343,2384,2018-08-12,75.0,8,12,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 27, 30, 31, 34, 36, 37, 40, 41, 46, 47, 48...",781,34,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...","[244, 246]","[224, 225, 265, 266, 267, 223]","[263, 278, 279, 289, 162, 163, 164, 165, 166, ...",3,14,84
1976,2384,2018-10-01,65.0,10,1,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[0, 32, 34, 36, 37, 41, 47, 50, 54, 56, 58, 60...",732,84,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[244],"[224, 264, 265, 266, 223]","[263, 279, 280, 281, 289, 162, 163, 164, 165, ...",1,7,86


In [0]:
#Step 1: Finding locations of crimes near the listing's location 
airbnb_hm['crimes_loc_1km']=np.nan
airbnb_hm['crimes_loc_2km']=np.nan
airbnb_hm['crimes_loc_5km']=np.nan

crimes_loc_1km=[None]*len(airbnb_hm)
crimes_loc_2km=[None]*len(airbnb_hm)
crimes_loc_5km=[None]*len(airbnb_hm)

for i in range(len(airbnb_hm)):
  loc_id=airbnb_hm.loc_id.iloc[i]
  crimes_loc_1km[i]=indices(list(dist1_hm.iloc[loc_id]),1)
  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_hm.loc[loc_id]),1),indices(list(dist1_hm.loc[loc_id]),1)))
  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_hm.loc[loc_id]),1),indices(list(dist2_hm.loc[loc_id]),1)))

airbnb_hm.crimes_loc_1km=crimes_loc_1km
airbnb_hm.crimes_loc_2km=crimes_loc_2km
airbnb_hm.crimes_loc_5km=crimes_loc_5km


#Step 2: Finding which close crime locations had crimes on that date
airbnb_hm['crimes_1km']=np.nan
airbnb_hm['crimes_2km']=np.nan
airbnb_hm['crimes_5km']=np.nan

crimes_1km=[None]*len(airbnb_hm)
crimes_2km=[None]*len(airbnb_hm)
crimes_5km=[None]*len(airbnb_hm)

for i in range(len(airbnb_hm)):
  crimes_that_date = airbnb_hm.crimes_that_date.iloc[i]
  crimes_1km[i] = list(set(crimes_that_date).intersection(airbnb_hm.crimes_loc_1km.iloc[i]))
  crimes_2km[i] = list(set(crimes_that_date).intersection(airbnb_hm.crimes_loc_2km.iloc[i]))
  crimes_5km[i] = list(set(crimes_that_date).intersection(airbnb_hm.crimes_loc_5km.iloc[i]))
  
airbnb_hm.crimes_1km=crimes_1km
airbnb_hm.crimes_2km=crimes_2km
airbnb_hm.crimes_5km=crimes_5km


#Step 3: Counting how many crimes happened that date on those close locations
airbnb_hm['crimes_1km_count']=np.nan
airbnb_hm['crimes_2km_count']=np.nan
airbnb_hm['crimes_5km_count']=np.nan

crimes_1km_count=[None]*len(airbnb_hm)
crimes_2km_count=[None]*len(airbnb_hm)
crimes_5km_count=[None]*len(airbnb_hm)

for i in range(len(airbnb_hm)):
  c=crimes[crimes["date"]==airbnb_hm.date.iloc[i]]
  l1=airbnb_hm.crimes_1km.iloc[i]
  l2=airbnb_hm.crimes_2km.iloc[i]
  l5=airbnb_hm.crimes_5km.iloc[i]
  
  c1=c[c["crim_loc_id2"].isin(l1)]
  c2=c[c["crim_loc_id2"].isin(l2)]
  c5=c[c["crim_loc_id2"].isin(l5)]  
  
  crimes_1km_count[i]=len(c1)
  crimes_2km_count[i]=len(c2)
  crimes_5km_count[i]=len(c5)  
  
airbnb_hm.crimes_1km_count=crimes_1km_count
airbnb_hm.crimes_2km_count=crimes_2km_count
airbnb_hm.crimes_5km_count=crimes_5km_count

In [59]:
airbnb_hm.head()

Unnamed: 0,listing_id,date,price,month,day,year,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,homi_loc_id,homicides_count_date,crimes_that_date,date_id,crimes_loc_1km,crimes_loc_2km,crimes_loc_5km,crimes_1km,crimes_2km,crimes_5km,crimes_1km_count,crimes_2km_count,crimes_5km_count
0,2384,2018-07-26,75.0,7,26,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[6107, 6127]",2,"[121, 127]",15,[123],"[223, 224, 225, 243, 244, 245, 246, 264, 265, ...","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...",[],[],[121],0,0,1
268,2384,2018-07-29,69.0,7,29,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[5946, 6005, 6265]",3,"[25, 48, 190]",18,[123],"[223, 224, 225, 243, 244, 245, 246, 264, 265, ...","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...",[],[],[],0,0,0
847,2384,2018-08-05,65.0,8,5,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[6023, 6043, 6141, 6181, 6200, 6216, 6230, 6282]",8,"[60, 76, 132, 146, 154, 161, 166, 200]",24,[123],"[223, 224, 225, 243, 244, 245, 246, 264, 265, ...","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...",[],[],[],0,0,0
1343,2384,2018-10-01,65.0,10,1,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,"[5902, 6243, 6423]",4,"[3, 175, 232]",72,[123],"[223, 224, 225, 243, 244, 245, 246, 264, 265, ...","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...",[],[],[],0,0,0
1793,2384,2018-10-31,75.0,10,31,2018,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,[6301],1,[208],97,[123],"[223, 224, 225, 243, 244, 245, 246, 264, 265, ...","[65, 66, 67, 68, 69, 77, 78, 79, 80, 81, 82, 8...",[],[],[],0,0,0


In [0]:
#Step 1: Finding locations of crimes near the listing's location 
#airbnb_ag['crimes_loc_1km']=np.nan
#airbnb_ag['crimes_loc_2km']=np.nan
#airbnb_ag['crimes_loc_5km']=np.nan

#crimes_loc_1km=[None]*len(airbnb_ag)
#crimes_loc_2km=[None]*len(airbnb_ag)
#crimes_loc_5km=[None]*len(airbnb_ag)

#for i in range(len(airbnb_ag)):
#  loc_id=airbnb_ag.loc_id.iloc[i]
#  crimes_loc_1km[i]=indices(list(dist1_ag.iloc[loc_id]),1)
#  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_ag.loc[loc_id]),1),indices(list(dist1_ag.loc[loc_id]),1)))
#  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_ag.loc[loc_id]),1),indices(list(dist2_ag.loc[loc_id]),1)))

#airbnb_ag.crimes_loc_1km=crimes_loc_1km
#airbnb_ag.crimes_loc_2km=crimes_loc_2km
#airbnb_ag.crimes_loc_5km=crimes_loc_5km


#Step 2: Finding which close crime locations had crimes on that date
#airbnb_ag['crimes_1km']=np.nan
#airbnb_ag['crimes_2km']=np.nan
#airbnb_ag['crimes_5km']=np.nan

#crimes_1km=[None]*len(airbnb_ag)
#crimes_2km=[None]*len(airbnb_ag)
#crimes_5km=[None]*len(airbnb_ag)

#for i in range(len(airbnb_ag)):
#  crimes_that_date = airbnb_ag.crimes_that_date.iloc[i]
#  crimes_1km[i] = list(set(crimes_that_date).intersection(airbnb_ag.crimes_loc_1km.iloc[i]))
#  crimes_2km[i] = list(set(crimes_that_date).intersection(airbnb_ag.crimes_loc_2km.iloc[i]))
#  crimes_5km[i] = list(set(crimes_that_date).intersection(airbnb_ag.crimes_loc_5km.iloc[i]))
  
#airbnb_ag.crimes_1km=crimes_1km
#airbnb_ag.crimes_2km=crimes_2km
#airbnb_ag.crimes_5km=crimes_5km


#Step 3: Counting how many crimes happened that date on those close locations
#airbnb_ag['crimes_1km_count']=np.nan
#airbnb_ag['crimes_2km_count']=np.nan
#airbnb_ag['crimes_5km_count']=np.nan##

#crimes_1km_count=[None]*len(airbnb_ag)
#crimes_2km_count=[None]*len(airbnb_ag)
#crimes_5km_count=[None]*len(airbnb_ag)

#for i in range(len(airbnb_ag)):
#  c=crimes[crimes["date"]==airbnb_ag.date.iloc[i]]
#  l1=airbnb_ag.crimes_1km.iloc[i]
#  l2=airbnb_ag.crimes_2km.iloc[i]
#  l5=airbnb_ag.crimes_5km.iloc[i]
  
#  c1=c[c["crim_loc_id2"].isin(l1)]
#  c2=c[c["crim_loc_id2"].isin(l2)]
#  c5=c[c["crim_loc_id2"].isin(l5)]  
  
#  crimes_1km_count[i]=len(c1)
#  crimes_2km_count[i]=len(c2)
#  crimes_5km_count[i]=len(c5)  
  
#airbnb_ag.crimes_1km_count=crimes_1km_count
#airbnb_ag.crimes_2km_count=crimes_2km_count
#airbnb_ag.crimes_5km_count=crimes_5km_count

In [0]:
#airbnb_ag.head()

In [0]:
#Step 1: Finding locations of crimes near the listing's location 
#airbnb_st['crimes_loc_1km']=np.nan
#airbnb_st['crimes_loc_2km']=np.nan
#airbnb_st['crimes_loc_5km']=np.nan

#crimes_loc_1km=[None]*len(airbnb_st)
#crimes_loc_2km=[None]*len(airbnb_st)
#crimes_loc_5km=[None]*len(airbnb_st)

#for i in range(len(airbnb_st)):
#  loc_id=airbnb_st.loc_id.iloc[i]
#  crimes_loc_1km[i]=indices(list(dist1_st.iloc[loc_id]),1)
#  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_st.loc[loc_id]),1),indices(list(dist1_st.loc[loc_id]),1)))
#  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_st.loc[loc_id]),1),indices(list(dist2_st.loc[loc_id]),1)))

#airbnb_st.crimes_loc_1km=crimes_loc_1km
#airbnb_st.crimes_loc_2km=crimes_loc_2km
#airbnb_st.crimes_loc_5km=crimes_loc_5km


#Step 2: Finding which close crime locations had crimes on that date
#airbnb_st['crimes_1km']=np.nan
#airbnb_st['crimes_2km']=np.nan
#airbnb_st['crimes_5km']=np.nan

#crimes_1km=[None]*len(airbnb_st)
#crimes_2km=[None]*len(airbnb_st)
#crimes_5km=[None]*len(airbnb_st)

#for i in range(len(airbnb_st)):
#  crimes_that_date = airbnb_st.crimes_that_date.iloc[i]
#  crimes_1km[i] = list(set(crimes_that_date).intersection(airbnb_st.crimes_loc_1km.iloc[i]))
#  crimes_2km[i] = list(set(crimes_that_date).intersection(airbnb_st.crimes_loc_2km.iloc[i]))
#  crimes_5km[i] = list(set(crimes_that_date).intersection(airbnb_st.crimes_loc_5km.iloc[i]))
  
#airbnb_st.crimes_1km=crimes_1km
#airbnb_st.crimes_2km=crimes_2km
#airbnb_st.crimes_5km=crimes_5km


#Step 3: Counting how many crimes happened that date on those close locations
#airbnb_st['crimes_1km_count']=np.nan
#airbnb_st['crimes_2km_count']=np.nan
#airbnb_st['crimes_5km_count']=np.nan

#crimes_1km_count=[None]*len(airbnb_st)
#crimes_2km_count=[None]*len(airbnb_st)
#crimes_5km_count=[None]*len(airbnb_st)

#for i in range(len(airbnb_st)):
#  c=crimes[crimes["date"]==airbnb_st.date.iloc[i]]
#  l1=airbnb_st.crimes_1km.iloc[i]
#  l2=airbnb_st.crimes_2km.iloc[i]
#  l5=airbnb_st.crimes_5km.iloc[i]
  
#  c1=c[c["crim_loc_id2"].isin(l1)]
#  c2=c[c["crim_loc_id2"].isin(l2)]
#  c5=c[c["crim_loc_id2"].isin(l5)]  
  
#  crimes_1km_count[i]=len(c1)
#  crimes_2km_count[i]=len(c2)
#  crimes_5km_count[i]=len(c5)  
  
#airbnb_st.crimes_1km_count=crimes_1km_count
#airbnb_st.crimes_2km_count=crimes_2km_count
#airbnb_st.crimes_5km_count=crimes_5km_count

In [0]:
#airbnb_st.head()

In [64]:
X = airbnb_cr[['crimes_1km_count','crimes_2km_count','crimes_5km_count','crimes_count_date','room_type','neighbourhood','reviews_per_month','availability_365']]
y = airbnb_cr[['price']]

X.availability_365=X.availability_365/365

X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

X=X.drop(['neighbourhood','room_type'],axis=1)

reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
print(reg.coef_)
print(reg.intercept_)
#reg.predict(np.array([[3, 5]]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0.28387532020747297
[[   1.3169166     1.37700426    0.11322049    0.17016341   -8.57663254
    28.55716909  -79.73901249 -112.32743761   61.44671771   11.39955402
   -21.18959845  -63.40837379  -50.81933529  -53.77777379   -1.61070395
   -23.43462287  -19.1950975    -6.36107795  -27.74228641  -20.9956442
   -53.97722995   15.71390368  -12.4573762   -54.71827823   -1.88331464
     8.39012497  -57.08370693   18.21619486  -51.46328842  -25.33704504
   -16.36764424  -23.94105097  -82.28521769  -11.18150964  -35.09182942
   -30.18379556   -8.5263818    10.26353314  -15.21618857    0.68658051
    49.15594963   34.52618262  -13.91025547    0.53318993   78.71223832
   -11.05243493  -12.48800025  -67.82278485  -16.25213255   52.1946042
    59.89375659   10.92982532  -42.01323725   22.49760578  -51.61243594
    -6.17225352   16.62416333    4.092964    -12.02187757  -34.53482451
   -12.88647367  -61.93748242  -66.56374116  -35.24909274  -59.55479518
    15.82872868  -12.56560452    7.84971241  -

In [65]:
X = airbnb_hm[['crimes_1km_count','crimes_2km_count','crimes_5km_count','homicides_count_date','room_type','neighbourhood','reviews_per_month','availability_365']]
y = airbnb_hm[['price']]

X.availability_365=X.availability_365/365

X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

X=X.drop(['neighbourhood','room_type'],axis=1)

reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
print(reg.coef_)
print(reg.intercept_)
#reg.predict(np.array([[3, 5]]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0.2632378099267898
[[  -6.37464342   -4.2506025     0.37254859    3.50899838   -9.00953224
    31.95618605  -78.12206346 -110.26709727   62.06059615   18.9196523
   -26.37413156  -34.24304103  -33.77379194  -38.95279219    9.10355792
   -16.69660563  -23.45009476   -3.1365229   -28.33926297  -13.14932772
   -29.65655299   24.98635817  -23.42341109  -10.17527947    0.46043128
   -12.22409482  -25.81833718    4.56995374  -43.49296407  -17.17251966
   -27.4958234   -11.03560772  -44.97053409  -21.99770241  -22.01658036
     2.56541937   -4.18926961   15.93809642  -26.01310887    5.29936904
    64.08037036   56.24362374  -10.00302249   19.05247858  127.40242194
     0.56894439  -13.30698238  -69.67229812  -29.56906796   99.20402257
    72.73918545   52.83906944  -28.53656324   29.99499653  -16.62125312
   -12.80602854    0.86032743   11.42102253  -15.62273657  -33.10123807
   -13.43242707  -49.28700605  -42.66444596  -14.65535287  -39.24470852
    21.41301798  -10.02823102   26.74383155  -

In [0]:
#X = airbnb_ag[['crimes_1km_count','crimes_2km_count','crimes_5km_count','violence_count_date','room_type','neighbourhood','reviews_per_month','availability_365']]
#y = airbnb_ag[['price']]

#X.availability_365=X.availability_365/365

#X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
#X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

#X=X.drop(['neighbourhood','room_type'],axis=1)

#reg = LinearRegression().fit(X, y)
#print(reg.score(X, y))
#print(reg.coef_)
#print(reg.intercept_)
#reg.predict(np.array([[3, 5]]))

In [0]:
#X = airbnb_st[['crimes_1km_count','crimes_2km_count','crimes_5km_count','stealing_count_date','room_type','neighbourhood','reviews_per_month','availability_365']]
#y = airbnb_st[['price']]

#X.availability_365=X.availability_365/365

#X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
#X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

#X=X.drop(['neighbourhood','room_type'],axis=1)

#reg = LinearRegression().fit(X, y)
#print(reg.score(X, y))
#print(reg.coef_)
#print(reg.intercept_)
#reg.predict(np.array([[3, 5]]))

In [68]:
Z=airbnb_cr[['price','crimes_1km_count','crimes_2km_count','crimes_5km_count']]
Z.corr()

Unnamed: 0,price,crimes_1km_count,crimes_2km_count,crimes_5km_count
price,1.0,0.151853,0.158395,0.038009
crimes_1km_count,0.151853,1.0,0.571406,-0.012513
crimes_2km_count,0.158395,0.571406,1.0,0.266614
crimes_5km_count,0.038009,-0.012513,0.266614,1.0


In [69]:
Z=airbnb_hm[['price','crimes_1km_count','crimes_2km_count','crimes_5km_count']]
Z.corr()

Unnamed: 0,price,crimes_1km_count,crimes_2km_count,crimes_5km_count
price,1.0,-0.015099,-0.016759,-0.022705
crimes_1km_count,-0.015099,1.0,-0.000926,0.177514
crimes_2km_count,-0.016759,-0.000926,1.0,-0.006148
crimes_5km_count,-0.022705,0.177514,-0.006148,1.0


In [0]:
#Z=airbnb_st[['price','crimes_1km_count','crimes_2km_count','crimes_5km_count']]
#Z.corr()

#Price Changes

In [0]:
cal_change=cal_change.drop(cal_change[(cal_change.year==2018) & (cal_change.month==7) & (cal_change.day<9)].index)
cal_change=cal_change.drop(cal_change[(cal_change.year>2019)].index)
cal_change=cal_change.drop(cal_change[(cal_change.year==2019) & (cal_change.month>6)].index)                           


In [72]:
cal_change_r=cal_change[cal_change['review']==1]
cal_change_r_latest=cal_change_r.drop_duplicates(subset=['listing_id','date'],keep='first')
cal_change_r_latest=cal_change_r_latest.rename(index=str, columns={"price": "price_latest","scr_date": "scr_date_latest"})
cal_change_r_earliest=cal_change_r.drop_duplicates(subset=['listing_id','date'],keep='last')
cal_change_r_earliest=cal_change_r_earliest.rename(index=str, columns={"price": "price_earliest","scr_date": "scr_date_earliest"})
cal_change_r_earliest=cal_change_r_earliest.drop(['month','day','year','review'],axis=1)
cal_change_r=cal_change_r_latest.merge(cal_change_r_earliest,on=['listing_id','date'])
cal_change_r=cal_change_r.merge(listings, on='listing_id')
cal_change_r=cal_change_r.drop(['loc_id'], axis=1)
cal_change_r=cal_change_r.rename(index=str, columns={"loc_id2": "loc_id"})

cal_change_r.head()

Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id
0,2384,2018-07-26,75.0,2018-07-18,7,26,2018,1,60.0,2018-04-15,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71
1,2384,2018-07-29,69.0,2018-07-18,7,29,2018,1,60.0,2018-04-15,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71
2,2384,2018-08-12,75.0,2018-07-18,8,12,2018,1,65.0,2018-05-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71
3,2384,2018-10-31,75.0,2018-07-18,10,31,2018,1,65.0,2018-05-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71


In [73]:
cal_change_r['scr_lat_str']=cal_change_r.scr_date_latest.astype('str')
cal_change_r['scr_lat_m']=cal_change_r.scr_lat_str.apply(lambda x: int(x[5:7]))
cal_change_r['scr_lat_y']=cal_change_r.scr_lat_str.apply(lambda x: int(x[0:4]))
cal_change_r['scr_ear_str']=cal_change_r.scr_date_earliest.astype('str')
cal_change_r['scr_ear_m']=cal_change_r.scr_ear_str.apply(lambda x: int(x[5:7]))
cal_change_r['scr_ear_y']=cal_change_r.scr_ear_str.apply(lambda x: int(x[0:4]))
cal_change_r=cal_change_r.drop(['scr_lat_str','scr_ear_str'],axis=1)

cal_change_r=cal_change_r.drop(cal_change_r[(cal_change_r.scr_lat_y==2018) & (cal_change_r.scr_lat_m<7)].index)
cal_change_r=cal_change_r.drop(cal_change_r[(cal_change_r.scr_ear_y==2018) & (cal_change_r.scr_ear_m<7)].index)

cal_change_r.head()

Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,scr_lat_m,scr_lat_y,scr_ear_m,scr_ear_y
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018
5,2384,2018-11-09,65.0,2018-10-11,11,9,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018
6,2384,2018-11-12,65.0,2018-10-11,11,12,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018
7,2384,2018-11-30,65.0,2018-10-11,11,30,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018
8,2384,2018-12-03,75.0,2018-11-15,12,3,2018,1,80.0,2018-09-14,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,11,2018,9,2018


In [74]:
cal_change_r['price_dif']=cal_change_r['price_latest']-cal_change_r['price_earliest']
cal_change_r.head()

Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,scr_lat_m,scr_lat_y,scr_ear_m,scr_ear_y,price_dif
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
5,2384,2018-11-09,65.0,2018-10-11,11,9,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
6,2384,2018-11-12,65.0,2018-10-11,11,12,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
7,2384,2018-11-30,65.0,2018-10-11,11,30,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0
8,2384,2018-12-03,75.0,2018-11-15,12,3,2018,1,80.0,2018-09-14,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,11,2018,9,2018,-5.0


In [75]:
len(cal_change_r)

59947

In [0]:
#Step 1: Finding locations of crimes near the listing's location 
cal_change_r['crimes_loc_1km']=np.nan
cal_change_r['crimes_loc_2km']=np.nan
cal_change_r['crimes_loc_5km']=np.nan

crimes_loc_1km=[None]*len(cal_change_r)
crimes_loc_2km=[None]*len(cal_change_r)
crimes_loc_5km=[None]*len(cal_change_r)

for i in range(len(cal_change_r)):
  loc_id=cal_change_r.loc_id.iloc[i]
  crimes_loc_1km[i]=indices(list(dist1_cr.iloc[loc_id]),1)
  crimes_loc_2km[i]=list(np.setdiff1d(indices(list(dist2_cr.loc[loc_id]),1),indices(list(dist1_cr.loc[loc_id]),1)))
  crimes_loc_5km[i]=list(np.setdiff1d(indices(list(dist5_cr.loc[loc_id]),1),indices(list(dist2_cr.loc[loc_id]),1)))

cal_change_r.crimes_loc_1km=crimes_loc_1km
cal_change_r.crimes_loc_2km=crimes_loc_2km
cal_change_r.crimes_loc_5km=crimes_loc_5km

In [0]:
#Step 2: Finding which close crime locations had crimes on that date
cal_change_r['crimes_1km_lat']=np.nan
cal_change_r['crimes_2km_lat']=np.nan
cal_change_r['crimes_5km_lat']=np.nan
cal_change_r['crimes_1km_ear']=np.nan
cal_change_r['crimes_2km_ear']=np.nan
cal_change_r['crimes_5km_ear']=np.nan

crimes_1km_lat=[None]*len(cal_change_r)
crimes_2km_lat=[None]*len(cal_change_r)
crimes_5km_lat=[None]*len(cal_change_r)
crimes_1km_ear=[None]*len(cal_change_r)
crimes_2km_ear=[None]*len(cal_change_r)
crimes_5km_ear=[None]*len(cal_change_r)


for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  crimes_date_lat = crimes_date[crimes_date['date']==date_lat].crim_loc_id2
  crimes_date_ear = crimes_date[crimes_date['date']==date_ear].crim_loc_id2
  crimes_1km_lat[i] = list(set(crimes_date_lat[0]).intersection(cal_change_r.crimes_loc_1km.iloc[i]))
  crimes_2km_lat[i] = list(set(crimes_date_lat[0]).intersection(cal_change_r.crimes_loc_2km.iloc[i]))
  crimes_5km_lat[i] = list(set(crimes_date_lat[0]).intersection(cal_change_r.crimes_loc_5km.iloc[i]))
 
  crimes_1km_ear[i] = list(set(crimes_date_ear[0]).intersection(cal_change_r.crimes_loc_1km.iloc[i]))
  crimes_2km_ear[i] = list(set(crimes_date_ear[0]).intersection(cal_change_r.crimes_loc_2km.iloc[i]))
  crimes_5km_ear[i] = list(set(crimes_date_ear[0]).intersection(cal_change_r.crimes_loc_5km.iloc[i]))
  
cal_change_r.crimes_1km_lat=crimes_1km_lat
cal_change_r.crimes_2km_lat=crimes_2km_lat
cal_change_r.crimes_5km_lat=crimes_5km_lat

cal_change_r.crimes_1km_ear=crimes_1km_ear
cal_change_r.crimes_2km_ear=crimes_2km_ear
cal_change_r.crimes_5km_ear=crimes_5km_ear

In [0]:
#Step 3: Counting how many crimes happened that date on those close locations
cal_change_r['crimes_1km_count_lat']=np.nan
cal_change_r['crimes_2km_count_lat']=np.nan
cal_change_r['crimes_5km_count_lat']=np.nan

cal_change_r['crimes_1km_count_ear']=np.nan
cal_change_r['crimes_2km_count_ear']=np.nan
cal_change_r['crimes_5km_count_ear']=np.nan

crimes_1km_count_lat=[None]*len(cal_change_r)
crimes_2km_count_lat=[None]*len(cal_change_r)
crimes_5km_count_lat=[None]*len(cal_change_r)

crimes_1km_count_ear=[None]*len(cal_change_r)
crimes_2km_count_ear=[None]*len(cal_change_r)
crimes_5km_count_ear=[None]*len(cal_change_r)


for i in range(len(cal_change_r)):
  date_lat = cal_change_r.scr_date_latest.iloc[i]
  date_ear = cal_change_r.scr_date_earliest.iloc[i]
  
  c_lat=crimes[crimes["date"]==date_lat]
  c_ear=crimes[crimes["date"]==date_ear]
  
  l1_lat=cal_change_r.crimes_1km_lat.iloc[i]
  l2_lat=cal_change_r.crimes_2km_lat.iloc[i]
  l5_lat=cal_change_r.crimes_5km_lat.iloc[i]
  
  l1_ear=cal_change_r.crimes_1km_ear.iloc[i]
  l2_ear=cal_change_r.crimes_2km_ear.iloc[i]
  l5_ear=cal_change_r.crimes_5km_ear.iloc[i]
  
  c1_lat=c_lat[c_lat['crim_loc_id2'].isin(l1_lat)]
  c2_lat=c_lat[c_lat['crim_loc_id2'].isin(l2_lat)]
  c5_lat=c_lat[c_lat['crim_loc_id2'].isin(l5_lat)]  
  
  c1_ear=c_ear[c_ear['crim_loc_id2'].isin(l1_ear)]
  c2_ear=c_ear[c_ear['crim_loc_id2'].isin(l2_ear)]
  c5_ear=c_ear[c_ear['crim_loc_id2'].isin(l5_ear)]  
  
  crimes_1km_count_lat[i]=len(c1_lat)
  crimes_2km_count_lat[i]=len(c2_lat)
  crimes_5km_count_lat[i]=len(c5_lat)  
  
  crimes_1km_count_ear[i]=len(c1_ear)
  crimes_2km_count_ear[i]=len(c2_ear)
  crimes_5km_count_ear[i]=len(c5_ear)  
  
cal_change_r.crimes_1km_count_lat=crimes_1km_count_lat
cal_change_r.crimes_2km_count_lat=crimes_2km_count_lat
cal_change_r.crimes_5km_count_lat=crimes_5km_count_lat

cal_change_r.crimes_1km_count_ear=crimes_1km_count_ear
cal_change_r.crimes_2km_count_ear=crimes_2km_count_ear
cal_change_r.crimes_5km_count_ear=crimes_5km_count_ear

In [0]:
cal_change_r['crimes_1km_count_dif']=cal_change_r['crimes_1km_count_lat']-cal_change_r['crimes_1km_count_ear']
cal_change_r['crimes_2km_count_dif']=cal_change_r['crimes_2km_count_lat']-cal_change_r['crimes_2km_count_ear']
cal_change_r['crimes_5km_count_dif']=cal_change_r['crimes_5km_count_lat']-cal_change_r['crimes_5km_count_ear']

In [80]:
cal_change_r.head()


Unnamed: 0,listing_id,date,price_latest,scr_date_latest,month,day,year,review,price_earliest,scr_date_earliest,host_id,neighbourhood,latitude,longitude,room_type,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,lr_m,lr_d,lr_y,lat,lon,location,loc,loc_id,scr_lat_m,scr_lat_y,scr_ear_m,scr_ear_y,price_dif,crimes_loc_1km,crimes_loc_2km,crimes_loc_5km,crimes_1km_lat,crimes_2km_lat,crimes_5km_lat,crimes_1km_ear,crimes_2km_ear,crimes_5km_ear,crimes_1km_count_lat,crimes_2km_count_lat,crimes_5km_count_lat,crimes_1km_count_ear,crimes_2km_count_ear,crimes_5km_count_ear,crimes_1km_count_dif,crimes_2km_count_dif,crimes_5km_count_dif
4,2384,2018-11-05,65.0,2018-10-11,11,5,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[245],"[264, 225, 266, 265]","[263, 280, 282, 162, 163, 164, 165, 166, 290, ...","[245, 246]","[225, 264, 265, 266, 243, 633, 223]","[262, 263, 279, 280, 282, 162, 163, 164, 165, ...",1,5,60,3,9,78,-2,-4,-18
5,2384,2018-11-09,65.0,2018-10-11,11,9,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[245],"[264, 225, 266, 265]","[263, 280, 282, 162, 163, 164, 165, 166, 290, ...","[245, 246]","[225, 264, 265, 266, 243, 633, 223]","[262, 263, 279, 280, 282, 162, 163, 164, 165, ...",1,5,60,3,9,78,-2,-4,-18
6,2384,2018-11-12,65.0,2018-10-11,11,12,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[245],"[264, 225, 266, 265]","[263, 280, 282, 162, 163, 164, 165, 166, 290, ...","[245, 246]","[225, 264, 265, 266, 243, 633, 223]","[262, 263, 279, 280, 282, 162, 163, 164, 165, ...",1,5,60,3,9,78,-2,-4,-18
7,2384,2018-11-30,65.0,2018-10-11,11,30,2018,1,75.0,2018-07-18,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,10,2018,7,2018,-10.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[245],"[264, 225, 266, 265]","[263, 280, 282, 162, 163, 164, 165, 166, 290, ...","[245, 246]","[225, 264, 265, 266, 243, 633, 223]","[262, 263, 279, 280, 282, 162, 163, 164, 165, ...",1,5,60,3,9,78,-2,-4,-18
8,2384,2018-12-03,75.0,2018-11-15,12,3,2018,1,80.0,2018-09-14,2613,Hyde Park,41.78886,-87.58671,2,159,2.89,1,306,7,11,2019,41.79,-87.59,"(41.78886, -87.58671)","(41.79, -87.59)",71,11,2018,9,2018,-5.0,"[244, 245, 246]","[223, 224, 225, 243, 264, 265, 266, 267, 632, ...","[162, 163, 164, 165, 166, 178, 179, 180, 181, ...",[245],"[224, 225, 264, 265, 266, 267, 223]","[262, 263, 278, 279, 280, 281, 289, 162, 163, ...",[244],"[225, 264, 265, 266, 267, 243, 223]","[279, 280, 282, 289, 162, 163, 164, 165, 291, ...",1,15,73,1,17,87,0,-2,-14


In [84]:
X = cal_change_r[['crimes_1km_count_dif','crimes_2km_count_dif','crimes_5km_count_dif','room_type','neighbourhood','reviews_per_month','availability_365']]
y = cal_change_r[['price_dif']]

X.availability_365=X.availability_365/365

X = pd.concat([X, pd.get_dummies(X.room_type, prefix='room_type', drop_first=True)], axis=1)
X = pd.concat([X, pd.get_dummies(X.neighbourhood, drop_first=True)], axis=1)

X=X.drop(['neighbourhood','room_type'],axis=1)

reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
print(reg.coef_)
print(reg.intercept_)
#reg.predict(np.array([[3, 5]]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0.08512025606732498
[[  0.6545262    0.08157831   0.1435885    1.1993805  -31.46564681
   31.66531718  35.62511642 -51.53972417  13.27926913  29.14985424
    7.58565962  13.00885298  37.79710824   3.65092963  11.34531969
   -5.72744896   1.90272536  19.87527566   1.65727244   8.07050025
  -18.73095852   4.2486526    1.45965175  -1.01598176  17.78914883
   -8.82831119 -12.49160206  13.42120816  16.38016285  31.75683647
    5.85579895  25.78301666  14.54712031  15.04131079  12.13338057
  -13.73251306  -7.40708236  -0.93748639   6.69717577 -25.54385569
  -31.08007819  11.41055673  -3.29389818 -61.80442419  -1.67565716
   -6.31521831  33.01133713  -4.89849782 -69.95640548 -59.786115
  -37.88133365  39.97837557   2.15941713  10.05675499  10.68851013
   -4.60380975  21.50784103   8.1569795   19.80427862  13.39995773
   38.20360793  49.69655625  11.12064633  18.01423551  -4.13313342
   -8.246183     3.383518    63.22150623  -5.21070059  -0.50374015
    8.17632131   8.74072234  -7.47877728   9