Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
visualization
2016-fal-jas91_smaf91-poster.pdf
README.md
api.py
get_data.py
kmeans.py
load_yelp_data.py
rating_inspection_correlation.py
report.md
scatter-plot.png
settings.py
skyline.py
transformation1.py
transformation2.py
transformation3.py
transformation4.py
visualization.png
zip-code-ranking.png

README.md

Instructions

General requisites

auth.json

In order to run this project the auth.json file should be structured this way:

{
    "services": {
        "cityofbostondataportal": {
            "service": "https://data.cityofboston.gov/",
            "token": "XXXXXXXXXXXXXXXXXXXXXXXXX"
         }
     }
}

Trial Mode

To run any transformation on trial mode:

>>> python3 <filename>.py trial

Project 1

2.a

We chose to combine datasets holding the following data: [Crime], [Schools], [Hospitals], [Food Establishment Inspections], [311] reports. An interesting project would be to rate a given zipcode based on the quality of it’s surroundings.

2.b

The algorithm that retrieve these datasets automatically may be found in the file: get_data.py.To run it:

>>> python3 get_data.py

Make sure to uncomment the last lines in the file:

# get_data.execute()

2.c

The following transformations must be executed in order.

Transformation 1

The first transformation has the objective to standarize the geographic information of the datasets. The GeoJSON format was used in the following way:

"geo_info": {
    "type": "Feature",
    "properties": {
        "zip_code": 02215
    },
    "geometry": {
        "type": "Point",
        "coordinates": [-71.00, 42.00]
    }
}

This script may be found in transformation1.py. To run it:

>>> python3 transformation1.py

Make sure to uncomment the last lines in the file:

# transformation1.execute()

Transformation 2

The purpose of the second transformation is to populate the zip_code field of the crime dataset. Based on the information from the other datasets, it is possible to build and index. Later, given two coordinates from each crime entry find an entry within 1 Km range and assign its zip code. This script may be found in transformation2.py. To run it:

>>> python3 transformation2.py

Make sure to uncomment the last lines in the file:

# transformation2.execute()

Transformation 3

Finally, in the third transformation, the amount od crimes and 311 service reports are grouped per zip code. This script may be found in transformation3.py. To run it:

>>> python3 transformation3.py

Make sure to uncomment the last lines in the file:

# transformation3.execute()

Project 2

Problem 1

Following the same idea as Project 1 the goal is to rank the zipcodes according to the information we have so far. That is, [Crime], [Schools], [Hospitals], [Food Establishment Inspections], [311] reports. Having this data we can derive a new dataset with the following structure:

(zipcode, #crimes, #311 reports, #passed food inspections, #schools, #hospitals)

A user might want to query this dataset in order to know which zipcode to choose to live in, based on the attributes mentioned above. To be able to perform this analysis a multi-objective query must be defined. In this case the multi-objective query could be defined as follows: minimize the #crimes and #311 reports, while maximizing the quality of the surrounding restaurants, that is, the #passed food inspections, #schools and #hospitals. Given equally importance to all five attributes.

This can be computed optimally using skyline queries. Where the result of the query will be formed of all non-dominated tuples following the pareto optimality definition [1]. Where an element a = (a1, ..., an) dominates an element b = (b1, ..., bn) if:

for all i in {1, ..., n}, ai ≥ bi and exists j in {1, ..., n}, such as aj > bj

Then, the skyline set is defined by all the elements in the datsets that are not dominated by any other element.

To solve this optimization problem, first run the transformation4.py:

>>> python3 transformation4.py

and then:

>>> python3 skyline.py

Make sure to uncomment the last lines in the file:

# transformation4.execute()

and

# skyline.execute()

Problem 2

Given the crimes dataset, another interesting problem would be to find the minimum number of police patrols and where should these patrols be located in order to minimize the distance between the patrols and the historic crime locations. User input can be used to define what types of crimes should have priority and the minimum distance between these new added patrols and the crime locations. This can be solved using k-means.

To customize the results just edit the settings.py file. It should look something like this:

MIN_PATROLS = 15
MAX_PATROLS = 30
MIN_DISTANCE = 4
CODES = ['18xx', '14xx']

After that in order to solve this optimization problem run:

>>> python3 kmeans.py

Make sure to uncomment the last lines in the file:

# kmeans.execute()

Problem 3

This problem involves some statistical analysis between inspections and social media data. Given the [Food Establishment Inspections] dataset and the [Yelp Academic Dataset] we want to determine if a correlation between the average ratings and the penalty score from the inspections exists.

The files from the [Yelp Academic Dataset] used to solve this problem are: yelp_academic_dataset_business.json and yelp_academic_dataset_review.json. These two files should be placed on a directory /yelp outside the jas91_smaf91 folder.

To store the Yelp datasets execute load_yelp_data.py. To run it:

>>> python3 load_yelp_data.py

Make sure to uncomment the last lines in the file:

# load_yelp_data.execute()

Let b be a business in the Food Establishment Inspections dataset, inspected at a time t, then the penalty score of b is defined as follows:

penaltyb = minorb + majorb + severeb

where minorb, majorb and severeb are minor, major and severe violations (which in the dataset are represented as strings '*', '**', '***').

The datasets were joined by name and (latitude, longitude) using an index in both those attributes, allowing an error of 50 meters in the coordinates. To perform the correlation analysis, the data was partitioned by date, we defined a window of time between two inspections i and i+1 and the average rating was calculated depending on which interval of time that review was performed. We decided to do it this way since we assumed that, if the ratings are correlated with the inspection, they would reflect the results based on the inmediate performed inspection, not having an important effect on subsequent or previous inspections.

To determine if the average rating and the penalty score are truly correlated the Pearson Correlation Coefficient (from the scipy.stats python package) was used. The results are shown below:

correlation coefficient p value
minor -0.036 0.006
major -0.038 0.003
severe -0.030 0.02
penalty score -0.042 0.001
# violations -0.041 0.001

alt text

The results indicate that there is a negative correlation between the average ratings and the penalty score. That is, if the penalty score is high, one can expect that the average rating is low and vice versa.

The algorithm to performed this can be found in rating_inspection_correlation.py. To run it:

>>> python3 rating_inspection_correlation.py

Make sure to uncomment the last lines in the file:

# rating_inspection_correlation.execute()

Project 3

For this part of the project, we decided to create an interactive client-server application with the Problem 2 of the previous project. Using FlaskAPI to create a simple API and D3 + Google Maps to create the visualization. The visualization source code is located in the /visualization folder, while the API is located in file api.py. For now both visualization and API will run in locahost:8000 and localhost:5000 respectively

API

To run the api just execute from the shell:

>>> python api.py

This will run the API in port 5000.

Make sure to install the python module FlaskAPI. To make a GET request to the api the JSON header looks like the this:

{
  "async": true,
  "crossDomain": true,
  "url": "http://localhost:5000/patrols_coordinates?max_patrols=X1&min_patrols=X2&min_distance=X3&codes=Y1,Y2,Y3",
  "method": "GET"
}

where X1, X2 and X3 are integers and Y1, Y2 and Y3 are crime codes.

Visualization

Police patrol allocation

To run the visualization part of the project, go to the /visualization/police-patrol-allocation folder and execute:

python -m SimpleHTTPServer

and type the url http://localhost:8000/ in the browser. That will take you to the index page.

alt text

In here just fill the inputs and click on the Submit button, this will call the API and show the results on the map.

Zip code ranking

To run the visualization part of the project, go to the /visualization/zip-code-ranking folder and execute:

python -m SimpleHTTPServer

and type the url http://localhost:8000/ in the browser. That will take you to the index page.

alt text

References

[1] U. Guntzer W.T. Balke. Multi-objective query processing for database systems. 2004 [Crime]: https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports-July-2012-August-2015-Sourc/7cdf-6fgx [Schools]: https://data.cityofboston.gov/Facilities/School-Gardens/cxb7-aa9j [Hospitals]: https://data.cityofboston.gov/Public-Health/Hospital-Locations/46f7-2snz [Food Establishment Inspections]: https://data.cityofboston.gov/Health/Food-Establishment-Inspections/qndu-wx8w [311]: https://data.cityofboston.gov/City-Services/311-Open-Service-Requests/rtbk-4hc4 [Yelp Academic Dataset]: https://www.yelp.com/dataset_challenge/drivendata