CSCI-GA.3033-005 - Big Data Application Development
We are using the following data to analyse crime in Chicago and best places to live in Chicago. we have taken into account the following datasets
- Crimes (Presently Used)[https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2]
- Socioeconomic_Indicators (Presently Used) https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2
- Public Health Statistics- Selected public health indicators by Chicago community area(Presently Used) https://data.cityofchicago.org/Health-Human-Services/Public-Health-Statistics-Selected-public-health-in/iqnk-2tcu
- Crimes Description (Presently Used) https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
- Sex Offenders (Presently Used) https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy
- neighbourhoods (shape file) (Presently Used) [https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6]
- Food Inspection (Presently Used) https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5
- Affordable Rental Housing (Presently Used) https://data.cityofchicago.org/Community-Economic-Development/Affordable-Rental-Housing-Developments/s6ha-ppgi
- 311 Service Requests - Vacant and Abandoned Buildings Reported.csv (Future Work) https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd
All the data can be found on https://data.cityofchicago.org/
-
The code to convert latitude and longitude to community area was ideated from this:
https://github.com/craigmbooth/chicago_neighborhood_finder -
A Google Maps geocoding library for Scala https://github.com/KoddiDev/geocoder
- Sree Lakshmi Addepalli
- Divya Juneja
- Sree Gowri Addepalli
It contains the datasets used for the project. Some might not be present as they are greater than 25 MB
It contains the scala code for ingesting the datasets into the HDFS.
It contains the scala code for preprocessing the datasets. The following methods are done in preprocessing.
The Community Area is an important part of the dataset as it gives values community area wise.After getting the latitude and longitude of the rows which had missing values we applied two procedures:
i.Used the communities areas shape file dataset given by Chicago government and generated a json for range of geopoints for each community. Later we tagged the unknown community area by passing the latitude and longitude.
ii.Still there were few values giving none and not being detected any community. We further went ahead and generated a script which calculated the Euclidean Squared Distance Metric from each community area center and later the nearest community area was set to this row.
The following steps are used for cleaning and preprocessing:
a. Fill missing values: Ignore/drop the rows having missing values.
If the column is numerical, filled in the missing value with the mean/avg value of the column.
If the column is categorical, fill in with the most occurring category
b. Feature Engineering: input data was used to derive new features.(Example Date)
withColumn was used for adding/replacing an existing column
3 udfs were used to generate new features
Binning / Bucketing was used to generate new features
c. Feature Selection: Correlation analysis was done to get the interdependence between variables and to drop columns which are highly dependent based on pvalue.
The following Analytics was done:
Analytics of CrimeData
Analytics Between Poverty and Arrests
CommunityWise Crime Analysis and Housing Crowded
CommunityWise Primary Crime Income and Unemployment
CommunityWise Primary Crime and Hardship
Community Wise Primary Type And Illiteracy
Community Wise Total Vacant Buildings
The visualisations of above crime analytics and clusters of restaurants, affordable housing, crimes, sexoffenders and safe places.
the following mllib algorithms have been used
a. Classification - Logistic Regression: to classify an arrest will happen or not Neural Networks: to classify an arrest will happen or not(was getting garbage collection issues)
b. Clustering - Kmeans was used to cluster similar things together here restaurants, affordable housing, crimes, sexoffenders and safe places.
It contains the scala code for extracting the final results from the HDFS to local.
We made two applications:
a. To predict and arrest will happen or not given the following parameters: Year, Month, Date, Time(24 hr format), Crime Type, Crime Description and address of crime. we looked at the historical crime data from Chicago and joined this data with other socioeconomic factors and public health. We are predicting an arrest and with such analytics we can deploy more police over the unsafe areas and make the streets more safe.
b. To recommend nearby safe places to an end user given an input as an address. It takes into contch sideration the nearby clean restaurants, affordable housing, crimes and if the user is a female then the use of sexoffenders list.
To execute the codes import both the applications in intellij for scala and run it with the given program arguments as above.
It contains the screenshots and results of above remediation code
All the intermediate code files or results files.
List of issues we encountered in the project.
The papers draft uptil now.
https://drive.google.com/file/d/1iWswYLWG_wRtc532N6XWOXUmewQ8SP1C/view?usp=sharing https://drive.google.com/file/d/1a3mHz7f5DVtvH3gFASzhiuNGR8NTi9lo/view?usp=sharing https://drive.google.com/open?id=1TE6E-aPG6n5MnsdmjYqXBBfspn3yMtRA
Our midsem project which we had to drop due to issues.