BigDataProject

CSCI-GA.3033-005 - Big Data Application Development

We are using the following data to analyse crime in Chicago and best places to live in Chicago. we have taken into account the following datasets

Crimes (Presently Used)[https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2]
Socioeconomic_Indicators (Presently Used) https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2
Public Health Statistics- Selected public health indicators by Chicago community area(Presently Used) https://data.cityofchicago.org/Health-Human-Services/Public-Health-Statistics-Selected-public-health-in/iqnk-2tcu
Crimes Description (Presently Used) https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
Sex Offenders (Presently Used) https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy
neighbourhoods (shape file) (Presently Used) [https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6]
Food Inspection (Presently Used) https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5
Affordable Rental Housing (Presently Used) https://data.cityofchicago.org/Community-Economic-Development/Affordable-Rental-Housing-Developments/s6ha-ppgi
311 Service Requests - Vacant and Abandoned Buildings Reported.csv (Future Work) https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd

All the data can be found on https://data.cityofchicago.org/

Acknowledgement:

The code to convert latitude and longitude to community area was ideated from this:
https://github.com/craigmbooth/chicago_neighborhood_finder
A Google Maps geocoding library for Scala https://github.com/KoddiDev/geocoder

Team Members:

Sree Lakshmi Addepalli
Divya Juneja
Sree Gowri Addepalli

Folders and process to run ...

01. Data

It contains the datasets used for the project. Some might not be present as they are greater than 25 MB

02. DataIngestion

It contains the scala code for ingesting the datasets into the HDFS.

03. Preprocessing

It contains the scala code for preprocessing the datasets. The following methods are done in preprocessing.

The Community Area is an important part of the dataset as it gives values community area wise.After getting the latitude and longitude of the rows which had missing values we applied two procedures:

i.Used the communities areas shape file dataset given by Chicago government and generated a json for range of geopoints for each community. Later we tagged the unknown community area by passing the latitude and longitude.

ii.Still there were few values giving none and not being detected any community. We further went ahead and generated a script which calculated the Euclidean Squared Distance Metric from each community area center and later the nearest community area was set to this row.

04. Cleaning and Profiling

The following steps are used for cleaning and preprocessing:

a. Fill missing values: Ignore/drop the rows having missing values.

                    If the column is numerical, filled in the missing value with the mean/avg value of the column. 
                    
                    If the column is categorical, fill in with the most occurring category

b. Feature Engineering: input data was used to derive new features.(Example Date)

                    withColumn was used for adding/replacing an existing column
                    
                    3 udfs were used to generate new features
                    
                    Binning / Bucketing was used to generate new features

c. Feature Selection: Correlation analysis was done to get the interdependence between variables and to drop columns which are highly dependent based on pvalue.

05. Profiling - Mentioned above

06. Analytics

The following Analytics was done:

Analytics of CrimeData

Analytics Between Poverty and Arrests

CommunityWise Crime Analysis and Housing Crowded

CommunityWise Primary Crime Income and Unemployment

CommunityWise Primary Crime and Hardship

Community Wise Primary Type And Illiteracy

Community Wise Total Vacant Buildings

07. Visualisation

The visualisations of above crime analytics and clusters of restaurants, affordable housing, crimes, sexoffenders and safe places.

08. DL_Algorithm

the following mllib algorithms have been used

a. Classification - Logistic Regression: to classify an arrest will happen or not Neural Networks: to classify an arrest will happen or not(was getting garbage collection issues)

b. Clustering - Kmeans was used to cluster similar things together here restaurants, affordable housing, crimes, sexoffenders and safe places.

10. dataextraction

It contains the scala code for extracting the final results from the HDFS to local.

11. Remediation_Code

We made two applications:

a. To predict and arrest will happen or not given the following parameters: Year, Month, Date, Time(24 hr format), Crime Type, Crime Description and address of crime. we looked at the historical crime data from Chicago and joined this data with other socioeconomic factors and public health. We are predicting an arrest and with such analytics we can deploy more police over the unsafe areas and make the streets more safe.

b. To recommend nearby safe places to an end user given an input as an address. It takes into contch sideration the nearby clean restaurants, affordable housing, crimes and if the user is a female then the use of sexoffenders list.

To execute the codes import both the applications in intellij for scala and run it with the given program arguments as above.

09. Outputs

It contains the screenshots and results of above remediation code

12. Extras

All the intermediate code files or results files.

13. issues.md

List of issues we encountered in the project.

14. Papers and ppt

The papers draft uptil now.

https://drive.google.com/file/d/1iWswYLWG_wRtc532N6XWOXUmewQ8SP1C/view?usp=sharing https://drive.google.com/file/d/1a3mHz7f5DVtvH3gFASzhiuNGR8NTi9lo/view?usp=sharing https://drive.google.com/open?id=1TE6E-aPG6n5MnsdmjYqXBBfspn3yMtRA

15. Bitcoin

Our midsem project which we had to drop due to issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigDataProject

We are using the following data to analyse crime in Chicago and best places to live in Chicago. we have taken into account the following datasets

Acknowledgement:

Team Members:

Folders and process to run ...

01. Data

02. DataIngestion

03. Preprocessing

04. Cleaning and Profiling

05. Profiling - Mentioned above

06. Analytics

07. Visualisation

08. DL_Algorithm

10. dataextraction

11. Remediation_Code

09. Outputs

12. Extras

13. issues.md

14. Papers and ppt

15. Bitcoin

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
Analytics		Analytics
Bitcoin		Bitcoin
Cleaning		Cleaning
DL_Algorithm		DL_Algorithm
DataIngestion		DataIngestion
Extras		Extras
Outputs		Outputs
Papers		Papers
Preprocessing		Preprocessing
Profiling		Profiling
Remediation_Code		Remediation_Code
Visualisation		Visualisation
data		data
dataextraction		dataextraction
README.md		README.md
issues.md		issues.md

Lakshmiaddepalli/BigDataProject

Folders and files

Latest commit

History

Repository files navigation

BigDataProject

We are using the following data to analyse crime in Chicago and best places to live in Chicago. we have taken into account the following datasets

Acknowledgement:

Team Members:

Folders and process to run ...

01. Data

02. DataIngestion

03. Preprocessing

04. Cleaning and Profiling

05. Profiling - Mentioned above

06. Analytics

07. Visualisation

08. DL_Algorithm

10. dataextraction

11. Remediation_Code

09. Outputs

12. Extras

13. issues.md

14. Papers and ppt

15. Bitcoin

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages