Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


BU CS591L1 Fall 2016 Project 1

Team Members:

  • Yehui Huang
  • Yingqiao Xiong

##Narrative 2a In this project, we've considered the relationship between the avg income and the number of some kind of public buildings (Hospital, School, HealthStore and Public Garden). The analyze is based on the zipcode. We want to see if the total number of the public buildings is proportional to the average earnings.

##Datasets The five datasets we used:

  1. 'hospital_locations':'https://data.cityofboston.gov/resource/u6fv-m8v4.json'
  2. 'employee_earnings_report_2015':'https://data.cityofboston.gov/resource/bejm-5s9g.json'
  3. 'health_corner_stores':'https://data.cityofboston.gov/resource/ybm6-m5qd.json'
  4. 'community_gardens':'https://data.cityofboston.gov/resource/rdqf-ter7.json'
  5. 'public_schools':'https://data.cityofboston.gov/resource/492y-i77g.json'
  6. 'crime_incident_reports':https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports-July-2012-August-2015-Sourc/7cdf-6fgx/data
  7. 'approved_building_permit':https://data.cityofboston.gov/Permitting/Approved-Building-Permits/msk6-43c6/data

##Porcess Firstly we can obtain the number of average earnings in each zipcode area by using MapReduce to project and reduce to get the total earnings and counts in each zipcode, then we can obtain the average earning by (total earning divides counts) for each zipcode area.

Secondly we can obtain number of hospital, number of garden, number of store and number of school group by the zipcode by applying the simple MapReduce Function.

Combining all datasets generated by the second step, we obtained a num of these building grouped by the zipcodes.

Lastly, we merge the previous dataset with the average earnings dataset to obtain a dataset that contains the number of these buildings and average earning using zipcode as unit.

##Instruction To run the project, please follow the follow instructions: ###Authentication The auth.json file contains the credentials with the following format

  "db_username": "alice_bob",
  "db_password": "alice_bob"

If you want to connect to the db needs the authentication, there is a helpers.py script that is imported in every activity scripts. In every script, we use the functions in helpers.py to connect to the authenticated database. The following code in every activity script is to connect to the database by reading the user_name and pass_word in your own auth.json:

repo = openDb(getAuth("db_username"), getAuth("db_password"))

Just need to store your own db_username and db_password in the auth.json

###Run the program: Please execute the execute.py to run the whole program:

python3 execute.py

#Project 2 Updates: Team Members:

  • Yehui Huang
  • Yingqiao Xiong
  • Hongyu Zhou
  • Chang Gao

###K-Means Algorthm in Crime and Zip Firstly we extract two new data sets CrimeIncidentReport and ApprovedBuildingPermit. We use the k-means s an optimization to calculate the approximated mean point in each zip code area and use the argmin to place the crime coordinates into his closest zipcode mean point.

###Map Reduce to merge Crime into the Result Dataset We used mapreduce again to merge the crime_zip dataset into the result dataset from project 1

###Statistic Operations

  • Correlation Coefficient: taking the numbers of corner stores, hospitals, public schools, community gardens and crimes as independent variables, and average income as dependent variable, we calculate the correlation coefficients to measure the dependency between the dependent variable and each of the independent variable, to estimate their relationships. The formula applied are identical to those shown in the class notes.

  • Linear Regression: taking the variables described above, we fit the data into the linear least square regression model, and calculate the estimated slope and interception of the regression line. Using the coefficients returned, we can quantify the effect of each independent variable. This method shall be improved using non-linear regression, as many of the relationships are not linear.

  • Coefficient of Determination (R-squared): taking the estimations derived above, we calculate coefficients of determination for each of the regression, to measure how well the linear model fits into the data.

The statisticOperations.py file extracts the processed data, conducts the mathematical operations , and delivers the statistical results.

###Interpretation of Results The correlations between income and each of the possible factors are vague, as most of the correlation coeffiecients fall in the range of [-0.1, 0.1]. The linear regression model does not fit well for the data sets, as the coefficients are noticibly close to 0.

Surprisingly, We see a clear diminishing oscillation patern of the dependent variable against each of the independent variable. This could be caused by the un-normalized data, or it could be some unknown distributions that we have not tested with yet, thus we will normalize data before further operations, and take other models into consideration in further research.

In addition, we shall consider other factors which could have affected the average income in certain areas, and also consider the correlations between many of the independent variables.