# 1. Project Understanding

The problem is this: instructors and decision-makers in the education field have a difficult time deciding where and how to invest limited resources.  The goal of these individuals is to help students to be successful. Therefore, the ability to guage which neighborhoods' students are more or less successful than any other and *why* is very valuable.  This analysis seeks to provide decision makers with a way to answer these questions: 
* Is any given student a product of their environemnt?  Put more succinclty: is there a relationship between a student's environment (on a neighborhood level) and their academic success? 
* If such a relationship exists, is it possible to say whether or not a body of students is likely to succeed, based on one perameter or another?

Given the scope of this project and the amount of data being compared, we will limit our examination to a single city: New York City.  The variety of neighborhoods and success rates among students (as well as the high volume of quality data) makes this a choice of convenience, though we are hoping, not a fatal one. If a reliable relationship between a student's environment and their success could be discovered, then decision makers would be better able to change that environment to make their students more successful, and teachers would be able to better instruct their students, knowing what factors they are working against.

# 2. Analytic Approach

First, we must find a way of quantifying "success" in an academic setting. Thankfully, metrics such as GPA and SAT scores have long been used for such a purpose, and are readily available.

Secondly, we must find a way of identifying those "feautures" that makes any given neighborhood different from or similar to another one.  The sky is the limit, in this regard.  However, for the purposes of this analysis, we will focus on the following:

* Average per-capita income
* Number of occurences of crime (separated by type) in a given neighborhood
* Number of neighborhood features in the form of venues (separated by type)

Given that each of these values is a countable feature and the feature we are tryign to predict (success in the form of a test score) is also a continuous quantity, we should be able to model the problem using various forms of regression.  For the Data Exploration phase, we will use various forms of linear regression to identify relationships.  For the model-building phase, we will use a polynomial regression to try to predict student success. 

# 3. Data Requirements

Any geocoded data, such as the following, should be organized either by zip code or neighborhood name.  Furthermore, dated information, such as SAT scores, crimes, and average income, should be from around around the same time period.  After all, we are supposing that the students producing these scores were subject to the environment we are measuring.

# 4. Data Collection

### Neighborhood Venue data

*(Constructed during previous Coursera exercises)* Contains Foursquare Data for all of New York City's neighborhoods.  This data will be crucial to try to get an overall impression regarding "what kind of neighborhood" each neighborhood is.   It features a column for each type of venue recorded, along with the number of each such venue located within the neighborhood.  

### Income Data

[Income data on Kaggle](https://www.kaggle.com/wpncrh/zip-code-income-tax-data-2014)  This data set will give us what we need to compare neighborhoods based on average income.  It features an "average income" field, as well as a ZIP code, for each instance.

### Average SAT data for NY schools

[NY SAT data on Kaggle](https://www.kaggle.com/nycopendata/high-schools)  This data set gives us average SAT scores for each school in New York. The ZIP code for each school is given, which will allow us to match up the data, geographically, with our venue/income/crime counterparts.

### New York City Crime Data 

[Crime Data on Kaggle](https://www.kaggle.com/adamschroeder/crimes-new-york-city)   This data set includes a column for "type" of crime commited, as well as the lattitude and longitude of the incident. This data will allow us to find the ZIP code in which each incident occured, and, from this, match it to the neighborhoods being analyzed.

In [68]:
# import basic libraries

import pandas as pd 
import numpy as np

In [1]:
# import data from CSV files

ny_crimes_df = pd.read_csv('data/NYPD_Complaint_Data_Historic.csv')
ny_crimes_df_desc = pd.read_csv('data/Crime_Column_Description.csv')


In [2]:
# Crime Data
print(ny_crimes_df_desc['Description'])
ny_crimes_df.head()

0     Randomly generated persistent ID for each comp...
1     Exact date of occurrence for the reported even...
2     Exact time of occurrence for the reported even...
3     Ending date of occurrence for the reported eve...
4     Ending time of occurrence for the reported eve...
5                    Date event was reported to police 
6               Three digit offense classification code
7     Description of offense corresponding with key ...
8     Three digit internal classification code (more...
9     Description of internal classification corresp...
10    Indicator of whether crime was successfully co...
11    Level of offense: felony, misdemeanor, violation 
12    Jurisdiction responsible for incident. Either ...
13    The name of the borough in which the incident ...
14          The precinct in which the incident occurred
15    Specific location of occurrence in or around t...
16    Specific description of premises; grocery stor...
17    Name of NYC park, playground or greenspace

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,RPT_DT,KY_CD,OFNS_DESC,PD_CD,PD_DESC,...,ADDR_PCT_CD,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,PARKS_NM,HADEVELOPT,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,Lat_Lon
0,101109527,12/31/2015,23:45:00,,,12/31/2015,113,FORGERY,729.0,"FORGERY,ETC.,UNCLASSIFIED-FELO",...,44.0,INSIDE,BAR/NIGHT CLUB,,,1007314.0,241257.0,40.828848,-73.916661,"(40.828848333, -73.916661142)"
1,153401121,12/31/2015,23:36:00,,,12/31/2015,101,MURDER & NON-NEGL. MANSLAUGHTER,,,...,103.0,OUTSIDE,,,,1043991.0,193406.0,40.697338,-73.784557,"(40.697338138, -73.784556739)"
2,569369778,12/31/2015,23:30:00,,,12/31/2015,117,DANGEROUS DRUGS,503.0,"CONTROLLED SUBSTANCE,INTENT TO",...,28.0,,OTHER,,,999463.0,231690.0,40.802607,-73.945052,"(40.802606608, -73.945051911)"
3,968417082,12/31/2015,23:30:00,,,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,105.0,INSIDE,RESIDENCE-HOUSE,,,1060183.0,177862.0,40.654549,-73.726339,"(40.654549444, -73.726338791)"
4,641637920,12/31/2015,23:25:00,12/31/2015,23:30:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,13.0,FRONT OF,OTHER,,,987606.0,208148.0,40.738002,-73.987891,"(40.7380024, -73.98789129)"


In [3]:
# Income Data
income_df = pd.read_csv('data/14zpallagi.csv')

In [4]:
income_df.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,1,AL,0,1,850050.0,481840.0,115070.0,240450.0,479900.0,1401930.0,...,389850.0,324575.0,0.0,0.0,0.0,0.0,62690.0,47433.0,744910.0,1964826.0
1,1,AL,0,2,491370.0,200750.0,150290.0,125560.0,281350.0,1016010.0,...,397110.0,950446.0,0.0,0.0,0.0,0.0,70780.0,101969.0,413790.0,1177400.0
2,1,AL,0,3,259540.0,75820.0,142970.0,34070.0,156720.0,589190.0,...,250230.0,1319641.0,0.0,0.0,0.0,0.0,62170.0,132373.0,192050.0,538160.0
3,1,AL,0,4,164840.0,26730.0,125410.0,10390.0,99750.0,423300.0,...,163580.0,1394913.0,0.0,0.0,0.0,0.0,45120.0,124048.0,115470.0,375882.0
4,1,AL,0,5,203650.0,18990.0,177070.0,5860.0,122670.0,565930.0,...,203050.0,3655700.0,610.0,135.0,270.0,66.0,81180.0,387298.0,114380.0,448442.0


In [5]:
# SAT Data
sat_df = pd.read_csv('data/scores.csv')

In [6]:
sat_df.head()

Unnamed: 0,School ID,School Name,Borough,Building Code,Street Address,City,State,Zip Code,Latitude,Longitude,...,End Time,Student Enrollment,Percent White,Percent Black,Percent Hispanic,Percent Asian,Average Score (SAT Math),Average Score (SAT Reading),Average Score (SAT Writing),Percent Tested
0,02M260,Clinton School Writers and Artists,Manhattan,M933,425 West 33rd Street,Manhattan,NY,10001,40.75321,-73.99786,...,,,,,,,,,,
1,06M211,Inwood Early College for Health and Informatio...,Manhattan,M052,650 Academy Street,Manhattan,NY,10002,40.86605,-73.92486,...,3:00 PM,87.0,3.4%,21.8%,67.8%,4.6%,,,,
2,01M539,"New Explorations into Science, Technology and ...",Manhattan,M022,111 Columbia Street,Manhattan,NY,10002,40.71873,-73.97943,...,4:00 PM,1735.0,28.6%,13.3%,18.0%,38.5%,657.0,601.0,601.0,91.0%
3,02M294,Essex Street Academy,Manhattan,M445,350 Grand Street,Manhattan,NY,10002,40.71687,-73.98953,...,2:45 PM,358.0,11.7%,38.5%,41.3%,5.9%,395.0,411.0,387.0,78.9%
4,02M308,Lower Manhattan Arts Academy,Manhattan,M445,350 Grand Street,Manhattan,NY,10002,40.71687,-73.98953,...,3:00 PM,383.0,3.1%,28.2%,56.9%,8.6%,418.0,428.0,415.0,65.1%


In [9]:
# Neighborhood Data
venues_df = pd.read_csv('data/100_venues_for_all_ny_hoods.csv')
neighborhood_df = pd.read_csv('data/ny_neighborhood_venu_cat_count.csv')

In [10]:
neighborhood_df.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,...,Volleyball Court,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,Allerton,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,Annadale,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,Arden Heights,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,Arlington,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,Arrochar,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 5. Data understanding

# 6. Data Preparation

In [54]:
# Wrangle venues_df to extract Lat/Lng data and assign to neighborhood_df
temp = venues_df[['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates().reset_index(drop=True)

# Get lat-lng values in order of neighborhood_df

lat = []
lng = []
for n in neighborhood_df['Neighborhood']:
    r = temp[temp['Neighborhood'] == n]
    lat.append(r['Neighborhood Latitude'].iloc[0])
    lng.append(r['Neighborhood Longitude'].iloc[0])

# Assign new columns to neighborhood_df

neighborhood_df['Neighborhood Latitude'] = lat
neighborhood_df['Neighborhood Longitude'] = lng

# Drop unncessary "Unnamed" column from neighborhood df

neighborhood_df.drop(columns=["Unnamed: 0"], inplace=True)


In [55]:
neighborhood_df.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,...,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Neighborhood Latitude,Neighborhood Longitude
0,Allerton,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,40.865788,-73.859319
1,Annadale,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,40.538114,-74.178549
2,Arden Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.549286,-74.185887
3,Arlington,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,40.635325,-74.165104
4,Arrochar,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.596313,-74.067124


### Note about the uszipcode library

It becomes necessary, at this point, to try to get a ZIP code that matches each of our neighborhoods in "neighborhood_df".  With this information, we can then compare it to our other geocoded data organized by ZIP.

For notes on API usage, see the [docs](https://pypi.org/project/uszipcode/).

In [66]:
# Use uszipcodes library to get zip codes for each neighborhood

import uszipcode 

search = uszipcode.SearchEngine()

zips = []

for lat, lng in zip(neighborhood_df['Neighborhood Latitude'], neighborhood_df['Neighborhood Longitude']):
    print(f'Getting ZIP for {lat}, {lng}')
    res = search.by_coordinates(lat, lng, radius=10, returns=1)
    zips.append(res[0].zipcode)
    print(f'Found ZIP, {res[0].zipcode}')
len(zips.un)

88
Found ZIP, 11357
Getting ZIP for 40.73301404027834, -73.73889198912481
Found ZIP, 11427
Getting ZIP for 40.576155565431094, -73.8540175039252
Found ZIP, 11694
Getting ZIP for 40.72857318176675, -73.72012814826904
Found ZIP, 11426
Getting ZIP for 40.857277100738955, -73.88845196134805
Found ZIP, 10458
Getting ZIP for 40.61100890202044, -73.99517998380729
Found ZIP, 11204
Getting ZIP for 40.615149550453076, -73.89855633630317
Found ZIP, 11234
Getting ZIP for 40.73725071694497, -73.93244235260178
Found ZIP, 11101
Getting ZIP for 40.60577868452358, -74.18725638381567
Found ZIP, 10311
Getting ZIP for 40.68568291209144, -73.98374824115798
Found ZIP, 11217
Getting ZIP for 40.633130512758015, -73.99049823044811
Found ZIP, 11219
Getting ZIP for 40.55740128845452, -73.92551196994168
Found ZIP, 11697
Getting ZIP for 40.71093547252271, -73.81174822458634
Found ZIP, 11435
Getting ZIP for 40.57682506566604, -73.96509448785335
Found ZIP, 11235
Getting ZIP for 40.60302658351238, -73.8200548911032
F

AttributeError: 'list' object has no attribute 'un'

In [75]:
zips = pd.Series(zips)
len(zips.unique())

148

# 7. Modeling

# 8. Evaluation

# 9. Deployment

# 10. Feedback