# 1. Project Understanding

The problem is this: instructors and decision-makers in the education field have a difficult time deciding where and how to invest limited resources.  The goal of these individuals is to help students to be successful. Therefore, the ability to guage which neighborhoods' students are more or less successful than any other and *why* is very valuable.  This analysis seeks to provide decision makers with a way to answer these questions: 
* Is any given student a product of their environemnt?  Put more succinclty: is there a relationship between a student's environment (on a neighborhood level) and their academic success? 
* If such a relationship exists, is it possible to say whether or not a body of students is likely to succeed, based on one perameter or another?

Given the scope of this project and the amount of data being compared, we will limit our examination to a single city: New York City.  The variety of neighborhoods and success rates among students (as well as the high volume of quality data) makes this a choice of convenience, though we are hoping, not a fatal one.

# 2. Analytic Approach

First, we must find a way of quantifying "success" in an academic setting. Thankfully, metrics such as GPA and SAT scores have long been used for such a purpose, and are readily available.

Secondly, we must find a way of identifying those "feautures" that makes any given neighborhood different from or similar to another one.  The sky is the limit, in this regard.  However, for the purposes of this analysis, we will focus on the following:

* Average per-capita income
* Number of occurences of crime (separated by type) in a given neighborhood
* Number of neighborhood features in the form of venues (separated by type)

Given that each of these values is a countable feature and the feature we are tryign to predict (success in the form of a test score) is also a continuous quantity, we should be able to model the problem using various forms of regression.  For the Data Exploration phase, we will use various forms of linear regression to identify relationships.  For the model-building phase, we will use a polynomial regression to try to predict student success. 

# 3. Data Requirements

Any geocoded data, such as the following, should be organized either by zip code or neighborhood name.  Furthermore, dated information, such as SAT scores, crimes, and average income, should be from around around the same time period.  After all, we are supposing that the students producing these scores were subject to the environment we are measuring.

# 4. Data Collection

### Neighborhood Venue data

*(Constructed during previous Coursera exercises)* Contains Foursquare Data for all of New York City's neighborhoods.

### Income Data

[Income data on Kaggle](https://www.kaggle.com/wpncrh/zip-code-income-tax-data-2014)

### Average SAT data for NY schools

[NY SAT data on Kaggle](https://www.kaggle.com/nycopendata/high-schools)

### New York City Crime Data 

[Crime Data on Kaggle](https://www.kaggle.com/adamschroeder/crimes-new-york-city)

In [9]:
# import basic libraries

import pandas as pd 

# import data from CSV files

ny_crimes_df = pd.read_csv('data/NYPD_Complaint_Data_Historic.csv')
ny_crimes_df_desc = pd.read_csv('data/Crime_Column_Description.csv')


In [18]:
# Crime Data
print(ny_crimes_df_desc['Description'])
ny_crimes_df.head()

0     Randomly generated persistent ID for each comp...
1     Exact date of occurrence for the reported even...
2     Exact time of occurrence for the reported even...
3     Ending date of occurrence for the reported eve...
4     Ending time of occurrence for the reported eve...
5                    Date event was reported to police 
6               Three digit offense classification code
7     Description of offense corresponding with key ...
8     Three digit internal classification code (more...
9     Description of internal classification corresp...
10    Indicator of whether crime was successfully co...
11    Level of offense: felony, misdemeanor, violation 
12    Jurisdiction responsible for incident. Either ...
13    The name of the borough in which the incident ...
14          The precinct in which the incident occurred
15    Specific location of occurrence in or around t...
16    Specific description of premises; grocery stor...
17    Name of NYC park, playground or greenspace

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,RPT_DT,KY_CD,OFNS_DESC,PD_CD,PD_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,JURIS_DESC,BORO_NM,ADDR_PCT_CD,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,PARKS_NM,HADEVELOPT,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,Lat_Lon
0,101109527,12/31/2015,23:45:00,,,12/31/2015,113,FORGERY,729.0,"FORGERY,ETC.,UNCLASSIFIED-FELO",COMPLETED,FELONY,N.Y. POLICE DEPT,BRONX,44.0,INSIDE,BAR/NIGHT CLUB,,,1007314.0,241257.0,40.828848,-73.916661,"(40.828848333, -73.916661142)"
1,153401121,12/31/2015,23:36:00,,,12/31/2015,101,MURDER & NON-NEGL. MANSLAUGHTER,,,COMPLETED,FELONY,N.Y. POLICE DEPT,QUEENS,103.0,OUTSIDE,,,,1043991.0,193406.0,40.697338,-73.784557,"(40.697338138, -73.784556739)"
2,569369778,12/31/2015,23:30:00,,,12/31/2015,117,DANGEROUS DRUGS,503.0,"CONTROLLED SUBSTANCE,INTENT TO",COMPLETED,FELONY,N.Y. POLICE DEPT,MANHATTAN,28.0,,OTHER,,,999463.0,231690.0,40.802607,-73.945052,"(40.802606608, -73.945051911)"
3,968417082,12/31/2015,23:30:00,,,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,COMPLETED,MISDEMEANOR,N.Y. POLICE DEPT,QUEENS,105.0,INSIDE,RESIDENCE-HOUSE,,,1060183.0,177862.0,40.654549,-73.726339,"(40.654549444, -73.726338791)"
4,641637920,12/31/2015,23:25:00,12/31/2015,23:30:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,COMPLETED,MISDEMEANOR,N.Y. POLICE DEPT,MANHATTAN,13.0,FRONT OF,OTHER,,,987606.0,208148.0,40.738002,-73.987891,"(40.7380024, -73.98789129)"


In [13]:
# Income Data
income_df = pd.read_csv('data/14zpallagi.csv')

In [19]:
income_df.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,NUMDEP,TOTAL_VITA,VITA,TCE,A00100,N02650,A02650,N00200,A00200,N00300,A00300,N00600,A00600,N00650,A00650,N00700,A00700,N00900,A00900,N01000,A01000,N01400,A01400,N01700,A01700,SCHF,N02300,A02300,N02500,A02500,...,N07230,A07230,N07240,A07240,N07220,A07220,N07260,A07260,N09400,A09400,N85770,A85770,N85775,A85775,N09750,A09750,N10600,A10600,N59660,A59660,N59720,A59720,N11070,A11070,N10960,A10960,N11560,A11560,N06500,A06500,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,1,AL,0,1,850050.0,481840.0,115070.0,240450.0,479900.0,1401930.0,548630.0,24840.0,16660.0,8180.0,11004990.0,850050.0,11187657.0,682860.0,8746419.0,95140.0,64688.0,43950.0,72642.0,38880.0,46689.0,13400.0,5825.0,145420.0,810441.0,35870.0,38739.0,37300.0,223749.0,107590.0,1047421.0,8170.0,33740.0,97553.0,36220.0,64407.0,...,32980.0,16979.0,32000.0,5460.0,31350.0,9543.0,2240.0,604.0,116380.0,153768.0,24900.0,84425.0,25060.0,83442.0,56250.0,6175.0,787080.0,2241625.0,390640.0,1169025.0,361400.0,1050890.0,251070.0,319709.0,80750.0,77190.0,12390.0,6237.0,252000.0,157928.0,389850.0,324575.0,0.0,0.0,0.0,0.0,62690.0,47433.0,744910.0,1964826.0
1,1,AL,0,2,491370.0,200750.0,150290.0,125560.0,281350.0,1016010.0,375670.0,10850.0,7080.0,3780.0,17658446.0,491370.0,17836190.0,425830.0,14494884.0,92610.0,69421.0,41040.0,96100.0,36250.0,64683.0,51450.0,27881.0,62500.0,250528.0,31950.0,75029.0,34210.0,317668.0,101020.0,1791833.0,8600.0,18180.0,52682.0,88200.0,560917.0,...,45700.0,47932.0,67530.0,12516.0,122540.0,108415.0,10560.0,4749.0,37790.0,80221.0,10870.0,34145.0,11810.0,38398.0,33560.0,6324.0,480490.0,2034491.0,134550.0,272098.0,112880.0,232001.0,103680.0,154145.0,37240.0,34513.0,3410.0,2665.0,368090.0,850897.0,397110.0,950446.0,0.0,0.0,0.0,0.0,70780.0,101969.0,413790.0,1177400.0
2,1,AL,0,3,259540.0,75820.0,142970.0,34070.0,156720.0,589190.0,186770.0,3170.0,1680.0,1490.0,15963943.0,259540.0,16117661.0,223910.0,12316371.0,82760.0,69005.0,39530.0,123290.0,35530.0,85619.0,62280.0,40993.0,39640.0,253529.0,31650.0,108577.0,28200.0,360311.0,72620.0,1757156.0,7680.0,8900.0,27490.0,60570.0,882812.0,...,24340.0,31082.0,17950.0,2792.0,76950.0,116151.0,8970.0,3975.0,26800.0,65138.0,1720.0,5212.0,2660.0,9160.0,7470.0,2734.0,255980.0,1736742.0,660.0,138.0,410.0,71.0,10670.0,14029.0,18570.0,17072.0,360.0,460.0,246400.0,1236058.0,250230.0,1319641.0,0.0,0.0,0.0,0.0,62170.0,132373.0,192050.0,538160.0
3,1,AL,0,4,164840.0,26730.0,125410.0,10390.0,99750.0,423300.0,133020.0,1260.0,700.0,560.0,14294375.0,164840.0,14422811.0,143710.0,10817987.0,69880.0,62269.0,35700.0,126688.0,32500.0,90127.0,57000.0,43748.0,28060.0,232026.0,28920.0,139040.0,22560.0,378333.0,52890.0,1584142.0,6130.0,5060.0,16508.0,40540.0,793454.0,...,17800.0,23168.0,0.0,0.0,54560.0,90111.0,6020.0,2195.0,19600.0,54016.0,200.0,426.0,750.0,2206.0,2100.0,1206.0,163320.0,1658063.0,0.0,0.0,0.0,0.0,740.0,938.0,12980.0,11946.0,40.0,41.0,163100.0,1323173.0,163580.0,1394913.0,0.0,0.0,0.0,0.0,45120.0,124048.0,115470.0,375882.0
4,1,AL,0,5,203650.0,18990.0,177070.0,5860.0,122670.0,565930.0,185150.0,1260.0,900.0,360.0,27387096.0,203650.0,27664725.0,181410.0,20155298.0,113360.0,141176.0,69620.0,368076.0,64740.0,274484.0,107690.0,111331.0,38850.0,571411.0,59170.0,547842.0,32290.0,825820.0,69810.0,2690393.0,8720.0,4820.0,17488.0,45910.0,1038902.0,...,22740.0,31720.0,0.0,0.0,46520.0,63034.0,7940.0,2660.0,28990.0,113649.0,0.0,0.0,450.0,1574.0,1360.0,1325.0,201900.0,3756035.0,0.0,0.0,0.0,0.0,90.0,120.0,20660.0,19605.0,0.0,0.0,202820.0,3501461.0,203050.0,3655700.0,610.0,135.0,270.0,66.0,81180.0,387298.0,114380.0,448442.0


In [21]:
# SAT Data
sat_df = pd.read_csv('data/scores.csv')

In [22]:
sat_df.head()

Unnamed: 0,School ID,School Name,Borough,Building Code,Street Address,City,State,Zip Code,Latitude,Longitude,Phone Number,Start Time,End Time,Student Enrollment,Percent White,Percent Black,Percent Hispanic,Percent Asian,Average Score (SAT Math),Average Score (SAT Reading),Average Score (SAT Writing),Percent Tested
0,02M260,Clinton School Writers and Artists,Manhattan,M933,425 West 33rd Street,Manhattan,NY,10001,40.75321,-73.99786,212-695-9114,,,,,,,,,,,
1,06M211,Inwood Early College for Health and Informatio...,Manhattan,M052,650 Academy Street,Manhattan,NY,10002,40.86605,-73.92486,718-935-3660,8:30 AM,3:00 PM,87.0,3.4%,21.8%,67.8%,4.6%,,,,
2,01M539,"New Explorations into Science, Technology and ...",Manhattan,M022,111 Columbia Street,Manhattan,NY,10002,40.71873,-73.97943,212-677-5190,8:15 AM,4:00 PM,1735.0,28.6%,13.3%,18.0%,38.5%,657.0,601.0,601.0,91.0%
3,02M294,Essex Street Academy,Manhattan,M445,350 Grand Street,Manhattan,NY,10002,40.71687,-73.98953,212-475-4773,8:00 AM,2:45 PM,358.0,11.7%,38.5%,41.3%,5.9%,395.0,411.0,387.0,78.9%
4,02M308,Lower Manhattan Arts Academy,Manhattan,M445,350 Grand Street,Manhattan,NY,10002,40.71687,-73.98953,212-505-0143,8:30 AM,3:00 PM,383.0,3.1%,28.2%,56.9%,8.6%,418.0,428.0,415.0,65.1%


In [23]:
# Neighborhood Data
neighborhood_df = pd.read_csv('data/ny_neighborhood_venu_cat_count.csv')

In [24]:
neighborhood_df.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Astrologer,Athletics & Sports,Auditorium,Australian Restaurant,Austrian Restaurant,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bath House,Beach,Beach Bar,Bed & Breakfast,Beer Bar,...,Tea Room,Tech Startup,Temple,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Tibetan Restaurant,Tiki Bar,Toll Plaza,Tourist Information Center,Toy / Game Store,Track,Trail,Train,Train Station,Turkish Restaurant,Udon Restaurant,Used Bookstore,Vape Store,Varenyky restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Volleyball Court,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,Allerton,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,Annadale,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,Arden Heights,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,Arlington,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,Arrochar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# 5. Data understanding

# 6. Data Preparation

# 7. Modeling

# 8. Evaluation

# 9. Deployment

# 10. Feedback