## Steps for Notebook
1. Getting the datasets (cleveland, census data, acs data) (https://la.arcgis.com/databrowser/index.html), (immigation, emigration) (https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2022/) (**Ethan**)
2. Combine into one dataset via geoenrichment (per block per year) (could be zip_code per year) (**Calvin**)
4. Initial Visualization (hotspots for crime and socioeconomic factors, visualize blocks/zip_codes in Cleveland, show difference in demographics with a hotspot comparing 2010 vs 2020 census) (**Calvin** for hotspot), (**Ethan** for chloropleth maps)
5. Perform correlation analysis with hotspots. How much socioeconomic factors explain the varability in crime rate. (**Ethan**)
6. kNN, Isolation Forest, One-class SVM, Random Forest. Predict the crime rate for a zip-code/block based on socioeconomic factors. Split into a training vs test set (80/20). See which ones have the lowest MSE. (**Ethan** for Random Forest, Isolation Forest) (**Calvin** for kNN and One-class SVM)
8. Comparing the models and detemerining the best one. (**Calvin**)

Finish up to step 4 by Thursday 

# Part 0: Imports and Initialization

## Imports

In [1]:
%matplotlib inline

import pandas as pd
import geopandas as gpd
import numpy as np
import os

import censusdata
from census import Census
from us import states

from matplotlib import pyplot as plt
import pygris
import folium

pd.set_option('display.max_columns', None)

## Functions

In [9]:

def download_OH_data(var_map, year_start, year_end):
    df_final = None
    
    for yr in range(year_start, year_end + 1):
        c = Census("4977648d549eae5dd6bc0563b7c148db6c44642d", year=yr)
        
        for raw_var, alias in var_map.items():
  
            if raw_var.startswith('DP'):
                data = c.acs5dp.get(
                    ('NAME', raw_var),
                    {'for': 'tract:*', 'in': f'state:{states.OH.fips} county:039'}
                )
            else:
                data = c.acs5.get(
                    ('NAME', raw_var),
                    {'for': 'tract:*', 'in': f'state:{states.OH.fips} county:039'}
                )
            

            df_temp = pd.DataFrame(data)
            df_temp.rename(columns={raw_var: f"{alias}_{yr}"}, inplace=True)

            # Merge it into df_final
            if df_final is None:
                # If this is the first chunk of data, just assign
                df_final = df_temp
            else:
                # Otherwise, merge on the geo-id columns
                df_final = pd.merge(
                    df_final, df_temp,
                    on=['NAME', 'state', 'county', 'tract'],
                    how='outer'  # or 'inner', your choice
                )
                
    return df_final

## Part 1: Getting Data

## Census Data

- Utilizing search functionality to locate variables

In [4]:
s = censusdata.search('acs5', 2015, 'label', 'graduate')
s

[('B06009PR_002E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!Less than high school graduate'),
 ('B06009PR_003E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!High school graduate (includes equivalency)'),
 ('B06009PR_006E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!Graduate or professional degree'),
 ('B06009PR_008E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!Born in Puerto Rico!!Less than high school graduate'),
 ('B06009PR_009E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!Born in Puerto Rico!!High school graduate (includes equivalency)'),
 ('B06009PR_012E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!Born in Puerto Rico!!Graduate or professional degree'),
 ('B06009PR_014E',
  'PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN PUERTO RICO',
  'Estimate!!Total!!Born 

- Codes of collected variables we will be using

In [5]:
'''
Add. codes to find:
 - Immigration/emmigration (how many people are moving to/from the area?)
    - mobility/migration
 - Housing information (how many people own/rent their homes?)
 - Age distributions
 - Familial structures
    - Household size, family size
 - 'gini index' (measure of income inequality)
 - in/out of labor force (different from income?)
 - educational attainment
'''
codes = {
    'B06010_003E': 'income_yes',
    'B06010_004E': 'income_no',
    'B06010_005E': 'income_0_10k',
    'B06010_006E': 'income_10_25k',
    'B06010_007E': 'income_25_35k',
    'B06010_008E': 'income_35_45k',
    'B06010_009E': 'income_45_55k',
    'B06010_010E': 'income_55_65k',
    'B06010_011E': 'income_65_75k',
    'B06010_013E': 'income_over_75k',
    'B06002_001E': 'median_age',
    'B15003_022E': 'bachelors_degree'
}

- Running the downloads

In [10]:
test_code = {'B15003_022E': 'bachelors_degree'}

df = download_OH_data(test_code, 2015, 2017)
df.head()

Unnamed: 0,NAME,bachelors_degree_2015,state,county,tract,bachelors_degree_2016,bachelors_degree_2017
0,"Census Tract 9587, Defiance County, Ohio",550.0,39,39,958700,514.0,591.0
1,"Census Tract 9589, Defiance County, Ohio",250.0,39,39,958900,197.0,168.0
2,"Census Tract 9581, Defiance County, Ohio",317.0,39,39,958100,303.0,326.0
3,"Census Tract 9583, Defiance County, Ohio",176.0,39,39,958300,193.0,274.0
4,"Census Tract 9584, Defiance County, Ohio",233.0,39,39,958400,243.0,268.0


## Part 2: Combining Data

## Part 3: Visualizations

## Part 4: Correlation Analysis