## Steps for Notebook
1. Getting the datasets (cleveland, census data, acs data) (https://la.arcgis.com/databrowser/index.html), (immigation, emigration) (https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2022/) (**Ethan**)
2. Combine into one dataset via geoenrichment (per block per year) (could be zip_code per year) (**Calvin**)
4. Initial Visualization (hotspots for crime and socioeconomic factors, visualize blocks/zip_codes in Cleveland, show difference in demographics with a hotspot comparing 2010 vs 2020 census) (**Calvin** for hotspot), (**Ethan** for chloropleth maps)
5. Perform correlation analysis with hotspots. How much socioeconomic factors explain the varability in crime rate. (**Ethan**)
6. kNN, Isolation Forest, One-class SVM, Random Forest. Predict the crime rate for a zip-code/block based on socioeconomic factors. Split into a training vs test set (80/20). See which ones have the lowest MSE. (**Ethan** for Random Forest, Isolation Forest) (**Calvin** for kNN and One-class SVM)
8. Comparing the models and detemerining the best one. (**Calvin**)

Finish up to step 4 by Thursday 

# Part 0: Imports and Initialization

## Imports

In [1]:
%matplotlib inline

import pandas as pd
import geopandas as gpd
import numpy as np
import os

import censusdata
from census import Census
from us import states

from matplotlib import pyplot as plt
import pygris
import folium

pd.set_option('display.max_columns', None)

## Functions

In [27]:

def download_OH_data(var_map, year_start, year_end):
    df_final = None
    
    for yr in range(year_start, year_end + 1):
        c = Census("4977648d549eae5dd6bc0563b7c148db6c44642d", year=yr)
        
        for raw_var, alias in var_map.items():
  
            if raw_var.startswith('DP'):
                data = c.acs5dp.get(
                    ('NAME', raw_var),
                    {'for': 'tract:*', 'in': f'state:{states.OH.fips} county:035'}
                )
            else:
                data = c.acs5.get(
                    ('NAME', raw_var),
                    {'for': 'tract:*', 'in': f'state:{states.OH.fips} county:035'}
                )
            

            df_temp = pd.DataFrame(data)
            df_temp.rename(columns={raw_var: f"{alias}_{yr}"}, inplace=True)

            # Merge it into df_final
            if df_final is None:
                # If this is the first chunk of data, just assign
                df_final = df_temp
            else:
                # Otherwise, merge on the geo-id columns
                df_final = pd.merge(
                    df_final, df_temp,
                    on=['NAME', 'state', 'county', 'tract'],
                    how='outer'  # or 'inner', your choice
                )
                
    return df_final

## Part 1: Getting Data

## Census Data

- Utilizing search functionality to locate variables

In [22]:
# s = censusdata.search('acs5', 2015, 'label', 'family')
# s

s = censusdata.search('acs5', 2015, 'name', 'B')
s

[]

- Codes of collected variables we will be using

In [None]:
'''
Add. codes to find:
 - Immigration/emmigration (how many people are moving to/from the area?)
    - mobility/migration
 - Housing information (how many people own/rent their homes?)
 - Age distributions
 - Familial structures
    - Household size, family size
 - in/out of labor force (different from income?)
 - educational attainment
'''
codes = {
    'B06010_003E': 'income_yes',
    'B06010_004E': 'income_no',
    'B06010_005E': 'income_0_10k',
    'B06010_006E': 'income_10_25k',
    'B06010_007E': 'income_25_35k',
    'B06010_008E': 'income_35_45k',
    'B06010_009E': 'income_45_55k',
    'B06010_010E': 'income_55_65k',
    'B06010_011E': 'income_65_75k',
    'B06010_013E': 'income_over_75k',

    'B06002_001E': 'median_age',

    'B15003_022E': 'bachelors_degree'

    'B19083_001E': 'gini_index' # Measure of index inequality
}

- Running the downloads

In [28]:
test_code = {'B15003_022E': 'bachelors_degree'}

df = download_OH_data(test_code, 2015, 2017)
df.head()

Unnamed: 0,NAME,bachelors_degree_2015,state,county,tract,bachelors_degree_2016,bachelors_degree_2017
0,"Census Tract 1514, Cuyahoga County, Ohio",62.0,39,35,151400,83.0,74.0
1,"Census Tract 1524, Cuyahoga County, Ohio",130.0,39,35,152400,170.0,174.0
2,"Census Tract 1527.02, Cuyahoga County, Ohio",177.0,39,35,152702,191.0,139.0
3,"Census Tract 1542, Cuyahoga County, Ohio",139.0,39,35,154200,118.0,113.0
4,"Census Tract 1605, Cuyahoga County, Ohio",813.0,39,35,160500,864.0,961.0


In [23]:
cuyahoga_tracts = pygris.tracts(state="39", county="035", year=2022)  # Adjust year if needed

In [26]:
cuyahoga_tracts.head()

Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,GEOID,NAME,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
190,39,35,175109,39035175109,1751.09,Census Tract 1751.09,G5020,S,3598417,0,41.3432585,-81.7720694,"POLYGON ((-81.78524 41.33829, -81.78522 41.339..."
191,39,35,175110,39035175110,1751.1,Census Tract 1751.10,G5020,S,5527803,25406,41.320746,-81.7646461,"POLYGON ((-81.78528 41.32275, -81.78528 41.323..."
192,39,35,190506,39035190506,1905.06,Census Tract 1905.06,G5020,S,9542971,0,41.376237,-81.9424544,"POLYGON ((-81.97096 41.36935, -81.97096 41.369..."
193,39,35,172105,39035172105,1721.05,Census Tract 1721.05,G5020,S,773060,0,41.5266147,-81.4355169,"POLYGON ((-81.43879 41.53301, -81.43292 41.533..."
194,39,35,152605,39035152605,1526.05,Census Tract 1526.05,G5020,S,4055218,7587,41.5793017,-81.5206108,"POLYGON ((-81.53602 41.57285, -81.52908 41.577..."


## Part 2: Combining Data

## Part 3: Visualizations

## Part 4: Correlation Analysis