# Stitching together neighborhood data

Several issues to address:
1. Population crosswalk isn't actually a crosswalk. Need to make it a crosswalk because communities are aggregated up to larger geographies, and then the combined population can be used. Right now, we're merging much larger populations than the actual size of the community (Whittier includes Whittier, Unincorporated - Whittier, La Habra Heights, Unincorporated - La Habra Heights, Unincorporated - Sunrise Village), but we are displaying totals for La Habra Heights, Sunrise Village, all of that as separate from Whittier. That's misleading, becuase the population-adjusted cases will look much lower for Whittier (numerator too small, denominator too big). [PDF, not JSON or CSV to be used to develop this.](https://github.com/ANRGUSC/lacounty_covid19_data/blob/master/data/population.pdf)
1. Aggregate to these larger neighborhoods, and feed that into interactive charting.
1. Check the neighborhood names in our historical df match what we cleaned for missing early July dates (cleaned in Stata).
1. Check the neighborhood names in our historical df match what we cleaned for parquets (July 10 onward)
1. Append missing and parquets into historical df
1. Feed that into interactive charts, and schedule DAG

In [1]:
import pandas as pd

from datetime import datetime, date

S3_FILE_PATH = "s3://public-health-dashboard/jhu_covid19/"

In [48]:
POP_JSON = "https://raw.githubusercontent.com/ANRGUSC/lacounty_covid19_data/master/data/population.json"
pd.read_json(POP_JSON, orient="index")

Unnamed: 0,0
Exposition Park--(Region:107),45190
Altadena--(Region:13),53439
Canoga Park--(Region:162),63694
Castaic--(Region:2),28325
Park La Brea--(Region:100),27005
...,...
Maywood--(Region:51),41472
Culver City--(Region:77),56391
Westwood--(Region:153),54421
Bell Gardens--(Region:56),67342


In [4]:
historical = pd.read_parquet(f"{S3_FILE_PATH}lacounty-neighborhood-time-series.parquet")

In [33]:
POP_URL = (
    "https://raw.githubusercontent.com/ANRGUSC/"
    "lacounty_covid19_data/master/data/processed_population.csv"
)

pop = pd.read_csv(POP_URL)

In [42]:
m1 = pd.merge(names, pop, on = "Region", how ="left", validate = "1:1")

In [45]:
m1[m1.Population.isna()].Region.value_counts()

Faircrest Heights    1
Vermont Square       1
Cheviot Hills        1
Placerita Canyon     1
Little Tokyo         1
                    ..
Lennox               1
Rolling Hills        1
Whittier Narrows     1
Padua Hills          1
East Pasadena        1
Name: Region, Length: 167, dtype: int64

In [32]:
historical[historical.Region.str.contains("Florence")]

Unnamed: 0,Region,Latitude,Longitude,date,date2,cases,Population,new_cases,cases_per100k,cases_avg7,new_cases_avg7,cases_per100k_avg7,cases_p25,cases_p50,cases_p75,ncases_p25,ncases_p50,ncases_p75,rank,max_rank
9104,Florence,33.974159,-118.243286,2020-03-23,2020-03-23,1,,,,85.428571,,,3.000000,11.428571,75.000000,3.509203,12.318305,201.238688,,104
9105,Florence,33.974159,-118.243286,2020-03-24,2020-03-24,1,,0.0,,72.428571,,,3.000000,10.428571,71.000000,4.220627,16.508762,260.525444,,116
9106,Florence,33.974159,-118.243286,2020-03-25,2020-03-25,4,,3.0,,59.714286,,,2.500000,5.857143,35.571429,3.830987,9.094055,93.128122,,121
9107,Florence,33.974159,-118.243286,2020-03-26,2020-03-26,7,,3.0,,46.571429,,,2.571429,6.142857,31.857143,4.462887,9.892583,41.239935,,134
9108,Florence,33.974159,-118.243286,2020-03-28,2020-03-28,6,,-1.0,,32.714286,,,4.250000,23.142857,165.607143,5.684826,11.477478,42.546715,,134
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9200,Florence,33.974159,-118.243286,2020-06-28,2020-06-28,1271,,32.0,,1187.285714,29.142857,,23.642857,110.571429,334.000000,378.562442,604.788839,966.986383,,155
9201,Florence,33.974159,-118.243286,2020-06-29,2020-06-29,1325,,54.0,,1218.571429,31.285714,,24.285714,113.285714,340.642857,386.413165,618.805758,994.934580,,155
9202,Florence,33.974159,-118.243286,2020-06-30,2020-06-30,1375,,50.0,,1253.000000,34.428571,,24.785714,117.428571,347.642857,392.256372,633.676852,1023.074889,,155
9203,Florence,33.974159,-118.243286,2020-07-01,2020-07-01,1394,,19.0,,1287.000000,34.000000,,25.571429,120.428571,355.571429,401.898922,649.313171,1046.797227,,155


## Check that missing days have all the same neighborhood names
If not, go back and clean in Stata

In [11]:
names = historical[["Region"]].drop_duplicates()
names['in'] = 1

In [12]:
df = pd.read_excel('../data/lacounty_cleaned.xlsx')
names2 = df[['Region']].drop_duplicates()

In [13]:
test = pd.merge(names2, names, how = 'left', on = "Region", validate = "1:1")

In [30]:
fix_me = ["Adams", "Cadillac", "Florence", "Mid-", "Pico", "Temple", "Athens", "La Crescenta"]
names[names.Region.str.contains("Crescenta")]

Unnamed: 0,Region,in
12544,La Crescenta,1


In [31]:
test[test['in'].isna()]

fix_me = {
    "Adams-Normandie": "Adams",
    "Cadillac-Corning": "Cadillac",
    "Florence-Firestone": "Florence",
    "Mid-city": "Mid", 
    "Pico-Union": "Pico",
    "Athens-Westmont": "Athens",
    "La Crescenta-Montrose": "La Crescenta"
}


# Probably want to rename Mid to Mid-City

In [29]:
test[test.Region.str.contains("Athens")]

Unnamed: 0,Region,in
228,Athens-Westmont,
229,Athens Village,1.0


In [None]:
historical.head(2)

In [None]:
historical.dtypes

In [None]:
def grab_missing_days():
    df = pd.read_excel('../data/lacounty_cleaned.xlsx')
    
    df["year"] = 2020
    
    df = (df.assign(
        date = pd.to_datetime(df[["year", "month", "day"]]).dt.date,
        date2 = pd.to_datetime(df[["year", "month", "day"]]),
        ).drop(columns = ["year", "month", "day"])
    )
    
    return df

In [None]:
def standardize(df):
    old = pd.read_parquet(f"{S3_FILE_PATH}lacounty-neighborhood-time-series.parquet")
    
    old = old[(old.Latitude.notna()) & 
              (old.Population.notna()) & 
              (old.Region.notna())
              ][["Region", "Latitude", "Longitude", "Population"]].drop_duplicates()
    
    df = pd.merge(df, old, on = "Region", how = "left", validate = "m:1")
    
    return df

In [None]:
df2 = grab_missing_days()

df3 = standardize(df2)

In [None]:
df3[df3.Region.isna()]

In [None]:
df = historical.append(df3, sort=False)

df = df.sort_values(["Region", "date"]).reset_index(drop=True)

In [None]:
df

In [None]:
import la_neighborhood

In [None]:
sort_cols = 
group_cols = 
final = derive_columns(df, sort_cols, group_cols)

Need to make sure neighborhoods from the missing days are named the same way as in prior

Need to make sure neighborhoods pulling from ESRI layer are named same way as in prior

# Clean up parquets