# Introduction
This notebook normalizes the age groups. In the preprocessed AgeRange.csv files in folder [Age](../../dataset_additional/Age/), the individual age group distribution has again been min max scaled before dumping them in csv files. Which makes sum of all these age groups > 1 for a single county. This may create confusion as sum of these percentage should be 1.00 for a county. Hence this notebook takes care of that and dumps the combined file [Age Groups.csv](../../dataset_raw/CovidMay17-2022/Age%20Groups.csv).

# List files

In [1]:
import pandas as pd
import os
root_dir = '../../dataset_additional/Age/CovidOct25-2022-Age-20200401/'
filenames = os.listdir(root_dir)
print(filenames)

['AGE019.csv', 'AGE2029.csv', 'AGE3039.csv', 'AGE4049.csv', 'AGE5064.csv', 'AGE6579.csv', 'AGE80PLUS.csv']


# Combine

In [2]:
results = {}
age_groups = [filename.split(".")[0] for filename in filenames]

for index, filename in enumerate(filenames):
    df = pd.read_csv(os.path.join(root_dir, filename), usecols=['FIPS', '2020-02-28'])

    # since this is static, all date column values are same for a county
    results[age_groups[index]] =  df['2020-02-28']
    if index==0:
        results['FIPS'] = df['FIPS']

df = pd.DataFrame(results)
df.describe().T[['mean', 'std', 'min', 'max']]

Unnamed: 0,mean,std,min,max
AGE019,0.537431,0.080973,0.0,1.0
FIPS,30383.649268,15162.508374,1001.0,56045.0
AGE2029,0.323373,0.080992,0.0,1.0
AGE3039,0.376278,0.109956,0.0,1.0
AGE4049,0.345274,0.078599,0.0,1.0
AGE5064,0.578315,0.099217,0.0,1.0
AGE6579,0.274364,0.084424,0.0,1.0
AGE80PLUS,0.190533,0.060821,0.0,1.0


In [3]:
from sklearn.preprocessing import Normalizer

df[age_groups] = Normalizer(norm='l1').fit_transform(df[age_groups]) 
df = df[['FIPS']+age_groups]
df.round(6).to_csv('../../dataset_raw/CovidMay17-2022/Age Groups old.csv', index=False)