# Introduction
This notebook normalizes the industry groups. In the preprocessed industry occupation csv files in folder [Industry](../../dataset_additional/Industry/), the individual group distribution has again been min max scaled before dumping them in csv files. Which makes sum of all these groups > 1 for a single county. This may create confusion as sum of these percentage should be 1.00 for a county. Hence this notebook takes care of that and dumps the combined file [Industry Groups.csv](../../dataset_raw/CovidMay17-2022/Industry%20Groups.csv).

# List files

In [2]:
import pandas as pd
import os
root_dir = '../../dataset_additional/Industry/CovidOct25-2022-Industry/'
filenames = os.listdir(root_dir)
print(filenames)

['Accommodation and Food Services.csv', 'Administrative and Support and Waste Management and Remediation Services.csv', 'Agriculture, Forestry, Fishing and Hunting.csv', 'Arts, Entertainment, and Recreation.csv', 'Construction.csv', 'Educational Services.csv', 'Finance and Insurance.csv', 'Health Care and Social Assistance.csv', 'Information.csv', 'Management of Companies and Enterprises.csv', 'Manufacturing.csv', 'Mining, Quarrying, and Oil and Gas Extraction.csv', 'Other Services (except Public Administration).csv', 'Professional, Scientific, and Technical Services.csv', 'Retail Trade.csv', 'Transportation and Warehousing.csv', 'Utilities.csv', 'Wholesale Trade.csv']


# Combine

In [3]:
results = {}
industry_groups = [filename.split(".")[0] for filename in filenames]

for index, filename in enumerate(filenames):
    df = pd.read_csv(os.path.join(root_dir, filename), usecols=['FIPS', '2020-02-28'])

    # since this is static, all date column values are same for a county
    results[industry_groups[index]] =  df['2020-02-28']
    if index==0:
        results['FIPS'] = df['FIPS']

df = pd.DataFrame(results)
df.describe().T[['mean', 'std', 'min', 'max']]

Unnamed: 0,mean,std,min,max
Accommodation and Food Services,0.118235,0.073311,0.0,1.0
FIPS,30383.649268,15162.508374,1001.0,56045.0
Administrative and Support and Waste Management and Remediation Services,0.039657,0.051361,0.0,1.0
"Agriculture, Forestry, Fishing and Hunting",0.012396,0.044188,0.0,1.0
"Arts, Entertainment, and Recreation",0.019038,0.037505,0.0,1.0
Construction,0.062849,0.05438,0.0,1.0
Educational Services,0.029785,0.064419,0.0,1.0
Finance and Insurance,0.07472,0.055008,0.0,1.0
Health Care and Social Assistance,0.181913,0.090542,0.0,1.0
Information,0.040185,0.049887,0.0,1.0


In [6]:
from sklearn.preprocessing import Normalizer

df[industry_groups] = Normalizer(norm='l1').fit_transform(df[industry_groups]) 
df = df[['FIPS']+industry_groups]
df.round(6).to_csv('../../dataset_raw/CovidMay17-2022/Industry Groups.csv', index=False)

In [5]:
df.describe().T[['mean', 'std', 'min', 'max']]

Unnamed: 0,mean,std,min,max
FIPS,30383.649268,15162.508374,1001.0,56045.0
Accommodation and Food Services,0.10175,0.067439,0.0,1.0
Administrative and Support and Waste Management and Remediation Services,0.033638,0.044198,0.0,0.910305
"Agriculture, Forestry, Fishing and Hunting",0.010313,0.033247,0.0,0.678271
"Arts, Entertainment, and Recreation",0.016055,0.031023,0.0,0.772795
Construction,0.054091,0.049938,0.0,1.0
Educational Services,0.023926,0.047663,0.0,0.618201
Finance and Insurance,0.062769,0.042355,0.0,0.666667
Health Care and Social Assistance,0.156155,0.081821,0.0,1.0
Information,0.032991,0.036194,0.0,0.554281
