# Introduction
This file converts the cleaned raw dataset into a single merged file that the TFTModel can work on. The script version available at [prepare_data.py](../script/prepare_data.py).

If you need to change the input feature set, only add that info in the `"data"` section of the json configuration  file. This notebook will update the rest (at least feature column mappings and locations) . If you have pivoted dynamic feature and need to melt that date columns, make sure to keep the feature name as `string` in `"dynamic_features_map"`. If it is already melted and your dynamic file has a `Date` column, `list` or `string` format both is fine.

In the final output all null values are replaced with 0. If you don't want that, comment that out.

# Import libraries

In [2]:
import argparse
import sys
sys.path.append( '..' )

# Setup storage

You would need the `CovidMay17-2022` and `Support files` folders for the dateset. And the v0 folder for the codes. Upload both of them in the place where you are running the code from. My folder structure looks like this
* dataset_raw
    * CovidMay17-2022
    * Support files
* v0

## Googe drive
Not needed, since you can run this on CPU. But set `running_on_colab = True` if using. Also update the `cd` path so that it points to the notebook folder in your drive.

In [4]:
running_on_colab = False

if running_on_colab:
    from google.colab import drive
    drive.mount('/content/drive')

    %cd /content/drive/My Drive/Projects/Covid/v0/notebooks

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Projects/Covid/v0/notebooks


## Input
If running on colab, modify the below paths accordingly. Note that this config.json is different from the config.json in TF2 folder as that is for the old dataset.

In [5]:
from dataclasses import dataclass
from Class.DataMerger import *

@dataclass
class args:
    # folder where the cleaned feature file are at
    # dataPath = '../../dataset_raw/CovidDecember12-2021'
    dataPath = '../../dataset_raw/CovidMay17-2022'
    supportPath = '../../dataset_raw/Support files'
    outputPath = '../2022_May/'
    configPath = '../config_2022_May.json'
    cachePath = None # '../2022_May/Total.csv'

In [6]:
# create output path if it doesn't exist
if not os.path.exists(args.outputPath):
    print(f'Creating output directory {args.outputPath}')
    os.makedirs(args.outputPath, exist_ok=True)

# load config file
with open(args.configPath) as inputFile:
    config = json.load(inputFile)
    print(f'Config file loaded from {args.configPath}')
    inputFile.close()

Config file loaded from ../config_2022_May.json


# Data merger

## Total features

In [7]:
# get merger class
dataMerger = DataMerger(config, args.dataPath, args.supportPath)

# if you have already created the total df one, and now just want to 
# reuse it to create different population or rurality cut
if args.cachePath:
    total_df = pd.read_csv(args.cachePath)
else:
    total_df = dataMerger.get_all_features()

Unique counties present 3142
Merging feature Age Distribution.csv with length 3142
Merging feature Air Pollution.csv with length 3142
Merging feature Health Disparities.csv with length 3142

Merged static features have 3142 counties
   FIPS  AgeDist  AirPollution  HealthDisp
0  1001   0.5017        0.5210      0.2606
1  1003   0.6095        0.4371      0.2039
2  1005   0.5797        0.5090      0.6562
3  1007   0.5427        0.4910      0.5320
4  1009   0.5755        0.5210      0.4462
Reading Disease Spread.csv
Min date 2020-02-28 00:00:00, max date 2022-05-17 00:00:00
Length 2545020.

Reading Transmissible Cases.csv
Min date 2020-02-28 00:00:00, max date 2022-05-17 00:00:00
Length 2545020.

Reading Vaccination.csv
Min date 2020-12-13 00:00:00, max date 2022-05-17 00:00:00
Length 1679704.

Reading Social Distancing.csv
Min date 2020-02-28 00:00:00, max date 2022-05-17 00:00:00
Length 2545020.

Total dynamic feature shape (2587742, 7)
   FIPS              Name       Date  DiseaseSpread

In [8]:
output_path_total = os.path.join(args.outputPath, 'Total.csv') 
print(f'Writing total data to {output_path_total}\n')
total_df.round(4).to_csv(output_path_total, index=False)

Writing total data to ../2022_May/Total.csv



In [9]:
print('Updating config file')
dataMerger.update_config(args.configPath)

Updating config file
static locs: [0, 1, 2]
future locs: [7, 8, 9, 10, 11, 12, 13, 14]
target loc: 15. total input 16
col_mappings: Static ['AgeDist', 'AirPollution', 'HealthDisp']
col_mappings: Future ['LinearSpace', 'Constant', 'LinearTime', 'P2Time', 'P3Time', 'P4Time', 'CosWeekly', 'SinWeekly']
col_mappings: Known Regular  ['AgeDist', 'AirPollution', 'HealthDisp', 'DiseaseSpread', 'Transmission', 'VaccinationFull', 'SocialDist']


## Rurality cut

In [10]:
# can be used as cache to perform different rurality or population cuts
# total_df = pd.read_csv(output_path_total)

# you can define "Rurality cut" in 'data'->'support'
# "Rurality cut" has to be set true. and also set lower and upper limit in RuralityRange and/or MADRange
# having -1 in either of these two will result in ignoring that key
if dataMerger.need_rurality_cut():
    rurality_df = dataMerger.rurality_cut(total_df)

    output_path_rurality_cut = os.path.join(args.outputPath, 'Rurality_cut.csv')
    print(f'Writing rurality cut data to {output_path_rurality_cut}\n')
    rurality_df.round(4).to_csv(output_path_rurality_cut, index=False)

Lost number of locations from median cut 3039
Remaining number of locations from median cut 182
Lost Num Locations from MAD Cut 103
Remaining Num Locations from MAD Cut 79
##################################################
Final Location Count: 79
Rurality cut dataset shape (63990, 20)
Writing rurality cut data to ../2022_May/Rurality_cut.csv



## Population cut

In [11]:
# you can define 'Population cut' in 'data'->'support'
# this means how many of top counties you want to keep
if dataMerger.need_population_cut():
    top_df = dataMerger.population_cut(total_df)

    output_path_population_cut = os.path.join(args.outputPath, 'Population_cut.csv')
    print(f'Writing population cut data to {output_path_population_cut}\n')
    top_df.round(4).to_csv(output_path_population_cut, index=False)

Slicing based on top 100 counties by population
Writing population cut data to ../2022_May/Population_cut.csv

