# Introduction
This file converts the cleaned raw dataset into a single merged file that the TFTModel can work on. The work here is done following [`prepareData_Main.py`](../TF2/TFTTF2_ModelDev/prepareData_main.py) and [`data_preparation.py`](../TF2/TFTTF2_ModelDev/data_preparation.py) scripts. However, those two scripts are for the `old dataset`.

If you need to change the input feature set, only add that info in the `"data"` section of the `config.json`. This notebook will update the rest (at least feature column mappings and locations) . If you have pivoted dynamic feature and need to melt that date columns, make sure to keep the feature name as string in `"dynamic_features_map"`. If it is already melted and your dynamic file has a `Date` column list or string format both is fine.

In the final output all null values are replaced with 0. If you don't want that, comment that out.

### Differences from the old script

|Old script|This notebook|
|---|---|
|Can only keep one feature per static feature file.|Can keep as many features needed per static feature file.|
|One feature per dynamic feature file. | Can handle one or multiple dynamic features per file. |
|Converts static features into dynamic (adds date) then agains drops the dates later|No need to add dates in static features or convert it to dynamic.|
| Left joins features based on `Date` and `FIPS`. However, this may create different merged files depending on the order of input feature files. So even with same feature files we might get very different merged files.| Outer joins features based on `Date` and inner join on `FIPS`. This fixes being dependent on processing sequence like `left` join.| 
| Uses `Population.csv` file as a base for `FIPS`. | Same |
| Uses a random first csv file as a base for county `Name`s. Uses `Name` as id in config. | Uses `Population.csv` file as a base for `County` names. Uses `County` as id. Since `Name` would have ambiguous meaning.|
| Re-implements custom MinMaxScaler. | Uses a MinMaxScaler from sklearn library. |
| Scales down only the cluster choosen by Rurality. Local scaling. More like `RMSE` of their deviation ratio from the minimum value for that particular cluster. | Same. But probably will change to global scaling later. As that is generally used in practice and will make scores across different clusters dirrectly comparable. Also I am in support of not scaling the target feature to find the actual `RMSE` of covid cases.|

# Import libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import os, json
import math

# Utils

In [3]:
def valid_date(date):
    try:
        pd.to_datetime(date)
        return True
    except:
        return False

def read_feature_file(file_name):
    df = pd.read_csv(os.path.join(dataPath, f'{file_name}'))
    # drop empty column names in the feature file
    df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
    return df

def convert_cumulative_to_daily(df):
    date_columns = [col for col in df.columns if valid_date(col)]
    df_advanced = df[date_columns].shift(periods=1, axis=1, fill_value=0)
    df[date_columns] -= df_advanced[date_columns]
    return df

def missing_percentage(df):
    return df.isnull().mean().round(4).mul(100).sort_values(ascending=False)

In [4]:
def add_embeddings(data):
    def LinearLocationEncoding(TotalLoc):
        linear = np.empty(TotalLoc, dtype=float)
        for i in range(0, TotalLoc):
            linear[i] = float(i) / float(TotalLoc)
        return linear

    def LinearTimeEncoding(Dateslisted):
        Firstdate = Dateslisted[0]
        numtofind = len(Dateslisted)
        dayrange = (Dateslisted[numtofind - 1] - Firstdate).days + 1
        linear = np.empty(numtofind, dtype=float)
        for i in range(0, numtofind):
            linear[i] = float((Dateslisted[i] - Firstdate).days) / float(dayrange)
        return linear

    def P2TimeEncoding(numtofind):
        P2 = np.empty(numtofind, dtype=float)
        for i in range(0, numtofind):
            x = -1 + 2.0 * i / (numtofind - 1)
            P2[i] = 0.5 * (3 * x * x - 1)
        return P2

    def P3TimeEncoding(numtofind):
        P3 = np.empty(numtofind, dtype=float)
        for i in range(0, numtofind):
            x = -1 + 2.0 * i / (numtofind - 1)
            P3[i] = 0.5 * (5 * x * x - 3) * x
        return P3

    def P4TimeEncoding(numtofind):
        P4 = np.empty(numtofind, dtype=float)
        for i in range(0, numtofind):
            x = -1 + 2.0 * i / (numtofind - 1)
            P4[i] = 0.125 * (35 * x * x * x * x - 30 * x * x + 3)
        return P4

    def WeeklyTimeEncoding(Dateslisted):
        numtofind = len(Dateslisted)
        costheta = np.empty(numtofind, dtype=float)
        sintheta = np.empty(numtofind, dtype=float)
        for i in range(0, numtofind):
            j = Dateslisted[i].date().weekday()
            theta = float(j) * 2.0 * math.pi / 7.0
            costheta[i] = math.cos(theta)
            sintheta[i] = math.sin(theta)
        return costheta, sintheta

    # Set up linear location encoding for all of the data
    LLE = LinearLocationEncoding(config["support"]["Nloc"])

    for idx, i in enumerate(data['FIPS'].unique()):
        data.loc[data['FIPS'] == i, 'LinearSpace'] = LLE[idx]

    # Set up constant encoding
    data['Constant'] = 0.5

    # Set up linear time encoding
    dates = pd.to_datetime(data['Date'].unique())

    LTE = LinearTimeEncoding(dates)
    P2E = P2TimeEncoding(len(dates))
    P3E = P3TimeEncoding(len(dates))
    P4E = P4TimeEncoding(len(dates))

    CosWeeklyTE, SinWeeklyTE = WeeklyTimeEncoding(dates)

    for idx, i in enumerate(dates):
        data.loc[data['Date'] == i, 'LinearTime'] = LTE[idx]
        data.loc[data['Date'] == i, 'P2Time'] = P2E[idx]
        data.loc[data['Date'] == i, 'P3Time'] = P3E[idx]
        data.loc[data['Date'] == i, 'P4Time'] = P4E[idx]
        data.loc[data['Date'] == i, 'CosWeekly'] = CosWeeklyTE[idx]
        data.loc[data['Date'] == i, 'SinWeekly'] = SinWeeklyTE[idx]

    return data

# Setup storage

## Input
If running on colab add root to both dataPath and configPath. Note that this config.json is different from the config.json in TF2 folder as that is for the old dataset.

In [33]:
# folder where the cleaned feature file are at
# dataPath = '../../dataset_raw/CovidDecember12-2021'
dataPath = '../../dataset_raw/CovidMay17-2022'
support_path = '../../dataset_raw/Support files'
output_folder = '2022_May/'
if not os.path.exists(output_folder):
    os.makedirs(output_folder, exist_ok=True)

configPath = output_folder + 'config.json'
with open(configPath) as inputFile:
    config = json.load(inputFile)
    inputFile.close()

config = config['TFTparams']['data']

## Output
Creates two files
* outputPathTotal
  * path where the merged csv files will be dumped
  * contains all counties
  * can be reused to create further clusters

* outputPathFinal
  * only contains counties selected by current Rurality cut in config.json

In [16]:
output_total = output_folder + 'Total.csv'
output_top500 = output_folder + 'Top_500.csv'
output_rurality_cut = output_folder + 'Rurality_cut.csv'

# Static features
## Static features mapping

In [7]:
# map between csv filename and feature columns extracted from that file
# each file must have FIPS column, no index
static_features_map = config['static_features_map']
static_features = []
for value in static_features_map.values():
    if type(value)==list:
        static_features.extend(value)
    else:
        static_features.append(value)

print(f'Static features {static_features}')

Static features ['AgeDist', 'AirPollution', 'HealthDisp']


## Read base static feature
All other static features will be merged on this. County names are also extracted from this base feature file.

In [9]:
# We'll use population file as the base and take the county names from it
# then merge other files to it

support_file = config['support']['Population']
population = pd.read_csv(os.path.join(support_path, f'{support_file}'))

id_columns = ['FIPS']
# population.rename({'COUNTY':'County'}, axis=1, inplace=True)
static_df = population[id_columns]

locs = static_df['FIPS'].nunique()
print(f'Unique counties present {locs}')

Unique counties present 3142


## Merge

In [10]:
for file_name in static_features_map.keys():
    if file_name == support_file: continue

    feature_df = read_feature_file(file_name)
    print(f'Merging feature {file_name} with length {feature_df.shape[0]}')

    has_date_columns = False
    for column in feature_df.columns:
        if valid_date(column):
            has_date_columns = True
            break

    # if static feature has date column, convert the first date column into feature of that name
    # this is for PVI data, and in that case static_features_map[file_name] is a single value
    if has_date_columns:
        feature_column = static_features_map[file_name]
        feature_df.rename({column: feature_column}, axis=1, inplace=True)
        feature_df = feature_df[['FIPS', feature_column]]
    else: 
        feature_columns = static_features_map[file_name]
        if type(feature_columns) == list:
            feature_df = feature_df[['FIPS'] + feature_columns]
        else:
            feature_df = feature_df[['FIPS', feature_columns]]

    static_df = static_df.merge(feature_df, how='inner', on='FIPS')

print(f"\nMerged static features have {static_df['FIPS'].nunique()} counties")
static_df.head()

Merging feature Age Distribution.csv with length 3142
Merging feature Air Pollution.csv with length 3142
Merging feature Health Disparities.csv with length 3142

Merged static features have 3142 counties


Unnamed: 0,FIPS,AgeDist,AirPollution,HealthDisp
0,1001,0.5017,0.521,0.2606
1,1003,0.6095,0.4371,0.2039
2,1005,0.5797,0.509,0.6562
3,1007,0.5427,0.491,0.532
4,1009,0.5755,0.521,0.4462


# Dynamic features
## Dynamic feature mapping

In [11]:
# notice: no need to add .csv to filename
# {feature_file_name: feature_name}
dynamic_features_map = config['dynamic_features_map']

dynamic_features = []
for value in dynamic_features_map.values():
    if type(value)==list:
        dynamic_features.extend(value)
    else:
        dynamic_features.append(value)
print(dynamic_features)

['DiseaseSpread', 'Transmission', 'VaccinationFull', 'SocialDist']


In [12]:
first_date = pd.to_datetime(config['support']['FirstDate'])
last_date = pd.to_datetime(config['support']['LastDate'])

In [13]:
dynamic_df = None
merge_keys = ['FIPS', 'Date']

for file_name in dynamic_features_map.keys():
    print(f'Reading {file_name}')
    df = read_feature_file(file_name)
    
    # check whether the Date column has been pivoted
    if 'Date' not in df.columns:
         # technically this should be set of common columns
        id_vars = [col for col in df.columns if not valid_date(col)]
        df = df.melt(
            id_vars= id_vars,
            var_name='Date', value_name=dynamic_features_map[file_name]
        ).reset_index(drop=True)

    # can be needed as some feature files may have different date format
    df['Date'] = pd.to_datetime(df['Date'])
    print(f'Min date {df["Date"].min()}, max date {df["Date"].max()}')
    df = df[(df['Date'] >= first_date) & (df['Date'] <= last_date)]

    print(f'Length {df.shape[0]}.')

    if dynamic_df is None: dynamic_df = df
    else:
        # if a single file has multiple features
        if type(dynamic_features_map[file_name]) == list:
            selected_columns = merge_keys + dynamic_features_map[file_name]
        else:
            selected_columns = merge_keys + [dynamic_features_map[file_name]]

        # using outer to keep the union of dates 
        # as vaccination dates are not available before late in 2020
        dynamic_df = dynamic_df.merge(df[selected_columns], how='outer',on=merge_keys)

        # however, we don't need to keep mismatch of FIPS
        dynamic_df = dynamic_df[~dynamic_df['FIPS'].isna()]

print(f'Total dynamic feature shape {dynamic_df.shape}')
dynamic_df.head()

Reading Disease Spread.csv
Min date 2020-02-28 00:00:00, max date 2022-05-17 00:00:00
Length 2545020.
Reading Transmissible Cases.csv
Min date 2020-02-28 00:00:00, max date 2022-05-17 00:00:00
Length 2545020.
Reading Vaccination.csv
Min date 2020-12-13 00:00:00, max date 2022-05-17 00:00:00
Length 1679704.
Reading Social Distancing.csv
Min date 2020-02-28 00:00:00, max date 2022-05-17 00:00:00
Length 2545020.
Total dynamic feature shape (2587742, 7)


Unnamed: 0,FIPS,Name,Date,DiseaseSpread,Transmission,VaccinationFull,SocialDist
0,1001,"Alabama, Autauga",2020-02-28,0.0,0.0,,1.0
1,1003,"Alabama, Baldwin",2020-02-28,0.0,0.0,,1.0
2,1005,"Alabama, Barbour",2020-02-28,0.0,0.0,,0.825
3,1007,"Alabama, Bibb",2020-02-28,0.0,0.0,,1.0
4,1009,"Alabama, Blount",2020-02-28,0.0,0.0,,1.0


# Target feature
Converts cumulative covid cases into daily cases. Also remove outliers. For now only handling one target here.

In [25]:
# cases
target_column = list(config['targets'].keys())[0]

# read cumulative cases.csv
target_df = read_feature_file(config['targets'][target_column])

if 'Date' not in target_df.columns:
    target_df = convert_cumulative_to_daily(target_df)
    target_df.fillna(0, inplace=True)

target_df.head()

Unnamed: 0,FIPS,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-26,2020-01-27,2020-01-28,2020-01-29,2020-01-30,...,2022-05-16,2022-05-17,2022-05-18,2022-05-19,2022-05-20,2022-05-21,2022-05-22,2022-05-23,2022-05-24,2022-05-25
0,1001,0,0,0,0,0,0,0,0,0,...,7,1,2,12,6,0,0,13,0,12
1,1003,0,0,0,0,0,0,0,0,0,...,54,25,19,36,35,0,0,103,0,88
2,1005,0,0,0,0,0,0,0,0,0,...,7,0,3,0,1,0,0,2,0,0
3,1007,0,0,0,0,0,0,0,0,0,...,6,2,1,2,1,0,0,4,0,6
4,1009,0,0,0,0,0,0,0,0,0,...,10,4,6,6,8,0,0,6,0,7


In [26]:
target_df = target_df.melt(
    id_vars= ['FIPS'],
    var_name='Date', value_name='Cases'
).reset_index(drop=True)
target_df = target_df.fillna(0)
target_df['Date'] = pd.to_datetime(target_df['Date'])

# some days had old covid cases fixed by adding neg values
target_df.loc[target_df['Cases']<0, 'Cases'] = 0

target_df = target_df[(target_df['Date'] >= first_date) & (target_df['Date'] <= last_date)]

# Merge all together

In [27]:
# the joint types should be inner for consistency
total_df = dynamic_df.merge(target_df, how='outer', on=['FIPS', 'Date'])
total_df = static_df.merge(total_df, how='inner', on='FIPS')
total_df = total_df.reset_index(drop=True)

print(total_df.shape)
total_df.head()

(2545020, 11)


Unnamed: 0,FIPS,AgeDist,AirPollution,HealthDisp,Name,Date,DiseaseSpread,Transmission,VaccinationFull,SocialDist,Cases
0,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-02-28,0.0,0.0,,1.0,0.0
1,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-02-29,0.0,0.0,,1.0,0.0
2,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-03-01,0.0,0.0,,1.0,0.0
3,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-03-02,0.0,0.0,,1.0,0.0
4,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-03-03,0.0,0.0,,1.0,0.0


In [28]:
missing_percentage(total_df)

VaccinationFull    35.68
FIPS                0.00
AgeDist             0.00
AirPollution        0.00
HealthDisp          0.00
Name                0.00
Date                0.00
DiseaseSpread       0.00
Transmission        0.00
SocialDist          0.00
Cases               0.00
dtype: float64

In [29]:
total_df = total_df.fillna(0)

## Add embeddings

In [30]:
total_df['TimeFromStart'] = (total_df['Date'] - total_df['Date'].min()).dt.days

pre_columns = total_df.columns
total_df = add_embeddings(total_df)
known_future_features = [col for col in total_df.columns if col not in pre_columns]

print(total_df.shape)
total_df.head()

(2545020, 20)


Unnamed: 0,FIPS,AgeDist,AirPollution,HealthDisp,Name,Date,DiseaseSpread,Transmission,VaccinationFull,SocialDist,Cases,TimeFromStart,LinearSpace,Constant,LinearTime,P2Time,P3Time,P4Time,CosWeekly,SinWeekly
0,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-02-28,0.0,0.0,0.0,1.0,0.0,0,0.0,0.5,0.0,1.0,-1.0,1.0,-0.900969,-0.433884
1,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-02-29,0.0,0.0,0.0,1.0,0.0,1,0.0,0.5,0.001235,0.992593,-0.985213,0.975415,-0.222521,-0.974928
2,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-03-01,0.0,0.0,0.0,1.0,0.0,2,0.0,0.5,0.002469,0.985204,-0.970517,0.951104,0.62349,-0.781831
3,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-03-02,0.0,0.0,0.0,1.0,0.0,3,0.0,0.5,0.003704,0.977833,-0.955912,0.927065,1.0,0.0
4,1001,0.5017,0.521,0.2606,"Alabama, Autauga",2020-03-03,0.0,0.0,0.0,1.0,0.0,4,0.0,0.5,0.004938,0.97048,-0.941398,0.903296,0.62349,0.781831


In [31]:
# checkpoint the total merged data
total_df.round(4).to_csv(output_total, index=False)

# Uncomment if only starting from here
# total_df = pd.read_csv(outputPathTotal)

# Rurality median based cut

In [34]:
MADRANGE = config['support']['MADRange']
RURRANGE = config['support']['RuralityRange']

# fails to read on unicode
rur = pd.read_csv(os.path.join(support_path, config['support']["Rurality"]), encoding = 'latin1')

locs = rur.FIPS

if -1 in RURRANGE:
    print('No Median Rurality Cut')
    lost = []
else:
    locs = rur[(rur['Median'] >= RURRANGE[0]) & (rur['Median'] <= RURRANGE[1])].FIPS
    lost = rur[~((rur['Median'] >= RURRANGE[0]) & (rur['Median'] <= RURRANGE[1]))].FIPS
    rur = rur[rur['FIPS'].isin(locs)]

print('Lost number of locations from median cut ' + str(len(lost)))
print('Remaining number of locations from median cut ' + str(len(locs)))

if -1 in MADRANGE:
    print('No MAD cut')
    lost = []
else:
    locs = rur[(rur['MAD'] >= MADRANGE[0]) & (rur['MAD'] < MADRANGE[1])].FIPS
    lost = rur[~((rur['MAD'] >= MADRANGE[0]) & (rur['MAD'] < MADRANGE[1]))].FIPS

print('Lost Num Locations from MAD Cut ' + str(len(lost)))
print('Remaining Num Locations from MAD Cut ' + str(len(locs)))

print('#' * 50)
print('Final Location Count: ' + str(len(locs)))

Lost number of locations from median cut 3039
Remaining number of locations from median cut 182
Lost Num Locations from MAD Cut 103
Remaining Num Locations from MAD Cut 79
##################################################
Final Location Count: 79


In [35]:
# only keep the selected counties
df = total_df[total_df['FIPS'].isin(locs)].reset_index(drop=True)
df.shape

(63990, 20)

In [36]:
df.round(4).to_csv(output_rurality_cut, index=False)

# Top 500 counties by population

In [37]:
sorted_fips = population.sort_values(by=['POPESTIMATE2019'], ascending=False)['FIPS'].values
df = total_df[total_df['FIPS'].isin(sorted_fips[:500])]
df.round(4).to_csv(output_top500, index=False)

# Top 100 counties by population

In [38]:
df = total_df[total_df['FIPS'].isin(sorted_fips[:100])]
df.round(4).to_csv(output_folder + 'Top_100.csv', index=False)

# Update config.json
Make sure your config.json is consistent with these info. Maybe we can directly update config from this notebook or create a separate config for model in future.

In [39]:
static_locs = [i for i in range(len(static_features))]
print(f'static locs: {static_locs}')

start = len(static_features) + len(dynamic_features)
future_locs = [i for i in range(start, start + len(known_future_features))]
print(f'future locs: {future_locs}')

target_loc = start + len(known_future_features)
print(f'target loc: {target_loc}. total input {target_loc+1}')

print(f'col_mappings: Static {static_features}')
print(f'col_mappings: Future {known_future_features}')
print(f'col_mappings: Known Regular  {static_features + dynamic_features}')

static locs: [0, 1, 2]
future locs: [7, 8, 9, 10, 11, 12, 13, 14]
target loc: 15. total input 16
col_mappings: Static ['AgeDist', 'AirPollution', 'HealthDisp']
col_mappings: Future ['LinearSpace', 'Constant', 'LinearTime', 'P2Time', 'P3Time', 'P4Time', 'CosWeekly', 'SinWeekly']
col_mappings: Known Regular  ['AgeDist', 'AirPollution', 'HealthDisp', 'DiseaseSpread', 'Transmission', 'VaccinationFull', 'SocialDist']


In [40]:
# read the config file again
with open(configPath) as inputFile:
    config = json.load(inputFile)
    inputFile.close()

config["TFTparams"]["static_locs"] = static_locs
config["TFTparams"]["future_locs"] = future_locs
config["TFTparams"]["target_loc"] = [target_loc]
config["TFTparams"]["total_inputs"] = target_loc + 1

config["TFTparams"]["col_mappings"]["Static"] = static_features

# this notebook doesn't support multiple target columns yet
config["TFTparams"]["col_mappings"]["Target"] = [target_column]
config["TFTparams"]["col_mappings"]["Future"] = known_future_features
config["TFTparams"]["col_mappings"]["Known Regular"] = static_features + dynamic_features

# dump the json config
with open(configPath, 'w') as outputFile:
    json.dump(config, outputFile, indent=4)
    outputFile.close()

# Run TFT model

Now go to `TF2/TFTTF2_ModelDev` and run the following to run TFT on this new dataset

```python
python main.py -p "../../dataset_new/config.json" -c checkpoints -d "../../dataset_new/TFTdfCurrent.csv"
```