# 1 Make contact matrices
This script processes GREASYPOP data so the contact matrices are in format readable to Covasim. We'll need to make a number of changes.
1. In GREASYPOP, people are uniquely identified by three IDs (p_id, hh_id, and cbg_id for person, household, and CBG). In Covasim, people only have one unique identifier (uid).
2. The contact networks for GREASYPOP are stored as upper triangle matrices for each layer (household, school, workplace, and group quarters). Covasim needs a dictionary for each layer with three arrays: person one (p1), person two (p2), and a weight of one (beta).
3. The 'non-household' layer in GREASYPOP is a combination of school, workplace, and group quarters layers and does not represent interactions outside of the household. We will create a custom 'community' later to represent contacts such as going to the grocery store or a restaurant, which will use random mixing and a mean number of contacts of 4. 
4. In GREASYPOP, CBGs have unique identifiers (cbg_id) that are not the same as the actual Census Block Group numbers. The file `cbg_idxs.csv` maps cbg_id to the actual CBG number.

**Input files**
- `people.csv`: All the people in the synth pop with columns for all their attributes
- `cbg_idxs.csv`: All of the CBGs in the synth pop identified by their cbg_id
- `adj_mat_keys.csv`: All the people in the synth pop identified by p_id, hh_id, and cbg_id
- `adj_upper_triang_hh.mtx`: Upper triangle household matrix
- `adj_upper_triang_sch.mtx`: Upper triangle school matrix
- `adj_upper_triang_wp.mtx`: Upper triangle workplace matrix
- `adj_upper_triang_gq.mtx`: Upper triangle group quarters matrix (long-term care and correctional facilities)

**Output files**
- `people_all.csv`: Dataframe of all agents and attributes including uid, cbg, county, and state
- `gplayer_c.csv`: Community layer as dataframe using uid
- `gplayer_h.csv`: Household layer as dataframe using uid
- `gplayer_s.csv`: School layer as dataframe using uid
- `gplayer_w.csv`: Workplace layer as dataframe using uid
- `gplayer_g.csv`: Group quarters layer as dataframe using uid

In [1]:
# Import packages and set path
import os
from scipy.io import *
import pandas as pd
import numpy as np

path = ""

## 1.1 Data cleaning
First we will read in `people.csv`. This is a GREASYPOP dataframe of all the agents in the synthpop and their attributes. Agents are uniquely identified by p_id, hh_id, and cbg_id. Attributes include age, sex, race, school grade, worker status, commuter status, commuter income category, and commuter workplace category. See the [GREASYPOP-CO manuscript](https://arxiv.org/abs/2406.14698) to learn more about how agent attributes and contacts are generated.

In [2]:
people = pd.read_csv(f'{path}/people.csv') # Read in people.csv (big file, takes a bit)
people.head() # Preview data

Unnamed: 0,p_id,hh_id,cbg_id,sample_index,age,female,working,commuter,commuter_income_category,commuter_workplace_category,race_black_alone,white_non_hispanic,hispanic,sch_grade
0,1,1,1,1025535,51,0.0,1,0,,,0.0,1.0,0.0,
1,2,1,1,1025536,46,1.0,1,1,2.0,9.0,0.0,0.0,0.0,
2,3,1,1,1025537,10,1.0,0,0,,,0.0,0.0,0.0,5
3,4,1,1,1025538,6,1.0,0,0,,,0.0,0.0,0.0,k
4,5,1,1,1025539,22,0.0,1,1,1.0,8.0,0.0,0.0,0.0,c


Now we will read in `cbg_idxs.csv` and create columns for state and county from the Census geocode.

In [3]:
cbg_idxs = pd.read_csv(f'{path}/cbg_idxs.csv') # Read in cbg_idxs.csv
cbg_idxs['state'] = cbg_idxs['cbg_geocode'].astype(str).str[:2] # Make column for state FIPS code
cbg_idxs['county'] = cbg_idxs['cbg_geocode'].astype(str).str[:5] # Make column for county FIPS code
cbg_idxs.head() # Preview data

Unnamed: 0,cbg_id,cbg_geocode,state,county
0,1,510131019003,51,51013
1,2,510131035022,51,51013
2,3,510131023024,51,51013
3,4,510131013005,51,51013
4,5,510131014023,51,51013


Next create a new file `people_all.csv` which is a list of all the agents with unique identifiers (uid). The GREASYPOP population includes workers from out of state who commute into the area of interest, but these agents do not have any attribute data. Covasim needs all agents to have values for each attribute so we will randomly assign these agents ages, make them all female, and assign their state and county to be 0. These agents will be filtered out when we calculate health burden for our area of interest. We are just giving them values so the arrays used by Covasim have the same length (the total number of agents). This will become clearer in the next notebook. GREASYPOP uses the column 'female' to identify sex, but Covasim uses 'sex' with 0 for female and 1 for male so we'll make this change as well.

In [4]:
keys_mat = pd.read_csv(f'{path}/adj_mat_keys.csv') # Read in adj_mat_keys.csv
keys_mat['uid'] = keys_mat['index_zero'] # We can use the column 'index_zero' as the 'uid' for Covasim
keys_mat = keys_mat.drop(columns=['index_zero','index_one']) # Drop column 'index_zero'
keys_mat = keys_mat.merge(people, how='left',on=['p_id','hh_id','cbg_id']) # Merge to make dataframe of people with uid and attributes
keys_mat_age_null = keys_mat.loc[keys_mat['age'].isnull()] # Identify people who commute into the synth pop
keys_mat.loc[keys_mat['age'].isnull(), 'age'] = 16 + (np.random.uniform(low=0, high=49, size=(len(keys_mat_age_null)))).round() # Assign them ages
keys_mat.loc[keys_mat['female'].isnull(), 'female'] = 1 # Make all agents with missing gender female
keys_mat['sex'] = 1 - keys_mat['female'] # Create column 'sex' where 0=female and 1=male
keys_mat = keys_mat.merge(cbg_idxs, how='left',on='cbg_id') # Merge all agents to their home cbg using the Census FIPS code
keys_mat['state'] = keys_mat['state'].fillna(0).astype(str) # Assign agents with missing states to have state 0
keys_mat['county'] = keys_mat['county'].fillna(0).astype(str) # Assign agents with missing counties to have county 0
keys_mat.to_csv(f'{path}/people_all.csv') # Export 
keys_mat.head() # Preview data

Unnamed: 0,p_id,hh_id,cbg_id,uid,sample_index,age,female,working,commuter,commuter_income_category,commuter_workplace_category,race_black_alone,white_non_hispanic,hispanic,sch_grade,sex,cbg_geocode,state,county
0,1,636,3879,0,2554510.0,45.0,1.0,1.0,0.0,,,0.0,1.0,0.0,,0.0,240317100000.0,24,24031
1,1,850,4915,1,2575249.0,50.0,0.0,1.0,1.0,2.0,2.0,0.0,1.0,0.0,,1.0,240378800000.0,24,24037
2,4,552,2920,2,2563912.0,21.0,0.0,0.0,0.0,,,0.0,0.0,0.0,,1.0,240135100000.0,24,24013
3,1227,0,1711,3,0.0,30.0,1.0,0.0,0.0,,,,,,,0.0,110010000000.0,11,11001
4,2,113,5593,4,2470655.0,56.0,1.0,0.0,0.0,,,0.0,1.0,0.0,,0.0,245102600000.0,24,24510


## 1.2 Define new layers
The next cell has a function that reads in a GREASYPOP upper triangle matrix and makes it a dataframe that can be read by Covasim with columns 'p1', 'p2', and 'beta'. The values in the columns 'p1' and 'p2' correspond to 'uid', and 'beta' is the weight of the link between the two people (always 1). For our purposes, in a .mtx file, 'col' corresponds to 'p1' and row corresponds to 'p2'. For now, we'll store each layer in a dataframe that will be turned into a dictionary in the next notbook.

In [5]:
def newLayer(file):
    #file = f'{path}/adj_upper_triang_wp.mtx' # uncomment this if you'd like to go through line by line
    m = mmread(file) # Read in the mtx file
    mat_data = {'p1': m.col, 'p2': m.row}
    mat = pd.DataFrame(data=mat_data) # Join p1 and p2 in a dataframe
    mat['beta'] = 1 # Add the column 'beta' used by Covasim. This is the weight of the link between p1 and p2
    return mat

The function works for the household, school, workplace, and group quarters layers. The community layer is made in the next notebook.

In [6]:
household = newLayer(f'{path}/adj_upper_triang_hh.mtx')
school = newLayer(f'{path}/adj_upper_triang_sch.mtx')
work = newLayer(f'{path}/adj_upper_triang_wp.mtx')
groupquarters = newLayer(f'{path}/adj_upper_triang_gq.mtx')

household.to_csv(f'{path}/gplayer_h.csv')
school.to_csv(f'{path}/gplayer_s.csv')
work.to_csv(f'{path}/gplayer_w.csv')
groupquarters.to_csv(f'{path}/gplayer_g.csv')

school.head() # Preview data

Unnamed: 0,p1,p2,beta
0,1893,1601,1
1,2240,2053,1
2,4863,2292,1
3,5457,3305,1
4,6110,4409,1
