In [1]:
%run ./resources/library.py
style_notebook()

Digital Case Study: Multidrug-Resistant Tuberculosis (MDR-TB) Outbreak - Revisiting the 2005 Outbreak Investigation in Thailand by John Oeltmann

# Part 1: Mapping the 2005 MDR-TB Outbreak in Thailand

The Oeltmann et. al (2005) paper describes an outbreak of MDR-TB within a refugee camp in Thailand during 2005.  You may notice some similarities between the work-flow pattern to map this outbreak with that of Dr. John Snow as he worked on the Cholera outbreak in London.

In [2]:
show_pdf('resources/genotyping.pdf',900,600)

For Notebooks 2 and 3, we will follow the workflow outlined in this PDf file below.

In [3]:
show_pdf("./resources/Work_flow_diagram_08012017.pdf",800,500)

## Step 1. Outbreak Notification

You just received the **notification** below of an outbreak by email.

Please download the notification from your email (PDF attachment) using the PDF below. Please take time to read the contents of the attachment.

In [4]:
show_pdf("./resources/GDD.pdf",800,500)

![2005 Worldwide Incidence of TB](./images/Incidence-of-TB-worldwide-2005-WHO.png)

**Figure 1**. 

##  Step 2. Initial Cases Received

You then received a separate email with an attached [Excel Spreadsheet](./resources/Thailand_cases_exercise_1st_spreadsheet_07232017.xls) that contains data about the initial cases. 

Now it's time to use your data science hacking skills. Using Python code, read this spreadsheet as a data frame to use within the Jupyter Notebooks environment.

In [5]:
import pandas as pd

In [6]:
# Import the excel file and call it xls_file
xls_file = pd.ExcelFile('resources/Thailand_cases_exercise_1st_spreadsheet_07232017.xls')
xls_file

<pandas.io.excel.ExcelFile at 0x7f5e766b28d0>

In [7]:
# View the excel file's sheet names
xls_file.sheet_names

['TBDATA']

In [8]:
# Load the xls file's Sheet1 as a dataframe
df1 = xls_file.parse('TBDATA')

In [9]:
# To view the column names with some data we can use the "head" function and indicate we want to display only 4 rows (the top 4).
# You can change the number 4 to something else, try it.
df1.head(4)

Unnamed: 0,CaseNo,LON,LAT,COORDS
0,TH-102678,100.828607,14.704461,"14.704461,100.828607"
1,TH-101007,100.829347,14.702266,"14.702266,100.829347"
2,TH-101290,100.825159,14.699828,"14.699828,100.825159"
3,TH-101067,100.824887,14.700197,"14.700197,100.824887"


You can also read the Excel file [here](./resources/Thailand_cases_exercise_1st_spreadsheet_07232017.xls) to check the outputs above.

> Question 1. What are the column names in the Excel spreadsheet? 

Answer Cell for Question #1.
Please type your answer below. Double-click to begin typing.


> Question 2. What is the name of the Sheet where the data are found?

> Question 3. How do you list the columns of the dataframe `df1`?

In [10]:
# For Question #3, you may want to try typing the code here...
# Try: list(df1) or df1.columns.values


Let's "pickle" this data frame so we can retrieve it later.

In [11]:
df1.to_pickle("outputs/df1.pickle")

## Step 3. Mapping the Cases (data) in the Spreadsheet


In [54]:
# use folium
import folium

In [55]:
#Store the coordinates of the refugee camp listed in the EOC email
CAMP_COORDINATES = (14.699859, 100.829019)

In [56]:
# create empty map zoomed in on Refugee camp
map1 = folium.Map(location=CAMP_COORDINATES, zoom_start=16)

In [57]:
# Let's see what this map looks like...
map1

In [58]:
# Let's add a marker for every record in the filtered data, use a clustered view
for each in df1.iterrows():
    folium.Marker([each[1]['LAT'],each[1]['LON']], popup=each[1]['CaseNo']).add_to(map1) 


In [59]:
# Let's display the map again!                                       
map1

You can click on each marker in the map above to launch the pop-up that displays the case number.

## Step 4. Learning about Genotyping Terminology

The data received include a unique identifier for each case within the line listing. This may not always be present in real-life outbreak data, but is very helpful. We will be using this ID to join data from various sources, including molecular epidemiology data with genotypes of TB isolates. The CDC documentation on genotyping can help you to better understand the molecular epi data.

In a separate email, you receive Study IDs and genotypes within a 24-loci MIRU-VNTR coded system. You can read the genotyping documentation [here](./resources/genotypingterminology.pdf). Please take some time to understand the terminology used.

## Step 5. Data Update from the Field!

You receive an email from the field with genotyping data in an [Excel spreadsheet](./resources/Thailand_cases_exercise_2nd_spreadsheet_07232017.xls). This spreadsheet includes the unique ID for the patient and a 24-loci MIRU-VNTR string code as described in the CDC genotyping documentation. We will read this spreadsheet into a data frame in a similar manner to the case line listing data. Please run the Python code below to read the genotyping data.

In [18]:
# Import the excel file and call it xls_file
xls_file2 = pd.ExcelFile('resources/Thailand_cases_exercise_2nd_spreadsheet_07232017.xls')
xls_file2

<pandas.io.excel.ExcelFile at 0x7f5e4a2de8d0>

In [19]:
# View the excel file's sheet names
xls_file2.sheet_names

['TBDATA2']

In [20]:
# Load the xls file's Sheet1 as a dataframe
df2 = xls_file2.parse('TBDATA2')

In [21]:
# Let's display the dataframe
df2

Unnamed: 0,CaseNo,FAKEMIRUVNTR,FAKEMIRUID,SYMBOL
0,TH-102678,012345678901234567893120,3,Blue Circle
1,TH-101007,012345678901234567894320,4,Green Circle
2,TH-101290,012345678901234567894320,4,Green Circle
3,TH-101067,012345678901234567894320,4,Green Circle
4,TH-101184,012345678901234567890423,5,Green Circle
5,TH-100913,012345678901234567893120,3,Green Circle
6,TH-101176,012345678901234567894320,4,Green Circle
7,TH-101497,012345678901234567894320,4,Green Circle
8,TH-101280,012345678901234567894320,4,Blue Circle
9,TH-101055,012345678901234567894320,4,Green Circle


In [22]:
# Display the first four records of the dataframe df2
# Type the command below and press, Shift-Enter


Note that we have pre-assigned symbols in the last column to facilitate the visualization of different types of data on a map.

In [23]:
# Let's pickle this dataframe and use it later
df2.to_pickle("outputs/df2.pickle")

## Step 6. Merging Two Data Sets

In the steps above, we you have created two dataframes:
1. `df1`: Line listing of cases 
2. `df2`: Cases with MIRU-VNTR (genotyping) string code

Next we will need to **`merge`** the case-line listing data with the genotyping data to create an analytic dataset with cases and their genotypes. In this step we will join the case-line listing data with the genotyping data and creating the analytic dataset, using the `CaseNo` field as the linking variable. Please run the Python code below, using `pandas merge` to merge these data:


In [24]:
# Merge the dataframe with case-line listed data with the data frame with genotyping data...
df3 = pd.merge(df1, df2, left_on='CaseNo', right_on='CaseNo')
df3

Unnamed: 0,CaseNo,LON,LAT,COORDS,FAKEMIRUVNTR,FAKEMIRUID,SYMBOL
0,TH-102678,100.828607,14.704461,"14.704461,100.828607",012345678901234567893120,3,Blue Circle
1,TH-101007,100.829347,14.702266,"14.702266,100.829347",012345678901234567894320,4,Green Circle
2,TH-101290,100.825159,14.699828,"14.699828,100.825159",012345678901234567894320,4,Green Circle
3,TH-101067,100.824887,14.700197,"14.700197,100.824887",012345678901234567894320,4,Green Circle
4,TH-101184,100.829032,14.697482,"14.697482,100.829032",012345678901234567890423,5,Green Circle
5,TH-100913,100.829261,14.702418,"14.702418,100.829261",012345678901234567893120,3,Green Circle
6,TH-101176,100.829228,14.702959,"14.702959,100.829228",012345678901234567894320,4,Green Circle
7,TH-101497,100.829140,14.702244,"14.702244,100.82914",012345678901234567894320,4,Green Circle
8,TH-101280,100.829344,14.702908,"14.702908,100.829344",012345678901234567894320,4,Blue Circle
9,TH-101055,100.829494,14.701485,"14.701485,100.829494",012345678901234567894320,4,Green Circle


In [25]:
df3.to_pickle("outputs/df3.pickle")

## Step 7. Mapping the Merged Data Sets

Please create a map of the TB cases, colored by MIRU-ID from merged df and df2

In [26]:
# create 2nd map of the case data 
#Store the coordinates of the refugee camp listed in the EOC email
CAMP_COORDINATES = (14.699859, 100.829019)

# create empty map zoomed in on Refugee camp
map2 = folium.Map(location=CAMP_COORDINATES, zoom_start=16)
# store the latitude and longitude coordinates from the data frame

for each in df3.iterrows():
        # we change the color of the marker based on MIRUID
        if each[1]['FAKEMIRUID']== 4:                     
            folium.Marker([each[1]['LAT'],each[1]['LON']], \
                          popup=each[1]['CaseNo'], \
                          icon=folium.Icon(color='green')).add_to(map2)
        if each[1]['FAKEMIRUID']== 3:                     
            folium.Marker([each[1]['LAT'],each[1]['LON']], \
                          popup=each[1]['CaseNo'], \
                          icon=folium.Icon(color='blue')).add_to(map2)
        if each[1]['FAKEMIRUID']== 2:                     
            folium.Marker([each[1]['LAT'],each[1]['LON']], \
                          popup=each[1]['CaseNo'], \
                          icon=folium.Icon(color='yellow')).add_to(map2)
        if each[1]['FAKEMIRUID']== 1:                     
            folium.Marker([each[1]['LAT'],each[1]['LON']], \
                          popup=each[1]['CaseNo'], \
                          icon=folium.Icon(color='red')).add_to(map2)           
map2

## Step 8. Mapping by Drug Resistance Type

Now that we have successfully joined the case line listing data with the genotyping data and created the analytic dataset, we will create a map of the genotyped cases and symbolize these cases by drug-resistance type.

We receive an email, our [third spreadsheet attached](./resources/Thailand_cases_exercise_3rd_spreadsheet_07232017.xls), that includes a MIRU-ID and **classification of MIRU-ID by drug-resistance type**. We need to create a new field in the analytic dataset for DRTYPE and code each line-listed case by this type. 

Please run the Python code below to classify the MIRU-IDs by DRTYPE.

In [27]:
# read the 3rd spreadsheet with MIRU-IDs and DRTYPE and create a data frame
# Import the excel file and call it xls_file
xls_file3 = pd.ExcelFile('resources/Thailand_cases_exercise_3rd_spreadsheet_07232017.xls')
xls_file3

<pandas.io.excel.ExcelFile at 0x7f5e4a2ac198>

In [28]:
# View the excel file's sheet names
xls_file3.sheet_names

['TBDATA3']

In [29]:
# Load the xls file's Sheet1 as a dataframe
df4 = xls_file3.parse('TBDATA3')
df4
df4.to_pickle("outputs/df3.pickle")

In [30]:
df3.head(4)

Unnamed: 0,CaseNo,LON,LAT,COORDS,FAKEMIRUVNTR,FAKEMIRUID,SYMBOL
0,TH-102678,100.828607,14.704461,"14.704461,100.828607",012345678901234567893120,3,Blue Circle
1,TH-101007,100.829347,14.702266,"14.702266,100.829347",012345678901234567894320,4,Green Circle
2,TH-101290,100.825159,14.699828,"14.699828,100.825159",012345678901234567894320,4,Green Circle
3,TH-101067,100.824887,14.700197,"14.700197,100.824887",012345678901234567894320,4,Green Circle


In [31]:
df4.head(4)

Unnamed: 0,CaseNo,FAKEMIRUVNTR,FAKEMIRUID,DRTYPE
0,TH-102678,012345678901234567893120,3,PANSUSCEPTIBLE
1,TH-101007,012345678901234567894320,4,UNKNOWN
2,TH-101290,012345678901234567894320,4,UNKNOWN
3,TH-101067,012345678901234567894320,4,UNKNOWN


In [32]:
# merge the dataframe with DRTYPE with the previous merge to form the analytic dataset
df5 = pd.merge(df4, df3, left_on='CaseNo', right_on='CaseNo')
df5

Unnamed: 0,CaseNo,FAKEMIRUVNTR_x,FAKEMIRUID_x,DRTYPE,LON,LAT,COORDS,FAKEMIRUVNTR_y,FAKEMIRUID_y,SYMBOL
0,TH-102678,012345678901234567893120,3,PANSUSCEPTIBLE,100.828607,14.704461,"14.704461,100.828607",012345678901234567893120,3,Blue Circle
1,TH-101007,012345678901234567894320,4,UNKNOWN,100.829347,14.702266,"14.702266,100.829347",012345678901234567894320,4,Green Circle
2,TH-101290,012345678901234567894320,4,UNKNOWN,100.825159,14.699828,"14.699828,100.825159",012345678901234567894320,4,Green Circle
3,TH-101067,012345678901234567894320,4,UNKNOWN,100.824887,14.700197,"14.700197,100.824887",012345678901234567894320,4,Green Circle
4,TH-101184,012345678901234567890423,5,UNKNOWN,100.829032,14.697482,"14.697482,100.829032",012345678901234567890423,5,Green Circle
5,TH-100913,012345678901234567893120,3,PANSUSCEPTIBLE,100.829261,14.702418,"14.702418,100.829261",012345678901234567893120,3,Green Circle
6,TH-101176,012345678901234567894320,4,UNKNOWN,100.829228,14.702959,"14.702959,100.829228",012345678901234567894320,4,Green Circle
7,TH-101497,012345678901234567894320,4,UNKNOWN,100.829140,14.702244,"14.702244,100.82914",012345678901234567894320,4,Green Circle
8,TH-101280,012345678901234567894320,4,UNKNOWN,100.829344,14.702908,"14.702908,100.829344",012345678901234567894320,4,Blue Circle
9,TH-101055,012345678901234567894320,4,UNKNOWN,100.829494,14.701485,"14.701485,100.829494",012345678901234567894320,4,Green Circle


## Step 9. Mapping the Analytic Data Set

Next, we will create a map of the analytic dataset with genotyped TB cases, symbolized by drug-resistance type. We wish to see a map of the cases within the refugee camp and to examine the types of TB within the camp, including MDR-TB cases. Please run the code below to create the map of TB cases within the camp:


In [52]:
# create 3rd map of the case data 
#Store the coordinates of the refugee camp listed in the EOC email
CAMP_COORDINATES = (14.699859, 100.829019)

# create empty map zoomed in on Refugee camp
map3 = folium.Map(location=CAMP_COORDINATES, zoom_start=16)
folium.TileLayer("openstreetmap").add_to(map3)
# store the latitude and longitude coordinates from the data frame

for each in df3.iterrows():
        # for each MIRUID type we use a specific map symbol
        if each[1]['FAKEMIRUID']== 4:
            # unknown type: green circle
            unknowndr=folium.RegularPolygonMarker([each[1]['LAT'], \
                                                   each[1]['LON']], \
                                                  popup=each[1]['CaseNo'], \
                                                  fill_color='lightgreen', \
                                                  number_of_sides=12, \
                                                  radius=6).add_to(map3)
        if each[1]['FAKEMIRUID']== 3: 
            # pansusceptible type: blue circle
            Pansusceptible=folium.RegularPolygonMarker([each[1]['LAT'], \
                                                        each[1]['LON']], \
                                                       popup=each[1]['CaseNo'], \
                                                       fill_color='blue', \
                                                       number_of_sides=12, \
                                                       radius=6).add_to(map3)
        if each[1]['FAKEMIRUID']== 2:                     
            # resistant but not MDRTB: yellow square
            ResistantNotMDRTB=folium.RegularPolygonMarker([each[1]['LAT'], \
                                                           each[1]['LON']], \
                                                           popup=each[1]['CaseNo'], \
                                                          fill_color='yellow', \
                                                          number_of_sides=4, \
                                                          radius=8).add_to(map3)
        if each[1]['FAKEMIRUID']== 1:   
            # mdr-tb type: red triangle
            MDRTB=folium.RegularPolygonMarker([each[1]['LAT'], \
                                               each[1]['LON']], \
                                              popup=each[1]['CaseNo'], \
                                              fill_color='red', \
                                              number_of_sides=3, \
                                              radius=9).add_to(map3)

map3

You have now created a map similar to that found on page 1718 of the Oeltmann et. al 2005 paper.
<img src="./images/oeltmann_map.png">

You have been able to map the genotyped TB cases by drug resistance type within a refugee camp within Thailand. Please take some time to think about the questions at the end of this notebook and reflect upon the data requirements and purpose of this exercise. Thank you.

> Question 1: The TB case data provided in this exercise included the latitude and longitude coordinates gathered from GPS units. How critical are these geospatial data to completion of this exercise ?


## Congratulations!

## References


You have completed Notebook 2. Please proceed to Notebook 3.

CDC, 2017 Tuberculosis Genotyping: What is tuberculosis (TB) genotyping? CDC TB Fact Sheets 2017. URL: https://www.cdc.gov/tb/publications/factsheets/statistics/genotyping.htm

CDC, 2017 GENType: New Genotyping Terminology to Integrate 24-locus MIRU-VNTR. CDC TB Fact Sheets 2017. URL: https://www.cdc.gov/tb/publications/factsheets/statistics/genotypingterminology.pdf

CDC, 2017 A New Tool to Diagnose Tuberculosis: The Xpert MTB/RIF Assay. URL: https://www.cdc.gov/tb/publications/factsheets/pdf/xpertmtb-rifassayfactsheet_final.pdf

Oeltmann, J. E., Varma, J. K., Ortega, L., Liu, Y., O’Rourke, T., Cano, M., … Maloney, S. A. (2008). Multidrug-Resistant Tuberculosis  Outbreak among US-bound Hmong  Refugees, Thailand, 2005. Emerging Infectious Diseases, 14(11), 1715–1721. http://doi.org/10.3201/eid1411.071629

Shaw, N.S., et al. 2017. Transmission of Extensively Drug-Resistant Tuberculosis in South Africa. New England Journal of Medicine. January 19, 2017. 376:3.                                                  URL: http://www.nejm.org/doi/pdf/10.1056/NEJMoa1604544

Additional readings:
Reichmann, Lee B and Janice Hopkins Tanne. 2001. Timebomb: the global epidemic of multi-drug resistant tuberculosis. ISBN 0-07-135924-9. McGraw-Hill. URL: https://www.goodreads.com/book/show/1733578