# Exploring outbreaks in India using geopandas

## Outline

In this notebook, the outbreaks of different diseases in India will be explored. Outbreak data by district is merged with the geospatial information of each district to map outbreaks. Furthermore the time evolution of outbreaks can also be mapped.


## Background

Disease outbreak data for India was downloaded from the [Integrated Disease Surveillance Program](https://idsp.nic.in/) in the form of weekly reports in .pdf format. The dataset spans 2009 to present day and is extracted from the .pdf files using `idsp_parser.py`. It is then merged with district data from [Global Administrative Area Maps](https://gadm.org) to form a _master_ geopandas dataframe which has all the information for plotting.

In [1]:
import geopandas as gpd
import pandas as pd
import random
import matplotlib.pyplot as plt
import scipy as sp
from datetime import datetime
import shapely as sh
%matplotlib notebook

df = pd.read_csv("/users/rsg/anla/podcast/country_disease_outbreaks/india/idsp_reporting/IDSP_data.csv")
df_1 = df.copy()
IND_2 = gpd.read_file("/data/datasets/Projects/PODCAST/country_district_shape_files/INDIA/gadm36_IND_2.shp")

In [8]:
df

Unnamed: 0.1,Unnamed: 0,ID_code,state,district,disease,cases,deaths,start_date,report_date,status,comments,raw
0,0,[],Dadra and Nagar Haveli,Dadra and Nagar Haveli,?,7,12,?,?,[],?,5 7 12 1 2 1 1 144 71 40 36
1,1,[],Dadra and Nagar Haveli,Dadra and Nagar Haveli,?,34,17,?,?,[],?,34 17 15 17 10 7 5 3 2 2 1 1 1
2,2,[],NCT of Delhi,West,?,2,2,?,?,[],?,3 2 2 1 1 1 1 0 20 40 60 80 100 120 140 160 ...
3,3,[],Andhra Pradesh,Prakasam,?,?,?,27-06-09,30-06-09,['Under Control'],?,taken Nellore i. Acute Diarrhoeal Disease 43 ...
4,4,[],Gujarat,Chhota Udaipur,Food Poisoning,?,?,25-06-09,27-06-09,['Under Control'],?,Pradesh Prakasam ii. Acute Diarrhoeal Disease ...
...,...,...,...,...,...,...,...,...,...,...,...,...
17371,17371,['TN/RMN/2019/06/0161'],Tamil Nadu,Ramanathapuram,Chikungunya,21,00,05-02-19,11-02-19,['Under Control'],?,TN/RMN/2019/06/0161 Tamil Nadu Ramanathapura m...
17372,17372,['TL/BLY/2019/06/0162'],Telangana,Ranga Reddy,Food Poisoning,40,00,04-02-19,06-02-19,['Under Control'],?,TL/BLY/2019/06/0162 Telangana Jayashankar Bhup...
17373,17373,['KN/CMN/2019/06/0163'],Karnataka,Mysore,?,31,00,21-01-19,?,['Under Surveillance'],?,KN/CMN/2019/06/0163 Karnataka Chamarajanagar a...
17374,17374,['KL/WYN/2019/06/0164'],Kerala,Wayanad,Food Poisoning,55,00,28-01-19,?,['Under Surveillance'],?,KL/WYN/2019/06/0164 Kerala Wayanad Food Poison...


# Cleaning up the data

The IDSP_parser.py program returns a .csv file of outbreaks. However, further processing is required to clean the dataset. These steps are:

* Consolidate duplicated reporting
* Convert string columns to numeric, datetime and other datatypes
* Merging outbreak DataFrame with GeoDataFrame to connect the outbreak data with geospatial information

After these basic steps are complete, then the data can be queried and plotted.

## Converting string to units

start date columns goes to datetime object

cases, deaths columns go to numeric

In [2]:
df_1[['start_date','report_date']] = df[['start_date','report_date']].apply(lambda x: pd.to_datetime(x, dayfirst=True, errors='coerce'))
df_1[['cases','deaths']] = df[['cases','deaths']].apply(lambda x : pd.to_numeric(x, errors='coerce'))

In [4]:
df_1.to_csv('idsp_outbreaks_raw.csv')

## Duplicates

If an outbreak continues to the next week. A follow up report is common. It will have the same location, disease and date, with revised figures for the total number of cases (I believe). Also post 2016 it will share the ID code of the original report of that outbreak.

The solution is to use:

```
DataFrame.Duplicated()
```

In [None]:
df_1 = df_1.sort_values(by='start_date')#.sort_values(by='report_date')

In [None]:
df_1[df_1.duplicated(subset=['state','district','disease','start_date'], keep='last') == True].shape

__There appears to be 664 follow up report duplications.__ This means that we should probably throw away these records. Explore these records

In [None]:
all_duplicates = df_1[(df_1.disease != '?') & \
     (df_1.duplicated(subset=['state','district','disease','start_date'],
                      keep=False) == True)].sort_values(by=['start_date','cases'])

In [None]:
all_duplicates[:10]

Lets explore these datasets. _outbreaks_ has text information regarding outbreaks; when, where, what, how many and current status. There is also a comments field which contains potentially useful but unstructured data.

Count the rows with the same ID_code.

In [None]:
outbreaks[outbreaks.duplicated('ID_code') == False].set_index(['state','district']).count()

In [None]:
outbreaks[outbreaks.duplicated('cases',keep=False) == False]#.groupby(['state','district']).count()

__ISSUE:__ The post 2016 records with codified ID_codes are easy to find followup reports for. However the pre 2016 records reuse the serial numbers each week resulting in a 100s of repeats. This could be solved only recording propper ID codes in the ID code column. Also a more sophisticated matching routine could be used to find same day same disease same location and use that... We could also retrospectively generate ID codes for the pre 2016 records, which could be neat.

Any value which was not found when creating the dataframe has been replaced with a ?. One way to handle this is to drop all records that contain? Another is to go through and format the columnns correctly. The latter is better as it carries more information.

We would like to

* Make the cases and deaths field to integer
* Make the dates into datatime object

Using `apply` apply the `.to_numeric()` method along the case and death columns. Errors are coerced, meaning failure results in a NAN value which is then dropped. `lambda` function is used here in order to set the errors='coerce'.

In [5]:
outbreaks[['cases','deaths']] = outbreaks[['cases','deaths']].apply(lambda x : pd.to_numeric(x, errors='coerce'))

NameError: name 'outbreaks' is not defined

similarly `apply` `to_datetime` to convert the dates.

In [None]:
outbreaks[['start_date','report_date']] = outbreaks[['start_date','report_date']].apply(lambda x: pd.to_datetime(x, errors='ignore'))

In [None]:
outbreaks.start_date[0]

## Adding the geospatial element

There are a few ways we can connect the outbreak data with the region map of India.

IND_3 is a geopandas, geodataframe which has shape file geometry for the administrative regions of India. As per the following convention,

 * NAME_1 is the state
 * NAME_2 is the district
 * NAME_3 is the city


In [None]:
IND_2

Most of the data columns in the IND_2 geodataframe are not populated and therefore not helpful.

## Quick plot

In [None]:
IND_2.plot()

The two data frames can be merged using the dataframe method `.merge()`. By merging the two, we can attach the shape file for the district to the outbreaks in that district.

In this case we are interested in the _state_ and _district_ columns because that is the highest resolution the outbreak dataframe has. State must be included because some districts share the same name but are located in different states.

To do this _NAME\_1_ and _NAME\_2_ are renamded to _state_ and _district_ for compatibility with the outbreak dataframe. Next the multiple level 3 (settlements) data is removed. The dissolve method deals with multiple rows with the same district, keeping the geometry of the first row with a given district.

In [7]:
district_locations = IND_2[['NAME_1','NAME_2','geometry']]\
                    .rename(columns={'NAME_1':'state','NAME_2':'district'})\
                    .dissolve(by=['state','district'],aggfunc='first')

AttributeError: 'GeoDataFrame' object has no attribute 'df'

In [None]:
district_locations

In [None]:
IND_2[IND_2.NAME_1 == 'Kerala'].set_index('NAME_2').loc[['Alappuzha','Ernakulam','Kottayam']][['geometry']]

This geodataframe contains the unique names and geometry for each district in India. Now it must be merged with the outbreak data along the state and district columns

## Plot a random district and label it

To plot the districts themselves we can use `gpd.plotting.plot_dataframe()`.

In [None]:
fig, ax = plt.subplots()

# choose a district at random
district = random.choice(district_locations.index)

gpd.plotting.plot_dataframe(district_locations,
                            ax=ax,
                           )
gpd.plotting.plot_polygon_collection(ax, district_locations.loc[[district],'geometry'],
                                     color='red'
                                    )

# district_locations.loc[[district]].centroid
plt.annotate(s=district, xy=(district_locations.loc[[district]].centroid.x,
                             district_locations.loc[[district]].centroid.y
                            )
            )

plt.show()

# Merge outbreaks with geospatial data on state and district columns

In [None]:
master = district_locations.merge(df_1,on=['state','district'])

In [6]:
master

NameError: name 'master' is not defined

In [None]:
master[['state','district','disease','cases','geometry']].to_file('IND_outbreaks.shp')

In [None]:
%matplotlib
%matplotlib inline

In [None]:
district_locations

In [None]:
cholera_case_sum = district_locations.merge(master[(master.disease ==  'Cholera')].groupby(['state','district'])['cases'].sum(),
                         left_index=True,
                         right_index=True)

In [None]:
cholera_case_sum['log_cases'] = cholera_case_sum['cases'].apply(sp.log)

In [None]:
cholera_case_sum

In [None]:
from matplotlib.colors import LogNorm, SymLogNorm

In [None]:
cholera_case_sum[cholera_case_sum.cases > 0].plot(figsize=(15,15), cmap='Reds', column='log_cases', legend=True)

save to file for later use

In [None]:
cholera_case_sum[cholera_case_sum.cases == cholera_case_sum.cases.max()].plot()

In [None]:
cholera_case_sum['region'] = cholera_case_sum.index.values

In [None]:
cholera_case_sum

In [None]:
cholera_case_sum.to_file("cholera_case_sum.shp")

# Combination of data sets to map outbreaks

## Plot Cholera outbreaks

Lets plot the total cholera outbreaks. To do this we select the outbreaks that are cholera. Then take only the district and cases field before merging with the district locations. Now we have a composite dataframe that can make 

In [None]:
set(outbreaks.disease)

In [None]:
pwd

In [None]:
district_locations.merge(
    outbreaks[outbreaks['disease'] == 'Cholera'],
    on=['state','district']
).to_file('./IDSP_cholera_outbreaks_india_2009_to_2016_geopandas.shp')

In [None]:
test = gpd.read_file('./IDSP_cholera_outbreaks_india_2009_to_2016_geopandas.shp')

In [None]:
test

In [None]:
ls

First lets figure out the total number of cholera cases

In [None]:
outbreaks[outbreaks.disease == 'Cholera'].cases.dropna().sum()

The district locations are ready to be merged with the outbreak data. In this case we just need the number of cholera cases so we select those before merging. That gives a consise geodataframe. Note that the geodataframe should be the one calling the merge() method, otherwise the result will be a normal dataframe, and loose its geo prefix and special abilities. This can be rectified it is just not quite so pleasing.

In [None]:
composite = district_locations.merge(
    outbreaks[outbreaks['disease'] == 'Acute Diarrheal Disease'][['state','district','cases']],
    on=['state','district']
)
print(type(composite))

drop nan values and aggregate using .dissolve(). This gives us a single row for each district that contains the geometry and number of cases of cholera only.

In [None]:
composite

In [None]:
cholera_district_cases = composite.dropna().dissolve(by=['state','district'],aggfunc='sum')
print(type(cholera_district_cases))
print(cholera_district_cases.sum())

In [None]:
%matplotlib notebook

In [None]:
fig, ax = plt.subplots(figsize=(5,5),dpi=150)

IND_2.plot(ax=ax,
           color='white',
           edgecolor='black',
           alpha=1,
           linewidth=0.05,
          )

cholera_district_cases.plot(column = 'cases',
                            cmap='Reds',
                            ax=ax,
                            legend=True,
                           )

plt.title('Cholera cases 2009-present')

# label the 3 most infected districts
for index in cholera_district_cases.nlargest(3,columns='cases').index:
    gpd.plotting.plot_point_collection(ax,
                                       cholera_district_cases.loc[[index]].centroid,
#                                        color='black',
                                       marker='+',
                                       label=" ".join(index)+": "+str(cholera_district_cases.loc[[index]].cases.sum())
                                      )
    
#     plt.annotate(s= " ".join(index),
#                  xy=(cholera_district_cases.loc[[index]].centroid.x,
#                      cholera_district_cases.loc[[index]].centroid.y),
#                  horizontalalignment='left',
#                  verticalalignment='bottom'
#                 )

# get the total bounding box
x0,y0,x1,y1 = cholera_district_cases.total_bounds

# display total cases as an inset
plt.text(x0 + 1  * (x1-x0),
         y0 + 1  * (y1-y0),
         'total cases = '+str(int(cholera_district_cases.cases.sum())),
         horizontalalignment='right',
        )

# plt.legend(loc=1)

plt.tight_layout()

plt.show()

Heatmap showing total Cholera cases by district from 2009 - present. The results with later version of the data analysis code seems to produce wildly different results. Which isn't reassuring. A ground truth metric against which the data can be compared would be very useful!

At this point we have some informative data. However it should be noted that this representation shows the total number of outbreaks by district. The districts themselves are not equal and this graphic shows neither the spatial density of cholera nor the infection rate.

In [None]:
cholera_state_cases = cholera_district_cases.dissolve(by='state',aggfunc='sum')

In [None]:
cholera_state_cases

In [None]:
d[(d['x']>2) & (d['y']>7)]

In [None]:
mah_cholera = outbreaks[(outbreaks['state'] == 'Maharashtra') & (outbreaks.disease == 'Cholera')]

In [None]:
mah_cholera[['district','cases','start_date']]

In [None]:
outbreaks.loc[8801].raw

In [None]:
master[(master['state'] == 'Kerala') & \
       (master['disease'] == ('Cholera' or 'Acute Diarrheal Disease' or 'Food Poisoning'))].to_file('idsp_IVO_kerala_lake.shp')

In [None]:
test_kerala_lake = gpd.read_file('idsp_IVO_kerala_lake.shp')

In [None]:
set(test_kerala_lake.disease)

In [None]:
!cp idsp_IVO* /data/datasets/Projects/REVIVAL/disease_data/