# Continuing on from the previous answer to now make a choropleth map.

#### Question 2: What strains and subtypes were most common in the US in 2017-18?

Answering questions 2 and 3 will require retrieving records from sequence databases.

#### Data source

The datasets in this project are from the [Influenza Research Database](https://www.fludb.org/brc/home.spg?decorator=influenza), a viral data repository that pulls from [NCBI GenBank](https://www.ncbi.nlm.nih.gov/genbank/) and [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/).

I'm going to pull records according to strain, for both flus A and B, and for flu season 17-18.

** Note: This analysis is performed only on data from the Influenza Research Database and is not a complete analysis of the 2017-18 flu season as not all US flu cases may have been reported, sequenced and deposited in this database. 


---
# Preparing the Data: Moving flu raw data into DataFrames

In [1]:
# Import data analysis libraries 
import numpy as np
import pandas as pd

# Import visualization libraries
import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('paper', font_scale=1.)
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
# Load the flu datasets and look at the available column names

df_fluA = pd.read_csv('fluA_strains.tsv', sep='\t')
df_fluB = pd.read_csv('fluB_strains.tsv', sep='\t')

df_fluA.columns

Index(['Strain Name', 'Complete Genome', 'Subtype', 'Collection Date', 'Host',
       'Country', 'State/Province', 'Geographic Grouping', 'Flu Season',
       'Submission Date', 'Passage History', 'Specimen Source Health Status',
       '1 PB2', '2 PB1', '3 PA', '4 HA', '5 NP', '6 NA', '7 MP', '8 NS', 'Age',
       'Gender', 'M2 31N', 'M2 26F', 'M2 27A', 'M2 30T', 'M2 34E',
       'NA 275Y N1', 'NA 292K N2', 'NA 119V N2', 'NA 294S N2', 'PB1-F2 66S',
       'PB2 E627K', 'PB2 D701N', 'PB2 A199S', 'PB2 A661T', 'PB2 V667I',
       'PB2 K702R', 'PA S409N', 'NP L136M', 'M2 A16G', 'M2 C55F', 'NS1 T92E',
       'RERRRKKR', 'Sensitive Drug', 'Resistant Drug', 'Submission Date.1',
       'NCBI Taxon ID', 'pH1N1-like', 'US Swine H1 Clade',
       'Global Swine H1 Clade test', 'H5 Clade', 'Unnamed: 52'],
      dtype='object')

In [6]:
# Let's take a look at the first few rows of data for Flu A
df_fluA.head(3)

Unnamed: 0,Strain Name,Complete Genome,Subtype,Collection Date,Host,Country,State/Province,Geographic Grouping,Flu Season,Submission Date,...,RERRRKKR,Sensitive Drug,Resistant Drug,Submission Date.1,NCBI Taxon ID,pH1N1-like,US Swine H1 Clade,Global Swine H1 Clade test,H5 Clade,Unnamed: 52
0,A/Alabama/01/2018,Yes,H1N1,01/02/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,No,-N/A-,-N/A-,03/24/2018,11320,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,
1,A/Alabama/02/2018,Yes,H1N1,01/03/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,No,"Oseltamivir,Zanamivir",-N/A-,03/24/2018,11320,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,
2,A/Alabama/03/2018,Yes,H3N2,01/03/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,No,-N/A-,-N/A-,03/24/2018,11320,Negative,-N/A-,-N/A-,-N/A-,


In [7]:
# And here is the first few rows for Flu B
df_fluB.head(3)

Unnamed: 0,Strain Name,Complete Genome,Subtype,Collection Date,Host,Country,State/Province,Geographic Grouping,Flu Season,Submission Date,...,RERRRKKR,Sensitive Drug,Resistant Drug,Submission Date.1,NCBI Taxon ID,pH1N1-like,US Swine H1 Clade,Global Swine H1 Clade test,H5 Clade,Unnamed: 52
0,B/Alabama/01/2018,No,-N/A-,01/15/2018,Human,USA,Alabama,North America,17-18,2018-04-02,...,-N/A-,-N/A-,-N/A-,04/02/2018,11520,-N/A-,-N/A-,-N/A-,-N/A-,
1,B/Alabama/02/2018,Yes,-N/A-,01/22/2018,Human,USA,Alabama,North America,17-18,2018-04-03,...,-N/A-,-N/A-,-N/A-,04/03/2018,11520,-N/A-,-N/A-,-N/A-,-N/A-,
2,B/Alabama/03/2018,Yes,-N/A-,02/06/2018,Human,USA,Alabama,North America,17-18,2018-04-03,...,-N/A-,-N/A-,-N/A-,04/03/2018,11520,-N/A-,-N/A-,-N/A-,-N/A-,


---
# Viewing data on a US Map

In [3]:
# imports libraries for a choropleth map

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [8]:
import pandas as pd

In [11]:
df_fluA = pd.read_csv('fluA_strains.tsv', sep='\t')
df_fluB = pd.read_csv('fluB_strains.tsv', sep='\t')

In [13]:
# check the dataframe 

df_fluA.head(3)

Unnamed: 0,Strain Name,Complete Genome,Subtype,Collection Date,Host,Country,State/Province,Geographic Grouping,Flu Season,Submission Date,...,RERRRKKR,Sensitive Drug,Resistant Drug,Submission Date.1,NCBI Taxon ID,pH1N1-like,US Swine H1 Clade,Global Swine H1 Clade test,H5 Clade,Unnamed: 52
0,A/Alabama/01/2018,Yes,H1N1,01/02/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,No,-N/A-,-N/A-,03/24/2018,11320,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,
1,A/Alabama/02/2018,Yes,H1N1,01/03/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,No,"Oseltamivir,Zanamivir",-N/A-,03/24/2018,11320,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,
2,A/Alabama/03/2018,Yes,H3N2,01/03/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,No,-N/A-,-N/A-,03/24/2018,11320,Negative,-N/A-,-N/A-,-N/A-,


First thing we have to do is convert the state names to abbreviations so they read into the map correctly.

Here's a dictionary of state names to abbreviations thanks to [rogerallen](https://gist.github.com/rogerallen/1583593)


In [40]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

In [52]:
# Using pandas.DataFrame.map function to create a new column 'abbrev' with proper two-letter State abbreviation 

df_fluA['abbrev'] = df_fluA['State/Province'].map(us_state_abbrev)
df_fluA.head(3)

Unnamed: 0,Strain Name,Complete Genome,Subtype,Collection Date,Host,Country,State/Province,Geographic Grouping,Flu Season,Submission Date,...,Sensitive Drug,Resistant Drug,Submission Date.1,NCBI Taxon ID,pH1N1-like,US Swine H1 Clade,Global Swine H1 Clade test,H5 Clade,Unnamed: 52,abbrev
0,A/Alabama/01/2018,Yes,H1N1,01/02/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,-N/A-,-N/A-,03/24/2018,11320,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,,AL
1,A/Alabama/02/2018,Yes,H1N1,01/03/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,"Oseltamivir,Zanamivir",-N/A-,03/24/2018,11320,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,,AL
2,A/Alabama/03/2018,Yes,H3N2,01/03/2018,Human,USA,Alabama,North America,17-18,2018-03-24,...,-N/A-,-N/A-,03/24/2018,11320,Negative,-N/A-,-N/A-,-N/A-,,AL


In [53]:
# using only the H1N1 records into a new dataframe

df_fluA_H1N1 = df_fluA[df_fluA['Subtype'] == 'H1N1']

In [94]:
# create a new series object, and place into a dataframe to load into my choropleth map

df_fluA_H1N1_cts = df_fluA_H1N1['abbrev'].value_counts()

dfA = pd.DataFrame(df_fluA_H1N1_cts)
dfA = dfA.reset_index()
dfA.columns = ['abbrev', 'counts']
dfA.head(3)

Unnamed: 0,abbrev,counts
0,IA,172
1,CA,69
2,MN,62


In [96]:
data_H1N1 = dict(
        type = 'choropleth',
        colorscale = 'Greens',
        reversescale = True,
        locations = dfA['abbrev'],
        z = dfA['counts'],
        locationmode = 'USA-states',
        text = ['State/Province'],
        marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
        colorbar = {'title':'reported H1N1 cases'}
            ) 

In [97]:
layout = dict(title = 'Reported/Sequenced Influenza H1N1 for 2017-2018',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [98]:
choromap = go.Figure(data = [data_H1N1],layout = layout)
iplot(choromap,validate=False)

In [None]:
# Here is a map for H3N2

df_byState_H3N2 = df_byState[df_byState['Subtype'] == 'H3N2']
df_byState_H3N2.head(5)

In [None]:
data_H3N2 = dict(
        type = 'choropleth',
        colorscale = 'Blues',
        reversescale = True,
        locations = df_byState_H3N2['abbrev'],
        z = df_byState_H3N2['counts'],
        locationmode = 'USA-states',
        text = ['State/Province'],
        marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
        colorbar = {'title':'reported H3N2 cases'}
            ) 

In [None]:
layout = dict(title = 'Reported/Sequenced Influenza H3N2 for 2017-2018',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [None]:
choromap = go.Figure(data = [data_H3N2],layout = layout)
iplot(choromap,validate=False)

Now that there's a general idea of what this dataset holds, in my next analysis I will download sequences for analysis. To be continued...