

# Question 2: What strains and subtypes were most common in the US in 2017-18?

Answering questions 2 and 3 will require retrieving records from sequence databases.

#### Data source

The datasets in this project are from the [Influenza Research Database](https://www.fludb.org/brc/home.spg?decorator=influenza), a viral data repository that pulls from [NCBI GenBank](https://www.ncbi.nlm.nih.gov/genbank/) and [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/).

I'm going to pull records according to strain, for both flus A and B, and for flu season 17-18.

** Note: This analysis is performed only on data from the Influenza Research Database and is not a complete analysis of the 2017-18 flu season as not all US flu cases may have been reported, sequenced and deposited in this database. 


---
# Preparing the Data: Moving flu raw data into DataFrames

In [1]:
# Import data analysis libraries 
import numpy as np
import pandas as pd

# Import visualization libraries
import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('paper', font_scale=1.)
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
# Load the flu datasets and look at the available column names

df_fluA = pd.read_csv('fluA_strains.tsv', sep='\t')
df_fluB = pd.read_csv('fluB_strains.tsv', sep='\t')

df_fluA.columns

FileNotFoundError: File b'raw_data/fluA_strains.tsv' does not exist

There are plenty of columns to work with, including subtypes to known mutations or drug resistance. I'm going to keep it simple in the beginning before moving onto sequence data and just explore some demographics such as:
* subtypes; 
* locations; and 
* collection dates. 

In [None]:
# Let's take a look at the first few rows of data for Flu A
df_fluA.head(3)

In [None]:
# And here is the first few rows for Flu B
df_fluB.head(3)

---

# Extracting Subtype Data by Month 

Now that the DataFrames have been loaded, I can view how many records exist and what information they hold. This will help in determining what tables/figures may be appropriate to neatly summariza the data.

In [None]:
df_fluA.info()

There are about 5k records, and I'd like to limit this investigation to strains collected from Humans. Let's take a look to see if there are any other "Host" animals.

In [None]:
df_fluA['Host'].unique()


Plenty of bird hosts along with the unfortunate dog and pig, so let's make a new database for just human hosts named:

df_fluA_Human and df_fluB_Human


In [None]:
df_fluA_Human = df_fluA[df_fluA['Host'] == 'Human']
df_fluB_Human = df_fluB[df_fluB['Host'] == 'Human']
df_fluA_Human.info()

Now that we have a proper DataFrame containing records from Human infections, let's look at for just Flu A:

##### General Info
* How many records: 'n_records' 
* What were the top 5 states reporting: 'top5_states'
* What states had lowest reporting: 'bottom5_states'

##### Subtype Details
* How many unique subtypes are there: 'n_uni_subtypes'
* What are those subtypes: 'uni_subtypes'


In [None]:
n_records = len(df_fluA_Human)
top5_states = df_fluA_Human['State/Province'].value_counts().head(5)
bottom5_states = df_fluA_Human['State/Province'].value_counts().tail(5)
n_uni_subtypes = df_fluA_Human['Subtype'].nunique()
uni_subtypes = df_fluA_Human['Subtype'].unique()

print("\033[1m" + "General Info" + "\033[0m")
print("Total number of flu A records: {}".format(n_records))
print("Top 5 states reporting Flu A:\n{}\n".format(top5_states))
print("Bottom 5 states reporting Flu A:\n{}\n".format(bottom5_states))
print("\033[1m" + "Subtype Details" + "\033[0m")
print("Number of unique subtypes reported: {}".format(n_uni_subtypes))
print("Those subtypes are: {}".format(uni_subtypes))

For Flu A, there are only four reported subtypes, but for simplicity I will count 'H1' with 'H1N1', and 'H3' with 'H3N2'. 

To see the yearly trend of reporting dates, let's plot out when these entries were submitted ('Collection Date) by month.

In [None]:
# Replaces "H1" and "H3" subtypes into full names

df_fluA_Human = df_fluA_Human.replace(to_replace="H1", value="H1N1")
df_fluA_Human = df_fluA_Human.replace(to_replace="H3", value="H3N2")

In [None]:
# Let's check the data type of the 'Collection Date' column

type(df_fluA_Human.loc[:,'Collection Date'].iloc[0])

In [None]:
# Since the Collection Date is a str, convert the column to datetime objects for ease in plotting 

df_fluA_Human = df_fluA_Human.copy()  
df_fluA_Human['Collection Date'] = pd.to_datetime(df_fluA_Human['Collection Date'])

# check the conversion
type(df_fluA_Human['Collection Date'].iloc[0])

In [None]:
# Creates a new column for month of each record. This will be the x-axis of my graph

df_fluA_Human['Month'] = df_fluA_Human['Collection Date'].apply(lambda time: time.month)
df_fluA_Human.columns

In [None]:
# Double check the 'Month' column from a random sample

df_fluA_Human['Month'].sample(5)

In [None]:
# Now that I have the 'Month', I'd like to change the int to month names
import calendar

df_fluA_Human['Month'] = df_fluA_Human['Month'].apply(lambda x: calendar.month_abbr[x])
df_fluA_Human['Month'].sample(5)

In [None]:
# Plots the records by month, starting with the beginning of the flu season in Oct 2017

order = ['Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']

sns.countplot(x='Month', data=df_fluA_Human, hue='Subtype', order = order, palette='rainbow').set_title('Reported Flu A Subtypes in US during 2017-18')

# relocate the legend
plt.legend(bbox_to_anchor = (0.8, 0.98), loc =2, borderaxespad=0.)


plt.savefig('FluA.png')

### Partial answer to Question 2: For Flu A, the only reported subtypes are H1N1 and H3N2 with H3N2 being reported at higher rates. There is a peak of cases in Dec 2017 for H3N2 and Jan 2018 for H1N1.

Unlike the 2 subtypes, there are many more strains reported:


In [None]:
df_fluA_Human["Strain Name"].nunique()

Because there are so many strains, a figure won't work, but we can list them to get an idea the most common submissions. This may prove useful later when we need to look at exact sequences.

A little primer on [how to read strain names](https://www.cdc.gov/flu/about/viruses/types.htm):

Antigenic type/host origin/geographical origin/strain #/isolation year (if fluA, hemagglutinin and neuraminidase description, e.g., H1N1)

In [None]:
df_H1N1_strains = df_fluA_Human[df_fluA_Human["Subtype"] == "H1N1"]
H1N1_strains = df_H1N1_strains['Strain Name'].value_counts().head(10)
df_H3N2_strains = df_fluA_Human[df_fluA_Human["Subtype"] == "H3N2"]
H3N2_strains = df_H3N2_strains['Strain Name'].value_counts().head(10)

print("\033[1m" + "Most Common Strains per Subtype" + "\033[0m")
print("H1N1 strains:\n{}\n".format(H1N1_strains))
print("H3N2 strains:\n{}\n".format(H3N2_strains))


So it looks like strains are being deposited as they are newly sequenced, with no repeating strains. 

Now lets perform the same analysis for Influenza Type B.

In [None]:
df_fluB_Human.info()

In [None]:
n_records = len(df_fluB_Human)
top5_states = df_fluB_Human['State/Province'].value_counts().head(5)
bottom5_states = df_fluB_Human['State/Province'].value_counts().tail(5)
n_uni_subtypes = df_fluB_Human['Subtype'].nunique()
uni_subtypes = df_fluB_Human['Subtype'].unique()

print("\033[1m" + "General Info" + "\033[0m")
print("Total number of flu B records: {}".format(n_records))
print("Top 5 states reporting Flu B:\n{}\n".format(top5_states))
print("Bottom 5 states reporting Flu B:\n{}\n".format(bottom5_states))
print("\033[1m" + "Subtype Details" + "\033[0m")
print("Number of unique subtypes reported: {}".format(n_uni_subtypes))
print("Those subtypes are: {}".format(uni_subtypes))

Interesting that the highest record type did not fill in the state field. Also there is only one subtype reported, which is unnamed. We can still take a look at the montly reporting dates and most common strains.

To simplify the graphs later on, will relabel these as just "FluB_Sub_ukwn"

In [None]:
df_fluB_Human = df_fluB_Human.replace(to_replace="-N/A-", value="FluB_Sub_unkwn")

Also want to take a look at the strains for fluB.

In [None]:
fluB_strains = df_fluB_Human["Strain Name"].value_counts().head(10)

print("\033[1m" + "Most Common Strains for Flu B" + "\033[0m")
print("FluB_Sub_unkwn:\n{}\n".format(fluB_strains))

Also no repeat strains for fluB. Will continue to convert the DataFrame of fluB so it includes the month collection date.

In [None]:
df_fluB_Human = df_fluB_Human.copy()  
df_fluB_Human['Collection Date'] = pd.to_datetime(df_fluB_Human['Collection Date'])

In [None]:
# check the conversion
type(df_fluB_Human['Collection Date'].iloc[0])

In [None]:
# Create a new column for just the month of each record. This will be the x-axis of my graph

df_fluB_Human['Month'] = df_fluB_Human['Collection Date'].apply(lambda time: time.month)

In [None]:
# The month is an int conver to --> month as str
import calendar

df_fluB_Human['Month'] = df_fluB_Human['Month'].apply(lambda x: calendar.month_abbr[x])

In [None]:
#Plots the records by month, starting with the beginning of the flu season in Oct 2017

order = ['Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']

sns.countplot(x='Month', data=df_fluB_Human, hue='Subtype', order = order, palette='rainbow').set_title('Reported Flu B Subtypes in US during 2017-18')

# relocate the legend
plt.legend(bbox_to_anchor = (0.68, 0.98), loc =2, borderaxespad=0.)

plt.savefig('FluB.png')

Here's a look at just Flu B for the season, with a peak around January with about 500 submissions. 

Let's see how it compares to Flu A, by combining this all into one figure.

In [None]:
df_AllFlu = pd.merge(df_fluA_Human, df_fluB_Human, how='outer')

In [None]:
#Plotting out all three subtypes
order = ['Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']

sns.countplot(x='Month', data=df_AllFlu, hue='Subtype', order = order, palette='rainbow').set_title('Reported Flu Subtypes in US during 2017-18')

# relocate the legend
plt.legend(bbox_to_anchor = (0.68, 0.98), loc =2, borderaxespad=0.)

plt.savefig('Flu_allsubtypes.png')

### Here is a cumulative look at all three subtypes: Flu A H1N1, Flu A H3N2, and Flu B 'N/A'. What's interesting is how Flu B persists during the end while Flu A is slowly waning out for the season.

Although these figures answer Question 2, it'd be fun to take it one step further and look at these data using a map.

---
# Looking at Subtypes on a US Map

In [None]:
# imports libraries for a choropleth map

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
init_notebook_mode(connected=True)

First thing we have to do is convert the state names to abbreviations so they read into the map correctly.

Here's a dictionary of state names to abbreviations thanks to [rogerallen](https://gist.github.com/rogerallen/1583593)


In [None]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

In [None]:
us_state_abbrev = {state: abbrev for abbrev, state in us_state_abbrev.items()}

df_byState['abbrev'] = df_byState['State/Province'].map(us_state_abbrev)
df_byState.head(5)

In [None]:
#Let's do H1N1 cases first.

df_byState_H1N1 = df_byState_H1N1.copy()
df_byState_H1N1 = df_byState[df_byState['Subtype'] == 'H1N1']
df_byState_H1N1.head(5)

In [None]:
data_H1N1 = dict(
        type = 'choropleth',
        colorscale = 'Greens',
        reversescale = True,
        locations = df_byState_H1N1['abbrev'],
        z = df_byState_H1N1['counts'],
        locationmode = 'USA-states',
        text = ['State/Province'],
        marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
        colorbar = {'title':'reported H1N1 cases'}
            ) 

In [None]:
layout = dict(title = 'Reported/Sequenced Influenza H1N1 for 2017-2018',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [None]:
choromap = go.Figure(data = [data_H1N1],layout = layout)
iplot(choromap,validate=False)

In [None]:
# Here is a map for H3N2

df_byState_H3N2 = df_byState[df_byState['Subtype'] == 'H3N2']
df_byState_H3N2.head(5)

In [None]:
data_H3N2 = dict(
        type = 'choropleth',
        colorscale = 'Blues',
        reversescale = True,
        locations = df_byState_H3N2['abbrev'],
        z = df_byState_H3N2['counts'],
        locationmode = 'USA-states',
        text = ['State/Province'],
        marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
        colorbar = {'title':'reported H3N2 cases'}
            ) 

In [None]:
layout = dict(title = 'Reported/Sequenced Influenza H3N2 for 2017-2018',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [None]:
choromap = go.Figure(data = [data_H3N2],layout = layout)
iplot(choromap,validate=False)

Now that there's a general idea of what this dataset holds, in my next analysis I will download sequences for analysis. To be continued...