## Influenza Behavior 1997-2022: Visual Illustrations of Outbreaks and Beyond

###### Background

Seasonal influenza outbreaks of influenza are a frequent threat to public health. Circulating viruses cause disease in domestic and wild animals. Viral mutations in wild species are a source of zoonotic diseases, and pandemics such as the Influenza outbreak of 1918 and the H3N2 outbreak of 2009 [1]. While outbreaks circulate seasonally, there are periodic pandemics that lead to high rates of hospitalization and unexpectedly high mortality [2].

Influenza A is the only known cause of pandemic outbreaks, with swine and avian strains representing important sources of novel strains causing severe disease in humans [1, 2]. As such, monitoring future outbreaks requires not only observation of seasonal trends in humans, but also monitoring  of animal population healthing [3]. 

Influenza's seasonality makes it a useful illustration for circulating disease and how disease dynamics change over time, in addition to the effect of a pandemic outbreak on those dynamics. In this project, I use 26 years of data collected by the CDC to illustrate: 1) Graphical appearances of the 2009 H1N1 pandemic, 2) Possible extinction of an Influenza B strain, and 3) Differences (or lack thereof) between age groups and seasons proportions of flu subtypes reported. 

###### Methodology

The United States Centers for Disease Control (CDC) offers a tools to view nationally-pooled data for mortality, hospitalization, virologic surveillance data. For this exercise, I queried the CDC's FluView tool to examine data for the subtypes of Influenza A and Influenza B aggregated by season and age group. The most recent data at the time of query included the 2022-2023 flu season. My interest was in subtype and lineage characteristics in the two types of Influenza capable of circulating in humans, type A and type B. As such, I chose to include a wide time range. Because I was interested in changes in subtype between years, I chose a 25 year timespan minimally so that visualizations included at least 2 decades and to better illustrate the emergence of the 2009 pandemic compared to earlier decades. During the query, I also included all categories of Influenza A and B subtypes and lineages respectively, as well as each age group category. After the query, the raw data was saved as a CSV file for later use. 

From here, Python and relevant libraries including pandas and matplot lib were implemented throughout the project. Code and explanations about how the data were processed is detailed in future sections.

###### Data Cleaning

Having addressed what quesestions the project was designed to answer and collected initial data, I began first with cleaning the data. Fortunately, the queried data was already reasonably clean. In order to evaluate all flu data, I did an outer join on separate dataframes (1997-2015,2016-present). Because data prior to 2015 didn't subtype influenza strains I filled the NaN columns with zeros. 
Detailed below is the code used to better utilize the data for this project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
prior_2015 = pd.read_csv('WHO_NREVSS_Combined_prior_to_2015_16.csv',header=1)
post_2015 = pd.read_csv('WHO_NREVSS_Public_Health_Labs.csv',header=1)

In [11]:
post_2015.info()
prior_2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479 entries, 0 to 478
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   REGION TYPE                  479 non-null    object
 1   REGION                       479 non-null    object
 2   YEAR                         479 non-null    int64 
 3   WEEK                         479 non-null    int64 
 4   TOTAL SPECIMENS              479 non-null    int64 
 5   A (2009 H1N1)                479 non-null    int64 
 6   A (H3)                       479 non-null    int64 
 7   A (Subtyping not Performed)  479 non-null    int64 
 8   B                            479 non-null    int64 
 9   BVic                         479 non-null    int64 
 10  BYam                         479 non-null    int64 
 11  H3N2v                        479 non-null    int64 
 12  A (H5)                       479 non-null    int64 
dtypes: int64(11), object(2)
memory usag

In [13]:
df=post_2015.merge(prior_2015,how='outer')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1419 entries, 0 to 1418
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   REGION TYPE                  1419 non-null   object 
 1   REGION                       1419 non-null   object 
 2   YEAR                         1419 non-null   int64  
 3   WEEK                         1419 non-null   int64  
 4   TOTAL SPECIMENS              1419 non-null   int64  
 5   A (2009 H1N1)                1419 non-null   int64  
 6   A (H3)                       1419 non-null   int64  
 7   A (Subtyping not Performed)  1419 non-null   int64  
 8   B                            1419 non-null   int64  
 9   BVic                         479 non-null    float64
 10  BYam                         479 non-null    float64
 11  H3N2v                        1419 non-null   int64  
 12  A (H5)                       1419 non-null   int64  
 13  PERCENT POSITIVE  

In [17]:
df=df.replace(np.nan,0)

In [23]:
df.isnull().sum()

REGION TYPE                    0
REGION                         0
YEAR                           0
WEEK                           0
TOTAL SPECIMENS                0
A (2009 H1N1)                  0
A (H3)                         0
A (Subtyping not Performed)    0
B                              0
BVic                           0
BYam                           0
H3N2v                          0
A (H5)                         0
PERCENT POSITIVE               0
A (H1)                         0
A (Unable to Subtype)          0
dtype: int64

In [None]:
df = df.rename(columns={'Age Group':'Age_Group', 'A (H1)':'A_H1', 
                        'A (H3)':'A_H3', 'A (H1N1)pdm09':'A_H1N1_pdm09',
                        'B (Victoria Lineage)':'B_Victoria_Lineage',
                        'B (Yamagata Lineage)':'B_Yamagata_Lineage',
                        'B (Lineage Unspecified)':'B_Unspecified_Lineage',
                        'H3N2v':'A_H3N2v'})

When preparing to change the column types, I realized that formatting within the 'Season' column presented difficulties. As formatted, it wouldn't convert appropriately to datetime. Additionally, early visualizations of flu subtypes by season were visually confusing with the seasons labelled as ranges on a graph. In the interest of making comprehensible visualizations, I used a regular expression to remove the end date of the range. Furthermore, I converted the 'Season' column to period after converting to date time. Because flu seasons don't start and end on January 1 and no further date information was provided from FluView, I thought including a starting date and month would be misleading. 

In [None]:
df["Age_Group"] = df['Age_Group'].astype("category")

In [None]:
df["Season"] = df["Season"].str.replace(r'-\d\d$', '', regex=True) #strips end date from 'Season' column

In [None]:
df["Season"] = pd.to_datetime(df["Season"])

In [None]:
df["Season"] = df["Season"].dt.to_period('Y')

In [None]:
df["A_Subtype_not_Available"] = df.loc[:,["A (Unable to Subtype)",
                                          "A (Subtyping not Performed)"]].sum(axis=1)

In [None]:
df = df.drop(["A (Unable to Subtype)", "A (Subtyping not Performed)"], axis=1) 
#now that the aggregated column exists, these columns are redundant and I dropped them

In [None]:
cols = df.columns.tolist() #separating column data into a list makes it easier to change the structure of the dataframe

In [None]:
cols = cols[0:5] + cols[8:] + cols[5:8]

In [None]:
df = df[cols] #updates the dataframe to include the new arrangement

In [None]:
df.info() #just to display what the columns look like so far

At least for now, the data cleaning process is complete!

###### Preliminary Analysis

I began my analysis by examining summary statistics. The pandas commond DataFrame.describe() usually provides some of the most important summary statistics. For this project, I was interested in eventually building boxplots to show variability of case spread. Thus, I thought it would be helpful to include interquartile range instead.

The python library scipy does provide statistical functions to calculate interquartile range. However, I developed a custom formula to calculate the interquartile range with ease by piggybacking off of pandas' DataFrame.quantile() function.

In [None]:
def iqr(arg):
    q1 = arg.quantile(q=0.25, axis=0, numeric_only=True, method='single')
    q3 = arg.quantile(q=0.75, axis=0, numeric_only=True, method='single')
    range = q3-q1
    return range

In [None]:
summary_stats = pd.DataFrame({'Mean':df.mean(axis=0, numeric_only=True),
         'Median':df.median(axis=0, numeric_only=True),
         'Max':df.max(axis=0, numeric_only=True), 
         'Min':df.min(axis=0, numeric_only=True),
         'StDev':df.std(axis=0, numeric_only=True),
         'Q1':df.quantile(q=0.25, axis=0, numeric_only=True, method='single'),
         'Q3':df.quantile(q=0.75, axis=0, numeric_only=True, method='single'),
         'IQR':iqr(df)})

In [None]:
summary_stats

At first glance, this dataset looks strange. Not only are half of this data sets' medians equal to zero, but the standard deviation for them is quite high. Importantly, the dataset queried for this project was from public health laboratories for the stated time frame. Even when a given strain is not reported, that does not mean it isn't circulating. The laboratory can only report a case if one of their samples is positive for it. That means that while the dataset is good at looking for *trends*, it cannot be used to make generalizations about the population. 

###### Case Study 1: Is There a Difference in Dominant Influenza Strains Between Seasons?

As part of my first question, I wanted to explore if, and if so how dominant subtypes change by season. To do this, I calculated the sum of positive cases for each category by season using the DataFrame.groupby() function. The table is shown as a reference below with an initial graph to better visualize the data. What stands out with the initial line chart is that the number contains too many strains to visualize appropriately. I went back to my data and broke it into two new frames, one for Influenza A and one for Influenza B.

In [None]:
season_group = df.groupby('Season').sum(numeric_only=True)

In [None]:
seasons_a = season_group.drop(["B_Victoria_Lineage", "B_Yamagata_Lineage", "B_Unspecified_Lineage"], axis=1)

In [None]:
seasons_a

In [None]:
seasons_a.plot()

In [None]:
seasons_a.boxplot()

The illustrations above visually represent changes in Influenza A subtypes by season. The graph demonstrates both the fluctuating nature of flu strains, including a very prominant spike of H1N1 (pdm09) cases in 2009, correlating to the Swine Flu pandemic in that year. 

In [None]:
seasons_b = season_group.drop(["A_H1", "A_H3", "A_H1N1_pdm09", "A_H3N2v",
                               "A_Subtype_not_Available"], axis=1)

In [None]:
seasons_b.plot()

In [None]:
seasons_b.boxplot()

The visualizations help clarify some of the questions about seasons. On the line chart, subtypes and lineages tend to fluctuate in relative frequency compared to each other. The box plots further demonstrate that there are a number of outliers with many subtypes and lineages, especially H3 Influenza and Swine Flu. Using additional statistics to see whether this difference is statistically significant will be helpful in future analyses. 

###### Case Study 2: COVID-19 Impacts on Flu Cases and Influenza B Yamagata Lineage

The line charts above were suggestive of a change in case numbers for the 2020 and 2021 Flu seasons, so I explored the nature of these differences in as a case study.

In [None]:
pandemic_a = seasons_a.loc[["2019", "2020", "2021", "2022"]]

In [None]:
pandemic_a

The table is a slice of years before and after the pandemic also demonstrates how outside factors can affect circulating flu strains. Prior to the COVID-19 pandemic, the predominant strain was H1N1 (pdm09). There was a dramatic drop in 2020, possibly attributed, at least in part, to increased social distancing and stay at home orders. Following 2020, H3 Influenza became the most predominant and positive influenza results have risen to similar levels seen before the pandemic. A line graph below provides a visualization for the trend.

In [None]:
pandemic_a.plot()

In [None]:
pandemic_b = seasons_b.loc[["2019", "2020", "2021", "2022"]]

In [None]:
pandemic_b

In [None]:
pandemic_b.plot()

Furthermore, there is an interesting trend when comparing the time period 2019-2022. While both strains were less frequently detected in 2020, the following years saw a return to typical case numbers. The outlier to this trend is the Influenza B Yamagata Lineage. 

The World Health Organization's program, FluNet, collects data on influenza from around the globe. Researchers studying global trends have observed the significant decline in the Yamagata lineage elsewhere. While more data is needed, many researchers suggest that the COVID-19 pandemic resulted in the extinction of the Yamagata Lineage [3, 4]. 

###### Case Study 3:  Is There a Difference in Dominant Influenza Strains Between Age Groups?

In [None]:
grouped_ages = df.groupby('Age_Group').sum(numeric_only=True)

In [None]:
grouped_ages

In [None]:
a_ages = grouped_ages.drop(["B_Victoria_Lineage", "B_Yamagata_Lineage", "B_Unspecified_Lineage"], axis=1)

In [None]:
a_ages.plot.bar(stacked=True)

It seems that in all age groups, H3 Influenza and H1N1pdm09 Influenza are generally the most common subtypes. A noteworthy difference is in H1N1pdm09 diagnosis in people over the age of 65. In that group, it seems that H3 Influenza is much more common. Researchers have observed that for pandemic Swine Flu, younger people were disproportionately affected with severe disease. It seems at least that higher proportions of cases are seen in younger age groups, following that trend.

In [None]:
b_ages = grouped_ages.drop(["A_H1", "A_H3", "A_H1N1_pdm09", "A_H3N2v", "A_Subtype_not_Available"], axis=1)

In [None]:
b_ages

In [None]:
b_ages.plot.bar(stacked=True)

In contrast to influenza A, there seem to be similar proportions for each Lineage. Further statistical analysis would help uncover whether or not the differences are statistically significant.

###### Discussion

###### Conclusions

In summary, using the available information I make the following arguments:
1) Dominant Influenza subtype seems to vary for Influenza A, with H1N1 (pdm09), H3, and H1 influenza being the most common subtypes. Dynamics are less clear for Influenza B, but seem to support that after 2015, the predominant lineage alternates between Victoria and Yamagata.
2) There was a precipitous drop in positive influenza tests near the pandemic. Most strains have seen a resurgence,but the Yamagata Lineage is  either dormant or extinct. 
3) There appear to be some differences in predominant influenza subtypes between age groups, particularly for people who are 65+

###### Limitations

This data used a constrained dataset from the CDC's FluView tool only. Importantly, the CDC relies on data reporting from many public health laboratories to make the dataset. Therefore, the number of participating laboratories vary from season to season. As such, extrapolations to the general population are limited. 

There exist a wealth of resources publicly available on the internet. The CDC Wonder database provides data on mortality statistics, including Influenza mortality. The WHO also monitors Influenza on a more global level, providing a wealth of data from different countries. Incorporating these sources into the dataset may provide yet more insight. 

###### Future Avenues

Future avenues for Influenza data exploration include:
1) Incorporation of global Influenza and Influenza mortality for further analysis
2) Advanced statistical analysis

###### References

[1] Monto, A. S., & Fukuda, K. (2020). Lessons from influenza pandemics of the last 100 years. Clinical Infectious Diseases, 70(5), 951-957.
[2] Harrington, W. N., Kackos, C. M., & Webby, R. J. (2021). The evolution and future of influenza pandemic preparedness. Experimental & molecular medicine, 53(5), 737-749.
[3] Koutsakos, M., Wheatley, A. K., Laurie, K., Kent, S. J., & Rockman, S. (2021). Influenza lineage extinction during the COVID-19 pandemic?. Nature Reviews Microbiology, 19(12), 741-742.
[4]Dhanasekaran, V., Sullivan, S., Edwards, K. M., Xie, R., Khvorov, A., Valkenburg, S. A., ... & Barr, I. G. (2022). Human seasonal influenza under COVID-19 and the potential consequences of influenza lineage elimination. Nature communications, 13(1), 1721.