# <font color= blue> Raw Data Collection

- The main purpose of this work is to apply the theoretical framework developed by the Stockholm Paradigm to analyse and rank a sampled group of Influenza H1N1 strains according to their emergence risk in human populations. In order to do that we then developed a method to standardize the raw data collection, processing and analysis, which will be succintly described here. It is important to, first, describe the raw data sampling, organization, and processing, before anything else.

- The Stockholm Paradigm lengthly discusses the evolutionary processes involved in the (re)emergence of new symbiont interactions, being Influenza-human symbiosis a possible study object to this coevolutionary framework. It can be generally summarized in three essencial biological processes: Ecological Fitting, which discourses on the populational, inter-species adjustment that preceds and enables new symbiosis; Oscillation Hypothesis, which describes the cophylogenetic evolution of host and parasite lineages; and Taxon Pulse, which approaches the phylogenetic-ecological circunstances involved in the maintenance and evolution of symbiont interactions, or the coevolution in the face of ecological (spatio-temporal) change.

- This work relies heavily on the assumptions of ecological fitting, where _capacity_ and _opportunity_ conditions precede _and_ enable the emergence of a new symbiont interactions. The idea here is to describe, as much as the available data lets us, the capacity different Influenza H1N1 strains - human, non-human mammalian, or avian - have to establish itself in human populations, coupled with the current ecological circunstances to do so. By analysing both capacity and opportunity, we expect to be able to visualize, and hierarchize, the sampled strains according to their emergence risk in humans, contributing to emerging infectious diseases preparedness. 

- In order to approach our objective, we begin by sampling the data available in the different public Influenza Databases. By the end of the data sampling period, we collected all the data used in this work in two main platforms:

> ## <font color=blue> Influenza Research Database (IRD)

>>    This database is the largest Influenza dataset we could find. In te beginning of this work it was and independent database, organized and financed by NIH, but during the development of this research it was modifed and incorporated into the BACTERIAL AND VIRAL BIOINFORMATICS RESOURCE CENTER . 3.30.19, a now larger database that inclues all bacterial and viral research databases, as well as archeas and its eukariotic hosts. The current database version can be accessed at [BV-BRC](https://www.bv-brc.org/). 
>>
>>    From this database we collected most of the data we utilized. Those that describe _viral capacity to utilize their host_ will be herein described as (1) Strain Table; (2) Genome Table and (3) Epitope Table and (4) Substitutions Table - or known genomic substitutions that are associated with capacity alterations, such as transmission capacity, replication change, infectivity in host, antiviral resistance, among other specifications. Opportunity will be described by the (4) Surveillance Table, which includes variables that describe the ecological context of different H1N1 strains.

> ## <font color=blue> FluPrint Database (FluPrint)

>>     Since one of our objective was to also characterize Influenza's capacity according to the immunological inprint it leaves in the human host, we decided to explore the immunological data this dataset provides for the recently analysed in eight clinical studies conducted between 2007 to 2015 at the Human Immune Monitoring Center of Stanford University. This database accumulates several immunological cell population characterization and quantification in order to systematically describe the immunological pattern different Influenza vaccines (and strains) induce on the human system. 
>>
>>     We decided to utilize the H1N1 strain data present in this dataset, but as supplementary variables in our statistical analysis, since our main data source (IRD) did not contain the specific H1N1 strain that was used in the Fluzode of 2009. The dataset can be accessed here: [FluPrint](https://fluprint.com/#/about).

## <font color=blue> Raw Datasets

> ### <font color=blue> Strain Table

The raw version of this data was obtained in January 2023 through this link [Raw Strain Data IRD] (). The downloaded version of the Influenza A H1N1 strain information gathered in the BV-BRC database was comprised of 18,045 rows and 21 columns after we manually deleted all unnecessary variables (originally it had 37 columns). The full raw dataset is summarized in the following table.

|Column|Description|Variable|Example|
|---|---|---|---|
|0|subtype|Influenza subtype|H1N1
|1|2_pb1|Genbank ID for second genomic segment|CY188807|
|2|6_na|Genbank ID for sixth genomic segment|CY188803|
|3|family|Taxon level|Orthomyxoviridae|
|4|taxon_lineage_ids|10239;2559587...|
|5|8_ns|Genbank ID for eighth genomic segment|CY188805|
|6|5_np|Genbank ID for fifth genomic segment|CY188804|
|7|taxon_lineage_names|Full taxonomy|Viruses;Riboviria;Orthornavirae;Negarnaviricota;Polyploviricotina;Insthoviricetes;Articulavirales;Orthomyxoviridae;Alphainfluenzavirus;Influenza A virus;H1N1 subtype;Influenza A virus (A/New York/WC-LVD-14-047/2014(H1N1))|
|8|1_pb2|Genbank ID for first genomic segment|CY188808|
|6|host_common_name|Host common name|Avian|
|7|isolation_country|Isolation Country|USA|
|8|7_mp|Genbank ID for seventh genomic segment|CY188802|
|9|species|Influenza species|Influenza A virus|
|10|host_name|Host Name|swine|
|11|taxon_id|Taxon identification in IRD bank|1523128|
|11|3_pa|Genbank ID for third genomic segment|CY188806|
|12|genome_ids|Genomic localization in the IRD bank|1523128.10;1523128.8;1523128.9;1523128.7;1523128.4;1523128.3;1523128.6;1523128.5|
|13|status|Strain classification in IRD|Complete
|14|strain|Strain Identification|A/New York/WC-LVD-14-047/2014
|15|genbank_accessions|All segment Genbank IDs|CY188801;CY188806;CY188807;CY188804;CY188802;CY188803;CY188805;CY188808|
|14|geographic_group|Continent|North America|
|15|host_group|Host group|Nonhuman Mammal|
|16|collection_year|Year of collection|2014|
|17|passage|Undiscriminated information|E2|
|18|n_type|Neuraminidase type|1|
|19|4_ha|Genbank ID for fourth genomic segment|CY188801|
|20|segment_count|Genomic segment number|8|
|21|h_type|Hemagglutinin type|1|
|22|id|Undiscriminated information|0007de89-acf4-4ca4-aa89-ffeda1bb451d
|23|_version_|Undiscriminated information|1,75113026210995E+018
|24|date_inserted|Date of upload|2022-12-02T19:05:42.309Z
|25|date_modified|Date of last modification|2022-12-02T19:05:42.309Z|

After observing the type of information eah column contained, as well as analysing if the column had too many missing data, the deleted all variables that did not carry any interesting information. In this table, we were only interested in the Strain Name as well as the Genbank ID information for all eight genome segments.

Therefore the dataset that was included in the pre-processing execution, later entitled _BVBRC_strain/BVBRC_strain_filtered.csv_ includes the columns below:

|Column|Description|Variable|Example|
|---|---|---|---|
|0|subtype|Influenza subtype|H1N1
|1|2_pb1|Genbank ID for second genomic segment|CY188807|
|2|6_na|Genbank ID for sixth genomic segment|CY188803|
|3|8_ns|Genbank ID for eighth genomic segment|CY188805|
|4|5_np|Genbank ID for fifth genomic segment|CY188804|
|5|1_pb2|Genbank ID for first genomic segment|CY188808
|6|host_common_name|Host common name|Avian
|7|isolation_country|Isolation Country|USA
|8|7_mp|Genbank ID for seventh genomic segment|CY188802|
|9|host_name|Host Name|swine|
|10|taxon_id|Taxon identification in IRD bank|1523128|
|11|3_pa|Genbank ID for third genomic segment|CY188806|
|12|strain|Strain Identification|A/New York/WC-LVD-14-047/2014
|13|genbank_accessions|All segment Genbank IDs|CY188801;CY188806;CY188807;CY188804;CY188802;CY188803;CY188805;CY188808|
|14|geographic_group|Continent|North America|
|15|host_group|Host group|Nonhuman Mammal|
|16|collection_year|Year of collection|2014|
|17|n_type|Neuraminidase type|1|
|18|4_ha|Genbank ID for fourth genomic segment|CY188801|
|19|segment_count|Genomic segment number|8|
|20|h_type|Hemagglutinin type|1|

> ### <font color=blue> Surveillance Table

This dataset includes all information on data collection (date and location in terms of coordinate, city, state/province and country), and host identification and condition (in terms of natural state, capture mode and health at the time of sampling). The raw data (_BVBRC_surveillance.csv_) has 96 columns, many of which are incomplete, incorrectly filled or completely empty. Therefore, we manually deleted all irrelevant columns, and ended up with a modified raw table described below, later entitled _BVBRC_surveillance_filtered.csv_BVBRC_surveillance_filtered.

|Column|Description|Variable|Example|
|---|---|---|---|
|0|S|Sample number|STP_2020_1527
|1|Sequence Accession|Genbank ID for all genomic segments|CY168429,CY168430,CY168427,CY168428,CY168423,CY168424,CY168425,CY168426|
|2|Sample Material|Sample type|CY188803|
|3|Collection Year|Year of collection|2016|
|4|Collection Country|Collection country|USA|
|5|Collection State Province|Collection region|Wisconsin
|6|Collection City|City of collection|Cahuil
|7|Collection Latitude|Collection Latitude|-34.47928
|8|Collection Longitude|Collection Longitude|-72.02064|
|9|Pathogen Test Type|Viral detection test type|Influenza A virus|
|10|Pathogen Test Result|Viral detection test result|Positive|
|11|Subtype|Influenza A subtype|H3N2|
|12|Strain|Strain Identification|A/Green-winged Teal/Wisconsin/08OS2292/2008(H3N2)
|13|Host Identifier|Host ID in IRD|UGAI14-2125|
|14|Host Species|Specific taxon|Gallus gallus domesticus|
|15|Host Common Name|Host Common Name|Domestic Chicken|
|16|Host Group|Host Group|Avian|
|17|Host Sex|Host Sex|Female|
|18|Host Natural State|Host Natural State|Domestic|
|19|Host Capture Status|Host capture strategy|Active surveillance (e.g. trap)|
|20|Host Health|Host's condition at time of sampling|Healthy|
|21|Symptoms|Notable symptom|Temperature:101.6|
|22|Onset Hours|Time of initial symptoms|cough:4|
|23|Sudden Onset|Undetermined information|-|
|24|Diagnosis|Undetermined information|-|
|25|Pre Visit Medication|Undetermined information|-|
|26|Treatment|Undetermined information|-|
|27|Initiation Of Treatment|Undetermined information|-|
|28|Duration of Treatment|Undetermined information|-|
|29|Treatment Dosage|Undetermined information|-|
|30|Vaccination Type|Undetermined information|-|
|31|Days Elapsed to Vaccination|Undetermined information|-|
|32|Source of Vaccine Information|Undetermined information|-|
|33|Vaccine Lot Number|Undetermined information|-|
|34|Vaccine Manufacturer|Undetermined information|-|
|35|Vaccine Dosage|Undetermined information|-|
|36|Other Vaccinations|Undetermined information|-|
|37|Additional Metadata|Undetermined information|-|
|38|Comments|Undetermined information|-|
|39-79|Undetermined information|-|

This was then imported into the **Surveillance data processing** script as _BVBRC_surveillance_filtered.csv_.

> ### <font color=blue> Epitope Table

This dataset includes information on all linear peptides registered in the BV-BRC (or previous IRD) Epitope platform. Apart from all other datasets used in this work, this table does not have a column with the _Strain Name_. In this case, it was necessary to use the Genbank ID in order to collect the RNA sequence, translate it into its aminoacidic sequence, and then scan each of the eight sequences for all strains for epitope presence. 
    
The raw Epitope table was manually filtered so that it included only complete and unique (not present in other datasets) data information. Therefore, we edited the original table _BVBRC_epitope.csv_ so that it only included, for the 8,595 the following 13 parameters:   


|Column|Description|Variable|Example|
|---|---|---|---|
|0|Epitope ID|Epitope number in BV-BRC|10003|
|1|Epitope Type|Epitope type|Linear peptide|
|2|Epitope Sequence|Aminoacidic sequence|DRLFFKCI|
|3|Organism|Species|Influenza A virus|
|4|Protein Name|Protein name|Matrix protein 2|
|5|Protein ID|Protein ID originally pulled from UniProtKB/Swiss-Prot database|P06821.1
|6|Protein Accession|Protein accession code originally pulled from UniProtKB/Swiss-Prot databases|P06821
|7|Start|Aminoacidic position in protein sequence|44.0
|8|End|Aminoacidic position in protein sequence|51.0|
|9|Total Assays|Total epitope identification assays|2|
|10|Bcell Assays|Epitope identification assays with B cells|NaN|
|11|Tcell Assays|Epitope identification assays with B cells|0/3|
|12|Comments|Additional information|A/Short text|

> ### <font color=blue> Substitutions Table

The fourth raw dataset inlcudes the original data registered in the database that correspond to genomic substitutions that have been identified by published works and described as substitutions that alter viral capacity to interact with their host (published in PubMed). For this dataset, we downloaded the substitutions that were divided into Avian, Human and Mammalian substitution tables, all of which have similar structure, with the first column containing the _Strain Name_, whereas the remaining columns describe **presence**, **abscence** or **unknown** information on substitution detection in the respective strain.

> Human Substitutions Table:

>|Column|Description|Variable|Example|
|---|---|---|---|
|0|Strain Name|Viral strain name|A/England/257/2009|
|1|Subtype|Influenza A subtype|H1N1|
|2|Collection Date|Collection Date|05/09/2009|
|3|State / Province|Collection region|Influenza A virus|
|4|Country|Collection country|USA|
|5|Host|Host group|Human
|6-34| Phenotypic substitutions|-|

> Avian Substitutions Table:

|Column|Description|Variable|Example|
|---|---|---|---|
|0|Strain Name|Viral strain name|A/England/257/2009|
|1|Subtype|Influenza A subtype|H1N1|
|2|Collection Date|Collection Date|05/09/2009|
|3|State / Province|Collection region|Influenza A virus|
|4|Country|Collection country|USA|
|5|Host|Host group|Human
|6-31| Phenotypic substitutions|-|


> Mammalian (non-human) Substitutions Table:

|Column|Description|Variable|Example|
|---|---|---|---|
|0|Strain Name|Viral strain name|A/England/257/2009|
|1|Subtype|Influenza A subtype|H1N1|
|2|Collection Date|Collection Date|05/09/2009|
|3|State / Province|Collection region|Influenza A virus|
|4|Country|Collection country|USA|
|5|Host|Host group|Human
|6-31| Phenotypic substitutions|-|


After concluding the pre-processing application here described, we imported the respective data to the subsequent processing steps:

1. Strain Table was saved as _BVBRC_strain_filtered.csv_ and imported into _Strain data processing.ipynb_
2. Surveillance Table was saved as _BVBRC_surveillance_filtered.csv_ and imported into _Surveillance data processing.ipynb_
3. Epitope Table was saved as _BVBRC_epitope_filtered_ and imported into _BVBRC_epitope_filtered.ipynb_
4. Substitutions Tables were saved as _Avian_pheno_subs.csv_, _human_pheno_subs.csv_, and _mammal_pheno_subs.csv_ and imported into _Substitutions data processing and merging.ipynb_.

In [29]:
pip install nbconvert[webpdf]

[0mNote: you may need to restart the kernel to use updated packages.


In [30]:
jupyter nbconvert --to webpdf --allow-chromium-download Data_descrption.ipynb

SyntaxError: invalid syntax (3689987796.py, line 1)