<h1>Group 38 Project Proposal: Determining the Connection Between Country Wealth and Tuberculosis Mortality <h1>
    
<img src="images/TB_img.jpg" alt="Tuberculosis under EM Microscope" width = "1000"/>
    
<font size="2"> <i>image attribution</i>: NIAID Mycobacterium tuberculosis Bacteria, the Cause of TB, CC BY 2.0 <https://creativecommons.org/licenses/by/4.0>, via Flickr at <https://www.flickr.com/photos/niaid/51637606937/in/photostream/> </font>

<h3>Introduction</h3>
<hr>

*Total Word Count:* ---

Tuberculosis is an airborne respiratory disease that affects mainly the lungs. It can also affect other organs and the symptoms differ depending on where the infection is. Tuberculosis is spread by germs that are released into the air when someone coughs or sneezes. People who breathe in this air may not necessarily become sick because the germs are not active, although they may become sick later on. When the germs are active they divide and attack organ tissues, causing severe threats.  
HIV (Human Immunodeficiency Virus) is an infection that affects the immune system. Symptoms include flu-like symptoms, chills, rash, and fatigue. HIV spreads through sex and sharing needles. More serious stages of HIV develop into AIDS (Acquired immunodeficiency syndrome). There’s currently no cure or vaccine for HIV. 
Due to a decrease in the functionality of the immune system in HIV patients, they are more likely to become sick with other diseases, especially tuberculosis. 

Preventive measures and access to healthcare services are of utmost importance when it comes to reducing these infections in the population. However, these factors are not always accessible in every part of the world. In this observational study, we aim to compare the number of deaths due to Tuberculosis across countries in different World Bank income groups, specifically between low-income countries and high-income countries. By doing this we can make conclusions about how much a country invests in healthcare can affect accessibility and the deaths caused by diseases such as Tuberculosis. 
The dataset that we are using is “Tuberculosis > Mortality Data by Country” (source: https://apps.who.int/gho/data/view.main.57020ALL?lang=en)
The data file is xmart.csv

**The columns in the full dataset are: **  
**Country**  
**Year**  
**Number of deaths due to tuberculosis, excluding HIV**: number of deaths caused by tuberculosis in a given year, rounded to 2 significant figures  
**Deaths due to tuberculosis among HIV-negative people (per 100 000 population)**: number of deaths caused by tuberculosis that have been tested negative for HIV in a given year



<h3>Preliminary Results</h3>
<hr>

In [1]:
library(tidyverse)
library(broom)
library(repr)
library(digest)
library(infer)
library(gridExtra)
options(repr.matrix.max.rows = 6)
options(repr.matrix.max.cols = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine




For this investigation, we use two datasets: the **World Health Orginization (WHO)** Tuberculosis Mortality dataset, and the **OECD** GDP by country dataset.

We begin with the **WHO** dataset, which we'll wrangle into tidy data before encorperating the **OECD** GDP data.

In [3]:
# URL of the WHO dataset csv file
tb_url <- "https://github.com/Remembria/Group-38-Project-Proposal/raw/main/xmart.csv"

# Reading this csv file into a dataframe
tb_df <- read.csv(tb_url)

head(tb_df)

Unnamed: 0_level_0,Country..Year,Number.of.deaths.due.to.tuberculosis..excluding.HIV,Deaths.due.to.tuberculosis.among.HIV.negative.people..per.100.000.population.
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,Afghanistan; 2021,12 000 [7300-19 000],31 [18-47]
2,Afghanistan; 2020,13 000 [7800-20 000],34 [20-52]
3,Afghanistan; 2019,9800 [5800-15 000],26 [15-39]
4,Afghanistan; 2018,11 000 [6300-16 000],29 [17-44]
5,Afghanistan; 2017,11 000 [6300-16 000],30 [18-45]
6,Afghanistan; 2016,12 000 [6900-18 000],34 [20-51]


As of now, however, this dataset is unsorted, too large, and filled with uneccessary metadata. We fix this with a series of operations to wrangle our data into tidy format with three columns: *Country*, *Year*, and *Number of Deaths due to TB*

In [17]:
# The current columns are difficult to reference due to their spaces. We use make.names here to make them referenceable
colnames(tb_df) <- make.names(colnames(tb_df))
tb_df

Country,Year,Number.of.deaths.due.to.tuberculosis..excluding.HIV,Deaths.due.to.tuberculosis.among.HIV.negative.people..per.100.000.population.
<chr>,<chr>,<chr>,<chr>
Afghanistan,2021,12 000 [7300-19 000],31 [18-47]
Afghanistan,2020,13 000 [7800-20 000],34 [20-52]
Afghanistan,2019,9800 [5800-15 000],26 [15-39]
⋮,⋮,⋮,⋮
Zimbabwe,2002,2900 [1200-5400],24 [9.9-45]
Zimbabwe,2001,3100 [1300-5800],26 [11-49]
Zimbabwe,2000,3300 [1300-6400],28 [11-54]


In [28]:
#Rename the data columns so we can easily work with them 
tb_df <- tb_df %>%
    select(Country, Year, Number.of.deaths.due.to.tuberculosis..excluding.HIV) %>%
    rename(country = Country, deaths_exclude_HIV = Number.of.deaths.due.to.tuberculosis..excluding.HIV) |>
    separate(deaths_exlcude_HIV, c(deaths_exlude_HIV, NA))

tb_df

ERROR: [1m[33mError[39m in [1m[1m`select()`:[22m
[33m![39m Can't subset columns that don't exist.
[31m✖[39m Column `Number.of.deaths.due.to.tuberculosis..excluding.HIV` doesn't exist.


In [4]:
gdp_url <- "https://stats.oecd.org/sdmx-json/data/DP_LIVE/.GDP.TOT.USD_CAP.A/OECD?contentType=csv&detail=code&separator=comma&csv-lang=en&startPeriod=2000&endPeriod=2021"

df

<h3>Methods: Plan</h3>
<hr>

<h3>References</h3>
<hr>