# Task 4 - Covide-19 Globally

Go to the ECDC and WHO websites and research Covid-19 globally. Document what you investigate and what you come up with. Note that you have to navigate and read your way through their websites to find relevant data.

This presentation examines Covide-19 globally from the start in 2020 to now (november 2023).

1. **Temporal Analysis:**
   - What is the overall trend of new cases and new deaths over time?
   - Are there any noticeable patterns or spikes in new cases or deaths during specific periods?
   - How has the number of cumulative cases and cumulative deaths changed over time in different countries or regions?

2. **Geographical Analysis:**
   - Which countries or regions have reported the highest and lowest number of new cases and new deaths?
   - How does the distribution of new cases and new deaths vary across different WHO regions?
   - Are there specific countries or regions that experienced significant changes in cumulative cases or cumulative deaths during certain periods?

3. **Rate of Spread and Mortality:**
   - What is the average daily increase in new cases and new deaths globally or within specific regions?
   - How does the mortality rate (number of deaths divided by the number of cases) vary across different countries or regions?

In [72]:
import pandas as pd
import plotly.express as px

#### Data Download

The url origins from [WHO:s](https://covid19.who.int/data) webiste and contains daily cases and deaths by date reported to WHO. You'll find a copy of the data in the Data folder, it's this data the examination is perfomed on. The data on the website is updated day by day, hence the code is out commented below.

In [73]:
# url = "https://covid19.who.int/WHO-COVID-19-global-data.csv"

# # Read the data from the URL into a Pandas DataFrame
# data = pd.read_csv(url)

# # The local file path where I want to save the data
# local_file_path = "Data/WHO-COVID-19-global-data.csv"

# # Save the DataFrame to a CSV file in the specified local path
# data.to_csv(local_file_path, index=False)

# print(f"Data has been downloaded and saved to: {local_file_path}")

In [74]:
covid_global_data = pd.read_csv("Data/WHO-COVID-19-global-data.csv", parse_dates=['Date_reported'], index_col='Date_reported')

The code below gives us a inital understanding of what kind of data we can expect using .head(). The data set contains data from the third of january 2020 and the number of new cases and deaths as well as total number of cases and deceased by day. Above we choosed to set the date column as index and aslo cast the column to datetime objects. 

In [75]:
covid_global_data.head()

Unnamed: 0_level_0,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
Date_reported,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
2020-01-04,AF,Afghanistan,EMRO,0,0,0,0
2020-01-05,AF,Afghanistan,EMRO,0,0,0,0
2020-01-06,AF,Afghanistan,EMRO,0,0,0,0
2020-01-07,AF,Afghanistan,EMRO,0,0,0,0


The code provides us with the perception of the number of rows in our dataset, and the columns is a representaion of all columns in our dataset. As seen the dataframe contains values from 2020-01-03 to 2023-11-09.

In [76]:
print(covid_global_data.index)
print(covid_global_data.columns)

DatetimeIndex(['2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06',
               '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10',
               '2020-01-11', '2020-01-12',
               ...
               '2023-10-31', '2023-11-01', '2023-11-02', '2023-11-03',
               '2023-11-04', '2023-11-05', '2023-11-06', '2023-11-07',
               '2023-11-08', '2023-11-09'],
              dtype='datetime64[ns]', name='Date_reported', length=333459, freq=None)
Index(['Country_code', 'Country', 'WHO_region', 'New_cases',
       'Cumulative_cases', 'New_deaths', 'Cumulative_deaths'],
      dtype='object')


The code below, using the methods for value counts and showing the 'tail' of our set, shows that the dataset contains data for 237 countries, 1407 rows and covers the time period january 2020 (see above) to november 2023.

In [77]:
display(f"Number of countries: {len(covid_global_data['Country'].value_counts())}")
display(covid_global_data["Country"].value_counts())
display(covid_global_data.tail())

'Number of countries: 237'

Country
Afghanistan        1407
Paraguay           1407
Nigeria            1407
Niue               1407
North Macedonia    1407
                   ... 
Grenada            1407
Guadeloupe         1407
Guam               1407
Guatemala          1407
Zimbabwe           1407
Name: count, Length: 237, dtype: int64

Unnamed: 0_level_0,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
Date_reported,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-11-05,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-06,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-07,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-08,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-09,ZW,Zimbabwe,AFRO,0,265848,0,5723


Using the unique method below will provide us with the different WHO regions represented in our dataset. The organization have six regions ([WHO regions](https://www.who.int/about/who-we-are/regional-offices)). In our dataset all six are represented alongside with some countries not covered by the organisation, called 'Other'.

In [93]:
covid_global_data["WHO_region"].unique()

array(['EMRO', 'EURO', 'AFRO', 'WPRO', 'AMRO', 'SEARO', 'Other'],
      dtype=object)

### 1. **Temporal Analysis:**
   - What is the overall trend of new cases and new deaths over time?
   - Are there any noticeable patterns or spikes in new cases or deaths during specific periods?
   - How has the number of cumulative cases and cumulative deaths changed over time in different countries or regions?

For the whole timeperiod the mean value is 2314 new cases per day globally, and the mean for number of deahs per day is 20. One could guess that these values would differ if we would look at only at one specific year (especially before vaccine got spread more globally).

In [79]:
display(covid_global_data["New_cases"].mean())
display(covid_global_data["New_deaths"].mean())

2314.590210490645

20.9266356583568

The code below takes values from each specified year (using loc) and prints the mean values over the year for the columns with data on new cases and deaths reported daily.

In [90]:
#apple.loc["2018", "Close"].mean()
print(f"Mean new cases 2020: {covid_global_data.loc['2020', 'New_cases'].mean()}")
print(f"Mean new cases 2021: {covid_global_data.loc['2021', 'New_cases'].mean()}")
print(f"Mean new cases 2022: {covid_global_data.loc['2022', 'New_cases'].mean()}")
print(f"Mean new cases 2023: {covid_global_data.loc['2023', 'New_cases'].mean()}")
print()
print(f"Mean new deaths 2020: {covid_global_data.loc['2020', 'New_deaths'].mean()}")
print(f"Mean new deaths 2021: {covid_global_data.loc['2021', 'New_deaths'].mean()}")
print(f"Mean new deaths 2022: {covid_global_data.loc['2022', 'New_deaths'].mean()}")
print(f"Mean new deaths 2023: {covid_global_data.loc['2023', 'New_deaths'].mean()}")



Mean new cases 2020: 954.4161450363982
Mean new cases 2021: 2347.9719206982254
Mean new cases 2022: 5132.990208658459
Mean new cases 2023: 570.8299429773123

Mean new deaths 2020: 22.450410349144526
Mean new deaths 2021: 40.89197156233744
Mean new deaths 2022: 14.392855904282989
Mean new deaths 2023: 3.491581402245858


As seen above the number of cases and deaths differ quite a lot for the different years. The year with most cases where in 2022, and the year with the highest number of deaths by day where in 2021.

Below we find the total number of cases by WHO region. 

Note:

*WHO Regions*
- 'EMRO': 'Eastern Mediterranean Region',
- 'EURO': 'European Region',
- 'AFRO': 'African Region',
- 'WPRO': 'Western Pacific Region',
- 'AMRO': 'Region of the Americas',
- 'SEARO': 'South-East Asia Region',
- 'Other': 'Other Regions'

In [100]:
# Group data by 'Date_reported' and sum the 'New_cases' and 'New_deaths' column for each date
global_cases_and_deaths_by_date = covid_global_data.groupby(covid_global_data.index)[['New_cases', 'New_deaths']].sum()

# Create a line chart using Plotly Express
fig = px.line(global_cases_and_deaths_by_date, 
              x=global_cases_and_deaths_by_date.index, 
              y='New_cases', 
              log_y=True,
              labels={'New_cases': 'Total Cases', 'Date_reported': 'Date'},
              title='Total Number of Cases by Day Globally',
              template='plotly_dark')
fig.show()

# # Create a line chart using Plotly Express
fig = px.line(global_cases_and_deaths_by_date, 
              x=global_cases_and_deaths_by_date.index, 
              y='New_deaths', 
              log_y=True,
              labels={'New_deaths': 'Total deceased', 'Date_reported': 'Date'},
              title='Total Number of Deaths by Day Globally',
              template='plotly_dark')
fig.show()

From the above graph we can detect two clear tops one in january 2022 and one in december the same year in number of cases, the tops for deaths are a few weeks later. It's quite clear that the higher the cases, the higher the deaths, even after the vaccinations started.

Now lets take a look on cumulative number of cases and deaths by region. The code is almost the same, but we add WHO_region when we group the data and change the aggrehated columns to cumulative cases and deaths.

As seen in the graphs below the highest number of cases we find in the EURO region (European Region) followed by AMRO region (Region of the Americas). In bottom we find Other region, not quite sure where this is, and AFRO region (African Region). The same goes for total number of deaths but with the Region of the Americas in top.

The plateau for cases and deaths happens about the same time, in the summer/autum of 2020. After that point, the increase is not as dramatic as in the initial stage. The curve that looks a bit different than the others is Western Pacific.

In [None]:
# Function for updating the names of WHO Regions

def labeling_who_regions ():
    region_names = {
    'EMRO': 'Eastern Mediterranean (EMRO)',
    'EURO': 'European (EURO)',
    'AFRO': 'African (AFRO)',
    'WPRO': 'Western Pacific (WPRO)',
    'AMRO': 'Americas (AMRO)',
    'SEARO': 'South-East Asia (SEARO)',
    }

    fig.for_each_trace(lambda t: t.update(name=region_names.get(t.name, t.name)))


In [None]:
# Group data by 'Date_reported' and 'WHO_region' and sum the 'New_cases' and 'New_deaths' columns for each date
global_cases_by_date_region = covid_global_data.groupby(['Date_reported', 'WHO_region'])[['Cumulative_cases', 'Cumulative_deaths']].sum().reset_index()

fig = px.line(
        data_frame=global_cases_by_date_region,
        y='Cumulative_cases',
        log_y=True,
        x='Date_reported',
        color='WHO_region',
        labels={'Cumulative_cases': 'Cumulative cases', 'Date_reported': 'Date'},
        title='Cumulative Cases by WHO Region',
        template='plotly_dark'
    )
labeling_who_regions()
fig.show()

fig = px.line(
        data_frame=global_cases_by_date_region,
        y='Cumulative_deaths',
        log_y=True,
        x='Date_reported',
        color='WHO_region',
        labels={'Cumulative_deaths': 'Cumulative deaths', 'Date_reported': 'Date'},
        title='Cumulative Deaths by WHO Region',
        template='plotly_dark'
    )
labeling_who_regions()
fig.show()

### 2. **Geographical Analysis:**
   - Which countries or regions have reported the highest and lowest number of new cases and new deaths?
   - How does the distribution of new cases and new deaths vary across different WHO regions?
   - Are there specific countries or regions that experienced significant changes in cumulative cases or cumulative deaths during certain periods?

The code below group our dataframe on WHO regions and then aggregates the maximum and minimum values from the columns with new cases and deaths by day. Then the result is presented in a hist plot.

In [None]:
max_min_deaths_cases_regions = (covid_global_data.groupby(['Date_reported', 'WHO_region'])[['New_cases', 'New_deaths']]
    .agg(
        max_cases=pd.NamedAgg(column="New_cases", aggfunc="max"),
        min_cases=pd.NamedAgg(column="New_cases", aggfunc="min"),
        max_deaths=pd.NamedAgg(column="New_deaths", aggfunc="max"),
        min_deaths=pd.NamedAgg(column="New_deaths", aggfunc="min")
    )
)

display(max_min_deaths_cases_regions)

Unnamed: 0_level_0,Unnamed: 1_level_0,max_cases,min_cases,max_deaths,min_deaths
Date_reported,WHO_region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-03,AFRO,0,0,0,0
2020-01-03,AMRO,0,0,0,0
2020-01-03,EMRO,0,0,0,0
2020-01-03,EURO,0,0,0,0
2020-01-03,Other,0,0,0,0
...,...,...,...,...,...
2023-11-09,EMRO,0,0,0,0
2023-11-09,EURO,0,0,0,0
2023-11-09,Other,0,0,0,0
2023-11-09,SEARO,0,0,0,0


As seen above the minimum values for number of cases and deaths apperently have negtive values, why this is is hard to tell. Down below the maximum values for each region are shown in two separate graphs.

In [None]:
fig = px.bar(
        data_frame=max_min_deaths_cases_regions,
        x='WHO_region',
        y="max_cases",
        log_y=True,
        labels=dict(max_cases= 'Max Cases', WHO_region= 'WHO Region'),
        hover_data=['max_cases', 'WHO_region'],
        title='Maximum number of cases in One Day by WHO Region',
        template='plotly_dark',
    )
fig.show()

fig = px.bar(
        data_frame=max_min_deaths_cases_regions,
        x='WHO_region',
        y="max_deaths",
        log_y=True,
        labels=dict(max_deaths='Max deaths', WHO_region= 'WHO Region'),
        hover_data=['max_deaths', 'WHO_region'],
        title='Maximum number of deaths in One Day by WHO Region',
        template='plotly_dark',
    )
fig.show()


As seen above 