# Task 4 - Covide-19 Globally

Go to the ECDC and WHO websites and research Covid-19 globally. Document what you investigate and what you come up with. Note that you have to navigate and read your way through their websites to find relevant data.

This presentation examines Covide-19 globally from the start in 2020 to now (november 2023).

1. **Temporal Analysis:**
   - What is the overall trend of new cases and new deaths over time?
   - Are there any noticeable patterns or spikes in new cases or deaths during specific periods?
   - How has the number of cumulative cases and cumulative deaths changed over time in different countries or regions?

2. **Geographical Analysis:**
   - Which countries have reported the highest and lowest number of new cases and new deaths?

3. **Rate of Spread and Mortality in the autum of 2023:**
   - What is the average daily increase in new cases and new deaths globally or within specific regions?

In [18]:
import pandas as pd
import plotly.express as px

#### Data Download

The url origins from [WHO:s](https://covid19.who.int/data) webiste and contains daily cases and deaths by date reported to WHO. You'll find a copy of the data in the Data folder, it's this data the examination is perfomed on. The data on the website is updated day by day, hence the code is out commented below.

In [19]:
# url = "https://covid19.who.int/WHO-COVID-19-global-data.csv"

# # Read the data from the URL into a Pandas DataFrame
# data = pd.read_csv(url)

# # The local file path where I want to save the data
# local_file_path = "Data/WHO-COVID-19-global-data.csv"

# # Save the DataFrame to a CSV file in the specified local path
# data.to_csv(local_file_path, index=False)

# print(f"Data has been downloaded and saved to: {local_file_path}")

In [20]:
covid_global_data = pd.read_csv("Data/WHO-COVID-19-global-data.csv", parse_dates=['Date_reported'], index_col='Date_reported')

The code below gives us a inital understanding of what kind of data we can expect using .head(). The data set contains data from the third of january 2020 and the number of new cases and deaths as well as total number of cases and deceased by day. Above we choosed to set the date column as index and aslo cast the column to datetime objects. 

In [21]:
covid_global_data.head()

Unnamed: 0_level_0,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
Date_reported,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
2020-01-04,AF,Afghanistan,EMRO,0,0,0,0
2020-01-05,AF,Afghanistan,EMRO,0,0,0,0
2020-01-06,AF,Afghanistan,EMRO,0,0,0,0
2020-01-07,AF,Afghanistan,EMRO,0,0,0,0


The code provides us with the perception of the number of rows in our dataset, and the columns is a representaion of all columns in our dataset. As seen the dataframe contains values from 2020-01-03 to 2023-11-09.

In [22]:
print(covid_global_data.index)
print(covid_global_data.columns)

DatetimeIndex(['2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06',
               '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10',
               '2020-01-11', '2020-01-12',
               ...
               '2023-10-31', '2023-11-01', '2023-11-02', '2023-11-03',
               '2023-11-04', '2023-11-05', '2023-11-06', '2023-11-07',
               '2023-11-08', '2023-11-09'],
              dtype='datetime64[ns]', name='Date_reported', length=333459, freq=None)
Index(['Country_code', 'Country', 'WHO_region', 'New_cases',
       'Cumulative_cases', 'New_deaths', 'Cumulative_deaths'],
      dtype='object')


The code below, using the methods for value counts and showing the 'tail' of our set, shows that the dataset contains data for 237 countries, 1407 rows and covers the time period january 2020 (see above) to november 2023.

In [23]:
display(f"Number of countries: {len(covid_global_data['Country'].value_counts())}")
display(covid_global_data["Country"].value_counts())
display(covid_global_data.tail())

'Number of countries: 237'

Country
Afghanistan        1407
Paraguay           1407
Nigeria            1407
Niue               1407
North Macedonia    1407
                   ... 
Grenada            1407
Guadeloupe         1407
Guam               1407
Guatemala          1407
Zimbabwe           1407
Name: count, Length: 237, dtype: int64

Unnamed: 0_level_0,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
Date_reported,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-11-05,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-06,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-07,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-08,ZW,Zimbabwe,AFRO,0,265848,0,5723
2023-11-09,ZW,Zimbabwe,AFRO,0,265848,0,5723


Using the unique method below will provide us with the different WHO regions represented in our dataset. The organization have six regions ([WHO regions](https://www.who.int/about/who-we-are/regional-offices)). In our dataset all six are represented alongside with some countries not covered by the organisation, called 'Other'.

In [24]:
covid_global_data["WHO_region"].unique()

array(['EMRO', 'EURO', 'AFRO', 'WPRO', 'AMRO', 'SEARO', 'Other'],
      dtype=object)

### 1. **Temporal Analysis:**
   - What is the overall trend of new cases and new deaths over time?
   - Are there any noticeable patterns or spikes in new cases or deaths during specific periods?
   - How has the number of cumulative cases and cumulative deaths changed over time in different countries or regions?

For the whole timeperiod the mean value is 2314 new cases per day globally, and the mean for number of deaths per day is 20. One could guess that these values would differ if we would look at only at one specific year (especially before vaccine got spread more globally).

In [25]:
display(covid_global_data["New_cases"].mean())
display(covid_global_data["New_deaths"].mean())

2314.590210490645

20.9266356583568

The code below takes values from each specified year (using loc) and prints the mean values over the year for the columns with data on new cases and deaths reported daily.

In [26]:
#apple.loc["2018", "Close"].mean()
print(f"Mean new cases 2020: {covid_global_data.loc['2020', 'New_cases'].mean()}")
print(f"Mean new cases 2021: {covid_global_data.loc['2021', 'New_cases'].mean()}")
print(f"Mean new cases 2022: {covid_global_data.loc['2022', 'New_cases'].mean()}")
print(f"Mean new cases 2023: {covid_global_data.loc['2023', 'New_cases'].mean()}")
print()
print(f"Mean new deaths 2020: {covid_global_data.loc['2020', 'New_deaths'].mean()}")
print(f"Mean new deaths 2021: {covid_global_data.loc['2021', 'New_deaths'].mean()}")
print(f"Mean new deaths 2022: {covid_global_data.loc['2022', 'New_deaths'].mean()}")
print(f"Mean new deaths 2023: {covid_global_data.loc['2023', 'New_deaths'].mean()}")



Mean new cases 2020: 954.4161450363982
Mean new cases 2021: 2347.9719206982254
Mean new cases 2022: 5132.990208658459
Mean new cases 2023: 570.8299429773123

Mean new deaths 2020: 22.450410349144526
Mean new deaths 2021: 40.89197156233744
Mean new deaths 2022: 14.392855904282989
Mean new deaths 2023: 3.491581402245858


As seen above the number of cases and deaths differ quite a lot for the different years. The year with most cases where in 2022, and the year with the highest number of deaths by day where in 2021.

Below we find the total number of cases globally day by day. Since the index is repetative we group the data on date (the index) and calculate the sum for new cases and deaths for each day.

In [27]:
# Group data by 'Date_reported' and sum the 'New_cases' and 'New_deaths' column for each date
global_cases_and_deaths_by_date = covid_global_data.groupby(covid_global_data.index)[[
    'New_cases', 'New_deaths']].sum()

# Create a line chart using Plotly Express
fig = px.line(global_cases_and_deaths_by_date, 
              x=global_cases_and_deaths_by_date.index, 
              y='New_cases', 
              log_y=True,
              labels={'New_cases': 'Total Cases', 'Date_reported': 'Date'},
              title='Total Number of Cases by Day Globally',
              template='plotly_dark')
fig.write_html("Visualizations/cases_by_day_globally.html")

fig.show()

# # Create a line chart using Plotly Express
fig = px.line(global_cases_and_deaths_by_date, 
              x=global_cases_and_deaths_by_date.index, 
              y='New_deaths', 
              log_y=True,
              labels={'New_deaths': 'Total deceased', 'Date_reported': 'Date'},
              title='Total Number of Deaths by Day Globally',
              template='plotly_dark')
fig.write_html("Visualizations/deaths_by_day_globally.html")
fig.show()

From the above graph we can detect two clear tops for number of cases, one in january 2022 and one in december the same year in number of cases, the tops for deaths are a few weeks later. It's quite clear that the higher the cases, the higher the deaths, even after the vaccinations started.

Now lets take a look on cumulative number of cases and deaths by region. The code is almost the same, but we add WHO_region when we group the data and change the aggrehated columns to cumulative cases and deaths.

As seen in the graphs below the highest number of cases we find in the EURO region (European Region) followed by AMRO region (Region of the Americas). In bottom we find Other region, not quite sure where this is, and AFRO region (African Region). The same goes for total number of deaths but with the Region of the Americas in top.

The plateau for cases and deaths happens about the same time, in the summer/autum of 2020. After that point, the increase is not as dramatic as in the initial stage. The curve that looks a bit different than the others is Western Pacific.

Note:

*WHO Regions*
- 'EMRO': 'Eastern Mediterranean Region',
- 'EURO': 'European Region',
- 'AFRO': 'African Region',
- 'WPRO': 'Western Pacific Region',
- 'AMRO': 'Region of the Americas',
- 'SEARO': 'South-East Asia Region',
- 'Other': 'Other Regions'

In [28]:
# Define a function to update the names on the lines
def labeling_who_regions(fig):
    region_names = {
        'EMRO': 'Eastern Mediterranean (EMRO)',
        'EURO': 'European (EURO)',
        'AFRO': 'African (AFRO)',
        'WPRO': 'Western Pacific (WPRO)',
        'AMRO': 'Americas (AMRO)',
        'SEARO': 'South-East Asia (SEARO)',
    }

    fig.for_each_trace(lambda t: t.update(name=region_names.get(t.name, t.name)))



In [29]:
# Group by 'WHO_region' and date and sum the cumulative values
sum_by_region = covid_global_data.groupby(['WHO_region', covid_global_data.index])[
    ['Cumulative_cases', 'Cumulative_deaths']].sum().reset_index()

# Plot Cumulative Cases
fig = px.line(sum_by_region, 
                    x='Date_reported', 
                    y='Cumulative_cases', 
                    color='WHO_region',
                    template='plotly_dark',
                    labels={'WHO_region': 'WHO Region'},
                    title='Cumulative Cases by WHO Region')
fig.update_xaxes(title_text='Date Reported')
fig.update_yaxes(title_text='Cumulative Cases')
labeling_who_regions(fig)
fig.write_html("Visualizations/cumulative_cases_by_who_region.html")
fig.show()

# Plot Cumulative Deaths
fig_ = px.line(sum_by_region, 
                     x='Date_reported', 
                     y='Cumulative_deaths', 
                     color='WHO_region',
                    template='plotly_dark',
                     title='Cumulative Deaths by WHO Region')
fig.update_xaxes(title_text='Date Reported')
fig.update_yaxes(title_text='Cumulative Deaths')
labeling_who_regions(fig)
fig.write_html("Visualizations/cumulative_deaths_by_who_region.html")
fig.show()

As seen above the stepest curves are found in European and Western Pacific regions. The European region have had the highest number of reported cases and the lowest number are found in African region. But even though the most reported cases are found in Europe the most deaths due to covid are found in Americas.

### 2. **Geographical Analysis:**
   - Which countries have reported the highest and lowest number of new cases and new deaths?

The code below group our dataframe by country and then aggregates the maximum and minimum values from the columns with new cases and deaths by day. Then the result is presented in a scatter plot. 

The minimum values aren't used since it contains negative values, why this is is hard to tell. The grapg tells us that Pitcairn Islands had the lowest number of cases and actualy no deaths at all reported at all. This where also the case for Holy See and Tokelau.

The most deaths reported in one day where in Chile and Eucador, but the most cases in one day are found in China.

In [30]:
# Group by 'Country' and date and aggregate the max and min values for new cases and new deaths
max_min_deaths_cases_countries = (covid_global_data.groupby('Country')[[
    'New_cases', 'New_deaths']].agg(
        max_cases=pd.NamedAgg(column="New_cases", aggfunc="max"),
        min_cases=pd.NamedAgg(column="New_cases", aggfunc="min"),
        max_deaths=pd.NamedAgg(column="New_deaths", aggfunc="max"),
        min_deaths=pd.NamedAgg(column="New_deaths", aggfunc="min")
    )
).reset_index()

fig = px.scatter(data_frame=max_min_deaths_cases_countries,
                y='max_cases',
                x='max_deaths',
                log_y=True,
                hover_name='Country',
                color='max_deaths',
                labels=dict(max_cases='Max cases in one day', max_deaths='Max deaths in one day'),
                title='Max Cases in One Day by Country')
fig.write_html("Visualizations/max_cases_by_country.html")
fig.show()

As seen above the minimum values for number of cases and deaths apperently have negtive values, why this is is hard to tell. Down below the maximum values for each region are shown in two separate graphs.

### 3. **Rate of Spread and Mortality in autum 2023:**
   - What is the average daily increase in new cases and new deaths globally or within specific regions?

Below we take a look at the data from 2023 using loc on index. We then calculate the mean value for number of cases and deaths reported by day.

In [31]:
covid_data_2023 = covid_global_data.loc["2023"]
mean_cases_2023 = covid_data_2023.loc[:, "New_cases"].mean()
mean_deaths_2023 = covid_data_2023.loc[:, "New_deaths"].mean()

print(f"The average daily increase in new cases in 2023 are: {mean_cases_2023:.0f}.")
print(f"The average daily increase in new deaths in 2023 are: {mean_deaths_2023:.0f}.")

The average daily increase in new cases in 2023 are: 571.
The average daily increase in new deaths in 2023 are: 3.


If we sort the values on 'new_cases' and choose descending order (by choosing False on ascending) we can see that the first 20 values are mostly in january and found in China and USA.

In [32]:
covid_data_2023.sort_values(by="New_cases", ascending=False).head(20)

Unnamed: 0_level_0,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
Date_reported,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-01,CN,China,WPRO,2165484,87090526,4344,56888
2023-01-02,CN,China,WPRO,1606621,88697147,4189,61077
2023-01-04,CN,China,WPRO,1364408,91344022,4370,69879
2023-01-03,CN,China,WPRO,1282467,89979614,4432,65509
2023-01-05,CN,China,WPRO,1250579,92594601,3566,73445
2023-01-06,CN,China,WPRO,991889,93586490,3229,76674
2023-01-07,CN,China,WPRO,863250,94449740,3618,80292
2023-01-08,CN,China,WPRO,702548,95152288,3810,84102
2023-01-09,CN,China,WPRO,527649,95679937,3482,87584
2023-01-06,US,United States of America,AMRO,471714,99883410,2764,1085220


The above mentioned countries are also found in top of number of deaths. This might not be a suprice since the two countries have a large pupulation.

In [33]:
covid_data_2023.sort_values(by="New_deaths", ascending=False).head(20)

Unnamed: 0_level_0,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
Date_reported,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-03,CN,China,WPRO,1282467,89979614,4432,65509
2023-01-13,US,United States of America,AMRO,448092,100331502,4407,1089627
2023-01-04,CN,China,WPRO,1364408,91344022,4370,69879
2023-01-01,CN,China,WPRO,2165484,87090526,4344,56888
2023-01-02,CN,China,WPRO,1606621,88697147,4189,61077
2023-01-08,CN,China,WPRO,702548,95152288,3810,84102
2023-01-27,US,United States of America,AMRO,294867,100934394,3743,1097034
2023-01-20,US,United States of America,AMRO,308025,100639527,3664,1093291
2023-01-07,CN,China,WPRO,863250,94449740,3618,80292
2023-01-05,CN,China,WPRO,1250579,92594601,3566,73445


From the below we can see for each month what the mean values for cases and deaths looks like. The highest number are like above, found in january. In autum of 2023 the number of deaths and cases are decreasing month by month. Lets hope this trend keeps solid! 

In [34]:
display(covid_data_2023.loc[:,'New_cases'].resample('M').mean())
display(covid_data_2023.loc[:,'New_deaths'].resample('M').mean())


Date_reported
2023-01-31    3131.626106
2023-02-28     728.036769
2023-03-31     523.320675
2023-04-30     457.134880
2023-05-31     293.698244
2023-06-30     146.660478
2023-07-31     191.834899
2023-08-31     200.966245
2023-09-30      94.202250
2023-10-31      85.536546
2023-11-30      12.038912
Freq: M, Name: New_cases, dtype: float64

Date_reported
2023-01-31    16.190690
2023-02-28     6.019439
2023-03-31     3.328297
2023-04-30     3.494093
2023-05-31     2.108071
2023-06-30     1.098875
2023-07-31     0.572070
2023-08-31     1.117055
2023-09-30     1.238959
2023-10-31     0.830679
2023-11-30     0.090014
Freq: M, Name: New_deaths, dtype: float64