# Downloading the Data from 
(https://ourworldindata.org/covid-cases)

In [None]:
!pip install --upgrade --no-cache-dir gdown
!gdown 1wufZEVE6NO5o3SU3q0DVrErtuFnk8pXU

# Importing the necessary library

In [2]:
import pandas as pd

#Loading and filtering the data

In [3]:
# load the data into a pandas DataFrame
df = pd.read_csv("/content/owid-covid-data.csv")

In [4]:
# filter the data to only include rows for US, China, France, and Germany
df = df[df['location'].isin(["United States", "China", "France", "Germany"])]
df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
45159,CHN,Asia,China,2020-01-22,547.0,,,17.0,,,...,48.4,,4.34,76.91,0.761,1425887000.0,,,,
45160,CHN,Asia,China,2020-01-23,639.0,92.0,,18.0,1.0,,...,48.4,,4.34,76.91,0.761,1425887000.0,,,,
45161,CHN,Asia,China,2020-01-24,916.0,277.0,,26.0,8.0,,...,48.4,,4.34,76.91,0.761,1425887000.0,,,,
45162,CHN,Asia,China,2020-01-25,1399.0,483.0,,42.0,16.0,,...,48.4,,4.34,76.91,0.761,1425887000.0,,,,
45163,CHN,Asia,China,2020-01-26,2062.0,663.0,,56.0,14.0,,...,48.4,,4.34,76.91,0.761,1425887000.0,,,,


#Preprocess the data by replacing the nan values

In [5]:
us = df[df['location'] == "United States"][['total_deaths', 'median_age']].fillna(0)
us.head()

Unnamed: 0,total_deaths,median_age
235547,0.0,38.3
235548,0.0,38.3
235549,0.0,38.3
235550,0.0,38.3
235551,0.0,38.3


In [6]:
china = df[df['location'] == "China"][['total_deaths', 'median_age']].fillna(0)
china.head()

Unnamed: 0,total_deaths,median_age
45159,17.0,38.7
45160,18.0,38.7
45161,26.0,38.7
45162,42.0,38.7
45163,56.0,38.7


In [7]:
france = df[df['location'] == "France"][['total_deaths', 'median_age']].fillna(0)
france.head()

Unnamed: 0,total_deaths,median_age
78319,0.0,42.0
78320,0.0,42.0
78321,0.0,42.0
78322,0.0,42.0
78323,0.0,42.0


In [8]:
germany = df[df['location'] == "Germany"][['total_deaths', 'median_age']].fillna(0)
germany.head()

Unnamed: 0,total_deaths,median_age
83579,0.0,46.6
83580,0.0,46.6
83581,0.0,46.6
83582,0.0,46.6
83583,0.0,46.6


#Calculate and print the correlation

In [9]:
# calculate the correlation between total deaths and median age for each country
corr_us = us.corr()
corr_china = china.corr()
corr_france = france.corr()
corr_germany = germany.corr()

print("Correlation between total deaths and median age for US:\n", corr_us)
print("Correlation between total deaths and median age for China:\n", corr_china)
print("Correlation between total deaths and median age for France:\n", corr_france)
print("Correlation between total deaths and median age for Germany:\n", corr_germany)

Correlation between total deaths and median age for US:
               total_deaths  median_age
total_deaths           1.0         NaN
median_age             NaN         NaN
Correlation between total deaths and median age for China:
               total_deaths  median_age
total_deaths           1.0         NaN
median_age             NaN         NaN
Correlation between total deaths and median age for France:
               total_deaths  median_age
total_deaths           1.0         NaN
median_age             NaN         NaN
Correlation between total deaths and median age for Germany:
               total_deaths  median_age
total_deaths           1.0         NaN
median_age             NaN         NaN


The correlation above illustrates a nan value. This occurrence of nan correlation can be attributed to a variety of reasons, such as missing data. However, in this instance, it is not a result of missing data as we have already replaced any nan values.

The reason for the nan correlation is simply that the values do not vary. This can happen because of the formula used to calculate correlation, which is 

**cor(i,j) = cov(i,j)/[stdev(i)*stdev(j)]**

As we can see, the formula is divided by the standard deviation of both variables. If the values do not vary, the standard deviation will be zero, resulting in a division by zero and ultimately producing a nan value. This is the case in this instance, where all the median age data is homogenous or stagnant, resulting in a standard deviation of zero.