In the **data_acquisition** notebook, we gathered data from three different data sources and then processed it to create tables in the database. We created three tables and populated those using SQL queries in the database. Therefore, our database is complete and can be accessed for the analysis purpose. 

In this notebook, we will be using the database to analyze the data in order to reach a conclusion for our central question. We will use SQL queries to obtain the required data from the tables in the database and then transform it to pandas dataframe, so that we can pefrom further operations on the dataframe as needed. Then, we will visually represent the data through graphs for better understanding of the data.

**FYI: After running all the cells for the first time, please proceed to the second code cell and run it again in order to establish a connection to the database. After that you may run the cells.**

Importing the necessary libraries.

In [17]:
import pandas as pd
import numpy as np 
import sqlite3
import plotly.express as px
import plotly.graph_objects as go
import warnings 
warnings.filterwarnings('ignore') 

We use our first table in the database, which contains the **Covid-19 data**, to create a dataframe. To do this, we first create a connection to our database using `sqlite3.connect()`. Then we write a SQL query to obtain the specific columns from the table and then read that SQL query into dataframe using the `read_sql_query()` function. 

In [18]:
conn = sqlite3.connect("covid.db")

table_1 = 'covid_data'
cols = 'country, [Tot\xa0Cases/1M pop], [Deaths/1M pop]'

qry = f"SELECT {cols} FROM {table_1}"

df = pd.read_sql_query(qry, conn) 

To make the dataframe display more precise, we used the `set_index()` to set the index as **country**. Then with the use of `replace` and `fillna` we fill all the missing data with the value 0 to indicate the absence of value. We also drop the unnecessary row and then displayed our first dataframe.

This dataframe displays all the countries in its index and two columns: **Tot Cases(Covid)/1M pop**	and **Deaths(Covid)/1M pop**, which we chose specifically for the plotting purpose. 

In [19]:
df = df.set_index("country")
df = df.replace('', np.nan).fillna(0)
df = df.drop(['Total:'])
df

Unnamed: 0_level_0,Tot Cases/1M pop,Deaths/1M pop
country,Unnamed: 1_level_1,Unnamed: 2_level_1
World,83898,854.4
USA,302728,3315
India,31761,377
France,587268,2438
Germany,438214,1895
...,...,...
Niue,117756,0
Vatican City,36295,0
Western Sahara,16,2
MS Zaandam,0,0


We use our second table in the database, which contains the **GDP (worldwide) from 2019**, to create a dataframe. First we write a SQL query to obtain the specific columns from the table and then read that SQL query into dataframe using the `read_sql_query()` function. 

In [20]:
table_2 = 'gdp2019'
cols = '[Country/Territory],[GDP(US$million)]'

qry = f"SELECT {cols} FROM {table_2}"

df1 = pd.read_sql_query(qry, conn)

We did some changes in the display of the dataframe. To allign the index name with the first dataframe, we used `rename()` to change the name of the first column to **country** and then set the name of couple of countries as needed. Then we used the `set_index()` to set the index as **country** and displayed our second dataframe.

This dataframe shows all the countries in its index and next to the index, it displays the **GDP (US$million) from 2019** of all the listed countries. This data will also be used for further analysis.

In [21]:
df1.rename(columns = {"Country/Territory":"country"}, inplace = True)
df1['country'][1] = 'USA'
df1['country'][2] = 'China'
df1 = df1.set_index("country") 
df1 

Unnamed: 0_level_0,GDP(US$million)
country,Unnamed: 1_level_1
World,87751541
USA,21427700
China,14342903
Japan,5081770
Germany,3845630
...,...
Palau (2018),284
Marshall Islands (2018),221
Kiribati,195
Nauru,118


We use our third table in the database, which contains the **GDP (worldwide) from 1960-2020**, to create a dataframe. First we write a SQL query to obtain the specific columns from the table and then read that SQL query into dataframe using the `read_sql_query()` function. We close the connection to our database using `conn.close()`.

In [22]:
table_3 = 'gdpall'
cols = '[Country Name], [2019], [2020]'

qry = f"SELECT {cols} FROM {table_3}"

df2 = pd.read_sql_query(qry, conn)

conn.close()

We did some changes in the display of the dataframe. To allign the index name with the first dataframe, we used `rename()` to change the name of the first column to **country**. Then with the use of `replace` and `fillna` we fill all the missing data with the value 0 to indicate the absence of value. After that we display our dataframe.

This dataframe shows all the countries in its first column and the other two columns displays the **GDP of 2019 and 2020.** This data will also be used for further analysis and in the plotting.

In [23]:
df2.rename(columns = {"Country Name":"country"}, inplace = True)
df2 = df2.replace('', np.nan).fillna(0)
df2['country'][251] = 'USA'
df2

Unnamed: 0,country,2019,2020
0,Aruba,0.000000,0.000000
1,Africa Eastern and Southern,2.077898,-2.939186
2,Afghanistan,3.911603,-2.351101
3,Africa Western and Central,3.190336,-0.884981
4,Angola,-0.624644,-5.399987
...,...,...,...
261,Kosovo,4.756831,-5.340275
262,"Yemen, Rep.",0.000000,0.000000
263,South Africa,0.113054,-6.431975
264,Zambia,1.441306,-2.785055


Now, we created a function `merge_tables()` which takes in parameters: `table1`, `table2`, `merge_column` and then returned it using `pd.merge()` which merge two dataframe objects and returns a new dataframe, whereas, the source dataframe objects are unchanged.

After that we call the function `merge_tables()` to merge the first and the third dataframe and set the `merge_column` as **country**. Then we renamed two of the columns from the third dataframe in order to make the dataframe more understandable. Finally, we displayed our merged dataframe.



In [24]:
def merge_tables(table1, table2, merge_column):
  return pd.merge(table1, table2, on=merge_column)

merged_df =  merge_tables(df,df2,"country")
merged_df.rename(columns = {"2019":"gdp 2019"}, inplace = True)
merged_df.rename(columns = {"2020":"gdp 2020"}, inplace = True)
merged_df 

Unnamed: 0,country,Tot Cases/1M pop,Deaths/1M pop,gdp 2019,gdp 2020
0,World,83898,854.4,2.600878,-3.293497
1,USA,302728,3315,2.161177,-3.404592
2,India,31761,377,4.041554,-7.251755
3,France,587268,2438,1.842972,-7.855256
4,Germany,438214,1895,1.055508,-4.569617
...,...,...,...,...,...
173,Palau,323370,384,-1.896243,-9.737418
174,Nauru,423828,92,0.000000,1.149425
175,Kiribati,27792,105,3.926445,-1.947557
176,Tuvalu,232471,0,9.756098,4.400000


Now, we merge the merged dataframe with our df1 dataframe we have using our merged function. Then, before we move on the next step for plotting our data, we need to change the values in `Tot Cases per 1M` column from string to int. To do this first we get rid of the commas using `.str.replace` function. After that is completed we use the `numeric` method to convert values in that column to numeric values. 

In [25]:
merged = merge_tables(merged_df,df1,"country")
merged["Tot\xa0Cases/1M pop"] = merged["Tot\xa0Cases/1M pop"].str.replace(",",'')
merged["Tot\xa0Cases/1M pop"] = pd.to_numeric(merged["Tot\xa0Cases/1M pop"])

Using the merged dataframe above, we plotted **upside-down bar charts** of change in GDP for the year 2019 and 2020. 

For plotting, we used `go.bar()` which returns a bar trace with `x` coordinate set **country** and `y` coordinate as **gdp 2019** and **gdp 2020**. Then we named the **variables** in the bar chart for easier comparison and displayed the bar charts simultaneously using `fig.show()`.

In [26]:
fig =  go.Figure()
fig.add_trace(go.Bar(x=merged["country"], y =merged["gdp 2019"], name='change in gdp for the year 2019'))
fig.add_trace(go.Bar(x=merged["country"], y =merged["gdp 2020"], name='change in gdp for the year 2020'))

fig.show()

This time we used the merge data from above to display a scatter plot of **Total Cases (Covid)/1M Population against GDP of 2019**. 

For plotting, we used `px.scatter()` which represent each data point as a marker point, whose location is given by the `x` and `y` columns. We sorted the merged data and set the axis of the plot as needed. We also set the `color` of the dots to the **country** which makes the comparison much simpler. After that we plotted the scatter plot using `fig.show()`.

In [27]:
fig = px.scatter(merged.sort_values(by="Tot\xa0Cases/1M pop", ascending = True), x="Tot\xa0Cases/1M pop", y="gdp 2019", color='country', title = 'Total Cases (Covid)/1M Population against GDP of 2019')
fig.show()

Now, we again used the merge data from above to display a scatter plot of **Total Cases (Covid)/1M Population against GDP of 2020**. 

For plotting, we used `px.scatter()` which represent each data point as a marker point, whose location is given by the `x` and `y` columns. We sorted the merged data and set the axis of the plot as needed. We also set the `color` of the dots to the **country** which makes the comparison much simpler. After that we plotted the scatter plot using `fig.show()`.

In [28]:
fig = px.scatter(merged.sort_values(by="Tot\xa0Cases/1M pop", ascending = True), x="Tot\xa0Cases/1M pop", y="gdp 2020", color='country', title='Total Cases (Covid)/1M Population against GDP of 2020')
fig.show()

# **Analysis**

The results of the scatter plot are surprising, to say the least. If there was a strong negative relationship between the number of COVID-19 cases and the change in GDP, the points in the scatter plot would form a line or curve that slopes downwards from left to right. This would indicate that countries with higher numbers of COVID-19 cases tend to have higher declines in GDP. However, there is no clear relationship between these two variables in the scatter plot; the points are scattered randomly without forming any clear pattern. This indicates that the number of COVID-19 cases does not have a significant effect on the change in GDP for the countries in this data.

But, as we saw in the earlier bar graph, most countries definitely suffered an economic decline because of COVID-19. A scatter plot alone is not enough to determine the exact relationship between these two variables. One possible explanation for this is the spending in different economies during this time. Many countries, including the United States, increased their spending on goods, whether through welfare programs or the stock market. The stock exchange and financial markets were major factors that drove up investment during this time, which may have resulted in a relatively low decline in many economies, even though other sectors were heavily affected at the time. Other possible reasons include continuation of essential workers and services, as well as the growth of online services and remote work, whivh helped to support the economy and limit the decline in GDP during the Covid-19 pandemic. 

# **Conclusion**

It is clear from the scatter plot that there is no strong relationship between the number of COVID-19 cases and the change in GDP for the countries in this data. This suggests that the number of COVID-19 cases alone does not have a significant effect on the change in GDP. However, it is important to note that other factors, such as government spending, the growth of online services, and the continuation of essential workers and services, may have played a role in limiting the decline in GDP in many countries. It is also possible that the scatter plot does not accurately reflect the relationship between these two variables due to the limited data available. Overall, it is difficult to determine the exact relationship between the number of COVID-19 cases and the change in GDP without further information and analysis.