# COVID-19 : Worldwide & USA Analysis

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)<br/>
3. [Data Profiling](#section3)
    - 3.1 [Understanding the Dataset](#section301)<br/>
    - 3.2 [Preprocessing](#section302)<br/>
4. [Questions](#section4)
    - 4.1 [How are COVID-19 cases distributed worldwide?](#section401)<br/>
    - 4.2 [How have cases increased with time in the most infected countries?](#section402)<br/>
    - 4.3 [At what rate have cases increased daily in the most infected countries?](#section403)<br/>
    - 4.4 [Which states in the US have the most and least positive cases?](#section404)<br/>
    - 4.5 [Which states in the US have performed the highest and lowest number of tests?](#section405)<br/>
    - 4.6 [Which states in the US show the highest positive case rate with respect to tests conducted?](#section406)<br/>
    - 4.7 [Which states in the US show the highest death rate with respect to positive cases?](#section407)<br/>
5. [Conclusion](#section5)<br/>  

<a id='section1'></a>
### 1. Problem Statement

This notebook explores the spread of COVID-19 worldwide by using various python libraries for visualization and numerical manipulation. We perform a preliminary __Exploratory Data Analysis(EDA)__ of our __Global COVID case tracking__ dataset. We will then look into the regions which have the highest number of cases. This data will be analysed using some basic statistical tools and charts. 

Our end goal in this notebook is to analyze the current spread of COVID and visualizes the number of cases country-wise. We also look at the amount of new cases country-wise which will give us a picture of how well the virus is being contained. Lastly, we will look into regions that have the most cases and further examine how the virus is spread state-wise.

* __Exploratory Data Analysis__ <br/>
Understand the data by EDA and derive simple models with Pandas as baseline.
EDA ia a critical and first step in analyzing the data and we do this for below reasons :
    - Finding patterns in Data
    - Determining relationships in Data
    - Checking of assumptions
    - Detection of mistakes 

<a id='section2'></a>
### 2. Data Loading and Description

In this project we will be using multiple datasets(one of the overal worldwide data as well as one for the cases only in the USA) so as to get a clearer picture of how COVID-19 has spread across the globe and also across the country with the highest number of cases(USA).

We will analyze and process these datasets in order to answer several questions.


__1. worldwide_df :__
- The dataset consists of data about the spread of COVID-19 across the globe.
- The dataset comprises of __Country Names__ as rows and various data about those countries as columns. Below is a table showing names of the columns and their description.

| Column Name   | Description                                               |
| ------------- |:-------------                                            :| 
| Province/State           | Name of Province or States of the country                                                 | 
| Country/Region      | Name of Country/Region                        |  
| Lat        | Latitude of location                                           | 
| Long          | Longitude of location                                      |
 | Dates          | List of dates which represent the number of cases as of the date mentioned in the column name                                         |

__2. USA_state_stats :__
 - The dataset consists of information about number of cases state-wise in the USA as well as the distribution of the number of positive, negative and deaths.
 - The dataset comprises of __30 columns__ and each row is representative of a state. As we will not be using all the columns, the entire description of the data can be found at 'https://covidtracking.com/'. Below is a table showing names of all the columns we will be using and their descriptions.
        
| Column Name   | Description                                               |
| ------------- |:-------------                                            :| 
| state           | Name of the state                                                 | 
| positive      | Number of positive cases                        |  
| negative        | Number of negative cases                                |
| death        | Number of deaths                                    |
|totalTestResults | Total number of tests conducted|

__Both these datasets are updated daily so importing them should automatically fetch the most recently updated data__

#### Importing packages 

In [1]:
import pandas as pd
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode,iplot,plot
init_notebook_mode(connected=True) 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np

In [2]:
import cufflinks as cf
init_notebook_mode(connected=True)
# For offline use
cf.go_offline()

#### Importing the Dataset

__worldwide_df__

In [3]:
worldwide_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
worldwide_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/20/20,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,8145,8676,9216,9998,10582,11173,11831,12456,13036,13659
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,964,969,981,989,998,1004,1029,1050,1076,1099
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,7542,7728,7918,8113,8306,8503,8697,8857,8997,9134
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,762,762,762,762,762,763,763,763,763,764
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,52,58,60,61,69,70,70,71,74,81


__USA_state_stats__

In [4]:
#COVID USA stats from covidtracking.com
USA_state_stats = pd.read_csv('https://covidtracking.com/api/v1/states/current.csv')
USA_state_stats.head()

Unnamed: 0,state,positive,positiveScore,negativeScore,negativeRegularScore,commercialScore,grade,score,notes,dataQualityGrade,...,checkTimeEt,death,hospitalized,total,totalTestResults,posNeg,fips,dateModified,dateChecked,hash
0,AK,430,1.0,1.0,1.0,1.0,A,4.0,"Please stop using the ""total"" field. Use ""tota...",A,...,5/29 15:22,10,,49439,49439,49439,2,2020-05-29T04:00:00Z,2020-05-29T19:22:00Z,68ad70f66afd5ef31321c3295c4e3eed051ea4cf
1,AL,16823,1.0,1.0,0.0,1.0,B,3.0,"Please stop using the ""total"" field. Use ""tota...",B,...,5/29 14:44,605,1800.0,208883,208883,208883,1,2020-05-29T04:00:00Z,2020-05-29T18:44:00Z,1f8c806e84306966f71133639ab0c9c6d2d6e9d6
2,AR,6538,1.0,1.0,1.0,1.0,A,4.0,"Please stop using the ""total"" field. Use ""tota...",A,...,5/29 14:33,125,667.0,119768,119768,119768,5,2020-05-28T23:50:00Z,2020-05-29T18:33:00Z,b9004684021e4cc5e66645821204cf3087d2fcfc
3,AZ,18465,1.0,1.0,0.0,1.0,B,3.0,"Please stop using the ""total"" field. Use ""tota...",A+,...,5/29 16:01,885,2911.0,209813,209813,209813,4,2020-05-29T04:00:00Z,2020-05-29T20:01:00Z,d8b6f669548cb8b8581cac9025835946c4c79726
4,CA,103886,1.0,1.0,0.0,1.0,B,3.0,"Please stop using the ""total"" field. Use ""tota...",B,...,5/29 16:11,4068,,1835478,1835478,1835478,6,2020-05-29T04:00:00Z,2020-05-29T20:11:00Z,63a9cdfd96ee08c583425814721389853ab749f4


<a id='section3'></a>
## 3. Data Profiling

- In the upcoming section we will first __understand our dataset__ using various pandas functionalities.
- Once we identify if there are any inconsistencies and shortcomings in the data, we can begin preprocessing it.
- In __preprocessing__, we will deal with erronous and missing values of columns. If necessary, we may also add columns to make analysis easier.

<a id='section301'></a>
### 3.1 Understanding the data

__worldwide_df :__

In [5]:
worldwide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Columns: 133 entries, Province/State to 5/29/20
dtypes: float64(2), int64(129), object(2)
memory usage: 276.5+ KB


In [6]:
worldwide_df.shape

(266, 133)

In [7]:
worldwide_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/20/20,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,8145,8676,9216,9998,10582,11173,11831,12456,13036,13659
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,964,969,981,989,998,1004,1029,1050,1076,1099
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,7542,7728,7918,8113,8306,8503,8697,8857,8997,9134
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,762,762,762,762,762,763,763,763,763,764
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,52,58,60,61,69,70,70,71,74,81


__USA_state_stats :__

In [8]:
USA_state_stats.head()

Unnamed: 0,state,positive,positiveScore,negativeScore,negativeRegularScore,commercialScore,grade,score,notes,dataQualityGrade,...,checkTimeEt,death,hospitalized,total,totalTestResults,posNeg,fips,dateModified,dateChecked,hash
0,AK,430,1.0,1.0,1.0,1.0,A,4.0,"Please stop using the ""total"" field. Use ""tota...",A,...,5/29 15:22,10,,49439,49439,49439,2,2020-05-29T04:00:00Z,2020-05-29T19:22:00Z,68ad70f66afd5ef31321c3295c4e3eed051ea4cf
1,AL,16823,1.0,1.0,0.0,1.0,B,3.0,"Please stop using the ""total"" field. Use ""tota...",B,...,5/29 14:44,605,1800.0,208883,208883,208883,1,2020-05-29T04:00:00Z,2020-05-29T18:44:00Z,1f8c806e84306966f71133639ab0c9c6d2d6e9d6
2,AR,6538,1.0,1.0,1.0,1.0,A,4.0,"Please stop using the ""total"" field. Use ""tota...",A,...,5/29 14:33,125,667.0,119768,119768,119768,5,2020-05-28T23:50:00Z,2020-05-29T18:33:00Z,b9004684021e4cc5e66645821204cf3087d2fcfc
3,AZ,18465,1.0,1.0,0.0,1.0,B,3.0,"Please stop using the ""total"" field. Use ""tota...",A+,...,5/29 16:01,885,2911.0,209813,209813,209813,4,2020-05-29T04:00:00Z,2020-05-29T20:01:00Z,d8b6f669548cb8b8581cac9025835946c4c79726
4,CA,103886,1.0,1.0,0.0,1.0,B,3.0,"Please stop using the ""total"" field. Use ""tota...",B,...,5/29 16:11,4068,,1835478,1835478,1835478,6,2020-05-29T04:00:00Z,2020-05-29T20:11:00Z,63a9cdfd96ee08c583425814721389853ab749f4


In [9]:
USA_state_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   56 non-null     object 
 1   positive                56 non-null     int64  
 2   positiveScore           52 non-null     float64
 3   negativeScore           52 non-null     float64
 4   negativeRegularScore    52 non-null     float64
 5   commercialScore         52 non-null     float64
 6   grade                   52 non-null     object 
 7   score                   52 non-null     float64
 8   notes                   56 non-null     object 
 9   dataQualityGrade        56 non-null     object 
 10  negative                55 non-null     float64
 11  pending                 5 non-null      float64
 12  hospitalizedCurrently   45 non-null     float64
 13  hospitalizedCumulative  35 non-null     float64
 14  inIcuCurrently          24 non-null     floa

In [10]:
USA_state_stats.shape

(56, 30)

<a id='section302'></a>
### 3.2 Preprocessing

__worldwide_df :__

For our analysis, we do not need information about __Province/State.__ We will only be analyzing country information. Below we explore the Province/State column.

In [11]:
worldwide_df[worldwide_df['Country/Region'] == 'China'].head(10)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,5/20/20,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20
49,Anhui,China,31.8257,117.2264,1,9,15,39,60,70,...,991,991,991,991,991,991,991,991,991,991
50,Beijing,China,40.1824,116.4142,14,22,36,41,68,80,...,593,593,593,593,593,593,593,593,593,593
51,Chongqing,China,30.0572,107.874,6,9,27,57,75,110,...,579,579,579,579,579,579,579,579,579,579
52,Fujian,China,26.0789,117.9874,1,5,10,18,35,59,...,356,356,356,356,356,357,357,358,358,358
53,Gansu,China,37.8099,101.0583,0,2,2,4,7,14,...,139,139,139,139,139,139,139,139,139,139
54,Guangdong,China,23.3417,113.4244,26,32,53,78,111,151,...,1590,1590,1591,1592,1592,1592,1592,1592,1592,1593
55,Guangxi,China,23.8298,108.7881,2,5,23,23,36,46,...,254,254,254,254,254,254,254,254,254,254
56,Guizhou,China,26.8154,106.8748,1,3,3,4,5,7,...,147,147,147,147,147,147,147,147,147,147
57,Hainan,China,19.1959,109.7453,4,5,8,19,22,33,...,169,169,169,169,169,169,169,169,169,169
58,Hebei,China,39.549,116.1306,1,1,2,8,13,18,...,328,328,328,328,328,328,328,328,328,328


As we can see above, each country's cases are distributed amongst their different states and provinces. To simplify this, we will add the values in each of these provinces and states for each country to get the total number of cases for each country.

In [12]:
df_group_country = worldwide_df.groupby('Country/Region').sum()
df_group_country

Unnamed: 0_level_0,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,5/20/20,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,33.000000,65.000000,0,0,0,0,0,0,0,0,...,8145,8676,9216,9998,10582,11173,11831,12456,13036,13659
Albania,41.153300,20.168300,0,0,0,0,0,0,0,0,...,964,969,981,989,998,1004,1029,1050,1076,1099
Algeria,28.033900,1.659600,0,0,0,0,0,0,0,0,...,7542,7728,7918,8113,8306,8503,8697,8857,8997,9134
Andorra,42.506300,1.521800,0,0,0,0,0,0,0,0,...,762,762,762,762,762,763,763,763,763,764
Angola,-11.202700,17.873900,0,0,0,0,0,0,0,0,...,52,58,60,61,69,70,70,71,74,81
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,31.952200,35.233200,0,0,0,0,0,0,0,0,...,398,423,423,423,423,423,429,434,446,446
Western Sahara,24.215500,-12.885800,0,0,0,0,0,0,0,0,...,6,6,6,6,9,9,9,9,9,9
Yemen,15.552727,48.516388,0,0,0,0,0,0,0,0,...,184,197,209,212,222,233,249,256,278,283
Zambia,-15.416700,28.283300,0,0,0,0,0,0,0,0,...,832,866,920,920,920,920,920,1057,1057,1057


By grouping by country and performing an aggregate sum function, we are able to get the total number of cases for each country.

In [13]:
df_group_country.iloc[:,len(df_group_country.columns) - 1]

Country/Region
Afghanistan           13659
Albania                1099
Algeria                9134
Andorra                 764
Angola                   81
                      ...  
West Bank and Gaza      446
Western Sahara            9
Yemen                   283
Zambia                 1057
Zimbabwe                149
Name: 5/29/20, Length: 188, dtype: int64

In [14]:
df_group_country['Total Cases'] = df_group_country.iloc[:,len(df_group_country.columns) - 1]
df_group_country.head()

Unnamed: 0_level_0,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20,Total Cases
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,33.0,65.0,0,0,0,0,0,0,0,0,...,8676,9216,9998,10582,11173,11831,12456,13036,13659,13659
Albania,41.1533,20.1683,0,0,0,0,0,0,0,0,...,969,981,989,998,1004,1029,1050,1076,1099,1099
Algeria,28.0339,1.6596,0,0,0,0,0,0,0,0,...,7728,7918,8113,8306,8503,8697,8857,8997,9134,9134
Andorra,42.5063,1.5218,0,0,0,0,0,0,0,0,...,762,762,762,762,763,763,763,763,764,764
Angola,-11.2027,17.8739,0,0,0,0,0,0,0,0,...,58,60,61,69,70,70,71,74,81,81


We have now added a column called __Total Cases__ which contains the total number of cases for each country so that it will be easier to analyze.

In [15]:
df_group_country = df_group_country.sort_values('Total Cases',ascending=False)
df_group_country_top10 = df_group_country.head(10)
df_group_country

Unnamed: 0_level_0,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20,Total Cases
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
US,37.090200,-95.712900,1,1,2,2,5,5,5,5,...,1577147,1600937,1622612,1643246,1662302,1680913,1699176,1721753,1746019,1746019
Brazil,-14.235000,-51.925300,0,0,0,0,0,0,0,0,...,310087,330890,347398,363211,374898,391222,411821,438238,465166,465166
Russia,60.000000,90.000000,0,0,0,0,0,0,0,0,...,317554,326448,335882,344481,353427,362342,370680,379051,387623,387623
United Kingdom,270.029900,-482.924700,0,0,0,0,0,0,0,0,...,252246,255544,258504,260916,262547,266599,268619,270508,272607,272607
Spain,40.000000,-4.000000,0,0,0,0,0,0,0,0,...,233037,234824,235290,235772,235400,236259,236259,237906,238564,238564
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Seychelles,-4.679600,55.492000,0,0,0,0,0,0,0,0,...,11,11,11,11,11,11,11,11,11,11
MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,0,0,...,9,9,9,9,9,9,9,9,9,9
Western Sahara,24.215500,-12.885800,0,0,0,0,0,0,0,0,...,6,6,6,9,9,9,9,9,9,9
Papua New Guinea,-6.315000,143.955500,0,0,0,0,0,0,0,0,...,8,8,8,8,8,8,8,8,8,8


With the above code, we sort the total cases in descending order so that we have the countries with the highest cases at the top of the list. We will make use of this further down in the notebook.

__USA_state_stats :__

In [16]:
USA_state_stats.drop(['onVentilatorCumulative','onVentilatorCurrently','score','grade','commercialScore','negativeRegularScore','negativeScore','positiveScore','posNeg','total','posNeg','notes'],axis=1,inplace=True)
USA_state_stats.head(5)

Unnamed: 0,state,positive,dataQualityGrade,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,recovered,lastUpdateEt,checkTimeEt,death,hospitalized,totalTestResults,fips,dateModified,dateChecked,hash
0,AK,430,A,49009.0,,14.0,,,,367.0,5/29 00:00,5/29 15:22,10,,49439,2,2020-05-29T04:00:00Z,2020-05-29T19:22:00Z,68ad70f66afd5ef31321c3295c4e3eed051ea4cf
1,AL,16823,B,192060.0,,,1800.0,,577.0,9355.0,5/29 00:00,5/29 14:44,605,1800.0,208883,1,2020-05-29T04:00:00Z,2020-05-29T18:44:00Z,1f8c806e84306966f71133639ab0c9c6d2d6e9d6
2,AR,6538,A,113230.0,,104.0,667.0,,,4583.0,5/28 19:50,5/29 14:33,125,667.0,119768,5,2020-05-28T23:50:00Z,2020-05-29T18:33:00Z,b9004684021e4cc5e66645821204cf3087d2fcfc
3,AZ,18465,A+,191348.0,,931.0,2911.0,378.0,,4551.0,5/29 00:00,5/29 16:01,885,2911.0,209813,4,2020-05-29T04:00:00Z,2020-05-29T20:01:00Z,d8b6f669548cb8b8581cac9025835946c4c79726
4,CA,103886,B,1731592.0,,4414.0,,1328.0,,,5/29 00:00,5/29 16:11,4068,,1835478,6,2020-05-29T04:00:00Z,2020-05-29T20:11:00Z,63a9cdfd96ee08c583425814721389853ab749f4


In [17]:
USA_state_stats.drop(['hospitalized','dataQualityGrade','inIcuCurrently','inIcuCumulative','pending','hospitalizedCurrently','hospitalizedCumulative','hash','dateChecked','dateModified','fips','checkTimeEt','lastUpdateEt'],axis=1,inplace=True)
USA_state_stats.head(5)

Unnamed: 0,state,positive,negative,recovered,death,totalTestResults
0,AK,430,49009.0,367.0,10,49439
1,AL,16823,192060.0,9355.0,605,208883
2,AR,6538,113230.0,4583.0,125,119768
3,AZ,18465,191348.0,4551.0,885,209813
4,CA,103886,1731592.0,,4068,1835478


In [18]:
USA_state_stats.isnull().sum()

state                0
positive             0
negative             1
recovered           12
death                0
totalTestResults     0
dtype: int64

In [19]:
USA_state_stats[USA_state_stats['recovered'].isnull()]

Unnamed: 0,state,positive,negative,recovered,death,totalTestResults
4,CA,103886,1731592.0,,4068,1835478
9,FL,54497,928742.0,,2495,983239
10,GA,45670,404601.0,,1974,450271
14,IL,117455,734307.0,,5270,851762
15,IN,33558,215155.0,,2110,248713
19,MA,95512,476233.0,,6718,571745
24,MO,12795,165055.0,,738,177850
29,NE,13261,81542.0,,164,94803
35,OH,34566,335324.0,,2131,369890
47,WA,20764,322327.0,,1106,343091


In [20]:
USA_state_stats.drop('recovered',axis=1,inplace=True)
USA_state_stats.head()

Unnamed: 0,state,positive,negative,death,totalTestResults
0,AK,430,49009.0,10,49439
1,AL,16823,192060.0,605,208883
2,AR,6538,113230.0,125,119768
3,AZ,18465,191348.0,885,209813
4,CA,103886,1731592.0,4068,1835478


We also have one row where the column __negative__ is null. We could drop just this row but we will instead use the totalTestResults and the positive columns to get the missing negative value.

In [21]:
USA_state_stats['negative'] = USA_state_stats['totalTestResults'] - USA_state_stats['positive']
USA_state_stats.isnull().sum()

state               0
positive            0
negative            0
death               0
totalTestResults    0
dtype: int64

In [22]:
USA_state_stats.head()

Unnamed: 0,state,positive,negative,death,totalTestResults
0,AK,430,49009,10,49439
1,AL,16823,192060,605,208883
2,AR,6538,113230,125,119768
3,AZ,18465,191348,885,209813
4,CA,103886,1731592,4068,1835478


In [23]:
USA_state_stats['positive/tests %'] = (USA_state_stats['positive']/USA_state_stats['totalTestResults'])*100
USA_state_stats['death/positive %'] = (USA_state_stats['death']/USA_state_stats['positive'])*100
USA_state_stats.head()

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
0,AK,430,49009,10,49439,0.869759,2.325581
1,AL,16823,192060,605,208883,8.053791,3.596267
2,AR,6538,113230,125,119768,5.458887,1.9119
3,AZ,18465,191348,885,209813,8.800694,4.792851
4,CA,103886,1731592,4068,1835478,5.659888,3.915831


In [24]:
USA_state_stats['state'].nunique()

56

In [25]:
USA_state_stats['state'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', 'PR', 'AS', 'GU', 'MP',
       'VI'], dtype=object)

In [26]:
USA_state_stats.drop([52,54],axis=0,inplace=True)

<a id='section4'></a>
## 4. Questions

<a id='section401'></a>
### 4.1 How are COVID-19 cases distributed worldwide?

To answer this quetion, we will use a choropleth map to visualize the cases across the globe to gain a holistic view of the spread of the virus.

In [27]:
data = dict(
        type = 'choropleth',
        colorscale = 'agsunset',
        reversescale = True,
        locations = df_group_country.index,
        locationmode = "country names",
        z = df_group_country['Total Cases'],
        text = df_group_country.index,
        colorbar = {'title' : 'COVID-19 cases by Country'},
      ) 

layout = dict(title = 'COVID-19 cases by Country',
                geo = dict(showframe = False,projection = {'type':'orthographic'})
             )

In [30]:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap,validate=False)

* From the above interactive map, we are able to get an overall understanding of the spread of the virus across the globe. It is clear to us that the US is the worst affected as of the time that this notebook is being made.

In [31]:
df_group_country_top10 = df_group_country_top10.iloc[:,2:]
df_group_country_top10

Unnamed: 0_level_0,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,...,5/21/20,5/22/20,5/23/20,5/24/20,5/25/20,5/26/20,5/27/20,5/28/20,5/29/20,Total Cases
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
US,2,2,5,5,5,5,5,7,8,8,...,1577147,1600937,1622612,1643246,1662302,1680913,1699176,1721753,1746019,1746019
Brazil,0,0,0,0,0,0,0,0,0,0,...,310087,330890,347398,363211,374898,391222,411821,438238,465166,465166
Russia,0,0,0,0,0,0,0,2,2,2,...,317554,326448,335882,344481,353427,362342,370680,379051,387623,387623
United Kingdom,0,0,0,0,0,0,0,2,2,2,...,252246,255544,258504,260916,262547,266599,268619,270508,272607,272607
Spain,0,0,0,0,0,0,0,0,1,1,...,233037,234824,235290,235772,235400,236259,236259,237906,238564,238564
Italy,0,0,0,0,0,0,0,2,2,2,...,228006,228658,229327,229858,230158,230555,231139,231732,232248,232248
France,2,3,3,3,4,5,5,5,6,6,...,181951,182354,182694,182709,183067,182847,183038,186364,186923,186923
Germany,0,0,0,1,4,4,4,5,8,10,...,179021,179710,179986,180328,180600,181200,181524,182196,182922,182922
India,0,0,0,0,0,0,1,1,1,2,...,118226,124794,131423,138536,144950,150793,158086,165386,173491,173491
Turkey,0,0,0,0,0,0,0,0,0,0,...,153548,154500,155686,156827,157814,158762,159797,160979,162120,162120


* We have created a new dataframe with the top 10 most infected countries. We can now use this dataframe to plot the trajectory of cases daily for these 10 countries.

* The transpose function will be implemented on our dataframe so as to allowe us to get our date columns as index values and hence plot the number of cases daily across the 10 most infected countries.

<a id='section402'></a>
### 4.2 How have cases increased with time in the most infected countries?

#### Visualizing daily number of cases among top 10 most infected countries:

In [32]:
df_group_country_top10_trans = df_group_country_top10.transpose()
df_group_country_top10_trans.iplot(width=2.5,size=20)

From the graph above, we can see that the trajectory of the US stands out from the other countries. The curve seems to be much steeper and the number of cases seem to increase at a much higher rate than other countries. 

In [33]:
df_group_country_top10_trans

Country/Region,US,Brazil,Russia,United Kingdom,Spain,Italy,France,Germany,India,Turkey
1/24/20,2,0,0,0,0,0,2,0,0,0
1/25/20,2,0,0,0,0,0,3,0,0,0
1/26/20,5,0,0,0,0,0,3,0,0,0
1/27/20,5,0,0,0,0,0,3,1,0,0
1/28/20,5,0,0,0,0,0,4,4,0,0
...,...,...,...,...,...,...,...,...,...,...
5/26/20,1680913,391222,362342,266599,236259,230555,182847,181200,150793,158762
5/27/20,1699176,411821,370680,268619,236259,231139,183038,181524,158086,159797
5/28/20,1721753,438238,379051,270508,237906,231732,186364,182196,165386,160979
5/29/20,1746019,465166,387623,272607,238564,232248,186923,182922,173491,162120


<a id='section403'></a>
### 4.3 At what rate have cases increased daily in the most infected countries?

#### Visualizing daily new cases among top 10 most infected countries:

Next we will create new columns for each of our 10 countries to show the number of __new cases__ each day for each one of them. To do this, we will create a new dataframe named __df_case_increment__.

In [34]:
df_case_increment = df_group_country_top10_trans
df_case_increment

Country/Region,US,Brazil,Russia,United Kingdom,Spain,Italy,France,Germany,India,Turkey
1/24/20,2,0,0,0,0,0,2,0,0,0
1/25/20,2,0,0,0,0,0,3,0,0,0
1/26/20,5,0,0,0,0,0,3,0,0,0
1/27/20,5,0,0,0,0,0,3,1,0,0
1/28/20,5,0,0,0,0,0,4,4,0,0
...,...,...,...,...,...,...,...,...,...,...
5/26/20,1680913,391222,362342,266599,236259,230555,182847,181200,150793,158762
5/27/20,1699176,411821,370680,268619,236259,231139,183038,181524,158086,159797
5/28/20,1721753,438238,379051,270508,237906,231732,186364,182196,165386,160979
5/29/20,1746019,465166,387623,272607,238564,232248,186923,182922,173491,162120


In [35]:
df_case_increment.columns

Index(['US', 'Brazil', 'Russia', 'United Kingdom', 'Spain', 'Italy', 'France',
       'Germany', 'India', 'Turkey'],
      dtype='object', name='Country/Region')

In [36]:
newCaseList = []
oldColumns = []
for x in df_case_increment.columns:
    oldColumns.append(x)
    newcases = x + '_newcases'
    df_case_increment[newcases] = df_case_increment[x]
    newCaseList.append(newcases)
df_case_increment.iloc[:-1,10:] = df_case_increment.iloc[:-1,10:].shift(periods = 1,fill_value=0)

for y in range(len(newCaseList)):
    df_case_increment[newCaseList[y]] = df_case_increment[oldColumns[y]] - df_case_increment[newCaseList[y]]
df_case_increment
#test_df.head()

Country/Region,US,Brazil,Russia,United Kingdom,Spain,Italy,France,Germany,India,Turkey,US_newcases,Brazil_newcases,Russia_newcases,United Kingdom_newcases,Spain_newcases,Italy_newcases,France_newcases,Germany_newcases,India_newcases,Turkey_newcases
1/24/20,2,0,0,0,0,0,2,0,0,0,2,0,0,0,0,0,2,0,0,0
1/25/20,2,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0
1/26/20,5,0,0,0,0,0,3,0,0,0,3,0,0,0,0,0,0,0,0,0
1/27/20,5,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,1,0,0
1/28/20,5,0,0,0,0,0,4,4,0,0,0,0,0,0,0,0,1,3,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5/26/20,1680913,391222,362342,266599,236259,230555,182847,181200,150793,158762,18611,16324,8915,4052,859,397,-220,600,5843,948
5/27/20,1699176,411821,370680,268619,236259,231139,183038,181524,158086,159797,18263,20599,8338,2020,0,584,191,324,7293,1035
5/28/20,1721753,438238,379051,270508,237906,231732,186364,182196,165386,160979,22577,26417,8371,1889,1647,593,3326,672,7300,1182
5/29/20,1746019,465166,387623,272607,238564,232248,186923,182922,173491,162120,24266,26928,8572,2099,658,516,559,726,8105,1141


The above code is used to create new columns for each of our 10 countries and perform mathematical operations to give us the number of new cases per day for each of them.

In [37]:
df_case_increment_top10 = df_case_increment.iloc[1:-1,10:] 

In [38]:
df_case_increment_top10

Country/Region,US_newcases,Brazil_newcases,Russia_newcases,United Kingdom_newcases,Spain_newcases,Italy_newcases,France_newcases,Germany_newcases,India_newcases,Turkey_newcases
1/25/20,0,0,0,0,0,0,1,0,0,0
1/26/20,3,0,0,0,0,0,0,0,0,0
1/27/20,0,0,0,0,0,0,0,1,0,0
1/28/20,0,0,0,0,0,0,1,3,0,0
1/29/20,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
5/25/20,19056,11687,8946,1631,-372,300,358,272,6414,987
5/26/20,18611,16324,8915,4052,859,397,-220,600,5843,948
5/27/20,18263,20599,8338,2020,0,584,191,324,7293,1035
5/28/20,22577,26417,8371,1889,1647,593,3326,672,7300,1182


Lastly, we trim the dataframe to give us only the daily new case data so that we will easily be able to plot them.

In [39]:
df_case_increment_top10.iplot(size=20,width=1.5)

All below observations are made at the time of writing this notebook:

- The number of new cases shows an overall downward trend in the countries USA,UK,Italy,France,Germany and Turkey.

- In Spain the number of new cases seems to be contstant which means the situation is not worsening neither is it getting better.

- In the countries Russia, Brazil and Iran, the number of new cases seems to be increasing daily which shows that there still exists a good amount of spread in these countries.

- Again, we can see that the new case line of the US stands out from the others due to the fact that the number of cases daily is much higher than that of any of the other 9 countries. As this graph is interactive, we are able to deselect the countries we do not want to view from the legend or hover over the lines to see the value of new cases at that point.

### Analyzing spread of COVID-19 cases in the US

<a id='section404'></a>
### 4.4 Which states in the US have the most and least positive cases?

Firstly, we will look the United States of America as a whole and see how the cases are distributed through the country.

We will use a choropleth map to get a geographic plot of the country along with a color-coded legend based on the number of positive cases in each state.

In [40]:
fig = go.Figure(data=go.Choropleth(
    locations=USA_state_stats['state'], # Spatial coordinates
    z = USA_state_stats['positive'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds', # Reds
    colorbar_title = "COVID-19 cases in USA by state",
))

fig.update_layout(
    width = 800,
    height = 800,
    title_text = 'COVID-19 cases in USA by state',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Now that we have an overall understanding of how positive cases are spread across the USA as well as have a vague idea of hotspots and coldspots, we can dig a little deeper into the data to see exactly which states have the highest and lowest number of confirmed cases.

#### States with the highest number of confirmed cases

In [41]:
USA_state_stats.sort_values('positive',ascending=False).head(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
34,NY,368284,1575846,23780,1944130,18.943383,6.456973
31,NJ,158844,557567,11531,716411,22.172189,7.259324
14,IL,117455,734307,5270,851762,13.78965,4.486825
4,CA,103886,1731592,4068,1835478,5.659888,3.915831
19,MA,95512,476233,6718,571745,16.705349,7.033671
38,PA,71339,366970,5464,438309,16.275961,7.659205
43,TX,61006,832269,1626,893275,6.829476,2.665312
22,MI,56621,464986,5406,521607,10.855107,9.547694
9,FL,54497,928742,2495,983239,5.5426,4.578234
20,MD,50988,233530,2466,284518,17.920835,4.836432


#### States with the lowest number of confirmed cases

In [42]:
USA_state_stats.sort_values('positive',ascending=False).tail(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
28,ND,2520,66933,59,69453,3.628353,2.34127
21,ME,2226,43480,85,45706,4.870258,3.818509
49,WV,1951,91426,74,93377,2.08938,3.792927
46,VT,975,30910,55,31885,3.057864,5.641026
50,WY,891,22378,15,23269,3.829129,1.683502
11,HI,647,52177,17,52824,1.224822,2.627512
26,MT,493,38036,17,38529,1.279556,3.448276
0,AK,430,49009,10,49439,0.869759,2.325581
53,GU,172,5923,5,6095,2.821985,2.906977
55,VI,69,1633,6,1702,4.054054,8.695652


<a id='section405'></a>
### 4.5 Which states in the US have performed the highest and lowest number of tests?

Similar to the amount of positive cases, we will first look at how the number of tests are distributed across all states using a choropleth map.

In [43]:
fig = go.Figure(data=go.Choropleth(
    locations=USA_state_stats['state'], # Spatial coordinates
    z = USA_state_stats['totalTestResults'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds', # Reds
    colorbar_title = "COVID-19 tests in USA by state",
))

fig.update_layout(
    width = 800,
    height = 800,
    title_text = 'COVID-19 tests in USA by state',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

With this understanding, let us move on to analyzing which states have the highest and lowest testing numbers.

#### States with the highest number of tests done

In [44]:
USA_state_stats.sort_values('totalTestResults',ascending=False).head(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
34,NY,368284,1575846,23780,1944130,18.943383,6.456973
4,CA,103886,1731592,4068,1835478,5.659888,3.915831
9,FL,54497,928742,2495,983239,5.5426,4.578234
43,TX,61006,832269,1626,893275,6.829476,2.665312
14,IL,117455,734307,5270,851762,13.78965,4.486825
31,NJ,158844,557567,11531,716411,22.172189,7.259324
19,MA,95512,476233,6718,571745,16.705349,7.033671
22,MI,56621,464986,5406,521607,10.855107,9.547694
10,GA,45670,404601,1974,450271,10.142781,4.322312
38,PA,71339,366970,5464,438309,16.275961,7.659205


#### States with the lowest number of tests done

In [45]:
USA_state_stats.sort_values('totalTestResults',ascending=False).tail(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
21,ME,2226,43480,85,45706,4.870258,3.818509
13,ID,2769,41992,82,44761,6.186189,2.961358
7,DC,8538,35320,460,43858,19.467372,5.387679
41,SD,4866,35816,59,40682,11.961064,1.212495
26,MT,493,38036,17,38529,1.279556,3.448276
46,VT,975,30910,55,31885,3.057864,5.641026
50,WY,891,22378,15,23269,3.829129,1.683502
53,GU,172,5923,5,6095,2.821985,2.906977
51,PR,3647,0,132,3647,100.0,3.619413
55,VI,69,1633,6,1702,4.054054,8.695652


<a id='section406'></a>
### 4.6 Which states in the US show the highest rate of positive cases with respect to tests conducted?

To understand this, we have created the column __positive/tests %.__ This column in essence shows us how many tests turn out to be positive out of 100. This can give us a good idea of the extent of infection spread in the state.

#### States with the highest percentage of positive cases with respect to total tests

In [46]:
USA_state_stats.sort_values('positive/tests %',ascending=False).head(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
51,PR,3647,0,132,3647,100.0,3.619413
31,NJ,158844,557567,11531,716411,22.172189,7.259324
7,DC,8538,35320,460,43858,19.467372,5.387679
34,NY,368284,1575846,23780,1944130,18.943383,6.456973
20,MD,50988,233530,2466,284518,17.920835,4.836432
6,CT,41762,199631,3868,241393,17.300419,9.262009
19,MA,95512,476233,6718,571745,16.705349,7.033671
38,PA,71339,366970,5464,438309,16.275961,7.659205
8,DE,9236,48297,356,57533,16.053395,3.854482
5,CO,25121,143838,1421,168959,14.868104,5.656622


In [47]:
top10_pos_tests = USA_state_stats.sort_values('positive/tests %',ascending=False).head(10)
top10_pos_tests.iplot(kind='bar',x='state',y='positive/tests %',color='Blue',fill=True)

From the above graphs we can see that __PR__ shows a 100% positivity rate according to the data. It is followed by __NJ__ and __NY__ in terms of number of positive cases with respect to tests done.

#### States with the lowest percentage of positive cases with respect to total tests

In [48]:
USA_state_stats.sort_values('positive/tests %',ascending=False).tail(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
50,WY,891,22378,15,23269,3.829129,1.683502
28,ND,2520,66933,59,69453,3.628353,2.34127
36,OK,6338,181060,329,187398,3.382107,5.190912
37,OR,4131,118550,151,122681,3.36727,3.655289
46,VT,975,30910,55,31885,3.057864,5.641026
53,GU,172,5923,5,6095,2.821985,2.906977
49,WV,1951,91426,74,93377,2.08938,3.792927
26,MT,493,38036,17,38529,1.279556,3.448276
11,HI,647,52177,17,52824,1.224822,2.627512
0,AK,430,49009,10,49439,0.869759,2.325581


In [49]:
bottom10_pos_tests = USA_state_stats.sort_values('positive/tests %',ascending=False).tail(10)
bottom10_pos_tests.iplot(kind='bar',x='state',y='positive/tests %',color='Blue',fill=True)

As for the states which have the lowest positive cases with respect to total tests done, __AK__ has the lowest rate with only slightly above 1% followed by __HI__ with about 1.5% and __MT__ with 1.8%.

<a id='section407'></a>
### 4.7 Which states in the US show the highest death rate with respect to positive cases?

To answer this question, we have created the column __death/positive %.__ This shows us how many deaths occor for every 100 positive cases. This gives us an insight about how likey contracting the virus would lead to death.

In [50]:
fig = go.Figure(data=go.Choropleth(
    locations=USA_state_stats['state'], # Spatial coordinates
    z = USA_state_stats['death/positive %'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'YlOrRd', # Blues
    colorbar_title = "Deaths wrt positive cases",
))

fig.update_layout(
    width = 800,
    height = 800,
    title_text = 'Deaths with respect to positive cases',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

#### States with the highest percentage of deaths with respect to positive cases

In [51]:
USA_state_stats.sort_values('death/positive %',ascending=False).head(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
22,MI,56621,464986,5406,521607,10.855107,9.547694
6,CT,41762,199631,3868,241393,17.300419,9.262009
55,VI,69,1633,6,1702,4.054054,8.695652
38,PA,71339,366970,5464,438309,16.275961,7.659205
31,NJ,158844,557567,11531,716411,22.172189,7.259324
18,LA,38802,316225,2766,355027,10.92931,7.128499
19,MA,95512,476233,6718,571745,16.705349,7.033671
34,NY,368284,1575846,23780,1944130,18.943383,6.456973
15,IN,33558,215155,2110,248713,13.49266,6.287621
35,OH,34566,335324,2131,369890,9.34494,6.165018


From the dataframe above we can see that __MI__ has the highest death rate with above 9.6% followed by __CT__ with 9%.

#### States with the lowest percentage of deaths with respect to positive cases

In [52]:
USA_state_stats.sort_values('death/positive %',ascending=False).tail(10)

Unnamed: 0,state,positive,negative,death,totalTestResults,positive/tests %,death/positive %
11,HI,647,52177,17,52824,1.224822,2.627512
28,ND,2520,66933,59,69453,3.628353,2.34127
0,AK,430,49009,10,49439,0.869759,2.325581
16,KS,9719,85230,208,94949,10.236021,2.140138
2,AR,6538,113230,125,119768,5.458887,1.9119
50,WY,891,22378,15,23269,3.829129,1.683502
42,TN,22085,399882,360,421967,5.233822,1.630066
29,NE,13261,81542,164,94803,13.987954,1.236709
41,SD,4866,35816,59,40682,11.961064,1.212495
44,UT,9264,196591,107,205855,4.500255,1.155009


The state with the lowest death rate is __WY__ with a death rate of just under 1%. __UT__ and __SD__ with just above 1.1%.

In [53]:
infected_rate = pd.read_csv('COVID19_line_list_data.csv')
infected_rate.head()

Unnamed: 0,id,case_in_country,reporting date,Unnamed: 3,summary,location,country,gender,age,symptom_onset,...,recovered,symptom,source,link,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26
0,1,,1/20/2020,,First confirmed imported COVID-19 pneumonia pa...,"Shenzhen, Guangdong",China,male,66.0,01/03/20,...,0,,Shenzhen Municipal Health Commission,http://wjw.sz.gov.cn/wzx/202001/t20200120_1898...,,,,,,
1,2,,1/20/2020,,First confirmed imported COVID-19 pneumonia pa...,Shanghai,China,female,56.0,1/15/2020,...,0,,Official Weibo of Shanghai Municipal Health Co...,https://www.weibo.com/2372649470/IqogQhgfa?fro...,,,,,,
2,3,,1/21/2020,,First confirmed imported cases in Zhejiang: pa...,Zhejiang,China,male,46.0,01/04/20,...,0,,Health Commission of Zhejiang Province,http://www.zjwjw.gov.cn/art/2020/1/21/art_1202...,,,,,,
3,4,,1/21/2020,,new confirmed imported COVID-19 pneumonia in T...,Tianjin,China,female,60.0,,...,0,,人民日报官方微博,https://m.weibo.cn/status/4463235401268457?,,,,,,
4,5,,1/21/2020,,new confirmed imported COVID-19 pneumonia in T...,Tianjin,China,male,58.0,,...,0,,人民日报官方微博,https://m.weibo.cn/status/4463235401268457?,,,,,,


<a id='section5'></a>
## 5. Conclusion 

- In this notebook, we used various numerical and visualization libraries to perform an Exploratory Data Analysis of COVID-19 data.
- We were able to sucessfully process the datasets by getting rid of irrelevant data or create new columns where necessary.
- We made use of packages like __pandas and plotly__ to develop better insights about the data using visualization. <br/>
- We have also seen how __preproceesing__ helps in dealing with __missing__ and __erroneous__ values and irregualities present in the data. We also _created new features_ which in turn help us to better understand the data.
- We used plotly to be able to visualize geographical data and better understand the spread of our data.
- These steps helped us in developing a deeper understanding of the spread of COVID-19 spread across the globe and in the US. We were able to understand the current situation and estimate how severely each country was hit by the virus.<br/><br/>