<a href="https://colab.research.google.com/github/SriSatyaLokesh/COVID19-DataAnalysis/blob/master/COVID19_Worldwide_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COVID19 Exploratory Data Analysis
### World wide

#### follwing cells is for performing data analysis in google colab

In [3]:
# upload your kaggle API token (you can get that from your account) 
from google.colab import files
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [0]:
# Run this to create a kaggle environment
# !pip install -q kaggle
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/kaggle.json
# !chmod 600 /root/.kaggle/kaggle.json

# import numpy as np
# import pandas as pd
# import plotly.express as px

**```Let's perform exploratory data analysis on covid-19 data ```**
- I'm using data from kaggle and github
- Global covid-19 data https://www.kaggle.com/imdevskp/corona-virus-report/
- India covid-19 data https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/
- Time series covid-19 data https://github.com/CSSEGISandData/COVID-19.git

In [5]:
# Get data from kaggle 
import zipfile
# Download data
!kaggle datasets download -d imdevskp/corona-virus-report/
!kaggle datasets download -d sudalairajkumar/novel-corona-virus-2019-dataset/

# UnZip data
zip_ref = zipfile.ZipFile("corona-virus-report.zip", 'r')
zip_ref.extractall()
zip_ref = zipfile.ZipFile("novel-corona-virus-2019-dataset.zip", 'r')
zip_ref.extractall()
zip_ref.close()

Downloading corona-virus-report.zip to /content
 43% 3.00M/6.90M [00:00<00:00, 23.9MB/s]
100% 6.90M/6.90M [00:00<00:00, 33.8MB/s]
Downloading novel-corona-virus-2019-dataset.zip to /content
  0% 0.00/713k [00:00<?, ?B/s]
100% 713k/713k [00:00<00:00, 47.1MB/s]


In [6]:
# Get data from github 

# Download data
!git clone https://github.com/CSSEGISandData/COVID-19.git

Cloning into 'COVID-19'...
remote: Enumerating objects: 18918, done.[K
remote: Total 18918 (delta 0), reused 0 (delta 0), pack-reused 18918[K
Receiving objects: 100% (18918/18918), 76.14 MiB | 32.19 MiB/s, done.
Resolving deltas: 100% (9737/9737), done.


In [1]:
!ls

corona-virus-report.zip		     sample_data
COVID-19			     time_series_covid_19_confirmed.csv
covid_19_clean_complete.csv	     time_series_covid_19_confirmed_US.csv
covid_19_data.csv		     time_series_covid_19_deaths.csv
COVID19_line_list_data.csv	     time_series_covid_19_deaths_US.csv
COVID19_open_line_list.csv	     time_series_covid_19_recovered.csv
kaggle.json			     usa_county_wise.csv
novel-corona-virus-2019-dataset.zip


In [0]:
#IMPORT required libraries
import numpy as np
import pandas as pd
import plotly.express as px

In [0]:
# load the data
data = pd.read_csv("covid_19_clean_complete.csv")

In [5]:
print(data.columns)
data.tail(5)

Index(['Province/State', 'Country/Region', 'Lat', 'Long', 'Date', 'Confirmed',
       'Deaths', 'Recovered'],
      dtype='object')


Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
20092,Falkland Islands (Malvinas),United Kingdom,-51.7963,-59.5236,4/7/20,2,0,0
20093,Saint Pierre and Miquelon,France,46.8852,-56.3159,4/7/20,1,0,0
20094,,South Sudan,6.877,31.307,4/7/20,2,0,0
20095,,Western Sahara,24.2155,-12.8858,4/7/20,4,0,0
20096,,Sao Tome and Principe,0.18636,6.613081,4/7/20,4,0,0


In [6]:
data[data["Country/Region"]=="India"]["Province/State"]

131      NaN
392      NaN
653      NaN
914      NaN
1175     NaN
        ... 
18923    NaN
19184    NaN
19445    NaN
19706    NaN
19967    NaN
Name: Province/State, Length: 77, dtype: object

#### Even India doesn't have state specification so we should fill those values

In [0]:
# Replacing all the NaN values with Country/Region
data["Province/State"].fillna(data["Country/Region"], inplace=True)

In [8]:
data.tail(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
20092,Falkland Islands (Malvinas),United Kingdom,-51.7963,-59.5236,4/7/20,2,0,0
20093,Saint Pierre and Miquelon,France,46.8852,-56.3159,4/7/20,1,0,0
20094,South Sudan,South Sudan,6.877,31.307,4/7/20,2,0,0
20095,Western Sahara,Western Sahara,24.2155,-12.8858,4/7/20,4,0,0
20096,Sao Tome and Principe,Sao Tome and Principe,0.18636,6.613081,4/7/20,4,0,0


In [9]:
data[data["Country/Region"]=="India"]["Province/State"]

131      India
392      India
653      India
914      India
1175     India
         ...  
18923    India
19184    India
19445    India
19706    India
19967    India
Name: Province/State, Length: 77, dtype: object

###### We have filled all NaN values, we are ready to perform analysis

In [10]:
data["Date"].tail(5)

20092    4/7/20
20093    4/7/20
20094    4/7/20
20095    4/7/20
20096    4/7/20
Name: Date, dtype: object

In [11]:
#we need to form date with that specific format
from datetime import datetime as dt,date,timedelta
today = dt.now()-timedelta(days=3)
today = dt.strftime(today,"%-m/%-d/%y")
print(today)

4/7/20


In [0]:

latest_data = data.loc[data["Date"]==today][['Province/State',"Country/Region",'Lat','Long',"Confirmed","Deaths","Recovered"]]

### Let's find total active cases


In [0]:

#total active cases in every country
latest_data["Active"] = latest_data["Confirmed"] - latest_data["Deaths"] - latest_data["Recovered"]

In [14]:
latest_data.tail()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Confirmed,Deaths,Recovered,Active
20092,Falkland Islands (Malvinas),United Kingdom,-51.7963,-59.5236,2,0,0,2
20093,Saint Pierre and Miquelon,France,46.8852,-56.3159,1,0,0,1
20094,South Sudan,South Sudan,6.877,31.307,2,0,0,2
20095,Western Sahara,Western Sahara,24.2155,-12.8858,4,0,0,4
20096,Sao Tome and Principe,Sao Tome and Principe,0.18636,6.613081,4,0,0,4


###### Aggregating the results specific to each country

In [0]:
latest_data_aggregated = latest_data.groupby("Country/Region",as_index=False)[["Country/Region","Confirmed","Deaths","Recovered","Active"]].sum()

In [16]:
latest_data_aggregated.tail()

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active
179,Vietnam,249,0,123,126
180,West Bank and Gaza,261,1,42,218
181,Western Sahara,4,0,0,4
182,Zambia,39,1,7,31
183,Zimbabwe,11,2,0,9


Let's see how each country is affected with the COVID-19 **Confirmed** cases through the color from light to dark(heatmap)

- green color means least no of confirmed cases
- yellow color means mid range of confirmed cases
- red color means highest no of confirmed cases

We can see **US, Italy, China, Spain, Germany, France** are highly affected countries 

In [17]:
config = dict({'scrollZoom': False})
worldwide_confirmed_cases_fig = px.choropleth(latest_data_aggregated, locations="Country/Region", 
                    locationmode='country names', color="Confirmed", 
                    hover_name="Country/Region", range_color=[1,8000], 
                    # color_continuous_scale="peach", 
                    # color_continuous_scale="Inferno", 
                    # color_continuous_scale=px.colors.sequential.Cividis_r,
                    color_continuous_scale=["green", "yellow", "red"], 
                    title='Countries with Confirmed Cases - '+str(today))

worldwide_confirmed_cases_fig.show(config=config)

Let's see how each country is affected with the COVID-19 **Active** cases through the color from **green to red** (heatmap)

- green color means least no of active cases
- yellow color means mid range of active cases
- red color means highest no of active cases



In [32]:
worldwide_active_cases_fig = px.choropleth(latest_data_aggregated, locations="Country/Region", 
                    locationmode='country names', color="Active", 
                    hover_name="Country/Region", range_color=[1,8000], 
                    color_continuous_scale=["green", "yellow", "red"], 
                    title='Countries with Active Cases - '+str(today))

worldwide_active_cases_fig.show(config=config)

Let's see how each country is affected with the COVID-19 **Death** cases through the color from **green to red** (heatmap)

- green color means least no of death cases
- yellow color means mid range of death cases
- red color means highest no of death cases

We can see **Italy, Spain, China, Iran, France** have high Death rates

Least Death rate countries are **Guinea, Haiti, Rwanda, Qatar, Liberia**

In [25]:
worldwide_death_cases_fig = px.choropleth(latest_data_aggregated, locations="Country/Region", 
                    locationmode='country names', color="Deaths", 
                    hover_name="Country/Region", range_color=[1,2000], 
                    color_continuous_scale=["green", "yellow", "red"], 
                    title='Countries with Death Cases - '+str(today))

worldwide_death_cases_fig.show(config=config)

Let's see how each country is affected with the COVID-19 **Recovered** cases through the color from **green to red** (heatmap)

- red color means least no of recovered cases
- yellow color means mid range of recovered cases
- green color means highest no of recovered cases


In [31]:
worldwide_recovered_cases_fig = px.choropleth(latest_data_aggregated, locations="Country/Region", 
                    locationmode='country names', color="Recovered", 
                    hover_name="Country/Region", range_color=[1,8000], 
                    color_continuous_scale=["yellow", "green"], 
                    title='Countries with Recovered Cases - '+str(today))

worldwide_recovered_cases_fig.show(config=config)

### Let's see how this spread over time world wide

#####  1. This is how the confirmed cases grown over time