<a href="https://colab.research.google.com/github/Cryptonex7/covid19-analysis/blob/collab-files/SarthakTesting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

link to gov data : https://www.kaggle.com/sudalairajkumar/covid19-in-india

# Importing  Libraries

In [1]:
# Imports
import numpy as np
import pandas as pd

import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_dark"
from plotly.subplots import make_subplots

# Supress Warnings
import warnings
warnings.simplefilter('ignore')

In [2]:
# Function for fetching url
import requests

def get(url):
  try:
    response = requests.get(url)
    print(f"Request returned {response.status_code} : '{response.reason}'")
    return response.json()
  except requests.HTTPError:
    print(response.status_code, response.reason)
    raise

## Fetching Data from Covid19 API

In [3]:
# enum containing all the API link templates
import enum
class Data(enum.Enum):
  raw_data = 'https://api.covid19india.org/csv/latest/raw_data{}.csv'

In [4]:
def getDataFromCsv(link_template, number):
  try:
    df = pd.read_csv(link_template.format(number))
    if not df.empty:
      return {'data': df, 'status': 'ok'}
  except:
    return {'data': None, 'status': 'error'}

In [5]:
# returns an array of data
def getData(data):
  data_array = []
  i = 1
  while True:
    result = getDataFromCsv(link_template=data.value, number=i)
    if result['status'] != 'error':
      data_array.append(result['data'])
      i += 1
    else:
      break
  return data_array

In [6]:
#  Fetching raw data
raw_data = getData(Data.raw_data)

# Data Extraction

### Web Scraping

#### Statewise data

In [7]:
# Import gdrive
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### Normalize the fetched JSON

In [8]:
#add
from datetime import datetime

file_loc = datetime.today().strftime('%Y-%m-%d')

# Import gdrive
from google.colab import drive
drive.mount('/drive')


Mounted at /drive


#### Government Data

In [9]:
gov_data = pd.read_csv('/drive/My Drive/covid_19_india.csv')

# Date Wrangling

In [10]:
# for govt data
gov_data['Date'] = pd.to_datetime(gov_data['Date'],dayfirst=True)

In [11]:
gov_data = gov_data.rename(columns={"Confirmed": "Total Confirmed cases", 'Cured': "Cured/Discharged/Migrated", "State/UnionTerritory" : "Name of State / UT", "Deaths":"Death"})

In [12]:
confirmed_death_recovered = gov_data.groupby('Date')['Date', 'Total Confirmed cases','Cured/Discharged/Migrated', 'Death'].sum().reset_index()

In [13]:
c = 0
r = 0
d = 0
confirmed_death_recovered['day_conf'] = confirmed_death_recovered['Total Confirmed cases']
confirmed_death_recovered['day_rec'] = confirmed_death_recovered['Cured/Discharged/Migrated']
confirmed_death_recovered['day_death'] = confirmed_death_recovered['Death']
for row in confirmed_death_recovered.index:
  confirmed_death_recovered['day_conf'][row] -= c
  confirmed_death_recovered['day_rec'][row] -= r
  confirmed_death_recovered['day_death'][row] -= d
  c = confirmed_death_recovered['Total Confirmed cases'][row]
  r = confirmed_death_recovered['Cured/Discharged/Migrated'][row]
  d = confirmed_death_recovered['Death'][row]


In [14]:
confirmed_death_recovered['tot_active'] = confirmed_death_recovered['Total Confirmed cases'] - confirmed_death_recovered['Death'] - confirmed_death_recovered['Cured/Discharged/Migrated']

# Understanding the Data

In [15]:
print("External Data")
print(f"First recorded Case: {gov_data['Date'].min()}")
print(f"Last recorded Case: {gov_data['Date'].max()}")
print(f"Total Days recorded: {gov_data['Date'].max() - gov_data['Date'].min()}")

External Data
First recorded Case: 2020-01-30 00:00:00
Last recorded Case: 2020-07-03 00:00:00
Total Days recorded: 155 days 00:00:00


# Country Analysis

## 1. Confirmed Cases Over Time


### 1.1 Confirmed Cases ( Cumulative )

In [16]:
fig = px.line(confirmed_death_recovered, x="Date", y="Total Confirmed cases", title="Day Wise Overall Confirmed Cases in India", width=900, height=650)

fig.show()

print("\n")

fig = px.line(confirmed_death_recovered, x="Date", y="Total Confirmed cases", title="Day Wise Confirmed Cases in India(Logarithmic Scale)", log_y=True, width=900, height=650)

fig.show()





### 1.2 Confirmed Cases ( Day Wise )

In [17]:
fig = px.line(confirmed_death_recovered, x="Date", y="day_conf", title="Day Wise Encountered Cases in India", width=900, height=650)

fig.show()

print("\n")

fig = px.line(confirmed_death_recovered, x="Date", y="day_conf", title="Day Wise Encountred Cases in India(Logarithmic Scale)", log_y=True, width=900, height=650)

fig.show()





## Observation from the above graph:

## 2. Total Confirmed Cases in Various States till Date

In [18]:
#cases state wise

#Updated to 36

fig = px.bar(gov_data[-gov_data[gov_data['Date'] == gov_data['Date'].max()].shape[0]:].sort_values('Total Confirmed cases', ascending=False)[:36][::-1], 
             x='Total Confirmed cases', y='Name of State / UT',
             title='Confirmed Cases in Various States in India', text='Total Confirmed cases', height=800,width = 1200, orientation='h')
fig.show()

# Recovered Cases




##Cumulative Recovered Cases

In [19]:
fig = px.line(confirmed_death_recovered, x="Date", y="Cured/Discharged/Migrated", title="Recovered Cases in India(Cumulative)", width=900, height=650)

fig.show()

print("\n")

fig = px.line(confirmed_death_recovered, x="Date", y="Cured/Discharged/Migrated", title="Recovered Cases in India(Cumulative)(Logarithmic Scale)", log_y=True, width=900, height=650)

fig.show()





## Day Wise Recovered Cases

In [20]:
fig = px.line(confirmed_death_recovered, x="Date", y="day_rec", title="Recovered Cases in India(Day Wise)",color_discrete_sequence=['#F42272'])

fig.show()

print("\n")

fig = px.line(confirmed_death_recovered, x="Date", y="day_rec", title="Recovered Cases in India(Day Wise)(Logarithmic Scale)", log_y=True, width=900, height=650,color_discrete_sequence=['#F42272'])

fig.show()





# Deceased Cases

## Cumulative Deaths

In [21]:
fig = px.line(confirmed_death_recovered, x="Date", y="Death", title="Deaths in India(Cumulative)")

fig.show()

print("\n")

fig = px.line(confirmed_death_recovered, x="Date", y="Death", title="Deaths in India(Cumulative)(Logarithmic Scale)", log_y=True)

fig.show()





## Deaths Day Wise

In [22]:
fig = px.line(confirmed_death_recovered, x="Date", y="day_death", title="Deaths Day Wise in India",
             color_discrete_sequence=['#F42272'])
fig.show()

## Recoveries and Deaths District Wise and State Wise

In [23]:
fig = px.bar(gov_data[-gov_data[gov_data['Date'] == gov_data['Date'].max()].shape[0]:].sort_values('Cured/Discharged/Migrated', ascending=False)[:33][::-1], 
             x='Cured/Discharged/Migrated', y='Name of State / UT',
             title='Recovered Cases in Various States in India', text='Cured/Discharged/Migrated', height=800,width = 1400, orientation='h', color_discrete_sequence=['green'])
fig.show()

In [24]:
fig = px.bar(gov_data[-gov_data[gov_data['Date'] == gov_data['Date'].max()].shape[0]:].sort_values('Death', ascending=False)[:33][::-1], 
             x='Death', y='Name of State / UT',
             title='Deceased Cases in Various States in India', text='Death', height=800,width = 1400, orientation='h', color_discrete_sequence=['red'])
fig.show()

# Comparisons

In [25]:
temp = confirmed_death_recovered.melt(id_vars="Date", value_vars=['Death', 'Cured/Discharged/Migrated' , 'tot_active'],
                 var_name='case', value_name='count')

fig = px.line(temp, x="Date", y="count", color='case',
             title='Cases over time: Line Plot', color_discrete_sequence = ['red', 'cyan', 'orange'], width=1150)
fig.show()


# fig = px.area(temp, x="Date", y="count", color='case',
#              title='Cases over time: Area Plot', color_discrete_sequence = ['red', 'cyan', 'orange'])
# fig.show()

# Mortality Rate

In [26]:
statewise_data = gov_data[-gov_data[gov_data['Date'] == gov_data['Date'].max()].shape[0]:]

In [27]:
statewise_data['mortalityRate'] = round((statewise_data['Death']/statewise_data['Total Confirmed cases'])*100, 2)

temp = statewise_data[statewise_data['Total Confirmed cases']>10]
temp = temp.sort_values('mortalityRate', ascending=False)

fig = px.bar(temp.sort_values(by="mortalityRate", ascending=False)[:22][::-1],
             x = 'mortalityRate', y = 'Name of State / UT', 
             title='Deaths per 10 Confirmed Cases', text='mortalityRate', height=500, orientation='h',
             color_discrete_sequence=['darkred']
            )
fig.show()

# Recovery Rate

In [28]:
statewise_data['recoveryRate'] = round((statewise_data['Cured/Discharged/Migrated']/statewise_data['Total Confirmed cases'])*100, 2)
temp = statewise_data[statewise_data['Total Confirmed cases']>10]
temp = temp.sort_values('recoveryRate', ascending=False)

fig = px.bar(temp.sort_values(by="recoveryRate", ascending=False)[:25][::-1],
             x = 'recoveryRate', y = 'Name of State / UT', 
             title='Recovery per 10 Confirmed Cases', text='recoveryRate', height=600, orientation='h',
             color_discrete_sequence=['darkgreen']
            )
fig.show()

# Travel History Analysis

In [29]:
raw_data[0].rename(columns={'Num cases':'Num Cases'}, inplace=True)
raw_data[1].rename(columns={'Num cases':'Num Cases'}, inplace=True)

In [30]:
 clean_notes = {'notes' : raw_data[0]['Notes'], 'confirmed' : raw_data[0]['Num Cases']}
 for i in range (1, len(raw_data)):
  clean_notes['notes'] = clean_notes['notes'].append(raw_data[i]['Notes'],ignore_index=True)
  clean_notes['confirmed'] = clean_notes['confirmed'].append(raw_data[i]['Num Cases'],ignore_index=True)

In [31]:
clean_notes = pd.DataFrame(clean_notes)

In [32]:
# Filter all the Travel data and pick all instances > 5:
notes_cleaned = clean_notes[clean_notes["notes"].str.contains("Travelled", na=False)]
v = notes_cleaned[['notes']]
notes_cleaned = notes_cleaned[v.replace(v.stack().value_counts()).gt(5).all(1)]
notes_cleaned['notes'].unique()

array(['Travelled from Italy', 'Travelled from Dubai',
       'Travelled from Middle East', 'Travelled from UK',
       'Travelled from UAE', 'Travelled from Saudi Arabia',
       'Travelled from London', 'Travelled from Dubai, UAE',
       'Travelled from Delhi', 'Travelled from Kolkata',
       'Travelled to Delhi', 'Travelled from Rajasthan',
       'Travelled from Mumbai',
       'Travelled from Iran, Resident of Ladakh( S.N Medical College ) - Evacuee',
       'Travelled from Delhi and Contact history with TN-P5 and TN-P6',
       'Travelled from Iran, Resident of Ladakh( AIIMS ) - Evacuee',
       'Travelled from Abu Dhabi', 'Travelled from West Bengal',
       'Travelled from Maharashtra', 'Travelled from Mumbai, Maharashtra',
       'Travelled from Ahmedabad', 'Travelled from Ahmedabad, Gujarat',
       'Travelled from Ajmer, Rajasthan', 'Travelled from Chennai',
       'Travelled from Kuwait', 'Travelled from Mumbai, Maharastra',
       'Travelled from Tamil Nadu', 'Travelled 

Here we observe there is redundancy in the form of duplicacy, we thus convert 'Travelled from Dubai, UAE' & 'Travelled from Dubai' and analyze the over all spread of the disease due to travel(few more locations are merged)

In [33]:
# Removing the different labels
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled from Dubai, UAE', 'Travelled from Dubai')
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled from London', 'Travelled from UK')
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled to Delhi', 'Travel within India')
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled to Delhi', 'Travel within India')
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled from Delhi and Contact history with TN-P5 and TN-P6', 'Travelled from Delhi')
#dont know why but everything is working except the below two lines
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled from Iran, Resident of Ladakh( S.N Medical College ) - Evacuee', 'Travelled from Iran')
notes_cleaned["notes"] = notes_cleaned["notes"].str.replace('Travelled from Iran, Resident of Ladakh( AIIMS ) - Evacuee', 'Travelled from Iran')

# Rename column name to Available Information
notes_cleaned = notes_cleaned.rename(columns={'notes':'Available Information'})

# Pie Chart to show the travel related spread of Coronavirus
fig = go.Figure(data=[go.Pie(labels=notes_cleaned['Available Information'], values = notes_cleaned['confirmed'],pull=0.05)])
fig.show()
notes_cleaned['Available Information'].unique()

array(['Travelled from Italy', 'Travelled from Dubai',
       'Travelled from Middle East', 'Travelled from UK',
       'Travelled from UAE', 'Travelled from Saudi Arabia',
       'Travelled from Delhi', 'Travelled from Kolkata',
       'Travel within India', 'Travelled from Rajasthan',
       'Travelled from Mumbai',
       'Travelled from Iran, Resident of Ladakh( S.N Medical College ) - Evacuee',
       'Travelled from Iran, Resident of Ladakh( AIIMS ) - Evacuee',
       'Travelled from Abu Dhabi', 'Travelled from West Bengal',
       'Travelled from Maharashtra', 'Travelled from Mumbai, Maharashtra',
       'Travelled from Ahmedabad', 'Travelled from Ahmedabad, Gujarat',
       'Travelled from Ajmer, Rajasthan', 'Travelled from Chennai',
       'Travelled from Kuwait', 'Travelled from Mumbai, Maharastra',
       'Travelled from Tamil Nadu', 'Travelled from Telangana',
       'Travelled from Gujarat', 'Travelled from Surat',
       'Travelled from Mumbai in private vehicle',
       

In [34]:
pie_data = {}
pie_data['travel'] = notes_cleaned['Available Information'].unique()
pie_data = pd.DataFrame.from_dict(pie_data)
pie_data['per'] = 0
pie_data

Unnamed: 0,travel,per
0,Travelled from Italy,0
1,Travelled from Dubai,0
2,Travelled from Middle East,0
3,Travelled from UK,0
4,Travelled from UAE,0
5,Travelled from Saudi Arabia,0
6,Travelled from Delhi,0
7,Travelled from Kolkata,0
8,Travel within India,0
9,Travelled from Rajasthan,0


In [35]:
Travelled_from_Italy = 0
Travelled_from_Dubai = 0
Travelled_from_MiddleEast = 0
Travelled_from_UK = 0
Travelled_from_SaudiArabia = 0
Travelled_from_Delhi = 0
Travelled_from_IranResidentofLadakhSNMedicalCollegeEvacuee = 0
Travelled_from_IranResidentofLadakhAIIMSEvacuee = 0 

In [36]:
for row in notes_cleaned.index:
  if(notes_cleaned['Available Information'][row] == "Travelled from Italy"):
    Travelled_from_Italy += 1
  elif(notes_cleaned['Available Information'][row] == "Travelled from Dubai"):
    Travelled_from_Dubai += 1
  elif(notes_cleaned['Available Information'][row] == "Travelled from Middle East"):
    Travelled_from_MiddleEast += 1
  elif(notes_cleaned['Available Information'][row] == "Travelled from UK"):
    Travelled_from_UK += 1    
  elif(notes_cleaned['Available Information'][row] == "Travelled from Saudi Arabia"):
    Travelled_from_SaudiArabia += 1 
  elif(notes_cleaned['Available Information'][row] == "Travelled from Delhi"):
    Travelled_from_Delhi += 1 
  elif(notes_cleaned['Available Information'][row] == "Travelled from Iran, Resident of Ladakh( S.N Medical College ) - Evacuee"):
    Travelled_from_IranResidentofLadakhSNMedicalCollegeEvacuee += 1 
  elif(notes_cleaned['Available Information'][row] == "Travelled from Iran, Resident of Ladakh( AIIMS ) - Evacuee"):
    Travelled_from_IranResidentofLadakhAIIMSEvacuee += 1                    

In [37]:
# pie_data['per'][pie_data['travel'] == ('Travelled from Italy') ] = (Travelled_from_Italy/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from Dubai') ] = (Travelled_from_Dubai/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from Middle East') ] = (Travelled_from_MiddleEast/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from UK') ] = (Travelled_from_UK/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from Delhi') ] = (Travelled_from_Delhi/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from Saudi Arabia') ] = (Travelled_from_SaudiArabia/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from Iran, Resident of Ladakh( S.N Medical College ) - Evacuee') ] = (Travelled_from_IranResidentofLadakhSNMedicalCollegeEvacuee/total)*100
# pie_data['per'][pie_data['travel'] == ('Travelled from Iran, Resident of Ladakh( AIIMS ) - Evacuee') ] = (Travelled_from_IranResidentofLadakhAIIMSEvacuee/total)*100

# Before Lockdown v/s After lockdown

In [38]:
gov_data['Active'] = 1
for i in gov_data.index:
    gov_data['Active'][i] = (gov_data['Total Confirmed cases'][i] - (gov_data['Cured/Discharged/Migrated'][i] + gov_data['Death'][i] ))

In [39]:
bef_lockdown = confirmed_death_recovered[confirmed_death_recovered['Date'] < '2020-03-25' ]
fig = px.bar(bef_lockdown, x="Date", y="tot_active", title="Day Wise active Cases in India Before Lockdown", width=900, height=650)

fig.show()

print("\n")

after_lockdown = confirmed_death_recovered[confirmed_death_recovered['Date'] >= '2020-03-25' ]
fig = px.bar(after_lockdown, x="Date", y="tot_active", title="Day Wise active Cases in India After/During Lockdown", width=900, height=650)

fig.add_shape(
        # Line Horizontal
            type="line",
            x0="2020-04-14",
            y0=0,
            x1="2020-04-14",
            y1=int(max(confirmed_death_recovered["tot_active"])),
            line=dict(
                color="LightSeaGreen",
                width=4,
                dash="dashdot",
            ),
    )

fig.add_shape(
        # Line Horizontal
            type="line",
            x0="2020-05-03",
            y0=0,
            x1="2020-05-03",
            y1=int(max(confirmed_death_recovered["tot_active"])),
            line=dict(
                color="LightSeaGreen",
                width=4,
                dash="dashdot",
            ),
    )

fig.add_shape(
        # Line Horizontal
            type="line",
            x0="2020-05-17",
            y0=0,
            x1="2020-05-17",
            y1=int(max(confirmed_death_recovered["tot_active"])),
            line=dict(
                color="LightSeaGreen",
                width=4,
                dash="dashdot",
            ),
    )

fig.add_shape(
        # Line Horizontal
            type="line",
            x0="2020-05-31",
            y0=0,
            x1="2020-05-31",
            y1=int(max(confirmed_death_recovered["tot_active"])),
            line=dict(
                color="LightSeaGreen",
                width=4,
                dash="dashdot",
            ),
    )

fig.show()





# Age Analysis

In [81]:
 clean_age = {'Age' : raw_data[0]['Age Bracket']}
 for i in range (1, len(raw_data)):
  clean_age['Age'] = clean_age['Age'].append(raw_data[i]['Age Bracket'],ignore_index=True)

clean_age = pd.DataFrame(clean_age)
# clean_age = clean_age[clean_age['Age'] != '28-35']
# clean_age = clean_age[clean_age['Age'] != '8 Months']
# clean_age = clean_age[clean_age['Age'] != '6 Months']
# clean_age = clean_age[clean_age['Age'] != '5 Months']
# clean_age = clean_age[clean_age['Age'] != '5 months']

# clean_age['Age'] = clean_age['Age'].astype(float)
# clean_age['Age'].dropna(inplace=True)
# clean_age['Age'] = np.floor(clean_age['Age'])

In [82]:
num = []
for i in range(0,111):
  i = str(i)
  num.append(i)

In [83]:
clean = []
for age in clean_age['Age']:
  if age in num:
    clean.append(age)

In [84]:
cl = {'Age' : clean}
cl = pd.DataFrame(cl)
cl['Age'] = cl['Age'].astype(float)
cl['Age'].dropna(inplace=True)
cl['Age'] = np.floor(cl['Age'])

In [88]:
fig = px.histogram(cl, x="Age")
fig.show()

# Gender Analysis

In [41]:
 clean_Gender = {'Gender' : raw_data[0]['Gender']}
 for i in range (1, len(raw_data)):
  clean_Gender['Gender'] = clean_Gender['Gender'].append(raw_data[i]['Gender'],ignore_index=True)

clean_Gender = pd.DataFrame(clean_Gender)
clean_Gender.dropna(inplace = True)
clean_Gender = clean_Gender['Gender'].str.replace('M ','M')

In [42]:
fig = px.pie(clean_Gender, names='Gender')
fig.show()

 - This may be indicative of a greater number of males, highlighting the already present gender disparity in India, but at the same time may be attributed to smoking, drinking, general poor health practiced by a greater number of men in the Indian subcontinent as compared to women .

- Since coronavirus spreads from person to person, it is quite obvious that chances of a person getting in contact with coronavirus increases when he/she is at workplace or at some crowded place.
 - According to a website catalyst(https://www.catalyst.org/research/women-in-the-workforce-india/) the male workforce is 78.6% compared to 23.6% of female workforce in India. This can be another reason of more males getting affected with the virus.


# Gender Age Corelation

In [98]:
clean_g = {'Gender' : raw_data[0]['Gender'],'Age' : raw_data[0]['Age Bracket']}
for i in range (1, len(raw_data)):
  clean_g['Gender'] = clean_g['Gender'].append(raw_data[i]['Gender'],ignore_index=True)
  clean_g['Age'] = clean_g['Age'].append(raw_data[i]['Age Bracket'],ignore_index=True)

clean_g = pd.DataFrame(clean_g)
clean_g.dropna(inplace=True)


# clean = clean[clean['Age'] != '8 Months']
# clean = clean[clean['Age'] != '28-35']
# clean = clean[clean['Age'] != '6 Months']
# clean = clean[clean['Age'] != '5 Months']
# clean = clean[clean['Age'] != '5 months']
# clean['Age'] = clean['Age'].astype(float)
# clean['Age'] = np.floor(clean['Age'])
# clean['Gender'] = clean['Gender'].str.replace('M ','M')

In [97]:
clean_g.head()

Unnamed: 0,Gender,Age
0,F,20
3,M,45
4,M,24
5,M,69
20,F,70


In [101]:
clean = []
gen = []
for row in clean_g:
  print(clean_g['Age'][row])
  # if clean_g['Age'][row] in num:
  #   clean.append(clean_g['Age'][row])
    #gen.append(clean_g['Gender'][row])

KeyError: ignored

In [None]:
cl = {'Age' : clean}
cl = pd.DataFrame(cl)
cl['Age'] = cl['Age'].astype(float)
cl['Age'].dropna(inplace=True)
cl['Age'] = np.floor(cl['Age'])

In [None]:
#comparing age gender wise
fig = px.histogram(clean, x="Age", color='Gender')
fig.show()

# Testing Analysis

In [43]:
testing_data = get('https://api.covid19india.org/state_test_data.json')

from pandas.io.json import json_normalize
testing_data = json_normalize(testing_data)

testing_data = pd.DataFrame.from_dict(testing_data['states_tested_data'][0])

# Removing redundant entries
testing_data = testing_data[testing_data['updatedon'] != '']

# Converting date string to datatime objects
testing_data['updatedon'] = pd.to_datetime(testing_data['updatedon'], format='%d/%m/%Y')

# Removing wrong data entry
testing_data = testing_data[ testing_data['totaltested'] != '14/04/2020' ]

# Converting strings to floating point values after removing ','

testing_data.replace(to_replace="", value="0", inplace=True)   ######VERY CRUCIAL, WITHOUT THIS CODE BREAKS, IDK WHY THOUGH {TOOK 30 MIN TO FIND}
testing_data.replace(to_replace=" ", value="0", inplace=True)   ## NEW Type of issue addressed here! (More efficient code will be written in a later version so that such probelms do not arise, for now hard-coding till V1.0)

testing_data['totaltested'] = testing_data['totaltested'].str.replace(',', '').astype(float)
testing_data['positive'] = testing_data['positive'].str.replace(',', '').astype(float)
testing_data['negative'] = testing_data['negative'].str.replace(',', '').astype(float)
testing_data['unconfirmed'] = testing_data['unconfirmed'].str.replace(',', '').astype(float)

last_updated_on = testing_data['updatedon'].max()

# Seperating the data to be grouped

grouped_testing_data = testing_data

# Sorting the data according to the total number of tested cases

grouped_testing_data = grouped_testing_data.sort_values(['totaltested'],ascending=False)

# Grouped_testing_data

grouped_testing_data = grouped_testing_data.groupby('updatedon')['totaltested', 'positive', 'negative',
       'unconfirmed'].sum().reset_index()


Request returned 200 : 'OK'


In [44]:
# import time module to access time based functions
import time

# Create the date we want to exclude and convert it be a time object
d1 = "2020-02-04"
newdate1 = time.strptime(d1, "%Y-%m-%d")

# remove the date
grouped_testing_data = grouped_testing_data[grouped_testing_data['updatedon'] != d1 ]

In [45]:
# Bar graph which showcases the total number of tests conducted daywise in India

fig = px.bar(grouped_testing_data,
             x='updatedon', y='totaltested',
             title='Tests Conducted state-wise in India: ', text='totaltested', height=800, width=1000)

fig.show()

# Quick Analysis( Obesity )

In [None]:
# flg = data.groupby('Detected state')['confirmed', 'death', 'recovered', 'active'].sum().reset_index()

# flg['mortalityRate'] = round((flg['death']/flg['confirmed'])*100, 2)

# temp = flg[flg['confirmed']>10]
# temp = temp.sort_values('mortalityRate', ascending=False)

# # print(flg)

# fig = px.bar(temp.sort_values(by="mortalityRate", ascending=False)[:8][::-1],
#              x = 'mortalityRate', y = 'Detected state', 
#              title='Deaths per 10 Confirmed Cases', text='mortalityRate', height=500, orientation='h',
#              color_discrete_sequence=['darkred']
#             )
# fig.show()

### District Wise analysis updated (28/04/20)

In [None]:
# Fetching and Parsing the data
state_district_wise = get('https://api.covid19india.org/v2/state_district_wise.json')

df = pd.DataFrame(columns=['district', 'notes', 'active', 'confirmed', 'deceased', 'recovered', 'delta.confirmed', 'delta.deceased', 'delta.recovered'])
for row in state_district_wise:
  state_district_wise = row
  data = json_normalize(state_district_wise)
  state = json_normalize(data['districtData'][0])
  df = df.append(state)

Request returned 200 : 'OK'


In [None]:
df = df[["district","active","confirmed","deceased","recovered"]]
df = df[df.district != "Unknown"]

In [None]:
latest_grouped = df.groupby('district')['confirmed'].sum().reset_index()

fig = px.bar(latest_grouped.sort_values('confirmed', ascending=False)[:20][::-1], 
             x='confirmed', y='district',
             title='20 most affected Districts in India', text='confirmed', height=850, orientation='h', color_discrete_sequence = ['maroon'])
fig.show()

Mumbai is heavily affected
> Mumbai being the hub for tourist arrivals and international flights, as well as the most populous city in india and 5th most populous city worldwide has been the most affected city thus far.

>Thane & Pune being close to Mumbai are also affected heavily which can be attributed to inter city travel between the major neighbouring cities.



In [None]:
latest_grouped = df.groupby('district')['active'].sum().reset_index()

fig = px.bar(latest_grouped.sort_values('active', ascending=False)[:20][::-1], 
             x='active', y='district',
             title='Districts with most active cases in India(20)', text='active', height=850, orientation='h')
fig.show()

In [None]:
latest_grouped = df.groupby('district')['deceased'].sum().reset_index()

fig = px.bar(latest_grouped.sort_values('deceased', ascending=False)[:20][::-1], 
             x='deceased', y='district',
             title='Districts with most deceased cases in India(20)', text='deceased', height=850, orientation='h')
fig.show()

In [None]:
latest_grouped = df.groupby('district')['recovered'].sum().reset_index()

fig = px.bar(latest_grouped.sort_values('recovered', ascending=False)[:20][::-1], 
             x='recovered', y='district',
             title='Districts with most recovered cases in India(20)', text='recovered', height=850, orientation='h')
fig.show()

Here we form a hypothesis that there seems to be a correlation between the cities the virus is spreading in a greater number & said cities being costal areas.

With this hypothesis, let us explore this idea and see if we can reach any conclusions-

In [None]:
# # For analysis section
# # Finding number of patients in districts with humid climate
latest_grouped = df.groupby('district')['confirmed'].sum().reset_index()
most_affected_districts = latest_grouped

close_to_water = most_affected_districts[most_affected_districts["district"].isin(["Mumbai", "Kasaragod", "Pune", "Kochi", "Sangli",'Chennai', "Kolkata"])] #Ahemdabad is ~326km from the sea, vadodra ~263, "Sangli city is situated on the bank of Krishna river", 

#Segregate the remaining cities
far_from_water = most_affected_districts[~(most_affected_districts["district"].isin(["Mumbai", "Kasaragod", "Pune", "Kochi", "Sangli", 'Chennai', "Kolkata"]))]

#Calculate the total number of confirmed cases in the two cases
x = close_to_water['confirmed'].sum()
y = far_from_water['confirmed'].sum()

#Lists used to feed the pie chart
labels = ['Close to water','Far from water']
values = [x,y]

# plotting a Pie chart to see the distribution of confirmed cases in the two cases
fig = go.Figure(data=[go.Pie(labels=labels, values = values,pull=0.05)])
fig.show()

While the graph shows that just the districts near costal areas are having over 1/3rd the total known cases, thus putting weight behind our hypothesis, but recent research has shown this to be untrue ().

We can cite the reason for the graph showing results against the research in these cities, especially Mumbai, Chennai, and, Kolkata as them being trading hubs for the import/export businesses as well as popular tourist destinations, attracting thousands of tourists every year.