## 

## Covid-19 Prediction

***A speedy and accurate diagnosis of COVID-19 is made possible by effective SARS-CoV-2 screening, which can also lessen the burden on healthcare systems. There have been built prediction models that assess the likelihood of infection by combining a number of parameters. These are meant to help medical professionals all over the world treat patients, especially in light of the scarcity of healthcare resources. The current dataset has been downloaded from ‘ABC’ government website and contains around 2,78,848 individuals who have gone through the RT-PCR test. Data set contains 11 columns, including 8 features suspected to play an important role in the prediction of COVID19 outcome. Outcome variable is covid result test positive or negative. We have data from 11th March 2020 till 30th April 2020.***


## Features:

A. Basic information: 
1. ID (Individual ID)

2. Sex (male/female). 

3. Age ≥60 above years (true/false) 

4. Test date (date when tested for COVID)


B. Symptoms: 

5. Cough (true/false).

6. Fever (true/false). 

7. Sore throat (true/false). 

8. Shortness of breath (true/false). 

9. Headache (true/false). 


C. Other information: 

10. Known contact with an individual confirmed to have COVID-19 (true/false).


D. Covid report

11. Corona positive or negative

In [337]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [382]:
# Importing the dataset
covid_raw= pd.read_csv(r'E:\Data Analyst Journey\Odin SChool\Notes\EDA&ML\Projects\Covid Project\corona_tested_006.csv')

  covid_raw1= pd.read_csv(r'E:\Data Analyst Journey\Odin SChool\Notes\EDA&ML\Projects\Covid Project\corona_tested_006.csv')


In [383]:
## Chekcing for feature head
covid_raw.head(10)

Unnamed: 0,Ind_ID,Test_date,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Age_60_above,Sex,Known_contact
0,1,11-03-2020,True,False,True,False,False,negative,,,Abroad
1,2,11-03-2020,False,True,False,False,False,positive,,,Abroad
2,3,11-03-2020,False,True,False,False,False,positive,,,Abroad
3,4,11-03-2020,True,False,False,False,False,negative,,,Abroad
4,5,11-03-2020,True,False,False,False,False,negative,,,Contact with confirmed
5,6,11-03-2020,True,False,False,False,False,other,,,Contact with confirmed
6,7,11-03-2020,False,False,False,False,False,negative,,,Other
7,8,11-03-2020,False,True,False,False,False,negative,,,Abroad
8,9,11-03-2020,True,False,False,False,False,negative,,,Abroad
9,10,11-03-2020,True,False,True,False,False,negative,,,Abroad


In [340]:
covid_raw1 = covid_raw.copy()

In [341]:
covid_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278848 entries, 0 to 278847
Data columns (total 11 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Ind_ID               278848 non-null  int64 
 1   Test_date            278848 non-null  object
 2   Cough_symptoms       278596 non-null  object
 3   Fever                278596 non-null  object
 4   Sore_throat          278847 non-null  object
 5   Shortness_of_breath  278847 non-null  object
 6   Headache             278847 non-null  object
 7   Corona               278848 non-null  object
 8   Age_60_above         151528 non-null  object
 9   Sex                  259285 non-null  object
 10  Known_contact        278848 non-null  object
dtypes: int64(1), object(10)
memory usage: 23.4+ MB


Here we can see that the dataset has mostly features of object datatype and one feature is of int datatype. ALl the features are of categorical in nature.

In [342]:
# Percentage of Data Missing
covid_raw.isnull().sum()*100/len(covid_raw)

Ind_ID                  0.000000
Test_date               0.000000
Cough_symptoms          0.090372
Fever                   0.090372
Sore_throat             0.000359
Shortness_of_breath     0.000359
Headache                0.000359
Corona                  0.000000
Age_60_above           45.659284
Sex                     7.015650
Known_contact           0.000000
dtype: float64

In [343]:
len(covid_raw)

278848

The total length of the dataset is 278848 rows.

We can see that Age_60_above has most missing values (45%).

There are couple of ways that we can implement to deal with this.

1.Deleting the rows (As the null values are almost 50% of the data so its not advisible to delete.)

2.Replace with most frequent values.(might result in imbalance dataset)

3.Apply classifier algorithm to predict


In [344]:
# Dropping unnecessary column
covid_raw.drop(columns='Test_date',inplace = True)

In [345]:
covid_raw

Unnamed: 0,Ind_ID,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Age_60_above,Sex,Known_contact
0,1,True,False,True,False,False,negative,,,Abroad
1,2,False,True,False,False,False,positive,,,Abroad
2,3,False,True,False,False,False,positive,,,Abroad
3,4,True,False,False,False,False,negative,,,Abroad
4,5,True,False,False,False,False,negative,,,Contact with confirmed
...,...,...,...,...,...,...,...,...,...,...
278843,278844,False,False,False,False,False,positive,,male,Other
278844,278845,False,False,False,False,False,negative,,female,Other
278845,278846,False,False,False,False,False,negative,,male,Other
278846,278847,False,False,False,False,False,negative,,male,Other


In [346]:
#dropping NaN values from features
covid_raw.dropna(subset=['Fever'],inplace = True)
covid_raw.dropna(subset=['Sex'],inplace = True)
covid_raw.dropna(subset=['Cough_symptoms'],inplace = True)

In [347]:
covid_raw.isna().sum()

Ind_ID                      0
Cough_symptoms              0
Fever                       0
Sore_throat                 0
Shortness_of_breath         0
Headache                    0
Corona                      0
Age_60_above           120870
Sex                         0
Known_contact               0
dtype: int64

***As we can see, the "Age_60_above" feature includes 45% null values. As a result, I'm considering the idea of predicting the null values and replacing them with the predicted values using a classification machine learning algorithm.***

## Encoding 

In [348]:
## Before that we have to encode the features as all the features are categorical we are using binary encoding.
covid_raw['Cough_symptoms'] =[1 if x==True else  0 if x==False else x for x in covid_raw['Cough_symptoms']]
covid_raw['Fever'] = [1 if x==True else  0 if x==False else x for x in covid_raw['Fever']]
covid_raw['Sore_throat'] = [1 if x==True else  0 if x==False else x for x in covid_raw['Sore_throat']]
covid_raw['Shortness_of_breath'] =  [1 if x==True else  0 if x==False else x for x in covid_raw['Shortness_of_breath']]
covid_raw['Headache'] = [1 if x==True else  0 if x==False else x for x in covid_raw['Headache']]
covid_raw['Corona'] = [1 if x=='positive' else  0 if x=='negative' else 0 for x in covid_raw['Corona']]
covid_raw['Age_60_above'] = [1 if x=='Yes' else  0 if x=='No' else x for x in covid_raw['Age_60_above']]
covid_raw['Sex'] = [1 if x=='male' else  0 if x=='female' else x for x in covid_raw['Sex']]


In [349]:
#As multiple categories are there for 'Known_contact' feature I am considerring get_dummies function of pandas.
covid_raw = pd.get_dummies(covid_raw,prefix=['Known_contact'],columns=['Known_contact']).astype(int,errors='ignore')


## Now applying classifier algorithm to predict

In [364]:

from sklearn.linear_model import LogisticRegression
covid_raw.isna().sum()

Ind_ID                                       0
Cough_symptoms                               0
Fever                                        0
Sore_throat                                  0
Shortness_of_breath                          0
Headache                                     0
Corona                                       0
Age_60_above                            120870
Sex                                          0
Known_contact_Abroad                         0
Known_contact_Contact with confirmed         0
Known_contact_Other                          0
dtype: int64

In [365]:
#Rows having null values of 'Age_60_above' feature are stored in test_data
test_data = covid_raw[covid_raw['Age_60_above'].isnull()]


In [366]:
len(test_data)

120870

In [368]:
#Dropping the null values to have a non-null train data
covid_raw.dropna(inplace= True)
y_train=covid_raw['Age_60_above']

X_train = covid_raw.drop(['Age_60_above'],axis=1)
X_test= test_data.drop(['Age_60_above'],axis=1)


In [369]:
#Performing model training
model = LogisticRegression()
model.fit(X_train,y_train)

In [372]:
#Predicting the values
X_test.dropna(inplace = True)
y_pred = model.predict(X_test)

In [374]:
# Here we can see length of predicted values are same as number of null values in the previous test_data
len(y_pred)

120870

In [375]:
#Adding the predicting values to the test_data
test_data['Age_60_above']=y_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['Age_60_above']=y_pred


In [376]:
len(covid_raw)

138172

In [377]:
test_data

Unnamed: 0,Ind_ID,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Age_60_above,Sex,Known_contact_Abroad,Known_contact_Contact with confirmed,Known_contact_Other
156040,156041,0,0,0,0,0,0,0.0,1,0,0,1
156041,156042,0,0,0,0,0,0,0.0,1,0,0,1
156042,156043,0,0,0,0,0,0,0.0,1,0,0,1
156043,156044,0,0,0,0,0,0,0.0,0,0,0,1
156044,156045,0,0,0,0,0,0,0.0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
278843,278844,0,0,0,0,0,1,0.0,1,0,0,1
278844,278845,0,0,0,0,0,0,0.0,0,0,0,1
278845,278846,0,0,0,0,0,0,0.0,1,0,0,1
278846,278847,0,0,0,0,0,0,0.0,1,0,0,1


***Adding the test_data back to covid_raw data***

In [378]:
new_covid=pd.concat([covid_raw,test_data]).drop_duplicates()

In [384]:
len(new_covid)

259042

In [385]:
#checking for null values again
new_covid.isna().sum()

Ind_ID                                  0
Cough_symptoms                          0
Fever                                   0
Sore_throat                             0
Shortness_of_breath                     0
Headache                                0
Corona                                  0
Age_60_above                            0
Sex                                     0
Known_contact_Abroad                    0
Known_contact_Contact with confirmed    0
Known_contact_Other                     0
dtype: int64