# <Font color=cyan>**CORONA VIRUS SYMPTOMS DATASET**

**INTRODUCTION**

- 2019, the dark year for humanity engulped us with a deadly virus named Corona Virus and was soon declared Pandemic due to its rampant spread throughout the world which virtually vanished the borders.
- To put a break on the spread chain of the virus, people had to go through different test and the observation of relatable symptoms to the infection of the virus.
- Herein, the Corona dataset so provided with the different symptoms and Corona test results.
- We plan to work on the dataset with the target :
  - to know the spread of the virus.
  - to know whether a particular symptom or a few combined could give us a fair insight on whether the patient is suffering or likely to suffer the disease.
  - to know the relation between different symptoms and the test results.
  - to predict through Machine Learning model on the basis of trained data, the spread, the symtpoms and the likelihood of the patient being positive for Corona virus infection.


- Importance of predicting the disease accurately:
  - helps in curbing the spread of the disease if the disease is contagious.
  - helps in early diagnosis and inturn early treatment of the patient leading to saving lives.
    - arranging the required equipments, beds and accessories for example oxygen cylinders, masks etc.
    - arrangement of ambulances and the required professionals on the cure of the disease.
    - required staff and doctors can be trained and dispersed for the treatment in mean time.
  - helps Bio-scientist and pharma companies to research and come up with the vaccines and medicines which could further the cause for humanity.
  - Overall keeps the Healthcare system of the country in place to face the brunt of such deadly diseases.

- Impact on the medical field to effectively screen and reduce healthcare burden.
  - Data analysis and appropriate prediction of the disease could
    - help the system to prioritise the patients in severe stages
    - help to diagnose the patient in comparatively less time
    - predict the disease and takeup appropriate measures
    - segregation of the patients according to the parameters related to the disease for example age, symptoms etc and accordingly providing the needed treatment could ease the burden on the system.
    - Finally a good data based system in dealing with such diseases could improve healthcare system for public in general.

- Proposed method to deal the diseases in the future:
  - First and formost should be the aggregation of data related to the disease and the patients.
  - Involving stakeholders actively to appropriately diagnose
    - the root cause of the particular disease
    - how to control the spread of the disease
    - arranging for the required equipments, staff and professionals
    - involving scientists and pharma companies and PSUs to quickly start research on developing vaccines or medicines on the disease.
    - In general, use of technology could fasten the process of execution.

**Initial Hypothesis :**
 - the symptoms so provided in the dataset could be used to predict whether a person could be Corona Positive or Negative on the basis of the relevant variables and the information in the dataset.

**Approach for Data analysis:**

- **Correlation** : check the correlation of different variables with the target variable i.e Corona test.
- **Relevant variables** : running tests on relevant variables to enforce the relation and drop the irrelevant ones.
- **Structurizing data** : also structure the data values especially boolean values which are apparently unstructured
- **Missing values** : dealing with the missing values to structure the dataset.
- **Feature Selection** : Chi2 test on the variables to select the related features.
- **Encoding** - Encoding the features to put into the model to let the machine learn the model.


**Modelling** - running different models on cleaned data
- Models used:
  - Random Forest Classifier
  - XGBoost Classifier
  - Logistic Regression model
  - Decision Tree Classifier

- **Hyperparameter** Tuning on the models to counter the issue of overfitting
- **Accuracy scores and Confusion matrices** to evaluate the models.

**DATA DESCRIPTION :**

A. Basic information:

1. ID (Individual ID)

2. Sex (male/female).

3. Age ≥60 above years (true/false)

4. Test date (date when tested for COVID)

B. Symptoms:

5. Cough (true/false).

6. Fever (true/false).

7. Sore throat (true/false).

8. Shortness of breath (true/false).

9. Headache (true/false).

C. Other information:

10. Known contact with an individual confirmed to have COVID-19 (true/false).


D. Covid report

11. Corona positive or negative

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# Importing dataset
data = pd.read_csv('corona_tested_006.csv')


Columns (2,3,4,5,6) have mixed types. Specify dtype option on import or set low_memory=False.



### **1.** **Data Construct**

In [None]:
data.shape # rows, columns

In [None]:
data.info()

In [None]:
data.describe(include='all')

In [None]:
data.columns

In [None]:
# dropping Ind_ID column since it just provides serial no.
data = data.drop(columns=['Ind_ID'])

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr

# Calculate Spearman correlation
spearman_correlation = spearmanr(data['Test_date'],data['Corona'])[0]

print('Spearman correlation:', spearman_correlation)

- As we can see the Spearman Correlation Coefficient between the 'Test_date' variable and 'Corona' test variable is negligible which suggests the variables does not have a strong relation.
- Thus, the 'Test_date' variable can be dropped.

In [None]:
# Test_date column with the detailed knowledge of the pandemic, can be said that it does not have any correlation with Corona test results.
data = data.drop(columns=['Test_date'])

In [None]:
# Correlation among the variables
data['Known_contact'].unique()

In [None]:
data['Corona'].unique()

In [None]:
data['Known_contact'] = data['Known_contact'].map({'Abroad':0,'Contact with confirmed':1,'Other':2})
data['Corona'] = data['Corona'].map({'negative':0,'positive':1,'other':2})
X = data[['Known_contact','Corona']]
matrix = X.corr()
print(matrix)

As can be seen in the Correlation matrix, that the known_contact column has no correlation with Corona being positive or Negative. Thus the Known_contact column can be dropped.

In [None]:
# dropping Known_contact column
data = data.drop(columns=['Known_contact'])

In [None]:
data['Corona'] = data['Corona'].map({0:'negative',1:'positive',2:'other'})

In [None]:
# Unique values in variables
for i in data.columns:
  print(i, data[i].unique())

- Herein, some boolean values in Capitals and some in small and though they provide similar kind of information they are written differently and since operations on data is case sensitive we need to standardise those values.

<u>**Standardising Boolean Values:** </u>

In [None]:
# Standardising Boolean Values
for i in data.columns:
  data[i] = data[i].apply(lambda x : 'True' if x=='TRUE' else x)
  data[i] = data[i].apply(lambda x : 'False' if x=='FALSE' else x)
  data[i] = data[i].apply(lambda x : 'True' if x==True else x)
  data[i] = data[i].apply(lambda x : 'False' if x==False else x)

In [None]:
for i in data.columns:
  print(i, data[i].unique())

### **2. About Null values:**

Also, here we can see there are numerous 'None' values in different columns which points towards the presence of null values.

In [None]:
# Converting None values as Nan(Null values)
for i in data.columns:
  data[i][data[i]=='None'] = np.nan

In [None]:
# number of null values in each column
data.isnull().sum()

In [None]:
# Percentage of null values in each column
data.isnull().sum()/len(data)*100

Cough_symptoms          0.090372
Fever                   0.090372
Sore_throat             0.000359
Shortness_of_breath     0.000359
Headache                0.000359
Corona                  0.000000
Age_60_above           45.659284
Sex                     7.015650
dtype: float64

In [None]:
# Matrix for the missing values
import missingno as msno
msno.matrix(data)

### **3. Handling Missing Values**

**Insights:**

- Age_60_above columns have about 45% of missing values as can observed in matrix.
- Few other columns have missing values though very less.

In [None]:
# 1. Dropping Age_60_above column
data = data.drop(columns='Age_60_above')

In [None]:
for i in data.columns:
  print(i, data[i].unique())

- Dealing categorical variables statistically using mode.

In [None]:
# 2. Dealing the missing values in categorical variable statistically using mode.
data['Sex']= data['Sex'].fillna(data['Sex'].mode()[0])
data['Cough_symptoms']= data['Cough_symptoms'].fillna(data['Cough_symptoms'].mode()[0])
data['Fever']= data['Fever'].fillna(data['Fever'].mode()[0])
data['Sore_throat']= data['Sore_throat'].fillna(data['Sore_throat'].mode()[0])
data['Shortness_of_breath']= data['Shortness_of_breath'].fillna(data['Shortness_of_breath'].mode()[0])
data['Headache']= data['Headache'].fillna(data['Headache'].mode()[0])

In [None]:
for i in data.columns:
  p = data[i].unique()
  print(i,p)

Test_date ['11-03-2020' '12-03-2020' '13-03-2020' '14-03-2020' '15-03-2020'
 '16-03-2020' '17-03-2020' '18-03-2020' '19-03-2020' '20-03-2020'
 '21-03-2020' '22-03-2020' '23-03-2020' '24-03-2020' '25-03-2020'
 '26-03-2020' '27-03-2020' '28-03-2020' '29-03-2020' '30-03-2020'
 '31-03-2020' '01-04-2020' '02-04-2020' '03-04-2020' '04-04-2020'
 '05-04-2020' '06-04-2020' '07-04-2020' '08-04-2020' '09-04-2020'
 '10-04-2020' '11-04-2020' '12-04-2020' '13-04-2020' '14-04-2020'
 '15-04-2020' '16-04-2020' '17-04-2020' '18-04-2020' '19-04-2020'
 '20-04-2020' '21-04-2020' '22-04-2020' '23-04-2020' '24-04-2020'
 '25-04-2020' '26-04-2020' '27-04-2020' '28-04-2020' '29-04-2020'
 '30-04-2020']
Cough_symptoms ['True' 'False']
Fever ['False' 'True']
Sore_throat ['True' 'False']
Shortness_of_breath ['False' 'True']
Headache ['False' 'True']
Corona ['negative' 'positive' 'other']
Age_60_above [nan 'No' 'Yes']
Sex ['female' 'male']


- Thus we can see that the missing values has been handled statistically.

**4. Duplicates:**
- Since all the variables are Categorical, and the unique values within the variables are very less, the probability of duplication of values is very high, which can also be shown below.
- This also infers that the people with Corona positive or negative results posses similar kind of symptoms or no symptoms.

In [None]:
data.duplicated().sum()/len(data)*100

99.94082797796649

**5. Outliers:**

- Since the remaining variables are all categorical, thus the probability of outliers seems negligible.

Since we never fill missing values for target variable, we have to drop the rows with missing values in variable 'Corona'.

In [None]:
# thus deleting rows where Corona variable have 'other'
data = data[data['Corona']!='other']

In [None]:
data['Corona'].unique()

In [None]:
data.info()

**Insights**

- All the columns are categorical.
- target variale - 'Corona'
- input variables - 'Cough_symptoms','Fever','Sore_throat,'Shortness_of_breath','Headache'
- 'Sex' - to locate the distribution of the disease according to gender.


###**6. Feature Selection**
  - Chi-Squared test

In [None]:
# Chi-Squared test on the data for Feature Selection
from scipy.stats import chi2_contingency
table = pd.crosstab(data['Cough_symptoms'],data['Corona'])
for i in data.columns:
  table = pd.crosstab(data[i],data['Corona'])
  print(table)
  chi2,p_value,dof,expected = chi2_contingency(table)
  print(i)
  print('p_value:',p_value)
  print('\n')


**Insights**
  - Here, the p_value of variables -  'Cough_symptoms','Fever', 'Sore_throat, 'Headache', 'Shortness_of_breath' is less than the significance level(0.05). **(p_val < 0.05)**
  - Thus herein we conclude that there is **relation** between these variables and Corona tests being Positive or negative and hence are **important variables** in predicting the likelihood of tests in future.

**ANALYSIS PLOTS**

- to plot the independent variables with respect to dependent variable to analyse the data in a wholesome way.

In [None]:
for i in ['Cough_symptoms','Fever','Sore_throat','Shortness_of_breath','Headache']:
  df = data.groupby(['Corona',i],as_index=False).size()
  df.rename(columns={'size':'Count'},inplace=True)
  df['percent']=round(df['Count']*100/df.groupby('Corona')['Count'].transform('sum'),1)
  df['percent']=df['percent'].apply(lambda x: '{}%'.format(x))
  import plotly.express as px
  fig = px.bar(df, x=i, y='Count',
             color='Corona',text='percent', barmode='group',
             height=400)
  fig.show()

**Insights:**
- **For Cough_symptoms**
  - 44.7% of those tested Corona positive had Cough symptoms.
  - 55.3% of those tested Corona positive did not have Cough symptoms
  - 88.6% of those tested negative did not have Cough symptoms.
  - 13.4% of those tested negative had Cough symptoms.
- **For Fever**
  - 37.7% of those tested Corona positive had Fever.
  - 63.3% of those tested Corona positive did not have Fever.
  - 93.9% of those tested negative did not have Fever.
  - 6.1% of those tested Corona negative had Fever.
- **For Sore_throat**
  - 10.4% of those tested Corona positive had sore throat.
  - 89.6% of those tested Corona positive did not have sore throat.
  - 99.9% of those tested Corona negative did not have sore throat.
  - 0.1% of those tested Corona negative had sore throat.
- **For Shortness_of_breath**
  - 7.9% of those tested Corona positive had shortness of breath.
  - 92.1% of those tested Corona positive did not have shortness of breath.
  - 99.9% of those tested Corona negative did not have shortness of breath.
  - 0.1% of those tested Corona negative had shortness of breath
- **For Headache**
  - 15.2% of those tested Corona positive had Headache.
  - 84.8% of those tested Corona positive did not have Headache.
  - 99.9% of those tested Corona negative did not have Headache.
  - 0.1% of those tested Corona negative had Headache.

Thus, analysing the plots and the respective data we can conclude that patients with Cough symptoms and Fever are highly likely to be tested Corona positive since comparatively higher percentage of patients with these symptoms have tested positive.

whereas patients with sore throat, shortness of breath and Headache were the symptoms in few of the cases thus these symptoms cannot be neglected.
Having said that, the patients with Cough symptoms and Fever can be primarily prioritized.

### **7. Encoding**

- Encoding the data to optimise to provide it as input for the machine to be able to learn

In [None]:
data.head()

In [None]:
var = ['Cough_symptoms','Fever','Sore_throat','Shortness_of_breath','Headache']
for i in var:
  data[i]=data[i].map({'True':1,'False':0})

In [None]:
data['Sex']=data['Sex'].map({'male':1,'female':0})

### **8. Modelling**

<font color='red'>**MODEL-I**

 **i. RandomForest Classifier** - Bagging technique

  
- using RandomForest Regressor since the output variable has been converted to numerical.

In [None]:
# data['Corona'] = data['Corona'].map({0:'negative',1:'positive'})

In [None]:
X = data[['Cough_symptoms','Fever','Sore_throat','Shortness_of_breath','Headache']]
y = data['Corona']

In [None]:
# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=True)

# Modelling
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=3)
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)

# Evaluation - accuracy_Score
from sklearn.metrics import accuracy_score
ac_score_rf = round((accuracy_score(y_test,y_pred)*100),2)
print('Accuracy_score for RandomForestClassifier',ac_score_rf)

# Cross Validation
from sklearn.model_selection import cross_val_score
cv_score_rf = round((cross_val_score(model,X_train,y_train, cv=10).mean()*100),2)
print('Cross Validation score for RandomForestClassifier: ',cv_score_rf)

# Train accuracy
y_pred_train = model.predict(X_train)
train_accuracy = accuracy_score(y_train,y_pred_train)
print('Train Accuracy:',round((train_accuracy*100),2))

Accuracy_score for RandomForestClassifier 95.69
Cross Validation score for RandomForestClassifier:  95.75
Train Accuracy: 95.75


In [None]:
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Grid - the list of values for which you have to check
# Search - identify the best one
# CV - which gives highest/max CV score

estimator = RandomForestClassifier()
param_grid = {'n_estimators':list(range(1,10))} # if the last value is the best value increase the value
grid = GridSearchCV(estimator,param_grid,cv=5)
grid.fit(X_train,y_train)
grid.best_params_

# then put n_estimators = ans in RandomForestClassifier model but if the last value is the best value increase the range.

{'n_estimators': 2}

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
report = classification_report(y_test,y_pred)
print('Classification_report: ',report)

matrix = confusion_matrix(y_test,y_pred)
cm_display = ConfusionMatrixDisplay(confusion_matrix=matrix,display_labels=['Negative','Positive'])
cm_display.plot()

Validation of the RandomForest model:

In [None]:
model.predict([[0,1,1,0,1]])



array(['positive'], dtype=object)

<font color='red'>**MODEL-II**

**ii. XGB Classifier** - Boosting technique

In [None]:
data.head()

Unnamed: 0,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Sex
0,1,0,1,0,0,negative,0
1,0,1,0,0,0,positive,0
2,0,1,0,0,0,positive,0
3,1,0,0,0,0,negative,0
4,1,0,0,0,0,negative,0


In [None]:
data['Corona']=data['Corona'].map({'positive':1,'negative':0})

In [None]:
X = data[['Cough_symptoms','Fever','Sore_throat','Shortness_of_breath','Headache']]
y = data['Corona']

# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=True)

# Modelling
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=6)
model.fit(X_train,y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluation
from sklearn.metrics import accuracy_score
xg_score = round((accuracy_score(y_test,y_pred)*100),2)
print('Accuracy_Score for XGBoost: ',xg_score)

# Cross Validation
from sklearn.model_selection import cross_val_score
cv_score_xg = round((cross_val_score(model,X_train,y_train, cv=5).mean()*100),2)
print('Cross Validation score for XGBoost: ',cv_score_xg)

# Train accuracy
y_pred_train = model.predict(X_train)
train_accuracy = accuracy_score(y_train,y_pred_train)
print('Train Accuracy:',round((train_accuracy*100),2))

Accuracy_Score for XGBoost:  95.69
Cross Validation score for XGBoost:  95.75
Train Accuracy: 95.75


In [None]:
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Grid - the list of values for which you have to check
# Search - identify the best one
# CV - which gives highest/max CV score

estimator = XGBClassifier()
param_grid = {'n_estimators':list(range(1,10))} # if the last value is the best value increase the value
grid = GridSearchCV(estimator,param_grid,cv=5)
grid.fit(X_train,y_train)
grid.best_params_

# then put n_estimators = ans in XGBClassifier model but if the last value is the best value increase the range.

{'n_estimators': 6}

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
report = classification_report(y_test,y_pred)
print('Classification_report: ',report)

matrix = confusion_matrix(y_test,y_pred)
cm_display = ConfusionMatrixDisplay(confusion_matrix=matrix,display_labels=['Negative','Positive'])
cm_display.plot()

In [None]:
data.head()

Unnamed: 0,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Sex
0,1,0,1,0,0,0,0
1,0,1,0,0,0,1,0
2,0,1,0,0,0,1,0
3,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0


Validation of the XGB model:

In [None]:
model.predict([[0,1,1,0,1]])

array([1])

<font color='red'>**MODEL-III**

**iii. Logistic Regression**

In [None]:
data['Corona']=data['Corona'].map({1:'positive',0:'negative'})

In [None]:
X = data[['Cough_symptoms','Fever','Sore_throat','Shortness_of_breath','Headache']]
y = data['Corona']

# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=True)

In [None]:
# Modelling
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluation
from sklearn.metrics import accuracy_score
log_score = round((accuracy_score(y_test,y_pred)*100),2)
print('Accuracy_Score for LogisticRegression: ',log_score)

# Cross Validation
from sklearn.model_selection import cross_val_score
cv_score_log = round((cross_val_score(model,X_train,y_train, cv=5).mean()*100),2)
print('Cross Validation score for LogisticRegression: ',cv_score_log)

# Train accuracy
y_pred_train = model.predict(X_train)
train_accuracy = accuracy_score(y_train,y_pred_train)
print('Train Accuracy:',round((train_accuracy*100),2))

Accuracy_Score for LogisticRegression:  95.65
Cross Validation score for LogisticRegression:  95.69
Train Accuracy: 95.69


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
report = classification_report(y_test,y_pred)
print('Classification_report: ',report)

matrix = confusion_matrix(y_test,y_pred)
cm_display = ConfusionMatrixDisplay(confusion_matrix=matrix,display_labels=['Negative','Positive'])
cm_display.plot()

Validation of the Logistic Regression model:

In [None]:
model.predict([[True,True,True,True,True]])

<font color='red'>**MODEL-IV**

**iv. DecisionTree Classifier**

In [None]:
X = data[['Cough_symptoms','Fever','Sore_throat','Shortness_of_breath','Headache']]
y = data['Corona']

# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=True)

# Modelling
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=4)
model.fit(X_train,y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluation
from sklearn.metrics import accuracy_score
dt_score =round((accuracy_score(y_test,y_pred)*100),2)
print('Accuracy_Score for DecisionTreeClassifier ',dt_score)

# Cross Validation
from sklearn.model_selection import cross_val_score
cv_score_dt = round((cross_val_score(model,X_train,y_train, cv=5).mean()*100),2)
print('Cross Validation score: ',cv_score_dt)

# Train accuracy
y_pred_train = model.predict(X_train)
train_accuracy = accuracy_score(y_train,y_pred_train)
print('Train Accuracy:',round((train_accuracy*100),2))

Accuracy_Score for DecisionTreeClassifier  95.69
Cross Validation score:  95.75
Train Accuracy: 95.75


In [None]:
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Grid - the list of values for which you have to check
# Search - identify the best one
# CV - which gives highest/max CV score

estimator = DecisionTreeClassifier()
param_grid = {'max_depth':list(range(1,10))} # if the last value is the best value increase the value
grid = GridSearchCV(estimator,param_grid,cv=5)
grid.fit(X_train,y_train)
grid.best_params_

# then put n_estimators = ans in XGBClassifier model but if the last value is the best value increase the range.

{'max_depth': 4}

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
report = classification_report(y_test,y_pred)
print('Classification_report: ',report)

matrix = confusion_matrix(y_test,y_pred)
cm_display = ConfusionMatrixDisplay(confusion_matrix=matrix,display_labels=['Negative','Positive'])
cm_display.plot()

Validation of the Decision Tree model:

In [None]:
model.predict([[True,True,True,True,True]])



array(['positive'], dtype=object)

In [None]:
print('Accuracy_score for RandomForestClassifier',ac_score_rf)
print('Cross Validation score for RandomForestClassifier: ',cv_score_rf)
print('\nAccuracy_Score for XGBoost: ',xg_score)
print('Cross Validation score for XGBoost: ',cv_score_xg)
print('\nAccuracy_Score for LogisticRegression: ',log_score)
print('Cross Validation score for LogisticRegression: ',cv_score_log)
print('\nAccuracy_Score for DecisionTreeClassifier ',dt_score)
print('Cross Validation score: ',cv_score_dt)

Accuracy_score for RandomForestClassifier 95.69
Cross Validation score for RandomForestClassifier:  95.75

Accuracy_Score for XGBoost:  95.69
Cross Validation score for XGBoost:  95.75

Accuracy_Score for LogisticRegression:  95.65
Cross Validation score for LogisticRegression:  95.69

Accuracy_Score for DecisionTreeClassifier  95.69
Cross Validation score:  95.75


**9. Comparing Accuracy scores and Cross_Validation scores for each used Classifier.**

In [None]:
Classifiers = ['RandomForestClassifier','XGBClassifier','LogisticRegression','DecisionTreeClassifier']
Accuracy = [ac_score_rf,xg_score,log_score,dt_score]
Cross_Validation = [cv_score_rf,cv_score_xg,cv_score_log,cv_score_dt]

In [None]:
df = pd.DataFrame({'Classifiers':Classifiers,'Accuracy':Accuracy,'Cross_Validation':Cross_Validation})

In [None]:
df

Unnamed: 0,Classifiers,Accuracy,Cross_Validation
0,RandomForestClassifier,95.69,95.75
1,XGBClassifier,95.69,95.75
2,LogisticRegression,95.65,95.69
3,DecisionTreeClassifier,95.69,95.75


**10. Plotting accuracy and cross_validation for different models**

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Bar( x = Classifiers, y=list(Accuracy), name= 'Accuracy', marker_color='indianred' ))
fig.add_trace(go.Bar( x = Classifiers, y=list(Cross_Validation), name= 'Cross_Validation', marker_color='lightsalmon' ))
fig.update_layout(barmode='group')
fig.show()

**Conclusion:**

Different Machine Learning models has been trained and tested and since the accuracy for different models is closely similar, I would prefer using any of the above models.

# <font color=Cyan>**SQL QUESTIONS**

**Cleaning Data:**

In [None]:
data = pd.read_csv('corona_tested_006.csv')

In [None]:
# Cleaning
# dropping Ind_ID column since it just provides serial no.
data = data.drop(columns=['Ind_ID'])

# Standardising Boolean Values
for i in data.columns:
  data[i] = data[i].apply(lambda x : 'True' if x=='TRUE' else x)
  data[i] = data[i].apply(lambda x : 'False' if x=='FALSE' else x)
  data[i] = data[i].apply(lambda x : 'True' if x==True else x)
  data[i] = data[i].apply(lambda x : 'False' if x==False else x)

# Converting None values as Nan(Null values)
for i in data.columns:
  data[i][data[i]=='None'] = np.nan

In [None]:
# 2. Dealing the missing values in categorical variable statistically using mode.
data['Sex']= data['Sex'].fillna(data['Sex'].mode()[0])
data['Cough_symptoms']= data['Cough_symptoms'].fillna(data['Cough_symptoms'].mode()[0])
data['Fever']= data['Fever'].fillna(data['Fever'].mode()[0])
data['Sore_throat']= data['Sore_throat'].fillna(data['Sore_throat'].mode()[0])
data['Shortness_of_breath']= data['Shortness_of_breath'].fillna(data['Shortness_of_breath'].mode()[0])
data['Headache']= data['Headache'].fillna(data['Headache'].mode()[0])
data['Age_60_above'] = data['Age_60_above'].fillna(data['Age_60_above'].mode()[0])

# thus deleting rows where Corona variable have 'other'
data = data[data['Corona']!='other']

In [None]:
import duckdb
conn = duckdb.connect()
conn.register('data',data)

<duckdb.DuckDBPyConnection at 0x781e0daaee70>

**1. Find the number of corona patients who faced shortness of breath.**

In [None]:
conn.execute("Select * from data where Shortness_of_breath == True;").fetchdf()

- Here we can see all the rows where Shortness_of_breath is True.

**2. Find the number of negative corona patients who have fever and sore_throat.**

In [None]:
conn.execute("Select * from data where Corona == 'negative' and Fever==True and Sore_throat == True;").fetchdf()

- Here is the dataset with Corona test 'negative', and Fever and Sore_throat 'True'.

**3. Group the data by month and rank the number of positive cases.**


In [None]:
data_date = data.copy(deep=True)

In [None]:
data_date[['date','month','year']] = data_date['Test_date'].str.split('-',expand=True)
data_date.head()

In [None]:
grouped = conn.execute("Select year, month,Corona, count(*) as Count from data_date where Corona=='positive' GROUP BY year,month,Corona;").fetchdf()
grouped

Unnamed: 0,year,month,Corona,Count
0,2020,4,positive,8881
1,2020,3,positive,5848


In [None]:
ranked = conn.execute("Select year, month, Corona, Count, ROW_NUMBER() OVER(ORDER BY Count desc) Rank_by_positives from grouped").fetchdf()
ranked

Unnamed: 0,year,month,Corona,Count,Rank_by_positives
0,2020,4,positive,8881,1
1,2020,3,positive,5848,2


In [None]:
import plotly.express as px
px.bar(ranked,x='month', y='Count', text_auto='0.2s',color='month', title = 'Number of Positive cases in different Months',width=600, height=400)

- As we can see in Bar graph that positive cases detected in month of April(04) was comparatively higher than that detected in month March(03).

**4. Find the female negative corona patients who faced cough and headache.**

In [None]:
conn.execute("Select * from data where Corona=='positive' and Cough_symptoms=='True' and Headache=='True' and Sex=='female' ").fetchdf()

**5. How many elderly corona patients have faced breathing problems?**

- Elderly considered to be the patients aged above 60.
- There is a separate variable with patients aged above 60.

In [None]:
conn.execute("Select Age_60_above, Shortness_of_breath, Count(*) as Count from data where Shortness_of_breath==True and Age_60_above == 'Yes' Group by  Age_60_above,Shortness_of_breath").fetchdf()

Unnamed: 0,Age_60_above,Shortness_of_breath,Count
0,Yes,True,287


- 287 elderly patients faced Shortness of breath.

**6. Which three symptoms were more common among COVID positive patients?**

In [None]:
# Segregating Covid positive patients
positive = conn.execute("Select * from data where Corona=='positive'").fetchdf()
positive.head()

Unnamed: 0,Test_date,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Age_60_above,Sex,Known_contact
0,11-03-2020,False,True,False,False,False,positive,No,female,Abroad
1,11-03-2020,False,True,False,False,False,positive,No,female,Abroad
2,11-03-2020,False,False,False,False,False,positive,No,female,Abroad
3,11-03-2020,True,False,False,False,False,positive,No,female,Abroad
4,11-03-2020,False,False,False,False,False,positive,No,female,Abroad


In [None]:
# Separating Covid positive patients with respect to Cough_symptoms
Cough = conn.execute("Select Cough_symptoms, count(*) as Count from positive Group by Cough_symptoms").fetchdf()
Cough = conn.execute("Select * from Cough where Cough_symptoms=='True'").fetchdf()
Cough

Unnamed: 0,Cough_symptoms,Count
0,True,6584


In [None]:
# Separating Covid positive patients with respect to Fever symptom
Fever = conn.execute("Select Fever, count(*) as Count from positive Group by Fever").fetchdf()
Fever = conn.execute("Select * from Fever where Fever=='True'").fetchdf()
Fever

Unnamed: 0,Fever,Count
0,True,5559


In [None]:
# Separating Covid positive patients with respect to Sore_throat symptom
Sore_throat = conn.execute("Select Sore_throat, count(*) as Count from positive Group by Sore_throat").fetchdf()
Sore_throat = conn.execute("Select * from Sore_throat where Sore_throat=='True'").fetchdf()
Sore_throat

Unnamed: 0,Sore_throat,Count
0,True,1526


In [None]:
# Separating Covid positive patients with respect to Shortness_of_breath symptom
Shortness_of_breath = conn.execute("Select Shortness_of_breath, count(*) as Count from positive Group by Shortness_of_breath").fetchdf()
Shortness_of_breath = conn.execute("Select * from Shortness_of_breath where Shortness_of_breath=='True'").fetchdf()
Shortness_of_breath

Unnamed: 0,Shortness_of_breath,Count
0,True,1164


In [None]:
# Separating Covid positive patients with respect to Headache symptom
Headache = conn.execute("Select Headache, count(*) as Count from positive Group by Headache").fetchdf()
Headache = conn.execute("Select * from Headache where Headache=='True'").fetchdf()
Headache

Unnamed: 0,Headache,Count
0,True,2235


In [None]:
Symptoms = ['Cough','Fever','Sore_throat','Shortness_of_breath','Headache']
Count = [6584,5559,1526,1164,2235]
df = pd.DataFrame({'Symptoms':Symptoms,'Covid Positive patients':Count})

In [None]:
px.bar(df,x='Symptoms',y='Covid Positive patients', title = 'No of Covid Positive patients with different symptoms',text_auto='0.2s',
       color='Symptoms')

- Referring to the Bar graph, we can conclude that symtoms like **Cough, Fever and Headache** were the three prominent symptoms which were detected in Covid Positive patients.

**7. Which symptom was less common among COVID negative people?**

In [None]:
negative = conn.execute("Select * from data where Corona=='negative'").fetchdf()

In [None]:
# Separating Covid negative patients with respect to Cough_symptoms
Cough = conn.execute("Select Cough_symptoms, count(*) as Count from negative Group by Cough_symptoms").fetchdf()
Cough = conn.execute("Select * from Cough where Cough_symptoms=='True'").fetchdf()
Cough

Unnamed: 0,Cough_symptoms,Count
0,True,34987


In [None]:
# Separating Covid negative patients with respect to Fever symptom
Fever = conn.execute("Select Fever, count(*) as Count from negative Group by Fever").fetchdf()
Fever = conn.execute("Select * from Fever where Fever=='True'").fetchdf()
Fever

Unnamed: 0,Fever,Count
0,True,15816


In [None]:
# Separating Covid negative patients with respect to Sore_throat symptom
Sore_throat = conn.execute("Select Sore_throat, count(*) as Count from negative Group by Sore_throat").fetchdf()
Sore_throat = conn.execute("Select * from Sore_throat where Sore_throat=='True'").fetchdf()
Sore_throat

Unnamed: 0,Sore_throat,Count
0,True,366


In [None]:
# Separating Covid negative patients with respect to Shortness_of_breath symptom
Shortness_of_breath = conn.execute("Select Shortness_of_breath, count(*) as Count from negative Group by Shortness_of_breath").fetchdf()
Shortness_of_breath = conn.execute("Select * from Shortness_of_breath where Shortness_of_breath=='True'").fetchdf()
Shortness_of_breath

Unnamed: 0,Shortness_of_breath,Count
0,True,385


In [None]:
# Separating Covid negative patients with respect to Headache symptom
Headache = conn.execute("Select Headache, count(*) as Count from negative Group by Headache").fetchdf()
Headache = conn.execute("Select * from Headache where Headache=='True'").fetchdf()
Headache

Unnamed: 0,Headache,Count
0,True,148


In [None]:
Symptoms = ['Cough','Fever','Sore_throat','Shortness_of_breath','Headache']
Count = [34987,15816,366,385,148]
df = pd.DataFrame({'Symptoms':Symptoms,'Covid negative patients':Count})

In [None]:
px.bar(df,x='Symptoms',y='Covid negative patients', title = 'No of Covid negative patients with different symptoms',text_auto='0.2s',
       color='Symptoms')

- Referring to the Bar graph, we can conclude that **Headache** was less common symptom among Covid negative patients.

**8. What are the most common symptoms among COVID positive males whose known contact was abroad?**

In [None]:
data.head()

Unnamed: 0,Test_date,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Age_60_above,Sex,Known_contact
0,11-03-2020,True,False,True,False,False,negative,No,female,Abroad
1,11-03-2020,False,True,False,False,False,positive,No,female,Abroad
2,11-03-2020,False,True,False,False,False,positive,No,female,Abroad
3,11-03-2020,True,False,False,False,False,negative,No,female,Abroad
4,11-03-2020,True,False,False,False,False,negative,No,female,Contact with confirmed


In [None]:
df = conn.execute("Select * from data Where Sex == 'male' and Corona=='positive' and Known_contact=='Abroad'").fetchdf()
df.head()

Unnamed: 0,Test_date,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona,Age_60_above,Sex,Known_contact
0,22-03-2020,True,True,False,False,False,positive,Yes,male,Abroad
1,22-03-2020,True,False,False,True,False,positive,No,male,Abroad
2,22-03-2020,True,True,False,True,True,positive,Yes,male,Abroad
3,22-03-2020,False,False,False,True,False,positive,No,male,Abroad
4,22-03-2020,True,True,False,True,False,positive,Yes,male,Abroad


In [None]:
# Separating Covid positive patients with respect to Cough_symptoms
Cough = conn.execute("Select Cough_symptoms, count(*) as Count from df Group by Cough_symptoms").fetchdf()
Cough = conn.execute("Select * from Cough where Cough_symptoms=='True'").fetchdf()
Cough

Unnamed: 0,Cough_symptoms,Count
0,True,532


In [None]:
# Separating Covid positive patients with respect to Fever symptom
Fever = conn.execute("Select Fever, count(*) as Count from df Group by Fever").fetchdf()
Fever = conn.execute("Select * from Fever where Fever=='True'").fetchdf()
Fever

Unnamed: 0,Fever,Count
0,True,407


In [None]:
# Separating Covid positive patients with respect to Sore_throat symptom
Sore_throat = conn.execute("Select Sore_throat, count(*) as Count from df Group by Sore_throat").fetchdf()
Sore_throat = conn.execute("Select * from Sore_throat where Sore_throat=='True'").fetchdf()
Sore_throat

Unnamed: 0,Sore_throat,Count
0,True,87


In [None]:
# Separating Covid positive patients with respect to Shortness_of_breath symptom
Shortness_of_breath = conn.execute("Select Shortness_of_breath, count(*) as Count from df Group by Shortness_of_breath").fetchdf()
Shortness_of_breath = conn.execute("Select * from Shortness_of_breath where Shortness_of_breath=='True'").fetchdf()
Shortness_of_breath

Unnamed: 0,Shortness_of_breath,Count
0,True,84


In [None]:
# Separating Covid positive patients with respect to Headache symptom
Headache = conn.execute("Select Headache, count(*) as Count from df Group by Headache").fetchdf()
Headache = conn.execute("Select * from Headache where Headache=='True'").fetchdf()
Headache

Unnamed: 0,Headache,Count
0,True,129


In [None]:
Symptoms = ['Cough','Fever','Sore_throat','Shortness_of_breath','Headache']
Count = [532,407,87,84,129]
df = pd.DataFrame({'Symptoms':Symptoms,'Covid positive patients':Count})

In [None]:
px.bar(df,x='Symptoms',y='Covid positive patients', title = 'No of Covid negative patients with different symptoms',text_auto='0.2s',
       color='Symptoms')

- Referring to the plot, we can conclude herein that **Cough, Fever and Headache** were the most common symptoms among COVID positive males whose known contact was abroad.