<a href="https://colab.research.google.com/github/DocBot-Bangkit-2021/DocBot-MachineLearningModels/blob/main/Covid-19/Covid19_Baseline_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Library**

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import zipfile
import os

In [2]:
plt.style.use('seaborn')

# **Data Loading**

[Early stage symptoms of COVID-19 patient's](https://www.kaggle.com/martuza/early-stage-symptoms-of-covid19-patients)

### **A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients**

### **Dataset description**

The dataset contains 6,512 individuals with follows attributes :

* `Gender` - (male, female)
* `Age` - (Numeric)
* `Fever` - (yes-1, no-0)
* `Cough` - (yes-1, no-0)
* `Runny nose` - (yes-1, no-0)
* `Muscle soreness` - (yes-1, no-0)
* `Pneumonia` - (yes-1, no-0)
* `Diarrhea` - (yes-1, no-0)
* `Lung infection` - (yes-1, no-0)
* `Travel history` - (yes-1, no-0)
* `Isolation treatment` - (yes-1, no-0)
* `SARS-CoV-2 Positive` - (positive-1, suspected-0)

### **Related paper**

[A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7305929/)

Ahamad MM, Aktar S, Rashed-Al-Mahfuz M, et al. A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients. Expert Systems with Applications. 2020;160:113661. doi:10.1016/j.eswa.2020.113661

## **Download dataset from Kaggle**

How to get data from kaggle :
https://www.kaggle.com/general/51898

In [3]:
# Upload kaggle.json 
print("Upload your kaggle.json")
from google.colab import files
kaggle_file = files.upload()

# Change file permission
! chmod 600 kaggle.json 
# Check or make kaggle folder
! (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle)
# Move kaggle.json to kaggle folder
! mv kaggle.json ~/.kaggle/ && echo 'Done'
# Download dataset from kaggle
! kaggle datasets download -d martuza/early-stage-symptoms-of-covid19-patients

# unzipping/extract data .zip
# file_zip = 'early-stage-symptoms-of-covid19-patients.zip'
# with zipfile.ZipFile(file_zip) as zip_file:
#     zip_file.extractall('./sars-cov-2/')
file_zip = 'early-stage-symptoms-of-covid19-patients.zip'
zip_ref = zipfile.ZipFile(file_zip, 'r')
zip_ref.extractall('./sars-cov-2/')
zip_ref.close()

print('Done')
print(os.listdir('./sars-cov-2/'))

Upload your kaggle.json


Saving kaggle.json to kaggle.json
Done
Downloading early-stage-symptoms-of-covid19-patients.zip to /content
  0% 0.00/15.5k [00:00<?, ?B/s]
100% 15.5k/15.5k [00:00<00:00, 29.3MB/s]
Done
['covid_early_stage_symptoms.csv']


In [4]:
import pandas as pd
df = pd.read_csv('sars-cov-2/covid_early_stage_symptoms.csv')
df.head()

Unnamed: 0,gender,age_year,fever,cough,runny_nose,muscle_soreness,pneumonia,diarrhea,lung_infection,travel_history,isolation_treatment,SARS-CoV-2 Positive
0,male,89,1,1,0,0,0,0,0,1,0,0
1,male,68,1,0,0,0,0,0,0,0,0,0
2,male,68,0,0,0,0,0,0,0,1,0,0
3,male,68,1,1,0,0,0,0,0,1,1,1
4,male,50,1,1,1,0,1,0,0,1,0,1


## **Download dataset from GDrive [Optional]**

In [5]:
# import pandas as pd
# df = pd.read_csv('https://drive.google.com/uc?id=11w6cKQANeVoWnbjNJ8RQy2BLomdfCTBJ')
# df.head()

# **Exploratory Data Analyis (EDA)**

Displays a list of columns

In [6]:
df.columns

Index(['gender', 'age_year', 'fever', 'cough', 'runny_nose', 'muscle_soreness',
       'pneumonia', 'diarrhea', 'lung_infection', 'travel_history',
       'isolation_treatment', 'SARS-CoV-2 Positive'],
      dtype='object')

In [7]:
df.describe()

Unnamed: 0,age_year,fever,cough,runny_nose,muscle_soreness,pneumonia,diarrhea,lung_infection,travel_history,isolation_treatment,SARS-CoV-2 Positive
count,6512.0,6512.0,6512.0,6512.0,6512.0,6512.0,6512.0,6512.0,6512.0,6512.0,6512.0
mean,44.019502,0.41078,0.303286,0.084306,0.003993,0.074785,0.005682,0.131296,0.650952,0.216984,0.2414
std,16.112865,0.492013,0.459713,0.277867,0.063066,0.263064,0.075169,0.33775,0.476706,0.412223,0.427965
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,43.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,55.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,96.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6512 entries, 0 to 6511
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   gender               6512 non-null   object
 1   age_year             6512 non-null   int64 
 2   fever                6512 non-null   int64 
 3   cough                6512 non-null   int64 
 4   runny_nose           6512 non-null   int64 
 5   muscle_soreness      6512 non-null   int64 
 6   pneumonia            6512 non-null   int64 
 7   diarrhea             6512 non-null   int64 
 8   lung_infection       6512 non-null   int64 
 9   travel_history       6512 non-null   int64 
 10  isolation_treatment  6512 non-null   int64 
 11  SARS-CoV-2 Positive  6512 non-null   int64 
dtypes: int64(11), object(1)
memory usage: 610.6+ KB


In [9]:
df.isnull().sum()

gender                 0
age_year               0
fever                  0
cough                  0
runny_nose             0
muscle_soreness        0
pneumonia              0
diarrhea               0
lung_infection         0
travel_history         0
isolation_treatment    0
SARS-CoV-2 Positive    0
dtype: int64

In [10]:
df["SARS-CoV-2 Positive"].value_counts()

0    4940
1    1572
Name: SARS-CoV-2 Positive, dtype: int64

In [11]:
print('Total Row:', df.shape[0])
print('Total Columns:', df.shape[1])
df.shape

Total Row: 6512
Total Columns: 12


(6512, 12)

In [12]:
col_cat = ['gender', 'fever', 'cough', 'runny_nose', 'muscle_soreness',
       'pneumonia', 'diarrhea', 'lung_infection', 'travel_history',
       'isolation_treatment', 'SARS-CoV-2 Positive']
for col in col_cat :
  print(df[col].value_counts(), "\n")

male      3367
female    3145
Name: gender, dtype: int64 

0    3837
1    2675
Name: fever, dtype: int64 

0    4537
1    1975
Name: cough, dtype: int64 

0    5963
1     549
Name: runny_nose, dtype: int64 

0    6486
1      26
Name: muscle_soreness, dtype: int64 

0    6025
1     487
Name: pneumonia, dtype: int64 

0    6475
1      37
Name: diarrhea, dtype: int64 

0    5657
1     855
Name: lung_infection, dtype: int64 

1    4239
0    2273
Name: travel_history, dtype: int64 

0    5099
1    1413
Name: isolation_treatment, dtype: int64 

0    4940
1    1572
Name: SARS-CoV-2 Positive, dtype: int64 

