### Simple Model for Prdicting the COVID-19 test result or possibility of COVID infection with the given data.

We will be using a dataset of simple covid-19 based attributes. The dataset is simple with alphanumeric values..
<br></br>
Let's start with importing the libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import pickle

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Loading Raw Data and converting it to a cleaned data

In [None]:
data = pd.read_csv('../input/corona-symptoms-datasets/corona_tested_individuals_ver_006.english.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.columns

Here, test data is not crucial for prediction, so we remove the columns which don't contribute to the prediction model anyhow...
So, here we remove the date based column which was just the date of testing.


In [None]:
# drop test-date column
data.drop('test_date',axis=1,inplace=True)

In [None]:
# drop all values of each column which are not digit
data = data[data['cough'].apply(lambda x: str(x).isdigit())]
data = data[data['fever'].apply(lambda x: str(x).isdigit())]
data = data[data['sore_throat'].apply(lambda x: str(x).isdigit())]
data = data[data['shortness_of_breath'].apply(lambda x: str(x).isdigit())]
data = data[data['head_ache'].apply(lambda x: str(x).isdigit())]

# final_data is our main data
final_data = data

In [None]:
final_data.shape

We can see now the shape is 278594 rows to 9 columns.
<br></br>
Let's see what other patterns we can get from the dataset, Lets perform EDA... 

### EDA

Getting to know Data

In [None]:
display("Data to deal", final_data.head())

In [None]:
#size of Data
display("Shape of dataset")
print("Rows:",final_data.shape[0],"\nColumns:",final_data.shape[1])

In [None]:
#checking for the Null values
display('NULL Values', final_data.isnull().sum())

Thus we get to knew that there is no null value, so the data is clean...

In [None]:
display("Description",final_data.describe())

In [None]:
final_data.info()

In [None]:
#checking Distrubution of Data
for i in final_data.columns:
    print("\nColumn Name:",i,5*":",final_data[i].unique(),5*":","Unique Count",len(final_data[i].unique()))

In [None]:
# convert data types of column stated in convert_dict
convert_dict = {'cough': int, 
                'fever': int, 
                'sore_throat': int, 
                'shortness_of_breath': int, 
                'head_ache': int}
final_data = final_data.astype(convert_dict)

In [None]:
for i in final_data.columns:
    print("\nColumn Name:",i,5*":",final_data[i].unique(),5*":","Unique Count",len(final_data[i].unique()))

In [None]:
# frequency plot of corona_result
sns.countplot(final_data['corona_result'])

The results show that its not that prominent that the result is positive all the time, rather insterestingly its negetive maximum time.

In [None]:
# target(corona_result) v/s feature plots
sns.barplot(final_data['fever'], final_data['corona_result'])

Here we can see that fever contributes to fair amount of possibility of postive result of COVID 19 giving the insight that it is one of the major factor.

In [None]:
sns.barplot(final_data['shortness_of_breath'], final_data['corona_result'])

Shortness of breathing provides a strong response here. We can see that the positiveness of the test is majorly affected by the attribute of breathlessness

**Preproccesing** the dataset as some of them are not in numerical value and eventually also we need numerical value for tarining the model...

In [None]:
# label encoding on columns having more than 1 value
le = preprocessing.LabelEncoder()
final_data['corona_result'] = le.fit_transform(final_data['corona_result'])
final_data['gender'] = le.fit_transform(final_data['gender'])
final_data['age_60_and_above'] = le.fit_transform(final_data['age_60_and_above'])
final_data['test_indication'] = le.fit_transform(final_data['test_indication'])

In [None]:
final_data.head()

Here we can see the table is well numerical placed...

In [None]:
for i in final_data.columns:
    print("\nColumn Name:",i,"-->",final_data[i].unique(),"-->Unique Count",len(final_data[i].unique()))

## Model

In [None]:
final_data.head()

In [None]:
# now target is y and features in X
y = final_data['corona_result']
X = final_data.drop(['corona_result'], axis = 1)

In [None]:
X.head()

In [None]:
# Splitting the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Spiltting is a very crucial step if you want to test your model rightaway. Using what we have as in the datset inself, we can compare if the resulted predicted value is close to the value already present in the dataset. 
</br>
Thus we can see if our model works in short...

### K-Nearest Neighbors

In [None]:
# this will be used to plot accuracy of different alogrithms
scores_dict = {}

In [None]:
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )  
classifier.fit(X_train, y_train)

In [None]:
pred = classifier.predict(X_test) 
accuracy_knn = accuracy_score(y_test, pred)
print("KNN accuracy_score: ", accuracy_knn)
scores_dict['K-NearestNeighbors'] = accuracy_knn * 100

### Random Forest Classifier

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

In [None]:
RandomForest = RandomForestClassifier()
RandomForest = RandomForest.fit(X_train, y_train)
predRandomForest = RandomForest.predict(X_test)
accuracy_rf = accuracy_score(y_test, predRandomForest)
print('RandomForest accuracy_score: ', accuracy_rf)
scores_dict['RandomForestClassifier'] = accuracy_rf * 100

In [None]:
sns.distplot(y_test-pred)

### Decision Tree Classifier

In [None]:
DecisionTree = DecisionTreeClassifier()
DecisionTree = DecisionTree.fit(X_train, y_train)
pred = DecisionTree.predict(X_test)
accuracy_dt = accuracy_score(y_test, pred)
print('DecisionTree accuracy_score: ', accuracy_dt)
scores_dict['DecisionTreeClassifier'] = accuracy_dt * 100

Let's compare

In [None]:
scores_dict

In [None]:
with sns.color_palette('muted'):
  algo_name = list(scores_dict.keys())
  scoress = list(scores_dict.values())

  sns.set(rc={'figure.figsize':(9,5)})
  plt.xlabel("Algorithms")
  plt.ylabel("Accuracy score")

  sns.barplot(algo_name,scoress)

Which is almost similarly high, as we can see...