# Project Description
The data used in this project will help to identify whether a person is going to recover from 
coronavirus symptoms or not based on some pre-defined standard symptoms. These symptoms are 
based on guidelines given by the World Health Organization (WHO).
This dataset has daily level information on the number of affected cases, deaths and recovery from 
2019 novel coronavirus. Please note that this is a time series data and so the number of cases on 
any given day is the cumulative number.
The data is available from 22 Jan, 2020. Data is in “data.csv”.
The dataset contains 14 major variables that will be having an impact on whether someone has 
recovered or not, the description of each variable are as follows,
1. Country: where the person resides
2. Location: which part in the Country
3. Age: Classification of the age group for each person, based on WHO Age Group Standard
4. Gender: Male or Female 
5. Visited_Wuhan: whether the person has visited Wuhan, China or not
6. From_Wuhan: whether the person is from Wuhan, China or not
7. Symptoms: there are six families of symptoms that are coded in six fields.
13. Time_before_symptoms_appear: 
14. Result: death (1) or recovered (0)

### First we will import the main libraries for the whole Project

In [4]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

### Load and Visualize the Data... since it is of a high dimensional we will just see the table of data

In [70]:
names = ['location',"country","gender","age","vis_wuhan","from_wuhan","symptom1",
        "symptom2","symptom3","symptom4","symptom5","symptom6","diff_sym_hos","result"]
df = pd.read_csv("data.csv",header=None,names=names)
print(df.loc[0:10]) #visualising first 11 rows

     location country gender   age vis_wuhan from_wuhan symptom1 symptom2  \
0.0       104       8      1  66.0         1          0       14       31   
1.0       101       8      0  56.0         0          1       14       31   
2.0       137       8      1  46.0         0          1       14       31   
3.0       116       8      0  60.0         1          0       14       31   
4.0       116       8      1  58.0         0          0       14       31   
5.0        23       8      0  44.0         0          1       14       31   
6.0       105       8      1  34.0         0          1       14       31   
7.0        13       8      1  37.0         1          0       14       31   
8.0        13       8      1  39.0         1          0       14       31   
9.0        13       8      1  56.0         1          0       14       31   
10.0       13       8      0  18.0         1          0       14       31   

     symptom3 symptom4 symptom5 symptom6 diff_sym_hos result  
0.0        1

### divide the data into three partitions: training, validation, and testing
#### used the conventional 70% training 15% validation and 15% testing parititioning with randomness

In [83]:
from sklearn.model_selection import train_test_split
print(df.shape)
X = df.drop(columns=['result'])  #dropping the column of the target
Y = df['result']


train_ratio = 0.7
validation_ratio = 0.15
test_ratio = 0.15

# Split the data using train_test_split with randomness
XTrain, X_temp, YTrain, Y_temp = train_test_split(X, Y, test_size=1 - train_ratio, random_state=42)
XValidation, XTest, YValidation, YTest = train_test_split(
    X_temp, Y_temp, test_size=test_ratio / (validation_ratio + test_ratio), random_state=42)


(864, 14)


## First Classification method is KNN