# Project Description
The data used in this project will help to identify whether a person is going to recover from 
coronavirus symptoms or not based on some pre-defined standard symptoms. These symptoms are 
based on guidelines given by the World Health Organization (WHO).
This dataset has daily level information on the number of affected cases, deaths and recovery from 
2019 novel coronavirus. Please note that this is a time series data and so the number of cases on 
any given day is the cumulative number.
The data is available from 22 Jan, 2020. Data is in “data.csv”.
The dataset contains 14 major variables that will be having an impact on whether someone has 
recovered or not, the description of each variable are as follows,
1. Country: where the person resides
2. Location: which part in the Country
3. Age: Classification of the age group for each person, based on WHO Age Group Standard
4. Gender: Male or Female 
5. Visited_Wuhan: whether the person has visited Wuhan, China or not
6. From_Wuhan: whether the person is from Wuhan, China or not
7. Symptoms: there are six families of symptoms that are coded in six fields.
13. Time_before_symptoms_appear: 
14. Result: death (1) or recovered (0)

### First we will import the main libraries for the whole Project

In [8]:
import math
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

### Load and Visualize the Data... since it is of a high dimensional we will just see the table of data

In [9]:
names = ['location',"country","gender","age","vis_wuhan","from_wuhan","symptom1",
        "symptom2","symptom3","symptom4","symptom5","symptom6","diff_sym_hos","result"]
df = pd.read_csv("data.csv", header=None, skiprows=1, names=names)
print(df.loc[0:10]) #visualising first 11 rows

    location  country  gender   age  vis_wuhan  from_wuhan  symptom1  \
0        104        8       1  66.0          1           0        14   
1        101        8       0  56.0          0           1        14   
2        137        8       1  46.0          0           1        14   
3        116        8       0  60.0          1           0        14   
4        116        8       1  58.0          0           0        14   
5         23        8       0  44.0          0           1        14   
6        105        8       1  34.0          0           1        14   
7         13        8       1  37.0          1           0        14   
8         13        8       1  39.0          1           0        14   
9         13        8       1  56.0          1           0        14   
10        13        8       0  18.0          1           0        14   

    symptom2  symptom3  symptom4  symptom5  symptom6  diff_sym_hos  result  
0         31        19        12         3         1      

### divide the data into three partitions: training, validation, and testing
#### used the conventional 70% training 15% validation and 15% testing parititioning with randomness

In [58]:
from sklearn.model_selection import train_test_split
print(df.shape)
X = df.drop(columns=['result'])  #dropping the column of the target
Y = df['result']

randomState=44
train_ratio = 0.7
validation_ratio = 0.15
test_ratio = 0.15

# Split the data using train_test_split with randomness
XTrain, X_temp, YTrain, Y_temp = train_test_split(X, Y, test_size=1 - train_ratio, random_state=randomState)
XValidation, XTest, YValidation, YTest = train_test_split(
    X_temp, Y_temp, test_size=test_ratio / (validation_ratio + test_ratio), random_state=randomState)

print(YTrain)


(863, 14)
786    0
190    0
764    0
52     0
851    0
      ..
571    0
173    0
753    0
419    1
788    0
Name: result, Length: 604, dtype: int64


## First Classification method is KNN

### Normalizing,Fitting,Choosing Parameters 

1- I used StandardScaler for sklearn to normalize the data as you can see some of them are in 100s and some are just binary <br>
2- I tried k from 1 to sqrt of the number of training data why? because it is a common heuristic and works well in practice <br>
3- I used Validation data to choose the best K

In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
XTrain_scaled_ForKNN = scaler.fit_transform(XTrain)
XValidation_scaled_ForKNN = scaler.transform(XValidation)
xTest_scaled_ForKNN = scaler.transform(XTest)

# Assuming X_train is your training data
num_training_samples = len(XTrain_scaled_ForKNN)
max_k = int(math.sqrt(num_training_samples))

best_k = None
best_f1 = 0.0
for k in range (1,max_k+1,2):
    knn_classifier = KNeighborsClassifier(n_neighbors=k)
    knn_classifier.fit(XTrain_scaled_ForKNN,YTrain)
    y_pred_KNN = knn_classifier.predict(XValidation_scaled_ForKNN)

    f1_KNN = f1_score(YValidation, y_pred_KNN)

    if f1_KNN > best_f1:
        best_f1 = f1_KNN
        best_k = k
print("Best F1 Score",best_f1)
print("The K with the best F1 Score",best_k)

Best F1 Score 0.717948717948718
The K with the best F1 Score 3


### Testing
Now, I will test using the test data using K=3

In [60]:
y_test_KNN = knn_classifier.predict(xTest_scaled_ForKNN)
f1_Test_KNN = f1_score(YTest, y_test_KNN)
print("F1 Score for testing",f1_Test_KNN)

F1 Score for testing 0.5000000000000001
