# NOTE
This notebook is for educational purpose it cannot replace a specialized doctor



In [None]:
# IMPORTANT
# It is important to downgrade scikit-learn to version 0.23.2 because sklearn-porter doesn't work for later versions
# For future versions of sklean-porter you may ignore executing this cell
!pip install --upgrade scikit-learn==0.23.2
!pip install sklearn-porter

In [None]:
# Import all necessary modules, classes and functions
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn_porter import Porter

The dataset used in this notebook has been downloaded from [Kaggle](https://www.kaggle.com/hemanthhari/symptoms-and-covid-presence)
>**Step 1:** Loading the dataset

In [None]:
# Loading the data from the CSV file
covid = pd.read_csv("../covid_dataset.csv")
covid

Now we have to encode every 'Yes' and 'No' to '1' and '0' respectively using [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
>**Step 2:** Encoding the data

In [None]:
# Creating an instance of the LabelEncoder
encoder = LabelEncoder()

# Loop through every column of the dataset encoding every 'Yes' and 'No' to '1' and '0' respectively
for column in covid.columns:
    covid[column] = encoder.fit_transform(covid[column])

covid

Before we can use our dataset it is a good practice to filter it keeping only the most relevant features, to do so we have to calculate the correlation between the dependant column which is "COVID-19" and the rest of columns using the method [corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html).

The closer correlation coefficient to zero the less related. (Note: negative coefficient means inversely related)
>**Step 3:** Analyzing the dataset extracting for the most relevant features

In [None]:
# Calculting the correlation matrix and retrieve only columns have correlation with the 'COVID-19' column
corr_covid19 = covid.corr()["COVID-19"]

# Filtering out columns with correlation greater than 0.05 ignoring negative and NaN values
most_relevant_col = corr_covid19[corr_covid19 > 0.05]

# Retrieving names of most relevant columns
most_relevant_col_names = most_relevant_col.index
most_relevant_col_names

>**Step 4:** Filtering the dataset keeping only the most related columns (features)

In [None]:
# Filtering the dataset keeping only the most relevant columns
covid = covid[most_relevant_col_names]
covid

>**Step 5:** Spliting our dataset into two datasets, X as an input and y as an output, then spliting each into training and testing datasets

In [None]:
# Spliting the main dataset into two datasets, X as input and y as output
X = covid.drop('COVID-19', axis=1)
y = covid['COVID-19']

# Spliting the input dataset (X) and the output dataset (y) into training and testing datasets (each)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Before we build our classifier we may find it difficult to have a good settings (hyperparameters), to make our life easier we can either use a [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [RandomizedSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) however we will use the last one for its performance.
>**Step 6:** Run a random search to find out the best hyperparameters for a DecisionTreeClassifier where the data fittin is ran side by side

In [None]:
# Instancing a DecisionTreeClassifier as an estimator for the RandomizedSearch
model_dt = DecisionTreeClassifier()

# Setting the hyperparamaters we want to optimize
param_dist = {"max_depth": list(range(6, 10)),
              "min_samples_split": list(range(4, 8)),
              "min_samples_leaf": list(range(4, 8)),
              "criterion": ["gini", "entropy"]}

# Creating an instance of the RandomizedSearch and suppling an estimator and paramaters distribution
search = RandomizedSearchCV(model_dt, param_dist)

# Fitting the the randomized search to our training dataset
search.fit(X_train, y_train)

# Retrieve the best estimator found which is the model we will use
model_dt_best = search.best_estimator_

# Making a prediction using our testing dataset
y_pred = model_dt_best.predict(X_test)

# Calculating accuracy and F1 score
ac = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Printing the best paramaters found Followed by the accuracy and F1 score
print(search.best_params_)
print("accuracy", ac)
print("f1", f1)

After try and error it is supposed to have an optimal model ready to deploy, but what if you want to deploy to different platform with different programming language? One of the solutions is [sklean-porter](https://pypi.org/project/sklearn-porter/)
>**Step 7:** Port the model we got to your prefered programming language

In [None]:
# Instancing Porter giving our best model and Java language
porter = Porter(model_dt_best, language='java')

# Port our model to the desired language as a string
output = porter.export(embed_data=True)

# Creating a file and write the output string to it
with open("DecisionTreeClassifier.java", "w") as f:
    f.write(output)

# Print the output
print(output)