# Classify Single Input

## Description

This notebook is used to classify a single new data entry to determine if a sample of water is potable or not potable. It uses previously trained model to classify the new input and determine the outcome.

## Imports and Configuration

### Imports needed

In [80]:
import pickle

<h3>Load helper methods</h3>

The [helper_methods.ipynb](helper_methods.ipynb) notebook contains all the paths for the trained models (found in `models/` folder) which we need, as well as some other useful functions.

In [81]:
%run helper_methods.ipynb

## Load Models

We can now load the saved models from the disk and use them to train

In [82]:
#Load the saved models
random_forest = pickle.load(open(random_forest_saved_model, 'rb'))
decision_tree = pickle.load(open(decision_tree_saved_model, 'rb'))
knn = pickle.load(open(knn_saved_model, 'rb'))
mlp = pickle.load(open(mlp_saved_model, 'rb'))
svc = pickle.load(open(svc_saved_model, 'rb'))
gaussian_nb = pickle.load(open(gaussian_nb_saved_model, 'rb'))
adaboost = pickle.load(open(boosting_saved_model, 'rb'))

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


## Load new data

We can create a new data entry to see the prediction. This new datasample does not have a pre-determined outcome to suggest if it is potable or not. We are relying on our models to give us that prediction.

In [99]:
#New samples of water
sample1 = [[5.064867,   #Ph
          132.894724,   #Hardness
          12642.385122, #Solids
          6.546600,     #Chloramines
          310.135738,   #Sulfate
          398.410813,   #Organic Carbon
          24.000385,    #Trihalomethanes
          19.836572,    #Turbidity
          3.030454]]    #Potability

sample2 = [[6.702547,   #Ph
          207.321086,   #Hardness
          17246.920347, #Solids
          7.708117,     #Chloramines
          304.510230,   #Sulfate
          329.266002,   #Organic Carbon
          16.217303,    #Trihalomethanes
          28.878601,    #Turbidity
          3.442983]]    #Potability

X_new = sample1

## Running the Models

Now we can run the X_new data on different models to see what the predicted outcome will be.

### Random Forest

In [92]:
y_new = random_forest.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [7;31;47mNot potable


### Decision Tree

In [93]:
y_new = decision_tree.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [7;31;47mNot potable


### K Nearest Neighbors

In [94]:
y_new = knn.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [6;30;42mPotable


### MLP

In [95]:
y_new = mlp.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [6;30;42mPotable


### SVC

In [96]:
y_new = svc.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [7;31;47mNot potable


### Gaussian Naive Bayes

In [97]:
y_new = gaussian_nb.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [6;30;42mPotable


### Boosting

In [98]:
y_new = adaboost.predict(X_new)

predict_outcome(X_new, y_new)

X = [5.064867, 132.894724, 12642.385122, 6.5466, 310.135738, 398.410813, 24.000385, 19.836572, 3.030454]
Predicted Outcome: [6;30;42mPotable


## Analysis

As we can see, using different models gives us different results. We know from calculating the accuracies of the models in [model_training.ipynb](model_training.ipynb) that we get an average accuracy of 64.16% for all the models. This shows that our training data is not the best, meaning it is possible for the machine learning models to make mistakes (roughly 40% of the time). A better and more realistic dataset could be used to improve the accuracies of our models and give us more firm and correct results when predicting unknown samples. 