# Introduction
This notebook walks you through the process of using K-Nearest Neighbors (KNN) to develop a model which if fited to a training data set. Using the model, decision boundaries can be decided and prediction made on what class a sample falls. Principle components analysis (PCA) is conducted alongside the KNN inorder to take into consideration relationships between varaibles when developing the model. 

Here we are working though a data set which include different features of tumors and class them as either benign or malignant.

Follow the instructions and work though the notebook.

# To start with...
We will need to import the neccessary modules

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)        # to silence future warning messages
import numpy as np                                                    # for working with arrays and DataFrames
import pandas as pd                                                   # for storing our DataFrames                                                       # to use default of seaborn #take out
from sklearn.model_selection import train_test_split                  # to split data into test and training data
from sklearn.decomposition import PCA                                 # to conduct the PCA
from sklearn.neighbors import KNeighborsClassifier                    # we would be using a K Neightbours model
from sklearn.preprocessing import StandardScaler                      # for rescaling each variable
from sklearn.pipeline import make_pipeline                            # to make a pipeline we can pass our data through
from sklearn.model_selection import GridSearchCV                      # for optimising the hyper-parameters

# The Data
If you would like to import and use your own data set, do no run the next cell below but skip to the next.

Remember, your data set must be formated in a CSV file correctly. Your columns should include thoese labled: "mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness", "mean compactness", "mean concavity", "mean concave points", "mean symmetry", "mean fractal dimension". The column representing the $y$ result should only contain values of either 0 (to represent benign diagnoses) or 1 (representing a malignant diagnoses).


In [2]:
# Run this cell if you do not have your own data set to work with

# Here we will import a data set from scikit-learn called 'load_breast_cancer'
from sklearn.datasets import load_breast_cancer

# We can now split the data to have all the features grouped together in a dataframe (X) and the y result put in a seperate array.
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [None]:
# Run this cell if you have your own correctly formated data set you would like to import and explore
name_data = " " # write, between the quote makes, the file path of your data CSV or just the filename, if the file is located in the same directory as this notebook 
                # A url can also be inputed and used to import your data set

data = pd.read_csv(f"{name_data}.csv")

# We can now split the data to have all the features grouped together in a dataframe (X) and the y result put in a seperate array.
X, y = data(as_frame=True, return_X_y=True)

In [3]:
#### Run this Cell no matter your data source!
# We want to use only a subset of the data which includes the mean of the different variables, and not the best and worst of each variables
X_subset = X[["mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness", "mean compactness", "mean concavity", "mean concave points", "mean symmetry", "mean fractal dimension"]]

# We can now split the subset data into a test and training subsets randomly
train_X_s, test_X_s, train_y_s, test_y_s = train_test_split(X_subset, y, random_state=42)

# Using machine learning to optimise model
We will use the function "optimise_model" from the file "model" to do this. The output will be a table containing different values for the hyper-parameters and a score to show how the variation performs. 

In [5]:
import model   #import the file needed (make sure it is in the same directory as the notebook)
results = model.optimise_model(train_X_s, train_y_s, test_X_s, test_y_s)  #the test and training data subset are put as arguments
print(results)

    number of neighbours  best PCA     score
0                      1         3  0.895105
1                      2         2  0.881119
2                      3         5  0.923077
3                      4         5  0.937063
4                      5         6  0.951049
5                      6         2  0.909091
6                      7         5  0.958042
7                      8         5  0.944056
8                      9         5  0.958042
9                     10         5  0.972028
10                    11         8  0.965035
11                    12         9  0.951049
12                    13         5  0.951049
13                    14         7  0.958042
14                    15         6  0.972028
15                    16         7  0.958042
16                    17         6  0.972028
17                    18         8  0.972028
18                    19         7  0.965035
19                    20         8  0.965035
20                    21         4  0.958042
21        

In [6]:
max_score = results["score"].max() # to get the score for the best performing model
index_of_max_score = results["score"].idxmax() # get index of the score
parameters_for_optimum = results.loc[index_of_max_score] # obtain the parameters which lead to the score

print(f" The parameters for the optimum model are:") 
print(parameters_for_optimum)
 
if max_score >= 0.97:
    print(f" The optimised model is a very good with a score of {max_score:.2f}")
else:
    print(f" The optimised model is not a very good model as it only has a score of {max_score:.2f}") 


 The parameters for the optimum model are:
number of neighbours    10.000000
best PCA                 5.000000
score                    0.972028
Name: 9, dtype: float64
 The optimised model is a very good with a score of 0.97


# Fitting the optimum model 
This would allow the model we want to be store and be able to be called upon when we use predict later.

In [14]:
best_pca_parameters = results.loc[index_of_max_score, "best PCA"] # returns the value of PCA to use
best_neighbours_parameters = results.loc[index_of_max_score, "number of neighbours"] #returns the number of neighbours to use

#make a pipeline which will scales our data, conducts a PCA and set a number of neighbour to consider
optimised_model = make_pipeline(
    StandardScaler(),
    PCA(n_components = best_pca_parameters),   
    KNeighborsClassifier(n_neighbors = best_neighbours_parameters )
)

fitted_model = optimised_model.fit(train_X_s, train_y_s) # fit our optimised model to the training data 

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=5)),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=10))])


In [18]:
percentage_pca = sum(fitted_model["pca"].explained_variance_ratio_) *100  #percentage reflecting how much of the variance  
                                                                          #in the data is explained by each pca component
print(f" {percentage_pca:.1f}% of our variance is explained by the {best_pca_parameters} PCA components included in our model.") 

 97.5% of our variance is explained by the 5 PCA components included in our model.


# Visualising the model...or not
Depending on the number of PCA component used by the model, it may not be possible to visualise the model. This is because each PCA becomes an axis in our graph and it's not posible to plot a 1D graph, or any dimension over 4!

In [22]:
if best_pca_parameters >= 4 or  best_pca_parameters == 1 :
    print(f"As we have {best_pca_parameters} PCA components, we are unable to visualise our model as we would need to plot a {best_pca_parameters} dimensional graph and that just not possible yet!")
elif best_pca_parameters == 2:
    from plot import plot_knn # import a function which will plot a 2D graph, make sure the plot file is in the same directory
    plot_knn(fitted_model, X_subset, y)
else:
    print(f"To plota graph for {best_pca_parameters} PCA components is possible but difficult and this notebook does not deal with this")

As we have 5 PCA components, we are unable to visualise our model as we would need to plot a 5 dimensional graph and that just not possible yet!


# Making predictions
Now we have our model, and we are sufficiently happy with it, we can now input some data in and see what diagnoses is predicted. Have a go inputting some values where indicated. 

An example  of a random data set is [12.4, 20, 80.9, 470, 0.2, 0.1, 0.007, 0.033, 0.18, 0.072]. When the load_breast_cancer data is used to fit the model, this example data set gives a diagnoses of malignant. 

Go on play around with it!

In [24]:
# Enter data into the new_X DataFrame being created below.
#You can put as many values in each list as you want just be consitance across all the variable.

new_X = pd.DataFrame({
    "mean radius": [],   
    "mean texture": [],
    "mean perimeter": [],
    "mean area": [],
    "mean smoothness": [],
    "mean compactness": [],
    "mean concavity": [],
    "mean concave points": [],
    "mean symmetry":[],
    "mean fractal dimension": [],
})


predicted_y = optimised_model.predict(new_X)

for i in predicted_y:
    if i == 0:
        print("Tumour is benign")
    elif i == 1:
        print("Tumour is malignant")
    else:
        print("There has been an error, please check data values imputed")


ValueError: Found array with 0 sample(s) (shape=(0, 10)) while a minimum of 1 is required by StandardScaler.