# Machine Learning - Individual Assesment - University of the West of England
<a href="https://github.com/Sp0xF8/Individual-Project">GitHub Project Link</a>

# Imports

## These imports are shared across the whole application and not specific to either model

In [129]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import json 

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score

## SVM Imports
<p>These imports are specifically related to the SVM's functionality</p>

In [130]:
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

## ADA Ensambles
<p>These imports are specifically related to the ADA models</p>

In [131]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# Getting the Data ready for interperating
<p>In this stage we are using the Pandas libary to load the CSV data file.</p>
	<div style="margin-left: 20px;">
		<p>- This helps by giving us functionality to use a wide array of methods</p>
	</div>
	<p>The data is then printed to allow for easy refencing and understanding of base data</p>



In [132]:
## Load the CSV file

data = pd.read_csv('diabetes.csv')

print(data[:10])
print(data.shape)

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   
5            5      116             74              0        0  25.6   
6            3       78             50             32       88  31.0   
7           10      115              0              0        0  35.3   
8            2      197             70             45      543  30.5   
9            8      125             96              0        0   0.0   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   2

## Importance of Splitting Test and Training Data in Machine Learning Models

In Machine Learning, it is essential to split the dataset into training and test sets.
<div style="margin-left: 20px;">
	<p>The importance of having a <strong> Validation Dataset </strong> is to have a selection of data for which the learned model will be scored against. By knowing the result of said inputs, it is possible to match them against the predictions made by the model. This is the most accurate way to test. A counter-option would be to test against the trained data, however this wouldn't be a real-world example of predicting new labels. Knowing the success rate of predicted models means the model can be adjusted until the preformance is satisfactory. 
	</p>
</div>

Additionally, the from the training <strong>data</strong> provided by the CSV, the outcome must be dropped for the list of inputs ($X$). Everything, besides the outcome, should also be dropped from the list of outputs ($y$).


In [133]:
# Creating the initial X and y lists from the data CSV file
X = data.drop("Outcome", axis=1)
y = data["Outcome"]


#split the data into training and testing data, 80% training and 20% testing- random state is set to 42 because it is the answer to everything
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Setting up the Variables

## Creating SVM Variables
These two variables are of the type Dictonary, which is similar in format to the <strong>Json</strong> file extension. Advantages of setting the models to work in this behavoiur include ease of access and increased readability. It also gives the ability to store mulitple different datatypes in one element and easily export it to a Json file.
<div style="margin-left: 20px;">
	<p>
		<strong>svm_tests</strong> 			: This Dictonary is used to store the range of hyper-paramaters passed to the depth first grid search.
	</p>
	<p>
		<strong>current_svm_data</strong> 	: This Dictionary is used to store the current hyper-paramaters being passed to the <strong>svm_model</strong> class.
	</p>
</div>


In [134]:
# Dictionary to store the tests which will be preformed on the SVM
svm_tests = {
	'C': np.linspace(1, 100, 100),
	'tollerance': np.linspace(0.0001, 0.1, 100),
	'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
	'max_iter': np.linspace(10, 10000, 100).astype(int),
	'decision_function_shape': ['ovo', 'ovr'],
	'possibility': [True, False],
}

# Dictionary to store the current test's data for the SVM
current_svm_data = {
	'kernel': None,#
	'max_iter': None,
	'decision_function_shape': None,#
	'probability': None,#
	'shrinking': None,#
	'C': None,
	'tollerance':None#
}

## Creating Ensambles Variables


Simiarly to the previous definitions, these variables are also of the type Dictionary. They are, instead, however multi-layered. 
<div style="margin-left: 20px;">
	<p>
		<strong>ada_ensambles_tests</strong> : This Dictonary is used to store the range of hyper-paramaters passed to the depth first grid search.
		<div style="margin-left: 50px;">
			<p>
				<strong>Estimator</strong> 			: This Dictonary is used to store the range of hyper-paramaters passed to the <strong>Random Forest Classifier</strong>.
			</p>
			<p>
				<strong>Params</strong> 			: This Dictionary is used to store the range of hyper-paramaters passed to the <strong>Ada Boost classifier</strong>.
			</p>
		</div>
	</p>
	<hr style="width:75%;" align="left">
	<p>
		<strong>current_ada_data</strong> 	: This Dictionary is used to store the current hyper-paramaters being passed to the <strong>ada_model</strong> class.
		<div style="margin-left: 50px;">
			<p>
				<strong>Estimator</strong> 			: This Dictonary is used to store the current hyper-paramaters being passed to the <strong>Random Forest Classifier</strong>.
			</p>
			<p>
				<strong>Params</strong> 			: This Dictionary is used to store the current hyper-paramaters being passed to the <strong>Ada Boost classifier</strong>.
			</p>
		</div>
	</p>
</div>


In [135]:
# Dictionary to store the tests which will be preformed on the AdaBoost Classifier and Random Forest Classifier
ada_ensambles_tests = {
	'Estimator': {
		'n_estimators': np.linspace(1, 200, 10).astype(int),
		'criterion': ['gini', 'entropy', 'log_loss'],
		'max_features': ['sqrt', 'log2'],
		'bootstrap': [True, False],
		'min_samples_split': np.linspace(2, 11, 10).astype(int),
		'min_samples_leaf': np.linspace(2, 6, 5).astype(int)
	},
	'Params': {
		'n_estimators': np.linspace(1, 100, 10).astype(int),
		'learning_rate': np.linspace(0.1, 3, 10),
		'algorithim': ['SAMME', 'SAMME.R']
	}
}

# Dictionary to store the current test's data for the AdaBoost Classifier and Random Forest Classifier
current_ada_data = {
	'Estimator': {
		'n_estimators': None,
		'criterion': None,
		'max_features': None,
		'bootstrap': None,
		'min_samples_split': None,
		'min_samples_leaf': None
	},
	'Params': {
		'n_estimators': None,
		'learning_rate': None,
		'algorithim': None
	}
}

## Defining the test comparison
These two vairables are of type List and are used to store the best solutions found and their respective metrics. This is essneital in preforming a goal orientated depth first search. Without having a comparison, how can you know if you are finding a better solution?

<div style="margin-left: 20px;">
	<p>
		<strong>solution_list</strong> 		: This List stores class objects of the top 10 solutions, allowing for instant referencing once the search has been completed.
	</p>
	<p>
		<strong>accuracy_list</strong> 		: This List stores the accuracy of the class object at the same offset and is the list used for direct comparison without having to reference the solutions list; slowing down the, already hundreds of thousands, of tests. 
	</p>
</div>


In [136]:
solution_list = []
accuracy_list = []

# Ratios Function
This function is an easy way of returning multiple metrics in a more efficent way: using far less function calls.

### Paramaters:

<div style="margin-left: 20px;">
	<p>
		<strong>y_true</strong> 		: This variable is of type numpy array and holds the actual true values for the test data (<strong>y_test</strong>)
	</p>
	<p>
		<strong>y_pred</strong> 		: This variable is also of type numpy array and holds the predicted values for the test data (<strong>y_test</strong>)
	</p>
</div>

### Prediction Types:

<div style="margin-left: 20px;">
	<p>
		<strong>True Positive</strong> 		: This type of classification occurs when the model <em><strong>predicts a Positive result</strong>, and the result is <strong>actually Positive</strong></em>.
	</p>
	<p>
		<strong>True Negative</strong> 		: This type of classification occurs when the model <em><strong>predicts a Negative result</strong>, and the result is <strong>actually Negative</strong></em>.
	</p>
	<p>
		<strong>False Positive</strong> 	: This type of classification occurs when the model <em><strong>predicts a Positive result</strong>, and the result is <strong>actually Negative</strong></em>.
	</p>
	<p>
		<strong>False Negative</strong> 	: This type of classification occurs when the model <em><strong>predicts a Negative result</strong>, and the result is <strong>actually Positive</strong></em>.
	</p>
</div>



In [137]:
# Funct to test the SVM model without calling different functions, this is to make the code more readable and efficient
def ratios(y_true, y_pred):
    ## Get the confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
    
    ## Calculate False Negative Ratio
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0
    
	## Calculate Recall
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
	## Calculate Precision
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    
	## Calculate specificity
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    
    return fnr, recall, precision, specificity

# Class Definitions

Using a class instead of a standard function is highly effiecent. It allows for better reuseability, and storage, of data in the long term. This is fundamental concept of Object Orientated Programming. By using OOP in this project, it is much easier to rapidly cycle through a grid of hyper paramaters and subsequently output and results gained. This also means that less indents are reuqired to accomplish the same task, resulting in cleaner code which is much easier to understand.

By choosing to use a class, there is inherent access to standardised methods: like the constructor. Both the **svm_model** and **ada_model** both use **constructurs** which accept only one paramater. These are the previously defined respective Dictionarys (**current_svm_data**, **current_ada_data**). In this regard, the constructor is used to not only setup the Machine Learning model but also used to save paramaters. These can later be referenced by a different function, printing the best solutions found by the search. 

Additionally, the **predict** method is shared between the classes. While they both have vastly different features, they return the same metric values. The models themselves are stored in function-local variables, meaning once the predictions have been made: they are destroyed, freeing memory. The metric data is then stored in appropriatly named class-vairables for later use.

### Metric Variables:

<div style="margin-left: 20px;">
	<p>
		<strong>closed_pred</strong> 		: This is a variable of type Numpy Array and holds the <em>predicted true values for the test data</em> (<strong>y_pred</strong>)
	</p>
	<p>
		<strong>closed_accuracy</strong> 	: This variable is of the type Float and holds the most simple of the metrics. This is simply the number of <em>correctly classafied datapoints to the number of incorrectly classafied datapoints</em>. Having a good overall accuracy is important but, dependant on what application the model is used for, can be misappropriated.
	</p>
	<p>
		<strong>closed_precision</strong> 	: This variable is also of type Float and holds a ratio of <em>correctly identified <strong>True Positives</strong> to the total number of positive predictions</em> made by the model. This is useful for analysing the validity of the positivly predicited models. 
	</p>
	<p>
		<strong>closed_specificity</strong> : Similar to <strong>closed_precision</strong>, this variable is also Float. Instead, however, it stores the ratio of <em>correctly identified <strong>True Negatives</strong> to the total number of negative predictions</em> made by the model.  
	</p>
	<p>
		<strong>closed_recall</strong> 		: This variable type of Float is used for the <em><strong>Recall</strong>, or <strong>Sensitivity</strong>,</em> of the models predictions. This is <em>the number of <strong>True Positive</strong> predictions to the total number of positive instances</em>. Recall makes excellent pairing if <strong>closed_precision</strong> is also a metric of choice. It helps to give more complete info, as precision wil not specify how many instances were missed: only how accurate it is at finding positive predicitions. This could be misleading, as an easy way to always predict people with diabeties, without missing anyone, is to simply say <em>everyone has diabeties</em>. This would <strong><em>never miss a diabetic</strong>, but would <strong>instantly slow down formal diagnosis</strong> because everyone would be referred</em>.
	</p>
	<p>
		<strong>closed_fnr</strong> 		: This variable is of type float and holds the <em><strong>False Negative Ratio</strong></em>. This the invserse of <strong>Recall</strong>, however, for transparancy, using this as a clear metric seemed preferable. Often refered to as the <em>miss rate</em>, this metric is highly valued when missed positive instances is critical. When dealing with any form of diagnosis, it is far more important that the model can correctly identify as many <strong>True Positives</strong> as possible. <em>While it is <strong>less than ideal if a non-diabetic is referred</strong>, this would cause the diagnosis system to slow down, it is <strong>far more important that the diabetic is not overlooked</strong> and left without treatment</em>. It is important to note that <strong>this should not be the only metric used for judgement</strong>. A good choice of pairing for this metric would also be <em>Precision or Specificity</em>. Specificity would provide a well rounded overview by providing insight into if the model is simply predicting everyone as diabetic. 
	</p>
	<p>
		<strong>closed_f1</strong> 			: Of type Float, this variable stores the <em>harmonic mean of <strong>precision and recall</strong></em>. This ensures <strong>False Positives</strong> and <strong>False Negatives</strong> are accounted for, and is <em>particularly useful when neither have signifigantly different levels of importance</em>. Given the models previously stressed importance of not allowing for False Negatives, but having some leverage on False Positives: F1 scoring metrics are not entirely relavent. 
	</p>
</div>

### svm_model.predict Method

Typically, a prediction function would allow for some way of dynamically setting the data to be predicted. However, given the time complexity of the grid searches employed, it was far more efficent to define the test data globally and reference when needed. If this was to be progressed further, the predict function would need to take a multi-dimensional array as the input (representing the data to classify). For the current scenario, it is suited perfectly and offers an optimised solution.

Inside the prediction function, it begins by defining a pipline; through which, the data should be processed before reaching the **Support Vector Classifier**. Given the type of model, and the large separation of the data (*e.g pregnancies: 1, glocouse: 168), the data must be pre-processed before it can be accuratly analysed by the **SVC**. Problems can easily arise in the effectiveness of Linear based algorithms, this is where the **StandardScaler** is useful inside the pipline. Instead of predefining the the patient data (**X_train**, **X_test**, ...) after having been scaled, causing disruptions to the other ML model (**ada_model**), the data is scaled on access of the model. This means new test data added will also not need to be scaled, as the pipline will automatically handle any scaling needs. 

	closed_clf.fit(X_train, y_train)

This line of code is where the SVC Pipline is called to train the model. The first paramater, **X_train**, is then passed through the pipline. First being scaled, reducing the variance in the source data, then being passed further down the pipline to the SVC model and specified kernel. At its root, the **Standard Scaler** function is designed to help convert values from different formats into a single scale. It is designed to ensure that each feature has a mean of 0 and a unit of standard deviation. This is useful for Kernels such as *RBF*, which assumes that all features are centered around a 0 origin.

In [138]:

# SVM Model Definition
class svm_model:

	## Constructor - Takes in a data dictionary (current_svm_test) and assigns the values to the class variables
	def __init__(self, data):
		self.kernel 					= data['kernel']
		self.max_iter 					= data['max_iter']
		self.func_shape 				= data['decision_function_shape']
		self.probability 				= data['probability']
		self.shrinking 					= data['shrinking']
		self.tollerance 				= data['tollerance']
		self.C 							= data['C']



	## Predict Function - Uses the current_svm_test from the constructor to create a pipeline and predict the outcome of the test data 
	def predict(self):

		### Create the pipeline as a local variable
		closed_clf = make_pipeline(	
									StandardScaler(), # Standardise the data before training
									SVC( # Create the SVC model with the given parameters from the current_svm_test dictionary
										
										kernel=self.kernel, 
										max_iter=self.max_iter, 
										decision_function_shape=self.func_shape, 
										probability=self.probability, 
										shrinking=self.shrinking,
										C=self.C,
										tol=self.tollerance
									)
								)
		
		### Fit the model to the training data 
		closed_clf.fit(X_train, y_train)

		### Predict the outcome of the test data
		self.closed_pred 		= closed_clf.predict(X_test)

		### Calculate the accuracy, f1 score, false negative rate, recall, precision, and specificity of the model
		self.closed_accuracy 	= accuracy_score(y_test, self.closed_pred)
		self.closed_f1 			= f1_score(y_test, self.closed_pred)
		
		self.closed_fnr, self.closed_recall, self.closed_precision, self.closed_specificity	= ratios(y_test, self.closed_pred)

## ada_model.predict Method

### Random Forest Classifier
	closed_hype_clf = RandomForestClassifier(...)

1. At its basics, **Random Forest Classifier** is a combination of multiple decision trees, all trained from a different *randonly selected* datapoint. This simply involves choosing multiple random samples from the origonal dataset. These samples are then used to build trees from. 

2. At *each node of each decision tree*, a random number of features is selected (*so long as the value doesnt exceed **max_features***). This helps to ensure decision trees do not run in parralel, defeating the purpose of multiple trees. 

3. After a node has selected its features, within these selected features: the one which provides the best split, *according to the **criterion***, is chosen. 

4. The remaining data is then split into subsets based on the selected feature, creating different branches on the tree. This process is then repeated at each node until a stopping crietia is met.

5. After growing ***n_estimator*** number of trees, the results are aggrogated to make predictions. When these predicitons are made, each tree independently makes their own prediction. The final prediction for the *new input data, **X_test**, is the **average result of all of the trees***. 

### AdaBoost Classifier
	closed_clf = AdaBoostClassifier(closed_hype_clf, ...)

When you pass a Random Forest Classifier as an estimator to an AdaBoost Classifier, the AdaBoost algorithm works in conjunction with the Random Forest base estimator to create a strong ensamble model. Here's how it works:

1. The Random Forest Classifier is initialized as the base estimator within the AdaBoost. AdaBoost uses the Random Forest Classifier to build a sequence of weak learners.

2. AdaBoost trains multiple instances of the Random Forest Classifier, focusing more on the entities that were misclassified by the previous learners. This is achieved by adjusting the weights of the training entities during each iteration.

3. After training multiple Random Forest models, AdaBoost combines their predictions using a weighted voting system. The final prediction is determined from the weighted sum of individual Random Forest predictions, where the weight assigned to each model depends on its performance during training.

4. By iteratively focusing on difficult-to-classify entities and combining the predictions of multiple Random Forest models, AdaBoost enhances the overall performance of the ensamble classifier. This results in a robust and accurate model that utalises the strengths of both AdaBoost and Random Forest Classifiers.


As previously mentioned, data only needs to be scaled for models which assume the data is centered around a 0 point. Neither of these classifiers do, meaning the data can be left in its standard form. This is because they make choices based on relative values, not absolute. As mentioned, they choose splits which maximise information gain: but the splits are made independently within features, meaning no scale conflicts either.



In [139]:

# SVM Model Definition
class ada_model:

	## Constructor - Takes in a data dictionary (current_svm_test)
	def __init__(self, current_test):

		## Assign the values from the current_test dictionary to the class variables, which are also dictionaries. this removes the outer layer, improving readability later on
		self.clf_estimator = current_test['Estimator']

		self.clf_params = current_test['Params']
		

	## Predict Function - Uses the current_svm_test from the constructor to create an AdaBoost Classifier and a Random Forest Classifier to be used as an estoimator to predict the outcome of the test data
	def predict(self):

		### Create the estoimator for the AdaBoost Classifier as a local variable 
		closed_hype_clf = RandomForestClassifier(
									n_estimators=self.clf_estimator['n_estimators'], 
									criterion=self.clf_estimator['criterion'],
									max_features=self.clf_estimator['max_features'], 
									bootstrap=self.clf_estimator['bootstrap'],
									min_samples_split=self.clf_estimator['min_samples_split'], 
									min_samples_leaf=self.clf_estimator['min_samples_leaf'], 
									n_jobs=-1
								)
			
		### Create the AdaBoost Classifier as a local variable with the given parameters from the current_svm_test dictionary
		closed_clf = AdaBoostClassifier(
							closed_hype_clf,
							n_estimators=self.clf_params['n_estimators'],
							learning_rate=self.clf_params['learning_rate'],
							algorithm=self.clf_params['algorithim'],
							random_state=1)
		
		### Fit the model to the training data
		closed_clf.fit(X_train, y_train)

		### Predict the outcome of the test data
		self.closed_pred 		= closed_clf.predict(X_test)

		### Calculate the accuracy, f1 score, false negative rate, recall, precision, and specificity of the model
		self.closed_accuracy 	= accuracy_score(y_test, self.closed_pred)
		self.closed_f1 			= f1_score(y_test, self.closed_pred)
		self.closed_fnr, 
		self.closed_recall, 
		self.closed_precision, 
		self.closed_specificity	= ratios(y_test, self.closed_pred)

# Helper Functions

### Convert to Serialisable
This simple helper function takes an entity as a paramater and returns it, after having been converted to a JSON serialisable type. Given the only abstract type being used is a **Numpy Int32**, there only needs to be one conditional. Every other type can be converted at base value.  

In [140]:
def convert_to_serialisable(ent):
    if isinstance(ent, np.int32):
        return int(ent)
    return ent

### Write JSON

While it would have been entirely possible to use the the classes default __dict__ function, this gives little control over the how the results are displayed: decreasing readabaility. It is true, given the JSON file extnesion, the data could simply be extracted and accessed in a more appropriate way. But it made more sense to provide a file that is also legible to a human aswel. This means that the user could quickly take a look at the results manually, without having to decypher the cryptic positioning of the variables within the dictionary.

Additionally, it would have also been possible to store the array as an array. However, given the length of the array: it seemed more logical to include it as a string, creating a row instead of a 100 line column. This also makes direct comparison between predictions and expected results much easier as they run in paralel.

#### Paramaters:
<div style="margin-left: 20px;">
	<p>
		<strong>class_name</strong> 		: This paramater should either be an <strong>svm_model</strong> or <strong>ada_model</strong>. 
	</p>
	<p>
		<strong>output_file</strong> 		: This paramater should be of type <strong>string</strong> and <em>finish with a ".json" file extension</em>. It will not create an error if a txt is used, but some json functionality could be limited.
	</p>
</div>

In [141]:
def write_json(class_name, output_file):
	
	## Create a dictionary to store the class variables and the results of the model
	class_dict = {
		"features": {# Dictionary created empty, to be filled with the features of corresponding model.
		},
		"results": { # Dictionary used to store the metrics of the model 
			"accuracy": class_name.closed_accuracy,#float
			"f1": class_name.closed_f1,#float
			"fnr": class_name.closed_fnr,#float
			"recall": class_name.closed_recall,#float
			"precision": class_name.closed_precision,#float
			"specificity": class_name.closed_specificity#float
		},
		"predictions": { # Dictionary used to store the predictions of the model
			"y_pred": str(class_name.closed_pred.tolist()),#array as string
			"y_true": str(y_test.tolist())#array as string 
		}
	}

	## Check if the class_name is an instance of the svm_model class
	if(isinstance(class_name, svm_model)):
		class_dict["features"] = { # If the class_name is an instance of the svm_model class, fill the features dictionary with the features of the model
			"kernel": str(class_name.kernel),#str
			"max_iter": class_name.max_iter,#int
			"decision_function_shape": str(class_name.func_shape),#str
			"probability": class_name.probability,#bool
			"shrinking": class_name.shrinking,#bool
			"tollerance": class_name.tollerance,#float
			"C": class_name.C#float
		}
	elif(isinstance(class_name, ada_model)):
		class_dict["features"] = { # If the class_name is an instance of the ada_model class, fill the features dictionary with the features of the model
			"Estimator": {
				"n_estimators": class_name.clf_estimator['n_estimators'],#int
				"criterion": str(class_name.clf_estimator['criterion']),#str
				"max_features": str(class_name.clf_estimator['max_features']),#str
				"bootstrap": class_name.clf_estimator['bootstrap'],#bool
				"min_samples_split": class_name.clf_estimator['min_samples_split'],#int
				"min_samples_leaf": class_name.clf_estimator['min_samples_leaf']#int
			},
			"Params": {
				"n_estimators": class_name.clf_params['n_estimators'],#int
				"learning_rate": class_name.clf_params['learning_rate'],#float
				"algorithim": str(class_name.clf_params['algorithim'])#str
			}
		}

	_tojson_ = json.dumps(class_dict, default=convert_to_serialisable, indent=4) # Convert the class_dict to a JSON string

	with open(output_file, 'a') as f: # Open the output file in append mode and write the JSON string to the file

		f.write(str(_tojson_ ) + ',\n') # Add a comma and a newline character to the end of the JSON string to separate the different tests
		

### Solution Comparison

This is quite a simple function. It checks that **INSERTMETRICNAME** is higher than 70%, then proceeds to compare the current test against the current Top 10 solutions found. If a new best solution is found, the **solution_list** and **accuracy_list** are appended respectivly. The solution is then written to the correctly named JSON file. If the length of the list exceeds 10 items, the item at the front of the list is removed. 

#### Paramaters:
<div style="margin-left: 20px;">
	<p>
		<strong>current_test</strong> 	: This paramater should either be an <strong>svm_model</strong> or <strong>ada_model</strong>. 
	</p>
	<p>
		<strong>solution_name</strong> 	: This paramater should be of type <strong>string</strong> and <em>finish with a ".json" file extension</em>. It will not create an error if a txt is used, but some json functionality could be limited.
	</p>
</div>

In [142]:
def check_worst(current_test, solution_name):

	## for each solution in the solution list, where i stores the current solution being compared to the current test
	for i in range(len(solution_list)):

		## cbeck if the current test's preformance metric is more optimal than the current solution's preformance metric
		if current_test.closed_fnr < accuracy_list[i]:
			## append the current test to the solution list and the accuracy of the current test to the accuracy list
			solution_list.append(current_test)
			accuracy_list.append(current_test.closed_fnr)

			## write the current test to the output file
			write_json(current_test, solution_name)


			## check if the length of the solution list is greater than 10
			if(len(solution_list) > 10):
				## pop the first element from the solution list and the accuracy list
				solution_list.pop(0)
				accuracy_list.pop(0)

			return

In [143]:
def check_best(current_test, solution_name):

	for i in range(len(solution_list)):

		if current_test.closed_fnr > accuracy_list[i]:
			solution_list.append(current_test)
			accuracy_list.append(current_test.closed_fnr)

			write_json(current_test, solution_name)

			if(len(solution_list) > 10):
				solution_list.pop(0)
				accuracy_list.pop(0)

			return

# Systematic Grid Search

**SKLearn** deos include a standard grid search function: *GridSearchCV*. It is described as an "*exhaustive* search" over a definied paramater list. First it trains each possibility of **hyperparamaters** and scores them, using the most admirable for the final predictions. This function is ***increably exhaustive*** when searching over a large dataset. Instead of preforming an optimised search, this algorithm aims to save all possibilities in memory. This would be insignifigant when only checking a few hundred tests. However, with the hundreds of millions of combinations: **how do you know which to test**? 

It is possible to use the GridSearchCV function for these millions of test paramaters, however the computational requirements are *linear to the **number of tests** $*$ **memory usage for the model selected***. This can easily cause any computer to come to a grinding hault and freeze, even possibly resulting in a BSOD error. The *optimal memory usage should be **memory usage for the model selected** $*$ **number of best solutions***. For this search, as seen in the *Comparison Function*, the max number of best solitions to be stored in active memory should not exceed the number of best solutions, *10*, and an additional solution for the one being tested. This gives a *final memory usage of **11** $*$ **memory usage for the model selected***.

Given the memory optimisation of the custom search functions, it would be possible to preform searches on computers with limited memory availability. Dependent on the speed of the computer, and the size of the test, this could still take days to analyse for the most optimal hyperparamaters: however, in theory it should only need to run once as all best solutions found are stored in an external data-structure. **More will be spoken about that later**, but the key benefits of searching in this style is that it gives *functionalty to find the best paramaters inside of an **infinitly large pool of paramaters***. That is, ***so long as you have enough time to wait***. 

This is a deal breaker for the health industry, whos computer power usually lags behind civilisation due to lack of funding.

## Grid Search Functions

Both of these functions are structured in the same way for simplistic calling conventions, they also use the same logic: giving exceptions for differences needed by the individual model.

Whent he function is called, a few new *local varaibles* are defined. **iterationCount** is inconsequential to the flow of the actual search, however does it allows the program operator to see how many tests have been completed. This is used in combination with the **number_of_tests** constant, which is calculated by multiplying the length of all of the hyper paramater arrays together. These are later used to give a direct number of tests remaining, affirming to the operator: *tests are still being executed*. 

Next, the **Best Solutions** output is prepared. A new file is created in the JSON file format and a singular bracket and newline is written. This allows the different, best solutions, to be indexed and referenced as independent dictionarys. The final stage of preperation occurs when pushing an empty solution and the worst possible score for the model to acheive. This kick-starts the solution comparison, without it there is nothing to compare against and nothing to iterate. Because this data is added to the respective lists directly, there they will not be outputted to the **Best Solutions**. Neither will they be output to the **Best 10 Solutions**, the first few tests will overwite them no *matter how good they actually are*: the initial push is relativly much worse.

Finally, the bullk of the functions. Both are rooted on the principle: multiple nested for loops, incrementally cycling through the test paramaters held by **data**. These values are passed to the relative model's dictionary which is then subsequently passed to the class instance reprenting the Machine Learning model chosen. The prediction function is then called, on termination the class is passed to the *Comparison Function*. 

On completion of each itteration, an addition macro is applied to the **iterationCount** variable so that the completion percentage of the search can be accuratly seen. Every time the algorithm completes one inner loop: the **Iterations** TXT is overwritten.

Once the entire search as been completed, a final closing bracket is added to the **Best Solutions**. Finally, the **Best 10 Solutions** are produced.

### Paramaters:
<div style="margin-left: 20px;">
	<p>
		<strong>data</strong> 	: This paramater should be of type Dictionary and should hold the <strong>svm_tests</strong>, or <strong>ada_ensambles_tests</strong>. 
	</p>
</div>

### Outputs:
<div style="margin-left: 20px;">
	<p>
		<strong>Best Results</strong> 	: This output will be in the JSON file format and <em>will be generated, <strong>only  on completion of the entire search</strong></em>.
	</p>
	<p>
		<strong>Results</strong> 	: This output will be in the JSON file format and <em>will be appended <strong>every time a new best solution is found</strong></em>.
	</p>
	<p>
		<strong>Iterations</strong> 	: This output will be in the TXT file format and <em>will be overwirtten <strong>every few hundred test solutions</strong></em>.
	</p>
</div>

In [144]:
# Function to perform a depth first search on the SVM model
def svmDepthFirstSearch(data):

	## Clear the solution list and the accuracy list
	solution_list.clear()
	accuracy_list.clear()

	## Create a variable to store the number of tests that have beeb preformed
	iterationCount = 0

	## Calculate the number of tests that will be preformed
	number_of_tests = (
		len(data['kernel']) *
		len(data['decision_function_shape']) *
		len(data['possibility']) *
		len(data['possibility']) *
		len(data['max_iter']) *
		len(data['tollerance']) *
		len(data['C'])
	)


	## Setupt the svm_results.json file to store the best solutions discovered by the tests
	with open("svm_results.json", 'w') as f:
		## write the opening square bracket to the file to start the JSON array 
		f.write("[\n")
	

	## Append an empty object to the solution list and set the accuracy to the worst possible metric score
	solution_list.append(current_svm_data)
	accuracy_list.append(1.0)

	## Itterate through the different parameters in the data dictionary
	for kernel in data['kernel']:
		for function_shape in data['decision_function_shape']:
			for probability in data['possibility']:
				for shrinking in data['possibility']:
					for max_iter in data['max_iter']:
						for tol in data['tollerance']:
							for C in data['C']:

								## Set the current_svm_data dictionary to the current parameters indexed by the loop
								current_svm_data['kernel'] = kernel
								current_svm_data['decision_function_shape'] = function_shape
								current_svm_data['probability'] = probability
								current_svm_data['shrinking'] = shrinking
								current_svm_data['max_iter'] = max_iter
								current_svm_data['tollerance'] = tol
								current_svm_data['C'] = C


								## Create a new instance of the svm_model class with the current_svm_data dictionary
								current_test = svm_model(current_svm_data)

								## Call the predict method of the current_test instance
								current_test.predict()
								
								## Preform a comparison check
								check_worst(current_test, "svm_results.json")

								## Increment the iteration count
								iterationCount += 1
								
							## Write the current iteration count to the file to keep track of the progress EVERY TIME THE TOLLERANCE IS CHANGED
							with open("svm_iterations.txt", 'w') as f:
								f.write("Iteration: " + str(iterationCount) + "/" + str(number_of_tests))


	## Write the closing square bracket to the file to end the JSON array
	with open("svm_results.json", 'a') as f:
		f.write("]\n")


	## Write the best solutions to the output file
		
	with open("svm_best_results.json", 'w') as f:
		f.write("[\n")

	for i in range(len(solution_list)):
		write_json(solution_list[i], "svm_best_results.json")

	with open("svm_best_results.json", 'a') as f:
		f.write("]\n")

In [145]:
def adaDepthSearch(data):


	## Clear the solution list and the accuracy list
	solution_list.clear()
	accuracy_list.clear()

	## Create a variable to store the number of tests that have been preformed
	iterationCount = 0

	## Calculate the number of tests which will be preformed
	number_of_tests = (
		len(data['Estimator']['n_estimators']) *
		len(data['Estimator']['criterion']) *
		len(data['Estimator']['max_features']) *
		len(data['Estimator']['bootstrap']) *
		len(data['Estimator']['min_samples_split']) *
		len(data['Estimator']['min_samples_leaf']) *
		len(data['Params']['n_estimators']) *
		len(data['Params']['learning_rate']) *
		len(data['Params']['algorithim'])
	)

	## Set up the ada_results.json file to store the best solutions discovered by the tests
	with open("ada_results.json", 'w') as f:

		## write the opening square bracket to the file to start the JSON array
		f.write("[\n")
		

	## Append an empty object to the solution list
	solution_list.append(current_ada_data)
	## Set the accuracy to the worst possible metric score
	accuracy_list.append(0.0)


	## Itterate through the different parameters in the data dictionary relating to the Random Forest Classifier
	for n_estimators in data['Estimator']['n_estimators']:
		for criterion in data['Estimator']['criterion']:
			for max_features in data['Estimator']['max_features']:
				for bootstrap in data['Estimator']['bootstrap']:
					for min_samples_split in data['Estimator']['min_samples_split']:
						for min_samples_leaf in data['Estimator']['min_samples_leaf']:
							

							## Set the estimator dictionary to the current parameters indexed by the loop
							current_ada_data['Estimator']['n_estimators'] = n_estimators
							current_ada_data['Estimator']['criterion'] = criterion
							current_ada_data['Estimator']['max_features'] = max_features
							current_ada_data['Estimator']['bootstrap'] = bootstrap
							current_ada_data['Estimator']['min_samples_split'] = min_samples_split
							current_ada_data['Estimator']['min_samples_leaf'] = min_samples_leaf


							## Itterate through the different parameters in the data dictionary relating to the AdaBoost Classifier
							for n_estimators_params in data['Params']['n_estimators']:
								for learning_rate in data['Params']['learning_rate']:
									for algorithim in data['Params']['algorithim']:

										
										## Set the hyperparamaters dictionary to the current parameters indexed by the loop
										current_ada_data['Params']['n_estimators'] = n_estimators_params
										current_ada_data['Params']['learning_rate'] = learning_rate
										current_ada_data['Params']['algorithim'] = algorithim

										## Create a new instance of the ada_model class with the current_ada_data dictionary
										current_test = ada_model(current_ada_data)

										## Call the predict method of the current_test instance
										current_test.predict()
										

										## Preform a comparison check
										check_best(current_test, "ada_results.json")

										## Increment the iteration count
										iterationCount += 1
										
									## Write the current iteration count to the file to keep track of the progress EVERY TIME THE LEARNING_RATE IS CHANGED
									with open("ada_iterations.txt", 'w') as f:
										f.write("Iteration: " + str(iterationCount) + "/" + str(number_of_tests))


	## Write the closing square bracket to the file to end the JSON array
	with open("ada_results.json", 'a') as f:
		f.write("]\n")


	## Write the best solutions to the output file
	with open("ada_best_results.json", 'w') as f:
		f.write("[\n")

	for i in range(len(solution_list)):
		write_json(solution_list[i], "ada_best_results.json")

	with open("ada_best_results.json", 'a') as f:
		f.write("]\n")

# Execute the Search Algorithms

These functions simply call the aforementioned search algorithms with the predefined list of hyperparamaters.

In [146]:
# svmDepthFirstSearch(svm_tests)

In [147]:
# adaDepthSearch(ada_ensambles_tests)

# Read the Results into memory

This is where the application reads the results of a previously completed search. It is important to note, this function does not load the newly created JSON files. Instead, it reads a finished test stored in a separated folder. This is to ensure that the previously learned solutions are not removed and can be reapplied in the future. Using the **JSON.load** method, its possible to easily translate a json array into a python Dictionary Array. This will make referencing much easier.

In [148]:
# Define the array to store the solutions

array_of_solutions = []

def read_json(filepath):
	## Open the file in read mode
	with open(filepath, 'r') as f:
		## Load the JSON file into the array_of_solutions variable
		return json.load(f)


## Read the best results from the SVM model
array_of_solutions = read_json("TestOutputsSVM\soltuions320k.json")

## Panda Restructure

Creating a panda **DataFrame** automatically converts the data into columns and rows, like a standardised table of elements. This completed data structure should be much more readable to the operator and can then be used to make predictions about how the model could preform under the correct hyperparamaters. This then displays the 80 solutions found in the previous test. It may be apparent that the **solutions320k.json** file is formatted differently to the one produced by the grid search. This results datafile was generated before the **write_json** file was designed. This previous method resulted in poor readability and messy datastrcutures. The new method is much more streamlined, but converting the reading method for the svm_results would take some consideration.

***@@**: This step is unnecessasary- it just helps to give an idea of the data to be processed, and gives a better understanding of how the proceeding functions handle said data.*

In [149]:
#frame the data into a pandas dataframe
df = pd.DataFrame(array_of_solutions)

#display the data
print(df[0:80])


    kernel func_shape  shrinking  probability  tollerance  max_iter  C  \
0     poly        ovo       True         True      0.0002      1000  7   
1     poly        ovo       True         True      0.0003      1000  7   
2     poly        ovo       True         True      0.0004      1000  7   
3     poly        ovo       True         True      0.0005      1000  7   
4     poly        ovo       True         True      0.0006      1000  7   
..     ...        ...        ...          ...         ...       ... ..   
75  linear        ovo       True         True      0.0006      1303  7   
76  linear        ovo       True         True      0.0007      1303  7   
77  linear        ovo       True         True      0.0008      1303  7   
78  linear        ovo       True         True      0.0009      1303  7   
79  linear        ovo       True         True      0.0010      1303  7   

    closed_accuracy  closed_precision  closed_recall  closed_f1  closed_fnr  \
0          0.740260          0.6

## Reading Helper Functions 

These functions, while simple in nature, allow for far cleaner code: overall, improving readability. Instead of needing to write these code blocks in-line, presenting them inside functions splits the code. This in tern helps to lessen the code's indentation, negating confusion. 

### *Results to Grid* Function

These functions are designed to take the results from **read_json**, translating them into a dictionary with arrays corrisoponding to unique values for each paramater. 

They start by defining the dictionary for which to store the unique values, then itterating over each individaual entity in the results file. During itteration, each value is added to a list. Once this cycle has completed, a new dictionary is defined with the same keys. Lists are then converted into sets, removing any non-unique values, and finally converted back into lists for integrity. This final dictionary, filled with keys assosiated to unique lists is returned for later use. 

It should be worth noting that the outputs will **not** include the metrics produced by the model. These will not be needed for the proceeding functions. 

#### Paramaters:
<div style="margin-left: 20px;">
	<p>
		<strong>non_unique_values</strong> 	: This paramater is an array of dictionarys produced by the **read_json()** function. 
	</p>
</div>

#### Returns:
<div style="margin-left: 20px;">
	<p>
		<strong>unique_dict</strong> 	: This variable is a Dictionary which stores all unique *best* inputs assosiated with the respected model.
	</p>
</div>


In [150]:
def svm_results_to_grid(non_unique_values):

	## Create a dictionary to store the unique values of the non_unique_values dictionary
	unique_search = {
		'kernel': [],
		'max_iter': [],
		'decision_function_shape': [],
		'probability': [],
		'shrinking': [],
		'C': [],
		'tollerance': [],
	}

	## Itterate through the non_unique_values dictionary
	for i in range(len(non_unique_values)):
		## Append the values to the unique_search dictionary
		unique_search['kernel'].append(non_unique_values[i]['kernel'])
		unique_search['max_iter'].append(non_unique_values[i]['max_iter'])
		unique_search['decision_function_shape'].append(non_unique_values[i]['func_shape'])
		unique_search['probability'].append(non_unique_values[i]['probability'])
		unique_search['shrinking'].append(non_unique_values[i]['shrinking'])
		unique_search['C'].append(non_unique_values[i]['C'])
		unique_search['tollerance'].append(non_unique_values[i]['tollerance'])

	## Create a dictionary to store the unique values of the unique_search dictionary
	unique_dict = {
		## Cast the lists to sets to remove duplicates and then cast them back to lists
		'kernel': list(set(unique_search['kernel'])),
		'max_iter': list(set(unique_search['max_iter'])),
		'decision_function_shape': list(set(unique_search['decision_function_shape'])),
		'probability': list(set(unique_search['probability'])),
		'shrinking': list(set(unique_search['shrinking'])),
		'C': list(set(unique_search['C'])),
		'tollerance': list(set(unique_search['tollerance'])),
	}


	return unique_dict

In [151]:
def ada_results_to_grid(non_unique_values):
	unique_search = {
		'estimator': {
				"n_estimators": [],#int
				"criterion": [],#str
				"max_features": [],#str
				"bootstrap": [],#bool
				"min_samples_split": [],#int
				"min_samples_leaf": []#int
			},
		'params': {
				"n_estimators": [],#int
				"learning_rate": [],#float
				"algorithim": []#str
			}
	}

	for i in range(len(non_unique_values)):
		unique_search['estimator']['n_estimators'].append(non_unique_values[i]['Estimator']['n_estimators'])
		unique_search['estimator']['criterion'].append(non_unique_values[i]['Estimator']['criterion'])
		unique_search['estimator']['max_features'].append(non_unique_values[i]['Estimator']['max_features'])
		unique_search['estimator']['bootstrap'].append(non_unique_values[i]['Estimator']['bootstrap'])
		unique_search['estimator']['min_samples_split'].append(non_unique_values[i]['Estimator']['min_samples_split'])
		unique_search['estimator']['min_samples_leaf'].append(non_unique_values[i]['Estimator']['min_samples_leaf'])

		unique_search['params']['n_estimators'].append(non_unique_values[i]['Params']['n_estimators'])
		unique_search['params']['learning_rate'].append(non_unique_values[i]['Params']['learning_rate'])
		unique_search['params']['algorithim'].append(non_unique_values[i]['Params']['algorithim'])

	
	unique_dict = {
		'estimator': {
				"n_estimators": list(set(unique_search['estimator']['n_estimators'])),#int
				"criterion": list(set(unique_search['estimator']['criterion'])),#str
				"max_features": list(set(unique_search['estimator']['max_features'])),#str
				"bootstrap": list(set(unique_search['estimator']['bootstrap'])),#bool
				"min_samples_split": list(set(unique_search['estimator']['min_samples_split'])),#int
				"min_samples_leaf": list(set(unique_search['estimator']['min_samples_leaf']))#int
			},
		'params': {
				"n_estimators": list(set(unique_search['params']['n_estimators'])),#int
				"learning_rate": list(set(unique_search['params']['learning_rate'])),#float
				"algorithim": list(set(unique_search['params']['algorithim']))#str
			}
	}

	return unique_dict

### Small example code

This example of how functions can be used, in conjunction, is desinged to quicly show the operator the absolute best values for each feature. 

In [152]:
print(json.dumps(svm_results_to_grid(read_json("TestOutputsSVM\soltuions320k.json")), indent=8))

{
        "kernel": [
                "poly",
                "linear"
        ],
        "max_iter": [
                1121,
                1060,
                1000,
                1161,
                1141,
                1303,
                1080,
                1020
        ],
        "decision_function_shape": [
                "ovo"
        ],
        "probability": [
                true
        ],
        "shrinking": [
                false,
                true
        ],
        "C": [
                7
        ],
        "tollerance": [
                0.0009000000000000001,
                0.0004,
                0.0008,
                0.00030000000000000003,
                0.0002,
                0.0007000000000000001,
                0.0011,
                0.0006000000000000001,
                0.0001,
                0.001,
                0.0005
        ]
}


# Creating Optimised Models

