# Data Mining ACW: Coding Submission

--IMPORTANT! Before running, ensure SciKit-learn and Pandas are installed properly! This script was written with Python 3.7.3, 32-bit.--


## Step 1: Importing our Libraries and Raw Data

Below, you can see the data we've read in.

In [1]:
import pandas as pd
import sklearn as skl
import seaborn as sb
import math
print("Getting to work...")
rawData = pd.read_csv('ACWData.csv')

Getting to work...


In [2]:
rawData.describe()

Unnamed: 0,Random,Id,IPSI
count,1520.0,1520.0,1516.0
mean,0.509545,188365.022368,78.872032
std,0.284006,64355.870242,10.162351
min,0.000295,78261.0,35.0
25%,0.268531,137130.75,73.0
50%,0.517616,191344.5,77.0
75%,0.754724,244559.5,85.0
max,0.999448,295978.0,99.0



## Step 2: Data Cleaning

Next, a second data frame is created to house only entries that adhere to specified data taxonomy,
as well as another to hold entries with erroneous feature values, to ensure all problematic samples have been removed.
Duplicate entries will also be removed to prevent interference with classification later on.


In [3]:
cleanData = rawData.copy()
cleanData['Indication'] = cleanData['Indication'].str.upper()
cleanData['IPSI'] = pd.to_numeric(cleanData['IPSI'] , errors='coerce', downcast='float')
cleanData['Contra'] = pd.to_numeric(cleanData['Contra'] , errors='coerce', downcast='float')

In [4]:
badData = pd.DataFrame(columns = cleanData.columns)
badData_ref = cleanData.isna()

In [5]:
for index,row in badData_ref.iterrows():
    for columns in badData_ref :
        if ((row[columns]))  :
            badData = badData.append((cleanData.loc[[index]]))

In [6]:
cleanData = cleanData.dropna()
cleanData = cleanData.drop_duplicates(subset='Id')
cleanData.describe()

Unnamed: 0,Random,Id,IPSI,Contra
count,1502.0,1502.0,1502.0,1502.0
mean,0.509369,188063.770306,78.832886,56.69574
std,0.284234,64454.25388,10.163915,29.52651
min,0.000295,78261.0,35.0,10.0
25%,0.268254,135885.25,73.0,30.0
50%,0.516824,191053.0,77.0,50.0
75%,0.754513,244417.5,85.0,85.0
max,0.999448,295978.0,99.0,100.0



As promised, you can see how the for loop above iterates over all  1,520 items in the table, comparing them against another dataframe of reference boolean values - denoting the presence of any null-value attributes. 
The resultant tables include cleanData which holds only clean non-erroneous values, badData which only holds entries with null or unknown values, and badData_ref which acts as a flag table for finding null values.
Once one is detected, the index of that value is read from its row, and the corresponding row in the clean data frame is added to the bank of bad data. Dropped from the clean data frame entirely, it will later be added back to a composite of the two frames when the missing values are repaired.
<br>
<br>



## Step 3: Removing Outliers

The following few lines build a series of additional dataframes to isolate either of the two given classes.
This is used for ease of comparison, and validation for debugging purposes.<br>


In [7]:
cDataOutFree = cleanData.copy()
cDataOutFreeR = cDataOutFree.copy().where(cDataOutFree.label == 'Risk')
cDataOutFreeNR = cDataOutFree.copy().where(cDataOutFree.label == 'NoRisk')

Now, we'll compare the difference between given IPSI and Contra values,
and the means in both global and class-specific contexts. Each class
of data has been partitioned to its own table, so the values of any numeric
features are analysed according to means and standard deviations read from
a similar reference frame.<br>

Both of the data frames containing class specific data are then merged into one frame,
removing duplicates to ensure integrity in representing a cleaned set with respect to
original raw values.

In [8]:
cDataOutFree = cDataOutFree[abs(cDataOutFree.IPSI-cDataOutFree.IPSI.mean()) <= (3*cleanData.IPSI.std())]
cDataOutFree = cDataOutFree.append(cDataOutFree[abs(cDataOutFree.Contra-cDataOutFree.Contra.mean()) <= (3*cleanData.Contra.std())])
cDataOutFree = cDataOutFree.drop_duplicates(subset='Id')

In [9]:
cDataOutFreeR = cDataOutFreeR[abs(cDataOutFreeR.IPSI-cDataOutFreeR.IPSI.mean()) <= (3*(cleanData.where(cleanData.label == 'Risk')).IPSI.std())]
cDataOutFreeR = cDataOutFreeR.append(cDataOutFreeR[abs(cDataOutFreeR.Contra-cDataOutFreeR.Contra.mean()) <= (3*(cleanData.where(cleanData.label == 'Risk')).Contra.std())])
cDataOutFreeNR = cDataOutFreeNR[abs(cDataOutFreeNR.IPSI-cDataOutFreeNR.IPSI.mean()) <= (3*(cleanData.where(cleanData.label == 'NoRisk')).IPSI.std())]
cDataOutFreeNR = cDataOutFreeNR.append(cDataOutFreeNR[abs(cDataOutFreeNR.Contra-cDataOutFreeNR.Contra.mean()) <= (3*(cleanData.where(cleanData.label == 'Risk')).Contra.std())])

In [10]:
cDataOutFreeCR = cDataOutFreeR.append(cDataOutFreeNR) 
cDataOutFreeCR = cDataOutFree.drop_duplicates(subset='Id')
cDataOutFreeCR = cDataOutFreeCR.sort_index()


## Step 4: Transforms

Now we have our clean sets with all outliers removed, we need to make them numeric so any classifier we use can make sense of them.<br>

Rather than defining individual data frames for each of the sets we have,
it would be much easier to make a function that takes a given data frame and 
performs all the transforms we need. That way, it can be called whenever a
classifier needs to perform.

### numGen

To generate these figures in such a way, numGen is passed with a dataframe.
Copying pre-formatted numeric columns across to a local dataframe variable,
it then generates dummies from a copy of the supplied dataframe and drops
all columns where 1 correlates to a 'no' value in any of our nominative types.
It also transforms indication columns to four of individual binary values as
well. 
- Parameters:
    - data (dataframe object) - the array of data you want to transform.
- Returns:
    - numData (dataframe object) - a transformed array of data.

### genMet

In addition to this, it would be helpful to transform any performance metrics
we get back into something more legible. This is what genMet does, taking a
name, the corresponding dataframe object, any parameter descriptions of note,
a predefined accuracy metric, confusion matrix ndarray and any graphical renderings.
Using these, it generates metrics in a format matching our log table that will
later be used to note any successes.
- Parameters:
    - data (string) - the name of the dataframe the supplied classifier has been fitted to.
    - clf (classifier object) - a fitted classifier of any type.
    - params (string) - any parameters of note to be attached to the results log entry.
    - acc (float) - the accuracy associated with the supplied classifier's predictivity.
    - matrix (ndarray) - the confusion matrix associated with the supplied classifier's predictions.   
- Returns:
    - cols (dataframe object) - formatted results for the supplied classifier to be added to the main results log.

In [11]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [12]:
def numGen (data) :
        numData = pd.get_dummies((data.copy())[['Indication','Diabetes', 'IHD', 'Hypertension', 'Arrhythmia', 'History']])
        numData['IPSI'] = (data.copy())['IPSI']
        numData['Contra'] = (data.copy())['Contra']
        numData['label'] = (data.copy())['label']
        numData = numData.drop(['Diabetes_no','IHD_no','Hypertension_no','Arrhythmia_no','History_no'], axis=1)
        return numData

In [13]:
def genMet (data, clf, params, acc, matrix):
        
        datName = data
        clfType = str(clf).split('(')[0]
        tp = matrix[0,0]
        tn = matrix[1:,1:].sum()
        fp = matrix[0,1:].sum()
        fn = matrix[1:,0].sum()
        tp = int(tp)
        tn = int(tn)
        fp = int(fp)
        fn = int(fn)
        acc = "{:.4%}".format(acc)
        sens = zeroCatch(tp,(tp+fn))
        spec = zeroCatch(tn,(tn+fp))
        manhat = (1-sens)+(1-spec)
        eucl = math.sqrt(math.pow((1-sens),2)+math.pow((1-spec),2))
        print("Logged " + datName + " entry for " + clfType + " classifier, accuracy: " + acc)
        cols = {'Data Set':datName,'Classifier':clfType,'Parameters':params,'Accuracy':acc,'TP':tp,'TN':tn,'FP':fp,'FN':fn,'Spec':spec,'Sens':sens,'1-Spec':1-spec,'1-Sens':1-sens,'Manhattan':manhat,'Euclidian':eucl}
        return cols

In [14]:
def zeroCatch (a, b) :
        return a / b if b else 0 #stops any dividing by zero nonsense!


## Step 5: Modelling

Finally, this function takes whatever dataframe is passed to it, as well as a
string to identify any log entries pertaining to it. The function then initialises
a uniform number of models every time, creates a 70/30 split training/test dataframes
to use with them from the results of the numeric transformer written in the previous cell.
This is then used to fit, predict and log the results from each of the classifiers initialised
at the beginning using their standard parameters.


### fitPredictData

Takes a given data frame, splits it by a 70/30 ratio into training and test sets respectively,
and fits three different types of classifiers to it, with eight permutations of parameters and meta-parameters.
Then, using a local function to supply each one with a test set to predict, numGen and genMet are
called to supply the appropriate entries for the results log for each individual classifier.

- Parameters:
    - argData (dataframe object) - the dataframe you want to classify.<br>
    - dataName (string) - the name you want to use to refer to the dataframe in the results logs.<br>
- Returns:
    - resList (dataframe object) - the results log for each of the 8 classifier permutations, collated into a dataframe and sorted in order of execution.<br>

### clfPredict

Takes a given fitted model and a string containing any parameters to note in the results log.
Predicts the model against a test set, and generates accuracy and confusion matrix metrics to
supply to genMet to generate log entries for the respective classifier/data combination.

- Parameters:
    - model (classifier object) - the fully fitted model to generate performance metrics for.
    - args (string) - any arguments of note supplied to the classifier that may affect its operation.

- Returns:
    - cols (dataframe object) - clfPredict calls genMet, and passes the returned object from that method< straight back to whatever called it. Probably bad practice, but syntactically keeps code working.

In [15]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neural_network import MLPClassifier
from sklearn.cluster import KMeans
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier
import time

In [16]:
def fitPredictData(argData, dataName) :
        data = ((numGen(argData.copy())))
        le = preprocessing.LabelEncoder()
        features = ['Indication_A-F', 'Indication_ASX', 'Indication_CVA', 'Indication_TIA', 'Diabetes_yes', 'IHD_yes', 'Hypertension_yes', 'Arrhythmia_yes', 'History_yes', 'IPSI', 'Contra',]
        X = data.loc[:, features]
        Y = le.fit_transform(data.label)
        X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,shuffle=True)
        tree = DecisionTreeClassifier()
        km = KMeans(n_clusters=2, random_state=0)
        mlp = MLPClassifier(max_iter=350)
        metaTree = DecisionTreeClassifier(min_samples_split=8,max_depth=350)
        metaKm = KMeans(n_clusters=2, n_init=650)
        metaMlp = MLPClassifier(max_iter=700, activation='logistic')
        adaT = AdaBoostClassifier(DecisionTreeClassifier(min_samples_split=4, max_depth=350),n_estimators=50,learning_rate=1)
        bgK = BaggingClassifier(KMeans(n_clusters=2, n_init=350),n_estimators=50,max_features=11)
        resList =  pd.DataFrame(columns=['Data Set', 'Classifier','Parameters','Accuracy','TP','TN','FP','FN','Spec','Sens','1-Spec','1-Sens','Manhattan','Euclidian'])
    
        tree = tree.fit(X_train, Y_train)
        km = km.fit(X_train)
        mlp = mlp.fit(X_train,Y_train)
        metaTree = metaTree.fit(X_train,Y_train)
        metaKm = metaKm.fit(X_train,Y_train)
        metaMlp = metaMlp.fit(X_train,Y_train)
    
        
        adaT = adaT.fit(X_train,Y_train)
        bgK = bgK.fit(X_train,Y_train)
        
        print("Working on the next set of predictors...")
        def clfPredict (model, args) :
                result = (model).predict(X_test)
                accuracy = metrics.accuracy_score(Y_test,result)
                matrix = metrics.confusion_matrix(Y_test,result)
                
                return genMet(dataName,model,args,accuracy,matrix)
        resList = resList.append(clfPredict(tree,"std"), ignore_index=True)
        resList = resList.append(clfPredict(km,"std"),ignore_index=True)
        resList = resList.append(clfPredict(mlp,"300 max iterations"),ignore_index=True)
        resList = resList.append(clfPredict(metaTree,"350 max depth, 8 min split samples"),ignore_index=True)
        resList = resList.append(clfPredict(metaKm,"350 max depth"),ignore_index=True)
        resList = resList.append(clfPredict(metaMlp, "350 max iterations, logistic activator"),ignore_index=True)
        resList = resList.append(clfPredict(adaT,"AdaBoosted Decision Tree"),ignore_index=True)
        resList = resList.append(clfPredict(bgK,"Bagged KMeans"),ignore_index=True)
        return resList

At last, the classification function fitPredictData is called, using each of the dataframes we cleaned
previously, which will be automatically transformed, split, and logged with a
single function call. Once all data has been logged, the resultant dataframe writes to a .CSV
file, located in the same directory as wherever the script has been run from.<br>

It makes itself quite evident on repeated executions, but the KMeans classifier is incredibly inconsistent, which I suspect is due to how the training and test sets are passed to it. My attempts to mitigate against this have remained unsuccessful, but with the comparatively consistent success of other options, we are not without choice for implementation in a hypothetical deployment stage.

In [17]:
clfList = pd.DataFrame(columns=['Data Set', 'Classifier','Parameters','Accuracy','TP','TN','FP','FN','Spec','Sens','1-Spec','1-Sens','Manhattan','Euclidian'])

In [18]:
clfList = clfList.append(fitPredictData(cleanData, 'cleanData'), ignore_index=True)
clfList = clfList.append(fitPredictData(cDataOutFree, 'cDataOutFree'), ignore_index=True)
clfList = clfList.append(fitPredictData(cDataOutFreeCR, 'cDataOutFreeCR'), ignore_index=True)
export = clfList.to_csv(path_or_buf=r'results.csv', index=None, header=True)



Working on the next set of predictors...
Logged cleanData entry for DecisionTreeClassifier classifier, accuracy: 99.3348%
Logged cleanData entry for KMeans classifier, accuracy: 17.2949%
Logged cleanData entry for MLPClassifier classifier, accuracy: 95.5654%
Logged cleanData entry for DecisionTreeClassifier classifier, accuracy: 98.6696%
Logged cleanData entry for KMeans classifier, accuracy: 82.4834%
Logged cleanData entry for MLPClassifier classifier, accuracy: 97.1175%
Logged cleanData entry for AdaBoostClassifier classifier, accuracy: 99.1131%
Logged cleanData entry for BaggingClassifier classifier, accuracy: 82.4834%
Working on the next set of predictors...
Logged cDataOutFree entry for DecisionTreeClassifier classifier, accuracy: 98.8914%
Logged cDataOutFree entry for KMeans classifier, accuracy: 17.7384%
Logged cDataOutFree entry for MLPClassifier classifier, accuracy: 96.6741%
Logged cDataOutFree entry for DecisionTreeClassifier classifier, accuracy: 98.8914%
Logged cDataOutFre

If you can see a table below this cell, all data has been processed, cleaned, modelled and classified. View the results in the file you ran this script from!

In [19]:
clfList

Unnamed: 0,Data Set,Classifier,Parameters,Accuracy,TP,TN,FP,FN,Spec,Sens,1-Spec,1-Sens,Manhattan,Euclidian
0,cleanData,DecisionTreeClassifier,std,99.3348%,289,160,2,0,0.987654,1.0,0.012346,0.0,0.012346,0.012346
1,cleanData,KMeans,std,17.2949%,52,27,239,133,0.101504,0.281081,0.898496,0.718919,1.617415,1.150713
2,cleanData,MLPClassifier,300 max iterations,95.5654%,282,149,9,11,0.943038,0.962457,0.056962,0.037543,0.094505,0.068221
3,cleanData,DecisionTreeClassifier,"350 max depth, 8 min split samples",98.6696%,289,156,2,4,0.987342,0.986348,0.012658,0.013652,0.02631,0.018617
4,cleanData,KMeans,350 max depth,82.4834%,239,133,52,27,0.718919,0.898496,0.281081,0.101504,0.382585,0.298847
5,cleanData,MLPClassifier,"350 max iterations, logistic activator",97.1175%,289,149,2,11,0.986755,0.963333,0.013245,0.036667,0.049912,0.038986
6,cleanData,AdaBoostClassifier,AdaBoosted Decision Tree,99.1131%,289,159,2,1,0.987578,0.996552,0.012422,0.003448,0.015871,0.012892
7,cleanData,BaggingClassifier,Bagged KMeans,82.4834%,239,133,52,27,0.718919,0.898496,0.281081,0.101504,0.382585,0.298847
8,cDataOutFree,DecisionTreeClassifier,std,98.8914%,308,138,1,4,0.992806,0.987179,0.007194,0.012821,0.020015,0.014701
9,cDataOutFree,KMeans,std,17.7384%,51,29,258,113,0.101045,0.310976,0.898955,0.689024,1.587979,1.13264
