# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 06/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

Recent advances in supervised learning have provided powerful techniques for classifying problems. In this project, we see the SAT problem as a classification problem. Given a Boolean formula (represented by a vector of features), we are asked to predict if it is satisfiable or not.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.

The dataset is available at:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_train.csv

This is original unpublished data.

## Data Preparation

In [92]:
import pandas as pd

df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/6d5738101d173b97c565f143f945dedb9c42a400/dm_assignment2/sat_dataset_train.csv?raw=true")
df.head()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,420,10,42.0,0.02381,0.6,0.0,0.6,0.6,0.0,0.6,...,78750.0,8e-06,0.0,7.875e-06,8e-06,2.385082e-21,0.0,2.385082e-21,2.385082e-21,1
1,230,20,11.5,0.086957,0.137826,0.089281,0.117391,0.16087,2.180946,0.137826,...,6646875.0,17433.722184,1.0,2.981244e-12,34867.444369,17277.21,1.0,1.358551e-53,34554.42,0
2,240,16,15.0,0.066667,0.3,0.0,0.3,0.3,0.0,0.3,...,500000.0,1525.878932,0.0,1525.879,1525.878932,1525.879,0.0,1525.879,1525.879,1
3,424,30,14.133333,0.070755,0.226415,0.485913,0.056604,0.45283,2.220088,0.226415,...,87500.0,0.000122,1.0,6.535723e-14,0.000245,8.218628e-07,1.0,1.499676e-61,1.643726e-06,0
4,162,19,8.526316,0.117284,0.139701,0.121821,0.111111,0.185185,1.940843,0.139701,...,5859400.0,16591.49431,1.0,6.912725999999999e-42,33182.988621,16659.03,1.0,0.0,33318.07,1


In [93]:
df.dtypes

c                       int64
v                       int64
clauses_vars_ratio    float64
vars_clauses_ratio    float64
vcg_var_mean          float64
                       ...   
rwh_2_mean            float64
rwh_2_coeff           float64
rwh_2_min             float64
rwh_2_max             float64
target                  int64
Length: 328, dtype: object

Replacing infinity values with nan values and then filling nan values with 0 so the data can be used for tests

In [94]:
import numpy as np
df['target'].value_counts()
#Change infinity value to nan
df = df.replace([np.inf, -np.inf], np.nan)
#Filling Nan with 0
df = df.fillna(0)
df.shape
print(df)



        c    v  clauses_vars_ratio  vars_clauses_ratio  vcg_var_mean  \
0     420   10           42.000000            0.023810      0.600000   
1     230   20           11.500000            0.086957      0.137826   
2     240   16           15.000000            0.066667      0.300000   
3     424   30           14.133333            0.070755      0.226415   
4     162   19            8.526316            0.117284      0.139701   
...   ...  ...                 ...                 ...           ...   
1924  910   50           18.200000            0.054945      0.045055   
1925  440   30           14.666667            0.068182      0.229091   
1926  372   28           13.285714            0.075269      0.103111   
1927  821  181            4.535912            0.220463      0.019105   
1928  775   90            8.611111            0.116129      0.033118   

      vcg_var_coeff  vcg_var_min  vcg_var_max  vcg_var_entropy  \
0          0.000000     0.600000     0.600000         0.000000   
1  

Seperating the target from the rest of the features

In [95]:
x = df.iloc[:,:-1].values
y = df.iloc[:,327].values



# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate K-NN and decision tree classifiers using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset. Compare the results of both classifiers.

**KNN**
Imported train_test_split to split the data into training and test for the features and label data
Scaling the data using a MinMaxSacler
Fitting the data to the features train data
transforming the data
performing knn with k value of 5 as a standard hyperparameter to start with
predicting on test data
calculating accuracy
and printing

**Decision Tree**
Imported Decision tree classifier 
Set random state to 0
Fitting the data to the features and label train data
testing on test set
calculating accuracy
and printing

In [96]:
#importing Sklearn to train the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
#Scaling the data 
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Training the data using .fit to predict 
from sklearn.neighbors import KNeighborsClassifier
K = KNeighborsClassifier(n_neighbors=5)
K.fit(X_train, y_train)
prediction = K.predict(X_test)
print(prediction)
accuracy = sum(prediction==y_test)/y_test.shape[0]
print("Accuracy:", accuracy)

#Decision tree
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0)
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print(predictions)
acc = sum(predictions==y_test)/y_test.shape[0]
print(acc)
 






[0 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0
 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0
 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0
 0 1 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 1 1 0
 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0
 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 0
 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0
 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0
 0 0 1 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1
 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0
 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1
 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 0 1
 1 1 1 1 1 1 1 1 1 0 1 1 

The accuracy of the desision tree is much higher than the accuracy of the KNN. This is most likely due to the number of data points and the data is overscaled so the accuracy for the decision tree is extremly high - therefore I will use the Decision tree for the robust evalutaion as it is performing better

## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature normalisation.

Your report should provide concrete information of your reasoning; everything should be well-explained.

Do not get stressed if the things you try do not improve the accuracy. The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

**Hold-out and cross-validation**

Performed Hold out and cross validation for the KNN and Decision tree. 
The accuracy for the KNN stayed the same, but the accuracy for the decision tree improved by 1%. This showed that for this data the decision tree is a better model of evaluation and has a high accuracy. This method allows for multiple train and test data set tests, therefore it is more likely to have a higher accuracy

In [97]:
# Hold out and cross validation for the KNN
from sklearn import model_selection
scores = model_selection.cross_val_score(K, X_train, y_train, cv=10)
print(scores)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
# Hold out and cross validation for the Decision Tree
from sklearn import model_selection
scores = model_selection.cross_val_score(classifier, X_train, y_train, cv=10)
print(scores)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

[0.92592593 0.84444444 0.87407407 0.92592593 0.86666667 0.91851852
 0.88148148 0.83703704 0.83703704 0.88148148]
0.88 accuracy with a standard deviation of 0.03
[0.98518519 0.96296296 0.98518519 0.97777778 0.97777778 0.98518519
 0.98518519 0.97037037 0.97777778 0.97777778]
0.98 accuracy with a standard deviation of 0.01


**Hyper-parameter tuning - Decision Tree.**

Set up a function for the decision tree to allow fro muliple random sets, test hyper parameters. I tested 7 different random values, the hyper parameter 13 had the highest accuracy with 98.4, this is an improvement on the original decision tree with a random state of 0 but is the same as the hold out cross validation. 

In [73]:
def decision_tree(x,y,i):
  classifier = DecisionTreeClassifier(random_state=i) 
  classifier.fit(X_train, y_train)
  predictions = classifier.predict(X_test)
  acc = sum(predictions==y_test)/y_test.shape[0]
  return acc
DecisionTree1 = decision_tree(X_train, y_train,1)
DecisionTree3 = decision_tree(X_train, y_train,3)
DecisionTree5 = decision_tree(X_train, y_train,5)
DecisionTree63 = decision_tree(X_train, y_train,63)
DecisionTree103 = decision_tree(X_train, y_train,103)
DecisionTree45 = decision_tree(X_train, y_train,45)
DecisionTree13 = decision_tree(X_train, y_train,13)
print("Accuracy when the hyperparameter is 1 for the descision tree : ",(DecisionTree1))
print("Accuracy when the hyperparameter is 3 for the descision tree : ",DecisionTree3)
print("Accuracy when the hyperparameter is 5 for the descision tree : ",DecisionTree5)
print("Accuracy when the hyperparameter is 63 for the descision tree : ",DecisionTree63)
print("Accuracy when the hyperparameter is 103 for the descision tree : ",DecisionTree103)
print("Accuracy when the hyperparameter is 45 for the descision tree : ",DecisionTree45)
print("Accuracy when the hyperparameter is 13 for the descision tree : ",DecisionTree13)


Accuracy when the hyperparameter is 1 for the descision tree :  0.9792746113989638
Accuracy when the hyperparameter is 3 for the descision tree :  0.9775474956822107
Accuracy when the hyperparameter is 5 for the descision tree :  0.9810017271157168
Accuracy when the hyperparameter is 63 for the descision tree :  0.9792746113989638
Accuracy when the hyperparameter is 103 for the descision tree :  0.9810017271157168
Accuracy when the hyperparameter is 45 for the descision tree :  0.9879101899827288
Accuracy when the hyperparameter is 13 for the descision tree :  0.9810017271157168


Set up a pipeline to allow me to test PCA and the min max scaler and the standard scaler, to perform feature reduction and normalisation. PCA is a form of feature reduction, it allows for the main parts of the data to be used to change the basis of the data.



In [74]:
# Feature reduction - selecting a subset of the original features using PCA
# Use pipeline to pass in different components and understand which is better 
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn import tree
from sklearn.model_selection import GridSearchCV, cross_val_score
classifier = Pipeline([("scaler", MinMaxScaler()), 
    ("pca", PCA(n_components=3)),
    ("predictor", DecisionTreeClassifier())])
classifier.fit(X_train, y_train)
classifier.predict(X_test)


array([1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,

Used pipeline to try 5 different hyperparameters when the data is scaled using the min max scaler and PCA. The PCA's accuracy is not as high as the other methods.
Peformed a grid search on the data, fitted it to the X train and y train set
Printed the best parameter and the best score as well as the accuracy. Then used the best parameter on the test set and computed the accuracy. I then repeated these steps for the StandardScaler.
The accuracy for the minmax scaler on the test data was higher than the standard scaler. However, both of the accuracys are lower than the hyperparameter tuning above. I used the same hyperpapramters as above to allow for fair testing


In [75]:

# Create a dictionary of hyperparameters for the pipeline with the classifier
# Feature reduction - selecting a subset of the original features using PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

classifier = Pipeline([("scaler", MinMaxScaler()), 
     ("pca", PCA()),
     ("predictor", DecisionTreeClassifier())])
param_grid = {"pca__n_components": [1,3,5,63,103,45,13],
                   "predictor__criterion": ["gini", "entropy"]}
 # Create a grid search variable that will find the best hyperparameter values based on validation error
dt_gs = GridSearchCV(classifier, param_grid, scoring="accuracy")
dt_gs.fit(X_train, y_train)
dt_gs.best_params_, dt_gs.best_score_

best = classifier.set_params(**dt_gs.best_params_) 
print(best)
 # Fit the pipeline to the train data and print accuracy for the test set
classifier.fit(X_train, y_train)
acc = accuracy_score(y_test, classifier.predict(X_test))
print(acc)
 # Repesting the same steps but using the best parameneter, entropy
classifier = Pipeline([("scaler", MinMaxScaler()), 
     ("pca", PCA()),
     ("predictor", DecisionTreeClassifier())])
yparam_grid = {"pca__n_components": [1,3,5,63,103,45,13],
                   "predictor__criterion": ["entropy"]}
ydt_gs = GridSearchCV(classifier, yparam_grid, scoring="accuracy")
ydt_gs.fit(X_train, y_train)
ydt_gs.best_params_, ydt_gs.best_score_

best = classifier.set_params(**ydt_gs.best_params_) 
print(best)
classifier.fit(X_train, y_train)
acc1 = accuracy_score(y_test, classifier.predict(X_test))
print(acc1)


# Feature normalisation 
# Use min max scaler and the standard scaler and see which is better 
 # Repeating the above steps but changing the scaler
classifier1 = Pipeline([("scaler", StandardScaler()), 
     ("pca", PCA()),
     ("predictor", DecisionTreeClassifier())])
xparam_grid = {"pca__n_components": [1,3,5,63,103,45,13],
                   "predictor__criterion": ["gini", "entropy"]}
 # Create the grid search object which will find the best hyperparameter values based on validation error
xdt_gs = GridSearchCV(classifier1, xparam_grid, scoring="accuracy")
xdt_gs.fit(X_train, y_train)

best1 = classifier1.set_params(**xdt_gs.best_params_) 
print(best1)
classifier1.fit(X_train, y_train)
acc2 = accuracy_score(y_test, classifier1.predict(X_test))
print(acc2)

classifier2 = Pipeline([("scaler", MinMaxScaler()), 
     ("pca", PCA()),
     ("predictor", DecisionTreeClassifier())])
nparam_grid = {"pca__n_components": [1,3,5,63,103,45,13],
                   "predictor__criterion": ["entropy"]}
ndt_gs = GridSearchCV(classifier2, nparam_grid, scoring="accuracy")
ndt_gs.fit(X_train, y_train)

best1 = classifier2.set_params(**ndt_gs.best_params_) 
classifier2.fit(X_train, y_train)
acc3 = accuracy_score(y_test, classifier2.predict(X_test))
print(acc3)


Pipeline(steps=[('scaler', MinMaxScaler()), ('pca', PCA(n_components=63)),
                ('predictor', DecisionTreeClassifier(criterion='entropy'))])
0.8618307426597582
Pipeline(steps=[('scaler', MinMaxScaler()), ('pca', PCA(n_components=63)),
                ('predictor', DecisionTreeClassifier(criterion='entropy'))])
0.8652849740932642
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=45)),
                ('predictor', DecisionTreeClassifier(criterion='entropy'))])
0.9084628670120898
0.8687392055267703


## New classifier (10 Marks)

Replicate the previous task for a classifier that we did not cover in class. So different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.
Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

The classifier I chose is MLP, Neural networks, it works by creating several layers of inputs and outputs, they are connected. It calculated the gradient of the error against the weight of the model. MLP had an accuracy of 95% which is lower then the decision tree which had 98% accuracy

Resources used for MLP:
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
https://panjeh.medium.com/scikit-learn-hyperparameter-optimization-for-mlpclassifier-4d670413042b

In [76]:
#MLP Neural network models
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
mlpclassifier = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
mlpclassifier.predict(X_test)
mlpclassifier.predict(X_test)
mlpclassifier.score(X_test, y_test)



0.9499136442141624

Performed Hold out and cross validation for MLP, this improved the accuracy by 1%. However it is still lower than when this method was used on the decision tree, that gave 99% accuracy.

In [77]:
# Hold out and cross validation for MLP
from sklearn import model_selection
mlpscore = model_selection.cross_val_score(mlpclassifier, X_train, y_train, cv=10)
print(mlpscore)
print("%0.2f accuracy with a standard deviation of %0.2f" % (mlpscore.mean(), mlpscore.std()))



[0.94814815 0.99259259 0.96296296 0.94814815 0.95555556 0.97037037
 0.92592593 0.95555556 0.94074074 0.97037037]
0.96 accuracy with a standard deviation of 0.02


**Hyper-parameter tuning - MLP.**
Performed a grid search for MLP to determine the best hyperparameters, they are 'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (20,), 'learning_rate': 'constant', 'solver': 'adam'. Then with those results, calculated the mean and standard deviation. Using these values I creatd a for loop to iterate through all hyperparamenters and calculated the mean standard deviation for them all. Put it against the test set and printed out a gris showing the precision, recall, f1-score and accuracy of the model. The accuracy computed was 95%, this is the same as the hold out cross validation model and less than the accuracy gotten for the decision tree which was 98%

In [78]:
from sklearn.metrics import classification_report
parameters = {
    'hidden_layer_sizes': [(10,),(20,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}
mlpclf = GridSearchCV(mlpclassifier, parameters, n_jobs=-1, cv=5)
mlpclf.fit(X_train, y_train) 
print('Best parameters found:\n', mlpclf.best_params_)
mlpmeans = mlpclf.cv_results_['mean_test_score']
mlpstds = mlpclf.cv_results_['std_test_score']
for mlpmean, mlpstd, mlpparams in zip(mlpmeans, mlpstds, mlpclf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mlpmean, mlpstd * 2, mlpparams))

y_true, y_pred = y_test , mlpclf.predict(X_test)

print('Results on the test set:')
print(classification_report(y_true, y_pred))


Best parameters found:
 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (20,), 'learning_rate': 'constant', 'solver': 'adam'}
0.887 (+/-0.048) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.950 (+/-0.019) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10,), 'learning_rate': 'constant', 'solver': 'adam'}
0.887 (+/-0.048) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.950 (+/-0.019) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.885 (+/-0.049) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (20,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.959 (+/-0.016) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (20,), 'learning_rate': 'constant', 'solver': 'adam'}
0.885 (+/-0.049) for {'activation': 'tanh

MLP Feature reduction - selecting a subset of the original features using PCA
Use pipeline to pass in different components and understand which is better. When accuracy is computed against the tes set, the accuracy is only 95, which is lower than the accuracy computed for the desicion tree.  
The accuracy does not change a significant amount when changing the scaler (data normalisation)

In [90]:
#PCA and standard scaler with pipeline for mlp 
from sklearn.linear_model import LogisticRegression
pca = PCA()
scaler = StandardScaler()
logistic = LogisticRegression(max_iter=1000, tol=0.1)
pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])
param_grid = {
    "pca__n_components": [1,3,5,63,103,45,13],
    "logistic__C": np.logspace(-4, 4, 4),
}
mlp_gs = GridSearchCV(pipe, param_grid, n_jobs=2)
mlp_gs.fit(X_train, y_train)
print("Best parameter %0.3f):" % mlp_gs.best_score_)
print(mlp_gs.best_params_)
mlpaccpca = accuracy_score(y_train, mlp_gs.predict(X_train))
print(mlpaccpca)
mlpaccpcat = accuracy_score(y_test, mlp_gs.predict(X_test))
print(mlpaccpcat)





Best parameter 0.976):
{'logistic__C': 21.54434690031882, 'pca__n_components': 63}
0.9955555555555555
0.9585492227979274


In [91]:
#PCA and minmax with pipeline for mlp 
from sklearn.linear_model import LogisticRegression
pca1 = PCA()
scaler1 = MinMaxScaler()
logistic1 = LogisticRegression(max_iter=1000, tol=0.1)
pipe1 = Pipeline(steps=[("scaler", scaler1), ("pca", pca1), ("logistic", logistic1)])
param_grid1 = {
    "pca__n_components": [1,3,5,63,103,45,13],
    "logistic__C": np.logspace(-4, 4, 4),
}
mlp_gs1 = GridSearchCV(pipe1, param_grid1, n_jobs=2)
mlp_gs1.fit(X_train, y_train)
print("Best parameter %0.3f):" % mlp_gs1.best_score_)
print(mlp_gs1.best_params_)
mlpaccpca1 = accuracy_score(y_train, mlp_gs1.predict(X_train))
print(mlpaccpca1)
mlpaccpcat1 = accuracy_score(y_test, mlp_gs1.predict(X_test))
print(mlpaccpcat1)



Best parameter 0.960):
{'logistic__C': 10000.0, 'pca__n_components': 103}
1.0
0.9585492227979274


Best Model: Decision tree with hyperparameter of 13, has the highest accuracy with 98.7, originally holdout cross validation had the highest accuracy with 99%.

In [None]:
def decision_tree(x,y,i):
  classifier = DecisionTreeClassifier(random_state=i) 
  classifier.fit(X_train, y_train)
  predictions = classifier.predict(X_test)
  acc = sum(predictions==y_test)/y_test.shape[0]
  return acc
DecisionTree13 = decision_tree(X_train, y_train,13)
print("Accuracy when the hyperparameter is 13 for the descision tree : ",DecisionTree13)

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

I could not get the githib push to work - the bestModel file would not push to my github

In [154]:
!git clone https://github.com/Laurenigoe/Datamining
%cd Datamining
!git init
!git config --global user.email "laurenigoe1@gmail.com"
!git config --global user.name "Laurenigoe"

import pickle
pickle.dump(classifier,  open( "model.pkl", "wb" ))
clss = pickle.load(open("model.pkl", "rb"))
clss
clss.predict(X_test)

from joblib import dump, load
from io import BytesIO
import requests

# INSERT YOUR MODEL'S URL
mLink = 'https://github.com/Laurenigoe/Datamining/blob/main/model.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)
model


Cloning into 'Datamining'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 9 (delta 0), reused 3 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.
/content/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining
Reinitialized existing Git repository in /content/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Assignment2/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datamining/Datami

KeyError: ignored