# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 07/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

One of the most successful approaches is an algorithm portfolio, where a solver is selected among a set of candidates depending on the problem type. Your task is to create a classifier that takes as input the SAT instance's features and identifies the class.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, the fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.


The original dataset is available at:
https://github.com/bprovanbessell/SATfeatPy/blob/main/features_csv/all_features.csv



## Data Preparation

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)
df = df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)
df

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,608,71,8.563380,0.116776,0.045172,0.173688,0.029605,0.060855,2.802758,0.045172,...,5078250.0,1056.695041,1.000000,2.981935e-09,2113.390083,1081.900778,1.000000,1.302080e-29,2163.801556,matching
1,615,70,8.785714,0.113821,0.049617,0.168633,0.032520,0.069919,2.607264,0.049617,...,5469376.0,1207.488426,1.000000,6.927306e-28,2414.976852,1186.623627,1.000000,3.491123e-120,2373.247255,matching
2,926,105,8.819048,0.113391,0.033385,0.186444,0.017279,0.047516,3.022879,0.033385,...,4297025.0,441.327046,1.000000,1.194627e-76,882.654092,474.697562,1.000000,0.000000e+00,949.395124,matching
3,603,70,8.614286,0.116086,0.049799,0.133441,0.033167,0.063018,2.688342,0.049799,...,6640651.0,1181.583331,1.000000,2.437278e-30,2363.166661,1149.059132,1.000000,4.670090e-147,2298.118264,matching
4,228,43,5.302326,0.188596,0.067319,0.162581,0.048246,0.087719,2.203308,0.067319,...,2437500.0,1091.423921,0.999966,3.723599e-02,2182.810606,1296.888087,1.000000,6.307424e-06,2593.776167,matching
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2407,1668,56,29.785714,0.033573,0.056130,0.056969,0.046763,0.062350,2.504465,0.056130,...,19921877.0,6612.008184,1.000000,1.120646e-100,13224.016367,6615.279871,1.000000,0.000000e+00,13230.559743,5clique
2408,420,28,15.000000,0.066667,0.107143,0.046765,0.092857,0.114286,1.798018,0.107143,...,9375075.0,12725.565085,1.000000,4.271720e-43,25451.130169,12602.458887,1.000000,0.000000e+00,25204.917773,4clique
2409,1827,59,30.966102,0.032293,0.052898,0.045194,0.039409,0.055829,2.267027,0.052898,...,19531254.0,6245.948689,1.000000,2.996198e-101,12491.897377,6216.493633,1.000000,0.000000e+00,12432.987265,5clique
2410,932,44,21.181818,0.047210,0.064524,0.090983,0.049356,0.074034,2.633887,0.064524,...,12890627.0,7820.073756,1.000000,7.261942e-74,15640.147512,7680.635364,1.000000,0.000000e+00,15361.270729,5clique


In [4]:
# Label or target variable
X = df.iloc[:,:-1]
y = df['target']


0.0407407407407407

# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate a decision tree classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

In [5]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import tree

encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(y)

train_features, test_features, train_labels, test_labels = model_selection.train_test_split(X,y, test_size=0.3, random_state=999)

cls = tree.DecisionTreeClassifier(random_state = 42)

cls.fit(train_features, train_labels)

cls.score(test_features, test_labels)

0.9765193370165746

## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature selection.
* Feature normalisation.

Your report should provide concrete information about your reasoning; everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Data is split into train and test features and labels. 70% Train size set, 30% Test size set in random state 999
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(X,y, test_size=0.3, random_state=999)


#MinMax scaler, Principle Component Analysis and Decision Tree Classifier are defined and combined into pipeline.
scaler = MinMaxScaler()
pca = PCA()
cls = tree.DecisionTreeClassifier(random_state = 42)
pipe = Pipeline([('scaler', MinMaxScaler()), ('pca', PCA()), ('cls', tree.DecisionTreeClassifier())])


#Parameters for grid searchCV, list needed to be wide enough to catch best parameters but narrow enough to be practical to run. Parameters changed multiple times
#OPTIMAL VALUES -- PCA_n_components: 24 ,  cls_criterion: Entropy,   cls__max_depth: None,     cls__min_samples_split: 2
param_grid = {'pca__n_components': [20, 25,15,60,45],
              'cls__criterion': ['entropy'],
              'cls__max_depth': [None],
              'cls__min_samples_split': [1,2]}


#Define the GridSearch by passing, the pipeline and the parameter grid, perform 5 fold stratified corss validation.(Data is split again using now using three sets)
#Fit/Train the grid on the training data 
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(train_features, train_labels)


#Print best classifier, hyperparaters and training score
print("The best classifier is:", grid.best_estimator_)
print("Best hyper-parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)



y_pred = grid.best_estimator_.predict(test_features)

# Compute the accuracy score of the model
accuracy = accuracy_score(test_labels, y_pred)
print("Accuracy score: ", accuracy)




#https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html 
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Best hyper-parameters:  {'cls__criterion': 'entropy', 'cls__max_depth': None, 'cls__min_samples_split': 2, 'pca__n_components': 24}
Best score:  0.9466665496110828
Accuracy score:  0.9392265193370166


In [None]:
param_grid = {'pca__n_components': [20, 25,15,60,45],
              'cls__criterion': ['entropy'],
              'cls__max_depth': [None],
              'cls__min_samples_split': [1,2]}

param_grid = {'pca__n_components': [2, 5, 10, 20, 30, 40],
              'cls__criterion': ['gini', 'entropy'],
              'cls__max_depth': [None, 5, 10],
              'cls__min_samples_split': [2, 5, 10]}



## New classifier (10 Marks)

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.


Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

In [19]:
from sklearn.ensemble import RandomForestClassifier

#Data Split as before

train_features, test_features, train_labels, test_labels = model_selection.train_test_split(X,y, test_size=0.3, random_state=999)

#Define a new Pipeline using the RandomForest Classifier 
pipe = Pipeline([('scaler', MinMaxScaler()), ('pca', PCA()), ('rf', RandomForestClassifier())])


#update parameter grid for new random forest and pca arguments
param_grid = {'pca__n_components': [24],
              'pca__whiten' : [True],
              'rf__max_features': [0.3, 0.4, 0.5,],
              'rf__max_depth': [None],
              'rf__min_samples_leaf' : [2],
              'rf__n_estimators': [6, 7, 8, 9]}


#Define and fit the grid as before
grid = GridSearchCV(pipe, param_grid, cv=8)
grid.fit(train_features, train_labels)

#print the best combination
print("The best classifier is:", grid.best_estimator_)
print("Best hyper-parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)

y_pred = grid.best_estimator_.predict(test_features)

# Compute the accuracy score of the model
accuracy = accuracy_score(test_labels, y_pred)
print("Accuracy score: ", accuracy)     


#https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

The best classifier is: Pipeline(steps=[('scaler', MinMaxScaler()),
                ('pca', PCA(n_components=24, whiten=True)),
                ('rf',
                 RandomForestClassifier(max_features=0.3, min_samples_leaf=2,
                                        n_estimators=7))])
Best hyper-parameters:  {'pca__n_components': 24, 'pca__whiten': True, 'rf__max_depth': None, 'rf__max_features': 0.3, 'rf__min_samples_leaf': 2, 'rf__n_estimators': 7}
Best score:  0.9686018957345972
Accuracy score:  0.9488950276243094


In [None]:
param_grid = {'pca__n_components': [24],
              'pca__whiten' : [True],
              'rf__max_features': [0.3],
              'rf__max_depth': [None]
              'rf__min_samples_leaf' : [2],
              'rf__n_estimators': [9]}

param_grid = {'pca__n_components': [2, 5, 10, 20, 30, 40],
              'pca__whiten' : [True, False],
              'rf__max_features': [0.2, 0.4, 0.6, 0.8, 1],
              'rf__max_depth': [None, 5, 10],
              'rf__min_samples_leaf' : [1,2,3,4,5],
              'rf__n_estimators': [2, 5, 10]}

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv


In [None]:
from joblib import dump, load
from io import BytesIO
import requests

# INSERT YOUR MODEL'S URL
mLink = '122108729_Data_Mining_Second_Assignment.ipynb.raw=true'
# mfile = BytesIO(requests.get(mLink).content)
# model = load(mfile)
# YOUR CODE HERE