<a href="https://colab.research.google.com/github/Squatsit/CS6405/blob/main/122114180__Data_Mining__Second_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 07/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

One of the most successful approaches is an algorithm portfolio, where a solver is selected among a set of candidates depending on the problem type. Your task is to create a classifier that takes as input the SAT instance's features and identifies the class.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, the fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.


The original dataset is available at:
https://github.com/bprovanbessell/SATfeatPy/blob/main/features_csv/all_features.csv



## Data Preparation

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)
old_shape=df.shape
old_shape


(2412, 328)

In [None]:
# Label or target variable
df['target'].value_counts()

tseitin           298
dominating        294
cliquecoloring    268
php               266
subsetcard        263
op                201
tiling            120
5clique           108
3color            104
matching          102
5color             98
4color             98
3clique            98
4clique            94
Name: target, dtype: int64

# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate a decision tree classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

In [None]:
#When I tried to apply scaling there was an error regarding na and inf

#To remove na/infinity values
print('Na Values: ',df.isnull().values.any()) #True

df=df.dropna(axis=1) 
#originally tried rows but all were removed

import numpy as np
print('Infinity Values:',df.isin([np.inf, -np.inf]).values.any()) #True

#replace with na and then drop these rows
df = df.replace([np.inf, -np.inf], np.nan).dropna(axis=1) 

print('Sanity Check:',df.isnull().values.any())

print('Rows Lost =',old_shape[0]-df.shape[0],'Columns Lost =',old_shape[1]-df.shape[1] )

Na Values:  True
Infinity Values: True
Sanity Check: False
Rows Lost = 0 Columns Lost = 18


In [None]:
# YOUR CODE HERE
df_target=df['target']
df=df.drop(columns=['target'])

from sklearn import model_selection
train_features,test_features,train_labels,test_labels=model_selection.train_test_split(df,df_target,test_size=0.3, random_state=17)

from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_labels=le.fit_transform(train_labels)
test_labels=le.transform(test_labels)
#used PCA to reduce dimensionality of data as 312 is a lot of features
#used n=8 arbitraroly was in the lab
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
train_features_pca=pca.fit_transform(train_features)
test_features_pca=pca.transform(test_features)

from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier(random_state=0)
clf1.fit(train_features_pca,train_labels)
results=clf1.predict(test_features_pca)

from sklearn.metrics import accuracy_score
accuracy_score(test_labels, results)

0.93646408839779

In [None]:
train_features.shape

(1688, 309)

## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature selection.
* Feature normalisation.

Your report should provide concrete information about your reasoning; everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

In [None]:
# Feature Reduction
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
#PCA
PCA_classifierDT = Pipeline([
    ('pca',PCA()),
    ("predictor", DecisionTreeClassifier(random_state=0))
    ])

param_grid = {"pca__n_components": [8,10,12], #10,20,30 --> then tried 5,7,9,10,12 8,9,10,11 
              "predictor__criterion": ["gini", "entropy", "log_loss"],
              "predictor__splitter": ["best","random"],
              "predictor__max_features":["sqrt", "log2",None]
              }

#cross validation
DT_gs = GridSearchCV(PCA_classifierDT, param_grid, scoring="accuracy",cv=10)

# Run the GridSearchCV
DT_gs.fit(train_features, train_labels)

# Print the best parameters and the score
DT_gs.best_params_, DT_gs.best_score_

({'pca__n_components': 10,
  'predictor__criterion': 'entropy',
  'predictor__max_features': None,
  'predictor__splitter': 'random'},
 0.9543850380388841)

In [None]:
#Final Test on unseen data
PCA_classifierDT.set_params(**DT_gs.best_params_) 
PCA_classifierDT.fit(train_features, train_labels)
accuracy_score(test_labels, PCA_classifierDT.predict(test_features))

0.9488950276243094

In [None]:
# Feature Selection
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold


#Remove all features which have a variance below 0.03
constant_filter = VarianceThreshold(threshold=0.03)
train_features0=constant_filter.fit_transform(train_features)
test_features0=constant_filter.transform(test_features)


FS_classifierDT = Pipeline([
    ("feature_selection",SelectKBest(f_classif)),
    ("predictor", DecisionTreeClassifier(random_state=0))
    ])

param_grid = {"feature_selection__k": [20,40,80,'all'], #10,20,30 --> then tried 5,7,9,10,12 8,9,10,11 
              "predictor__criterion": ["gini", "entropy", "log_loss"],
              "predictor__splitter": ["best","random"],
              "predictor__max_features":["sqrt", "log2",None]
              }

#cross validation
FS_DT_gs = GridSearchCV(FS_classifierDT, param_grid, scoring="accuracy",cv=10)

# Run the GridSearchCV
FS_DT_gs.fit(train_features0, train_labels)

# Print the best parameters and the score
FS_DT_gs.best_params_, FS_DT_gs.best_score_

({'feature_selection__k': 40,
  'predictor__criterion': 'entropy',
  'predictor__max_features': None,
  'predictor__splitter': 'random'},
 0.9846012961397577)

In [None]:
train_features0.shape 
#Feature reduced from 309 to 127 to 80

(1688, 130)

In [None]:
#Final Test on Unseen Data
FS_classifierDT.set_params(**FS_DT_gs.best_params_) 
FS_classifierDT.fit(train_features0, train_labels)
accuracy_score(test_labels, FS_classifierDT.predict(test_features0))

0.9917127071823204

## New classifier (10 Marks)

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.


Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

In [None]:
# YOUR CODE HERE
import numpy as np
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv?raw=True",index_col=0)
old_shape=df.shape

print('Na Values: ',df.isnull().values.any()) #True

df=df.dropna(axis=1) 
#originally tried rows but all were removed

print('Infinity Values:',df.isin([np.inf, -np.inf]).values.any()) #True

#replace with na and then drop these rows
df = df.replace([np.inf, -np.inf], np.nan).dropna(axis=1) 

print('Sanity Check:',df.isnull().values.any())

print('Rows Lost =',old_shape[0]-df.shape[0],'Columns Lost =',old_shape[1]-df.shape[1] )

df_target=df['target']
df=df.drop(columns=['target'])

from sklearn import model_selection
train_features,test_features,train_labels,test_labels=model_selection.train_test_split(df,df_target,test_size=0.3, random_state=17)

from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_labels=le.fit_transform(train_labels)
test_labels=le.transform(test_labels)

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
PCA_classifierRF = Pipeline([
    ('pca',PCA()),
    ("predictor", RandomForestClassifier())
    ])

param_grid = {"pca__n_components": [10,12,14], #5,10,15,20 --> 15 --> 12,14,16,18
              "predictor__n_estimators": [250,275],
              "predictor__criterion": ["gini", "log_loss"],
              "predictor__max_features":["sqrt", "log2", None]
              }

#cross validation
RF_gs = GridSearchCV(PCA_classifierRF, param_grid, scoring="accuracy",cv=5)

# Run the GridSearchCV
RF_gs.fit(train_features, train_labels)

# Print the best parameters and the score
RF_gs.best_params_, RF_gs.best_score_

Na Values:  True
Infinity Values: True
Sanity Check: False
Rows Lost = 0 Columns Lost = 16


({'pca__n_components': 14,
  'predictor__criterion': 'gini',
  'predictor__max_features': 'log2',
  'predictor__n_estimators': 275},
 0.93801580333626)

In [None]:
PCA_classifierRF.set_params(**RF_gs.best_params_) 
PCA_classifierRF.fit(train_features, train_labels)
accuracy_score(test_labels, PCA_classifierRF.predict(test_features))

0.9726027397260274

In [5]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv?raw=True",index_col=0)
old_shape=df.shape

print('Na Values: ',df.isnull().values.any()) #True

df=df.dropna(axis=1) 
#originally tried rows but all were removed

print('Infinity Values:',df.isin([np.inf, -np.inf]).values.any()) #True

#replace with na and then drop these rows
df = df.replace([np.inf, -np.inf], np.nan).dropna(axis=1) 

print('Sanity Check:',df.isnull().values.any())

print('Rows Lost =',old_shape[0]-df.shape[0],'Columns Lost =',old_shape[1]-df.shape[1] )

df_target=df['target']
df=df.drop(columns=['target'])

from sklearn import model_selection
train_features,test_features,train_labels,test_labels=model_selection.train_test_split(df,df_target,test_size=0.3, random_state=17)

from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_labels=le.fit_transform(train_labels)
test_labels=le.transform(test_labels)

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold


#Remove all features which have a variance below 0.03
constant_filter = VarianceThreshold(threshold=0.03)
train_features0=constant_filter.fit_transform(train_features)
test_features0=constant_filter.transform(test_features)


FS_classifierRF = Pipeline([
    ("feature_selection",SelectKBest(f_classif)),
    ("predictor", RandomForestClassifier())
    ])

param_grid = {"feature_selection__k": [80,100,'all'], #10,20,30 --> then tried 5,7,9,10,12 8,9,10,11 
              "predictor__n_estimators": [200,250],
              "predictor__criterion": ["gini", "log_loss"],
              "predictor__max_features":["sqrt", "log2", None]
              }
              
#cross validation
FS_RF_gs = GridSearchCV(FS_classifierRF, param_grid, scoring="accuracy",cv=10)

# Run the GridSearchCV
FS_RF_gs.fit(train_features0, train_labels)

# Print the best parameters and the score
FS_RF_gs.best_params_, FS_RF_gs.best_score_

Na Values:  True
Infinity Values: True
Sanity Check: False
Rows Lost = 0 Columns Lost = 16


KeyboardInterrupt: ignored

In [None]:
FS_classifierRF.set_params(**FS_RF_gs.best_params_) 
FS_classifierRF.fit(train_features0, train_labels)
accuracy_score(test_labels, FS_classifierRF.predict(test_features0))

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv


In [6]:
from joblib import dump, load
from io import BytesIO
import requests
dump(FS_classifierRF,'bestmodel.joblib')
# INSERT YOUR MODEL'S URL
mLink = 'https://github.com/Squatsit/CS6405/blob/2ba9f769796e123688600cea97d5784bf5665b38/122114180__Data_Mining__Second_Assignment.ipynb'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)
# YOUR CODE HERE


KeyError: ignored