# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 07/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

One of the most successful approaches is an algorithm portfolio, where a solver is selected among a set of candidates depending on the problem type. Your task is to create a classifier that takes as input the SAT instance's features and identifies the class.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, the fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.


The original dataset is available at:
https://github.com/bprovanbessell/SATfeatPy/blob/main/features_csv/all_features.csv



## Data Preparation

In [39]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)
df.head()
df.tail()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
2407,1668,56,29.785714,0.033573,0.05613,0.056969,0.046763,0.06235,2.504465,0.05613,...,19921877.0,6612.008184,1.0,1.1206459999999999e-100,13224.016367,6615.279871,1.0,0.0,13230.559743,5clique
2408,420,28,15.0,0.066667,0.107143,0.046765,0.092857,0.114286,1.798018,0.107143,...,9375075.0,12725.565085,1.0,4.2717200000000005e-43,25451.130169,12602.458887,1.0,0.0,25204.917773,4clique
2409,1827,59,30.966102,0.032293,0.052898,0.045194,0.039409,0.055829,2.267027,0.052898,...,19531254.0,6245.948689,1.0,2.996198e-101,12491.897377,6216.493633,1.0,0.0,12432.987265,5clique
2410,932,44,21.181818,0.04721,0.064524,0.090983,0.049356,0.074034,2.633887,0.064524,...,12890627.0,7820.073756,1.0,7.261942e-74,15640.147512,7680.635364,1.0,0.0,15361.270729,5clique
2411,1080,45,24.0,0.041667,0.067284,0.055956,0.052778,0.071296,2.141971,0.067284,...,16015628.0,8070.60817,0.952197,385.8019,15755.414435,8031.133183,0.954395,366.262092,15696.004274,5clique


In [40]:
# Label or target variable
df['target'].value_counts()

tseitin           298
dominating        294
cliquecoloring    268
php               266
subsetcard        263
op                201
tiling            120
5clique           108
3color            104
matching          102
5color             98
4color             98
3clique            98
4clique            94
Name: target, dtype: int64

# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate a decision tree classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

In [41]:
# YOUR CODE HERE
import numpy as np
df.dropna(axis=0,how='any')
df.dropna(axis=1,how='any')
inf_cols = df.isin([np.inf]).any()
df = df.loc[:, ~inf_cols]
print(df.isna().sum())
num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols] = np.clip(df[num_cols], np.finfo(np.float32).min,np.finfo(np.float32).max).astype(np.float32)
#first, remove NaN values, then remove infinity values, and finally replace all values that exceed the float32 limit with the float32 boundary value.
df

c                     0
v                     0
clauses_vars_ratio    0
vars_clauses_ratio    0
vcg_var_mean          0
                     ..
rwh_2_mean            0
rwh_2_coeff           0
rwh_2_min             0
rwh_2_max             0
target                0
Length: 328, dtype: int64


Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,608.0,71.0,8.563380,0.116776,0.045172,0.173688,0.029605,0.060855,2.802758,0.045172,...,5078250.0,1056.695068,1.000000,2.981935e-09,2113.390137,1081.900757,1.000000,1.302080e-29,2163.801514,matching
1,615.0,70.0,8.785714,0.113821,0.049617,0.168633,0.032520,0.069919,2.607264,0.049617,...,5469376.0,1207.488403,1.000000,6.927306e-28,2414.976807,1186.623657,1.000000,0.000000e+00,2373.247314,matching
2,926.0,105.0,8.819048,0.113391,0.033385,0.186444,0.017279,0.047516,3.022879,0.033385,...,4297025.0,441.327057,1.000000,0.000000e+00,882.654114,474.697571,1.000000,0.000000e+00,949.395142,matching
3,603.0,70.0,8.614285,0.116086,0.049799,0.133441,0.033167,0.063018,2.688342,0.049799,...,6640651.0,1181.583374,1.000000,2.437278e-30,2363.166748,1149.059082,1.000000,0.000000e+00,2298.118164,matching
4,228.0,43.0,5.302326,0.188596,0.067319,0.162581,0.048246,0.087719,2.203308,0.067319,...,2437500.0,1091.423950,0.999966,3.723599e-02,2182.810547,1296.888062,1.000000,6.307424e-06,2593.776123,matching
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2407,1668.0,56.0,29.785715,0.033573,0.056130,0.056969,0.046763,0.062350,2.504465,0.056130,...,19921876.0,6612.008301,1.000000,0.000000e+00,13224.016602,6615.279785,1.000000,0.000000e+00,13230.559570,5clique
2408,420.0,28.0,15.000000,0.066667,0.107143,0.046765,0.092857,0.114286,1.798018,0.107143,...,9375075.0,12725.565430,1.000000,4.273960e-43,25451.130859,12602.458984,1.000000,0.000000e+00,25204.917969,4clique
2409,1827.0,59.0,30.966103,0.032293,0.052898,0.045194,0.039409,0.055829,2.267027,0.052898,...,19531254.0,6245.948730,1.000000,0.000000e+00,12491.897461,6216.493652,1.000000,0.000000e+00,12432.987305,5clique
2410,932.0,44.0,21.181818,0.047210,0.064524,0.090983,0.049356,0.074034,2.633888,0.064524,...,12890627.0,7820.073730,1.000000,0.000000e+00,15640.147461,7680.635254,1.000000,0.000000e+00,15361.270508,5clique


In [42]:
from sklearn.model_selection import train_test_split,KFold
from sklearn.preprocessing import MinMaxScaler,StandardScaler

x=df.iloc[:,:-1].fillna(0)
y=df["target"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=11)

In [43]:
from sklearn import tree
clf=tree.DecisionTreeClassifier()
clf=clf.fit(x_train,y_train)

In [44]:
score=clf.score(x_test,y_test)
print(score)

0.9861878453038674


## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature selection.
* Feature normalisation.

Your report should provide concrete information about your reasoning; everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

####Feature selection: Remove all columns with identical numerical values.

In [45]:
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(0)#remove all the column that has a variance of 0
x = vt.fit_transform(x)
x.shape

(2412, 257)

In [46]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

####As I didn't know the optimal values for the hyperparameters, I used the method of random search to find the best set of hyperparameters.
####In terms of dimensionality reduction, I wanted to combine feature selection and feature extraction, so I used FeatureUnion, which includes both select k-best and PCA methods.

In [47]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, VarianceThreshold
from sklearn.preprocessing import StandardScaler
# define Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),#feature normalization
    ('reduce_dim', FeatureUnion([
        ('pca', PCA()),
        ('kbest', SelectKBest(f_classif))
    ])),
    ('clf', DecisionTreeClassifier())
])

# define the param space
param_dist = {
    'reduce_dim__pca__n_components': [2, 4, 6, 8, 10],
    'reduce_dim__kbest__k': [2, 4, 6, 8, 10],
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [3, 5, 7, 9, 11],
    'clf__min_samples_split': [2, 4, 6, 8, 10],
    'clf__min_samples_leaf': [1, 2, 3, 4, 5],
    'clf__class_weight': [None, 'balanced']
}

#using the random search
random_search = RandomizedSearchCV(
    pipeline, param_distributions=param_dist, cv=5, n_iter=50, n_jobs=-1, random_state=1
)#including 5 times cv

# fit the model
random_search.fit(x_train, y_train)

# print best parameters
print("Best parameters: ", random_search.best_params_)

# evaluate the model on test model
print("Test set accuracy: {:.6f}".format(random_search.score(x_test, y_test)))

Best parameters:  {'reduce_dim__pca__n_components': 10, 'reduce_dim__kbest__k': 6, 'clf__min_samples_split': 10, 'clf__min_samples_leaf': 3, 'clf__max_depth': 7, 'clf__criterion': 'entropy', 'clf__class_weight': 'balanced'}
Test set accuracy: 0.966851


In [48]:
#the best parameter set is: (The execution result of the previous cell.)
#Best parameters:  {'reduce_dim__pca__n_components': 10, 'reduce_dim__kbest__k': 4, 'clf__min_samples_split': 4, 'clf__min_samples_leaf': 1, 'clf__max_depth': 9, 'clf__criterion': 'entropy', 'clf__class_weight': 'balanced'}

## New classifier (10 Marks)

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.


Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

####When dealing with high-dimensional datasets, a single decision tree is prone to overfitting, while random forests can reduce the risk of overfitting and improve the generalization performance of the model by randomly selecting features and combining the results of multiple decision trees. Therefore, in such cases, using a random forest model is better than a single decision tree model.
That's why I decided to use a random forest model.

In [49]:
from sklearn.ensemble import RandomForestClassifier

# define Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),#feature normalization
    ('reduce_dim', FeatureUnion([
        ('pca', PCA()),
        ('kbest', SelectKBest(f_classif))
    ])),
    ('clf', RandomForestClassifier())
])

# define the param space
param_dist = {
    'reduce_dim__pca__n_components': [2, 4, 6, 8, 10],
    'reduce_dim__kbest__k': [2, 4, 6, 8, 10],
    'clf__n_estimators': [5, 10, 20, 50],
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [3, 5, 7, 9, 11],
    'clf__min_samples_split': [2, 4, 6, 8, 10],
    'clf__min_samples_leaf': [1, 2, 3, 4, 5],
    'clf__class_weight': [None, 'balanced']
}

#using the random search
random_search = RandomizedSearchCV(
    pipeline, param_distributions=param_dist, cv=5, n_iter=50, n_jobs=-1, random_state=1
)#including 5 times cv

# fit the model
random_search.fit(x_train, y_train)

# print best parameters
print("Best parameters: ", random_search.best_params_)

# evaluate the model on test model
print("Test set accuracy: {:.6f}".format(random_search.score(x_test, y_test)))

Best parameters:  {'reduce_dim__pca__n_components': 10, 'reduce_dim__kbest__k': 10, 'clf__n_estimators': 50, 'clf__min_samples_split': 6, 'clf__min_samples_leaf': 3, 'clf__max_depth': 9, 'clf__criterion': 'gini', 'clf__class_weight': 'balanced'}
Test set accuracy: 0.984807


In [53]:
from joblib import dump, load
dump(clf, 'bestmodel.joblib')

['bestmodel.joblib']

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv


In [1]:
from joblib import dump, load
from io import BytesIO
import pandas as pd
import numpy as np
import requests

# INSERT YOUR MODEL'S URL
mLink = 'https://github.com/Alonsdfn/dm_temp/blob/main/bestmodel.joblib?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)

# Load the dataset and replicate your preprocessing
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv?raw=true", index_col=0)
df = df.replace([np.inf, -np.inf], 0)
df = df.fillna(0)
X = df.iloc[:,:-1]
y = df['target']

# Evaluate your model or pipeline
model.score(X,y)

0.9968102073365231