# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 06/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

Recent advances in supervised learning have provided powerful techniques for classifying problems. In this project, we see the SAT problem as a classification problem. Given a Boolean formula (represented by a vector of features), we are asked to predict if it is satisfiable or not.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.

The dataset is available at:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_train.csv

This is original unpublished data.

## New classifier (10 Marks)

Replicate the previous task for a classifier that we did not cover in class. So different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.
Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

In [1]:
# YOUR CODE HERE
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/dm_assignment2/sat_dataset_train.csv')
df.shape

(1929, 328)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1929 entries, 0 to 1928
Columns: 328 entries, c to target
dtypes: float64(237), int64(91)
memory usage: 4.8 MB


In [4]:
df.head()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,420,10,42.0,0.02381,0.6,0.0,0.6,0.6,0.0,0.6,...,78750.0,8e-06,0.0,7.875e-06,8e-06,2.385082e-21,0.0,2.385082e-21,2.385082e-21,1
1,230,20,11.5,0.086957,0.137826,0.089281,0.117391,0.16087,2.180946,0.137826,...,6646875.0,17433.722184,1.0,2.981244e-12,34867.444369,17277.21,1.0,1.358551e-53,34554.42,0
2,240,16,15.0,0.066667,0.3,0.0,0.3,0.3,0.0,0.3,...,500000.0,1525.878932,0.0,1525.879,1525.878932,1525.879,0.0,1525.879,1525.879,1
3,424,30,14.133333,0.070755,0.226415,0.485913,0.056604,0.45283,2.220088,0.226415,...,87500.0,0.000122,1.0,6.535723e-14,0.000245,8.218628e-07,1.0,1.499676e-61,1.643726e-06,0
4,162,19,8.526316,0.117284,0.139701,0.121821,0.111111,0.185185,1.940843,0.139701,...,5859400.0,16591.49431,1.0,6.912725999999999e-42,33182.988621,16659.03,1.0,0.0,33318.07,1


In [5]:
# Check if there are any NaN or infintie values in the dataset
def check_nan_and_infinite_values(df):
    print('NaN values in dataset:', np.any(np.isnan(df)))
    print('Inf values in dataset:', not np.all(np.isfinite(df)))

check_nan_and_infinite_values(df)

NaN values in dataset: True
Inf values in dataset: True


In [6]:
# Since there are NaN anf infinite values, we replace them with 0. 
# This is necessary to do as otherwise the model will throw an error when we start training 

df = df.replace([np.inf, -np.inf, np.nan], 0)

# Hence, no more NaN and inf values. 
check_nan_and_infinite_values(df)

NaN values in dataset: False
Inf values in dataset: False


In [7]:
# SVM model using default parameters

# Splitting dataset into train, test with 70, 30 ratio
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['target']) , df['target'], test_size=0.3)

# Declaring default SVC classifier model
svm = SVC()

# Fitting classifier with training data
svm.fit(X_train, y_train)

from sklearn import linear_model
model = linear_model.LinearRegression()


In [8]:
# Performing predictions with test data
y_preds = svm.predict(X_test)
print('Accuracy for SVM model with default parameters:', accuracy_score(y_test, y_preds))



Accuracy for SVM model with default parameters: 0.5284974093264249


In [9]:
# Feature Normalization using `Standard Scaling`

# Copying df into a new variable
df2 = df

# Scaling all columns of df except `target` column 
df2 = StandardScaler().fit_transform(df2.drop(columns=['target']))

# Convert the scaled df from dtype array to dtype dataframe
df2 = pd.DataFrame(df2, columns=[df.columns.drop('target')])

df2.shape

(1929, 327)

In [10]:
# df with scaled features
df2.head()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_min,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max
0,-0.289024,-0.943549,3.799785,-1.311479,4.718609,-1.107531,5.832823,2.807536,-1.957152,4.718609,...,-0.18882,-0.794704,-0.329846,-2.446825,-0.120963,-0.334298,-0.37574,-2.647376,-0.129158,-0.370523
1,-0.714431,-0.745699,0.052469,-0.562237,0.257936,-0.63649,0.48616,-0.002255,0.470293,0.257936,...,-0.304107,0.415544,0.335329,0.442685,-0.120963,0.390724,0.020649,0.408452,-0.129158,0.036323
2,-0.692041,-0.824839,0.482489,-0.802977,1.823158,-1.107531,2.509222,0.887976,-1.957152,1.823158,...,0.906215,-0.717084,-0.271627,-2.446825,0.031191,-0.302569,-0.340731,-2.647376,0.024214,-0.352557
3,-0.280068,-0.547848,0.376007,-0.754472,1.112953,1.456128,-0.187285,1.865865,0.51386,1.112953,...,-0.357137,-0.793092,-0.329846,0.442685,-0.120963,-0.334298,-0.37574,0.408452,-0.129158,-0.370523
4,-0.866682,-0.765484,-0.312887,-0.202401,0.276032,-0.464805,0.416584,0.15333,0.203053,0.276032,...,-0.39314,0.270443,0.303194,0.442685,-0.120963,0.355698,0.006466,0.408452,-0.129158,0.021767


In [11]:
# Hold-out method for splitting normalized df
X_train2, X_test2, y_train2, y_test2 = train_test_split(df2, df['target'], test_size=0.3)

In [12]:
# Feature Reduction using PCA

# Retaining 90% of explained variance
pca = PCA(n_components=0.9)   

# Extracting the columns onto the scaled 
pca.fit(df2)



PCA(n_components=0.9)

In [13]:
# Transforming the train set  
X_train2 = pca.transform(X_train2)



In [14]:
X_train2.shape

(1350, 19)

In [15]:
# Transforming the test set 
X_test2 = pca.transform(X_test2)



In [16]:
X_test2.shape

(579, 19)

In [17]:
#SVM using Cross-Validation and HyperParameter tuning

# Describing hyperparameters for Grid Search ## Takes roughly 1min 
param_grid = { 
    'C':[0.1,1,100],
    'kernel':['rbf','linear'],
    'degree':[1,3,5],
    'gamma': [0.001, 0.0001]
}
# Declaring GridSearch with appropriate model and parameters
grid = GridSearchCV(SVC(),param_grid,  scoring='accuracy', return_train_score=False, verbose=1)

# Applying Grid Search and Cross Validation
grid_search = grid.fit(X_train2, y_train2)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [18]:
# Check the best parameters found
grid_search.best_params_

{'C': 100, 'degree': 1, 'gamma': 0.001, 'kernel': 'rbf'}

In [19]:
# Check the best score achieved with the best parameter
grid_search.best_score_

0.928888888888889

In [20]:
# Declare SVM with the best parameters found in Grid Search  
SVM_test = grid_search.best_estimator_

In [21]:
# Train best model with scaled training data
SVM_test.fit(X_train2, y_train2)

SVC(C=100, degree=1, gamma=0.001)

In [22]:
# Make predictions with scaled test data
y_preds2 = SVM_test.predict(X_test2) 
print('Accuracy for SVM model with Grid Search and CV:', accuracy_score(y_test2, y_preds2)) 

Accuracy for SVM model with Grid Search and CV: 0.9326424870466321


In [23]:
## Add in save model code here

import joblib

file = 'model.sav'

joblib.dump(model, file)

['model.sav']