# CS6405 - Data Mining - Second Assignment

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula is satisfiable or not. This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

An extended version of the problem is Model Counting (#SAT). In #SAT the solver needs to compute the number of solutions of a Boolean formula. A wide variety of solvers have been designed to tackle this problem.

In this project, we want to create an Algorithm Selection (AS) approach to Model Counting. For each #SAT instance, there is a specific solver that works better than the others, the goal of your machine learing approach is to classify it.

In AS we represent #SAT problems with a vector of 72 features with general information about the problem, e.g., number of variables, number of clauses, etc. There is no need to understand the features to be able to complete the assignment. For each instance, there is a 'label' column representing the name of the optimal solver.

## Data Preparation

In [None]:
import pandas as pd

df = pd.read_csv("train_data.csv")
df

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,gsat_FirstLocalMinStep_CoeffVariance,gsat_FirstLocalMinStep_Median,gsat_FirstLocalMinStep_Q.10,gsat_FirstLocalMinStep_Q.90,gsat_BestAvgImprovement_Mean,gsat_BestAvgImprovement_CoeffVariance,gsat_FirstLocalMinRatio_Mean,gsat_FirstLocalMinRatio_CoeffVariance,gsat_EstACL_Mean,label
0,681.0,238.0,2.861345,0.349486,0.011143,0.905300,0.005874,0.111601,1.880038,0.011143,...,0.210148,50.0,42.0,57.0,0.954568,0.591296,1.141656,3.197217,9.240515e+09,gpmc
1,368.0,140.0,2.628571,0.380435,0.018012,0.510753,0.005435,0.054348,1.851609,0.018012,...,0.124438,34.0,29.0,40.0,1.693036,0.244951,0.969015,0.029930,5.401642e+03,d4
2,1935.0,1920.0,1.007812,0.992248,0.001760,1.723720,0.000000,0.012403,1.280404,0.001760,...,0.066708,102.0,94.0,111.0,0.398129,0.824694,0.935730,0.092714,3.561823e+04,gpmc
3,3452.0,2821.0,1.223680,0.817207,0.000968,1.436774,0.000290,0.006083,1.192878,0.000968,...,0.053628,192.0,179.0,205.0,0.247528,0.702251,0.923327,0.026977,1.268929e+05,gpmc
4,694.0,294.0,2.360544,0.423631,0.007656,0.493513,0.002882,0.040346,1.776102,0.007656,...,0.086841,72.0,64.0,80.0,0.822829,0.209989,0.855568,0.045802,1.647598e+04,d4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1331,949.0,351.0,2.703704,0.369863,0.006701,1.151888,0.002012,0.063380,1.857919,0.006701,...,0.079108,81.0,73.0,89.0,0.781888,0.256505,0.837929,0.066291,1.643030e+04,sharpsat
1332,1450.0,608.0,2.384868,0.419310,0.004427,1.552830,0.002408,0.161349,1.659495,0.004427,...,0.062664,136.0,125.0,147.0,0.776364,0.174557,0.855907,0.031698,3.816192e+04,gpmc
1333,250.0,100.0,2.500000,0.400000,0.026000,0.555766,0.000000,0.096000,1.751533,0.026000,...,0.140708,26.0,21.0,30.0,2.093007,0.117640,1.000000,0.000000,3.168220e+03,gpmc
1334,4949.0,3422.0,1.446230,0.691453,0.000764,0.770643,0.000202,0.004445,1.835041,0.000764,...,0.032571,461.0,441.0,479.0,0.202805,0.220101,0.856952,0.016884,6.236767e+05,gpmc


In [49]:
# Label or target variable
df['label'].value_counts()

label
gpmc        921
d4          168
ganak       140
addmc        55
sharpsat     52
Name: count, dtype: int64

# Tasks

## Basic models and evaluation

Using Scikit-learn, train and evaluate a k-NN classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

In [50]:
# YOUR CODE HERE
from sklearn import model_selection
from sklearn import neighbors

labels = df['label']
features = df.drop('label', axis = 1)

train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.3, random_state= 6405)

knn = neighbors.KNeighborsClassifier()

knn.fit(train_features,train_labels)
print("KNN Train accuracy for default value of k = 5 :", end=" ")
print(knn.score(train_features, train_labels)*100, end="")
print("%")
print("KNN Testing accuracy for default value of k = 5 :", end=" ")
print(knn.score(test_features, test_labels)*100, end="")
print("%")

KNN Train accuracy for default value of k = 5 : 79.25133689839572%
KNN Testing accuracy for default value of k = 5 : 68.3291770573566%


## Robust evaluation

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods. Do your best to improve the k-NN classifier performances on this dataset.

For instance, you could consider:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature selection.
* Feature normalisation.
* Etc.



Your report should provide concrete information about your reasoning, everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

In [2]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile, f_classif
import warnings
warnings.filterwarnings('ignore')

In [None]:
for x in [StandardScaler(), MinMaxScaler()]:
  scaler = x
  train_features_Scaled = scaler.fit_transform(train_features)
  pca = PCA(2)
  fit = pca.fit(train_features)
  # summarize components
  print("Explained Variance: %s" % fit.explained_variance_ratio_)

Explained Variance: [0.94593909 0.05406091]
Explained Variance: [0.94593909 0.05406091]


In [None]:
knn_param_grid = {
    'KNN__n_neighbors': [2,3,5,7,9,11,13,15,20,25],
    'PCA__n_components': [2,3,4,5]
}
for x in [StandardScaler(), MinMaxScaler()]:
  pipeline_knn = Pipeline([('scaler', x), \
                           ('PCA', PCA()), \
                           ('KNN', neighbors.KNeighborsClassifier())])

  knn_best = GridSearchCV(pipeline_knn, knn_param_grid, scoring="accuracy")

  # Run the GridSearchCV
  knn_best.fit(train_features, train_labels)
  print("best params for", x, ":")
  # Print the best parameters and the score
  print(knn_best.best_params_)
  print("Corresponding Train accuracy:", (knn_best.best_score_ * 100), "%")

  # print the testing data accuracy
  print("Corresponding Test accuracy:", (knn_best.best_estimator_.score(test_features, test_labels) * 100), "%")
  print("-"* 100)

best params for StandardScaler() :
{'KNN__n_neighbors': 5, 'PCA__n_components': 5}
Corresponding Train accuracy: 73.26203208556149 %
Corresponding Test accuracy: 69.82543640897757 %
----------------------------------------------------------------------------------------------------
best params for MinMaxScaler() :
{'KNN__n_neighbors': 7, 'PCA__n_components': 5}
Corresponding Train accuracy: 73.36898395721924 %
Corresponding Test accuracy: 68.07980049875312 %
----------------------------------------------------------------------------------------------------


In [None]:
knn_param_grid = {
    'KNN__n_neighbors': [5,7,9,11,13,15,20,25],
    'PCA__n_components': [2,3,4,5,6,7],
    'SelectPercentile__percentile': [20, 30, 40, 50, 60, 70]
}
for x in [StandardScaler(), MinMaxScaler()]:

    steps_knn = [('scaler', x)]
    steps_knn.append(('SelectPercentile', SelectPercentile(score_func = f_classif)))
    steps_knn.append(('PCA', PCA()))
    steps_knn.append(('KNN', neighbors.KNeighborsClassifier()))

    pipeline_knn = Pipeline(steps_knn)

    knn_best = GridSearchCV(pipeline_knn, knn_param_grid, scoring="accuracy")

    # Run the GridSearchCV
    knn_best.fit(train_features, train_labels)
    print("best params for", x, ":")
    # Print the best parameters and the score
    print(knn_best.best_params_)
    print("Corresponding Train accuracy:", (knn_best.best_score_ * 100), "%")

    # print the testing data accuracy
    print("Corresponding Test accuracy:", (knn_best.best_estimator_.score(test_features, test_labels) * 100), "%")
    print("-"* 100)

best params for StandardScaler() :
{'KNN__n_neighbors': 9, 'PCA__n_components': 4, 'SelectPercentile__percentile': 20}
Corresponding Train accuracy: 74.43850267379679 %
Corresponding Test accuracy: 70.57356608478803 %
----------------------------------------------------------------------------------------------------
best params for MinMaxScaler() :
{'KNN__n_neighbors': 11, 'PCA__n_components': 6, 'SelectPercentile__percentile': 20}
Corresponding Train accuracy: 74.33155080213905 %
Corresponding Test accuracy: 71.3216957605985 %
----------------------------------------------------------------------------------------------------


In [53]:
# FINAL MODEL FOR TASK 2

knn_param_grid = {
    'KNN__n_neighbors': [5,7,9,11,13,15,20,25],
    #'PCA__n_components': [2,3,4],
    'SelectPercentile__percentile': [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
}
for x in [StandardScaler(), MinMaxScaler()]:


    steps_knn = [('scaler', x)]
    steps_knn.append(('SelectPercentile', SelectPercentile(score_func = f_classif)))
    #steps_knn.append(('PCA', PCA()))
    steps_knn.append(('KNN', neighbors.KNeighborsClassifier()))

    pipeline_knn = Pipeline(steps_knn)

    knn_best = GridSearchCV(pipeline_knn, knn_param_grid, scoring="accuracy")

    # Run the GridSearchCV
    knn_best.fit(train_features, train_labels)
    print("best params for", x, ":")
    # Print the best parameters and the score
    print(knn_best.best_params_)
    print("Corresponding Train accuracy:", (knn_best.best_score_ * 100), "%")

    # print the testing data accuracy
    print("Corresponding Test accuracy:", (knn_best.best_estimator_.score(test_features, test_labels) * 100), "%")
    print("Number of columns chosen by SelectPercentile: ", sum(knn_best.best_estimator_.named_steps['SelectPercentile'].get_support()))
    print("-"* 100)

best params for StandardScaler() :
{'KNN__n_neighbors': 9, 'SelectPercentile__percentile': 20}
Corresponding Train accuracy: 74.33155080213902 %
Corresponding Test accuracy: 72.81795511221945 %
Number of columns chosen by SelectPercentile:  15
----------------------------------------------------------------------------------------------------
best params for MinMaxScaler() :
{'KNN__n_neighbors': 11, 'SelectPercentile__percentile': 20}
Corresponding Train accuracy: 74.01069518716577 %
Corresponding Test accuracy: 73.81546134663341 %
Number of columns chosen by SelectPercentile:  15
----------------------------------------------------------------------------------------------------


# Task 2 - Explanation

#### GridSearchCV was performed on a pipeline wherein Scaling (either MinMaxScaler or StandardScaler) along with Feature Selection (SelectPercentile) and Dimensional Reduction technique (PCA) was used either separately or in tandem prior to the K-Nearest Neighbours model. So, the pipeline had either 3 or 4 components in it, and hyperparameterisation was performed for all components except scaling, as only the type of scaling was switched across pipelines. The overall pipeline was fit and evaluated on 5-fold Cross Validation.
#### 1) Pipeline with different scalers switched but with only PCA applied yielded a maximum of 69.8% validation accuracy. (Best one - MinMaxScaler, 4 components of PCA and 7 neighbours in KNN)
#### 2) Next, pipeline was fit with both PCA and SelectPercentile where SelectPercentile performed ANOVA test between each input column with the output column to select the columns in the top 20 percentile of scores. Here the validation accuracy improved but was limited to 71.32%.
#### 3) Finally, the best results were obtained when the cross validation was performed on pipeline with MinMaxScaling, SelectPercentile and KNN model (No PCA in this model) with parameters KNN Neighbours - 11 and 20% of the top features were chosen by SelectPercentile (15 features) method. Final and best validation accuracy achieved with this method - 73.81%

## New classifier

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.






In [9]:
# YOUR CODE HERE

## GRIDSEARCH FOR MULTIPLE MODELS
!pip install catboost
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier, RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import time

# Defining parameter grids for each classifier
gbm_param_grid = {
    'SelectPercentile__percentile': [25, 30, 35, 40],
    'GBM__n_estimators': [50, 100],
    'GBM__learning_rate': [0.05, 0.1, 0.5],
    'GBM__min_samples_leaf': [5, 10, 15, 20]
}

lgbm_param_grid = {
    'SelectPercentile__percentile': [25, 30, 35, 40],
    'LGBM__n_estimators': [15, 25, 40],
    'LGBM__learning_rate': [0.05, 0.1, 0.5],
    'LGBM__min_child_samples': [5, 10, 15]
}

rf_param_grid = {
    'SelectPercentile__percentile': [25, 30, 35, 40],
    'RF__n_estimators': [15, 25, 40],
    'RF__min_samples_leaf': [5, 10, 15]
}

xgb_param_grid = {
    'SelectPercentile__percentile': [25, 30, 35, 40],
    'XGB__n_estimators': [15, 25, 40],
    'XGB__learning_rate': [0.05, 0.1, 0.5],
    'XGB__min_child_weight' : [5, 10, 15]
}

catboost_param_grid = {
    'SelectPercentile__percentile': [25, 30, 35, 40],
    'CatBoost__iterations': [15, 25, 40],
    'CatBoost__learning_rate': [0.05, 0.1, 0.5],
    'CatBoost__min_child_samples' : [5, 10 ,15]
}

ensemble_param_grid = {
    'SelectPercentile__percentile': [25, 30, 35, 40],
    'voting__weights': [[1,1,1],[1,2,1],[1,1,2],[2,1,1]]
}

encoder = LabelEncoder()
train_labels_numeric = encoder.fit_transform(train_labels)

for scaler in [MinMaxScaler()]:

    steps = [('scaler', scaler)]
    steps.append(('SelectPercentile', SelectPercentile(score_func=f_classif)))

    # RandomForest
    steps_rf = steps + [('RF', RandomForestClassifier())]
    pipeline_rf = Pipeline(steps_rf)
    grid_search_rf = GridSearchCV(pipeline_rf, param_grid = rf_param_grid, scoring = 'accuracy')
    grid_search_rf.fit(train_features, train_labels)

    # GBM
    steps_gbm = steps + [('GBM', GradientBoostingClassifier())]
    pipeline_gbm = Pipeline(steps_gbm)
    grid_search_gbm = GridSearchCV(pipeline_gbm, param_grid = gbm_param_grid, scoring='accuracy')
    grid_search_gbm.fit(train_features, train_labels)

    # LightGBM
    steps_lgbm = steps + [('LGBM', LGBMClassifier())]
    pipeline_lgbm = Pipeline(steps_lgbm)
    grid_search_lgbm = GridSearchCV(pipeline_lgbm, param_grid = lgbm_param_grid, scoring='accuracy')
    grid_search_lgbm.fit(train_features, train_labels)

    # XGBoost
    steps_xgb = steps + [('XGB', XGBClassifier())]
    pipeline_xgb = Pipeline(steps_xgb)
    grid_search_xgb = GridSearchCV(pipeline_xgb, param_grid = xgb_param_grid, scoring='accuracy')
    grid_search_xgb.fit(train_features, train_labels_numeric)

    # CatBoost
    steps_catboost = steps + [('CatBoost', CatBoostClassifier())]
    pipeline_catboost = Pipeline(steps_catboost)
    grid_search_catboost = GridSearchCV(pipeline_catboost, param_grid = catboost_param_grid, scoring='accuracy')
    grid_search_catboost.fit(train_features, train_labels)

    # Ensemble models (VotingClassifier)
    ensemble_clf = steps + [('voting', VotingClassifier(estimators=[('GBM', grid_search_gbm.best_estimator_), ('LGBM', grid_search_lgbm.best_estimator_),  ('RF', grid_search_rf.best_estimator_)], voting='hard'))]
    pipeline_ensemble = Pipeline(ensemble_clf)
    grid_search_ensemble = GridSearchCV(pipeline_ensemble, param_grid = ensemble_param_grid, scoring='accuracy')
    grid_search_ensemble.fit(train_features, train_labels)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
18:	learn: 0.7235716	total: 533ms	remaining: 589ms
19:	learn: 0.7072334	total: 568ms	remaining: 568ms
20:	learn: 0.6990066	total: 594ms	remaining: 537ms
21:	learn: 0.6880641	total: 620ms	remaining: 508ms
22:	learn: 0.6771390	total: 647ms	remaining: 478ms
23:	learn: 0.6679460	total: 680ms	remaining: 453ms
24:	learn: 0.6600950	total: 715ms	remaining: 429ms
25:	learn: 0.6552189	total: 723ms	remaining: 389ms
26:	learn: 0.6476886	total: 749ms	remaining: 361ms
27:	learn: 0.6398618	total: 781ms	remaining: 335ms
28:	learn: 0.6326457	total: 808ms	remaining: 307ms
29:	learn: 0.6254603	total: 836ms	remaining: 279ms
30:	learn: 0.6161183	total: 863ms	remaining: 251ms
31:	learn: 0.6083289	total: 890ms	remaining: 222ms
32:	learn: 0.5998832	total: 918ms	remaining: 195ms
33:	learn: 0.5941276	total: 944ms	remaining: 167ms
34:	learn: 0.5897500	total: 970ms	remaining: 139ms
35:	learn: 0.5851127	total: 1s	remaining: 111ms
36:	learn: 0.5789794

In [56]:
# Print the best parameters and score for each model
print("Best parameters for", MinMaxScaler())
print("Random Forest - Train Score:", grid_search_rf.best_score_ * 100, "%")
print("GBM - Train Score:", grid_search_gbm.best_score_ * 100, "%")
print("LightGBM - Train Score:", grid_search_lgbm.best_score_ * 100, "%")
print("XGBoost - Train Score:", grid_search_xgb.best_score_ * 100, "%")
print("CatBoost - Train Score:", grid_search_catboost.best_score_ * 100, "%")
print("Ensemble (VotingClassifier) - Train Score:", grid_search_ensemble.best_score_ * 100, "%")
print("-" * 200)
print("\nCorresponding Test scores")
print("Random Forest - Test Score:", grid_search_rf.best_estimator_.score(test_features, test_labels) * 100, "%")
print("GBM - Test Score:", grid_search_gbm.best_estimator_.score(test_features, test_labels) * 100, "%")
print("LightGBM - Test Score:", grid_search_lgbm.best_estimator_.score(test_features, test_labels) * 100, "%")
print("XGBoost - Test Score:", grid_search_xgb.best_estimator_.score(test_features, encoder.transform(test_labels)) * 100, "%")
print("CatBoost - Test Score:", grid_search_catboost.best_estimator_.score(test_features, test_labels) * 100, "%")
print("Ensemble (VotingClassifier) - Test Score:", grid_search_ensemble.best_estimator_.score(test_features, test_labels) * 100, "%")

Best parameters for MinMaxScaler()
Random Forest - Train Score: 76.89839572192513 %
GBM - Train Score: 76.14973262032085 %
LightGBM - Train Score: 77.2192513368984 %
XGBoost - Train Score: 76.68449197860963 %
CatBoost - Train Score: 76.04278074866309 %
Ensemble (VotingClassifier) - Train Score: 74.75935828877004 %
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Corresponding Test scores
Random Forest - Test Score: 73.81546134663341 %
GBM - Test Score: 73.31670822942642 %
LightGBM - Test Score: 76.05985037406484 %
XGBoost - Test Score: 75.31172069825436 %
CatBoost - Test Score: 77.80548628428927 %
Ensemble (VotingClassifier) - Test Score: 73.31670822942642 %


In [11]:
task3_bestpipeline_catboost = grid_search_catboost.best_estimator_
task3_bestpipeline_catboost

In [18]:
task3_bestpipeline_catboost.named_steps['CatBoost'].get_params()

{'iterations': 15, 'learning_rate': 0.5, 'min_child_samples': 5}

In [23]:
print("Total features selected by SelectPercentile function in best params:" , sum(task3_bestpipeline_catboost.named_steps['SelectPercentile'].get_support()))
print("Feature Names chosen: ", train_features.columns[task3_bestpipeline_catboost.named_steps['SelectPercentile'].get_support()])

Total features selected by SelectPercentile function in best params: 25
Feature Names chosen:  Index(['clauses_vars_ratio', 'vcg_var_max', 'vcg_var_entropy',
       'vcg_clause_coeff', 'vcg_clause_min', 'vcg_clause_max', 'vg_mean',
       'vg_min', 'vg_max', 'pnc_ratio_mean', 'pnc_ratio_coeff',
       'pnc_ratio_entropy', 'pnv_ratio_coeff', 'pnv_ratio_max',
       'pnv_ratio_entropy', 'pnv_ratio_stdev', 'binary_ratio', 'ternary_ratio',
       'ternary+', 'hc_fraction', 'hc_var_max', 'hc_var_entropy',
       'unit_props_at_depth_4', 'unit_props_at_depth_64',
       'saps_BestSolution_CoeffVariance'],
      dtype='object')


In [27]:
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

preds = task3_bestpipeline_catboost.predict(test_features)
report = classification_report(test_labels,preds)
print(report)

              precision    recall  f1-score   support

       addmc       0.87      0.81      0.84        16
          d4       0.75      0.31      0.43        59
       ganak       0.54      0.66      0.60        38
        gpmc       0.81      0.94      0.87       269
    sharpsat       0.67      0.11      0.18        19

    accuracy                           0.78       401
   macro avg       0.73      0.56      0.58       401
weighted avg       0.77      0.78      0.75       401



In [47]:
preds = task3_bestpipeline_catboost.predict(test_features)
matrix = confusion_matrix(test_labels,preds)
print(matrix)

[[ 13   0   0   3   0]
 [  0  18   6  35   0]
 [  0   0  25  12   1]
 [  2   4   9 254   0]
 [  0   2   6   9   2]]


In [28]:
from joblib import dump, load
dump(task3_bestpipeline_catboost, 'final_catboost_pipeline.joblib')

['final_catboost_pipeline.joblib']

# Task 3 - Explanation

#### 1) Building on Task 2, only the model was switched in the pipeline from KNN to various other Machine Learning models like RandomForestClassifier, GradientBoostingClassifier (GBM), LightGBM, XGBoost, CatBoost and VotingClassifier (combination of RandomForestClassifier, GBM, LightGBM).
#### 2) Experimentation was performed on separate pipelines to visualise the difference in model fits and choose the best one. MinMaxScaling and SelectPercentile were common components across all the pipelines.
#### 3) Although highest training accuracy (77.2%) was achieved by LightGBM Classifier on it's best set of parameters found out by hyperparameter tuning on GridSearchCV, highest validation accuracy  of 77.8% was achieved by CatBoost Classifier on it's own best set of parameters with a very marginal difference between the train accuracy between the two models. Catboost train accuracy - 76.04%
#### 4) Therefore, Final model chosen was the Catboost Classifier which has been uploaded to GitHub. Details of the model - SelectPercentile chose 25 features with top 35%ile scores, Catboost classifier with {'iterations': 15, 'learning_rate': 0.5, 'min_child_samples': 5} as tuned parameters.


# <font color="blue">Testing part</font>

Save your best model into your GitHub. And create a single code cell that loads it and evaluates it on the test dataset.

In [7]:
from joblib import dump, load
from io import BytesIO
import requests
import pandas as pd

# INSERT YOUR MODEL'S URL
mLink = 'https://github.com/Abhi21298/DataMining-Assignment2/blob/main/final_catboost_pipeline.joblib?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model = load(mfile)

df = pd.read_csv("test_data.csv")

X_test = df.drop(['label'], axis = 1)

Y_test = df['label']
# Your code here
print("Final Accuracy - ", model.score(X_test, Y_test) * 100, "%")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Final Accuracy -  74.32835820895522 %
