<a href="https://colab.research.google.com/github/AmelieAraji/CS6405/blob/main/SAT_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 06/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

Recent advances in supervised learning have provided powerful techniques for classifying problems. In this project, we see the SAT problem as a classification problem. Given a Boolean formula (represented by a vector of features), we are asked to predict if it is satisfiable or not.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.

The dataset is available at:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_train.csv

This is original unpublished data.

## Data Preparation

In [1]:
import pandas as pd 
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn import neighbors
from sklearn import metrics
from sklearn import model_selection
from sklearn import tree
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
from sklearn.inspection import permutation_importance
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

First, we load the dataset from github:

In [2]:
df = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/6d5738101d173b97c565f143f945dedb9c42a400/dm_assignment2/sat_dataset_train.csv?raw=true")
df.head()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,420,10,42.0,0.02381,0.6,0.0,0.6,0.6,0.0,0.6,...,78750.0,8e-06,0.0,7.875e-06,8e-06,2.385082e-21,0.0,2.385082e-21,2.385082e-21,1
1,230,20,11.5,0.086957,0.137826,0.089281,0.117391,0.16087,2.180946,0.137826,...,6646875.0,17433.722184,1.0,2.981244e-12,34867.444369,17277.21,1.0,1.358551e-53,34554.42,0
2,240,16,15.0,0.066667,0.3,0.0,0.3,0.3,0.0,0.3,...,500000.0,1525.878932,0.0,1525.879,1525.878932,1525.879,0.0,1525.879,1525.879,1
3,424,30,14.133333,0.070755,0.226415,0.485913,0.056604,0.45283,2.220088,0.226415,...,87500.0,0.000122,1.0,6.535723e-14,0.000245,8.218628e-07,1.0,1.499676e-61,1.643726e-06,0
4,162,19,8.526316,0.117284,0.139701,0.121821,0.111111,0.185185,1.940843,0.139701,...,5859400.0,16591.49431,1.0,6.912725999999999e-42,33182.988621,16659.03,1.0,0.0,33318.07,1


The number of features are important and we will next do a feature selection step:

In [3]:
df.shape

(1929, 328)

We can display features and their datatypes:



In [4]:
df.dtypes

c                       int64
v                       int64
clauses_vars_ratio    float64
vars_clauses_ratio    float64
vcg_var_mean          float64
                       ...   
rwh_2_mean            float64
rwh_2_coeff           float64
rwh_2_min             float64
rwh_2_max             float64
target                  int64
Length: 328, dtype: object

It is important to check that the dataset is balanced:

In [5]:
df['target'].value_counts()

1    976
0    953
Name: target, dtype: int64

For any model to run we first have to handle missing data, here we display the amount of NaN values per column. We can drop all features with any NaN, or only drop columns that have more than 10% of the entries as NaN:

In [6]:
%load_ext google.colab.data_table


with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df.isnull().sum())


c                                           0
v                                           0
clauses_vars_ratio                          0
vars_clauses_ratio                          0
vcg_var_mean                                0
vcg_var_coeff                               0
vcg_var_min                                 0
vcg_var_max                                 0
vcg_var_entropy                             0
vcg_clause_mean                             0
vcg_clause_coeff                            0
vcg_clause_min                              0
vcg_clause_max                              0
vcg_clause_entropy                          0
vg_mean                                     0
vg_coeff                                    0
vg_min                                      0
vg_max                                      0
pnc_ratio_mean                              0
pnc_ratio_coeff                             0
pnc_ratio_min                               0
pnc_ratio_max                     

Here we select the features to drop in the case we drop all columns with one or more NaN:

In [7]:
df.isnull().sum()[df.isnull().sum() != 0].index.tolist()

['v_nd_p_weights_entropy',
 'v_nd_n_weights_entropy',
 'c_nd_p_weights_entropy',
 'c_nd_n_weights_entropy',
 'cg_al_node_entropy',
 'cg_al_weights_entropy',
 'rg_node_entropy',
 'rg_weights_entropy',
 'big_node_entropy',
 'big_weights_entropy',
 'and_node_entropy',
 'and_weights_entropy',
 'band_node_entropy',
 'band_weights_entropy',
 'exo_node_entropy',
 'exo_weights_entropy']

Here we select the feature to drop in the case we drop all columns with 10% or more of their entries as NaN:

In [8]:
df.isnull().sum()[df.isnull().sum() > 192].index.tolist()

['v_nd_p_weights_entropy',
 'v_nd_n_weights_entropy',
 'c_nd_p_weights_entropy',
 'c_nd_n_weights_entropy',
 'big_node_entropy',
 'big_weights_entropy',
 'and_node_entropy',
 'and_weights_entropy',
 'band_node_entropy',
 'band_weights_entropy',
 'exo_node_entropy',
 'exo_weights_entropy']

In [9]:
def pp_dataset(df, na_strategy="all"):

  '''This function takes as an input a dataframe and a strategy for processing 
  NaN values: "all" or "thresh". A basic preprocessing is applied in order to 
  handle missing values and infinite values. The preprocessed df is returned.'''

  # Drop all columns with one or more NaN.
  if na_strategy == "all":
    na_col = df.isnull().sum()[df.isnull().sum()  != 0].index.tolist()
    df_clean = df.drop(columns=na_col)

  # Drop all columns with 10% or more of their entries as NaN.
  elif na_strategy == "thresh":
    na_col_thresh = df.isnull().sum()[df.isnull().sum() > 192].index.tolist()
    df_cl = df.drop(columns=na_col_thresh)
    df_clean = df_cl.replace(np.nan,0)
  
  # Replace infinite values by large numbers.
  df_p = df_clean.replace(np.inf, 100000000000000000000000000000)
  df_pp = df_p.replace(-np.inf, -100000000000000000000000000000) 

  return df_pp


Below we check the shapes of the different returned dataframes to ensure the preprocessing was correctly applied.

In [10]:
df.shape, np.any(np.isnan(df))

((1929, 328), True)

In [11]:
df_pp_all = pp_dataset(df, na_strategy="all")
df_pp_all.shape, np.any(np.isnan(df_pp_all))

((1929, 312), False)

In [12]:
df_pp_thresh = pp_dataset(df, na_strategy="thresh")
df_pp_thresh.shape, np.any(np.isnan(df_pp_all))

((1929, 316), False)

We can display one of the dataframe to check the values:

In [13]:
df_pp_all.head(10)



Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,420,10,42.0,0.02381,0.6,0.0,0.6,0.6,0.0,0.6,...,78750.0,7.875e-06,0.0,7.875e-06,7.875e-06,2.385082e-21,0.0,2.385082e-21,2.385082e-21,1
1,230,20,11.5,0.086957,0.137826,0.089281,0.117391,0.16087,2.180946,0.137826,...,6646875.0,17433.72,1.0,2.981244e-12,34867.44,17277.21,1.0,1.358551e-53,34554.42,0
2,240,16,15.0,0.066667,0.3,0.0,0.3,0.3,0.0,0.3,...,500000.0,1525.879,0.0,1525.879,1525.879,1525.879,0.0,1525.879,1525.879,1
3,424,30,14.133333,0.070755,0.226415,0.485913,0.056604,0.45283,2.220088,0.226415,...,87500.0,0.0001223252,1.0,6.535723e-14,0.0002446504,8.218628e-07,1.0,1.499676e-61,1.643726e-06,0
4,162,19,8.526316,0.117284,0.139701,0.121821,0.111111,0.185185,1.940843,0.139701,...,5859400.0,16591.49,1.0,6.912725999999999e-42,33182.99,16659.03,1.0,0.0,33318.07,1
5,138,18,7.666667,0.130435,0.31723,0.574838,0.115942,0.57971,1.816983,0.31723,...,394625.0,964.3322,1.0,0.0001295548,1928.664,1144.716,1.0,5.333914e-24,2289.433,1
6,564,36,15.666667,0.06383,0.074468,0.122695,0.051418,0.088652,2.538702,0.074468,...,8203127.0,8913.056,1.0,1.581267e-71,17826.11,8750.544,1.0,0.0,17501.09,1
7,400,30,13.333333,0.075,0.222667,0.519994,0.08,0.48,1.83818,0.222667,...,35000.0,5.428106e-08,0.999842,8.568525e-12,1.085536e-07,3.341328e-24,1.0,1.208599e-43,6.682656e-24,0
8,260,36,7.222222,0.138462,0.154274,0.597639,0.046154,0.369231,2.30573,0.154274,...,218750.0,0.0769181,0.999277,5.561134e-05,0.1537806,0.2538011,1.0,4.686094e-08,0.5076022,0
9,666,36,18.5,0.054054,0.084001,0.030072,0.076577,0.087087,1.455378,0.084001,...,12109377.0,10199.25,1.0,8.881271e-97,20398.49,10175.18,1.0,0.0,20350.37,0


# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate K-NN and decision tree classifiers using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset. Compare the results of both classifiers.

Below we use two algorithms, a KNN and a decision tree, to model and classify whether a Boolean formula F is satisfiable or not:

In [14]:
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
features = df_pp_all.drop(columns="target")
labels = df_pp_all["target"]

# Train - Test Split
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.3, random_state=0)
# Train the KNN
knn.fit(train_features,train_labels)
# Predict the classe
results = knn.predict(test_features)
# Compute the accuracy
print(metrics.accuracy_score(results, test_labels))

0.8739205526770294


In [15]:
dtc = tree.DecisionTreeClassifier()
# Train the decision tree
dtc.fit(train_features,train_labels)
# Predict the classe
results_tree = dtc.predict(test_features)
# Compute the accuracy
print(metrics.accuracy_score(results_tree, test_labels))

0.9827288428324698


In [16]:
knn_thresh = neighbors.KNeighborsClassifier(n_neighbors = 5)
features_thresh = df_pp_thresh.drop(columns="target")
labels_thresh = df_pp_thresh["target"]

# Train - Test Split
train_features_th, test_features_th, train_labels_th, test_labels_th = model_selection.train_test_split(features_thresh, labels_thresh, test_size=0.3, random_state=0)
# Train the KNN
knn_thresh.fit(train_features_th,train_labels_th)
# Predict the classe
results_thresh = knn_thresh.predict(test_features_th)
# Compute the accuracy
print(metrics.accuracy_score(results_thresh, test_labels_th))

0.8739205526770294


In [17]:
dtc_thresh = tree.DecisionTreeClassifier()
# Train the decision tree
dtc_thresh.fit(train_features_th,train_labels_th)
# Predict the classe
results_tree_thresh = dtc_thresh.predict(test_features_th)
# Compute the accuracy
print(metrics.accuracy_score(results_tree_thresh, test_labels_th))

0.9827288428324698


We can see that even with a basic preprocessing, we obtain an excellent result with the decision tree classifier with an accuracy of 0.98. The KNN with a k=5 have an accuracy of 0.87 on the test set, which is also quite good, but the decision tree by far surpassed it. We can notice that the strategy "all" NaN dropped or "thresh" with a threshold of 10% NaN does almost not make a difference. We will continue on with "all" strategy as the accuracy with this strategy (0.9844) and with decision tree is slightly better than for the "thresh strategy" (0.9810). 

## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature normalisation.

Your report should provide concrete information of your reasoning; everything should be well-explained.

Do not get stressed if the things you try do not improve the accuracy. The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

In [18]:
df_pp_all.shape

(1929, 312)

In order to distinguish which features impacts the most the prediction of the right target we are going to apply a model inspection technique called "permutation_importance". We will therefore be able to select the features that plays a role in the predictions and drop the others. In that way we decrease the dimensions and complexity of the data to be modelled; it often improoves accuracy and it definietly allows to save time computation. 

In [19]:
permutation_score = permutation_importance(dtc, train_features, train_labels, n_repeats=10) 
importance_df = pd.DataFrame(np.vstack((train_features.columns, permutation_score.importances_mean)).T) 
importance_df.columns=['feature','score decrease']
importance_df.sort_values(by="score decrease", ascending = False) 

Unnamed: 0,feature,score decrease
57,gsat_BestSolution_Mean,0.285407
58,gsat_BestSolution_CoeffVariance,0.144815
46,saps_BestSolution_CoeffVariance,0.135926
161,vg_al_node_std,0.030148
45,saps_BestSolution_Mean,0.029778
...,...,...
110,v_nd_n_weights_zeros,0.0
109,v_nd_n_weights_std,0.0
108,v_nd_n_weights_mean,0.0
107,v_nd_n_weights_mode,0.0


In [20]:
# Store the best predictors outputed by the permutation importance
importance_df_ind = importance_df.set_index('feature')
best_predictors = importance_df_ind["score decrease"][importance_df_ind["score decrease"] != 0].index.tolist()
best_predictors

['v',
 'vcg_var_mean',
 'vcg_var_min',
 'pnv_ratio_coeff',
 'unit_props_at_depth_256',
 'estimate_log_number_nodes_over_vars',
 'saps_BestSolution_Mean',
 'saps_BestSolution_CoeffVariance',
 'saps_BestAvgImprovement_CoeffVariance',
 'gsat_BestSolution_Mean',
 'gsat_BestSolution_CoeffVariance',
 'v_nd_p_node_val_rate',
 'vg_al_node_std',
 'vg_al_weights_val_rate',
 'band_node_val_rate']

In [21]:
# Create a new dataframe with a reduced number of features
df_best_predictor = df_pp_all[best_predictors]

In [22]:
features = df_best_predictor
labels = df["target"]

Below we fit different method of scaling in order to test them all and define which one is the best for this dataset.

In [23]:
r_scaler = RobustScaler()
r_scaler.fit(df_best_predictor)
r_df_best_predictor = pd.DataFrame(r_scaler.transform(df_best_predictor))


In [24]:
mm_scaler = MinMaxScaler()
mm_scaler.fit(df_best_predictor)
mm_df_best_predictor = pd.DataFrame(r_scaler.transform(df_best_predictor))

In [25]:
s_scaler = StandardScaler()
s_scaler.fit(df_best_predictor)
s_df_best_predictor = pd.DataFrame(r_scaler.transform(df_best_predictor))

Here we automatize the fitting of the model and we add a crossvalidation step with an hyperparameter for the k-folds to have a more robust analysis of our model accuracy.

In [26]:
def model_search(data, classifier_model, kfold):

  train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.3, random_state=0)
  model = classifier_model
  model.fit(train_features,train_labels)
  # Predict the classe
  results = model.predict(test_features)
  CV = cross_validate(model, train_features, train_labels, cv=kfold, return_estimator=True)
  
  return print(CV["test_score"].mean())


In [27]:
model_search(r_df_best_predictor, neighbors.KNeighborsClassifier(n_neighbors = 5), 10),
model_search(mm_df_best_predictor, neighbors.KNeighborsClassifier(n_neighbors = 5), 10)
model_search(s_df_best_predictor, neighbors.KNeighborsClassifier(n_neighbors = 5), 10)

0.9644444444444444
0.9644444444444444
0.9644444444444444


We can see that the method of scaling does not impact the accuracy as they are all equal, however, scaling the data increased the accuracy for the KNN from 0.87 previously to 0.9814 now.

In [28]:
model_search(r_df_best_predictor, tree.DecisionTreeClassifier(), 10)
model_search(mm_df_best_predictor, tree.DecisionTreeClassifier(), 10)
model_search(s_df_best_predictor, tree.DecisionTreeClassifier(), 10)

0.9837037037037037
0.9851851851851852
0.9851851851851852


As the accuracy with the decision tree was already high, it is not surprising that there is not an observable increase on the average accuracy. There is actually a small decrease but negligeable (0.984455 without scaling for decision tree, to 0.984444 here).
We can note that RobustScaler and StandardScaler presents better average accuracy than MinMaxScaler. This is consistent with the fact that our dataset contained infinite values that were replaced by large value, MinMaxScaler being sensitive to outliers and the others not. Overall the best model is the decision tree but we can now explore the hyperparamater of K nearest neighbors to see if we don't improove the accuracy.

In [29]:
for x in [3,5,7,9,11,13,15]:
  print("With k =", x, ", we obtain a cross validate accuracy of:")
  model_search(s_df_best_predictor, neighbors.KNeighborsClassifier(n_neighbors = x), 10)

With k = 3 , we obtain a cross validate accuracy of:
0.9651851851851851
With k = 5 , we obtain a cross validate accuracy of:
0.9644444444444444
With k = 7 , we obtain a cross validate accuracy of:
0.9540740740740741
With k = 9 , we obtain a cross validate accuracy of:
0.951851851851852
With k = 11 , we obtain a cross validate accuracy of:
0.9444444444444444
With k = 13 , we obtain a cross validate accuracy of:
0.9474074074074073
With k = 15 , we obtain a cross validate accuracy of:
0.9488888888888889


We can conclude that decision tree is the best model over KNN for the SAT problem, as any of the value tested for k and the KNN algorithm outperformed the decision tree. Best k was k=5 with an average accuracy of 0.9814 vs. 0.9844 for the decision tree.

## New classifier (10 Marks)

Replicate the previous task for a classifier that we did not cover in class. So different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.
Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

Below we try different classifier as Naive Bayes, Logistic Regression and Support Vector Classifier.

In [30]:
  model_search(s_df_best_predictor, GaussianNB(), 10)

0.9407407407407407


In [31]:
  model_search(s_df_best_predictor, LogisticRegression(), 10)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

0.9674074074074074


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [32]:
# for ker in ['poly', 'linear', 'sigmoid', 'rbf']:
#   print("With kernel =", ker, ", we obtain a cross validate accuracy of:")
#   model_search(mm_df_best_predictor, SVC(kernel=ker, C = 1.0), 10)

In [33]:
for ker in ['poly', 'sigmoid', 'rbf']:
  print("With kernel =", ker, ", we obtain a cross validate accuracy of:")
  model_search(mm_df_best_predictor, SVC(kernel=ker, C = 1.0), 10)

With kernel = poly , we obtain a cross validate accuracy of:
0.6755555555555556
With kernel = sigmoid , we obtain a cross validate accuracy of:
0.5237037037037038
With kernel = rbf , we obtain a cross validate accuracy of:
0.8866666666666667


The best average accuracy is given by the SVC with a linear kernel, previously while running I also had good results with rbf.

## Github commit

In [34]:
# from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.model_selection import cross_val_score


# classifier = Pipeline([
#     ("scaler", RobustScaler()),
#     ("predictor", SVC(kernel=ker, C = 1.0))])

# classifier.fit(train_features,train_labels)

In [35]:
# pickle.dump(classifier,  open( "model.pkl", "wb" ))

In [36]:
# !git add .

In [37]:
# !git commit -m "pickle"

In [38]:
# import os
# from getpass import getpass
# import urllib

# user = input('Username: ')
# password = getpass('Password: ')
# password = urllib.parse.quote(password)
# repo_name = input('Repo name: ')

In [39]:
# # cmd_string = 'git remote add origin https://{0}:{1}@github.com/{0}/{2}.git'.format(user, password, repo_name)
# cmd_string = 'git remote set-url origin https://{0}:{1}@github.com/{0}/{2}.git'.format(user, password, repo_name)

# os.system(cmd_string)

In [40]:
# !git push origin main

In [41]:
# !git clone https://github.com/AmelieAraji/CS6405

In [42]:
# !git init

In [43]:
# !git config --global user.email "ameliearaji@gmail.com"
# !git config --global user.name "AmelieAraji"

In [44]:
# import pickle

In [45]:
# pickle.dump(classifier,  open( "model.pkl", "wb" ))

In [46]:
# clss = pickle.load(open("model.pkl", "rb"))

In [47]:
# clss.predict(test_features)

In [48]:
# from io import BytesIO
# import requests
# mLink = 'https://github.com/AmelieAraji/CS6405?raw=true'
# mfile = BytesIO(requests.get(mLink).content)
# model = load(mfile)

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/dm_assignment2/sat_dataset_test.csv

In [49]:
# from joblib import dump, load
# from io import BytesIO
# import requests

# # INSERT YOUR MODEL'S URL
# mLink = 'https://github.com/AmelieAraji/CS6405/model2.pkl?raw=true'
# mfile = BytesIO(requests.get(mLink).content)
# model = load(mfile)
# # YOUR CODE HERE

I tried but I don't know how to push it on Github, so this is all I have for now and the whole collab need to be run. I will go to the lab to figure it all out.

In [50]:
#TRAIN
features = df_pp_all.drop(columns="target")
features = features[best_predictors]
labels = df_pp_all["target"]
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.3, random_state=0)
svc = SVC(kernel='linear', C = 1.0)
svc.fit(train_features,train_labels)

SVC(kernel='linear')

In [51]:
df_test = pd.read_csv("https://github.com/andvise/DataAnalyticsDatasets/blob/f4c1e07915ddfe98f0f5434ec3f0e7f3900f35ab/dm_assignment2/sat_dataset_test.csv?raw=true")
df_pp_test = pp_dataset(df_test, na_strategy="all")
features = df_pp_test.drop(columns="target")
test_features = features[best_predictors]
test_labels = df_pp_test["target"]

In [52]:
r_scaler = RobustScaler()
r_scaler.fit(test_features)
r_df_best_predictor = pd.DataFrame(r_scaler.transform(test_features))
# Predict the classe
results_svc = svc.predict(test_features)
# Compute the accuracy
print(metrics.accuracy_score(results_svc, test_labels))

0.968944099378882
