<a href="https://colab.research.google.com/github/SRI-CSL/CoProver/blob/main/src/notebooks/220629_metitarski/coprover_metitarski_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CoProver 

## &#10024; `MetiTarski`- problem

**Description:** Updated notebook to process v3 and v4 versions of MetiTarski data.

**Checklist**

- [] compare SVM with linear and polynomial (at least degree 2) kernels
- [] set the protocol so that we remove memorization of training set from both test sets.
- [] clean the code for Eric and reviewer
- [] once the above are done, dedicate time and attention examining the transformer architectures
  - [] EY strongly suspect that adjusting the tokenization scheme we might get additional benefits.

**Copyright 2022 SRI International.**

## &#9776; Import `needed` libraries

In [1]:
import os
import sys
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib import rc
from tqdm import tqdm

In [2]:
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [3]:
try:
    from google.colab import data_table
    data_table.disable_dataframe_formatter()
    
    from google.colab import output
    output.enable_custom_widget_manager()
except Exception:
    print("Launched notebook locally")

In [4]:
from typing import List, Any, Dict

In [5]:
# install gdown library for .csv files download
try:
    import gdown
except ImportError:
    !pip install gdown

## &#9997; Set `needed` configuration

In [6]:
warnings.filterwarnings('ignore')

In [7]:
# origins of the .csv data files used
# True: originating within the signal-public GitHub repository
# False: need to be downloaded from a provided URL, especially useful if running on Colab
IS_LOCAL_FILE = False

In [8]:
# dictionary of files for this notebook to work
# the dictionary is composed of (filename, Google ID) key-value pairs

# v3: https://drive.google.com/file/d/1uC0WDg7fyZxwpc9UIgJznDAgT5WPqtA9/view?usp=sharing
# v4: https://drive.google.com/file/d/1uIoGOoHPsugXszScyU4HS9RKznX6tFeO/view?usp=sharing
DATASET_DICT = {
    'metitarski_dataset_v3.csv': '1uC0WDg7fyZxwpc9UIgJznDAgT5WPqtA9',
    'metitarski_dataset_v4.csv': '1uIoGOoHPsugXszScyU4HS9RKznX6tFeO'
    }

## &#9881; Define `needed` functions

In [9]:
def path_exists(input_path: str) -> bool:
    return os.path.exists(input_path)

In [10]:
def check_file_status(input_path: str):
    if path_exists(input_path=input_path):
        print(f"- File {input_path.split('/')[-1]} exists locally at {input_path}!")
    else:
        if IS_LOCAL_FILE:
            print("- Something went wrong with the download. Please try again!")
        else:
            print(f"- IS_LOCAL_FILE is set to {IS_LOCAL_FILE}. The file is accessed via a public GitHub link!")

In [11]:
def download_dataset_from_google_drive(google_file_id: str, output_file_name: str, quiet_download: bool) -> str:
    file_path = f'./{output_file_name}'

    if not os.path.exists(file_path):
        gdown.download(id=google_file_id, output=output_file_name, quiet=quiet_download)
    else:
        print(f"{output_file_name} already exists!")
        
    return file_path

In [12]:
def get_dataset(dataset_name: str, is_local_file: bool) -> str:
    file_path = f'./{dataset_name}'

    if is_local_file:
        file_path = f'https://raw.githubusercontent.com/SRI-CSL/CoProver/main/data/{dataset_name}'
    else:
        if dataset_name in DATASET_DICT:
            file_path = download_dataset_from_google_drive(google_file_id=DATASET_DICT[dataset_name], output_file_name=dataset_name, quiet_download=False)
        else:
            print(f"{dataset_name} is not present in dataset dictionary! Please ensure the file name is correct!")
            return

    return file_path

* [] **TODO 1:** Modify the function setting `how='outer'` so as to get all potential options, i.e., `left`, `right`, and `both`.
* [] **TODO 2:** Create a Venn Diagram of the datasets.


In [13]:
def get_dataframe_differences(df_1: pd.DataFrame, df_2: pd.DataFrame, target_columns: List[str]) -> pd.DataFrame:
    """Obtain the records that are in df_1 but NOT in df_2
       Solution inspired by: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe

    Args:
      df_1: input dataframe 1
      df_2: input dataframe 2
      target_columns: list of columns to perform the difference operation upon

    Returns:
      A dataframe containing the records that are only in df_1 but not df_2
    """

    tmp_df = df_1.merge(df_2.drop_duplicates(), on=target_columns, how='left', indicator=True)

    result_df = tmp_df[tmp_df['_merge'] == 'left_only'][target_columns].copy()

    return result_df

## &#9749; Download datasets

In [14]:
# metitarski_dataset_v1.csv
path_metitarski_original = get_dataset(dataset_name='metitarski_dataset_v3.csv', is_local_file=IS_LOCAL_FILE)

Downloading...
From: https://drive.google.com/uc?id=1uC0WDg7fyZxwpc9UIgJznDAgT5WPqtA9
To: /content/metitarski_dataset_v3.csv
100%|██████████| 981k/981k [00:00<00:00, 65.7MB/s]


In [15]:
check_file_status(input_path=path_metitarski_original)

- File metitarski_dataset_v3.csv exists locally at ./metitarski_dataset_v3.csv!


In [16]:
# metitarski_dataset_v4.csv
path_metitarski = get_dataset(dataset_name='metitarski_dataset_v4.csv', is_local_file=IS_LOCAL_FILE)

Downloading...
From: https://drive.google.com/uc?id=1uIoGOoHPsugXszScyU4HS9RKznX6tFeO
To: /content/metitarski_dataset_v4.csv
100%|██████████| 6.42M/6.42M [00:00<00:00, 30.0MB/s]


In [17]:
check_file_status(input_path=path_metitarski)

- File metitarski_dataset_v4.csv exists locally at ./metitarski_dataset_v4.csv!


## &#128722; Load data

**Naming Convention**

* All original metitarski related content (data, variables) will be followed by the `_1` indicators.
* The new dataset generated that consists of all possible variable permutations will be followed by `_2` indicators.

In [19]:
df_metitarski_1 = pd.read_csv(path_metitarski_original, sep='\t')

In [20]:
df_metitarski_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6895 entries, 0 to 6894
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        6895 non-null   int64  
 1   file_id           6895 non-null   int64  
 2   input_file        6895 non-null   object 
 3   label_file        6895 non-null   object 
 4   nr_polynomials    6895 non-null   int64  
 5   max_total_degree  6895 non-null   int64  
 6   max_x1            6895 non-null   int64  
 7   max_x2            6895 non-null   int64  
 8   max_x3            6895 non-null   int64  
 9   prop_x1           6895 non-null   float64
 10  prop_x2           6895 non-null   float64
 11  prop_x3           6895 non-null   float64
 12  prop_mon_x1       6895 non-null   float64
 13  prop_mon_x2       6895 non-null   float64
 14  prop_mon_x3       6895 non-null   float64
 15  label             6895 non-null   int64  
dtypes: float64(6), int64(8), object(2)
memory 

In [21]:
df_metitarski_1.head()

Unnamed: 0.1,Unnamed: 0,file_id,input_file,label_file,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3,label
0,0,3940,poly3940.txt.ml,comp_times3940.txt,4,1,1,1,1,0.5,0.5,0.25,0.4,0.4,0.2,0
1,1,5554,poly5554.txt.ml,comp_times5554.txt,12,10,10,9,1,0.666667,0.666667,0.25,0.380952,0.52381,0.071429,4
2,2,4063,poly4063.txt.ml,comp_times4063.txt,9,1,1,1,1,0.444444,0.444444,0.555556,0.181818,0.181818,0.227273,5
3,3,4732,poly4732.txt.ml,comp_times4732.txt,7,8,4,2,1,0.428571,0.285714,0.428571,0.285714,0.142857,0.214286,2
4,4,5205,poly5205.txt.ml,comp_times5205.txt,6,18,12,6,1,0.5,0.333333,0.5,0.55,0.55,0.15,5


In [22]:
df_metitarski_1.label.unique()

array([0, 4, 5, 2, 1, 3])

In [23]:
df_metitarski_1.shape

(6895, 16)

In [24]:
df_metitarski_2 = pd.read_csv(path_metitarski, sep='\t')

In [25]:
df_metitarski_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41369 entries, 0 to 41368
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        41369 non-null  int64  
 1   file_id           41369 non-null  int64  
 2   input_file        41369 non-null  object 
 3   label_file        41369 non-null  object 
 4   nr_polynomials    41369 non-null  int64  
 5   max_total_degree  41369 non-null  int64  
 6   max_x1            41369 non-null  int64  
 7   max_x2            41369 non-null  int64  
 8   max_x3            41369 non-null  int64  
 9   prop_x1           41369 non-null  float64
 10  prop_x2           41369 non-null  float64
 11  prop_x3           41369 non-null  float64
 12  prop_mon_x1       41369 non-null  float64
 13  prop_mon_x2       41369 non-null  float64
 14  prop_mon_x3       41369 non-null  float64
 15  label             41369 non-null  int64  
dtypes: float64(6), int64(8), object(2)
memor

In [26]:
df_metitarski_2.head()

Unnamed: 0.1,Unnamed: 0,file_id,input_file,label_file,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3,label
0,0,415,poly415-perm0.txt.ml,comp_times415-perm0.txt,10,2,2,2,1,0.5,0.5,0.5,0.25,0.25,0.25,4
1,1,2230,poly2230-perm2.txt.ml,comp_times2230-perm2.txt,6,4,1,2,2,0.333333,0.666667,0.5,0.333333,0.47619,0.428571,0
2,2,6506,poly6506-perm3.txt.ml,comp_times6506-perm3.txt,6,16,16,1,1,0.5,0.333333,0.5,0.5,0.1,0.15,5
3,3,3998,poly3998-perm5.txt.ml,comp_times3998-perm5.txt,9,3,3,3,3,0.555556,0.555556,0.555556,0.35,0.35,0.35,1
4,4,3730,poly3730-perm2.txt.ml,comp_times3730-perm2.txt,14,9,1,9,3,0.214286,0.785714,0.142857,0.166667,0.611111,0.111111,3


In [27]:
df_metitarski_2.label.unique()

array([4, 0, 5, 1, 3, 2])

In [28]:
df_metitarski_2.shape

(41369, 16)

In [29]:
COMPARE_COLUMNS = ['nr_polynomials', 'max_total_degree', 'max_x1', 'max_x2', 'max_x3', 'prop_x1', 'prop_x2', 'prop_x3', 'prop_mon_x1', 'prop_mon_x2', 'prop_mon_x3', 'label']

In [30]:
df_1_not_2 = get_dataframe_differences(df_1=df_metitarski_1, df_2=df_metitarski_2, target_columns=COMPARE_COLUMNS)

In [31]:
print(f"- There are {df_1_not_2.shape[0]} records in the original dataset (df_metitarski_1) that are not in the newly generated one (df_metitarski_2)")

- There are 0 in the original dataset (df_metitarski_1) that are not in the newly generated one (df_metitarski_2)


In [32]:
df_2_not_1 = get_dataframe_differences(df_1=df_metitarski_2, df_2=df_metitarski_1, target_columns=COMPARE_COLUMNS)

In [33]:
print(f"- There are {df_2_not_1.shape[0]} records in the newly generated dataset (df_metitarski_2) that are not in the original one (df_metitarski_1")

- There are 32212 records in the newly generated dataset (df_metitarski_2) that are not in the original one (df_metitarski_1


## &#129504; MetiTarski RTF

In [34]:
FEATURE_COLUMNS = ['nr_polynomials', 'max_total_degree', 'max_x1', 'max_x2', 'max_x3', 'prop_x1', 'prop_x2', 'prop_x3', 'prop_mon_x1', 'prop_mon_x2', 'prop_mon_x3']

FEATURE_COLUMNS

['nr_polynomials',
 'max_total_degree',
 'max_x1',
 'max_x2',
 'max_x3',
 'prop_x1',
 'prop_x2',
 'prop_x3',
 'prop_mon_x1',
 'prop_mon_x2',
 'prop_mon_x3']

In [35]:
def training_set_scaler(input_df: pd.DataFrame):
    scaler = StandardScaler()
    scaler = scaler.fit(input_df)

    return scaler

In [36]:
def scale_data(input_df: pd.DataFrame, scaler):
    df_scaled = pd.DataFrame(scaler.transform(input_df), index=input_df.index, columns=input_df.columns)

    return df_scaled

### D1: Original MetiTarski Data

In [37]:
# original metitarski dataset features
df_features_1 = df_metitarski_1[FEATURE_COLUMNS].copy().reset_index()

In [38]:
df_features_1.drop(['index'], axis=1, inplace=True)

In [39]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(df_features_1, df_metitarski_1.label, test_size=0.1)
X_train_1.shape, X_test_1.shape, y_train_1.shape, y_test_1.shape

((6205, 11), (690, 11), (6205,), (690,))

In [40]:
y_train_1.value_counts()

5    2386
4    1111
3    1095
2     584
1     519
0     510
Name: label, dtype: int64

In [41]:
# scale original training set
scaler_1 = training_set_scaler(input_df=X_train_1)

In [42]:
# rescale data
X_train_D1 = scale_data(input_df=X_train_1, scaler=scaler_1)
X_train_D1.head()

Unnamed: 0,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3
4679,-0.233361,-0.900145,-0.75455,-0.61228,-0.448217,-0.867434,-0.600184,0.204174,-1.586091,-0.810929,-0.131236
3325,-0.532783,0.852412,0.98891,-0.61228,-0.448217,-0.405658,-0.095861,-0.463285,0.354414,-0.889155,-0.808965
5483,-0.532783,-0.740821,-0.75455,-0.61228,-0.448217,0.671821,-0.095861,-1.631339,-0.378717,-0.206453,-1.068383
668,3.359701,-0.262851,-0.596054,-0.123512,-0.448217,-1.256299,-1.024877,1.258058,-0.178415,0.898875,1.514678
5815,2.760857,0.693089,0.196428,0.365257,-0.448217,0.545059,1.7731,0.498642,1.457968,2.040077,1.011329


In [43]:
X_test_D1 = scale_data(input_df=X_test_1, scaler=scaler_1)
X_test_D1.head()

Unnamed: 0,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3
123,-0.832204,-0.740821,-0.75455,-0.61228,-0.448217,0.24083,0.610191,-1.397728,-0.586437,0.223397,-0.872378
2246,0.365483,-0.103528,0.037932,-0.123512,-0.448217,-0.046498,-0.488113,-0.463285,0.213286,-0.722272,-0.495069
6050,0.964326,-0.581498,-0.437557,-0.123512,1.760828,-0.1118,-0.416794,0.386209,-0.378717,-0.52884,0.254648
1838,0.066061,0.693089,0.196428,0.365257,-0.448217,0.402451,0.786704,-1.047312,2.435558,2.320647,0.709885
560,-1.131626,-0.740821,-0.596054,-0.61228,-0.448217,1.21056,-1.860992,-1.047312,0.556024,-1.431524,-0.980181


### D2: New (Larger) MetiTarski Data

In [44]:
df_features_2 = df_metitarski_2[FEATURE_COLUMNS].copy().reset_index()

In [45]:
df_features_2.head()

Unnamed: 0,index,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3
0,0,10,2,2,2,1,0.5,0.5,0.5,0.25,0.25,0.25
1,1,6,4,1,2,2,0.333333,0.666667,0.5,0.333333,0.47619,0.428571
2,2,6,16,16,1,1,0.5,0.333333,0.5,0.5,0.1,0.15
3,3,9,3,3,3,3,0.555556,0.555556,0.555556,0.35,0.35,0.35
4,4,14,9,1,9,3,0.214286,0.785714,0.142857,0.166667,0.611111,0.111111


In [46]:
df_features_2.drop(['index'], axis=1, inplace=True)

In [47]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(df_features_2, df_metitarski_2.label, test_size=0.1)
X_train_2.shape, X_test_2.shape, y_train_2.shape, y_test_2.shape

((37232, 11), (4137, 11), (37232,), (4137,))

In [48]:
# scale the new training set
scaler_2 = training_set_scaler(input_df=X_train_2)

In [49]:
# rescale data
X_train_D2 = scale_data(input_df=X_train_2, scaler=scaler_2)
X_train_D2.head()

Unnamed: 0,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3
3530,0.664927,-0.105145,0.681362,-0.482697,-0.485458,1.289855,-0.56979,-1.188488,0.76661,-0.737129,-1.03796
15352,-0.23445,-0.584486,-0.017892,-0.01959,-0.020206,-0.39279,-0.392666,-0.392722,0.334122,0.333213,0.328781
7406,-0.834035,2.770898,-0.484061,4.843029,-0.485458,-0.569911,0.670081,-1.807417,-1.451279,2.917023,-1.695047
33287,-0.23445,3.410018,5.809226,-0.482697,-0.485458,0.492812,-1.278288,-0.392722,2.459431,-1.557106,-1.364931
19644,3.063265,-0.264926,-0.484061,-0.251144,-0.252832,0.394412,-0.98308,-0.982178,0.165932,0.164845,0.292065


In [50]:
# rescale data
X_test_D2 = scale_data(input_df=X_test_2, scaler=scaler_2)
X_test_D2.head()

Unnamed: 0,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3
32815,-0.23445,1.492656,-0.250977,2.990602,-0.485458,-0.39279,-0.392666,-0.392722,-0.990374,0.901457,-0.993001
3495,-1.43362,-0.584486,-0.484061,-0.482697,-0.020206,1.083214,1.083372,3.144013,-0.885255,-0.887458,1.734485
23447,-0.834035,-0.584486,-0.017892,-0.251144,-0.252832,1.909777,-1.809661,-1.807417,1.217119,-1.41361,-1.412615
3683,-0.23445,-0.744266,-0.484061,-0.251144,-0.252832,0.492812,-0.392666,-0.392722,-0.254543,-0.676998,-0.678291
41173,-0.834035,1.492656,3.012209,-0.482697,-0.485458,0.669933,-1.809661,-0.569559,2.005509,-1.545148,-1.150356


### Training and Testing on Original Metitarski Data (D1)

#### SVM

In [51]:
D1_svm = svm.SVC(C=316, kernel='rbf', gamma=0.08, tol=0.0316)

D1_svm.fit(X_train_D1, y_train_1)

SVC(C=316, gamma=0.08, tol=0.0316)

In [52]:
D1_svm_score = D1_svm.score(X_test_D1, y_test_1)
D1_svm_score

0.5811594202898551

In [58]:
# scale the portion of dataset 2 that was not part of dataset 1
df_2_not_1_features = df_2_not_1[FEATURE_COLUMNS].copy().reset_index()
df_2_not_1_features.drop(['index'], axis=1, inplace=True)
df_2_not_1_labels = df_2_not_1.label

# df_2_not_1_scaled = scale_data(input_df=df_2_not_1_features, scaler=scaler_1)

In [59]:
df_2_not_1_scaled = scale_data(input_df=df_2_not_1_features, scaler=scaler_1)

In [60]:
D1_svm_score_D2_data = D1_svm.score(df_2_not_1_scaled, df_2_not_1_labels)
D1_svm_score_D2_data

0.22001117595926983

### Training and Testing on New Metitarski Data (D2)

#### SVM

In [61]:
D2_svm = svm.SVC(C=316, kernel='rbf', gamma=0.08, tol=0.0316)

D2_svm.fit(X_train_D2, y_train_2)

SVC(C=316, gamma=0.08, tol=0.0316)

In [62]:
D2_svm_score = D2_svm.score(X_test_D2, y_test_2)
D2_svm_score

0.5663524292965917

In [63]:
scaler_3 = training_set_scaler(input_df=df_2_not_1_features)

In [64]:
X_train_D3 = scale_data(input_df=df_2_not_1_features, scaler=scaler_1)
X_train_D3.head()

Unnamed: 0,nr_polynomials,max_total_degree,max_x1,max_x2,max_x3,prop_x1,prop_x2,prop_x3,prop_mon_x1,prop_mon_x2,prop_mon_x3
0,-0.532783,-0.422175,-0.75455,-0.123512,1.760828,-1.483136,1.080892,0.704769,-0.586437,1.236614,2.207694
1,-0.532783,1.489705,1.622895,-0.61228,-0.448217,-0.405658,-1.272615,0.704769,0.556024,-1.431524,-0.495069
2,1.862592,0.374442,-0.75455,3.29787,3.969873,-2.252764,1.921431,-1.798204,-1.728899,2.193541,-0.872378
3,0.066061,-0.740821,-0.596054,-0.61228,1.760828,0.402451,-0.978427,0.704769,-0.205617,-0.95869,0.74466
4,0.964326,-0.103528,-0.596054,-0.61228,3.969873,0.475916,-1.05866,1.66045,0.30524,0.28106,3.01903


In [65]:
D3_svm = svm.SVC(C=316, kernel='rbf', gamma=0.08, tol=0.0316)

D3_svm.fit(X_train_D3, df_2_not_1_labels)

SVC(C=316, gamma=0.08, tol=0.0316)

In [66]:
D1_scaled_on_D3 = scale_data(input_df=df_features_1, scaler=scaler_3)

In [67]:
D1_on_D3_svm_score = D3_svm.score(D1_scaled_on_D3, df_metitarski_1.label)
D1_on_D3_svm_score

0.21058738216098621

## &#128218; References

1. SVC, see [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
2. K-NN, see [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
3. Decision Tree, see [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
4. MLP, see [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
5. RF, see [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)