This notebook is for logistic regression on curves with size of Sha equal to 4 and 9 and removing one BSD feature at a time. This includes both the original and log-transformed data.

In [11]:
from lib import utils, models, executor
import torch.nn as nn
import torch.optim as optim
from pathlib import Path
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error

# fix the seed for reproducibility
seed = 42

# 1. Create balanced dataset of elliptic curves with size of the Tate-Shafarevich group equal to 4 and 9 containing all BSD features

In [2]:
# load your data here. The following ensure this will work on Windows as well as Unix
path = Path("..") / "data_files" / "sha"/ "ecq_sha_B_100_conds_1_500000_reg.parquet"
df = utils.load_data(path)

Loaded the dataset with 120 features and 3064705 curves..


In [4]:
len_9 = df[df['sha'] == 9].shape[0]
df_balanced = df[df['sha'] == 4].sample(len_9, random_state=seed) 
df_balanced = pd.concat([df_balanced, df[df['sha'] == 9]])
df_balanced.sha.value_counts()

sha
4    50428
9    50428
Name: count, dtype: int64

In [5]:
#Get columns with all the BSD features, from which we will eventually remove one at a time
bsd_features = ['special_value', 'torsion', 'real_period', 'regulator', 'tamagawa_product', 'sha']

df_balanced_bsd = df_balanced[bsd_features].copy()

In [15]:
df_balanced_bsd.head(5)

Unnamed: 0,special_value,torsion,real_period,regulator,tamagawa_product,sha
334625,2.19751,2,0.54938,1.0,4,4
1086182,3.22805,1,0.80701,1.0,1,4
1782926,3.98612,2,0.49826,1.0,8,4
2484030,2.99537,1,0.09361,1.0,8,4
3053287,2.23394,1,0.09308,1.0,6,4


# 2. Delete one feature at a time on original data

The best accuracy is about 65%.

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [18]:
# Initialize an empty DataFrame to store the results
results_df_lr = pd.DataFrame({
    'Feature Deleted': pd.Series(dtype='str'),
    'Accuracy': pd.Series(dtype='float')})


for i in range(len(bsd_features[:-1])):
    print(f'Running model without {bsd_features[i]}..')
    df_sub = df_balanced_bsd.drop(columns=[bsd_features[i]]).copy()
    X = df_sub[[c for c in df_sub.columns if c != 'sha']]
    y = df_sub['sha']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
   # Append the results to the DataFrame
    results_df_lr = pd.concat([results_df_lr, pd.DataFrame([{'Feature Deleted': bsd_features[i], 'Accuracy': accuracy}])], ignore_index=True)
    
print(results_df_lr)

Running model without special_value..


Running model without torsion..


Running model without real_period..


Running model without regulator..


Running model without tamagawa_product..
    Feature Deleted  Accuracy
0     special_value  0.613722
1           torsion  0.646589
2       real_period  0.609310
3         regulator  0.613425
4  tamagawa_product  0.650109


# 3. Delete one feature at a time on log-transformed data

Now, the best accuracy is at ~95% for missing regulator.

In [19]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [23]:
# Initialize an empty DataFrame to store the results
results_df_lr = pd.DataFrame({
    'Feature Deleted': pd.Series(dtype='str'),
    'Accuracy': pd.Series(dtype='float')})


for i in range(len(bsd_features[:-1])):
    print(f'Running model without {bsd_features[i]}..')
    df_sub_log = df_balanced_bsd.drop(columns=[bsd_features[i]]).copy()
    print
    X = df_sub_log[[c for c in df_sub_log.columns if c != 'sha']].apply(np.log)
    y = df_sub_log['sha']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
   # Append the results to the DataFrame
    results_df_lr = pd.concat([results_df_lr, pd.DataFrame([{'Feature Deleted': bsd_features[i], 'Accuracy': accuracy}])], ignore_index=True)
    
print(results_df_lr)

Running model without special_value..
Running model without torsion..


Running model without real_period..
Running model without regulator..


Running model without tamagawa_product..
    Feature Deleted  Accuracy
0     special_value  0.697849
1           torsion  0.716984
2       real_period  0.610946
3         regulator  0.948493
4  tamagawa_product  0.642871
