# SWMAL Exercise

## Qa Explain GridSearchCV


### Changes
There are a few additions made by the group.

1. Ignoring warnings, since running with low max iterations would result in ConvergenceWarnings.
2. SummarizeParamPerformance, which is used during random search to prune the possible hyperparameter candidates. When used later it will be explained.
3. When running on the cluster the results are written to a file using the WriteReportToFile function.

### Code review

#### Cell 1
The LoadAndSetupData function simply loads whichever dataset is desired and splits it into training and test sets.
Then there are several functions that print out information using the grid_tune object. It prints the best parameters and scores based on those parameters.

#### Cell 2
First up the call to LoadAndSetupData is made to get the Iris dataset.
Then the base model is chosen to be a Support Vector Classifier with the hyperparameter "gamma" set to 0.001.

Then the hyperparameters that are to be searched are defined. Here, 2 different kernels, linear and rbf are selected. The C parameter should also be searched with values 0.1, 1 and 10. This means that 2\*3 combinations are to be trained and scored.

The GridSearch uses 5 cross validations. So you could say that it is 2\*3\*5 models that are to be trained and scored.

The scoring is set to f1_micro, which takes all the results from the confusion matrix from all the different classes and computes a single F1 score from this. That means that no class is favoured.(https://datascience.stackexchange.com/questions/40900/whats-the-difference-between-sklearn-f1-score-micro-and-weighted-for-a-mult)

n_jobs just means "run on as many cpus/cores as possible, which will parallelize and speed up the grid search.

The grid search implements the fit predict interface and is used first to fit and then in the FullReport which tells us about the best model.

Apparently the best combination in our case is SVC(C=1, gamma=0.001, kernel='linear') yielding a score of 0.97143.

In [None]:
# TODO: Qa, code review..cell 1) function setup
import sys
import warnings
import os
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    os.environ["PYTHONWARNINGS"] = "ignore" # Also affect subprocesses

from time import time
import numpy as np

from sklearn import svm
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report, f1_score
from sklearn import datasets

from libitmal import dataloaders as itmaldataloaders # Needed for load of iris, moon and mnist

# The following function was made by ChatGPT to display the average values of different parameters.
# It should be usefull to quickly sort out any really bad parameters.
def SummarizeParamPerformance(search, param='loss', sort=True):
    results = search.cv_results_
    params = np.array(results[f'param_{param}'], dtype=object)
    scores = np.array(results['mean_test_score'])

    unique_params = np.unique(params)
    summary = []
    for val in unique_params:
        mask = params == val
        mean_score = np.mean(scores[mask])
        std_score = np.std(scores[mask])
        summary.append((val, mean_score, std_score))

    if sort:
        summary.sort(key=lambda x: x[1], reverse=True)

    print(f"\nAverage CV score per '{param}':")
    for val, mean_score, std_score in summary:
        print(f"  {param}={val!s:15}  mean_f1_micro={mean_score:.4f}  (+/- {std_score:.4f})")

def WriteReportToFile(grid_tuned, X_test, y_test, t, params_to_summarize=None, 
                      full_results_file='search_results.txt', 
                      status_file='search_status.txt'):
    """
    Captures FullReport and SummarizeParamPerformance output and writes to files.
    
    Args:
        grid_tuned: The fitted GridSearchCV or RandomizedSearchCV object
        X_test: Test features
        y_test: Test labels
        t: Search time in seconds
        params_to_summarize: List of parameter names to summarize (e.g., ['loss', 'penalty'])
        full_results_file: Filename for complete results
        status_file: Filename for status summary
    """
    import sys
    from io import StringIO
    
    # Capture all output to a string buffer
    output_buffer = StringIO()
    old_stdout = sys.stdout
    sys.stdout = output_buffer
    
    b0, m0 = FullReport(grid_tuned, X_test, y_test, t)
    
    # Summarize parameters if provided
    if params_to_summarize:
        for param in params_to_summarize:
            SummarizeParamPerformance(grid_tuned, param)
    
    print('OK(random-grid-search)')
    print(b0)
    
    # Restore stdout
    sys.stdout = old_stdout
    
    # Write full report to file
    with open(full_results_file, 'w') as f:
        f.write(output_buffer.getvalue())
    
    # Write just the OK and best model to a separate file
    with open(status_file, 'w') as f:
        f.write('OK(random-grid-search)\n')
        f.write(f'{b0}\n')
    
    print(f"Results written to {full_results_file} and {status_file}")
    
    return b0, m0

currmode="N/A" # GLOBAL var!

def SearchReport(model): 
    
    def GetBestModelCTOR(model, best_params):
        def GetParams(best_params):
            ret_str=""          
            for key in sorted(best_params):
                value = best_params[key]
                temp_str = "'" if str(type(value))=="<class 'str'>" else ""
                if len(ret_str)>0:
                    ret_str += ','
                ret_str += f'{key}={temp_str}{value}{temp_str}'  
            return ret_str          
        try:
            param_str = GetParams(best_params)
            return type(model).__name__ + '(' + param_str + ')' 
        except:
            return "N/A(1)"
        
    print("\nBest model set found on train set:")
    print()
    print(f"\tbest parameters={model.best_params_}")
    print(f"\tbest '{model.scoring}' score={model.best_score_}")
    print(f"\tbest index={model.best_index_}")
    print()
    print(f"Best estimator CTOR:")
    print(f"\t{model.best_estimator_}")
    print()
    """
    try:
        print(f"Grid scores ('{model.scoring}') on development set:")
        means = model.cv_results_['mean_test_score']
        stds  = model.cv_results_['std_test_score']
        i=0
        for mean, std, params in zip(means, stds, model.cv_results_['params']):
            print("\t[%2d]: %0.3f (+/-%0.03f) for %r" % (i, mean, std * 2, params))
            i += 1
    except:
        print("WARNING: the random search do not provide means/stds")
    """
    
    global currmode                
    assert "f1_micro"==str(model.scoring), f"come on, we need to fix the scoring to be able to compare model-fits! Your scoreing={str(model.scoring)}...remember to add scoring='f1_micro' to the search"   
    return f"best: dat={currmode}, score={model.best_score_:0.5f}, model={GetBestModelCTOR(model.estimator,model.best_params_)}", model.best_estimator_ 

def ClassificationReport(model, X_test, y_test, target_names=None):
    assert X_test.shape[0]==y_test.shape[0]
    print("\nDetailed classification report:")
    print("\tThe model is trained on the full development set.")
    print("\tThe scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, model.predict(X_test)                 
    print(classification_report(y_true, y_pred, target_names=target_names))
    print()
    
def FullReport(model, X_test, y_test, t):
    print(f"SEARCH TIME: {t:0.2f} sec")
    beststr, bestmodel = SearchReport(model)
    ClassificationReport(model, X_test, y_test)    
    print(f"CTOR for best model: {bestmodel}\n")
    print(f"{beststr}\n")
    return beststr, bestmodel
    
def LoadAndSetupData(mode, test_size=0.3):
    assert test_size>=0.0 and test_size<=1.0
    
    def ShapeToString(Z):
        n = Z.ndim
        s = "("
        for i in range(n):
            s += f"{Z.shape[i]:5d}"
            if i+1!=n:
                s += ";"
        return s+")"

    global currmode
    currmode=mode
    print(f"DATA: {currmode}..")
    
    if mode=='moon':
        X, y = itmaldataloaders.MOON_GetDataSet(n_samples=5000, noise=0.2)
        itmaldataloaders.MOON_Plot(X, y)
    elif mode=='mnist':
        X, y = itmaldataloaders.MNIST_GetDataSet(load_mode=0)
        if X.ndim==3:
            X=np.reshape(X, (X.shape[0], -1))
    elif mode=='iris':
        X, y = itmaldataloaders.IRIS_GetDataSet()
    else:
        raise ValueError(f"could not load data for that particular mode='{mode}', only 'moon'/'mnist'/'iris' supported")
        
    print(f'  org. data:  X.shape      ={ShapeToString(X)}, y.shape      ={ShapeToString(y)}')

    assert X.ndim==2
    assert X.shape[0]==y.shape[0]
    assert y.ndim==1 or (y.ndim==2 and y.shape[1]==0)    
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=0, shuffle=True
    )
    
    print(f'  train data: X_train.shape={ShapeToString(X_train)}, y_train.shape={ShapeToString(y_train)}')
    print(f'  test data:  X_test.shape ={ShapeToString(X_test)}, y_test.shape ={ShapeToString(y_test)}')
    print()
    
    return X_train, X_test, y_train, y_test

def TryKerasImport(verbose=True):
    
    kerasok = True
    try:
        import keras as keras_try
    except:
        kerasok = False

    tensorflowkerasok = True
    try:
        import tensorflow.keras as tensorflowkeras_try
    except:
        tensorflowkerasok = False
        
    ok = kerasok or tensorflowkerasok
    
    if not ok and verbose:
        if not kerasok:
            print("WARNING: importing 'keras' failed", file=sys.stderr)
        if not tensorflowkerasok:
            print("WARNING: importing 'tensorflow.keras' failed", file=sys.stderr)

    return ok
    
print(f"OK(function setup" + ("" if TryKerasImport() else ", hope MNIST loads works because it seems you miss the installation of Keras or Tensorflow!") + ")")

OK(function setup)


In [2]:
# TODO: Qa, code review..cell 2) the actual grid-search

# Setup data
X_train, X_test, y_train, y_test = LoadAndSetupData(
    'iris')  # 'iris', 'moon', or 'mnist'

# Setup search parameters
model = svm.SVC(
    gamma=0.001
)  # NOTE: gamma="scale" does not work in older Scikit-learn frameworks,
# FIX:  replace with model = svm.SVC(gamma=0.001)

tuning_parameters = {
    'kernel': ('linear', 'rbf'), 
    'C': [0.1, 1, 10]
}

CV = 5
VERBOSE = 0

# Run GridSearchCV for the model
grid_tuned = GridSearchCV(model,
                          tuning_parameters,
                          cv=CV,
                          scoring='f1_micro',
                          verbose=VERBOSE,
                          n_jobs=-1)

start = time()
grid_tuned.fit(X_train, y_train)
t = time() - start

# Report result
b0, m0 = FullReport(grid_tuned, X_test, y_test, t)
print('OK(grid-search)')

DATA: iris..
  org. data:  X.shape      =(  150;    4), y.shape      =(  150)
  train data: X_train.shape=(  105;    4), y_train.shape=(  105)
  test data:  X_test.shape =(   45;    4), y_test.shape =(   45)

SEARCH TIME: 6.30 sec

Best model set found on train set:

	best parameters={'C': 1, 'kernel': 'linear'}
	best 'f1_micro' score=0.9714285714285715
	best index=2

Best estimator CTOR:
	SVC(C=1, gamma=0.001, kernel='linear')


Detailed classification report:
	The model is trained on the full development set.
	The scores are computed on the full evaluation set.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.94      0.97        18
           2       0.92      1.00      0.96        11

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45


CTOR for best model: SVC(C=1, gamma=0.001, kern

### Qb Hyperparameter Grid Search using an SDG classifier
We set up a lot of hyperparameters and combinations to search through. There is no real structure other than the numerical values increasing by a factor 10.


In [None]:
# TODO: grid search
# Setup data. Done above already
#X_train, X_test, y_train, y_test = LoadAndSetupData('iris')  # 'iris', 'moon', or 'mnist'

model = SGDClassifier()

tuning_parameters = {
    'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1],
    'learning_rate': ['constant', 'optimal', 'adaptive', 'invscaling'],
    'eta0': [0.00001, 0.0001, 0.001, 0.01, 0.1],
    'max_iter': [1000, 2500, 5000, 10000, 20000],
    'shuffle': [True, False],
    'average': [False, True],
    'early_stopping': [False, True],
}

CV = 5
VERBOSE = 0

# Run GridSearchCV for the model
grid_tuned = GridSearchCV(model,
                          tuning_parameters,
                          cv=CV,
                          scoring='f1_micro',
                          verbose=VERBOSE,
                          n_jobs=-1)

start = time()
grid_tuned.fit(X_train, y_train)
t = time() - start

b0, m0 = WriteReportToFile(
    grid_tuned, X_test, y_test, t,
    params_to_summarize=None,
    full_results_file='mnist_search_results_iris_cv.txt',
    status_file='mnist_search_status_iris_cv.txt'
)

#### Output
best: dat=iris, score=1.00000, model=SGDClassifier(alpha=0.01,average=False,early_stopping=False,eta0=1e-05,learning_rate='optimal',loss='modified_huber',max_iter=1000,penalty='l1',shuffle=False)

SEARCH TIME: 122.54 sec

Best model set found on train set:

	best parameters={'alpha': 0.01, 'average': False, 'early_stopping': False, 'eta0': 1e-05, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'max_iter': 1000, 'penalty': 'l1', 'shuffle': False}
	best 'f1_micro' score=1.0
	best index=28981

Best estimator CTOR:
	SGDClassifier(alpha=0.01, eta0=1e-05, loss='modified_huber', penalty='l1',
              shuffle=False)

#### Comments on the run:
We can see that the search 2 minutes, which isnt a long time, but the number of trained models is <=28981.
If the dataset was bigger, training time would explode.

The best score is 1, which should be concerning, but the dataset is so small so its alright in this case.

### Qc Hyperparameter Random  Search using an SDG classifier
The n_iter parameter is the number of random combinations to test. We could have 4 loss functions, 3 penalty and 4 learning rates which would at least mean 4\*3\*4 = 48 models. However, if we set n-iter to 20, we are guaranteed to only get that amount of models.
We can then quickly test a lot of random combinations and possibly rule out parameters that are not a good fit for the use case.

Another advantage with the random search is that an interval can be specified rather than fixed values. There is the potential to get a lucky hit on a random number that just increases the scores dramatically(although it should be unlikely).

In [None]:
# TODO: random grid search
# Setup data. Done above already
#X_train, X_test, y_train, y_test = LoadAndSetupData('iris')  # 'iris', 'moon', or 'mnist'

from scipy.stats import uniform, loguniform, randint
model = SGDClassifier()

tuning_parameters = {
    'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'learning_rate': ['constant', 'optimal', 'adaptive', 'invscaling'],
    'alpha': loguniform(0.000001, 0.1),
    'eta0': loguniform(0.00001, 0.1),
    'max_iter': randint(1000, 20000),
    'shuffle': [True, False],
    'average': [False, True],
    'early_stopping': [False, True],
}

CV = 5
VERBOSE = 0

# Run RandomizedSearchCV for the model
grid_tuned = RandomizedSearchCV(model,
                          tuning_parameters,
                          n_iter=100,
                          random_state=42,
                          cv=CV,
                          scoring='f1_micro',
                          verbose=VERBOSE,
                          n_jobs=-1)

start = time()
grid_tuned.fit(X_train, y_train)
t = time() - start

# Report result
b0, m0 = FullReport(grid_tuned, X_test, y_test, t)

SummarizeParamPerformance(grid_tuned)
print('OK(grid-search)')

#### Output
best: dat=iris, score=0.99048, model=SGDClassifier(alpha=0.024169519319206773,average=False,early_stopping=False,eta0=3.4360613844702604e-05,learning_rate='optimal',loss='modified_huber',max_iter=12470,penalty='l1',shuffle=True)

SEARCH TIME: 2.90 sec

Best model set found on train set:

	best parameters={'alpha': 0.024169519319206773, 'average': False, 'early_stopping': False, 'eta0': 3.4360613844702604e-05, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'max_iter': 12470, 'penalty': 'l1', 'shuffle': True}
	best 'f1_micro' score=0.9904761904761905
	best index=92

Best estimator CTOR:
	SGDClassifier(alpha=0.024169519319206773, eta0=3.4360613844702604e-05,
              loss='modified_huber', max_iter=12470, penalty='l1')


	SGDClassifier(alpha=0.01, eta0=1e-05, loss='modified_huber', penalty='l1',
              shuffle=False)

#### Comments on the run
The random search that tests 100 random models takes just 3 seconds to run, which is 1/60 of the time. The best model scores 0.99, which is also very high, but again, the dataset is very small. We see that they both find the same loss and penalty functions to be best. The differences are mostly at the alpha and eta0 values. With more models tested we might achieve close to the same values.

One thing that is nice about the random search is that it scales linearly in time. So doubling n_iter, will take twice the time(disregarding the early stopping).

### Qd MNIST Search Quest II
This quest is in two parts. The first part is where we tried to use an SGD classifier and the second where we use the SVC.

#### SGD
Initially we thought using the SGD would be okay since it scored well with the IRIS dataset. We went through 3 different searches all using the random approach. After each run we looked at which parameters gave the worst scores and removed them from testing.

For all the tests we scaled the data using the StandardScaler(we did not go back and try to remove it).

The final result is 0.9207, which is posted to Brightspace.

In [None]:
# TODO:(in code and text..)
from time import time
import numpy as np
from scipy.stats import loguniform, randint
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = LoadAndSetupData('mnist')

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = SGDClassifier()

# reference for why the loguniform is used.
#https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization

# Trying all the loss, penalty and learning_rate options
tuning_parameters = {
    'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'learning_rate': ['constant', 'optimal', 'adaptive', 'invscaling'],
    'alpha': loguniform(0.000001, 0.1),
    'eta0': loguniform(0.00001, 0.1),
    'max_iter': randint(1000, 20000),
    'shuffle': [True, False],
    'average': [False, True],
    'early_stopping': [False, True],
}

CV = 5
VERBOSE = 0

grid_tuned = RandomizedSearchCV(
    model,
    tuning_parameters,
    n_iter=200,              # will take a couple hours
    random_state=42,
    cv=CV,
    scoring='f1_micro',
    verbose=VERBOSE,
    n_jobs=-1
)

start = time()
grid_tuned.fit(X_train, y_train)
t = time() - start

b0, m0 = WriteReportToFile(
    grid_tuned, X_test, y_test, t,
    params_to_summarize=['loss', 'penalty', 'learning_rate', 'shuffle', 'average', 'early_stopping'],
    full_results_file='mnist_search_results.txt',
    status_file='mnist_search_status.txt'
)

CV = 5

### Run 1
#### Input
Trying all the loss, penalty and learning_rate options
tuning_parameters = {
    'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'learning_rate': ['constant', 'optimal', 'adaptive', 'invscaling'],
    'alpha': loguniform(0.000001, 0.1),
    'eta0': loguniform(0.00001, 0.1),
    'max_iter': randint(1000, 20000),
    'shuffle': [True, False],
    'average': [False, True],
    'early_stopping': [False, True],
}

CV = 5
VERBOSE = 0

grid_tuned = RandomizedSearchCV(
    model,
    tuning_parameters,
    n_iter=200,              # will take a couple hours
    random_state=42,
    cv=CV,
    scoring='f1_micro',
    verbose=VERBOSE,
    n_jobs=-1
)

#### Output
SEARCH TIME: 9788.79 sec

best: dat=mnist, score=0.91400, model=SGDClassifier(alpha=3.46979613767106e-05,average=True,early_stopping=True,eta0=0.0010845668034236337,learning_rate='optimal',loss='modified_huber',max_iter=18429,penalty='l2',shuffle=False)

#### Parameter information
Average CV score per 'loss':
  loss=modified_huber   mean_f1_micro=0.8491  (+/- 0.1689)
  loss=squared_hinge    mean_f1_micro=0.8143  (+/- 0.1880)
  loss=hinge            mean_f1_micro=0.7767  (+/- 0.2730)
  loss=log_loss         mean_f1_micro=0.7235  (+/- 0.3143)

Average CV score per 'penalty':
  penalty=l2               mean_f1_micro=0.8744  (+/- 0.0628)
  penalty=elasticnet       mean_f1_micro=0.7799  (+/- 0.2627)
  penalty=l1               mean_f1_micro=0.7362  (+/- 0.3033)

Average CV score per 'learning_rate':
  learning_rate=optimal          mean_f1_micro=0.8213  (+/- 0.2332)
  learning_rate=invscaling       mean_f1_micro=0.7971  (+/- 0.1656)
  learning_rate=constant         mean_f1_micro=0.7790  (+/- 0.2879)
  learning_rate=adaptive         mean_f1_micro=0.7774  (+/- 0.2857)

Average CV score per 'shuffle':
  shuffle=False            mean_f1_micro=0.8069  (+/- 0.2175)
  shuffle=True             mean_f1_micro=0.7765  (+/- 0.2759)

Average CV score per 'average':
  average=False            mean_f1_micro=0.8623  (+/- 0.0559)
  average=True             mean_f1_micro=0.7240  (+/- 0.3292)

Average CV score per 'early_stopping':
  early_stopping=True             mean_f1_micro=0.8374  (+/- 0.1720)
  early_stopping=False            mean_f1_micro=0.7413  (+/- 0.3031)

This uses the created function mentioned in the beginning. We remove the parameters that score low and have a high variation, since they are unlikely to settle down.
Then we start another round of searching. Please note that the search time is ~3 hours and yields a score of 0.914.

### Run 2
#### Input
tuning_parameters = {
    'loss': ['hinge', 'modified_huber', 'squared_hinge'],
    'penalty': ['l2', 'elasticnet'],
    'learning_rate': ['adaptive', 'optimal'],
    'alpha': loguniform(0.000001, 0.01),
    'eta0': loguniform(0.0001, 0.1),
    'max_iter': randint(5000, 60000),
    'n_iter_no_change': [10],
    'early_stopping': [True],
    'shuffle': [True],
    'average': [True, False],
}

CV = 5
VERBOSE = 0

grid_tuned = RandomizedSearchCV(
    model,
    tuning_parameters,
    n_iter=200, # will take a couple hours
    random_state=42,
    cv=CV,
    scoring='f1_micro',
    verbose=VERBOSE,
    n_jobs=-1
)


#### Output
SEARCH TIME: 4805.65 sec

best: dat=mnist, score=0.91937, model=SGDClassifier(alpha=0.0010316033434719725,average=False,early_stopping=True,eta0=0.013321207301714324,learning_rate='adaptive',loss='hinge',max_iter=57075,n_iter_no_change=10,penalty='elasticnet',shuffle=True,validation_fraction=0.1)

#### Parameter information
Average CV score per 'loss':
  loss=squared_hinge    mean_f1_micro=0.8953  (+/- 0.0304)
  loss=modified_huber   mean_f1_micro=0.8801  (+/- 0.1539)
  loss=hinge            mean_f1_micro=0.8690  (+/- 0.1768)

Average CV score per 'penalty':
  penalty=l2               mean_f1_micro=0.9033  (+/- 0.0195)
  penalty=elasticnet       mean_f1_micro=0.8571  (+/- 0.1960)

Average CV score per 'learning_rate':
  learning_rate=optimal          mean_f1_micro=0.8976  (+/- 0.0854)
  learning_rate=adaptive         mean_f1_micro=0.8601  (+/- 0.1838)

Average CV score per 'shuffle':
  shuffle=True             mean_f1_micro=0.8807  (+/- 0.1398)

Average CV score per 'average':
  average=False            mean_f1_micro=0.9029  (+/- 0.0205)
  average=True             mean_f1_micro=0.8626  (+/- 0.1857)

Average CV score per 'early_stopping':
  early_stopping=True             mean_f1_micro=0.8807  (+/- 0.1398)

#### Discussion
We now see that the squared_hinge loss function has the highest average score but with a low variation. In fact, the best model has the lowest average hinge loss function and the lower average elasticnet. This tells us that you can just get a really lucky combination that achieves a higher score. However the modified_huber got neither the highest score or the best model, so we drop that.

### Run 3
#### Input
tuning_parameters = {
    'loss': ['hinge', 'squared_hinge'],     # Dropping modified_huber
    'penalty': ['l2', 'elasticnet'],
    'learning_rate': ['optimal', 'adaptive'],
    'alpha': loguniform(0.00001, 0.005),        # narrowing search around the best parameters found earlier
    'eta0': loguniform(0.001, 0.05),         # narrowing search around the best parameters found earlier
    'max_iter': randint(10000, 80000),      # increasing maximum iterations
    'n_iter_no_change': [10],
    'early_stopping': [True],
    'shuffle': [True],
    'average': [False],
}

CV = 5
VERBOSE = 0

grid_tuned = RandomizedSearchCV(
    model,
    tuning_parameters,
    n_iter=200, # will take some hours
    random_state=42,
    cv=CV,
    scoring='f1_micro',
    verbose=VERBOSE,
    n_jobs=-1
)

#### Output
SEARCH TIME: 10949.07 sec

best: dat=mnist, score=0.92073, model=SGDClassifier(alpha=0.0034952182314214584,average=False,early_stopping=True,eta0=0.030463008520733255,learning_rate='adaptive',loss='hinge',max_iter=75450,n_iter_no_change=10,penalty='l2',shuffle=True)

#### Parameter information
Average CV score per 'loss':
  loss=hinge            mean_f1_micro=0.9115  (+/- 0.0049)
  loss=squared_hinge    mean_f1_micro=0.8978  (+/- 0.0118)

Average CV score per 'penalty':
  penalty=l2               mean_f1_micro=0.9050  (+/- 0.0113)
  penalty=elasticnet       mean_f1_micro=0.9032  (+/- 0.0116)

Average CV score per 'learning_rate':
  learning_rate=optimal          mean_f1_micro=0.9084  (+/- 0.0041)
  learning_rate=adaptive         mean_f1_micro=0.9002  (+/- 0.0144)

Average CV score per 'shuffle':
  shuffle=True             mean_f1_micro=0.9041  (+/- 0.0115)

Average CV score per 'average':
  average=False            mean_f1_micro=0.9041  (+/- 0.0115)

Average CV score per 'early_stopping':
  early_stopping=True             mean_f1_micro=0.9041  (+/- 0.0115)


#### Discussion
We decided to stop searching using the SGD classifier since we barely got an increase in score. Looking at the variance there isnt much room for improvement.

#### SVC
Now we try the SVC which has fewer parameters than the SGD. The most important however is the kernel. We chose the rbf kernel since it is the best at multi-class classification when looking throught he scikit learn documentation. https://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html#sphx-glr-auto-examples-svm-plot-svm-kernels-py

We do NOT scale the data. We found that scaling yielded a worse score than not scaling at all.
Due to this late revelation there are not that many runs.

The best score is 0.96949(might increase) and will be used as a last resort.

The approach was similar to the SGD, but search time is way longer since it scales horribly with bigger datasets and thus fewer combinations could be tested per run.

Initially we just used the "auto" and "scale" gamma parameters, to avoid having too many variables.



In [None]:
# TODO:(in code and text..)
from time import time
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = LoadAndSetupData('mnist')

from sklearn.svm import SVC
model = SVC(kernel='rbf')

tuning_parameters = {
    'C': [4]
}

CV = 5
VERBOSE = 0

grid_tuned = GridSearchCV(
    model,
    tuning_parameters,
    cv=CV,
    scoring='f1_micro',
    verbose=VERBOSE,
    n_jobs=-1
)

start = time()
grid_tuned.fit(X_train, y_train)
t = time() - start

b0, m0 = WriteReportToFile(
    grid_tuned, X_test, y_test, t,
    params_to_summarize=["C"],
    full_results_file='mnist_search_POC_cv.txt',
    status_file='mnist_search_status_POC_cv.txt'
)

#### Output
SEARCH TIME: 362.07 sec
best: dat=mnist, score=0.98210, model=SVC(C=4)

#### Comments
Compared to many runs where the data was scaled and couldnt get above ~96.94, this was a clear improvement and a wider search is needed.
This was just proof of concept, that the data shouldnt be scaled(at least not with the standardscaler).

In [None]:
# TODO:(in code and text..)
from time import time
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = LoadAndSetupData('mnist')

from sklearn.svm import SVC
model = SVC(kernel='rbf')

tuning_parameters = {
    'C': [1, 2, 4, 8, 12, 16, 25, 35, 50],
    'gamma': ['auto', 'scale'], # using the defined options since they are more, well, defined
}

CV = 5
VERBOSE = 0

grid_tuned = GridSearchCV(
    model,
    tuning_parameters,
    cv=CV,
    scoring='f1_micro',
    verbose=VERBOSE,
    n_jobs=-1
)

start = time()
grid_tuned.fit(X_train, y_train)
t = time() - start


b0, m0 = WriteReportToFile(
    grid_tuned, X_test, y_test, t,
    params_to_summarize=["C", "gamma"],
    full_results_file='mnist_search_long_cv.txt',
    status_file='mnist_search_status_long_cv.txt'
)

#### Output
We hope to have a good output

#### Comments


#### Discussion
The benefits of grid search are clear:
1. Using random grid search we can quickly track down a set of suitable parameters that could yield a good model.
2. Then with the grid search we can do a more systematic search and remove even more parameters.
3. Set the processor(CPU or GPU) to just crunch for hours on end without having to manually replace parameters.

Some drawbacks/preconditions:
1. There are no indications when a search is done, we can only estimate based on earlier runs. If the model scales badly, becomes even worse(SVC).
2. You still need to have some idea what the parameters do to get a good search result.