This is a simple notebook to use Logistic Regression model for the Ising model.

It accompanies Chapter 5 of the book (4 of 5).

Author: Viviana Acquaviva, with contributions by Jake Postiglione and Olga Privman; see also data credits below.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pickle
from matplotlib import cm
%matplotlib inline

In [None]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate, train_test_split

from sklearn.model_selection import KFold, StratifiedKFold

from sklearn import metrics

### First, let's take a look at those sigmoids!

In [None]:
x = np.linspace(-10,10,100)

In [None]:
z = 2*x + 5 #Linear bit

Let's say that the probability that something will happen is called $\pi$. 

The logistic model assumes that

$log (\frac{\pi}{1-\pi}$) = z 

We can now solve for $\pi$:

In [None]:
pi = 1/(1 + np.exp(-z))

In [None]:
plt.plot(x, pi)

plt.xlim(-7,3);

plt.title('Hello, I am a sigmoid!')

plt.xlabel('x', fontsize=14)

plt.ylabel('$\pi$',fontsize=14);

### Learning Check-in
    
Where is that $\pi$ = 0.5? 

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
Looking at the definition of the logistic model, we can see that $\pi$ = 0.5 (odds are the same) when z = 0; in our graph, this corresponds to x = -2.5.
```
    
</p>
</details>

</br>

What happens if the slope of the linear model is negative?

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
The asymptotes of the sigmoid are flipped, and the curve is monotonically decreasing.
```
    
</p>
</details>

### We can now see an example from Mehta et al 2018:

["A high-bias, low-variance introduction to Machine Learning for physicists"](https://arxiv.org/abs/1803.08823).

We are trying to use a logistic regression model to predict whether a material is in a ordered or disordered phase, based on its spin configuration. In an ordered phase, the spins are aligned. The representation is a 2D lattice so our features are the spin states of each element in the lattice. The physical model, known as Ising model, predicts that the transition depends on temperature and is smeared (for a finite-size lattice), around a critical temperature $T_c$.

The training data is composed of 160,000 Monte Carlo simulations in a range of temperatures, and their labels.

Possible applications of this formalism involve predicting the critical temperature for more complex systems.

Reading in the data might take a little while.

In [None]:
#This is gratefully borrowed with permission from the notebooks maintained by P. Mehta.

######### LOAD DATA
# The data consists of 16*10000 samples taken in T=np.arange(0.25,4.0001,0.25):
data_file_name = '../data/Ising2DFM_reSample_L40_T=All.pkl'
# The labels are obtained from the following file:
label_file_name = '../data/Ising2DFM_reSample_L40_T=All_labels.pkl'


#DATA
with open(data_file_name, 'rb') as pickle_file:
    data = pickle.load(pickle_file) # pickle reads the file and returns the Python object (1D array, compressed bits)

data = np.unpackbits(data).reshape(-1, 1600) # Decompress array and reshape for convenience
data=data.astype('int')
data[np.where(data==0)]=-1 # map 0 state to -1 (Ising variable can take values +/-1)

#LABELS (convention is 1 for ordered states and 0 for disordered states)
with open(label_file_name, 'rb') as pickle_file:
    labels = pickle.load(pickle_file) # pickle reads the file and returns the Python object (here just a 1D array with the binary labels)

In [None]:
data.shape

In [None]:
np.unique(labels)

We can take a look at the label distribution:

In [None]:
plt.scatter(np.arange(data.shape[0]),labels)

plt.xlabel('Example #')

plt.ylabel('State');

#labels: 1 = ordered or near-critical
#labels: 0 = disordered

### Learning Check-in
    
Is this a balanced or imbalanced data set?

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
To check for balance, we can count the percentage of "1" labels, e.g. doing np.sum(labels)/len(labels), and we obtain ~ 56%, indicating that the data set is balanced.
```
    
</p>
</details>

#### We can take a look at a few examples:

In [None]:
#H/T: https://stackoverflow.com/questions/16834861/create-own-colormap-using-matplotlib-and-plot-color-scale

cmap = matplotlib.colors.ListedColormap(["aquamarine","navy"], name='from_list', N=None)

plt.figure(figsize=(15,8))
fig, axarr = plt.subplots(nrows=1, ncols=3)
axarr[0].imshow(data[0].reshape(40,40), cmap = cmap) #first object has label "1"
axarr[1].imshow(data[80000].reshape(40,40), cmap = cmap) #from documentation, this is critical-ish (between 60, and 90,000)
axarr[2].imshow(data[100000].reshape(40,40), cmap = cmap) #disordered
for i in range(3):
    axarr[i].set_xticks([0,20,40]);

### Let's pick a random selection to speed up the computations.

In [None]:
np.random.seed(10)

sel = np.random.choice(data.shape[0], 16000, replace = False)

In [None]:
seldata = data[sel,:]

In [None]:
sellabels = labels[sel]

In [None]:
plt.scatter(np.arange(seldata.shape[0]),sellabels); #The random selection also has the advantage of reshuffling the data!

### And now time for the logistic regression model.

In [None]:
model = LogisticRegression(max_iter = 1000) #This uses a numerical method to find the minimum of the loss function

In [None]:
model.get_params() #Note that (unlike in linear regression) regularization is the norm!

In [None]:
model

We can use cross validation, as usual:

In [None]:
#Takes 5-10 seconds

results = cross_validate(model, seldata, sellabels, 
                         cv = KFold(n_splits=5, shuffle=True, random_state=10), return_train_score = True)

In [None]:
results 

### Learning Check-in
    
Which metric do you think those numbers represent?

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
Quite surprisingly, the standard output of Logistic Regression is accuracy (a classification metric!)
```

</p>

</details>

This behavior is sub-optimal because we also want to access the odds. We'll look at that in a moment.

### We can do our own grid search to optimize the regularization parameter C:

In [None]:
for C in np.logspace(-3,3,7):
    model = LogisticRegression(max_iter=1000, C = C)
    results = cross_validate(model, seldata, sellabels, 
                         cv = KFold(n_splits=5, shuffle=True, random_state=10), return_train_score = True)
    print('C/Average test accuracy for C = ', '{:.3e} {:s} {:.3f} {:s} {:.3f}'.format(C, 'is ', results['test_score'].mean(),'+-',results['test_score'].std()))
    print('C/Average train accuracy for C = ', '{:.3e} {:s} {:.3f} {:s} {:.3f}'.format(C, 'is ', results['train_score'].mean(),'+-',results['train_score'].std()))

### Questions:

- How is this model's performance?


### Learning Check-in
    
Which value of C should we pick?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
The (test) scores are pretty flat, so the value of C that we pick is not that important.
```

</p>
</details>

</br>

How is this model's performance?

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
Not great, with an accuracy around 66%.
```

</p>
</details>
    

### Here we generate labels in order to check predictions.

For those classifiers that are solving a regression problem under the hood, there is the handy "predict_proba" method.

In [None]:
model = LogisticRegression(C=1.0, max_iter=1000)

ypred = cross_val_predict(model, seldata, sellabels, \
                               cv = KFold(n_splits=5, shuffle=True, random_state=10))

ypred_prob = cross_val_predict(model, seldata, sellabels, \
                               cv = KFold(n_splits=5, shuffle=True, random_state=10), method = 'predict_proba')

The output of predict_proba gives the probability to belong to disordered (label 0) or ordered (label 1) phase. The simple classifier output is the class with p > 0.5. We can look at this to convince ourselves:

In [None]:
np.column_stack([ypred_prob, ypred])

### We can plot a few examples to see how our classifier is doing. 

In [None]:
fig, axarr = plt.subplots(nrows=1, ncols=8, figsize=(15,5))
for i in range(8):
    axarr[i].imshow(seldata[i].reshape(40,40), cmap = cmap) 
    axarr[i].set_xlabel('True label:'+str(sellabels[i])+'\n'+'Pred label:'+str(ypred[i]))
    axarr[i].set_yticks([])
    axarr[i].set_xticks([])

Unfortunately, there are two instances that are misclassified by our Log Reg classifier. However, at least visually, it is understandable!

However, a look at the corresponding probabilities reveals some concerns:

In [None]:
ypred_prob[:8]

0 Ordered  (decent confidence) 

1 Ordered  (decent confidence) 

2 is predicted to be Ordered WITH HIGH CONFIDENCE... BUT INCORRECTLY!

.....

Something is going wrong here, because the confidence level of very uncertain cases appears to be too high. 

The conclusion is that the main indicator for this model is lack of consistency between spin alignments, which is not modeled well by our regressor. It's a tricky problem because many algorithms tend to look at the value of each feature to decide - for many of them, it's hard to represent the correlation among features as an indicator. 

### Learning Check-in
    
Which algorithm from the ones we have seen so far would you recommend using instead of Logistic Regression?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
Something that seems important here is to be able to combine features together. This is something that (generalized) linear models can't do well, but is well within reach for Support Vector Machines, for example
```
    
</p>
</details>
