# Multi Class Problems:

 

## Problem: MRI Data

The dementia level for the Oasis 1 MRI dataset is based on a patient assessment. As a result, it is not clear whether the levels of 0, .5, 1 and 2 should actually be understood as meaningfully numeric, or if they in fact are categorical labels. 

To load all of the files into an array we need to be able to search through the directory. Luckily, this is easy to do using the labels file, since each file name is stored there. We just need to loop through the __Filename__ column in the `labels` dataset and load them into an array one by one. There are 702 files in total. 

With the array there are two ways we can load them in: First, we can load them into a $609\times 176 \times 176$ array, which is the best option if we care about the 2D structure. However for algorithms like linear regression that can not see the 2D structure, we may want to flatten the images to a $609\times 30976$ array (note that $30976 = 176 \times 176$). Its easy enough two switch back and forth between the two array structures later. We will start with the flattened array. 

__Note:__ It is very import that we perform the train test split _before_ we expand the dataset through down sampling. If not, we are effectively training on the test data. 

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib

In [2]:
#file_dir = '/Users/liweizhang/My Drive/4570MatrixMethods/Labs/Lab2Classification/MRI_Images/'
labels = pd.read_csv('labels.csv')
# Using directory 
display(labels)
y = labels.CDR

Unnamed: 0.1,Unnamed: 0,Filename,ID,M/F,Hand,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF,Delay,Slice
0,0,OAS1_0001_MR1_55.png,OAS1_0001_MR1,F,R,74,2,3.0,29,0.0,1344,0.743,1.306,,55
1,1,OAS1_0001_MR1_120.png,OAS1_0001_MR1,F,R,74,2,3.0,29,0.0,1344,0.743,1.306,,120
2,2,OAS1_0001_MR1_180.png,OAS1_0001_MR1,F,R,74,2,3.0,29,0.0,1344,0.743,1.306,,180
3,3,OAS1_0002_MR1_55.png,OAS1_0002_MR1,F,R,55,4,1.0,29,0.0,1147,0.810,1.531,,55
4,4,OAS1_0002_MR1_120.png,OAS1_0002_MR1,F,R,55,4,1.0,29,0.0,1147,0.810,1.531,,120
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
604,604,OAS1_0449_MR1_120.png,OAS1_0449_MR1,F,R,71,3,4.0,29,0.0,1264,0.818,1.388,,120
605,605,OAS1_0449_MR1_180.png,OAS1_0449_MR1,F,R,71,3,4.0,29,0.0,1264,0.818,1.388,,180
606,606,OAS1_0456_MR1_55.png,OAS1_0456_MR1,M,R,61,5,2.0,30,0.0,1637,0.780,1.072,,55
607,607,OAS1_0456_MR1_120.png,OAS1_0456_MR1,M,R,61,5,2.0,30,0.0,1637,0.780,1.072,,120


In [3]:
data = np.zeros([609, 30976])
file_dir = 'MRI_Images/'
for n, file_name in enumerate(labels.Filename):
    data[n,:] = np.mean(matplotlib.image.imread(file_dir + file_name),axis=2).reshape(-1)
# Now we can use the file name to read the data


In [4]:
y=(y*2).astype(int)
y

0      0
1      0
2      0
3      0
4      0
      ..
604    0
605    0
606    0
607    0
608    0
Name: CDR, Length: 609, dtype: int64

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=0)
print(y_train.shape, y_test.shape)

(487,) (122,)


### Question1:

Perform Logistic Regression on the above Oasis 1 dataset.   Find the score and the Confusion Matrix

In [6]:
data.shape

(609, 30976)

In [7]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(487, 30976) (487,)
(122, 30976) (122,)


In [8]:
y_train

503    0
90     0
528    2
446    1
200    1
      ..
277    0
9      0
359    0
192    0
559    1
Name: CDR, Length: 487, dtype: int64

In [9]:
y_train.value_counts()

0    282
1    145
2     55
4      5
Name: CDR, dtype: int64

In [10]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', max_iter=3000)
# Here solver='lbfgs' is the choice of the method of different Gradient Descent.
clf.fit(X_train,y_train)
print("Logistic Regression Score: %.3f"%clf.score(X_test,y_test))

Logistic Regression Score: 0.484


In [11]:
from sklearn.metrics import confusion_matrix

y_pred=clf.predict(X_test)

conf_mx = confusion_matrix(y_test, y_pred)

conf_mx


array([[49, 13,  4,  0],
       [25,  8,  2,  0],
       [11,  7,  2,  0],
       [ 1,  0,  0,  0]])

####  Each image contains 176×176=30976 features. Now, we make each sampe smaller size by 1/8.  

In [12]:
DS = 8             # Downsample rate, must be a multiple of 30976

N_train = y_train.shape[0]  # The length of the training data
y_train = np.array(y_train)

if 30976/DS % 1 > 0:
    print("Downsample rate is not a multiple of 30976")
    DS = 1
    im_size = 30976
else:
    im_size = int(30976/DS)


data = np.zeros([609, im_size])

for i, file_name in enumerate(labels.Filename):
    img = np.mean(matplotlib.image.imread(file_dir + file_name),axis=2).reshape(-1)
    data[i,:] = img[::DS]            # Downsample the image

In [13]:
data.shape

(609, 3872)

Based on the code above, downsample the test data in the same way. 

In [14]:
from sklearn.model_selection import train_test_split

X_train1, X_test1, y_train1, y_test1 = train_test_split(data, y, test_size=0.2, random_state=0)
print(y_train1.shape, y_test1.shape)

(487,) (122,)


### Question2:
 
Perform Logistic Regression on the down sampled Oasis 1 dataset.   Find the score and the Confusion Matrix

In [15]:
from sklearn.linear_model import LogisticRegression

clf1 = LogisticRegression(solver='lbfgs', max_iter=3000)
# Here solver='lbfgs' is the choice of the method of different Gradient Descent.
clf1.fit(X_train1,y_train1)
print("Logistic Regression Score: %.3f"%clf1.score(X_test1,y_test1))

Logistic Regression Score: 0.525


In [16]:
from sklearn.metrics import confusion_matrix

y_predict1 = clf1.predict(X_test1)

conf_mx = confusion_matrix(y_test1, y_predict1)
conf_mx

array([[54, 11,  1,  0],
       [26,  8,  1,  0],
       [16,  2,  2,  0],
       [ 1,  0,  0,  0]])