1. [50 pts] In this assignment, we will use a priori analysis to find phrases, or interesting word
patterns, in a novel.

Note that you are free to use any a priori analytics and algorithm library in this assignment. Use the nltk library corpus gutenberg API and load the novel 'carroll-alice.txt', which is Alice in Wonderland by Lewis Carroll (although his real name was Charles Dodgson). There are 1703 sentences in the novel—which can be represented as 1703 transactions. Use any means you like to parse/extract words and save in a .csv format to be read by
Weka framework, similar to the a priori Analysis module. (Hint: Feel free to use mlxtend
library instead of Weka.)

Hint: Removing stop words and symbols using regular expressions can be helpful:
```python
from nltk.corpus import gutenberg, stopwords

Stop_words = stopwords.words('english')
Sentences = gutenberg.sents('carroll-alice.txt')
TermsSentences = []

for terms in Sentences:
    terms = [w for w in terms if w not in Stop_words]
    terms = [w for w in terms if re.search(r'^[a-zA-Z]{2}', w) is not None]
```

If you chose to Weka, use FPGrowth and start with default parameters. Reduce lowerBoundMinSupport to reach to a sweet point for the support and avoid exploding the number of rules generated.
Report interesting patterns.

(Example: Some of the frequently occurring phrases are “Mock Turtle”, “White Rabbit”, etc.)

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 72
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [10]:
import nltk
nltk.download('gutenberg')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/kavyabanerjee/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kavyabanerjee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kavyabanerjee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
from nltk.corpus import gutenberg, stopwords
import re
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Load Alice in Wonderland text
sentences = gutenberg.sents('carroll-alice.txt')

# Define stop words
stop_words = stopwords.words('english')

# Initialize list to hold the filtered sentences
terms_sentences = []

# Filter sentences to remove stop words and non-alphabetic terms
preprocessed_sentences = []
for sentence in sentences:
    filtered_sentence = [word for word in sentence if word.lower() not in stop_words and re.search(r'^[a-zA-Z]{2,}$', word)]
    if filtered_sentence:  # Ensure the sentence is not empty after filtering
        preprocessed_sentences.append(filtered_sentence)

# Convert the sentences into a one-hot encoded DataFrame
encoder = TransactionEncoder()
onehot = encoder.fit_transform(preprocessed_sentences)
onehot_df = pd.DataFrame(onehot, columns=encoder.columns_)

# Apply the Apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(onehot_df, min_support=0.02, use_colnames=True)

# Generate association rules from the frequent itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

In [17]:
rules_df = pd.DataFrame(rules, columns = ['antecedents', 'consequents', 'support', 'confidence', 'lift'])
rules_df

# output_csv = 'preprocessed_sentences.csv'
# rules_df.to_csv(output_csv, index=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(little),(Alice),0.021779,0.318584,1.360774
1,(Alice),(little),0.021779,0.093023,1.360774
2,(Alice),(said),0.098609,0.421189,1.550612
3,(said),(Alice),0.098609,0.363029,1.550612
4,(Alice),(thought),0.035088,0.149871,3.34779
5,(thought),(Alice),0.035088,0.783784,3.34779
6,(King),(said),0.025408,0.711864,2.620739
7,(said),(King),0.025408,0.093541,2.620739
8,(Mock),(Turtle),0.033878,1.0,28.5
9,(Turtle),(Mock),0.033878,0.965517,28.5


2. [50 pts] In the lecture module, the class NeuralNetMLP implements a neural network with a single hidden layer. Make the necessary modifications to upgrade it to a 2-hidden layer neural network. Run it on the MNIST dataset and report its performance.
(Hint: Raschka, Chapter 12)


Modifications include:

- Adding an additional weight matrix to represent the weights between the first and second hidden layers.
- Modifying the _forward method to include computations through the second hidden layer.
- Adjusting the backpropagation steps in the fit method to account for the gradients and updates through the additional hidden layer.

In [13]:
def get_acc(_y_test, _y_pred):
    return (np.sum(_y_test == _y_pred)).astype(float) / _y_test.shape[0]

def load_mnist(path, kind='train'):
    from numpy import fromfile, uint8
    import os
    import struct
    
    labels_path = os.path.join(path, '%s-labels.idx1-ubyte' % kind)
    images_path = os.path.join(path, '%s-images.idx3-ubyte' % kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', lbpath.read(8))
        labels = fromfile(lbpath, dtype=uint8)
        with open(images_path, 'rb') as imgpath:
            magic, num, rows, cols = struct.unpack(">IIII",imgpath.read(16))
            images = fromfile(imgpath, dtype=uint8).reshape(len(labels), 784)
            images = ((images / 255.) - .5) * 2
    return images, labels

X_train_mnist, y_train_mnist = load_mnist('../../Desktop/APML/Datasets/mnist/', kind='train')
print(f'Rows= {X_train_mnist.shape[0]}, columns= {X_train_mnist.shape[1]}')

X_test_mnist, y_test_mnist = load_mnist('../../Desktop/APML/Datasets/mnist/', kind='t10k')
print(f'Rows= {X_test_mnist.shape[0]}, columns= {X_test_mnist.shape[1]}')

Rows= 60000, columns= 784
Rows= 10000, columns= 784


In [14]:
class NeuralNetMLP(object):

    def __init__(self, n_hidden1=30, n_hidden2=30, epochs=100, eta=0.001, minibatch_size=1, seed=None):
        self.random = np.random.RandomState(seed)  # used to randomize weights
        self.n_hidden1 = n_hidden1  # size of the first hidden layer
        self.n_hidden2 = n_hidden2  # size of the second hidden layer - New addition
        self.epochs = epochs  # number of iterations
        self.eta = eta  # learning rate
        self.minibatch_size = minibatch_size  # size of training batch
        self.w_out, self.w_h1, self.w_h2 = None, None, None  # Initialize w_h2 for the second hidden layer
    
    @staticmethod
    def onehot(_y, _n_classes):  # one hot encode the input class y
        onehot = np.zeros((_n_classes, _y.shape[0]))
        for idx, val in enumerate(_y.astype(int)):
            onehot[val, idx] = 1.0
        return onehot.T
    
    @staticmethod
    def sigmoid(_z):  # Sigmoid activation function
        return 1.0 / (1.0 + np.exp(-np.clip(_z, -250, 250)))

    def _forward(self, _X):
        # First hidden layer
        z_h1 = np.dot(_X, self.w_h1)
        a_h1 = self.sigmoid(z_h1)
        
        # Second hidden layer - New addition
        z_h2 = np.dot(a_h1, self.w_h2)  # Input for h2 is output of h1
        a_h2 = self.sigmoid(z_h2)  # Activation for h2
        
        # Output layer
        z_out = np.dot(a_h2, self.w_out)  # Input for out is output of h2
        a_out = self.sigmoid(z_out)
        return z_h1, a_h1, z_h2, a_h2, z_out, a_out  # Include z_h2 and a_h2 in the return statement

    @staticmethod
    def compute_cost(y_enc, output):  # Compute the cost
        term1 = -y_enc * (np.log(output))
        term2 = (1.0-y_enc) * np.log(1.0-output)
        cost = np.sum(term1 - term2)
        return cost

    def predict(self, _X):
        # Forward pass to predict
        z_h1, a_h1, z_h2, a_h2, z_out, a_out = self._forward(_X)
        y_pred = np.argmax(z_out, axis=1)
        return y_pred

    def fit(self, _X_train, _y_train, _X_valid, _y_valid):
        n_output = np.unique(_y_train).shape[0]  # number of class labels
        n_features = _X_train.shape[1]
        
        # Initialize weights for all layers
        self.w_h1 = self.random.normal(loc=0.0, scale=0.1, size=(n_features, self.n_hidden1))
        self.w_h2 = self.random.normal(loc=0.0, scale=0.1, size=(self.n_hidden1, self.n_hidden2))  # New addition for the second hidden layer
        self.w_out = self.random.normal(loc=0.0, scale=0.1, size=(self.n_hidden2, n_output))

        y_train_enc = self.onehot(_y_train, n_output)  # one-hot encode original y

        for ei in range(self.epochs):  # Training epochs
            indices = np.arange(_X_train.shape[0])
            
            for start_idx in range(0, indices.shape[0] - self.minibatch_size + 1, self.minibatch_size):
                batch_idx = indices[start_idx:start_idx + self.minibatch_size]
                
                # Forward pass
                z_h1, a_h1, z_h2, a_h2, z_out, a_out = self._forward(_X_train[batch_idx])
                
                # Backpropagation
                sigmoid_derivative_h2 = a_h2 * (1.0 - a_h2)  # Derivative for h2
                
                delta_out = a_out - y_train_enc[batch_idx]  # Output layer error
                delta_h2 = np.dot(delta_out, self.w_out.T) * sigmoid_derivative_h2  # Error for h2
                
                sigmoid_derivative_h1 = a_h1 * (1.0 - a_h1)
                delta_h1 = np.dot(delta_h2, self.w_h2.T) * sigmoid_derivative_h1  # Error for h1
                
                # Gradients for weight updates
                grad_w_out = np.dot(a_h2.T, delta_out)
                grad_w_h2 = np.dot(a_h1.T, delta_h2)  # New addition for second hidden layer
                grad_w_h1 = np.dot(_X_train[batch_idx].T, delta_h1)
                
                # Update weights
                self.w_out -= self.eta * grad_w_out
                self.w_h2 -= self.eta * grad_w_h2  # Update for second hidden layer
                self.w_h1 -= self.eta * grad_w_h1

            # Evaluation after each epoch
            z_h1, a_h1, z_h2, a_h2, z_out, a_out = self._forward(_X_train)
            cost = self.compute_cost(y_enc=y_train_enc, output=a_out)
            y_train_pred = self.predict(_X_train)
            y_valid_pred = self.predict(_X_valid)
            train_acc = ((np.sum(_y_train == y_train_pred)).astype(float) / _X_train.shape[0])
            valid_acc = ((np.sum(_y_valid == y_valid_pred)).astype(float) / _X_valid.shape[0])
            print('\r%d/%d | Cost: %.2f ' '| Train/Valid Acc.: %.2f%%/%.2f%% ' %
                  (ei+1, self.epochs, cost, train_acc*100, valid_acc*100), end='')
        
        return self

In [15]:
from sklearn.metrics import confusion_matrix

nn = NeuralNetMLP(n_hidden1=20, n_hidden2=15, epochs=300, eta=0.0005, minibatch_size=100, seed=1)
nn.fit(X_train_mnist[:55000], y_train_mnist[:55000], X_train_mnist[55000:], y_train_mnist[55000:])

y_pred = nn.predict(X_test_mnist)

print(f'Accuracy= {get_acc(y_test_mnist, y_pred)*100:.2f}%')
print(confusion_matrix(y_test_mnist, y_pred))

300/300 | Cost: 6944.08 | Train/Valid Acc.: 98.40%/96.26% Accuracy= 94.82%
[[ 955    0    3    0    1    5    7    6    2    1]
 [   0 1114    5    1    0    2    2    3    8    0]
 [   6    3  974   16    8    1    7    5   12    0]
 [   3    0   13  956    0   20    0    9    7    2]
 [   1    2    8    0  931    1    9    1    4   25]
 [   8    1    0   15    3  837    8    5    8    7]
 [   8    1    1    2    3   14  920    0    9    0]
 [   2    7   15   10   12    1    0  963    1   17]
 [   7    2    9   19    7   14    8    5  896    7]
 [   1    9    1   13   18    6    2   11   12  936]]
