# 1. [50 pts]

In this assignment, we will use Apriori analysis to find phrases, or interesting
patterns in a novel.
Use the <code>nltk</code> library corpus <code>gutenberg</code> API and load the novel 'carroll-alice.txt' which is the
Alice in Wonderland by L. Carroll. There are 1703 sentences in the novel which can be
represented as 1703 transactions. Use any means to parse/extract words and save in CSV
format to be read by Weka framework similar to the Apriori Analysis module.
Hint: Removing stop words and using regular expressions can be helpful:

<code>from nltk.corpus import gutenberg, stopwords
Stop_words = stopwords.words('english')
Sentences = gutenberg.sents('carroll-alice.txt')
TermsSentences = []
for terms in Sentences:
     terms = [w for w in terms if w not in Stop_words]
     terms = [w for w in terms if re.search(r'^[a-zA-Z]{2}', w) is not None]</code>
 
Use FPGrowth and start with default parameters. Reduce lowerBoundMinSupport to reach
to a sweet point for the support and avoid exploding the number of rules generated.
Report interesting patterns.
(Example: Some of the frequently occurring phrases are Mock Turtle, White Rabbit, etc.)

In [1]:
import re
import pandas as pd

In [2]:
# Load the novel corpus
from nltk.corpus import gutenberg, stopwords

In [3]:
# Load stopwords, novel text, run RegEx on novel sentences
Stop_words = stopwords.words('english')
Sentences = gutenberg.sents('carroll-alice.txt')

# Sentences is a list of lists. The nested lists contain the strings/words of each sentence
# len( Sentences ) = 1703

In [4]:
# iterate through the list of strings running against RegEx and Stop_words

# Initialize a list to hold the processed lists of strings (This replaces Sentences)
TermsSentences = [ ]

for terms in Sentences:
    terms = [w for w in terms if w not in Stop_words]
    terms = [w for w in terms if re.search(r'^[a-zA-Z]{2}', w) is not None]
    TermsSentences.append( terms )

In [5]:
# Import FPGrowth implementation so we can avoid Weka
# !conda install -c conda-forge mlxtend
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.preprocessing import TransactionEncoder

In [6]:
# Transform TermsSentences which is a nested list of strings into a one-hot enconded dataframe

te = TransactionEncoder( )
te_ary = te.fit( TermsSentences ).transform( TermsSentences )
df = pd.DataFrame( te_ary, columns = te.columns_ )

In [7]:
# Look at results
df

Unnamed: 0,ALICE,ALL,AND,ARE,AT,Ada,Adventures,Advice,After,Ah,...,years,yelled,yelp,yer,yes,yesterday,yet,young,youth,zigzag
0,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1699,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1700,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1701,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [8]:
# Run fpgrowth on dataframe
results = fpgrowth( df, min_support = 0.01, use_colnames = True, max_len = 2 )
results[ 130 : 170 ]

Unnamed: 0,support,itemsets
130,0.011157,"(get, Alice)"
131,0.017029,"(could, Alice)"
132,0.014093,"(Alice, would)"
133,0.011744,"(would, said)"
134,0.012918,"(Rabbit, White)"
135,0.012918,"(say, Alice)"
136,0.013506,"(way, Alice)"
137,0.016442,"(much, Alice)"
138,0.011744,"(much, said)"
139,0.012331,"(think, Alice)"


In [9]:
# Check to see if 'White Rabbit' is included in the results df as suggested it should be

for i in range( len( results ) ):
    if len( results[ 'itemsets' ][ i ] ) == 2:
        str_to_check = 'White Rabbit'
        current_string = str( list( results[ 'itemsets' ][ i ] )[ 0 ] ) + ' ' + str( list( results[ 'itemsets' ][ i ] )[ 1 ] )

        if str_to_check == current_string:
            print( 'Index: ', i, ' ', current_string )
            
# Even at a min_support of 0.0005 which returns 70938 single or paired words, 'White Rabbit' is not returned as a pair

# 2. [50 pts]

In the lecture module, the class <code>NeuralNetMLP</code> is a single hidden layer neural
network implementation. Make the necessary modifications to upgrade it to a 2 hidden
layer network. Run it on the MNIST dataset and report its performance.

(Hint: Raschka, Chapter 12)

In [10]:
import numpy as np

In [11]:
# Load in the MNIST dataset

def load_mnist(path, kind='train'):
    from numpy import fromfile, uint8
    import os
    import struct
    
    labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind)
    images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', lbpath.read(8))
        labels = fromfile(lbpath, dtype=uint8)
        with open(images_path, 'rb') as imgpath:
            magic, num, rows, cols = struct.unpack(">IIII",imgpath.read(16))
            images = fromfile(imgpath, dtype=uint8).reshape(len(labels), 784)
            images = ((images / 255.) - .5) * 2
    return images, labels

X_train, y_train = load_mnist('mnist/', kind='train')
print(f'Rows= {X_train.shape[0]}, columns= {X_train.shape[1]}')

X_test, y_test = load_mnist('mnist/', kind='t10k')
print(f'Rows= {X_test.shape[0]}, columns= {X_test.shape[1]}')

Rows= 60000, columns= 784
Rows= 10000, columns= 784


In [12]:
# Define the MLP Class

class NeuralNetMLP( object ):
    
    def __init__( self, n_hidden = 30, epochs = 100, eta = 0.0001, minibatch_size = 100, seed = None ):
        self.random = np.random.RandomState( seed )
        self.n_hidden = n_hidden
        self.epochs = epochs
        self.eta = eta
        self.minibatch_size = minibatch_size
        
    @staticmethod
    def onehot( y, n_classes ):
        onehot = np.zeros( ( n_classes, y.shape[ 0 ] ) )
        for idx, val in enumerate( y.astype( int ) ):
            onehot[ val, idx ] = 1.0
        return( onehot.T )
    
    @staticmethod
    def sigmoid( z ):
        return( 1.0 / ( 1.0 + np.exp( -np.clip( z, -250, 250 ) ) ) )
    
    def _forward( self, X ): 
        # Equation 2
        z_h1 = np.dot( X, self.w_h1 )
        a_h1 = self.sigmoid( z_h1 )
        
        z_h2 = np.dot( a_h1, self.w_h2 )
        a_h2 = self.sigmoid( z_h2 )
        
        z_out = np.dot( a_h2, self.w_out )
        a_out = self.sigmoid( z_out )
        
        return( z_h1, a_h1, z_h2, a_h2, z_out, a_out )
    
    @staticmethod
    def compute_cost( y_enc, output ):
        term1 = -y_enc * ( np.log( output ) )
        term2 = ( 1.0 - y_enc ) * np.log( 1.0 - output )
        cost = np.sum( term1 - term2 )
        return( cost )
    
    def predict( self, X ):
        z_h1, a_h1, z_h2, a_h2, z_out, a_out = self._forward( X )
        y_pred = np.argmax( z_out, axis = 1 )
        return( y_pred )
    
    def fit( self, X_train, y_train, X_valid, y_valid ):
        import sys
        
        n_output = np.unique( y_train ).shape[ 0 ]
        n_features = X_train.shape[ 1 ]
        
        self.w_out = self.random.normal( loc = 0.0, scale = 0.1, size = ( self.n_hidden, n_output ) )
        
        self.w_h1 = self.random.normal( loc = 0.0, scale = 0.1, size = ( n_features, self.n_hidden ) )
        self.w_h2 = self.random.normal( loc = 0.0, scale = 0.1, size = ( self.n_hidden, self.n_hidden ) )
        
        y_train_enc = self.onehot( y_train, n_output )
        
        for i in range( self.epochs ):
            indices = np.arange( X_train.shape[ 0 ] )
            
            for start_idx in range( 0, indices.shape[ 0 ] - self.minibatch_size + 1, self.minibatch_size ):
                batch_idx = indices[ start_idx : start_idx + self.minibatch_size ]
                
                z_h1, a_h1, z_h2, a_h2, z_out, a_out = self._forward( X_train[ batch_idx ] )
                
                # Equation 3
                sigmoid_derivative_ah1 = a_h1 * ( 1.0 - a_h1 )
                sigmoid_derivative_ah2 = a_h2 * ( 1.0 - a_h2 )
                
                # Equation 5
                delta_out = a_out - y_train_enc[ batch_idx ]
                
                # Equation 6 
                delta_h2 = ( np.dot( delta_out, self.w_out.T ) * sigmoid_derivative_ah2 )
                delta_h1 = ( np.dot( delta_h2, self.w_h2.T ) * sigmoid_derivative_ah1 )
                
                # Equation 7
                grad_w_out = np.dot( a_h2.T, delta_out )
                
                # Equation 8
                grad_w_h2 = np.dot( a_h1.T, delta_h2 )
                grad_w_h1 = np.dot( X_train[ batch_idx ].T, delta_h1 )
                
                # Equation 9
                self.w_out -= self.eta * grad_w_out
                
                self.w_h1 -= self.eta * grad_w_h1
                self.w_h2 -= self.eta * grad_w_h2
                
                
            # Evaluation after each epoch during training
            z_h1, a_h1, z_h2, a_h2, z_out, a_out = self._forward( X_train )
            
            cost = self.compute_cost( y_enc = y_train_enc, output = a_out )
            y_train_pred = self.predict( X_train )  # monitoring training progress through reclassification
            y_valid_pred = self.predict( X_valid )  # monitoring training progress through validation
            train_acc = ( ( np.sum( y_train == y_train_pred ) ).astype( float ) / X_train.shape[ 0 ] )
            valid_acc = ( ( np.sum( y_valid == y_valid_pred ) ).astype( float ) / X_valid.shape[ 0 ] )
            sys.stderr.write( '\r%d/%d | Cost: %.2f ' '| Train/Valid Acc.: %.2f%%/%.2f%% '%
                ( i+1, self.epochs, cost, train_acc * 100, valid_acc * 100 ) )
            sys.stderr.flush( )
            
        return self

In [13]:
# Define and fit the neural network
nn = NeuralNetMLP( n_hidden = 20, epochs = 300, eta = 0.0005, minibatch_size = 100, seed = 1 )

nn.fit( X_train = X_train[ : 55000 ], y_train = y_train[ : 55000 ], X_valid = X_train[ 55000 : ], y_valid = y_train[ 55000 : ] )

300/300 | Cost: 7849.39 | Train/Valid Acc.: 98.05%/95.74%  

<__main__.NeuralNetMLP at 0x1984fdaaec8>