# Setting up our Notebook

In [1]:
import numpy as np 
import pandas as pd
import re, gensim
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk import word_tokenize
from nltk.corpus import stopwords
import os

# Loading in the Data and Exploration

In [2]:
# Needed to remove characters that will cause unicode errors

def remove_non_ascii(text):
    return ''.join([word for word in text if ord(word) < 128])

In [3]:
def load_data():
    
    # os.listdir() method in python is used to get the 
    # list of all files and directories in the specified directory. 
    
    negative_review_strings = os.listdir('Movie Reviews/review_data/tokens/neg')
    positive_review_strings = os.listdir('Movie Reviews/review_data/tokens/pos')
    negative_reviews, positive_reviews = [], []
    
    for positive_review in positive_review_strings:
        with open('Movie Reviews/review_data/tokens/pos/'+str(positive_review), 'r') as positive_file:
            positive_reviews.append(remove_non_ascii(positive_file.read()))
    # Note, since we have the "open()" as 'r', we need to do file.read() to access it
    
    for negative_review in negative_review_strings:
        with open('Movie Reviews/review_data/tokens/neg/'+str(negative_review), 'r') as negative_file:
            negative_reviews.append(remove_non_ascii(negative_file.read()))
    
    negative_labels, positive_labels = np.repeat(0, len(negative_reviews)), np.repeat(1, len(positive_reviews))
    # This just makes a bunch of 0s and 1s
    
    labels = np.concatenate([negative_labels, positive_labels])
    # Getting our full list of 0s and 1s
    reviews = np.concatenate([negative_reviews, positive_reviews])
    # Gettings all our reviews in order of negative then positive
    
    # From here, we could sample the rows if we wanted, or we could leave it the full dataset
#     rows = np.random.random_integers(0, len(reviews)-1, len(reviews)-1) 
#     data = pd.DataFrame(np.array([labels[rows], reviews[rows]]).T, columns=['Label', 'Text'])

    data = pd.DataFrame(np.array([labels, reviews]).T, columns=['Label', 'Text'])

    return data

In [4]:
movie_data = load_data()

In [5]:
movie_data.head()

Unnamed: 0,Label,Text
0,0,"tristar / 1 : 30 / 1997 / r ( language , viole..."
1,0,arlington road 1/4 . directed by mark pellingt...
2,0,the brady bunch movie is less a motion picture...
3,0,janeane garofalo in a romantic comedy -- it wa...
4,0,"i'm going to keep this plot summary brief , so..."


In [6]:
movie_data['Text']

0       tristar / 1 : 30 / 1997 / r ( language , viole...
1       arlington road 1/4 . directed by mark pellingt...
2       the brady bunch movie is less a motion picture...
3       janeane garofalo in a romantic comedy -- it wa...
4       i'm going to keep this plot summary brief , so...
                              ...                        
1395    one of the last entries in the long-running ca...
1396    hype ? sheesh , like no other . this side of t...
1397    for those of us who weren't yet born when the ...
1398    what starts out as a monotonous talking-head m...
1399    jackie brown ( miramax - 1997 ) starring pam g...
Name: Text, Length: 1400, dtype: object

# Latent Dirichlet Allocation

## Mathematical Definition

*Latent Dirichlet Allocation* is a form of unsupervised learning meant for topic modeling. We begin by defining several notations for our LDA: 

1. A word is a basic unit of data in a vocabulary, $V$. 
2. A document is a vector of words of length $N$, $\mathbf{w}=(w_1,….,w_n)$. Note, all documents don’t have to have the same length, but it makes the math easier and doesn’t change anything substantial. 
3. A corpus is a collection of $M$ documents, which will be denoted $D=(\mathbf{w}_1,…,\mathbf{w}_M)$

Further, we assume that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. In other words, we have a Dirichlet distribution describing the distribution of documents among topics and another Dirichlet distribution describing the distribution of topics among words. Within each document, we pick topics through a multinomial distribution and within each topic, we pick words through a multinomial distribution, as well.

Consider that we believe we have $K$ topics in our corpus. Then, because we are modeling the distribution of topics among the documents as a Dirichlet distribution, we have a $K$-dimensional vector per document, $\mathbf{\theta}_m \in \Delta^{K-1}$, $\mathbf{\theta}_{m,k} \in [0,1]$ representing the share of all topics in document m drawn from the Dirichlet distribution with parameters $\mathbf{\alpha}$. Then, we have a $V$-dimensional vector per topic $\mathbf{\phi}_k \in \Delta^{V-1}$, $\mathbf{\theta}_{k,V} \in [0,1]$, which is the vector of word-probabilities in topic $k$ chosen from a Dirichlet distribution with parameters $\mathbf{\beta}$. Then, we write the topic associated with word $n$ in document $m$, or the topic associated with $w_{m,n}$, as $z_{m,n}\in(1,…,K)$. We can also write our list of topics, with length $K$, as $\mathbf{z}$. With this, we can write out our document generation procedure.
1. Draw $M$ parameter vectors $\mathbf{\theta}_m$ from $Dir(\mathbf{\alpha})$
2. Draw $K$ parameter vectors per document $\mathbf{\phi}_k$ from $Dir(\mathbf{\beta})$
3. Each word $w_{m,n}$ in document $d$ is generated by:
4. Draw $z_{m,n}$ from $Multi(\mathbf{\theta}_d)$
5. Draw $w_{m,n}$ from $Multi(\mathbf{\phi}_{z_{m,n}})$

Then, our model, can be written as:

$$P(\mathbf{W}, \mathbf{Z}, \mathbf{\Theta}, \mathbf{\Phi} | \mathbf{\alpha}, \mathbf{\beta})=\prod_{k=1}^{K}p(\mathbf{\phi}_k|\mathbf{\beta})\prod_{m=1}^{M}p(\mathbf{\theta}_m|\mathbf{\alpha})\prod_{n=1}^{N}p(z_{m,n}|\mathbf{\theta}_d)p(w_{m,n}|\mathbf{\phi}_{z_{m,n}})$$

In the above, $\mathbf{W}$ is the collection of all words in all documents, $\mathbf{Z}$ is the collection of all topics in all documents, $\mathbf{\Theta}$ is the collection of all topic-document distributions, and $\mathbf{\Phi}$ is the collection of all word-topic distributions. This is thus our probability model for an entire corpus. 

We then need to train our model, often used, for example, in scki-kit learn and Gensim, *Online Variation Bayes*.

In [8]:
np.random.seed(2018)
n_topics = 10
stop_words = stopwords.words('english')
n_frequent_words = 1500
n_components = 10

## LDA with BoW and TFIDF

In [9]:
def print_topics(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)     

In [15]:
def sklearn_topic_model(feat_extractor):
    
    def create_topic_model(model, n_topics=10, max_iter=5, min_df=10, 
                           max_df=300, stop_words='english', token_pattern=r'\w+'):
        
        print(model + ' topic model:')
        data = load_data()['Text']
        if feat_extractor == 'bow':
            feature_extractor = CountVectorizer(min_df=min_df, max_df=max_df, 
                                                stop_words=stop_words, token_pattern=token_pattern)
        elif feat_extractor == 'tfidf':
            feature_extractor = TfidfVectorizer(min_df=min_df, max_df=max_df, 
                                                stop_words=stop_words, token_pattern=token_pattern)
            
        processed_data = feature_extractor.fit_transform(data) # Transforming the data into the form we want
        lda_model = LatentDirichletAllocation(n_components=n_topics, learning_method='online', 
                                              learning_offset=50., max_iter=max_iter, verbose=0) # Defining the model      
        lda_model.fit(processed_data) # Fitting the model
        features = feature_extractor.get_feature_names_out() # Getting the features
        print_topics(model=lda_model, feature_names=features, n_top_words=n_topics)

    create_topic_model(model=feat_extractor)   

In [16]:
sklearn_topic_model('bow')

bow topic model:
Topic #0: uk 2000 patch 1 visit produced photographed copyright certificate scale
Topic #1: farrelly senseless wayans marlon joshua senses awake bowling kingpin angel
Topic #2: original family michael david guy american d effects woman high
Topic #3: sandler wedding crystal adam burns singer billy francis buddy julia
Topic #4: neeson wars liam lucas phantom jedi zeta menace haunting jones
Topic #5: sam macdonald taste amusement revenge weaver ensues entirety sandler juice
Topic #6: ruth mcgregor house residents angela simon paul imagery bleak warren
Topic #7: van damme en n claude z met er die knock
Topic #8: apes carpenter ape sly mars professor snake jane ghosts jungle
Topic #9: epps ribisi squad ellis danes blonde giovanni omar silver claire


In [77]:
sklearn_topic_model('tfidf')

tfidf topic model:
Topic #0: mars carpenter franklin ghosts gibson killer sexual paul evil daughter
Topic #1: wild west smith tom patch family kline peter deep chris
Topic #2: grant l hell helen van jessica mario reeves jones question
Topic #3: family michael original david 10 effects woman american guy high
Topic #4: redman bvoice bloomington godzilla blonde sex michael gorilla francis contacted
Topic #5: sandler harrelson woody lynch virus russell item michael series drive
Topic #6: 10 garofalo ribisi todd danes squad epps seagal dialogue watching
Topic #7: redman 2000 kim girls jackie spice macdonald wrestling daniel uk
Topic #8: perry natasha outside species patrick reviewed beast alec smart attributes
Topic #9: mary angel stupid irene kevin breakfast cool bizarre jr richard


## LDA with Gensim

In [11]:
def gensim_topic_model():
    
    def remove_stop_words(text): 
        word_tokens = word_tokenize(text.lower()) # Lower-case the words
        word_tokens = [word for word in word_tokens if word not in stop_words and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', word)] # Removing stop words
        return word_tokens

    data = load_data()['Text']
    cleaned_data = [remove_stop_words(data[i]) for i in range(0, len(data))]  # Implementing above function
    print('Cleaned Data Shape:', len(cleaned_data))
    dictionary = gensim.corpora.Dictionary(cleaned_data) # Generating a dictionary for the cleaned data
    print('Dictionary shape:', len(dictionary))
    dictionary.filter_extremes(no_below=500, no_above=1000) # Removing words appearing below or above a threshold
    corpus = [dictionary.doc2bow(text) for text in cleaned_data] # Running each document through the dictionary
    # to filter our the too-low and too-frequent, then transforming each document into a BoW
    print('Corpus shape:', len(corpus))
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=n_topics, id2word=dictionary)   
    print('Gensim LDA implemenation: ')
    for _id in range(n_topics):
        header = str('Topic #%s: '%(_id))  
        tail = str(lda_model.print_topic(_id, 10))
        print(header + tail)

In [79]:
gensim_topic_model()

Cleaned Data Shape: 1400
Dictionary shape: 38960
Corpus shape: 1400
Gensim LDA implemenation: 
Topic #0: 0.099*"movie" + 0.061*"even" + 0.057*"one" + 0.052*"like" + 0.048*"film" + 0.036*"good" + 0.030*"plot" + 0.030*"movies" + 0.030*"get" + 0.029*"also"
Topic #1: 0.051*"life" + 0.047*"film" + 0.046*"would" + 0.045*"movies" + 0.044*"movie" + 0.043*"like" + 0.042*"one" + 0.038*"much" + 0.036*"time" + 0.034*"two"
Topic #2: 0.084*"one" + 0.078*"movie" + 0.054*"like" + 0.048*"film" + 0.046*"plot" + 0.042*"would" + 0.039*"two" + 0.034*"time" + 0.031*"really" + 0.030*"could"
Topic #3: 0.109*"film" + 0.073*"one" + 0.048*"like" + 0.043*"story" + 0.039*"first" + 0.038*"much" + 0.037*"would" + 0.032*"characters" + 0.032*"movie" + 0.031*"good"
Topic #4: 0.139*"film" + 0.056*"one" + 0.052*"even" + 0.041*"time" + 0.038*"like" + 0.031*"people" + 0.031*"character" + 0.030*"movie" + 0.029*"would" + 0.027*"much"
Topic #5: 0.176*"film" + 0.044*"one" + 0.044*"movie" + 0.039*"time" + 0.038*"like" + 0.035*"

# Non-Negative Matrix Factorization

# Mathematical Definition

In *Non-Negative Matrix Factorization*, we seek to factor a non-negative matrix $X$ into two matrices $W$ and $H$, also non-negative. The size of the matrices can be chosen and represents something about the problem at hand. For this script, given that we are doing topic modeling, our matrix $X$ will be a matrix of features of a corpus, for example, Bag-of-Words or $TFIDF$, and, after specifying the number of topics $K$, we will have $W$, an $m \times K$ matrix, where the columns represent the topics, and $H$, a $K \times n$ matrix, where each row is a word embedding allowing us to reconstruct the documents. 

Then, $X=WH$. There are several ways to find these matrices, but the following function from ski-kit learn will minimize the the following loss function:

$$ \mathcal{L}(W,H)=d_{loss}(W,H)+\left(\alpha_W \rho ||W||_1+\frac{\alpha_W (1-\rho)}{2}||W||^2 _{Fro}\right)n_{feat} +
\left(\alpha_H \rho ||H||_1 +\frac{\alpha_H (1-\rho)}{2}||H||^2 _{Fro}\right)n_{samples}$$

In the above, the $d_{loss}(W,H)$ can be:

$$ d_{Fro}(X,WH)=\frac{1}{2}||X-WH||^2 _{Fro}=\frac{1}{2} \sum_{i,j} (X_{i,j}-WH_{i,j})^2$$

$$ d_{KLD}(X,WH) = \sum_{i,j} \left( X_{i,j}log (\frac{X_{i,j}}{Y_{i,j}}) -X_{i,j} +Y_{i,j} \right)$$

$$ d_{IS}(X,WH) = \sum_{i,j} \left( \frac{X_{i,j}}{Y_{i,j}}-log (\frac{X_{i,j}}{Y_{i,j}}) -1 \right)$$

Finally, $||A||_1$ is the elementwise $L1$ norm, or $||A||_1 = \sum_{i,j} |A_{i,j}|$.

In [12]:
def nmf_topic_model(feat_extractor):

    def create_topic_model(model, n_topics=10, max_iter=5, min_df=10, 
                           max_df=300, stop_words='english', token_pattern=r'\w+'):
        print(model + ' NMF topic model: ')
        data = load_data()['Text']
        if feat_extractor == 'bow':
            feature_extractor = CountVectorizer(min_df=min_df, max_df=max_df, 
                                                stop_words=stop_words, token_pattern=token_pattern)
        elif feat_extractor == 'tfidf':
            feature_extractor = TfidfVectorizer(min_df=min_df, max_df=max_df, 
                                                stop_words=stop_words, token_pattern=token_pattern)

        processed_data = feature_extractor.fit_transform(data)
        nmf_model = NMF(n_components=n_topics, max_iter=max_iter)      
        nmf_model.fit(processed_data)
        features = feature_extractor.get_feature_names_out()
        print_topics(model=nmf_model, feature_names=features, n_top_words=n_topics)
        return nmf_model, processed_data, feature_extractor
           
    create_topic_model(model=feat_extractor)

In [81]:
nmf_topic_model('bow')

bow NMF topic model: 
Topic #0: woman wife michael d david job high guy play sense
Topic #1: jackie tarantino brown chan master drunken hong arts martial fu
Topic #2: scream 2 horror williamson sidney craven killer 3 sequel kevin
Topic #3: effects special space planet wars ship alien earth science computer
Topic #4: disney original family toy animated voice children woody joe kids
Topic #5: black van war white american soldiers men lee battle sweet
Topic #6: 10 7 8 joblo 5 didn guy 4 critique reviews
Topic #7: chicken run park gibson lord mel nick rocky idea voice
Topic #8: wild jay max sam kevin smith sex campbell van west
Topic #9: alien patrick species html shannon original ca mun rated r




In [82]:
nmf_topic_model('tfidf')

tfidf NMF topic model: 




Topic #0: woman american wife sex city david black night job half
Topic #1: alien effects ship space aliens special planet mars virus sci
Topic #2: scream horror williamson 2 craven sidney killer stab kevin campbell
Topic #3: jackie chan tarantino brown tucker mr martial hong arts grier
Topic #4: 10 joblo 7 8 5 critique 4 visit 9 didn
Topic #5: redman bvoice bloomington michael mailto indiana 20redman contacted archive reviews_by
Topic #6: disney animated animation toy family voice joe children bug kids
Topic #7: van en damme n z met soldier er knock wild
Topic #8: sandler wedding adam harry singer armageddon julia robbie willis gilmore
Topic #9: mail reviews ott ejohnsonott nuvo subscribe e johnson carpenter bloom
