# Appendix A - Supporting Python Code

## 1) Clustering

### 1.1) Loading The Corpus

Let's begin by importing our corpus - the dataset of moview reviews.

In [3]:
# Create corpus dataframe
import pandas as pd
import numpy as np
from dataclasses import dataclass

# Create Document class
@dataclass
class Document:
    doc_id: str
    text: str

def add_movie_descriptor(data: pd.DataFrame, corpus_df: pd.DataFrame):
    """
    Adds "Movie Description" to the supplied dataframe, in the form {Genre}_{P|N}_{Movie Title}_{DocID}
    """
    review = np.where(corpus_df['Review Type (pos or neg)'] == 'Positive', 'P', 'N')
    data['Descriptor'] = corpus_df['Genre of Movie'] + '_' + corpus_df['Movie Title'] + '_' + review + '_' + corpus_df['Doc_ID'].astype(str)

def get_corpus_df(path):
    data = pd.read_csv(path, encoding="utf-8")
    add_movie_descriptor(data, data)
    sorted_data = data.sort_values(['Descriptor'])
    indexed_data = sorted_data.set_index(['Doc_ID'])
    indexed_data['Doc_ID'] = indexed_data.index
    return indexed_data

# Define documents in this new class
corpus_df = get_corpus_df('MSDS453_ClassCorpus_Final_Sec56_v4_20230702.csv')
documents = [Document(x, y) for x, y in zip(corpus_df.Doc_ID, corpus_df.Text)]

### 1.2) Exploratory Data Analysis

Having imported the corpus of movie reviews, let's conduct exploratory data analysis on the raw dataset.

In [4]:
corpus_df.shape

(190, 9)

In [5]:
corpus_df.head(4).T

Doc_ID,101,102,103,104
DSI_Title,SAR_Doc1_Covenant,SAR_Doc2_Covenant,SAR_Doc3_Covenant,SAR_Doc4_Covenant
Submission File Name,SAR_Doc1_Covenant,SAR_Doc2_Covenant,SAR_Doc3_Covenant,SAR_Doc4_Covenant
Student Name,SAR,SAR,SAR,SAR
Genre of Movie,Action,Action,Action,Action
Review Type (pos or neg),Negative,Negative,Negative,Negative
Movie Title,Covenant,Covenant,Covenant,Covenant
Text,Nearly two years after the American military w...,Have Guy Ritchie and Jake Gyllenhaal switched ...,Guy Ritchie's The Covenant notably marks the f...,"In a weird throwback to those Chuck Norris ""Mi..."
Descriptor,Action_Covenant_N_101,Action_Covenant_N_102,Action_Covenant_N_103,Action_Covenant_N_104
Doc_ID,101,102,103,104


In [6]:
print(corpus_df.info());

<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 101 to 217
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   DSI_Title                 190 non-null    object
 1   Submission File Name      190 non-null    object
 2   Student Name              190 non-null    object
 3   Genre of Movie            190 non-null    object
 4   Review Type (pos or neg)  190 non-null    object
 5   Movie Title               190 non-null    object
 6   Text                      190 non-null    object
 7   Descriptor                190 non-null    object
 8   Doc_ID                    190 non-null    int64 
dtypes: int64(1), object(8)
memory usage: 14.8+ KB
None


In [7]:
print(corpus_df['Movie Title'].unique())

['Covenant' 'Inception' 'No time to die' 'Taken' 'The Dark Knight Rises'
 'Despicable Me 3' 'Holmes and Watson' 'Legally Blonde' 'Lost City'
 'Sisters' 'Drag Me to Hell' 'Fresh' 'It Chapter Two' 'The Toxic Avenger'
 'US' 'Annihilation' 'Minority Report' 'Oblivion' 'Pitch Black']


In [8]:
# Gather the number of reviews by genre
counts_df = corpus_df[['Genre of Movie']].copy()
counts_df['Count'] = 1
counts_df.groupby(['Genre of Movie']).count().reset_index()

Unnamed: 0,Genre of Movie,Count
0,Action,50
1,Comedy,50
2,Horror,50
3,Sci-Fi,40


In [9]:
corpus_df.columns

Index(['DSI_Title', 'Submission File Name', 'Student Name', 'Genre of Movie',
       'Review Type (pos or neg)', 'Movie Title', 'Text', 'Descriptor',
       'Doc_ID'],
      dtype='object')

### 1.3) Data Wrangling 

### 1.4) Vectorization

### 1.5) Clustering Experiments

#### 1.5.1) K-Means Clustering Experiments

Let's conduct our first clustering experiments via K-Means Clustering to determine whether we can successfully cluster movie reviews based on review type, genre, or movie title.

In [None]:
def k_means(titles, tfidf_matrix, k=3):
    
    #this is a function to generate the k-means output using the tfidf matrix.  Inputs 
    #to the function include: titles of text, processed text, and desired k value. 
    #Returns dataframe indicating cluster number per document

    km = KMeans(n_clusters=k, random_state =89)
    km.fit(tfidf_matrix)
    clusters = km.labels_.tolist()

    Dictionary={'Doc Name':titles, 'Cluster':clusters,  'Text': final_processed_text}
    frame=pd.DataFrame(Dictionary, columns=['Cluster', 'Doc Name','Text'])
    #dictionary to store clusters and respective titles
    cluster_title={}

    #note doc2vec clusters will not have individual words due to the vector representation
    #is based on the entire document not indvidual words. As a result, there won't be individual
    #word outputs from each cluster.   
    for i in range(k):
        temp=frame[frame['Cluster']==i]
        temp_title_list=[]
        for title in temp['Doc Name']:
            temp_title_list.append(title)
        cluster_title[i]=temp_title_list

    return cluster_title,clusters,frame


tfidf_matrix = tfidf(final_processed_text, titles, ngram_range = (1,1))

cluster_title,clusters,k_means_df = k_means(titles, tfidf_matrix, k=20)

cluster_title[9]

plot_tfidf_matrix(cluster_title,clusters,tfidf_matrix)

labels = data['Review Type (pos or neg)'].apply(lambda x: 0 if x.lower().split(' ')[0] == 'negative' else 1)
print(labels)

#### 1.5.2) Hierarchical Clustering Experiments

In [None]:
########################### THIS MAY BE A STRETCH GOAL SECTION ################################################

#### 1.5.3) DBSCAN Clustering Experiments

In [None]:
################################ THIS MAY BE A STRETCH GOAL SECTION ##########################################

## 2) Sentiment Analysis

Let's examine the effectiveness of various classification algorithms for conducting sentiment analysis on our corpus.  We can leverage Random Forest Classification, Naive Bayes Classification, and Support Vector Machine Classification to predict which reviews are positive or negative.

### 2.1) Random Forest Classifier Experiments

Let's apply random forest classification to our corpus to see how effectively we can predict which reviews are positive or negative.

In [None]:
def classifiers(x, y, model_type, cv = 3):
    
    #this function is to fit 3 different model scenarios.  Support vector machines, logistic regressions, naive bayes.
    #svm = Support vector machin
    #logistic = Logistic regression
    #naive_bayes = Naive Bayes Multinomial
    
    #can define cv value for cross validation.
    
    #function returns the train test split scores of each model.
    
    if model_type == 'svm':
        print("svm")
        model = SVC()

    elif model_type == 'logistic':
        print("logistic")
        model = LogisticRegression()

    elif model_type == 'naive_bayes':
        print("naive_bayes")
        model = MultinomialNB()
    
    elif model_type == 'randomforest':
        print("randomforest")
        model = RandomForestClassifier()

    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.10, random_state=23)
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_test)
    accy = accuracy_score(y_test, predictions) 
    return accy

classifiers(tfidf_matrix, labels, 'randomforest')

### 2.2) Naive Bayes Classifier Experiments

Let's apply Naive Bayes Classificaiton to our corpus to determine how effectively we can predict which reviews are positive or negative.

In [None]:
classifiers(tfidf_matrix, labels, 'naive_bayes')

### 2.3) Support Vector Machine Classification Experiments

Let's apply Support Vector Machine Classification to our corpus to determine how effectively we can predict which reviews are positive or negative.

In [None]:
classifiers(tfidf_matrix, labels, 'svm')

## 3) Multi-Class Classification

Let's apply various classification algorithms to our corpus to determine whether Random Forest Classification, Naive Bayes Classification, or Support Vector Machine Classification perform well at predicting movie genre using movie review data.

### 3.1) Random Forest Classifier Experiments

Let's apply Random Forest Classification to our corpus to predict genre based on the movie review data.

### 3.2) Naive Bayes Classifier Experiments

Let's apply Naive Bayes Classification to our corpus to examine how well we can predict movie genre based on the movie review.

### 3.3) Support Vector Machine Experiments

Let's apply Support Vector Machine Classification to our corpus data to examine how well we can predict movie genre using the movie review data.

## 4) Topic Modeling

### 4.1) Latent Semantic Analysis Experiments

### 4.2) Latent Dirichlet Allocation Experiments