<a href="https://colab.research.google.com/github/JacobTumak/SentimentAnalysisProject/blob/main/SA_Data_PreProcessing_(IMDB).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Importing/Installing Packages**

In [None]:
import json
import nltk
from nltk import (sent_tokenize, word_tokenize)
from nltk.corpus import stopwords
nltk.download(['punkt', 'stopwords'])
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:

stop_words.add('br')
# 'br' appears most in text later on as a break operator.
# Adding it to the stopwords list now willprevent it from 
# becoming the most seen word.

# **Importing DataSets from google Drive**

Read the file taken from drive into a list of reviews. Drive must be mounted into the same drive account that colab is using or else it won't mount.

In [None]:
from google.colab import drive # Must use same account as the notebook is in, otherwise it won't mount
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
def fetch_data(identifier, min_index, max_index): # Uses "append" method
  output_var = []
  for i in range(min_index, max_index + 1):
    for j in range(-1,11):
      try:
        output_var.append((open(f'/content/gdrive/My Drive/Train/{identifier}/{i}_{j}.txt', 'r')).read())
      except:
        continue
  return output_var

compile first 10 positive and negative reviews into their respective lists. Just for testing. Final trial will have more reviews loaded

In [None]:
data_test_pos = fetch_data('pos', 0,10)
data_test_neg = fetch_data('neg', 0, 10)

# **Tokenizing Raw-Data into word-lists**

In [None]:
def process_data(data_set):
  
  data_set = [word_tokenize(data) for data in data_set] # seperates strings into lists of words in original order
  data_set = [data_set[i][j] for i in range(len(data_set)) for j in range(len(data_set[i])) if data_set[i][j].lower() not in stop_words and data_set[i][j].isalpha()]
  ["".join(word.lower()) for word in data_set]
  return data_set

In [None]:
filtered_pos = process_data(data_test_pos)
filtered_neg = process_data(data_test_neg)

Make a function that produces a dictionary with the words from the word-list data as keywords and the frequency of use as values.

In [None]:
def word_frequency(list1, list2):

  all_words = list1 + list2
  top_words = {word : 0 for word in all_words}
  
  for word in list1:
    top_words[word] += 1

  # Previous-clunkier algorithm:
  # top_words = {}
  # for word in list1:
  #   if word in top_words.keys():
  #     top_words[word] += 1
  #   else:
  #     top_words[word] = 1
  #   for word in list2:
  #     if word not in list1:
  #       top_words[word] = 0
        
  return dict(sorted(top_words.items(), key=lambda x:x[1], reverse=True))
  # return top_words

Compile all negative and positive words into dictionary to compare most keywords.

In [None]:
neg_words = word_frequency(filtered_neg, filtered_pos)
pos_words = word_frequency(filtered_pos, filtered_neg)

Now to find most common words used in both sets of reviews (pos and neg) and add filter them out.

In [None]:
# This was my first attempt at structuring the data
# first_try = [{word : ({'Positive':[{'Int':pos_words[word]}, {'Dist':float(pos_words[word]/(pos_words[word]+neg_words[word]))}]}, {'Negative':[{'Int':neg_words[word]}, {'Dist':float(neg_words[word])/(pos_words[word]+neg_words[word])}]})} for word in pos_words if word in neg_words]

In [None]:
all_word_data = {word : {'Positive':{'Int':pos_words[word], 'Dist':round(float(pos_words[word]/(pos_words[word]+neg_words[word])), 2)}, 'Negative':{'Int':neg_words[word], 'Dist':round(float(neg_words[word]/(pos_words[word]+neg_words[word])), 2)}} for word in pos_words if word in neg_words}

Creates dataset of common word data as *common_word_data[word]['Positive' or 'Negative']['Int' or 'Dist']*

#**Compiling and Exporting Data to use in statistical-based sentiment analysis**

In [None]:
# with open("/content/gdrive/My Drive/Train/test2_file.txt", "w") as test_file:
#      test_file.write(json.dumps(all_word_data))

#**adverb and adjective based analysis trial**
The previous method of processing data develops a machine learning model based on word use analysis. 

My aim in this following section is to develop a system to do sentence based analysis rather than based solely on words.
My first idea is to compile lists of adjectives and positive-and-negative adverbs (modifier words) and search for them sentence by sentence.

In [None]:
# Getting adjectives and adverbs stored in my drive on google sheets
from google.colab import auth
auth.authenticate_user()
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
worksheet = gc.open('Trial2Spreadsheet').sheet1

# Sorting sheets data into structured set for easier use later on
rows = worksheet.get_all_values()
data_set = dict()
data_set['pos adj'] = [row[0] for row in rows if row[0] != ""]
data_set['neg adj'] = [row[1] for row in rows if row[1] != ""]
data_set['pos adv'] = [row[2] for row in rows if row[2] != ""]
data_set['neg adv'] = [row[3] for row in rows if row[3] != ""]

MessageError: ignored

In [None]:
#Downloading previously made data set from reviews (stored in my drive as a .txt file)
with open("/content/gdrive/MyDrive/Train/test_file.txt", 'r') as json_file:
  data_set = json.load(json_file)

In [None]:
#Finding and sorting all adjectives and adverbs into 