In [6]:
#Utilities main function
import utility
import os
import pandas as pd
original_wd = os.getcwd()

### **Main functions**

In [2]:
reviews = pd.read_csv("./data/reviews.csv")

### **Preprocessing class**

In [4]:
from preprocess_class import *

# Simple demostration of how modify_stop_words_list method is being used
reviews_dataset = Dataset(reviews)

reviews_dataset.modify_stop_words_list(replace_stop_words_list = ["I", "am", "very"], include_words = ["happy"], exclude_words = ["very", "sad"])

reviews_dataset.stop_words_list

['I', 'am', 'happy']

In [22]:
# Steps to create necessary feature engineering
# Modify stop words list
# Use the create_* methods to generate feature engineer you require: bow, tfidf, word2vec, doc2vec (sklearn)
reviews_dataset = Dataset(reviews)
reviews_dataset.modify_stop_words_list(include_words=['price', 'quality', 'good', 'great'], exclude_words=["not", "no", "least"])
reviews_dataset.create_bow(root_words_option=0, min_doc = 5, max_doc = 0.95, remove_stop_words= True)

print("price removed from bag of words: ", 'price' not in reviews_dataset.bow[0].get_feature_names_out())
print("least not removed from bag of words: ", 'least' in reviews_dataset.bow[0].get_feature_names_out())

price removed from bag of words:  True
least not removed from bag of words:  True


### **Sentimental Analysis Pipeline**

##### Bert Model

In [2]:
from sentimental_analysis.bert.train import *

[nltk_data] Downloading package punkt to /Users/gjj980/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/gjj980/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/gjj980/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gjj980/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##### Non-Bert Model

In [3]:
path = os.path.join(original_wd, 'sentimental_analysis\\non_bert')
os.chdir(path)
%run train
os.chdir(original_wd)

[10, 0.9044021911014027]
Training dataset has been loaded successfully

---------------------------------
LogisticRegression model succesfully trained

---------------------------------

model_accuracy:   0.63
model_precision:   0.88
model_auc:   0.68
model_f1score:   0.70
sensitivity:   0.58
specificity:   0.78

---------------------------------

Threshold parameter tuning

Prediction using best threshold for accuracy
-------------------------

model_accuracy:   0.75
model_precision:   0.76
model_auc:   0.56
model_f1score:   0.85
sensitivity:   0.96
specificity:   0.16
Best threshold for accuracy: 0.14051379451626203
Accuracy score at best threshold: 0.7502295684113865

---------------------------------



### **Topic Modelling Pipeline**

##### Bert Model

##### Non-Bert Model

In [2]:
from topic_modelling.non_bert.non_bert_topic_model import TopicModel

### Example of running a non_bert topic modelling

In [9]:
#example of training a model with LDA / NMF method
path = os.path.join(original_wd, 'topic_modelling\\non_bert')
os.chdir(path)
%run non_bert_topic_model
os.chdir(original_wd)

Train dataset loaded
---------------------------------

------Preprocessing text data--------
------ Training model --------

------ Distribution for number of documents in each topic --------

Topic 0: 116
Topic 1: 365
Topic 2: 274
Topic 3: 500
Topic 4: 125
Topic 5: 153
Topic 6: 272
Topic 7: 216
Topic 8: 801
Topic 9: 598
Topic 10: 227
Topic 11: 316
Topic 12: 315
Topic 13: 77
------4355 documents trained------
------ Generating key words and sample documents for each topic ------------


Topic 0
food, ingredient, baby, best, organic, baby food, earth, dog food, little, pet, real, jar, thing, earth best, dog, healthy, kind, dinner, meat, stuff

yummy, older, baby, food, texture, son, apple, sauce, half

refried, bean, kind, mexican, food, restaurant, cheese, case

condenser, microphone, dynamic, hum, signal, window, build, cheap

oat, bran, creamy, one-minute, quick, oat, taste, oat, perfect, food, somebody, diet, single, tablespoon, oat, bran

bold, taco, bell, spicy, ranchero, sauce, 

### Pre model training
- Identify word forms that interfere with generation of meaningful topics
(https://aclanthology.org/U15-1013/)
- Add on stop words that may affect the quality of topics generated

Citation:
Fiona Martin and Mark Johnson. 2015. More Efficient Topic Modelling Through a Noun Only Approach. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 111–115, Parramatta, Australia.

In [9]:
topic_key_words_before_filtering = pd.read_csv("./topic_modelling/non_bert/train_output/test_5/topic_key_words.csv")

for i in range(1,5):
    label = topic_key_words_before_filtering.iloc[i,0]
    key_words = ", ".join(topic_key_words_before_filtering.iloc[i, 1:].tolist())
    print(label)
    print(key_words, "\n")

carbohydrates
free, gluten, gluten free, mix, cracker, rice, pancake, pasta, flour, baking, family, gf, texture, waffle, great, graham, wheat, dairy, maple, bread 

animals_plants
treat, product, item, sauce, chicken, dog, package, bag, good, easy, received, company, amazon, time, came, hard, china, use, cheese, packaging 

spread_bread
butter, peanut, cake, peanut butter, bread, quick, quick easy, almond butter, rum, just like, easy, rice cake, cake mix, wal, bake, mart, pan, wal mart, soup, allergic 

products
price, product, store, food, love, amazon, great, bought, buy, taste, best, time, good, quality, chocolate, did, local, better, brand, hot 



Tokens like "buy", "bought", "just like", "quick easy" do not give us much insights about the topic while pasta, drinks, coffee, tea give us an idea about the type of products in the text. Adjectives like "gluten free" may also indicate some context to the topic, indicating healthy products. Hence, we focus on looking at nouns and adjective. However, there are also nouns and adjective that do not really value add to the topics such as "amazon", "taste", "great". We resolve this issue by adding words into stop words list.

In [57]:
# show the topics, topic accuracy and average topic accuracy for each model

import re

log_path = os.path.join(os.getcwd(), "topic_modelling/non_bert/logs")

log_files = os.listdir(log_path)

model_eval = []

for file in log_files:
    if file.startswith("model"):
        file_path = os.path.join(log_path, file)
        with open(file_path) as f:
            f = f.readlines()

        topic_accuracy = pd.DataFrame(columns = ["Topic_label", "Accuracy"])
        counts = []
        for line in f:
            if len(re.findall(r"^Topic \d{1,2}:", line)) > 0:
                count = int(line.split(" ")[-1])
                counts.append(count)
            elif line.find("testing accuracy") > -1:
                curr_topic_accuracy = line.split(" ")
                topic_label = " ".join(curr_topic_accuracy[:-4])
                accuracy = float(curr_topic_accuracy[-1][:-1])
                curr_df = pd.DataFrame({'Topic_label': [topic_label], 'Sample accuracy (8)': [accuracy]})
                topic_accuracy = pd.concat([topic_accuracy, curr_df])
            
            elif line.find("topic accuracy") > -1:
                average_accuracy = float(line.split(" ")[-2])
                
        topic_accuracy.insert(1, "Topic_count", counts)
        model_eval.append((topic_accuracy, average_accuracy))

In [65]:
# LDA (without changing doc_word_prior and topic_word_prior), 10 topics, bow (poor manual topic labelling as many topics consist of more 1 category, 
# every groups have similar distribution)
print(model_eval[1][0])
print("Average topic accuracy: ", model_eval[1][1])

      Topic_label  Topic_count  Accuracy
0   breakfast_tea          465     0.750
0          coffee          386     0.500
0     ingredients          400     0.500
0          drinks          487     0.625
0  baking_dessert          420     0.375
0         dog_cat          381     0.250
0    snacks_water          454     0.875
0         grocery          462     0.750
0    mains_spread          451     0.500
0      chips_corn          449     0.500
Average topic accuracy:  0.5625


In [66]:
# LDA, 14 topics, tfidf (same type of product spread across different topics)
print(model_eval[2][0])
print("Average topic accuracy: ", model_eval[2][1])

                      Topic_label  Topic_count  Accuracy
0                breakfast drinks          335     0.625
0                         service          141     0.875
0                         dessert          227     0.500
0  flavoured drinks_carbohydrates          522     0.625
0                    snacks_dairy           95     0.500
0                         dog_cat          459     0.625
0             family_sweet snacks          258     0.875
0                          coffee          134     0.125
0                sauce_condiments          319     0.750
0                     snack_chips          197     0.375
0      calories_protein_snack_bar          525     0.875
0                   coffee_drinks          521     0.750
0                         grocery          461     0.875
0                 crackers_snacks          161     0.250
Average topic accuracy:  0.6160714285714286


In [63]:
# NMF, 14 topics, bow (imbalance Topic count, poor manual topic labelling)
print(model_eval[0][0])
print("Average topic accuracy: ", model_eval[0][1])

topic_key_words_model_1 = pd.read_csv("./topic_modelling/non_bert/train_output/model_1/topic_key_words.csv")

label_0 = topic_key_words_model_1.iloc[0,0]
key_words_0 = ", ".join(topic_key_words_model_1.iloc[0, 1:].tolist())
print(label_0)
print(key_words_0, "\n")

label_13 = topic_key_words_model_1.iloc[12,0]
key_words_13 = ", ".join(topic_key_words_model_1.iloc[12, 1:].tolist())
print(label_13)
print(key_words_13, "\n")

           Topic_label  Topic_count  Accuracy
0              unknown          116     0.000
0               coffee          365     0.875
0                  tea          274     0.875
0            flavoring          500     1.000
0                 baby          125     0.250
0         healthy food          153     1.000
0                chips          272     0.250
0                  dog          216     0.625
0              grocery          801     0.625
0                snack          598     0.625
0         sweet drinks          227     0.625
0        carbohydrates          316     0.250
0  chocolate_milk_nuts          315     0.125
0                  cat           77     0.750
Average topic accuracy:  0.5625
unknown
food, ingredient, baby, best, organic, baby food, earth, dog food, little, pet, real, jar, thing, earth best, dog, healthy, kind, dinner, meat, stuff 

chocolate_milk_nuts
raw, different, aware, cacao, alive, hot, review, oil, coconut, brand, milk, powder, organic, coco

In [67]:
# NMF, 10 topics, tfidf
# Best performing model with highest sample accuracy
print(model_eval[3][0])
print("Average topic accuracy: ", model_eval[3][1])

          Topic_label  Topic_count  Accuracy
0              coffee          361     0.875
0                 tea          280     1.000
0     flavored drinks          612     0.500
0        cat_dog_baby          728     0.250
0               chips          207     0.875
0             grocery          413     0.875
0              snacks          651     0.875
0      chocolate_milk          476     0.500
0  bag_packaging_meat          408     0.500
0                 dog          219     0.625
Average topic accuracy:  0.6875


From the topics we observed from all our topic modelling techniques, we collated some of the more general topics and input into zero-shot model to see how well the model perform on the text inputs. We observe that zero shot model performs better on shorter texts and there may be texts which can be classified under different models. For texts with more than 1 sentence, I use sent_tokenise to splits the text into individual sentence. Then, I assign the label to the text by getting the label with the highest score among all the sentences. To ensure that the texts are labelled with high certainty, we only kept the labels with scores of at least 0.5. Those below 0.5 are labelled as unknown.

In [36]:
##zero shot model (not sure if we should present cos Jia Jun couldn't run)
# from topic_modelling.non_bert.zero_shot_class import PredictTopic

# data_path = "./data/topics_samples.csv"
# new_feedback = PredictTopic(data_path)
# new_feedback.predict()

zero_shot_final_results = pd.read_csv("./topic_modelling/non_bert/test_zero_shot_output/zero_shot_full_prediction.csv")
zero_shot_final_results = zero_shot_final_results.merge(reviews.reset_index(), how = "left", left_on = "original index", right_on = "index")
print(zero_shot_final_results.sample(10, random_state=100))
zero_shot_final_results['labels'].unique()


      original index    scores  \
2004            2004  0.380852   
1210            1210  0.553230   
2599            2599  0.766479   
561              561  0.737325   
1360            1360  0.472323   
1791            1791  0.740296   
662              662  0.951726   
3499            3499  0.515050   
2213            2213  0.626944   
859              859  0.737283   

                                               sequence                labels  \
2004  It's a spreadable jelly, sweet but not too swe...                   NaN   
1210  Tasted bland - did not have the freshness and ...            condiments   
2599  I always have been of Crystal Light products s...                drinks   
561   This is the first time we bought this tea from...                drinks   
1360  I especially like it with some spinach sauteed...                   NaN   
1791              Nothing comes close to these pickles.            condiments   
662   Unfortunately, I just did not like the consist...   

array(['dog', 'healthy alternatives', 'drinks', 'snacks', 'family', nan,
       'cat', 'household', 'carbohydrates', 'condiments', 'sauce'],
      dtype=object)

In [39]:
zero_shot_final_results['labels'].fillna("unknown", inplace=True)
zero_shot_final_results['labels'].value_counts()
# zero_shot_final_results.loc[zero_shot_final_results['labels'] == "unknown",:].sample(5, random_state=100)

drinks                  1686
unknown                 1530
snacks                   858
healthy alternatives     652
dog                      170
condiments               149
family                   139
sauce                    101
cat                       68
carbohydrates             54
household                 37
Name: labels, dtype: int64

Although the accuracy of zero shot is very high, there is a major challenge in specifying distinct topics. For example, healthy alternatives can come under drinks and snacks as well. As far as we want to specify the topics, it is hard for us to generate topic labels for some products, especially non edible products, such as pesticides. This results in a high percentage of unknown labels.

### **Dockerised API**

In [None]:
###Create a scoring function with

def score(csv_file):
    #Call Sentimental_analysis via Docker API

    #Return csv file name as reviews_test_predictions_DSAPES.csv
    #Text, Time, Sentimental Probabilities, Sentiments
    pass