# NLP Tasks
## NLP Task 1 - Name Entity Recognition

### Literature Review

Name entity recognition has been used in a variety of papers for the purposes of brand name recognition. Within the
Electronic commerce research journal an article was written that investigated name entity recognition of cross-border
e-commerce titles based on TWs-LSTM. The TFIDF algorithm is used to manipulate the text corpus of the commodity to
extract a weighted matrix, meanwhile the word2vec model represented the semantic meaning of the words extracted from the
corpus’s bag of words. Where combined into a one-dimensional matrix that was passed into LTSM for commodity name entity
recognition. The final accuracy achieved by this model was approximately 65% which was much higher than other models at
the time (Luo et al., 2019).

### Rational

The proposed solution to the overall problem is to create an outlook system based on brands. Each review that was
extracted utilising the webscraper had the model of the laptop along with the brand that it is made by. Name Entity
recognition can be used to go through the review title for each review and extract this brand based on its
classification. To accomplish this different entity name recognition systems can be utilised, or a model can be created
and trained specifically for the desired purpose.

Both the spacey and Stanford NER models were used and compared, with the results from the one with the highest accuracy
being appended to the dataset. Each of these have facilities to import custom labelled datasets and train the models
based of them, which as will not be utilised as a labelled training set is not available, instead the pretrained models
will be used.

The Stanford model utilises the conditional random field (CRF) algorithm which is a combination of the Hidden markov
model and the Maximum entropy markov model.  This algorithm assumes that features are dependent on each other which
would be the case with a sentence as the context is created by the dependency of words onto each other (Dasagrandhi,
2020). Conditional random field algorithm makes use of this contextual information from the previous labels to
better the current prediction (Chawla, 2021).

Spacey utilises an unpublished algorithm but it utilises a word embedding method in conjunction with a deep
convolutional neural network. This provides a model that has high adaptability and well-rounded accuracy (Dasagrandhi,
2020).

### Pre-processing
The review titles where able to be directly inputted into the models with no changes requires as they were only singular
sentence inputs.

### Preliminary analysis

The performance of the model will be based on the number of detections made and the correctness of those detections.
Spacey has a higher detection rate compared to the Stanford model, however most of the detections from spacey are wrong.
It recognises certain words as companies because of the use of capitalisation such as the detection of “Laptop Review”
as an ORG when it is clearly not. When the results of the Stanford NER model are assessed, it has a lower detection rate
but all the detections that are performed are correct.

![Figure 1 - NER Results](NERResults.PNG)

It should be noted that the spacey model is significantly faster than the Stanford model with it taking only 7 seconds
to process the text as opposed to Stanford’s 22 minutes.

Comparing these results to the article assessed, it has the same level of performance with less complexity. The article
is dated back to 2012 which in the period since, more sophisticated models have been developed with a much higher
success rate.

### Code

In [None]:
# Packages
import os

import en_core_web_sm
import pandas as pd
import tqdm
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

java_path = r"C:\Program Files\Java\jre1.8.0_291\bin\java.exe"
os.environ['JAVAHOME'] = java_path


def spacy(data):
    brandNames = []
    nlp = en_core_web_sm.load()
    for i in tqdm.tqdm(range(len(data['Title']))):
        organizations = []
        classified_text = nlp(data.loc[i, 'Title'])
        for item in classified_text.ents:
            if item.label_ == 'ORG':
                organizations.append(item.text)
        if len(organizations) == 0:
            brandNames.append("No brand found")
        else:
            brandNames.append(organizations[0])

    counter = 0
    for brand in brandNames:
        if brand == "No brand found":
            counter += 1

    print("Number of non branded entries: " + str(counter))
    print("Percentage detection: " + str(((len(brandNames) - counter)/(len(brandNames) + 0.0))*100))

    return brandNames


def namedEntityRecognition(data):
    brandNames = []
    model_filename = './stanford-ner-2020-11-17/classifiers/english.all.3class.distsim.crf.ser.gz'
    path_to_jar = './stanford-ner-2020-11-17/stanford-ner.jar'
    st = StanfordNERTagger(model_filename=model_filename, path_to_jar=path_to_jar, encoding='utf-8')
    for i in tqdm.tqdm(range(len(data['Title']))):
        organizations = []
        tokens = word_tokenize(data.loc[i, 'Title'])
        classified_text = st.tag(tokens)
        for item in classified_text:
            if item[1] == 'ORGANIZATION':
                organizations.append(item[0])
        if len(organizations) == 0:
            brandNames.append("No brand found")
        else:
            brandNames.append(organizations[0])

    counter = 0
    for brand in brandNames:
        if brand == "No brand found":
            counter += 1

    print("Number of non branded entries: " + str(counter))
    print("Percentage detection: " + str(((len(brandNames) - counter)/(len(brandNames) + 0.0))*100))

    return brandNames


def main():
    data = pd.read_csv('output.csv')

    brands = namedEntityRecognition(data)
    brands2 = spacy(data)

    print(brands)
    print(brands2)

    data['Brand'] = brands

    data.to_csv('outputBrand.csv', index=False)


main()

## NLP Task 2 - Sentiment Analysis

### Literature review
Sentiment analysis is a natural language processing method that detects if a body of text expresses a positive, negative
or neutral outlook by scoring the words within the text and performing a summation. An article was included in the
applied soft computing journal in 2020 that analysed the sentiment of tourism reviews, proposing the technique of
sentiment padding. Sentiment padding creates a more consistent sample size and improves the proportion of sentiment
information within each review. The article goes onto use neural networks; more specifically deep learning sentiment
analysis models named lexicon integrated two-channel CNN-LTSM family models. This model combines CNN and LTSM/BiLTSM
branches in parallel in order to achieve more accurate results. Utilising this approach on a variety of complex
datasets, it was found to outperform many baseline methods. The final accuracy achieved was 50.68% on a reversed SST
dataset while 95% accuracy was obtained on a Chinese review dataset. This was 2% higher than a zero-padding method
(Li et al., 2020).

### Rational

The sentiment of a body of text can be directly linked to the outlook of the laptop. A positive sentiment can be
assessed as a positive outlook for the laptop which is inline with the proposed solution. For this purpose, the Vader
model will be utilised which is a lexicon and rule-based sentiment analysis tool that is attuned to sentiments expressed
in social media (Singh, 2021). As these reviews are based on opinions from a range of writers it should be well suited
as social media sentiments are opinion based as well. Each section can be assigned a separate sentiment score which can
then be assessed at a later stage based on what aspect of the laptop that the user is most
interested in.

### Pre-processing

For this data as it was a large corpus some pre-processing was required. The first was the removal of any stop words
from the text which added no additional meaning to the sentence such as joining words, this was done by tokenizing the
words and removing any that appeared in NLTKs stop words list. Each segment of the data set besides the title was passed
through this process (As the title was not used in this NLP task). Afterwards the non-text characters where removed such
punctuation and brackets.

Lastly the corpus for each segment was normalized using lemmatization which reduce words to their base form. This acts
to reduce errors when analysing sentiment as there are less variations of each word allowing for a better sentiment
score to be assigned at the end, the NLTK WordNetLemmatizer was used to perform this.

### Preliminary Analysis

The vader package was used to apply a sentiment analysis on the processed corpus for each segment which was appended to
the dataset. This final dataset with all the sentiment for each of the section was saved into a csv document. To assess
the performance of this process a few sections of text will be selected and their corresponding sentiment scores. It
will then be manually reviewed to gauge if the mapped sentiment matches the actual overall outlook of the body of text.
For this the verdict paragraph will be selected as this gives a finalised opinion of the laptop which will be the
easiest to assess the outlook off.

![Figure 2 - Verdict and corresponding sentiment score](SentimentResults.PNG)

The sentiment analysis scores this verdict as highly positive with little to no negative sentiment detected. When the
actual paragraph itself is assessed, there is only one sentence which has a negative outlook with the first half
paragraph being highly positive. The compound rating is very high, much higher than what it should be. This could be a
result of the pre-processing removing some meaning from the descriptive words in certain areas.

When the overall sentiment of the verdict segment is assessed it appears to have a very large skew towards highly
positive sentiment. This may be a resultant from the reviews being sorted from highest overall rating with the lowest
rating scraped being a 74% (The data set is skewed more towards positive outlook).

### Code

In [None]:
import nltk
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import tqdm
from nltk import word_tokenize
from nltk.corpus import stopwords
import re


def noiseRemoval(data):
    stop_words = set(stopwords.words('english'))
    for i in tqdm.tqdm(range(data.shape[0])):
        for j in range(data.iloc[0].shape[0] - 1):
            tokens = word_tokenize(data.iloc[i, j + 1].lower())
            if j != data.iloc[0].shape[0] - 1:
                data.iloc[i, j + 1] = " ".join([word for word in tokens if not word in stop_words])

    data['Intro'] = [re.sub(r'\W', ' ', i) for i in data['Intro']]
    data['Case'] = [re.sub(r'\W', ' ', i) for i in data['Case']]
    data['Connectivity'] = [re.sub(r'\W', ' ', i) for i in data['Connectivity']]
    data['Input devices'] = [re.sub(r'\W', ' ', i) for i in data['Input devices']]
    data['Display'] = [re.sub(r'\W', ' ', i) for i in data['Display']]
    data['Performance'] = [re.sub(r'\W', ' ', i) for i in data['Performance']]
    data['Emissions'] = [re.sub(r'\W', ' ', i) for i in data['Emissions']]
    data['Energy management'] = [re.sub(r'\W', ' ', i) for i in data['Energy management']]
    data['Verdict'] = [re.sub(r'\W', ' ', i) for i in data['Verdict']]

    normalization(data)


def normalization(data):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    for i in tqdm.tqdm(range(data.shape[0])):
        for j in range(data.iloc[0].shape[0] - 1):
            word_list = word_tokenize(data.iloc[i, j + 1])
            data.iloc[i, j + 1] = ' '.join([lemmatizer.lemmatize(w) for w in word_list])

    sentimentAnalysis(data)


# Return sentiment of sentence. Returning 0.0 in all fields if it is none
def vader(sentence):
    analyser = SentimentIntensityAnalyzer()
    if sentence == "none":
        return {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
    else:
        return analyser.polarity_scores(sentence)


def sentimentAnalysis(data):
    segments = [[], [], [], [], [], [], [], [], []]
    for index, row in tqdm.tqdm(data.iterrows()):
        segments[0].append(vader(row['Intro']))
        segments[1].append(vader(row['Case']))
        segments[2].append(vader(row['Connectivity']))
        segments[3].append(vader(row['Input devices']))
        segments[4].append(vader(row['Display']))
        segments[5].append(vader(row['Performance']))
        segments[6].append(vader(row['Emissions']))
        segments[7].append(vader(row['Energy management']))
        segments[8].append(vader(row['Verdict']))

    data['Intro sentiment'] = segments[0]
    data['Case sentiment'] = segments[1]
    data['Connectivity sentiment'] = segments[2]
    data['Input devices sentiment'] = segments[3]
    data['Display sentiment'] = segments[4]
    data['Performance sentiment'] = segments[5]
    data['Emissions sentiment'] = segments[6]
    data['Energy management sentiment'] = segments[7]
    data['Verdict sentiment'] = segments[8]

    data.to_csv('outputBrandSentimentNoPreprocessing.csv', index=False)


def main():
    data = pd.read_csv('outputBrand.csv')
    noiseRemoval(data)


main()

## References

Luo, Y., Ma, J., & Li, C. (2019). Entity name recognition of cross-border e-commerce commodity titles based on TWs-LSTM.
Electronic Commerce Research, 20(2), 405–426. https://doi.org/10.1007/s10660-019-09371-6

Dasagrandhi, C. S. (2020). Understanding Named Entity Recognition Pre-Trained Models. V-Soft Consulting.
https://blog.vsoftconsulting.com/blog/understanding-named-entity-recognition-pre-trained-models#:%7E:text=Model%20Architecture,while%20learning%20a%20new%20pattern.&text=In%20terms%20of%20performance%2C%20it,methods%20for%20entity%20recognition%20problems.

Chawla, R. (2021, April 19). Overview of Conditional Random Fields - ML 2 Vec. Medium.
https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541

Li, W., Zhu, L., Shi, Y., Guo, K., & Cambria, E. (2020). User reviews: Sentiment analysis using lexicon integrated
two-channel CNN–LSTM​ family models. Applied Soft Computing, 94, 106435. https://doi.org/10.1016/j.asoc.2020.106435

Singh, F. (2021, February 2). Sentiment Analysis Made Easy Using VADER. Analytics India Magazine.
https://analyticsindiamag.com/sentiment-analysis-made-easy-using-vader/