# Opinion lexicon score model (Notebook 5/6)
> Opinion Lexion: An opinion lexicon is a collection of words and phrases that are annotated with sentiment information, typically indicating whether each term carries a positive, negative, or neutral connotation. 

In this notebook, 

- I have made use of Opinion lexicon for creating the sentiment scoring model
- We first breaks down the input text into sentences using the `sent_tokenize` function from nltk. 
- For each sentence, it calculates a sentence score by iterating over each token (word) in the sentence. 
- If the token is in the list of positive words from the opinion lexicon, the sentence score is incremented by 1. 
- If the token is in the list of negative words, the sentence score is decremented by 1. 
- Finally, the function returns the total sentiment score by summing up the sentence scores and dividing by the total number of tokens in all sentences.

Next Notebook: `issue-13-as-analysis-on-opinion-lexicon-scores.ipynb`

In [44]:
import pandas as pd
from nltk.corpus import opinion_lexicon
from nltk.tokenize import sent_tokenize, TreebankWordTokenizer
from string import punctuation
pd.set_option('display.max_colwidth', None)

# run if faced with an error
# nltk.download('opinion_lexicon')

In [9]:
# source: issue-5-as-data-preprocessing.ipynb
df = pd.read_csv('../data/preprocessed_small_sample.csv')

In [10]:
# extracting the positive and negative words
positive_words = list(opinion_lexicon.positive())
negative_words = list(opinion_lexicon.negative())

In [12]:
def get_opinion_lexicion_sentiment_score(text: str) -> float:
    """
    This function calculates the sentiment score of a given text using the opinion lexicon provided by the Natural Language Toolkit (nltk).

    Parameters:
    - text (str): The input text for which the sentiment score needs to be calculated.

    Returns:
    - float: The sentiment score of the input text, where a higher score indicates a more positive sentiment and a lower score indicates a more negative sentiment.

    Description
    The function first breaks down the input text into sentences using the `sent_tokenize` function from nltk. Then, for each sentence, it calculates a sentence score by iterating over each token (word) in the sentence. If the token is in the list of positive words from the opinion lexicon, the sentence score is incremented by 1. If the token is in the list of negative words, the sentence score is decremented by 1. Finally, the function returns the total sentiment score by summing up the sentence scores and dividing by the total number of tokens in all sentences.
    """
    
    total_score = 0
    raw_sentences = sent_tokenize(text)

    for sentence in raw_sentences:
        sentence_score = 0
        sentence = str(sentence)
        sentence = sentence.replace('<br/>', '')\
                            .translate(str.maketrans('', '', punctuation)).lower()
        tokens = TreebankWordTokenizer().tokenize(text)

        for token in tokens:
            sentence_score = sentence_score + 1 if token in positive_words else (sentence_score - 1 if token in negative_words else sentence_score)
        total_score = total_score + (sentence_score / len(tokens))

    return total_score 

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 8999
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   overall         9000 non-null   float64
 1   verified        9000 non-null   bool   
 2   reviewTime      9000 non-null   object 
 3   reviewerID      9000 non-null   object 
 4   asin            9000 non-null   object 
 5   style           5414 non-null   object 
 6   reviewerName    8995 non-null   object 
 7   reviewText      8999 non-null   object 
 8   summary         9000 non-null   object 
 9   unixReviewTime  9000 non-null   int64  
 10  vote            1713 non-null   float64
 11  image           188 non-null    object 
 12  cleaned_review  8998 non-null   object 
 13  tokens          9000 non-null   object 
 14  stemmed_tokens  9000 non-null   object 
 15  lemmas          9000 non-null   object 
dtypes: bool(1), float64(2), int64(1), object(12)
memory usage: 1.0+ MB


In [15]:
# Convert all values to strings, filling NaNs with empty strings
df['cleaned_review'] = df['cleaned_review'].astype(str)
df['sentiment_score'] = df['cleaned_review'].apply(lambda x: get_opinion_lexicion_sentiment_score(x))

In [49]:
df[['cleaned_review', 'sentiment_score']].sample()

Unnamed: 0,cleaned_review,sentiment_score
7364,these are great,0.333333


In [50]:
df.to_csv('../data/opinion_lexicon_scored_small_sample.csv', index= False)