### 2. Text Preprocessing and Feature Extraction

This section follows the exploration of the Amazon dataset by consolidating the Text Preprocessing Steps into succinct functions and developing features that can be used to develop a predictive mode. For this step, we implement the following processes:

<br>

#### Text Preprocessing

1. Reduce the Dataset
2. Preprocess the reviewText Data

#### Traditional Feature Extraction

1. Bag of Words
2. Term Frequency Inverse Document Frequency (TF-IDF)



<br>

#### Step 1: Reducing the Dataset

The Amazon review Dataset contains a number of data points that are not of significant use to our main task. However, they are relevant in dealing with the whole dataset with tasks suchs as assessing duplicates, converting rating and matching products. For our purposes, we will reduce the dataset by dropping duplicates and removing unnecessary columns. 


In [1]:
import numpy as np
import pandas as pd

def categorize_review_rating(rating):
    """
    Categorizes a review rating into negative, neutral, or positive.

    Parameters:
    - rating (int or float): The review rating to be categorized.

    Returns:
    - int: Returns -1 for ratings below 3 (negative), 0 for a rating of 3 (neutral), and 1 for ratings above 3 (positive).
    """
    if rating < 3:
        return -1
    elif rating == 3:
        return 0
    else:
        return 1


def subset_verified_reviews(input_df: pd.DataFrame) -> pd.DataFrame:
    """
    Filters a DataFrame for verified reviews, removes duplicates, selects specific columns,
    and applies a categorization to the review ratings.

    Parameters:
    - input_df (pd.DataFrame): The input DataFrame containing review data.

    Returns:
    - pd.DataFrame: A subset of the input DataFrame with only verified reviews, no duplicates,
      and categorized ratings.

    Note: This function assumes the presence of the columns 'verified', 'reviewerID', 'asin',
    'reviewText', and 'overall' in the input DataFrame. It also relies on an external function
    `review_rate` to categorize the 'overall' ratings.
    """

    # Filter for verified reviews
    verified_reviews = input_df[input_df['verified'] == True]

    # Remove duplicate reviews based on reviewerID and product ID (asin), keeping the first occurrence
    verified_reviews_no_duplicates = verified_reviews.drop_duplicates(subset=['reviewerID', 'asin'], keep='first')

    # Select the 'reviewText' and 'overall' columns
    selected_columns = verified_reviews_no_duplicates[['reviewText', 'overall']]

    # Apply the rating categorization function to the 'overall' column
    # Ensure `review_rate` is defined or imported correctly
    selected_columns.loc[:, 'overall'] = selected_columns['overall'].apply(categorize_review_rating)

    return selected_columns


<br>

#### Test the Function with All_Beauty_5.json Data

To test and validate the function above, we will use the All_Beauty_5.json file available on the amazon review data set. It is available here: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

In [2]:
data = pd.read_json('All_Beauty_5.json', lines=True)
data.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"09 1, 2016",A3CIUOJXQ5VDQ2,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",Shelly F,As advertised. Reasonably priced,Five Stars,1472688000,,
1,5,True,"11 14, 2013",A3H7T87S984REU,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",houserules18,Like the oder and the feel when I put it on my...,Good for the face,1384387200,,
2,1,True,"08 18, 2013",A3J034YH7UG4KT,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",Adam,I bought this to smell nice after I shave. Wh...,Smells awful,1376784000,,
3,5,False,"05 3, 2011",A2UEO5XR3598GI,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",Rich K,HEY!! I am an Aqua Velva Man and absolutely lo...,Truth is There IS Nothing Like an AQUA VELVA MAN.,1304380800,25.0,
4,5,True,"05 6, 2011",A3SFRT223XXWF7,B00006L9LC,{'Size:': ' 200ml/6.7oz'},C. C. Christian,If you ever want to feel pampered by a shampoo...,Bvlgari Shampoo,1304640000,3.0,


In [3]:
verified_data = subset_verified_reviews(data)
verified_data.head()

Unnamed: 0,reviewText,overall
0,As advertised. Reasonably priced,1
1,Like the oder and the feel when I put it on my...,1
2,I bought this to smell nice after I shave. Wh...,-1
4,If you ever want to feel pampered by a shampoo...,1
7,No change my scalp still itches like crazy. It...,-1


In [4]:
len(data), len(verified_data)

(5269, 3356)

<br>

### Step 2. Preprocessing reviewText

For preprocessing, we have settled on the following simple steps:

#### 1. Punctuation Removal

For this, we first replace `&` with `and` and follow through with removing all the extra punctuation. Later we may wish to revisit this step but for now, we proceed with removing all the punctuation.

#### 2. Replace Numbers with Words

We want to retain as many original features as possible therefore we convert number to words. That is `4 -> four`.

#### 3. Correct Spelling

This a feature the we are still developing as it does not always return the expected results. We expect to develop it further in the future.

#### 4. Lemmatization

We use lemmatization to reduce the words into their stem/root form. This will reduce the overall vocabularly that we will need to manage.

<br>

### Class Implementation

The class below implements these steps with specific method documented within their names.

In [5]:
import nltk
import re
import inflect
from nltk.corpus import wordnet, stopwords
# from sym_spellpy import SymSpell, Verbosity

class Preprocessor:
    """
    A class for preprocessing text data, designed to perform a series of processing steps such as punctuation removal,
    tokenization, numerical word replacement, spelling correction (optional), and lemmatization.
    """
    
    def __init__(self):
        """
        Initializes the Preprocessor object with a text and sets up necessary processing tools.
        
        Parameters:
        - text (str): The text to be processed.
        """
        self.word_tokenize = nltk.word_tokenize
        self.lemmatizer = nltk.WordNetLemmatizer()
        self.pos_dict = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}
        self.p = inflect.engine()  # Num to words engine
        # self.sym_spell = SymSpell()  # Initialize SymSpell for spelling correction if needed
        
    def punctuation_removal(self):
        """Removes punctuation from the text, replacing '&' with 'and'."""
        self.text = self.text.replace('&', 'and')
        self.text = re.sub("[^a-zA-Z]", " ", self.text)

    def tokenize(self):
        """Tokenizes the text for further processing."""
        return self.word_tokenize(self.text)

    def replace_num_with_words(self):
        """Replaces all numeric values in the text with their word representations."""
        self.text = ' '.join(self.p.number_to_words(word) if word.isdigit() else word for word in self.tokenize())

    def correct_spelling(self):
        """Attempts to correct the spelling of words in the text. Optional in preprocess pipeline."""
        new_sentence = []
        for word in self.tokenize():
            try:
                correct_word = self.sym_spell.lookup(word, Verbosity.CLOSEST)[0].term
            except IndexError:  # Handles exceptions where word correction is not possible
                correct_word = word
            new_sentence.append(correct_word)
        self.text = ' '.join(new_sentence)

    def lemmatize_text(self):
        """
        Lemmatizes the text, converting words to their base form according to their parts of speech,
        and removes stopwords.
        """
        tokens = nltk.pos_tag(self.tokenize())
        lemmatized_tokens = [
            self.lemmatizer.lemmatize(token[0], pos=self.pos_dict.get(token[1][0].upper(), wordnet.NOUN))
            for token in tokens if token[0].lower() not in stopwords.words('english')
        ]
        self.text = ' '.join(lemmatized_tokens)

    def preprocess(self, text):
        """
        Executes a preprocessing pipeline on the text, including punctuation removal, numerical word replacement,
        (optional spelling correction), and lemmatization.
        """
        self.text = text
        self.punctuation_removal()
        self.replace_num_with_words()
        # self.correct_spelling()
        self.lemmatize_text()
        return self.text.lower()



<br>

### Applying the Preprocessor to reviewText 

We can now apply the Preprocessor to the reviewText Data.

In [6]:
verified_data.loc[:, 'review'] = verified_data['reviewText'].apply( lambda x: Preprocessor().preprocess(str(x)))

In [7]:
verified_data.head()

Unnamed: 0,reviewText,overall,review
0,As advertised. Reasonably priced,1,advertise reasonably price
1,Like the oder and the feel when I put it on my...,1,like oder feel put face try brand review peopl...
2,I bought this to smell nice after I shave. Wh...,-1,buy smell nice shave put smell awful smell lik...
4,If you ever want to feel pampered by a shampoo...,1,ever want feel pamper shampoo one one smell li...
7,No change my scalp still itches like crazy. It...,-1,change scalp still itch like crazy doesnt lath...


In [8]:
verified_data.columns = ['reviewText', 'rate', 'review']
verified_data.head()

Unnamed: 0,reviewText,rate,review
0,As advertised. Reasonably priced,1,advertise reasonably price
1,Like the oder and the feel when I put it on my...,1,like oder feel put face try brand review peopl...
2,I bought this to smell nice after I shave. Wh...,-1,buy smell nice shave put smell awful smell lik...
4,If you ever want to feel pampered by a shampoo...,1,ever want feel pamper shampoo one one smell li...
7,No change my scalp still itches like crazy. It...,-1,change scalp still itch like crazy doesnt lath...


In [9]:
verified_data[['review', 'rate']].to_csv('verified_csv.csv', index=False)

<br>

### 3. Text Feature Extraction

The next step in the pipeline is to convert text into numerical features that can be used to develop a model that predicts the sentiment of the text. On this note, we cover traditional and modern feautures are we will use the to develop a sentiment analysis model. Namely, we will cover:

1. Bag of Words
2. Term Frequency Inverse Document Frequency (TF-IDF)
3. Word2Vec

We will use the verified and preprocessed dataset to implement these techniques, but first let's explore these techniques independently.

<br>

### 1. Bag of Words

The bag of words technique generates features through a one-hot encoding at the document level. Tactically, all words in the corpus are placed in to a bag. To map each document we assign values 0 if the word in not present in the document and 1 if it is. To better understand the working of the Bag of Words, let'd demonstrate it with an example of a corpus of 5, relatively simple documents.


In [10]:
import pandas as pd
from nltk import word_tokenize

corpus = [ "the restaurant had great food",
           "i love python programming",
           "i prefer R to python",
           "computers are fun to use",
           "i did not like the movie"] 

from sklearn.feature_extraction.text import CountVectorizer

bows_counter = CountVectorizer( analyzer = 'word',            # Word level vectorizer
                                lowercase = True,             # Lower case the text
                                ngram_range = (1, 1),         # Create 1 n-grams
                                tokenizer = word_tokenize,   # Use this tokenizer
                                stop_words = 'english',
                                token_pattern = None )     # remove english stopwords

bows_counter.fit(corpus)
features = bows_counter.transform(corpus).toarray()

In [11]:
features

array([[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]])

In [12]:
features_df = pd.DataFrame(features, columns=bows_counter.get_feature_names_out())
features_df

Unnamed: 0,computers,did,food,fun,great,like,love,movie,prefer,programming,python,r,restaurant,use
0,0,0,1,0,1,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,1,0,0,1,1,0,0,0
2,0,0,0,0,0,0,0,0,1,0,1,1,0,0
3,1,0,0,1,0,0,0,0,0,0,0,0,0,1
4,0,1,0,0,0,1,0,1,0,0,0,0,0,0


Notice that each sentence is its own document and that we now have binary features and that each column reflect the total vocabulary in the whole corpus.

Notice that the dataframe has 1-gram tokens and an encoding that shows whether a document contains the token. This set of features can help us model the sentiment of the text.

Another thing to notice is that the matrix can be quite sparse depending on the number of vocabularies and their relative frequency. Therefore, it may be useful to limit n-grams and use features using frequency thresholds.


<br>

## 2. Term Frequency - Inverse Document Frequency

Term Frequency Inverse Document Frequency a.k.a TF-IDF "TF-IDF is a commonly used weighting technique that assigns weights reflecting the importance of a word to a document. The basis of this technique is the idea that if a word appears frequently across all documents, it is less likely to hold significant information about any specific document. On the other hand, words that appear frequently in one or a few documents and rarely across all documents are considered to have specific importance and should be assigned higher weights.
The mathematical expression of tf-idf (in one of the many forms) is:

<br>

 
$$ tf\ {idf} =  {frequency_{t,d}} * log  \frac {(total\ documents)}{(total\ documents\ containing\ the\ term)} $$

<br>

It is simply the multiplication of the number of times a word appears in a document by the logarithm of the total number of documents divided by the number of documents that contain the word
Intuitively, high-frequency words that appear in nearly all documents are weighted by the logarithm of 1 (log1), resulting in a weight of zero. Conversely, words with high frequency within a specific document and low frequency across the corpus will have a higher weight.


Let's see an example using our small corpus above.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( analyzer='word',          # Word level vectorizer
                                    lowercase=True,           # Lowercase the text
                                    stop_words = 'english',
                                    tokenizer= word_tokenize, # Use this tokenizer
                                    token_pattern = None) 

tfidf_vectorizer.fit(corpus)
tfidf_features = tfidf_vectorizer.transform(corpus).toarray()

In [16]:
tfidf_df = pd.DataFrame(tfidf_features, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,computers,did,food,fun,great,like,love,movie,prefer,programming,python,r,restaurant,use
0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.614189,0.0,0.0,0.614189,0.495524,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.614189,0.0,0.495524,0.614189,0.0,0.0
3,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735
4,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0


<br>

### Generating CountVectorizer and TFIDF Vectorizer

Now that we have an understanding and a template for how these methods work, we can apply them to the text data we just processed.

In [17]:
review_countVectorizer = CountVectorizer( analyzer = 'word', 
                                          lowercase = True, 
                                          tokenizer = word_tokenize, 
                                          token_pattern = None, 
                                          stop_words = 'english', 
                                          ngram_range = (1, 1),
                                          min_df = 5,)

review_countVectorizer.fit( verified_data.review )

In [18]:
bowords_features = pd.DataFrame( review_countVectorizer.transform(verified_data.review).toarray(), columns=review_countVectorizer.get_feature_names_out() )
bowords_features.head()

Unnamed: 0,able,absolute,absolutely,accessory,actual,actually,add,addict,addition,adorable,...,wrinkle,wrong,x,yeah,year,yellow,yes,young,yum,yummy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
review_tfidf_vectorizer = TfidfVectorizer( analyzer='word',          # Word level vectorizer
                                           lowercase=True,           # Lowercase the text
                                           stop_words = 'english',   # remove english stopwords
                                           min_df = 5,               # use words that appear > 5
                                           tokenizer= word_tokenize, # Use this tokenizer
                                           token_pattern = None) 

review_tfidf_vectorizer.fit( verified_data.review )

In [20]:
tfidf_features = pd.DataFrame( review_tfidf_vectorizer.transform(verified_data.review).toarray(), columns = review_tfidf_vectorizer.get_feature_names_out() )
tfidf_features.head()

Unnamed: 0,able,absolute,absolutely,accessory,actual,actually,add,addict,addition,adorable,...,wrinkle,wrong,x,yeah,year,yellow,yes,young,yum,yummy
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.32858,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Notes

With traditional feature extraction methods such as Bag of Words and Tfidf, we are bound to have a lot of sparcity in the feature matrix. We will use this as a base and improve on feature techniques. We will also reduce sparcity by setting up the minimum frequency required for words to be included in the vocabulary.

### Pickling Vectorizers and Datasets for Modelling

The preprocessing steps on large datasets often take very long to implement. It is therefore useful to save checkpoints to be more efficient. For the case of the trained vectorizers and datasets, we will use `pickle` objects to save the state of vectorizers and implement them on new text.

In [21]:
import pickle

def pickle_object(object_to_pickle, file_name):
    """
    Serializes an object and saves it to a file using the pickle protocol.
    
    This function takes any Python object and a file name as input. It serializes the object using pickle and 
    saves it to the specified file. This is particularly useful for saving model objects or data transformers 
    for later use.
    
    Parameters:
    - object_to_pickle: The Python object to serialize. This can be any object that pickle can handle, including
      custom classes, lists, dictionaries, etc.
    - file_name: The name of the file (with path, if necessary) where the serialized object will be saved. 
      It's recommended to use a '.pkl' extension for clarity.
    
    Returns:
    - None
    """
    try:
        with open(file_name, 'wb') as file:
            pickle.dump(object_to_pickle, file)
    except Exception as e:
        print(f"An error occurred while pickling the object: {e}")


In [22]:
data_objects = { 'tfidf_vectorizer.pk': review_tfidf_vectorizer, 
                 'count_vectorizer.pk': review_countVectorizer,
                 'count_features_dataframe.pk': bowords_features,
                 'tfidf_features_dataframe.pk': tfidf_features }

for name, obj in data_objects.items():
    pickle_object(obj, name)