<a href="https://colab.research.google.com/github/MarioAvolio/Amazon-Fine-Foods-reviews-Transformers-Text-Classification/blob/main/Text_Preprocessing_Amazon_Fine_Food.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credits


**Mario Avolio: 880995 - https://marioavolio.netlify.app/**

Credits: 
- https://www.oreilly.com/library/view/practical-natural-language/9781492054047/

Dataset:
- https://snap.stanford.edu/data/web-FineFoods.html



In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt # plotting
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
import nltk
#https://www.nltk.org/

#NLTK is a leading platform for building Python programs to work with human language data. 
#It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, 
#along with a suite of text processing libraries for classification, tokenization, 
#stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength 
#NLP libraries, and an active discussion forum.

nltk.download('punkt') 

nltk.download('stopwords')
# Downloading stop words from NLTK nltk.download ('stopwords')


from nltk.tokenize import word_tokenize
#Tokenizers divide strings into lists of substrings. For example, tokenizers can 
#be used to find the words and punctuation in a string

# Constants and Methods

In [None]:
PATH_PROJ = "/content/drive/MyDrive/data-proj/"
# if not os.path.exists(PATH_PROJ):
#   PATH_PROJ = "/content/drive/MyDrive/shared/data-proj/"

PATH_DATASET = PATH_PROJ+"food.csv"
PATH_DATASET_PREPROCESSED = PATH_PROJ+"preprocessed.csv"

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('Number of sentences')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

In [None]:
def get_random_sentences_from_dataset():
  for elem in df.text.sample(30).to_numpy():
    print(" \n ---> ",elem)
  

# Pipeline

We would normally
walk through the requirements and break the problem down into several subproblems, then try to develop a step-by-step procedure to solve them. Since language
processing is involved, we would also list all the forms of text processing needed at
each step. This step-by-step processing of text is known as a pipeline. It is the series of
steps involved in building any NLP model. These steps are common in every NLP
project

The first step in the process of developing any NLP system is to collect data relevant
to the given task. Even if we’re building a rule-based system, we still need some data
to design and test our rules. The data we get is seldom clean, and this is where text
cleaning comes into play. After cleaning, text data often has a lot of variations and
needs to be converted into a canonical form. This is done in the pre-processing step.
This is followed by feature engineering, where we carve out indicators that are most
suitable for the task at hand. These indicators are converted into a format that is
understandable by modeling algorithms. Then comes the modeling and evaluation
phase, where we build one or more models and compare and contrast them using a
relevant evaluation metric(s). Once the best model among the ones evaluated is
chosen, we move toward deploying this model in production. Finally, we regularly
monitor the performance of the model and, if need be, update it to keep up its
performance.


# Data 

In [None]:
df = pd.read_csv(PATH_DATASET)

In [None]:

df

In [None]:
print(type(df))
print(df.shape)

In [None]:
print(df.columns)

In [None]:
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')

In [None]:
df.head(6)

Let's isolate the useful columns to our end. 

In [None]:
df = df[["text","score"]]

In [None]:
df.iloc[30:40]

## Looking at the Class Distribution


In [None]:
import matplotlib.pyplot as plt
df["score"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()

In this case, we can see that the dataset is heavily imbalanced; There are several ways to deal with imbalanced data, including:
- Randomly oversample the minority class.
- Randomly undersample the majority class.
- Gather more labeled data from the underrepresented classes.

## How Long Are Our Review?
Transformer models have a maximum input sequence length that is referred to as the
maximum context size. For applications using DistilBERT, the maximum context size
is 512 tokens, which amounts to a few paragraphs of text. 

In [None]:
df["Words Per Review"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Review", by="score", grid=False,
showfliers=False, color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

From the plot we see that for each emotion, most tweets are around 60 words long
and the longest tweets are well below DistilBERT’s maximum context size. Texts that
are longer than a model’s context size need to be truncated, which can lead to a loss in
performance if the truncated text contains crucial information; in this case, it looks
like that won’t be an issue.

Map target label to String
1. VERY NEGATIVE
2. NEGATIVE
3. NEUTRAL
4. POSITIVE
5. EXCELLENT


# Pre-Processing
Our text-extraction step removed all this and gave us the plain text of the
article we need. However, all NLP software typically works at the sentence level and
expects a separation of words at the minimum. So, we need some way to split a text
into words and sentences before proceeding further in a processing pipeline. Some‐
times, we need to remove special characters and digits, and sometimes, we don’t care
whether a word is in upper or lowercase and want everything in lowercase. Many
more decisions like this are made while processing text. Such decisions are addressed
during the pre-processing step of the NLP pipeline. Here are some common preprocessing steps used in NLP software:

- Preliminaries: Sentence segmentation and word tokenization.
- Frequent steps: Stop word removal, stemming and lemmatization, removing digits/punctuation,
lowercasing, etc.
- Other steps: Normalization, language detection, code mixing, transliteration, etc.
- Advanced processing: POS tagging, parsing, coreference resolution, etc.

## First Cleanup
Text extraction and cleanup refers to the process of extracting raw text from the input
data by removing all the other non-textual information, such as markup, metadata,
etc., and converting the text to the required encoding format

In [None]:
from bs4 import BeautifulSoup

def clean_html_tags(row):
  soupified = BeautifulSoup(row, "html.parser")
  for linebreak in soupified.find_all('br'): #remove br
    linebreak.replace_with(" ")

  span_tags = soupified.find_all('span') # remove span
  for span in span_tags:
      span.unwrap()
  
  return str(str(soupified))

In [None]:
df.text.iloc[10]

In [None]:
for elem in df.text.sample(30).to_numpy():
  print(" \n ---> ",clean_html_tags(elem))
  

In [None]:
df['text'] = df['text'].apply(clean_html_tags)

## Preliminaries

As mentioned earlier, NLP software typically analyzes text by breaking it up into
words (tokens) and sentences. Hence, any NLP pipeline has to start with a reliable
system to split the text into sentences (sentence segmentation) and further split a sentence into words (word tokenization)

### Sentence segmentation

As a simple rule, we can do sentence segmentation by breaking up text into sentences
at the appearance of full stops and question marks. However, there may be abbrevia‐
tions, forms of addresses (Dr., Mr., etc.), or ellipses (...) that may break the simple
rule.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
df.text.iloc[10]

In [None]:
sent_tokenize(df.text.iloc[10])

In [None]:
df.text = df.text.apply(sent_tokenize)

### Word tokenization
While readily available solutions work for most of our needs and most NLP libraries
will have a tokenizer and sentence splitter bundled with them, it’s important to
remember that they’re far from perfect. For example, consider this sentence: “Mr. Jack
O’Neil works at Melitas Marg, located at 245 Yonge Avenue, Austin, 70272.” If we run
this through the NLTK tokenizer, O, ‘, and Neil are identified as three separate tokens.
Similarly, if we run the sentence: “There are \$10,000 and €1000 which are there just
for testing a tokenizer” through this tokenizer, while $ and 10,000 are identified as
separate tokens, €1000 is identified as a single token. In another scenario, if we want
to tokenize tweets, this tokenizer will separate a hashtag into two tokens: a “#” sign
and the string that follows it. In such cases, we may need to use a custom tokenizer
built for our purpose

In [None]:
def word_tokenize_custom(list_of_sent):
  list_of_word = []
  for sent in list_of_sent:
    list_of_word.extend(word_tokenize(sent))

  return list_of_word

In [None]:
#word_tokenize function

print(df.text.iloc[10])
print(word_tokenize_custom(df.text.iloc[10]))

When dealing with social media text, we usually want to identify urls, hashtags, smileys as separate objects and do not tokenize it to individual characters.

In [None]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]

    

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)


tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(list_of_sentences):
  sentence = []
  for sent in list_of_sentences:
    sent = emoji_pattern.sub(r'', sent)
    sentence.extend(tokenize(sent))
  
  return np.array(sentence)

In [None]:
preprocess(df.text.iloc[10])

In [None]:
df.text = df.text.apply(preprocess) 

## Frequent Steps
Some of
the frequently used words in English, such as a, an, the, of, in, etc., are not particularly useful for this task, as they don’t carry any content on their own to separate
between the four categories. Such words are called stop words and are typically
(though not always) removed from further analysis in such problem scenarios. There
is no standard list of stop words for English, though. There are some popular lists
(NLTK has one, for example), although what a stop word is can vary depending on what we’re working on.

Similarly, in some cases, upper or lowercase may not make a difference for the problem. So, all text is lowercased (or uppercased, although lowercasing is more common). Removing punctuation and/or numbers is also a common step for many NLP
problems, such as text classification, information retrieval,
and social media analytics


In [None]:
from nltk.corpus import stopwords
from string import punctuation
 
def preprocess_corpus(texts):
  '''
  Remove stop words, digits, and punctuation and lowercase a given collection of texts
  '''
  mystopwords = set(stopwords.words("english"))
  stop = set()
  #adding some of the stopwords after observing the tweets
  stop.add("The")
  stop.add("And")
  stop.add("I")
  stop.add("J")
  stop.add("K")
  stop.add("I'd")
  stop.add("That's")
  stop.add("\x81")
  stop.add("It")
  stop.add("I'm")
  stop.add("...")
  stop.add("\x89")
  stop.add("ĚĄ")
  stop.add("it's")
  stop.add("ă")
  stop.add("\x9d")
  stop.add("âÂĺ")
  stop.add("Ě")
  stop.add("˘")
  stop.add("Â")
  stop.add("âÂ")
  stop.add("Ň")
  stop.add("http")
  stop.add("https")
  stop.add("co")
  stop.add("000")
  stop.add("Ň")
  stop.add("Ň")
  stop.add("Ň")
  stop.add("ââ")
  stop.add('ě')
  stop.add('ň')
  stop.add('``')
  stop.add("''")
  # stop.add("''")

  stop = list(stop)

  
  def remove_stops_digits(tokens):
    return [token.lower() for token in tokens if token not in mystopwords and not token.isdigit() and token not in punctuation and token not in stop]
  

  return remove_stops_digits(texts)


In [None]:
df.text.iloc[10]

In [None]:
preprocess_corpus(df.text.iloc[4])

In [None]:
df.text = df.text.apply(preprocess_corpus)

In [None]:
for elem in df.text.sample(100).to_numpy():
  print(" \n ---> ",elem)
  

### Stemming and lemmatization

Stemming refers to the process of removing suffixes and reducing a word to some
base form such that all different variants of that word can be represented by the same form (e.g., “car” and “cars” are both reduced to “car”). This is accomplished by applying a fixed set of rules (e.g., if the word ends in “-es,” remove “-es”). Although such rules may not always end up in a linguistically correct base form, stemming is commonly used in search engines to match user queries to relevant documents and in text classification to reduce the feature space to train machine learning models.


Lemmatization is the process of mapping all the different forms of a word to its base
word, or lemma. While this seems close to the definition of stemming, they are, in
fact, different. For example, the adjective “better,” when stemmed, remains the same.
However, upon lemmatization, this should become “good.


In [None]:
from nltk.stem.porter import PorterStemmer

def make_stemming(sent):
  excluded_words = ["i've", "this"]
  stemmer = PorterStemmer()
  stemmed_sentence = []
  for word in sent:
    if word in excluded_words:
      stemmed_sentence.append(word)
      continue

    stemmed_word = stemmer.stem(word)
    stemmed_sentence.append(stemmed_word)
  
  return stemmed_sentence

In [None]:
df.text.iloc[10]

In [None]:
make_stemming(df.text.iloc[10])

In [None]:
# df.text = df.text.apply(make_stemming)

In [None]:
# get_random_sentences_from_dataset()

In [None]:
import spacy

def make_lemmatization(sentence):
  sp = spacy.load('en_core_web_sm')

  list_of_lemmatize_words = []
  for word in sentence:
    token = sp(word) # The ‘u’ in front of a string means the string is a Unicode string.
    list_of_lemmatize_words.append(token[0].lemma_)
  
  return list_of_lemmatize_words

In [None]:
df.text.iloc[10]

In [None]:
make_lemmatization(df.text.iloc[10])

In [None]:
from tqdm.auto import tqdm

tqdm.pandas(desc="progress: ")
# df.text = df.text.progress_apply(make_lemmatization) # too much expensive

## Dictionary check

In [None]:
nltk.download('words')
from nltk.corpus import words

def return_real_word(sentence):
  return [w for w in sentence if w in words.words()]


In [None]:
len(return_real_word(df.text.iloc[10])), len(df.text.iloc[10])

In [None]:
tqdm.pandas(desc="progress: ")

# df.text = df['text'].progress_apply(return_real_word) # too much expensive

## Remove single token

In [None]:
def remove_single_token(sentence):
  return [word for word in sentence if len(word)>1]

In [None]:
len(remove_single_token(df.text.iloc[2])), len(df.text.iloc[2])

In [None]:
df.text.iloc[2]

In [None]:
df.text = df.text.apply(remove_single_token) # too much expensive

In [None]:
get_random_sentences_from_dataset()

In [None]:
df.to_csv(PATH_DATASET_PREPROCESSED, index=False)

# Visualize words

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from PIL import Image
import numpy as np
from collections import Counter

from wordcloud import WordCloud

In [None]:
# Function to get the counter (will be helpful later on)
def get_counter(series):
  flat_list = [item for sublist in series for item in sublist]
  c = Counter(flat_list)
  return c

In [None]:
flat_list = [item for sublist in df.text for item in sublist] #unique list containing all tokens 
  
fig = plt.figure(figsize=(20,14))
wordcloud = WordCloud(width=1600, height=800, background_color="black").generate_from_frequencies(Counter(flat_list))
plt.axis("off")
plt.imshow(wordcloud, interpolation='antialiased')

In [None]:
get_counter(df.text).most_common(10)