# Natural Language Processing with Python

In this workshop we will use the [Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)  that contains 500 positive and 500 negative sentences from imdb.com, amazon.com and yelp.com.  

* First, we will explore preliminary text analytics and text pre-processing .  
* Second, we will evaluate different feature extraction mechanisms for text.  
* Third, we will evaluate a simple text classification for Amazon review dataset, and an advanced deep neural network for classification accuracy improvement.


## Load the data

Load the reviews.csv file downloaded from LMS to the Google Colab file repository.

In [None]:
import pandas as pd

In [None]:
# Dataset reveiws.csv MUST be uploaded to Google Colab before executing this line
df = pd.read_csv('tp6_reviews.csv')

View a summary of the dataset.

In [None]:
df.info()

In [None]:
df.head()

Check what are the sources of data available in the dataset.

In [None]:
df['source'].unique()

## Preliminary analysis

***Pandas dataframe's df.apply() function***  
* This allow the users to pass a function and apply it on every single value of the Pandas series, i.e., column. [API documentation.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
*  Efficient way to update values in a dataframe column

In the following example, 


*   Count the number of words in each sentence
*   Assign the word count to a new attribute  named 'word_count'



In [None]:
def word_counter(document):
  split_word = str(document).split(" ") # split by white space
  word_count = len(split_word) # count the words
  return word_count

df['word_count_function'] = df['sentence'].apply(word_counter)

In [None]:
df.head(5)

Same above function can be achieved through a simple lambda function.

In [None]:
df['word_count'] = df['sentence'].apply(lambda x: len(str(x).split(" ")))
df.head(5)

Similarly, count the number of characters of each sentence

In [None]:
df['char_count'] = df['sentence'].str.len()  # Includes the spaces
df.head(5)

Calculate the average word length for each sentence.

*   First, construct a method (avg_word()) which takes a sentence, split the sentence to words, then calculate the average word length.
*   Using pandas dataframe apply() function and avg_word() method, calculate the average word length
*  Assign the value to new column names 'avg_word'



In [None]:
def avg_word(sentence):
  words = sentence.split() # split the sentence into words
  avg_of_words = (sum(len(word) for word in words)/len(words))
  return avg_of_words

df['avg_word'] = df['sentence'].apply(avg_word)
df.head(5)

## Text pre-processing

Pre-processing is mandatory for most text analytics tasks, as text in its raw format is unstructured and noisy. 

In the following snippets you will run several pre-processing steps.  

**Please note that pre-processing is to be used with clear understanding of the expected outcome of text analytics, as each pre-processing step is not relevant or applicable to every NLP task.**

Uppercase and lowercase characters are used for clarity in human communication. However, for a machine such distinction would create unnecessary complexities. Therefore, we transform all characters to lowercase.

In [None]:
df['sentence'] = df['sentence'].str.lower()
df.head()

Same with punctuation marks, we remove all using [regular expressions](https://www.w3schools.com/python/python_regex.asp) .  
**Regular Expressions** - A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings. 

In [None]:
# This regular expression only keeps words and characters
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')
df.head()

### Remove digits

For a sentiment analytics task, numbers or digits are not needed. Thus, we remove digits from the text dataset. 

However, for other tasks, numbers may be needed. 

In [None]:
def remove_digits(sent):
  return " ".join(w for w in sent.split() if not w.isdigit())

df['sentence'] = df['sentence'].apply(remove_digits)
df.head()

Demonstrate of the remove digit function

In [None]:
sample_text = 'Covid 19 is spreading fast'
print(remove_digits(sample_text))

What does "".join() means?

In [None]:
word_list = ["Covid", "is", "spreading", "fast"]
sentence = "     ".join(word_list)
print(sentence)

### Remove Stopwords

[Stopwords](https://en.wikipedia.org/wiki/Stop_words) are deemed irrelevant for NLP purposes because they occur frequently in the language. Therefore, we will omit the stopwords as a pre-processing step. For this, we will use [NLTK](https://www.nltk.org/) library here.

**NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) specifically for the English language written in  Python.**

In [None]:
# Load NLTK library
import nltk

# Download the stopwords to the nltk library
nltk.download('stopwords')

# Load the stopwords
from nltk.corpus import stopwords

Have a look at the stopwords indexed in the NLTK library.

In [None]:
stop = stopwords.words('english')
print(stop)

Remove unwanted stop words from the NLTK stop word list.

In [None]:
stop.remove('not')

In [None]:
all_words_i_want = ['had', "has"]
for w in all_words_i_want:
  stop.remove(w)

Remove stopwords from the sentences.

In [None]:
df['sentence'] = df['sentence'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.head(5)

### Common and rare word analysis

Aside from stopwords, some words appear rarely (only once or twice) in an entire body of text. 
Based on the analytics requirement, you can decide whether to keep or remove, and at what intensity/scale to remove.

In order to do this, first we have to construct a word frequency dictionary.

In [None]:
word_frequency = pd.Series(' '.join(df['sentence']).split()).value_counts()

List the top 10 common words.

In [None]:
# Top common words
word_frequency[:10]  # get top 10

List the top 10 rare words.

In [None]:
# least common words
word_frequency[-10:]  # get top 10

### Spelling correction

To correct misspelt words, we will use [textblob library](https://textblob.readthedocs.io/en/dev/) library. Keep in mind that corrections are always bound by the dictionary that you would use, and it may not account for context (their vs there).

Due to the time complexity of spell-checking an entire corpus, in this exercise, we will use spell-check for just one example. 

In [None]:
from textblob import TextBlob

In [None]:
# Do not run this line of code.
# Following line of code will correct spellings of all the sentences in the dataset.
# df['sentence'] = df['sentence'].apply(lambda x: str(TextBlob(x).correct()))   # This will take a long time. Thus, we will show an seperate example

Spelling correction example

In [None]:
def correct_word(word):
  return str(TextBlob(word).correct())

print(correct_word('bisness'))

In [None]:
incorrect_text = 'bisness anlytis is an itant skil seit for any organizaton'

func = lambda x: str(TextBlob(x).correct())
print(incorrect_text)
print(str(TextBlob(incorrect_text).correct()))

### Stemming

[Stemming](https://en.wikipedia.org/wiki/Stemming) is the removal of prefix, suffix etc, to derive the base form of a word. We will use the NLTK library.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stemming_function(sent):
  word_list = sent.split()
  stemmed_word_list = [stemmer.stem(word) for word in word_list]
  stemmed_sentence = " ".join(stemmed_word_list)
  return stemmed_sentence

df['sentence_stemmed'] = df['sentence'].apply(stemming_function)

df.head()

### Lemmatization

[Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.  
  
We will use  Wordnet for the lemmatization. Thus, we need to download Wordnet to the nltk library.

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage


In [None]:
# Download wordnet
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

lemmtizer = WordNetLemmatizer()

In [None]:
def lemmatize_function(sent):
  word_list = sent.split()
  lemma_word_list = [lemmtizer.lemmatize(word) for word in word_list]
  lemma_sentence = " ".join(lemma_word_list)
  return lemma_sentence

df['sentence_lemmatized'] = df['sentence'].apply(lemmatize_function)

Display original pre-processed sentence, stemmed sentence and lemmatized sentence.

In [None]:
df[['sentence', 'sentence_stemmed', 'sentence_lemmatized']].head(10)

Stemmed algorithm seems to be working better in this case when compared to the lemmatization. It is recommended to observe the results after pre-processing tasks to understand the performance of the third party libraries we are using in pre-processing.

## Text Feature Extraction

In a numeric dataset (e.g., house price dataset, titanic survival dataset, dungaree dataset), we had numeric and categorical variables, which we transformed to numeric values for predictive analytics. Those numeric variables are called numeric features in the datasets. Similarly, for NLP, we need to derive features from text data in numerical format because machines can only understand numeric representations.

### N-Grams

An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a contiguous sequence of n items from a given sample of text or speech. They are basically a set of co-occuring words within a given window. When computing the n-grams, the shift is one-step forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the ngrams would be:
* the cow
* cow jumps
* jumps over
* over the
* the moon
 
We will use NLTK ngrams and word_tokenizer libraries for n-gram feature extraction.

Note: Need to download punkt resource for nltk for work tokenization

In [None]:
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

nltk.download('punkt')

First we define the value for *n*, in n-gram representation.

In [None]:
n = 3

Following n_grams() method will take a sentence and construct a list of n-grams.

In [None]:
def n_grams(text):
  if len(word_tokenize(text)) < 3:
    return []
  n_grams = ngrams(word_tokenize(text), n)
  return [' '.join(grams) for grams in n_grams]

In [None]:
txt = "you want to exclude some stop word being getting ignored"
print(n_grams(txt))

Derive n-grams (n=3) for our dataset.

In [None]:
df['3_grams'] = df['sentence'].apply(lambda x: n_grams(x))

Display original sentence and n-grams.

In [None]:
df[['sentence', '3_grams']].head(20)

Based on above results (e.g., record 16) you can see that if there are only 2 words, the 3-grams would result no n-grams.  
Thus, you may try to derive n-grams with *n=2*.

### Bag of words

[Bag of words](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) is a simple text feature extraction mechanism.   
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
* A vocabulary of known words.  
* A measure of the presence of known words.  

We will use [CountVectorizer library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) on sklearn for bag-of-words model creation.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

You may refer to [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) API for detailed description about the parameters.

In [None]:
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1), analyzer = "word")

Transform lemmatized senteces into bag-of-words model.

In [None]:
X_bow = bow.fit_transform(df['sentence_stemmed'])

The X_bow would result in a term-document matrix.  
e.g., Output format:  (sentence_id, vocabulary_dictionary_id) count
* sentence_id - sentence id in the dataframe
* vocabulary_dictionary_id - id of the particular word in the bag of words model dictionary
* count - count of words

In [None]:
df['sentence_stemmed'].head()

In [None]:
print(X_bow)

### Term Frequency - Inverse Document Frequecy (TF-IDF)

[Term frequency–inverse document frequency](https://www.kdnuggets.com/2018/08/wtf-tf-idf.html), is a numerical statistic that is intended to reflect how important a word is to a document in a collection. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

We will use [feature extraction module](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) of the sklearn library for this.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

Construct TF-IDF using the lemmatized senteces.

In [None]:
tf_idf = vectorizer.fit_transform(df['sentence_stemmed'])  # as the text data, we will use lemmatized sentences

Display the list of all the words.

In [None]:
print(vectorizer.get_feature_names())

Here you see there are quite many text that includes a number (digit).  
In one of the pre-processing steps, we removed all the words/text that are only digits, but not combined.  
You might want to remove these as well...  

A comparison of TF-IDF values with respect to lemmatized sentences.

In [None]:
print(df['sentence_lemmatized'].head())

In [None]:
print(tf_idf[:5])

In the feature vector row (e.g., (0, 4843)), the first digit refers to the sentence row (i.e., first datarow).  
The second digit is the index of alphebitically ordered word list.

### Sentiment Analysis

Sentiment analysis is basically the process of determining the attitude or the emotion of the writer, i.e., whether it is positive or negative or neutral.

We will use the Textblob library. The sentiment function of textblob returns the polarity of the sentence, i.e., a float value which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.

In [None]:
from textblob import TextBlob

In [None]:
doc1 = "wow"
TextBlob(doc1).sentiment.polarity

Derive sentiment of each sentence.

In [None]:
df['sentiment'] = df['sentence_lemmatized'].apply(lambda x: TextBlob(x).sentiment.polarity)

Display original sentece with respect to its sentiment.

In [None]:
print(df[['sentence_lemmatized', 'sentiment']][:40])

## Text Classification

We will explore few text classification approaches to classify the review data as either positive (1) or negative (0).  
Here we will only use the amazon reviews (1000 reviews) for the workshop. (You may use yelp and imdb review data seperately evaluate the approaches.)

Previously, we conducted all the pre-processing steps to the entire 3 datasets (amazon, yelp and imdb). This for text classification we will filter only the reviews from amazon.

In [None]:
df_amazon = df.loc[df['source'] == 'amazon']

In [None]:
df_amazon.head()

Split train/validation data


In [None]:
from sklearn.model_selection import train_test_split

sentences_train, sentences_test, y_train, y_test = train_test_split(df_amazon['sentence_lemmatized'], df_amazon['label'], test_size=0.3, random_state=2)

### Logistic Regression

We will use the Bag of Words model as text features.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Construct the bag of words model.

In [None]:
bow = CountVectorizer(min_df=0, lowercase=False)
bow.fit(sentences_train)

Fit the train and test sentences to transform them to bag-of-word features.

In [None]:
X_train = bow.transform(sentences_train)
X_test  = bow.transform(sentences_test)

Use a logistic regression model for classification.

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

Evaluate the model.

In [None]:
score_train = classifier.score(X_train, y_train)
score_test = classifier.score(X_test, y_test)
print('Train accuracy: {:.2f}%'.format(score_train*100))
print('Test accuracy: {:.2f}%'.format(score_test*100))

What can you say about bias and variance of this model?  
- Low bias and high variance

Try few examples

In [None]:
testing = "happy customer"
vector_representation = bow.transform([testing])
prediction = classifier.predict(vector_representation)[0]

if prediction == 1:
   print("Positive Review")
else:
   print("Negative Review")

Positive Review


## Exercise

You may conduct similar classification exercises for yelp and imdb datasets.