# Natural Language Processing with Python

In this workshop we will use the [Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)  that contains 500 positive and 500 negative sentences from imdb.com, amazon.com and yelp.com.  

* First, we will explore preliminary text analytics and text pre-processing .  
* Second, we will evaluate different feature extraction mechanisms for text.  
* Third, we will evaluate a simple text classification for Amazon review dataset, and an advanced deep neural network for classification accuracy improvement.


## Load the data

Load the reviews.csv file downloaded from LMS to the Google Colab file repository.

In [4]:
import pandas as pd

In [5]:
# Dataset reveiws.csv MUST be uploaded to Google Colab before executing this line
df = pd.read_csv('tp6_reviews.csv')

View a summary of the dataset.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2748 entries, 0 to 2747
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  2748 non-null   object
 1   label     2748 non-null   int64 
 2   source    2748 non-null   object
dtypes: int64(1), object(2)
memory usage: 64.5+ KB


In [7]:
df.head()

Unnamed: 0,sentence,label,source
0,Wow... Loved this place.,1,yelp
1,Crust is not good.,0,yelp
2,Not tasty and the texture was just nasty.,0,yelp
3,Stopped by during the late May bank holiday of...,1,yelp
4,The selection on the menu was great and so wer...,1,yelp


Check what are the sources of data available in the dataset.

In [8]:
df['source'].unique()

array(['yelp', 'amazon', 'imdb'], dtype=object)

## Preliminary analysis

***Pandas dataframe's df.apply() function***  
* This allow the users to pass a function and apply it on every single value of the Pandas series, i.e., column. [API documentation.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
*  Efficient way to update values in a dataframe column

In the following example,


*   Count the number of words in each sentence
*   Assign the word count to a new attribute  named 'word_count'



In [9]:
def word_counter(document):
  split_word = str(document).split(" ") # split by white space
  word_count = len(split_word) # count the words
  return word_count

df['word_count_function'] = df['sentence'].apply(word_counter)

In [10]:
df.head(5)

Unnamed: 0,sentence,label,source,word_count_function
0,Wow... Loved this place.,1,yelp,4
1,Crust is not good.,0,yelp,4
2,Not tasty and the texture was just nasty.,0,yelp,8
3,Stopped by during the late May bank holiday of...,1,yelp,15
4,The selection on the menu was great and so wer...,1,yelp,12


Same above function can be achieved through a simple lambda function.

In [11]:
df['word_count'] = df['sentence'].apply(lambda x: len(str(x).split(" ")))
df.head(5)

Unnamed: 0,sentence,label,source,word_count_function,word_count
0,Wow... Loved this place.,1,yelp,4,4
1,Crust is not good.,0,yelp,4,4
2,Not tasty and the texture was just nasty.,0,yelp,8,8
3,Stopped by during the late May bank holiday of...,1,yelp,15,15
4,The selection on the menu was great and so wer...,1,yelp,12,12


In [12]:
def func(x):
  return len( str(x).split(" ") )

Similarly, count the number of characters of each sentence

In [13]:
df['char_count'] = df['sentence'].str.len()  # Includes the spaces
df.head(5)

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count
0,Wow... Loved this place.,1,yelp,4,4,24
1,Crust is not good.,0,yelp,4,4,18
2,Not tasty and the texture was just nasty.,0,yelp,8,8,41
3,Stopped by during the late May bank holiday of...,1,yelp,15,15,87
4,The selection on the menu was great and so wer...,1,yelp,12,12,59


Calculate the average word length for each sentence.

*   First, construct a method (avg_word()) which takes a sentence, split the sentence to words, then calculate the average word length.
*   Using pandas dataframe apply() function and avg_word() method, calculate the average word length
*  Assign the value to new column names 'avg_word'



In [14]:
def avg_word(sentence):
  words = sentence.split() # split the sentence into words
  avg_of_words = (sum(len(word) for word in words)/len(words))
  return avg_of_words

df['avg_word'] = df['sentence'].apply(avg_word)
df.head(5)

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word
0,Wow... Loved this place.,1,yelp,4,4,24,5.25
1,Crust is not good.,0,yelp,4,4,18,3.75
2,Not tasty and the texture was just nasty.,0,yelp,8,8,41,4.25
3,Stopped by during the late May bank holiday of...,1,yelp,15,15,87,4.866667
4,The selection on the menu was great and so wer...,1,yelp,12,12,59,4.0


## Text pre-processing

Pre-processing is mandatory for most text analytics tasks, as text in its raw format is unstructured and noisy.

In the following snippets you will run several pre-processing steps.  

**Please note that pre-processing is to be used with clear understanding of the expected outcome of text analytics, as each pre-processing step is not relevant or applicable to every NLP task.**

Uppercase and lowercase characters are used for clarity in human communication. However, for a machine such distinction would create unnecessary complexities. Therefore, we transform all characters to lowercase.

In [15]:
df['sentence'] = df['sentence'].str.lower()
df.head()

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word
0,wow... loved this place.,1,yelp,4,4,24,5.25
1,crust is not good.,0,yelp,4,4,18,3.75
2,not tasty and the texture was just nasty.,0,yelp,8,8,41,4.25
3,stopped by during the late may bank holiday of...,1,yelp,15,15,87,4.866667
4,the selection on the menu was great and so wer...,1,yelp,12,12,59,4.0


Same with punctuation marks, we remove all using [regular expressions](https://www.w3schools.com/python/python_regex.asp) .  
**Regular Expressions** - A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings.

In [16]:
# This regular expression only keeps words and characters
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')
df.head()

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word
0,wow... loved this place.,1,yelp,4,4,24,5.25
1,crust is not good.,0,yelp,4,4,18,3.75
2,not tasty and the texture was just nasty.,0,yelp,8,8,41,4.25
3,stopped by during the late may bank holiday of...,1,yelp,15,15,87,4.866667
4,the selection on the menu was great and so wer...,1,yelp,12,12,59,4.0


### Remove digits

For a sentiment analytics task, numbers or digits are not needed. Thus, we remove digits from the text dataset.

However, for other tasks, numbers may be needed.

In [17]:
def remove_digits(sent):
  return " ".join(w for w in sent.split() if not w.isdigit())

df['sentence'] = df['sentence'].apply(remove_digits)
df.head()

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word
0,wow... loved this place.,1,yelp,4,4,24,5.25
1,crust is not good.,0,yelp,4,4,18,3.75
2,not tasty and the texture was just nasty.,0,yelp,8,8,41,4.25
3,stopped by during the late may bank holiday of...,1,yelp,15,15,87,4.866667
4,the selection on the menu was great and so wer...,1,yelp,12,12,59,4.0


Demonstrate of the remove digit function

In [18]:
sample_text = 'Covid 19 is spreading fast'
print(remove_digits(sample_text))

Covid is spreading fast


What does "".join() means?

In [19]:
word_list = ["Covid", "is", "spreading", "fast"]
sentence = "     ".join(word_list)
print(sentence)

Covid     is     spreading     fast


### Remove Stopwords

[Stopwords](https://en.wikipedia.org/wiki/Stop_words) are deemed irrelevant for NLP purposes because they occur frequently in the language. Therefore, we will omit the stopwords as a pre-processing step. For this, we will use [NLTK](https://www.nltk.org/) library here.

**NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) specifically for the English language written in  Python.**

In [20]:
# Load NLTK library
import nltk

# Download the stopwords to the nltk library
nltk.download('stopwords')

# Load the stopwords
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Have a look at the stopwords indexed in the NLTK library.

In [21]:
stop = stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Remove unwanted stop words from the NLTK stop word list.

In [22]:
stop.remove('not')

In [23]:
all_words_i_want = ['had', "has"]
for w in all_words_i_want:
  stop.remove(w)

Remove stopwords from the sentences.

In [24]:
df['sentence'] = df['sentence'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.head(5)

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word
0,wow... loved place.,1,yelp,4,4,24,5.25
1,crust not good.,0,yelp,4,4,18,3.75
2,not tasty texture nasty.,0,yelp,8,8,41,4.25
3,stopped late may bank holiday rick steve recom...,1,yelp,15,15,87,4.866667
4,selection menu great prices.,1,yelp,12,12,59,4.0


### Common and rare word analysis

Aside from stopwords, some words appear rarely (only once or twice) in an entire body of text.
Based on the analytics requirement, you can decide whether to keep or remove, and at what intensity/scale to remove.

In order to do this, first we have to construct a word frequency dictionary.

In [25]:
word_frequency = pd.Series(' '.join(df['sentence']).split()).value_counts()

In [26]:
word_frequency

Unnamed: 0,count
not,302
good,174
great,157
movie,133
had,129
...,...
broke).-,1
case),1
(with,1
spinn,1


List the top 10 common words.

In [27]:
# Top common words
word_frequency[:10]  # get top 10

Unnamed: 0,count
not,302
good,174
great,157
movie,133
had,129
one,124
like,122
phone,118
film,107
really,98


List the top 10 rare words.

In [28]:
# least common words
word_frequency[-10:]  # get top 10

Unnamed: 0,count
psyched,1
iriver,1
strap.,1
magnetic,1
fond,1
broke).-,1
case),1
(with,1
spinn,1
one's,1


### Spelling correction

To correct misspelt words, we will use [textblob library](https://textblob.readthedocs.io/en/dev/) library. Keep in mind that corrections are always bound by the dictionary that you would use, and it may not account for context (their vs there).

Due to the time complexity of spell-checking an entire corpus, in this exercise, we will use spell-check for just one example.

In [29]:
from textblob import TextBlob

In [30]:
# Do not run this line of code.
# Following line of code will correct spellings of all the sentences in the dataset.
# df['sentence'] = df['sentence'].apply(lambda x: str(TextBlob(x).correct()))   # This will take a long time. Thus, we will show an seperate example

Spelling correction example

In [31]:
def correct_word(word):
  return str(TextBlob(word).correct())

print(correct_word('bisness'))

business


In [32]:
incorrect_text = 'bisness anlytis is an itant skil seit for any organizaton'

func = lambda x: str(TextBlob(x).correct())
print(incorrect_text)
print(str(TextBlob(incorrect_text).correct()))

bisness anlytis is an itant skil seit for any organizaton
business analysis is an want skin set for any organization


### Stemming

[Stemming](https://en.wikipedia.org/wiki/Stemming) is the removal of prefix, suffix etc, to derive the base form of a word. We will use the NLTK library.

In [33]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stemming_function(sent):
  word_list = sent.split()
  stemmed_word_list = [stemmer.stem(word) for word in word_list]
  stemmed_sentence = " ".join(stemmed_word_list)
  return stemmed_sentence

df['sentence_stemmed'] = df['sentence'].apply(stemming_function)

df.head()

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word,sentence_stemmed
0,wow... loved place.,1,yelp,4,4,24,5.25,wow... love place.
1,crust not good.,0,yelp,4,4,18,3.75,crust not good.
2,not tasty texture nasty.,0,yelp,8,8,41,4.25,not tasti textur nasty.
3,stopped late may bank holiday rick steve recom...,1,yelp,15,15,87,4.866667,stop late may bank holiday rick steve recommen...
4,selection menu great prices.,1,yelp,12,12,59,4.0,select menu great prices.


### Lemmatization

[Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.  
  
We will use  Wordnet for the lemmatization. Thus, we need to download Wordnet to the nltk library.

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage


In [34]:
# Download wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [35]:
from nltk.stem import WordNetLemmatizer

lemmtizer = WordNetLemmatizer()

In [36]:
def lemmatize_function(sent, pos='n'):
  word_list = sent.split()
  lemma_word_list = [lemmtizer.lemmatize(word, pos=pos) for word in word_list]
  lemma_sentence = " ".join(lemma_word_list)
  return lemma_sentence

df['sentence_lemmatized'] = df['sentence'].apply(lemmatize_function)

In [37]:
lemmatize_function("I cannot believe dogs are walking", pos='v')

'I cannot believe dog be walk'

Display original pre-processed sentence, stemmed sentence and lemmatized sentence.

In [38]:
df[['sentence', 'sentence_stemmed', 'sentence_lemmatized']].head(10)

Unnamed: 0,sentence,sentence_stemmed,sentence_lemmatized
0,wow... loved place.,wow... love place.,wow... loved place.
1,crust not good.,crust not good.,crust not good.
2,not tasty texture nasty.,not tasti textur nasty.,not tasty texture nasty.
3,stopped late may bank holiday rick steve recom...,stop late may bank holiday rick steve recommen...,stopped late may bank holiday rick steve recom...
4,selection menu great prices.,select menu great prices.,selection menu great prices.
5,getting angry want damn pho.,get angri want damn pho.,getting angry want damn pho.
6,honeslty taste fresh.),honeslti tast fresh.),honeslty taste fresh.)
7,potatoes like rubber could tell had made ahead...,potato like rubber could tell had made ahead t...,potato like rubber could tell had made ahead t...
8,fries great too.,fri great too.,fry great too.
9,great touch.,great touch.,great touch.


Stemmed algorithm seems to be working better in this case when compared to the lemmatization. It is recommended to observe the results after pre-processing tasks to understand the performance of the third party libraries we are using in pre-processing.

## Text Feature Extraction

In a numeric dataset (e.g., house price dataset, titanic survival dataset, dungaree dataset), we had numeric and categorical variables, which we transformed to numeric values for predictive analytics. Those numeric variables are called numeric features in the datasets. Similarly, for NLP, we need to derive features from text data in numerical format because machines can only understand numeric representations.

### N-Grams

An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a contiguous sequence of n items from a given sample of text or speech. They are basically a set of co-occuring words within a given window. When computing the n-grams, the shift is one-step forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the ngrams would be:
* the cow
* cow jumps
* jumps over
* over the
* the moon

We will use NLTK ngrams and word_tokenizer libraries for n-gram feature extraction.

Note: Need to download punkt resource for nltk for work tokenization

In [39]:
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

First we define the value for *n*, in n-gram representation.

In [40]:
n = 3

Following n_grams() method will take a sentence and construct a list of n-grams.

In [41]:
def n_grams(text):
  if len(word_tokenize(text)) < 3:
    return []
  n_grams = ngrams(word_tokenize(text), n)
  return [' '.join(grams) for grams in n_grams]

In [42]:
txt = "you want to exclude some stop word being getting ignored"
print(n_grams(txt))

['you want to', 'want to exclude', 'to exclude some', 'exclude some stop', 'some stop word', 'stop word being', 'word being getting', 'being getting ignored']


Derive n-grams (n=3) for our dataset.

In [43]:
df['3_grams'] = df['sentence'].apply(lambda x: n_grams(x))

Display original sentence and n-grams.

In [44]:
df[['sentence', '3_grams']].head(20)

Unnamed: 0,sentence,3_grams
0,wow... loved place.,"[wow ... loved, ... loved place, loved place .]"
1,crust not good.,"[crust not good, not good .]"
2,not tasty texture nasty.,"[not tasty texture, tasty texture nasty, textu..."
3,stopped late may bank holiday rick steve recom...,"[stopped late may, late may bank, may bank hol..."
4,selection menu great prices.,"[selection menu great, menu great prices, grea..."
5,getting angry want damn pho.,"[getting angry want, angry want damn, want dam..."
6,honeslty taste fresh.),"[honeslty taste fresh, taste fresh ., fresh . )]"
7,potatoes like rubber could tell had made ahead...,"[potatoes like rubber, like rubber could, rubb..."
8,fries great too.,"[fries great too, great too .]"
9,great touch.,[great touch .]


Based on above results (e.g., record 16) you can see that if there are only 2 words, the 3-grams would result no n-grams.  
Thus, you may try to derive n-grams with *n=2*.

### Bag of words

[Bag of words](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) is a simple text feature extraction mechanism.   
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
* A vocabulary of known words.  
* A measure of the presence of known words.  

We will use [CountVectorizer library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) on sklearn for bag-of-words model creation.

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

You may refer to [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) API for detailed description about the parameters.

In [46]:
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1), analyzer = "word")

Transform lemmatized senteces into bag-of-words model.

In [47]:
X_bow = bow.fit_transform(df['sentence_stemmed'])

The X_bow would result in a term-document matrix.  
e.g., Output format:  (sentence_id, vocabulary_dictionary_id) count
* sentence_id - sentence id in the dataframe
* vocabulary_dictionary_id - id of the particular word in the bag of words model dictionary
* count - count of words

In [70]:
df['sentence_stemmed'].shape

(2748,)

In [72]:
X_bow.shape

(2748, 1000)

### Term Frequency - Inverse Document Frequecy (TF-IDF)

[Term frequency–inverse document frequency](https://www.kdnuggets.com/2018/08/wtf-tf-idf.html), is a numerical statistic that is intended to reflect how important a word is to a document in a collection. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

We will use [feature extraction module](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) of the sklearn library for this.

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

Construct TF-IDF using the lemmatized senteces.

In [51]:
tf_idf = vectorizer.fit_transform(df['sentence_stemmed'])  # as the text data, we will use lemmatized sentences

Display the list of all the words.

In [52]:
print(vectorizer.get_feature_names_out())

['00' '10' '100' ... 'zombi' 'zombie' 'zombiez']


Here you see there are quite many text that includes a number (digit).  
In one of the pre-processing steps, we removed all the words/text that are only digits, but not combined.  
You might want to remove these as well...  

A comparison of TF-IDF values with respect to lemmatized sentences.

In [53]:
print(df['sentence_lemmatized'].head())

0                                  wow... loved place.
1                                      crust not good.
2                             not tasty texture nasty.
3    stopped late may bank holiday rick steve recom...
4                         selection menu great prices.
Name: sentence_lemmatized, dtype: object


In [54]:
print(tf_idf[:5])

  (0, 3098)	0.4467515072948057
  (0, 2493)	0.4815253044339722
  (0, 4685)	0.7540202065724695
  (1, 1842)	0.3878330443592633
  (1, 2830)	0.35855314218756035
  (1, 1018)	0.8491320120749498
  (2, 2764)	0.5871056522043099
  (2, 4132)	0.5655082660842398
  (2, 4095)	0.5234957828317272
  (2, 2830)	0.24791030534761116
  (3, 2242)	0.19150592485821574
  (3, 3356)	0.2257634873436066
  (3, 3924)	0.3512597313662575
  (3, 3463)	0.3835821062091567
  (3, 2026)	0.3835821062091567
  (3, 350)	0.3835821062091567
  (3, 2579)	0.3000299793060093
  (3, 2365)	0.3408542508108477
  (3, 3937)	0.3134449769315596
  (3, 2493)	0.20765324628761367
  (4, 3204)	0.6029695547245832
  (4, 1874)	0.3020391000394054
  (4, 2625)	0.49730455114224764
  (4, 3628)	0.5457914267701826


In the feature vector row (e.g., (0, 4843)), the first digit refers to the sentence row (i.e., first datarow).  
The second digit is the index of alphebitically ordered word list.

### Sentiment Analysis

Sentiment analysis is basically the process of determining the attitude or the emotion of the writer, i.e., whether it is positive or negative or neutral.

We will use the Textblob library. The sentiment function of textblob returns the polarity of the sentence, i.e., a float value which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.

In [55]:
from textblob import TextBlob

In [56]:
doc1 = "wow"
TextBlob(doc1).sentiment.polarity

0.1

Derive sentiment of each sentence.

In [57]:
df['sentiment'] = df['sentence_lemmatized'].apply(lambda x: TextBlob(x).sentiment.polarity)

Display original sentece with respect to its sentiment.

In [58]:
print(df[['sentence_lemmatized', 'sentiment']][:40])

                                  sentence_lemmatized  sentiment
0                                 wow... loved place.   0.400000
1                                     crust not good.  -0.350000
2                            not tasty texture nasty.  -1.000000
3   stopped late may bank holiday rick steve recom...   0.200000
4                        selection menu great prices.   0.800000
5                        getting angry want damn pho.  -0.500000
6                              honeslty taste fresh.)   0.300000
7   potato like rubber could tell had made ahead t...   0.000000
8                                      fry great too.   0.800000
9                                        great touch.   0.800000
10                                    service prompt.   0.000000
11                                 would not go back.   0.000000
12  cashier had care ever had say still ended wayy...   0.000000
13  tried cape cod ravoli, chicken,with cranberry....   0.000000
14                  disgu

## Text Classification

We will explore few text classification approaches to classify the review data as either positive (1) or negative (0).  
Here we will only use the amazon reviews (1000 reviews) for the workshop. (You may use yelp and imdb review data seperately evaluate the approaches.)

Previously, we conducted all the pre-processing steps to the entire 3 datasets (amazon, yelp and imdb). This for text classification we will filter only the reviews from amazon.

In [59]:
df_amazon = df.loc[df['source'] == 'amazon']

In [60]:
df_amazon.head()

Unnamed: 0,sentence,label,source,word_count_function,word_count,char_count,avg_word,sentence_stemmed,sentence_lemmatized,3_grams,sentiment
1000,way plug us unless go converter.,0,amazon,21,21,82,2.952381,way plug us unless go converter.,way plug u unless go converter.,"[way plug us, plug us unless, us unless go, un...",0.0
1001,"good case, excellent value.",1,amazon,4,4,27,6.0,"good case, excel value.","good case, excellent value.","[good case ,, case , excellent, , excellent va...",0.85
1002,great jawbone.,1,amazon,4,4,22,4.75,great jawbone.,great jawbone.,[great jawbone .],0.8
1003,tied charger conversations lasting minutes.maj...,0,amazon,11,11,79,6.272727,tie charger convers last minutes.major problems!!,tied charger conversation lasting minutes.majo...,"[tied charger conversations, charger conversat...",0.0
1004,mic great.,1,amazon,4,4,17,3.5,mic great.,mic great.,[mic great .],0.8


Split train/validation data


In [61]:
from sklearn.model_selection import train_test_split

sentences_train, sentences_test, y_train, y_test = train_test_split(df_amazon['sentence_lemmatized'], df_amazon['label'], test_size=0.3, random_state=2)

### Logistic Regression

We will use the Bag of Words model as text features.

In [62]:
from sklearn.feature_extraction.text import CountVectorizer

Construct the bag of words model.

In [63]:
bow = CountVectorizer(min_df=0.0, lowercase=False)
bow.fit(sentences_train)

Fit the train and test sentences to transform them to bag-of-word features.

In [64]:
X_train = bow.transform(sentences_train)
X_test  = bow.transform(sentences_test)

Use a logistic regression model for classification.

In [65]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

Evaluate the model.

In [66]:
score_train = classifier.score(X_train, y_train)
score_test = classifier.score(X_test, y_test)
print('Train accuracy: {:.2f}%'.format(score_train*100))
print('Test accuracy: {:.2f}%'.format(score_test*100))

Train accuracy: 98.86%
Test accuracy: 80.33%


What can you say about bias and variance of this model?  
- Low bias and high variance

Try few examples

In [67]:
testing = "happy customer"
vector_representation = bow.transform([testing])
prediction = classifier.predict(vector_representation)[0]

if prediction == 1:
   print("Positive Review")
else:
   print("Negative Review")

Positive Review


## Exercise

You may conduct similar classification exercises for yelp and imdb datasets.