# Working with text

CREDIT: This practical was inspired from [this notebook on NLP](https://www.kaggle.com/code/amar09/text-pre-processing-and-feature-extraction).

## Setup
### Imports

In [None]:
import string

import pandas as pd                                     # for dataset manipulation (DataFrames)
import numpy as np                                      # allows some mathematical operations
import matplotlib.pyplot as plt                         # library used to display graphs
import seaborn as sns                                   # more convenient visualisation library for dataframes
from sklearn.model_selection import train_test_split    # for classification
from sklearn.svm import LinearSVC                       # for classification
from sklearn.metrics import confusion_matrix            # for classification
from sklearn.metrics import accuracy_score              # for classification
import imblearn                                         # for imbalance management
import time                                             # for execution time measurement
import nltk                                             # Natural Language ToolKit for NLP

### Loading the dataset

Today's dataset is the [IMDB Movie Reviews Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

In [None]:
df = pd.read_csv("imdb_dataset.csv")

## Observing the dataset

Using what you have learned in the previous lessons, examine the datasets and see what you can learn about them.
In particular, identify the classification task this dataset was created for, and the potential issues you could encounter.
Are the classes balanced?

In [None]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
print(df.shape)
print(df['sentiment'].value_counts())
print(df.isna().sum())

(50000, 2)
positive    25000
negative    25000
Name: sentiment, dtype: int64
review       0
sentiment    0
dtype: int64


**Answer**

This dataset contains 50 000 observations with only one feature and two classes.
Each observation is a movie review extracted from a forum and the two classes "positive" and "negative" describe the sentiment of the commenter about the movie. There is no missing value. This dataset could be used to train and test a classifier performing "sentiment analysis", meaning it could guess wether the sentiment is positive or negative by reading the associated review.

### Analysing the reviews

In order to see what needs to be cleaned, let us first observe the most common words in the dataset.

1. Create a function `create_corpus(texts)` that takes a list / pd.Series of strings, and outputs a list of all the individual words contained in it.
2. Display the most common words in the IMDB dataset.
3. Comment on your observations.

**Answers**

1. (cf. code below).
2. (cf. code below).
3. The most frequent words are linking words and html tags and an html tag, not very useful for sentiment analysis.

In [None]:
def create_corpus(texts):
  corpus = " ".join(texts.tolist())
  corpus = corpus.split(" ")
  return corpus

In [None]:
# Most frequent words

from collections import Counter

words = create_corpus(df['review'])
words_low = list(map(lambda x: x.lower(), words))
words_count = Counter(words_low)
most_common_words = words_count.most_common(50)
for word, count in most_common_words:
  print(f"{word}: {count}")


the: 638821
a: 316606
and: 313602
of: 286653
to: 264567
is: 204867
in: 179798
i: 141577
this: 138472
that: 130133
it: 129589
/><br: 100974
was: 93256
as: 88229
with: 84582
for: 84503
but: 77843
on: 62885
movie: 61483
are: 57007
his: 56869
not: 56754
you: 55593
film: 55078
have: 54421
he: 51059
be: 50900
at: 45258
one: 44978
by: 43356
an: 42329
they: 40858
from: 39318
all: 38567
who: 38326
like: 37279
so: 35966
just: 34255
or: 33292
has: 32610
about: 32397
her: 31240
it's: 31207
if: 30786
some: 30164
out: 28977
what: 28003
very: 26907
when: 26901
there: 26070


The NLTK package offers a list of stopwords, which are common words in a language that carry little to no meaning.
Display the most common words in the dataset, this time ignoring stop words.

In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Cleaning the data
### Removal of stop words

1. Using the list of stopwords downloaded above, implement a `remove_stopwords(text)` that takes a string as input, and outputs the same string where stopwords are removed.
2. Apply this function to the data.

*Hint: You can do it on your own, or you can look into `str.translate` and `str.maketrans`.*

In [None]:
# Most frequent words except most common words in English

stops = stopwords.words("english")

def remove_stopwords(text:str):
  filtered_words = [i for i in text if i not in stops]
  return filtered_words

punctuation_words = remove_stopwords(words_low)
punctuation_words_count = Counter(punctuation_words)
most_common_punctuation_words = punctuation_words_count.most_common(50)
for word, count in most_common_punctuation_words:
  print(f"{word}: {count}")

/><br: 100974
movie: 61483
film: 55078
one: 44978
like: 37279
would: 23807
even: 23679
good: 23466
really: 21804
see: 20899
-: 18201
get: 17686
much: 17277
story: 16804
also: 15742
time: 15651
great: 15465
first: 15454
make: 15026
people: 15026
could: 14927
/>the: 14702
made: 13562
bad: 13492
think: 13303
many: 12877
never: 12621
two: 12188
<br: 11955
little: 11825
well: 11678
watch: 11460
way: 11370
it.: 11169
know: 10780
movie.: 10764
love: 10745
best: 10743
seen: 10609
characters: 10597
character: 10385
movies: 10345
ever: 10218
still: 9777
films: 9575
plot: 9452
show: 9376
acting: 9376
better: 9043
film.: 8920


### Removal of punctuation

1. Using the native `string.punctuation` list, implement a `remove_punctuation(text)` that takes a string as input, and outputs the same string where punctuation is removed.
2. Apply this function to the data.

In [None]:
import string

punctuation_list = string.punctuation

def remove_punctuation_3(text: str):
  words = [''.join(char for char in word if char not in string.punctuation)for word in text]
  empty_words = list(filter(None, words))
  return empty_words


In [None]:
# Most frequent words once the punctuation is removed

filtered_words = [i for i in words_low if i not in stops]

punctuation_words_count = Counter(remove_punctuation_3(filtered_words))
most_common_punctuation_words = punctuation_words_count.most_common(50)
for word, count in most_common_punctuation_words:
  print(f"{word}: {count}")

br: 113700
movie: 83501
film: 74444
one: 51016
like: 38986
good: 28568
the: 24948
even: 24570
would: 24024
it: 23283
time: 23251
really: 22946
see: 22532
story: 22084
much: 18945
well: 18770
get: 18201
great: 17819
also: 17815
bad: 17701
people: 17536
first: 17153
movies: 15449
made: 15415
make: 15303
films: 15281
could: 15155
way: 14995
characters: 14674
think: 14214
watch: 13566
many: 13369
seen: 13053
two: 13018
character: 12919
never: 12874
love: 12566
acting: 12469
plot: 12362
little: 12326
best: 12324
know: 12263
show: 12029
life: 11676
ever: 11623
better: 11042
this: 10921
still: 10739
say: 10620
end: 10536


### Stemming and lemmatization

1. What are stemming and lemmatization? What are some differences between the two?
2. Implement a `stem_text(text)` function that takes a string as input and outputs the same string where words have been stemmed using NLTK's `PorterStemmer`.
3. Implement a `lemmatize_text(text)` function that takes a string as input and outputs the same string where words have been lemmatized using NLTK's `WorldNetLemmatizer`.
4. Apply stemming and lemmatization to the dataset and store the results in two different columns. Compare and comment on the results.

**Answers**

1. Stemming consists in reducing the number of different words by replacing words by their radical. Lemmatization also aims to reduce the number of different words, however, it will transform the words only into meaningful words.
2. (Cf. code below).
3. (Cf. code below).
4. Both stemming and lemmatization work as expected. Further processing could remove the most common words in English, as we did before and also remove the html tags to keep only meaningful information.


In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

def stem_text(text : str):
    stemmer = PorterStemmer()
    L = text.split(" ")
    L_stemmed = [stemmer.stem(word) for word in L]
    stemmed_text = " ".join(L_stemmed)
    return stemmed_text

def lemmatize_text(text : str):
    lemmatizer = WordNetLemmatizer()
    L = text.split(" ")
    L_lemmatized = [lemmatizer.lemmatize(word) for word in L]
    lemmatized_text = " ".join(L_lemmatized)
    return lemmatized_text

reviews = df["review"].tolist()
S = []
for review in reviews :
  S.append(stem_text(review))
df["stem"] = S

L = []
for review in reviews :
  L.append(lemmatize_text(review))
df["lemmatize"] = L

df.head(50)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,review,sentiment,stem,lemmatize
0,One of the other reviewers has mentioned that ...,positive,one of the other review ha mention that after ...,One of the other reviewer ha mentioned that af...
1,A wonderful little production. <br /><br />The...,positive,a wonder littl production. <br /><br />the fil...,A wonderful little production. <br /><br />The...
2,I thought this was a wonderful way to spend ti...,positive,i thought thi wa a wonder way to spend time on...,I thought this wa a wonderful way to spend tim...
3,Basically there's a family where a little boy ...,negative,basic there' a famili where a littl boy (jake)...,Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"petter mattei' ""love in the time of money"" is ...","Petter Mattei's ""Love in the Time of Money"" is..."
5,"Probably my all-time favorite movie, a story o...",positive,"probabl my all-tim favorit movie, a stori of s...","Probably my all-time favorite movie, a story o..."
6,I sure would like to see a resurrection of a u...,positive,i sure would like to see a resurrect of a up d...,I sure would like to see a resurrection of a u...
7,"This show was an amazing, fresh & innovative i...",negative,"thi show wa an amazing, fresh & innov idea in ...","This show wa an amazing, fresh & innovative id..."
8,Encouraged by the positive comments about this...,negative,encourag by the posit comment about thi film o...,Encouraged by the positive comment about this ...
9,If you like original gut wrenching laughter yo...,positive,if you like origin gut wrench laughter you wil...,If you like original gut wrenching laughter yo...
