# Working with text

CREDIT: This practical was inspired from [this notebook on NLP](https://www.kaggle.com/code/amar09/text-pre-processing-and-feature-extraction).

## Setup
### Imports

In [6]:
import string

import pandas as pd                                     # for dataset manipulation (DataFrames)
import numpy as np                                      # allows some mathematical operations
import matplotlib.pyplot as plt                         # library used to display graphs
import seaborn as sns                                   # more convenient visualisation library for dataframes
from sklearn.model_selection import train_test_split    # for classification
from sklearn.svm import LinearSVC                       # for classification
from sklearn.metrics import confusion_matrix            # for classification
from sklearn.metrics import accuracy_score              # for classification
import imblearn                                         # for imbalance management
import time                                             # for execution time measurement
import nltk                                             # Natural Language ToolKit for NLP

### Loading the dataset

Today's dataset is the [IMDB Movie Reviews Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

In [2]:
df = pd.read_csv("imdb_dataset.csv")

## Observing the dataset

Using what you have learned in the previous lessons, examine the datasets and see what you can learn about them.
In particular, identify the classification task this dataset was created for, and the potential issues you could encounter.
Are the classes balanced?

In [8]:
# Your code here

*[Your comments here]*

### Analysing the reviews

In order to see what needs to be cleaned, let us first observe the most common words in the dataset.

1. Create a function `create_corpus(texts)` that takes a list / pd.Series of strings, and outputs a list of all the individual words contained in it.
*Hint: You may need to use the `str.split` function.*
2. Display the most common words in the IMDB dataset.
3. Comment on your observations.

In [24]:
def create_corpus(texts):
    corpus = []
    ... # Your code here
    return pd.DataFrame(corpus)

*[Your comments here]*

The NLTK package offers a list of stopwords, which are common words in a language that carry little to no meaning.
Display the most common words in the dataset, this time ignoring stop words.

In [0]:
nltk.download("stopwords")
from nltk.corpus import stopwords

In [36]:
stops = stopwords.words("english")
# Your code here

## Cleaning the data
### Removal of stop words

1. Using the list of stopwords downloaded above, implement a `remove_stopwords(text)` that takes a string as input, and outputs the same string where stopwords are removed.
2. Apply this function to the data.

*Hint: You can do it on your own, or you can look into `str.translate` and `str.maketrans`.*

In [71]:
def remove_stopwords(text:str):
    ... # Your code here
    return ...

### Removal of punctuation

1. Using the native `string.punctuation` list, implement a `remove_punctuation(text)` that takes a string as input, and outputs the same string where punctuation is removed.
2. Apply this function to the data.

In [79]:
import string

punctuation_list = string.punctuation


def remove_punctuation(text: str):
    ...  # Your code here
    return ...

### Stemming and lemmatization

1. What are stemming and lemmatization? What are some differences between the two?
2. Implement a `stem_text(text)` function that takes a string as input and outputs the same string where words have been stemmed using NLTK's `PorterStemmer`.
3. Implement a `lemmatize_text(text)` function that takes a string as input and outputs the same string where words have been lemmatized using NLTK's `WorldNetLemmatizer`.
4. Apply stemming and lemmatization to the dataset and store the results in two different columns. Compare and comment on the results.

In [85]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer


def stem_text(text):
    stemmer = PorterStemmer()
    ...  # Your code here
    return ...


def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    ...  # Your code here
    return ...

### Bonus: Other types of pre-processing

We could have done many other types of cleaning: removing emojis, removing URLs, spellchecking, removing frequent or rare words, etc.
If you want to try performing these cleaning steps yourself, feel free to refer to [this kaggle notebook](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing).

In [86]:
# Your code here

*[Your comments here]*