# Processing Text Data
In this Notebook, we will be processing text data using the [Twitter Customer Support](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter). 

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply go through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email your instructor.

## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with 'Question #' on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* The notebooks will undergo a 'Restart and Run All' command, so make sure that your code is working properly.
* You are expected to understand the dataset loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

## Import
Import **numpy**, **pandas**, **re**, **nltk**, and **string**.

In [None]:
import numpy as np
import pandas as pd
import re
import nltk
import string
pd.options.mode.chained_assignment = None

%load_ext autoreload
%autoreload 2

## Twitter Customer Support Dataset
For this notebook, we will work on a reduced version of the dataset called `Twitter Customer Support`. The original dataset contains more than 2M rows. In our reduced version, we only retained the first 50k rows in the dataset.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

If you view the `.csv` file in Excel, you can see that our dataset contains 50k **observations** (rows) across 7 **variables** (columns). The following are the descriptions of each variable in the dataset.

- **`tweet_id`**: A unique, anonymized ID for the Tweet. Referenced by `response_tweet_id` and `in_response_to_tweet_id`.
- **`author_id`**: A unique, anonymized user ID. @s in the dataset have been replaced with their associated anonymized user ID.
- **`inbound`**: Whether the tweet is "inbound" to a company doing customer support on Twitter. This feature is useful when re-organizing data for training conversational models.
- **`created_at`**: Date and time when the tweet was sent.
- **`text`**: Tweet content. Sensitive information like phone numbers and email addresses are replaced with mask values like `__email__`.
- **`response_tweet_id`**: IDs of tweets that are responses to this tweet, comma-separated.
- **`in_response_to_tweet_id`**: ID of the tweet this tweet is in response to, if any.

Let's read the dataset.

In [None]:
twcs_df = pd.read_csv('twcs-reduced.csv')

Show the first few rows of the dataset.

In [None]:
twcs_df.head()

For this notebook, we will use the values under the `text` column. Thus, let's instantiate a new `DataFrame` with only the `text` column.

In [None]:
twcs_text_df = twcs_df[['text']]
twcs_text_df['text'] = twcs_text_df['text'].astype(str)

Display the first few rows of `twcs_text_df`.

In [None]:
twcs_text_df.head()

## Pre-Processing Text Data

In any machine learning task, preprocessing the data is as important as model building. This process is even more important for unstructured data like texts. In this notebook, we will be performing some of the most common text pre-processing steps including:

* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Stemming
* Lemmatization


Do note that all of these pre-processing steps need not be performed on the dataset all the time. You need to carefully identify appropriate pre-processing techniques depending on the data or the task. For example, emojis or emoticons might be useful in sentiment analysis, thus it might not be a good idea to remove them.

Open `text_preprocessor.py` file. Some of the functions in the file are not yet implemented. We will implement the missing functions of this file.

### Lower Casing

Lower casing converts the input text into the same case so that 'text', 'Text', and 'TEXT' are similarly treated. This is especially helpful in getting the correct frequency of the same word but are represented in different cases. However, this may not be helpful when performing Part-of-Speech tagging (where proper casing gives some information about Nouns and so on) or sentiment analysis (where upper casing refers to anger and so on).

By default, lower casing is done my most of the modern day vectorizers and tokenizers like [sklearn TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Thus, set them to `False` as needed depending on the use case. 

Open `text_preprocessor.py` file and complete the `to_lower_case()` function. This converts characters in each string of a Series to lower case.

Implement the `to_lower_case()` function. Inline comments should help you in completing the contents of the function.

Afterwards, let's import the function.

In [None]:
# Write your code here


Convert the texts in column `text` to lowercase by calling the function `to_lower_case()` and assign the return value to a new column `text_lower`.

In [None]:
# Write your code here


Let's display the `DataFrame`.

In [None]:
twcs_text_df.head()

Display the lowercase version of the string in index `10`.

In [None]:
# Write your code here


**Question #1:** After calling the function `to_lower_case()`, what is the string in index `10`?

Answer: 

Display the lowercase version of the string in index `100`.

In [None]:
# Write your code here


**Question #2:** After calling the function `to_lower_case()`, what is the string in index `100`?

Answer: 

### Removal of Punctuations

Removing punctuations also standardizes the text data. This will treat the words 'hurray!' and 'hurray' in the same way. The list of punctuations to include should depend on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols:

``!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~``

Add or remove punctuations depending on the use case.

Get the punctuations in `string.punctuation`.

In [None]:
punctuations = string.punctuation
print(punctuations)

Open `text_preprocessor.py` file and complete the `remove_punctuations()` function. This takes in a `Series` of strings and removes punctuations in the text.

Implement the `remove_punctuations()` function. Inline comments should help you in completing the contents of the function.

Afterwards, let's import the function.

In [None]:
# Write your code here


Remove punctuations in the texts in column `text_lower` by calling the function `remove_punctuations()` and assign the return value to a new column `text_wo_punct`.

In [None]:
# Write your code here


Let's display the `DataFrame`.

In [None]:
twcs_text_df.head()

Display the string without punctuations in index `300`.

In [None]:
# Write your code here


**Question #3:** After calling the function `remove_punctuations()`, what is the string in index `300`?

Answer: 

Display the string without punctuations in index `1000`.

In [None]:
# Write your code here


**Question #4:** After calling the function `remove_punctuations()`, what is the string in index `1000`?

Answer: 

### Removal of Stopwords

Stopwords are commonly occuring words in a language, which include 'the', 'a', among others. These words can be removed from the text most of the time since they do not provide valuable information for analysis. However, these words might be important when performing Part-of-Speech tagging.

List of stopwords are already compiled for different languages. For example, the list of stopwords for the English language from the `nltk` package can be seen below.

Let's download `stopwords` from the `nltk.corpus` package.

In [None]:
nltk.download('stopwords')

Import `stopwords` and print the stopwords in English.

In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

Open `text_preprocessor.py` file and complete the `remove_stopwords()` function. This takes in a `Series` of strings and removes stopwords in the text.

Implement the `remove_stopwords()` function. Inline comments should help you in completing the contents of the function.

Afterwards, let's import the function.

In [None]:
# Write your code here


Remove punctuations in the texts in column `text_wo_punct` by calling the function `remove_stopwords()` and assign the return value to a new column `text_wo_stop`.

In [None]:
# Write your code here


Let's display the `DataFrame`.

In [None]:
twcs_text_df.head()

Display the string without stopwords in index `18`.

In [None]:
# Write your code here


**Question #5:** After calling the function `remove_stopwords()`, what is the string in index `18`?

Answer: 

Display the string without stopwords in index `3000`.

In [None]:
# Write your code here


**Question #6:** After calling the function `remove_stopwords()`, what is the string in index `3000`?

Answer: 

### Removal of Frequent Words

If you are working on a domain-specific corpus, most of the frequent words might not be important in processing the text data. Thus, it might be useful to remove frequent words in the given corpus. If you a technique similar to tf-idf, this is automatically taken care of.

Open `text_preprocessor.py` file and complete the `get_frequent_words()` function. This returns the most frequent words in our dataset.

Implement the `get_frequent_words()` function. Inline comments should help you in completing the contents of the function.

Afterwards, let's import the function.

In [None]:
# Write your code here


Let's call the function to get the top 15 most frequent words in the texts in column `text_wo_stop`. Display the words.

In [None]:
# Write your code here


**Question #7:** What is the most frequent word in the dataset?

Answer: 

**Question #8:** What is the 5th most frequent word in the dataset?

Answer: 

**Question #9:** What is the 10th most frequent word in the dataset?

Answer: 

**Question #10:** What is the 15th most frequent word in the dataset?

Answer: 

Open `text_preprocessor.py` file and complete the `remove_frequent_words()` function. This takes in a `Series` of strings and removes the top frequent words in the dataset.

Implement the `remove_frequent_words()` function. Inline comments should help you in completing the contents of the function.

Afterwards, let's import the function.

In [None]:
# Write your code here


Remove the top 10 frequent words in the texts in column `text_wo_stop` by calling the function `remove_frequent_words()` and assign the return value to a new column `text_wo_freq`.

In [None]:
# Write your code here


Let's display the `DataFrame`.

In [None]:
twcs_text_df.head()

Display the string without frequent words in index `43687`.

In [None]:
# Write your code here


**Question #11:** After calling the function `remove_frequent_words()`, what is the string in index `43687`?

Answer: 

Display the string without frequent words in index `44762`.

In [None]:
# Write your code here


**Question #12:** After calling the function `remove_frequent_words()`, what is the string in index `44762`?

Answer: 

### Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (From [Wikipedia](https://en.wikipedia.org/wiki/Stemming)).

For example, stemming will remove the suffixes in words like 'walks' and 'walking' to convert them to the root word 'walk'. But in another example, we have two words 'console' and 'consoling', the stemmer will remove the suffix and make them 'consol' which is not a proper english word.

Open `text_preprocessor.py` file and complete the `stem()` function. This takes in a `Series` of strings and performs stemming to each word in the string. This uses the `PorterStemmer` from the `nltk` package.

Implement the `stem()` function. Inline comments should help you in completing the contents of the function.

Import the function.

In [None]:
# Write your code here


Perform stemming in the texts in column `text_wo_freq` by calling the function `stem()` and assign the return value to a new column `text_stem`.

In [None]:
# Write your code here


Let's display the `DataFrame`.

In [None]:
twcs_text_df.head()

Words like 'private' and 'propose' have their 'e' at the end chopped off due to stemming. This is not intented. Lemmatization is used in such cases.

Display the string with stemmed words in index `18`.

In [None]:
# Write your code here


**Question #13:** After calling the function `stem()`, what is the string in index `18`?

Answer: 

Display the string with stemmed words in index `100`.

In [None]:
# Write your code here


**Question #14:** After calling the function `stem()`, what is the string in index `100`?

Answer: 

### Lemmatization

Lemmatization is similar to stemming in reducing inflected words to their word stem, but differs in the way that it makes sure the root word (also called as lemma) belongs to the language. As a result, this is generally slower than stemming process. Thus, either stemming or lemmatization can be used depending on the speed requirement.

Let's download `wordnet` from the `nltk` package. This is needed for lemmatization.

In [None]:
nltk.download('wordnet')

We will use the `WordNetLemmatizer` from the `nltk` package.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

Let's try to perform lemmatization on the word 'running'. This should return the root word 'run'.

In [None]:
lemmatizer.lemmatize('running')

Notice that the lemmatizer returns the word 'running' instead of the root word 'run'. This is because the lemmatization process depends on the Part-of-Speech tag to come up with the correct lemma. 

Lemmatize the word 'running' again by providing the correct Part-of-Speech tag `v` for verbs.

In [None]:
lemmatizer.lemmatize('running', 'v')

Now we are getting the root form 'run'. Thus, there is a need to provide the Part-of-Speech tag of the word along with the word for lemmatizer in `nltk`. Depending on the tag, the lemmatizer may return different results.

Perform lemmatization on the word 'stripes' and check the lemma when it is both verb and noun.

In [None]:
print('Lemma result for verb : ', lemmatizer.lemmatize('stripes', 'v'))
print('Lemma result for noun : ', lemmatizer.lemmatize('stripes', 'n'))

Open `text_preprocessor.py` file and complete the `lemmatize()` function. This takes in a `Series` of strings and performs lemmatization to each word in the string. This uses the `WordNetLemmatizer` and the `pos_tag` from the `nltk` package.

Implement the `lemmatize()` function. Inline comments should help you in completing the contents of the function.

Import the function.

In [None]:
# Write your code here


Let's download `averaged_perceptron_tagger` from the `nltk` package. We will use this to get the Part-of-Speech tag of each word in a string.

In [None]:
nltk.download('averaged_perceptron_tagger')

Perform lemmatization in the texts in column `text_wo_freq` by calling the function `lemmatize()` and assign the return value to a new column `text_lemma`.

In [None]:
# Write your code here


Let's display the `DataFrame`.

In [None]:
twcs_text_df.head()

We can see that the trailing 'e' in the words 'propose' and 'private' is retained when we use lemmatization unlike stemming. 

Display the string with lemmatized words in index `49267`.

In [None]:
# Write your code here


**Question #15:** After calling the function `lemmatize()`, what is the string in index `49267`?

Answer: 

Display the string with lemmatized words in index `750`.

In [None]:
# Write your code here


**Question #16:** After calling the function `lemmatize()`, what is the string in index `750`?

Answer: 