<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2023-Tutorial-Notebooks/blob/main/tutorial_notebooks/02_basic_text_preprocessing_template_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Starter Activity - Exploring Bag of Words (Vectorizer)**

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Corpus is 3 sentences:

1. A bag was found next to the river
2. Bag of words is a cool way of representing text
3. The word on the street is that math is not cool

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
     'A bag was found next to the river',
     'Bag of words is a cool way of representing text',
     'The word on the street is that math is not cool',
 ]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()

['bag',
 'cool',
 'found',
 'is',
 'math',
 'next',
 'not',
 'of',
 'on',
 'representing',
 'river',
 'street',
 'text',
 'that',
 'the',
 'to',
 'was',
 'way',
 'word',
 'words']

In [None]:
X.toarray()

array([[1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
       [1, 1, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 2, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, 0, 1, 0]])

bag cool found is math next not of on representing river street text that the to was way word words

[1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]

[1, 1, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]

[0, 1, 0, 2, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, 0, 1, 0]

How to Improve the features representation:
1.   Stop words List Removal
2.   Lemmatization (brings to a common form)
3.   Lowercasing
4.   Limitation of Dictionary (max K, max words)
5.   Synonym Mapping

Problems:
1. Different words have different meaning in contexts
2. Context and order matters
3. Lack of distance Comparison between words used in similar context.



Text pre-processing is one of mandatory steps we will preform while creating a NLP application. As humans, the text we usually write contains lots of spelling errors, short words, special symbols, emojis, etc, which we can understand but we need to preprocess this text if we want the computer to understand it. In this notebook, we will discuss some of the types of text pre-processing you will need to perform while working with text data.

## **Table of Contents**
> 1. [Lowercasing](#1)
> 2. [Removing HTML Tags](#2)
> 3. [Removing URLs](#3)
> 4. [Removing Punctuations](#4)
> 5. [Chat word treatment](#5)
> 6. [Spelling Correction](#6)
> 7. [Removing stop words](#7)
> 8. [Handling Emojis](#8)
> 9. [Tokenization](#9)
>10. [Stemming](#10)
>11. [References](#11)

In [None]:
import pandas as pd
import numpy as np
imdb_reviews = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
twitter_tweets = pd.read_csv('../input/twitter-sentiment-analysis-hatred-speech/train.csv')

<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Lowercasing
<a class="anchor" id="1"></a>

It is the process of converting a word into lower case. If a particular word (lets say `Book`) appears in the beginning of the sentence with a capital letter and another word (`book`) appears later in the sentence without a capital letter, our model will treat these 2 words differently. The process of lowercasing is usually very simple, we can use the `.lower()` method

In [None]:
def convert_lowercase(column):
    column = column.str.lower()
    return column

In [None]:
print(f"Before applying lower casing: {imdb_reviews['review'][0][:10]}")

imdb_reviews['review'] = convert_lowercase(imdb_reviews['review'])

print(f"After applying lower casing : {imdb_reviews['review'][0][:10]}")

Before applying lower casing: One of the
After applying lower casing : one of the


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Removing HTML Tags
<a class="anchor" id="2"></a>

Whenever you will scrape a website, HTML tags such as `header`, `body`, `anchor`, etc, will be present. These tags won't add any value to the text data we have and therefore, should be removed. We can remove these HTML tags by using regular expressions.

In [None]:
import re
def remove_html_tags(text):
    re_html = re.compile('<.*?>')
    return re_html.sub(r'', text)

In [None]:
text = '<h1> This is a h1 tag </h1>'
print(remove_html_tags(text))

 This is a h1 tag 


In [None]:
print(f"Before removing HTML tags: {imdb_reviews['review'][1][:70]}")
imdb_reviews['review'] = imdb_reviews['review'].apply(remove_html_tags)
print(f"After removing HTML tags : {imdb_reviews['review'][1][:70]}")


Before removing HTML tags: a wonderful little production. <br /><br />the filming technique is ve
After removing HTML tags : a wonderful little production. the filming technique is very unassumin


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Removing URLs
<a class="anchor" id="3"></a>

`URLs` in a text references to a location to the web but just like HTML tags, it doesn't provide any useful information. We can remove URLs by using regular expressions

In [None]:
text = 'My profile link: https://www.kaggle.com/anubhavgoyal10'
def remove_url(text):
    re_url = re.compile('https?://\S+|www\.\S+')
    return re_url.sub('', text)

In [None]:
print(f'Text before removing URL: {text}')
print(f'Text after removing URL : {remove_url(text)}')

Text before removing URL: My profile link: https://www.kaggle.com/anubhavgoyal10
Text after removing URL : My profile link: 


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Removing Punctuations
<a class="anchor" id="4"></a>

The reason of removing punctuations is pretty simiar to lowercasing, in certain cases, we want the word `hello` and `hello!` to be treated in the exact same way. Although, be careful while using punctuation, the word `can't` can be converted to `cant` and `can t` depending upon what you set in the parameter.

In [None]:
import string
exclude = string.punctuation

def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))

In [None]:
text = 'Hello!'
print(f'Text before punctuation: {text}')
text_wihout_punc = remove_punc(text)
print(f'Text after punctuation : {text_wihout_punc}')

Text before punctuation: Hello!
Text after punctuation : Hello


In [None]:
print(f"Tweet before removing punctuation: {twitter_tweets['tweet'][0]}")
twitter_tweets['tweet'] = twitter_tweets['tweet'].apply(remove_punc)
print(f"Tweet after removing punctuation : {twitter_tweets['tweet'][0]}")

Tweet before removing punctuation:  @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run
Tweet after removing punctuation :  user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction   run


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Chat word treatment
<a class="anchor" id="5"></a>

Many times we use short abbreviations of words while we are texting. We have to change them back into their full forms while performing NLP tasks

In [None]:
# Some chat words examples
chat_words = {
    'FYI' : 'for your information',
    'LOL' : 'laugh out loud',
    'AFK' : 'away from keyboard'
}

def chat_words_conv(text):
    new_text = []
    for word in text.split():
        if word.upper() in chat_words:
            new_text.append(chat_words[word.upper()])
        else:
            new_text.append(word)

    return ' '.join(new_text)

In [None]:
text = 'FyI I was afk for a while'
print(chat_words_conv(text))

for your information I was away from keyboard for a while


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Spelling Correction
<a class="anchor" id="6"></a>

In [None]:
from textblob import TextBlob
text = 'stringg witth lotts of spelingg erors'
textblob_ = TextBlob(text)
print(f'Correct text: {textblob_.correct().string}')

Correct text: string with lots of spelling errors


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Removing stop words
<a class="anchor" id="7"></a>

Stop words are words such as `the`, `an`, `so`, `and` which are present in abundance in every text but they don't provide much uselful information to the model, so by removing these words, we can focus on the more important information in the text. Although in some cases, we don't remove these stop words, one example being sentiment analysis.

In [None]:
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')

def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords_english:
            continue
        else:
            new_text.append(word)

    return ' '.join(new_text)

In [None]:
text = 'Stop words are a set of commonly used words in a language'
print(f'Text before removing stop words: {text}')
print(f'Text after removing stop words : {remove_stopwords(text)}')

Text before removing stop words: Stop words are a set of commonly used words in a language
Text after removing stop words : Stop words set commonly used words language


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Handling Emojis
<a class="anchor" id="8"></a>

Emojis are generally used in a text to express emotions. There are 2 ways in which we can handle emojis, either we can just remove them (not a good option as they can provide some useful information), and the second good will be to replace them in a way that computer can understand, we can do that using the emoji library in Python

In [None]:
import emoji
text = 'He is suffering from fever 🤒'
print(emoji.demojize(text))

He is suffering from fever :face_with_thermometer:


<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Tokenization
<a class="anchor" id="9"></a>

It is the process of breaking the data into smaller chunks of information. We can use tokenization to seperate sentences, words, characters. We can perform tokenization using the `nltk` or `spacy` library.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
sent_1 = 'Life is a matter of choices and every choice makes you!'
print(word_tokenize(sent_1))

['Life', 'is', 'a', 'matter', 'of', 'choices', 'and', 'every', 'choice', 'makes', 'you', '!']


As we can see, it managed to separate the `you` and `!`. If you used the Python split function, it wouldn't have done that.

In [None]:
para = "Don't forget that gifts often come with costs that go beyond their purchase price. When you purchase a child the latest smartphone, you're also committing to a monthly phone bill. When you purchase the latest gaming system, you're likely not going to be satisfied with the games that come with it for long and want to purchase new titles to play. When you buy gifts it's important to remember that some come with additional costs down the road that can be much more expensive than the initial gift itself."
print(sent_tokenize(para))

["Don't forget that gifts often come with costs that go beyond their purchase price.", "When you purchase a child the latest smartphone, you're also committing to a monthly phone bill.", "When you purchase the latest gaming system, you're likely not going to be satisfied with the games that come with it for long and want to purchase new titles to play.", "When you buy gifts it's important to remember that some come with additional costs down the road that can be much more expensive than the initial gift itself."]


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
sent_2 = 'A 5km bike ride costs around $10 in New York!'
doc = nlp(sent_2)
for token in doc:
    print(token, end= ', ')

A, 5, km, bike, ride, costs, around, $, 10, in, New, York, !, 

<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
Stemming
<a class="anchor" id="10"></a>

Stemming is a process by which we bring the words to their root forms. For e.g. the stem of walking, walks, walked is walk. We can do stemming using the `nltk` libray

In [None]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def perform_stemming(text):
    new_text = [ps.stem(word) for word in text.split()]
    return ' '.join(new_text)

In [None]:
text = 'walk walks walked walking'
perform_stemming(text)

'walk walk walk walk'

<div style='color: #216969;
           background-color: #EAF6F6;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: "Verdana";'>
References
<a class="anchor" id="11"></a>

1. [https://www.youtube.com/watch?v=6C0sLtw5ctc](https://www.youtube.com/watch?v=6C0sLtw5ctc)
2. [https://www.kaggle.com/code/campusx/text-preprocessing/script](https://www.kaggle.com/code/campusx/text-preprocessing/script)