# Introduction to the NLTK library for Python
    NLTK (Nautral Language Toolkit) is a leading platform for building python program for work with human language data. 

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suyog\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

# 1. Sentence Tokenization 
Is also called **sentence segmentation** is the problem of dividing a string of wirtten language into its component sentence.

In [6]:
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."

In [9]:
sentences = nltk.sent_tokenize(text)

In [12]:
for sentence in sentences:
    print(sentence)

Backgammon is one of the oldest known board games.
Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.
It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.


# 2. Word Tokenization
Is also called as **Word Segmentation** is the problem of dividing string of written language into its components words.

In [13]:
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']
['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']


# 3. Text Lemmatization and Stemming
For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. 
The goal of both Stemming and Lemmatization is to **reduce inflectional forms** and sometimes **derivationally related forms** of a word to a common base form.

Eg: dog, dogs, dog's, dogs' => dog 
    the boy's dogs are different sizes => the boy dog be differ size.
    
**Stemming** refers to a crude heuristic process that chops off the ends of words in the  hope of achieving this goal correctly most of the time.
Eg: The word "better" has "good" as its lemma. This link is missed bt stemming, as it requires a dictionary look-up.

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words.
Eg: The word "meeting" can be either the base form of a noun or a form of verb ("to meet") depending on the context. Lemmatization attempts to select the correct lemma depending on the context.

In [19]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\suyog\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\suyog\AppData\Roaming\nltk_data...


True

In [23]:
def compare(stemmer, lemmatizer, word, pos):
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

compare(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare(stemmer, lemmatizer, word = "playing", pos = wordnet.VERB)
compare(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

Stemmer: seen
Lemmatizer: see

Stemmer: play
Lemmatizer: play

Stemmer: drove
Lemmatizer: drive



# 4. Stop Words
Stop words are words which are **filterred out** before or after processing of text. when applying machine learning to text, these words can add a lot of **noise**.

Stop words usually refer to the most common words such as "and", "the", "a" in a language, but there is **no single universal list** of stopwords. 

the NLTK tool has a predefined list of stopwords that refers to the most common words.

In [24]:
from nltk.corpus import stopwords 
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suyog\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [25]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [26]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

In [27]:
words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


In [28]:
words = nltk.word_tokenize(sentence)
without_stop_words = []
for word in words:
    if word not in stop_words:
        without_stop_words.append(word)

print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


We convert List to Set. Set is an abstract data type that can store unique values, without any particular order. The **Search operation is a set is much faster** than the **Search operaion in a list**. 

# 5. Regex
A regular expression, regex is a **sequence of characters** that define a **search patter** and is a powerful tool for **pattern-matching**.
We can use regex to apply additional filtering to our text. For example we can remove all the non-word characters. In many cases, we dont need the punctuation mark and its easy to remove them with regex.

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals

* \w - match word
* \d - match digit
* \s - match whitespace
* \W - match not word
* \D - match not digit
* \S - match not whitespace
* [abc] - match any of a, b, or c
* [^abc] - not match a, b, or c
* [a-g] - match a character between a & g

In [29]:
import re
sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."

In [31]:
pattern = r"[^\w]"
print(re.sub(pattern, " ", sentence))

The development of snowboarding was inspired by skateboarding  sledding  surfing and skiing 


# 6.Bag-of-words
ML algorithm cannot word with raw text directly, we need to convert the text into vectors of numbers. this is called Feature Extraction. 

Bag-of-Words counts the number of times each word or n-gram (combination of n words) appears in a document

Any information about the **Order** or **Structure** of a words is **discarded** that is why its called Bag of words. 

## Steps to create Bag-of-words

### 1. Load the Data

In [33]:
Expression = "I like this movie, it's funny. I hate this movie. This was awesome! I like it. Nice one. I love it."

In [36]:
sentence = nltk.sent_tokenize(Expression)
print(sentence)

["I like this movie, it's funny.", 'I hate this movie.', 'This was awesome!', 'I like it.', 'Nice one.', 'I love it.']


### 2. Design the Vocabulary
Let’s get all the unique words from the four loaded sentences ignoring the case, punctuation, and one-character tokens. These words will be our **vocabulary**.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [38]:
count_vectorizer = CountVectorizer()

### 3. Create Document Vector
Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model.

 The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.

In [42]:
bag_of_words = count_vectorizer.fit_transform(sentence)

In [56]:
feature_names = count_vectorizer.get_feature_names_out(bag_of_words)
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)

Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0,1,0,1,1,0,1,0,0,1,0
1,0,0,1,0,0,0,1,0,0,1,0
2,1,0,0,0,0,0,0,0,0,1,1
3,0,0,0,1,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,1,0,0
5,0,0,0,1,0,1,0,0,0,0,0


In [58]:
print(sentence)

["I like this movie, it's funny.", 'I hate this movie.', 'This was awesome!', 'I like it.', 'Nice one.', 'I love it.']


### NOTE:
The complexity of the bag_of_words model comes in deciding how to design the vocabulary of known works(tokens).

#### Designing the Vocabulary
When the vocabulary size increases, the vector representation of the documents also increases.

Wile having **huge amount of data** the vector representation will have a **lot of zeros**. these vector which have a lot of zeros are called **sparse vector**. they require more memory and computational resources.

Another more complex way to create a vocabulary is to use grouped words. This changes the scope of the vocabulary and allows the bag-of-words model to get **more details** about the document. This approach is called **n-grams**.

An **n-gram** is a **sequence of a number of items** (words, letter, numbers, digits, etc.). In the context of text corpora, n-grams typically refer to a sequence of words. The “n” in the “n-gram” refers to the number of the grouped words.

Eg: Let's look at the bigrams for the following sentence:
"the Office building is open today"

All the bigrams are:

* the office
* office building
* building is
* is open
* open today

#### Scoring Words
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach - the binary approach (1 for presence, 0 for absence).

Some additional scoring methods are:

* **Counts**. Count the number of times each word appears in a document.
* **Term-Frequencies**. Calculate the frequency that each word appears in document out of all the words in the document

# 7. TF-IDF
TF-IDF, short for **term frequency-inverse document frequency** is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

* Term Frequency (TF): a scoring of the frequency of the word in the current document.

* Inverse Term Frequency (ITF): a scoring of how rare the word is across documents.


In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(sentence)

# Show the Model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names_out()
pd.DataFrame(values.toarray(), columns = feature_names)

Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0.0,0.550195,0.0,0.380907,0.451168,0.0,0.451168,0.0,0.0,0.380907,0.0
1,0.0,0.0,0.681722,0.0,0.0,0.0,0.559022,0.0,0.0,0.471964,0.0
2,0.635091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.439681,0.635091
3,0.0,0.0,0.0,0.645102,0.764096,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0,0.0
5,0.0,0.0,0.0,0.569213,0.0,0.82219,0.0,0.0,0.0,0.0,0.0
