# NLP Concepts using Python

## Text Analysis

In [1]:
import nltk
import spacy
import re
from nltk.tokenize import (word_tokenize,
                           sent_tokenize,
                           TreebankWordTokenizer,
                           wordpunct_tokenize,
                           TweetTokenizer,
                           MWETokenizer)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ganad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ganad\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# a more complex sentence to showcase tokenization
complex_sentence = """Dr. Emily Wong, Ph.D., exclaimed, 'OMG! Despite reaching 7.8 billion in 2021, global population metrics, like those in Section 123(i)(3)(A)(ii), remain baffling,' and then she added in a mix of French and English, 'C’est incroyable, no? Check my blog at www.emilywong-science.com or email me at wong@scienceworld.com for the full story on Mycobacterium tuberculosis complex (MTBC) research, which is, you know, the state-of-the-art stuff—I've literally termed it "the Occam’s razor of epidemiology."""
sentence = "The quick brown fox jumps over the lazy dog."

### Tokenization

In [3]:
tokens_word = nltk.word_tokenize(sentence)
token_sent = nltk.sent_tokenize(sentence)
token_treebank = TreebankWordTokenizer().tokenize(sentence)
token_wordpunct = wordpunct_tokenize(sentence)
token_tweet = TweetTokenizer().tokenize(sentence)
token_mwe = MWETokenizer().tokenize(sentence.split())


display(tokens_word,
        token_sent,
        token_treebank,
        token_wordpunct,
        token_tweet,
        token_mwe)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The quick brown fox jumps over the lazy dog.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

In [4]:
tokens_word_c = nltk.word_tokenize(complex_sentence)
token_sent_c = nltk.sent_tokenize(complex_sentence)
token_treebank_c = TreebankWordTokenizer().tokenize(complex_sentence)
token_wordpunct_c = wordpunct_tokenize(complex_sentence)
token_tweet_c = TweetTokenizer().tokenize(complex_sentence)
token_mwe_c = MWETokenizer().tokenize(complex_sentence.split())

print(tokens_word_c)
print()
print(token_sent_c)
print()
print(token_treebank_c)
print()
print(token_wordpunct_c)
print()
print(token_tweet_c)
print()
print(token_mwe_c)

['Dr.', 'Emily', 'Wong', ',', 'Ph.D.', ',', 'exclaimed', ',', "'OMG", '!', 'Despite', 'reaching', '7.8', 'billion', 'in', '2021', ',', 'global', 'population', 'metrics', ',', 'like', 'those', 'in', 'Section', '123', '(', 'i', ')', '(', '3', ')', '(', 'A', ')', '(', 'ii', ')', ',', 'remain', 'baffling', ',', "'", 'and', 'then', 'she', 'added', 'in', 'a', 'mix', 'of', 'French', 'and', 'English', ',', "'", 'C', '’', 'est', 'incroyable', ',', 'no', '?', 'Check', 'my', 'blog', 'at', 'www.emilywong-science.com', 'or', 'email', 'me', 'at', 'wong', '@', 'scienceworld.com', 'for', 'the', 'full', 'story', 'on', 'Mycobacterium', 'tuberculosis', 'complex', '(', 'MTBC', ')', 'research', ',', 'which', 'is', ',', 'you', 'know', ',', 'the', 'state-of-the-art', 'stuff—I', "'ve", 'literally', 'termed', 'it', '``', 'the', 'Occam', '’', 's', 'razor', 'of', 'epidemiology', '.']

["Dr. Emily Wong, Ph.D., exclaimed, 'OMG!", "Despite reaching 7.8 billion in 2021, global population metrics, like those in Secti

### Observation
- Word Tokenizer and the Treebank Tokenizer seems pretty good compared to the others

- Picking the right tokenizer depends on the task at hand. For example, if you are working on a sentiment analysis task, you might want to use a tokenizer that preserves emoticons and hashtags. If you are working on a machine translation task, you might want to use a tokenizer that preserves punctuation and capitalization.

### Penn Treebank Tagset Overview

A tagset is a list of part-of-speech tags, i.e., labels used to indicate the part of speech and often also other grammatical categories (case, tense etc.) of each token in a text corpus.

#### Introduction to Penn Treebank Tagset

The English Penn Treebank tagset is utilized with English corpora annotated by the TreeTagger tool, developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. This version of the tagset contains modifications developed by Sketch Engine (earlier version).

[See a more recent version of this tagset](https://www.sketchengine.eu/penn-treebank-tagset/).

### What is a POS Tag?

POS tags classify words into grammatical categories which can help in understanding the structure and context of text. The table below shows the English Penn TreeBank tagset with Sketch Engine modifications (earlier version).

Example: Using `[tag="NNS"]` finds all nouns in the plural, e.g., people, years when used in the CQL concordance search (always use straight double quotation marks in CQL).

| POS Tag | Description                                   | Example             |
|---------|-----------------------------------------------|---------------------|
| CC      | Coordinating conjunction                      | and                 |
| CD      | Cardinal number                               | 1, third            |
| DT      | Determiner                                    | the                 |
| EX      | Existential there                             | there is            |
| FW      | Foreign word                                  | les                 |
| IN      | Preposition, subordinating conjunction        | in, of, like        |
| IN/that | That as subordinator                          | that                |
| JJ      | Adjective                                     | green               |
| JJR     | Adjective, comparative                        | greener             |
| JJS     | Adjective, superlative                        | greenest            |
| LS      | List marker                                   | 1)                  |
| MD      | Modal                                         | could, will         |
| NN      | Noun, singular or mass                        | table               |
| NNS     | Noun, plural                                  | tables              |
| NP      | Proper noun, singular                         | John                |
| NPS     | Proper noun, plural                           | Vikings             |
| PDT     | Predeterminer                                 | both the boys       |
| POS     | Possessive ending                             | friend’s            |
| PP      | Personal pronoun                              | I, he, it           |
| PPZ     | Possessive pronoun                            | my, his             |
| RB      | Adverb                                        | however, usually    |
| RBR     | Adverb, comparative                           | better              |
| RBS     | Adverb, superlative                           | best                |
| RP      | Particle                                      | give up             |
| SENT    | Sentence-break punctuation                    | . ! ?               |
| SYM     | Symbol                                        | / [ = *             |
| TO      | Infinitive ‘to’                               | to go               |
| UH      | Interjection                                  | uhhuhhuhh           |
| VB      | Verb, base form                               | be                  |
| VBD     | Verb, past tense                              | was, were           |
| VBG     | Verb, gerund/present participle               | being               |
| VBN     | Verb, past participle                         | been                |
| VBP     | Verb, sing. present, non-3d                   | am, are             |
| VBZ     | Verb, 3rd person sing. present                | is                  |
| VH      | Verb have, base form                          | have                |
| VHD     | Verb have, past tense                         | had                 |
| VHG     | Verb have, gerund/present participle          | having              |
| VHN     | Verb have, past participle                    | had                 |
| VHP     | Verb have, sing. present, non-3d              | have                |
| VHZ     | Verb have, 3rd person sing. present           | has                 |
| VV      | Verb, base form                               | take                |
| VVD     | Verb, past tense                              | took                |
| VVG     | Verb, gerund/present participle               | taking              |
| VVN     | Verb, past participle                         | taken               |
| VVP     | Verb, sing. present, non-3d                   | take                |
| VVZ     | Verb, 3rd person sing. present                | takes               |
| WDT     | Wh-determiner                                 | which               |
| WP      | Wh-pronoun                                    | who, what           |
| WP$     | Possessive wh-pronoun                         | whose               |
| WRB     | Wh-abverb                                     | where, when         |

### Main Differences to the Default Penn Tagset
- In TreeTagger:
  - Distinguishes 'be' (VB) and 'have' (VH) from other (non-modal) verbs (VV).
  - For proper nouns, NNP and NNPS have become NP and NPS.
  - SENT for end-of-sentence punctuation (other punctuation tags may also differ).
- In TreeTagger tool + Sketch Engine modifications:
  - The word 'to' is tagged IN when used as a preposition and TO when used as an infinitive marker.

### Bibliography

M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.

### Part of Speech POS Tags

### Remove Punctuation

In [7]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

print(remove_punctuation(sentence))
print(remove_punctuation(complex_sentence))

The quick brown fox jumps over the lazy dog
Dr Emily Wong PhD exclaimed OMG Despite reaching 78 billion in 2021 global population metrics like those in Section 123i3Aii remain baffling and then she added in a mix of French and English Cest incroyable no Check my blog at wwwemilywongsciencecom or email me at wongscienceworldcom for the full story on Mycobacterium tuberculosis complex MTBC research which is you know the stateoftheart stuffIve literally termed it the Occams razor of epidemiology


### Lowercasing

In [9]:
def lowercase(text):
    return text.lower()

print(lowercase(sentence), "\n", lowercase(complex_sentence))

the quick brown fox jumps over the lazy dog. 
 dr. emily wong, ph.d., exclaimed, 'omg! despite reaching 7.8 billion in 2021, global population metrics, like those in section 123(i)(3)(a)(ii), remain baffling,' and then she added in a mix of french and english, 'c’est incroyable, no? check my blog at www.emilywong-science.com or email me at wong@scienceworld.com for the full story on mycobacterium tuberculosis complex (mtbc) research, which is, you know, the state-of-the-art stuff—i've literally termed it "the occam’s razor of epidemiology.


### Stemming

In [10]:
from nltk.stem import PorterStemmer 
def stemming(words):
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in words]

print(stemming(sentence), "\n", stemming(complex_sentence))

['t', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'] 
 ['d', 'r', '.', ' ', 'e', 'm', 'i', 'l', 'y', ' ', 'w', 'o', 'n', 'g', ',', ' ', 'p', 'h', '.', 'd', '.', ',', ' ', 'e', 'x', 'c', 'l', 'a', 'i', 'm', 'e', 'd', ',', ' ', "'", 'o', 'm', 'g', '!', ' ', 'd', 'e', 's', 'p', 'i', 't', 'e', ' ', 'r', 'e', 'a', 'c', 'h', 'i', 'n', 'g', ' ', '7', '.', '8', ' ', 'b', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'i', 'n', ' ', '2', '0', '2', '1', ',', ' ', 'g', 'l', 'o', 'b', 'a', 'l', ' ', 'p', 'o', 'p', 'u', 'l', 'a', 't', 'i', 'o', 'n', ' ', 'm', 'e', 't', 'r', 'i', 'c', 's', ',', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'o', 's', 'e', ' ', 'i', 'n', ' ', 's', 'e', 'c', 't', 'i', 'o', 'n', ' ', '1', '2', '3', '(', 'i', ')', '(', '3', ')', '(', 'a', ')', '(', 'i', 'i', ')', ',', ' ', 'r', 'e', 'm', 'a', 'i', 'n', ' ', 'b', 'a', 'f', '

### Lemmatization

In [12]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatization(words):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

print(lemmatization(sentence), "\n", lemmatization(complex_sentence))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ganad\AppData\Roaming\nltk_data...


['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'] 
 ['D', 'r', '.', ' ', 'E', 'm', 'i', 'l', 'y', ' ', 'W', 'o', 'n', 'g', ',', ' ', 'P', 'h', '.', 'D', '.', ',', ' ', 'e', 'x', 'c', 'l', 'a', 'i', 'm', 'e', 'd', ',', ' ', "'", 'O', 'M', 'G', '!', ' ', 'D', 'e', 's', 'p', 'i', 't', 'e', ' ', 'r', 'e', 'a', 'c', 'h', 'i', 'n', 'g', ' ', '7', '.', '8', ' ', 'b', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'i', 'n', ' ', '2', '0', '2', '1', ',', ' ', 'g', 'l', 'o', 'b', 'a', 'l', ' ', 'p', 'o', 'p', 'u', 'l', 'a', 't', 'i', 'o', 'n', ' ', 'm', 'e', 't', 'r', 'i', 'c', 's', ',', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'o', 's', 'e', ' ', 'i', 'n', ' ', 'S', 'e', 'c', 't', 'i', 'o', 'n', ' ', '1', '2', '3', '(', 'i', ')', '(', '3', ')', '(', 'A', ')', '(', 'i', 'i', ')', ',', ' ', 'r', 'e', 'm', 'a', 'i', 'n', ' ', 'b', 'a', 'f', '

## Text Classification

In [2]:
import os
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import pickle

model_dir = r'C:\Users\ganad\Desktop\Resume Work\NLP_HUB\models'

# Fetch the dataset
data = fetch_20newsgroups(subset='train')
categories = data.target_names

# Create pipelines for different classifiers
models = {
    'Naive Bayes': make_pipeline(TfidfVectorizer(), MultinomialNB()),
    'SVM': make_pipeline(TfidfVectorizer(), SVC(probability=True)),
    'Logistic Regression': make_pipeline(TfidfVectorizer(), LogisticRegression())
}

# Train and save each model in the specified directory
for name, model in models.items():
    model.fit(data.data, data.target)
    model_path = os.path.join(model_dir, f'{name.lower().replace(" ", "_")}_classifier.pkl')
    with open(model_path, 'wb') as f:
        pickle.dump(model, f)

In [5]:
# Define the directory where models are saved
model_dir = r'C:\Users\ganad\Desktop\Resume Work\NLP_HUB\models'

# Dictionary to map model names to file names
model_files = {
    'Naive Bayes': os.path.join(model_dir, 'naive_bayes_classifier.pkl'),
    'SVM': os.path.join(model_dir, 'svm_classifier.pkl'),
    'Logistic Regression': os.path.join(model_dir, 'logistic_regression_classifier.pkl')
}

# Function to load a model
def load_model(model_name):
    with open(model_files[model_name], 'rb') as f:
        model = pickle.load(f)
    return model

# Load the dataset to fetch categories
data = fetch_20newsgroups(subset='train')
categories = data.target_names

# Load all models
models = {name: load_model(name) for name in model_files.keys()}


# Print predictions from each model
for name, model in models.items():
    prediction = model.predict([sentence])
    print(f"Prediction from {name}: {categories[prediction[0]]}")

for name, model in models.items():
    prediction = model.predict([complex_sentence])
    print(f"Prediction from {name}: {categories[prediction[0]]}")

Prediction from Naive Bayes: rec.motorcycles
Prediction from SVM: rec.motorcycles
Prediction from Logistic Regression: rec.motorcycles
Prediction from Naive Bayes: soc.religion.christian
Prediction from SVM: sci.med
Prediction from Logistic Regression: sci.med


## Text Similarity

## Keyword Extraction

## Topic Modeling

## Text Summarization

## Language Translation

# References:
- Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.
- https://neptune.ai/blog/tokenization-in-nlp
- https://www.sketchengine.eu/penn-treebank-tagset/