# Introduction to Python Libraries for NLP

This notebook provides an introduction to some of the most popular Python libraries for NLP: NLTK, spaCy, and PyTorch. We will learn how to setup these libraries, understand their fundamental concepts, and apply them in simple NLP tasks.


### Setting Up Your Environment


```bash
pip install nltk spacy
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md

### Introduction to NLTK

The [Natural Language Toolkit (NLTK)](https://www.nltk.org/) is one of the leading platforms for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a range of functionalities each suited for different aspects of NLP. Below we will explore some of these functionalities

In [46]:
import nltk
nltk.download('punkt')  # Download the required models

[nltk_data] Downloading package punkt to /Users/waseem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### 1. Tokenization
Tokenization is the process of breaking down text into smaller components, such as words or sentences. This is a fundamental step in most NLP tasks as it helps in preparing text for further modeling.

In [10]:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello world. NLTK is cool!"
print("Word Tokenization:", word_tokenize(text))
print("Sentence Tokenization:", sent_tokenize(text))

Word Tokenization: ['Hello', 'world', '.', 'NLTK', 'is', 'cool', '!']
Sentence Tokenization: ['Hello world.', 'NLTK is cool!']


[nltk_data] Downloading package punkt to /Users/waseem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


When we perform tokenization on a text using `word_tokenize` and `sent_tokenize`, the output is a list.

- **Word Tokenization:** This splits the input text into individual elements where each element is a word. Punctuation marks are also treated as separate tokens.
  
  Output: `['Hello', 'world', '.', 'NLTK', 'is', 'cool', '!']`
  
  This shows each word and punctuation mark from the input text "Hello world. NLTK is cool!" as a separate token.

- **Sentence Tokenization:** This splits the text into individual sentences. Each sentence is a string in the resulting list.
  
  Output: `['Hello world.', 'NLTK is cool!']`
  
  The input text is split into sentences wherever punctuation marks that typically end sentences (like `.`, `!`, `?`) are found.

#### 2. Stemming and Lemmatization
Stemming and lemmatization are techniques used to reduce a word to its base or root form. Stemming generally chops off endings based on common prefixes and suffixes, while lemmatization takes into consideration the morphological analysis of the word, returning the lemma or base form according to the language's lexicon.

In [6]:
nltk.download('wordnet')

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "running"
print("Stemming:", stemmer.stem(word))
print("Lemmatization:", lemmatizer.lemmatize(word, pos='v'))

[nltk_data] Downloading package wordnet to /Users/waseem/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Stemming: run
Lemmatization: run


Both stemming and lemmatization aim to reduce a word to its base form, but their methods and results differ.

- **Stemming Output**: `run`
  
  The word "running" is reduced to "run". Stemming often results in a form that is not a valid word but captures the core meaning.
  
- **Lemmatization Output**: `run`
  
  Unlike stemming, lemmatization returns a root form that is a valid word ("run"), which is the lemma of "running" when the word is treated as a verb (`pos='v'`).

#### 3. Part-of-Speech Tagging
Part-of-speech (POS) tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. POS tagging is used for syntax parsing, text analysis, and in the preprocessing steps for more complex NLP tasks.

In [7]:
nltk.download('averaged_perceptron_tagger')

words = word_tokenize("And now for something completely different")
print("POS Tagging:", nltk.pos_tag(words))

POS Tagging: [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/waseem/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


The output of `pos_tag` is a list of tuples where each tuple consists of a word from the input text and a tag indicating its part of speech.

Output: `[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]`

Here's what the tags mean:
- **CC**: Coordinating conjunction
- **RB**: Adverb
- **IN**: Preposition
- **NN**: Noun, singular
- **JJ**: Adjective

Each tag helps in understanding the grammatical role of a word in the sentence, which is crucial for syntactic parsing and other language processing tasks.

#### 4. Named Entity Recognition (NER)

Named Entity Recognition (NER) is used to identify important named entities in the text—people, places, organizations—and label them with the appropriate entity types. NER is crucial for many applications such as building content recommenders, automating customer support, and more.

In [8]:
from nltk import ne_chunk
from nltk import pos_tag

nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "Albert Einstein was born in Ulm, Germany in 1879."
tags = pos_tag(word_tokenize(sentence))
entities = ne_chunk(tags)
print("Named Entity Recognition:", entities)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/waseem/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /Users/waseem/nltk_data...


Named Entity Recognition: (S
  (PERSON Albert/NNP)
  (PERSON Einstein/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Ulm/NNP)
  ,/,
  (GPE Germany/NNP)
  in/IN
  1879/CD
  ./.)


[nltk_data]   Unzipping corpora/words.zip.


The output of `ne_chunk` when combined with `pos_tag` is a nested tree structure where each branch of the tree represents a chunk of the sentence.
- **PERSON**: A named entity representing a person's name.
- **GPE**: Geo-Political Entity, which includes countries, cities, states.
- **NNP**: Proper noun, singular
- **VBD**: Verb, past tense
- **VBN**: Verb, past participle
- **IN**: Preposition
- **CD**: Cardinal number

Entities like "Albert Einstein", "Ulm", and "Germany" are recognized and categorized, showing their importance and type within the context of the sentence.

#### 5. Parsing

Parsing is the process of analyzing a string of symbols, either in natural language or computer languages, according to the rules of a formal grammar. In the context of NLP, parsing involves determining the parse tree (graphical representation of the syntactic structure) of a given sentence.

In [9]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
sentence = "The quick brown fox jumps over the lazy dog"
tags = pos_tag(word_tokenize(sentence))
parsed_sentence = cp.parse(tags)
print("Parsing:", parsed_sentence)


Parsing: (S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))


The output of parsing with a defined grammar is a tree structure where each node represents a syntactic chunk based on the specified grammar.

- **NP**: Noun Phrase
- **DT**: Determiner
- **JJ**: Adjective
- **NN**: Noun, singular

The parse tree identifies noun phrases in the sentence "The quick brown fox jumps over the lazy dog", helping in understanding the structure and components of the sentence.

#### 6. Sentiment Analysis Using VADER

[VADER](https://github.com/cjhutto/vaderSentiment) (Valence Aware Dictionary and sEntiment Reasoner) is a popular sentiment analysis tool because of its specificity to social media contexts and its ability to understand modern colloquialisms, slang, and emojis. It provides positive, negative, and neutral sentiment scores, as well as a compound score that summarizes the overall sentiment.
VADER analyzes text to classify sentiment into positive, negative, and neutral categories based on a lexicon of words and rules that interpret grammatical constructions of sentiment expressions. The compound score it generates is a normalized, weighted composite score that can be used to gauge the overall sentiment of a text. Scores range from -1 (most extreme negative) to +1 (most extreme positive).

In [45]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download VADER lexicon if not already downloaded
nltk.download('vader_lexicon')

# Create a Sentiment Intensity Analyzer object
sia = SentimentIntensityAnalyzer()

# Example texts
texts = [
    "I love this new phone, it's awesome!",
    "This is the worst movie I have ever seen."
]

# Apply VADER to each text
for text in texts:
    scores = sia.polarity_scores(text)
    print(f"Text: {text}")
    print("Scores:", scores)


Text: I love this new phone, it's awesome!
Scores: {'neg': 0.0, 'neu': 0.318, 'pos': 0.682, 'compound': 0.8622}
Text: This is the worst movie I have ever seen.
Scores: {'neg': 0.369, 'neu': 0.631, 'pos': 0.0, 'compound': -0.6249}


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/waseem/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Interpreting the compound scores:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05

Given the example texts:
1. "I love this new phone, it's awesome!" - This text received a high positive score and low negative and neutral scores. The compound score should reflect strong positive sentiment.
2. "This is the worst movie I have ever seen." - This text showed a high negative score, low positive and neutral scores, and a compound score indicating strong negative sentiment.

This method is especially useful in cases where quick, effective sentiment analysis is needed, and can be particularly effective in analyzing customer feedback, reviews, or social media data.

### Introduction to Spacy

[spaCy](https://spacy.io/) is a modern, robust library for Natural Language Processing (NLP) in Python.  It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. It is designed for practical, real-world applications and built for performance, with the ability to scale and integrate into large systems. spaCy provides pre-trained models for various languages and capabilities. Below we will explore some of its capabilities

In [34]:
import spacy

#### 1. Word Vectors and Similarity

Word vectors are numerical representations of words that capture their meanings, semantic relationships, and the contexts in which they appear. spaCy provides support for using these vectors to compute similarities between words, texts, and other linguistic units.
Word vectors, or embeddings, are learned from the context in which words appear. They are fundamental in various NLP tasks because they help algorithms understand synonymy and semantic similarity between different pieces of text. spaCy can use pre-trained word embeddings from models like GloVe, or you can train your own. (Topic explored further in 1.2.2 Basic Text Processing)



In [41]:
# Load the English NLP model
nlp = spacy.load("en_core_web_md")

if not nlp.vocab.vectors.size:
    raise ValueError("The model does not contain any word vectors.")

word1 = nlp("king")
word2 = nlp("prince")

# Compute similarity between two words
similarity = word1.similarity(word2)
print("Similarity:", similarity)


Similarity: 0.7827692113563378


The similarity score is on a scale from 0 to 1. The similarity between "king" and "prince", reflecting their semantic relation as royalty counterparts. This illustrates how word embeddings capture not just superficial lexical similarities but deeper semantic relationships, useful in tasks like semantic search, text clustering, and even creative language generation.

#### 2. Sentence Segmentation

Similar to Sentence Tokenization seen earlier, sentence segmentation involves dividing a text into its constituent sentences, which is a crucial first step in many NLP applications. spaCy's sentence segmentation algorithm uses a combination of tokenization rules and a statistical model to accurately predict sentence boundaries.

In [42]:
nlp = spacy.load("en_core_web_sm")

# Example text with multiple sentences
text = "Hello world! Here is another sentence. And yet another one."

# Process the text
doc = nlp(text)

# Print each sentence in the document
for sentence in doc.sents:
    print(sentence.text)

Hello world!
Here is another sentence.
And yet another one.


### 3. Custom Rule-Based Entity Recognition

Similar to NER but for situations where you might want to identify entities using custom, rule-based methods. This can be particularly useful for recognizing specific terms or patterns that are unique to a particular domain, such as product codes in retail or specific jargon in legal or medical texts.
spaCy allows for the addition of rule-based matching capabilities via the `Matcher` and `PhraseMatcher` classes, which can be used to search the text for a series of tokens or phrases based on patterns you define. This feature is highly customizable and can be used in conjunction with the statistical models to enhance the entity recognition capabilities of your NLP pipeline.

In [43]:
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Define the pattern
pattern = [{'IS_DIGIT': True}, {'LOWER': {'IN': ['dollars', 'dollar', 'usd']}}]
matcher.add("MONEY", [pattern])

text = "I need a subscription for 10 dollars."
doc = nlp(text)

# Apply the matcher to the doc
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: 10 dollars


For the custom rule-based entity recognition, we defined a pattern to identify monetary values, for the input text: "I need a subscription for 10 dollars."

Given this input and the pattern we defined (`{'IS_DIGIT': True}, {'LOWER': {'IN': ['dollars', 'dollar', 'usd']}}`), the matcher would identify "10 dollars" as a match because:
- "10" matches the first part of the pattern (`{'IS_DIGIT': True}`), indicating it’s a digit.
- "dollars" matches the second part of the pattern (`{'LOWER': {'IN': ['dollars', 'dollar', 'usd']}}`), confirming it’s one of the specified keywords.

The output would then correctly highlight "10 dollars" as an entity of type "MONEY", showcasing how rule-based patterns can effectively identify specific entities in text that fit predefined criteria.


#### 4. Text Classification

Text classification is another powerful feature spaCy can be adapted for. While spaCy doesn't provide direct, built-in methods for text classification like some other libraries, it can be used to prepare data and features for a classifier and to deploy a classification model.
Text classification involves assigning categories or labels to text based on its content. This is commonly used in sentiment analysis, topic categorization, and spam detection. In spaCy, you can use the `Doc` and `Span` objects to extract features and train a classifier using any machine learning library, such as scikit-learn.


In [44]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample texts and labels
texts = ["I love this phone", "This movie is terrible", "What a wonderful day"]
labels = [1, 0, 1]  # 1 for positive, 0 for negative

# Use spaCy to process texts and create a feature extractor
vectorizer = CountVectorizer()

# Create a classification pipeline
classifier = make_pipeline(vectorizer, MultinomialNB())

# Train the classifier
classifier.fit(texts, labels)

# Test the classifier
test_text = "The product was good"
prediction = classifier.predict([test_text])
print("Prediction for '{}': {}".format(test_text, "Positive" if prediction[0] == 1 else "Negative"))

Prediction for 'The product was good': Positive


We used a basic sentiment analysis setup with texts labeled as either positive (1) or negative (0). The classifier was trained on a few sample texts, and we then tested it with the text: "The product was good".

This prediction indicates that the classifier successfully associated the word "good" with positive sentiment, based on its training data. It demonstrates how text classification can be applied to analyze sentiments in customer feedback, reviews, or any textual data where understanding sentiment or categorizing text is crucial.

#### Conclusion

Both NLTK and spaCy offer comprehensive suites of tools and functionalities for natural language processing, each with its own strengths. While NLTK provides a broad array of algorithms and is highly suitable for academic and educational purposes, spaCy excels in performance and offers industrial-strength capabilities with a focus on efficiency and scalability. Choosing between the two depends largely on the specific requirements of the project, including the complexity of tasks involved and the need for speed and efficiency in processing large volumes of text.