<a href="https://colab.research.google.com/github/Raju1410/Class_IS532E/blob/main/Session3/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install spacy




In [None]:
import spacy
from spacy import displacy
import numpy as np
import nltk
nltk.download('wordnet')
from nltk.stem import PorterStemmer
from nltk import ngrams
nltk.download('punkt')
from nltk.corpus import wordnet
nltk.download('wordnet')
from collections import Counter

# Load the English language model
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Sample text for analysis
text = """SpaCy is an open-source software library for advanced Natural Language Processing."""

## Tokenization

In [None]:
# Tokenization
doc = nlp(text)
for token in doc:
    print(token.text)

# Visualize the tokens using displacy
displacy.render(doc, style="dep", jupyter=True)

SpaCy
is
an
open
-
source
software
library
for
advanced
Natural
Language
Processing
.


## POS Tagging

In [None]:
# Part-of-Speech Tagging
for token in doc:
    print(f"{token.text}: {token.pos_}")

SpaCy: PROPN
is: AUX
an: DET
open: ADJ
-: PUNCT
source: NOUN
software: NOUN
library: NOUN
for: ADP
advanced: ADJ
Natural: PROPN
Language: PROPN
Processing: PROPN
.: PUNCT


## Named Entity Recognition (NER)

In [None]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")
displacy.render(doc, style="ent", jupyter=True)

Sebastian Thrun (PERSON)
Google (ORG)
2007 (DATE)


## Stop Word Removal and Text Normalization

In [None]:
# Stop word removal and text normalization (lowercase and lemmatization)
normalized_tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

# Join the tokens into a normalized string
normalized_text = " ".join(normalized_tokens)

# Output the normalized text
print(normalized_text)

sebastian thrun start work self drive car google 2007 people outside company take seriously


# Stemming vs. Lemmatization

| Feature              | Stemming                                  | Lemmatization                              |
|----------------------|-------------------------------------------|-------------------------------------------|
| **Definition**       | Reducing words to their root form by removing suffixes or prefixes. | Reducing words to their base or dictionary form (lemma). |
| **Process**          | Uses heuristic processes, often removing endings of words. | Involves a detailed dictionary lookup for the base form. |
| **Output**           | Can produce non-existent words (e.g., "running" → "run"). | Produces actual words (e.g., "running" → "run"). |
| **Language**         | Works primarily in English and may not be effective for other languages. | More effective across various languages, as it considers the context of the word. |
| **Complexity**       | Faster and simpler; does not require extensive knowledge of the language. | More complex; requires understanding of the grammar and context. |
| **Use Cases**        | Useful in information retrieval, search queries, and where precision is less critical. | Preferred in natural language processing tasks where meaning is important. |
| **Examples**         | - "fishing" → "fish"  <br> - "better" → "better" (removed suffix) | - "better" → "good" <br> - "geese" → "goose" |

## Explanation
- **Stemming**: Stemming algorithms, like Porter Stemmer, simplify the process by cutting off prefixes and suffixes. It’s a brute-force approach that may not always yield real words. It’s useful for applications where the exact meaning of the word is less important than capturing the overall theme.
  
- **Lemmatization**: Lemmatization uses a dictionary or corpus to look up the base form of the word, considering its part of speech and context. It’s more accurate than stemming but requires more computational resources. Lemmatization is important in applications that require understanding the meaning of the text.



In [None]:
# Sample text
text = "The cats are running faster than the dogs. She bettered her previous record."

# Process the text using the model
doc = nlp(text)

# Print the tokens and their lemmas
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}")

Token: The, Lemma: the
Token: cats, Lemma: cat
Token: are, Lemma: be
Token: running, Lemma: run
Token: faster, Lemma: fast
Token: than, Lemma: than
Token: the, Lemma: the
Token: dogs, Lemma: dog
Token: ., Lemma: .
Token: She, Lemma: she
Token: bettered, Lemma: better
Token: her, Lemma: her
Token: previous, Lemma: previous
Token: record, Lemma: record
Token: ., Lemma: .


In [None]:
stemmer = PorterStemmer()
# Process the text with spaCy
doc = nlp(text)

# Apply stemming using NLTK's PorterStemmer
stems = [stemmer.stem(token.text) for token in doc]

# Output the stemmed words
print(stems)

['the', 'cat', 'are', 'run', 'faster', 'than', 'the', 'dog', '.', 'she', 'better', 'her', 'previou', 'record', '.']


## Dependency Parsing

In [None]:
# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Process the text
doc = nlp(text)

# Print token, POS tags, and dependency information
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}, Detailed POS: {token.tag_}, Dependency: {token.dep_}, Head: {token.head.text}")

Token: The, POS: DET, Detailed POS: DT, Dependency: det, Head: fox
Token: quick, POS: ADJ, Detailed POS: JJ, Dependency: amod, Head: fox
Token: brown, POS: ADJ, Detailed POS: JJ, Dependency: amod, Head: fox
Token: fox, POS: NOUN, Detailed POS: NN, Dependency: nsubj, Head: jumps
Token: jumps, POS: VERB, Detailed POS: VBZ, Dependency: ROOT, Head: jumps
Token: over, POS: ADP, Detailed POS: IN, Dependency: prep, Head: jumps
Token: the, POS: DET, Detailed POS: DT, Dependency: det, Head: dog
Token: lazy, POS: ADJ, Detailed POS: JJ, Dependency: amod, Head: dog
Token: dog, POS: NOUN, Detailed POS: NN, Dependency: pobj, Head: over
Token: ., POS: PUNCT, Detailed POS: ., Dependency: punct, Head: jumps


In [None]:
# Visualize dependency parsing
displacy.render(doc, style="dep", jupyter=True, options={'compact': True})

In [None]:
# Sample sentence
sentence = "The bass in the song was too loud."

# Process the sentence using spaCy
doc = nlp(sentence)

# Function to get WordNet synsets (senses) of a word
def get_wordnet_senses(word):
    return wordnet.synsets(word)

# For each token in the doc, print its WordNet senses
for token in doc:
    senses = get_wordnet_senses(token.text)
    if senses:
        print(f"Word: {token.text}, Senses: {[sense.definition() for sense in senses]}")

Word: bass, Senses: ['the lowest part of the musical range', 'the lowest part in polyphonic music', 'an adult male singer with the lowest voice', 'the lean flesh of a saltwater fish of the family Serranidae', 'any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)', 'the lowest adult male singing voice', 'the member with the lowest range of a family of musical instruments', 'nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes', 'having or denoting a low vocal or instrumental range']
Word: in, Senses: ['a unit of length equal to one twelfth of a foot', 'a rare soft silvery metallic element; occurs in small quantities in sphalerite', 'a state in midwestern United States', 'holding office', 'directed or bound inward', 'currently fashionable', 'to or toward the inside of']
Word: song, Senses: ['a short musical composition with words', 'a distinctive or characteristic sound', 'the act of singing', 'the character

# N-grams Explanation

N-grams are continuous sequences of 'n' items (words or characters) from a given sample of text or speech.
They are commonly used in natural language processing (NLP) for tasks like text analysis, language modeling,
and machine learning.

Types of N-grams:
1. Unigrams (1-grams): Single words (e.g., "I", "love", "coding").
2. Bigrams (2-grams): Pairs of consecutive words (e.g., "I love", "love coding").
3. Trigrams (3-grams): Triplets of consecutive words (e.g., "I love coding").


In [None]:
# Sample text
text = "This is a sentence"

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Generate bigrams
bigrams = list(ngrams(tokens, 2))

# Generate trigrams
trigrams = list(ngrams(tokens, 3))

print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Bigrams: [('This', 'is'), ('is', 'a'), ('a', 'sentence')]
Trigrams: [('This', 'is', 'a'), ('is', 'a', 'sentence')]


# Bag of Words Explanation


The Bag of Words (BoW) model is a simple and widely used technique in natural language processing (NLP)
for text representation. It transforms text into a numerical format that can be used in machine learning
algorithms.

Key Concepts:
1. **Tokenization**: The text is split into individual words (tokens).
2. **Vocabulary**: A set of unique words from the entire corpus is created.
3. **Frequency Count**: For each document, a vector is created where each element represents the count
   of a word from the vocabulary.

Characteristics:
- The BoW model disregards grammar and word order; it only focuses on the presence or absence of words.
- It can be used for various NLP tasks, including text classification, sentiment analysis, and topic modeling.
![Alt text](https://user.oc-static.com/upload/2022/12/08/16705125107088_16034397439042_surfin%20bird%20bow.png)  

In [None]:
documents = [
    "I love natural language processing.",
    "NLP is fun and I love it.",
    "Natural language processing can be amazing."
]

# Preprocess and tokenize the documents
tokenized_docs = []
for doc in documents:
    # Process the text with spaCy
    spacy_doc = nlp(doc.lower())  # Convert to lowercase
    tokens = [token.text for token in spacy_doc if not token.is_punct and not token.is_stop]
    tokenized_docs.append(tokens)

# Create a vocabulary from the tokenized documents
vocabulary = list(set(token for tokens in tokenized_docs for token in tokens))
print("Vocabulary:", vocabulary)

# Create the Bag of Words representation
def create_bow_representation(tokenized_docs, vocabulary):
    bow_matrix = []
    for tokens in tokenized_docs:
        # Create a frequency count for each document
        frequency_count = Counter(tokens)
        # Create a vector for the document
        vector = [frequency_count.get(word, 0) for word in vocabulary]
        bow_matrix.append(vector)
    return np.array(bow_matrix)

# Generate the Bag of Words representation
bow_matrix = create_bow_representation(tokenized_docs, vocabulary)

print("\nBag of Words Representation:")
print(bow_matrix)

Vocabulary: ['language', 'nlp', 'amazing', 'fun', 'love', 'processing', 'natural']

Bag of Words Representation:
[[1 0 0 0 1 1 1]
 [0 1 0 1 1 0 0]
 [1 0 1 0 0 1 1]]


## Gensim and Word Vectors

# Word2Vec and Cosine Similarity Explanation

### Word2Vec

Word2Vec is a popular technique used in natural language processing (NLP) to convert words into vector representations, allowing machines to understand and process textual data. Developed by Google, it utilizes neural networks to learn word associations from large corpora of text.

**Key Concepts:**
1. **Continuous Bag of Words (CBOW):** Predicts a target word based on its surrounding context words.
2. **Skip-Gram:** Predicts context words given a target word. This approach is useful for capturing relationships between words.
![Word2Vec: Skip-Gram vs. CBOW](https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-1536x806.png)
**source**:https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-1536x806.png
**Output:**
- Each word is represented as a dense vector in a high-dimensional space, where semantically similar words are mapped closer together.

**Applications:**
- Text classification, sentiment analysis, and machine translation.

### Cosine Similarity

Cosine similarity is a metric used to measure how similar two vectors are, regardless of their magnitude. It calculates the cosine of the angle between two non-zero vectors in an inner product space.

**Formula:**
$$
\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \cdot \|B\|}
$$

Where:
- \( A . B \) is the dot product of the vectors.
- \( \|A\| \) and \( \|B\| \) are the magnitudes (lengths) of the vectors.


**Interpretation:**
- Cosine similarity ranges from -1 to 1.
  - 1 means the vectors are identical (perfectly similar).
  - 0 means the vectors are orthogonal (no similarity).
  - -1 means the vectors are diametrically opposed.
![Word2Vec: Word Embeddings](https://lamarr-institute.org/wp-content/uploads/word2vec-Wort-Embeddings.jpg)

**Source**:https://lamarr-institute.org/wp-content/uploads/word2vec-Wort-Embeddings.jpg

**Applications:**
- Information retrieval, clustering, and recommendation systems.

### Summary

- Word2Vec transforms words into vector representations, capturing semantic relationships.
- Cosine similarity quantifies the similarity between these vectors, helping to measure the degree of relatedness between words or documents.


In [None]:
import gensim.downloader as api

word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
# Check the "most similar words", using the default "cosine similarity" measure.
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[5]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")

prince: 0.6517


In [None]:
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")

queen: 0.7699


In [None]:
result = word_vectors.most_similar(positive=['france', 'berlin'], negative=['paris'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")

germany: 0.8924
