# Natural Language Processing (NLP) Assignment
This assignment will guide you through the basic concepts of Natural Language Processing including:
- Text preprocessing
- Tokenization and N-grams
- Named Entity Recognition (NER)
- Converting text into numbers (vectorization)
- Word embeddings (for experienced learners)

You can run and modify the code cells below to complete the tasks.

In [5]:
# Import required libraries
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
from nltk.util import bigrams
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Text Preprocessing
Clean the following text by converting it to lowercase, removing punctuation and stop words.

In [18]:
import string

# Sample text
text = "Natural Language Processing is a fascinating field. It combines linguistics and computer science!"

# TODO: Preprocess the text
def preprocess(text):
    # Convert to lowercase
    text_lower = text.lower()
    # Tokenize
    doc = nlp(text_lower)
    tokens = [token.text for token in doc]
    # Remove punctuation and stopwords
    tokens = [token for token in tokens if token not in string.punctuation]
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [token for token in tokens if token not in stop_words]
    return cleaned_tokens

# Print cleaned tokens
cleaned_tokens = preprocess(text)
print(cleaned_tokens)


['natural', 'language', 'processing', 'fascinating', 'field', 'combines', 'linguistics', 'computer', 'science']


## 2. Tokenization and N-grams
Generate bigrams (2-grams) from the cleaned tokens.

In [19]:
# Generate bigrams from cleaned tokens
bigrams = list(ngrams(cleaned_tokens, 2))
print("Bigrams:", bigrams)

Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'fascinating'), ('fascinating', 'field'), ('field', 'combines'), ('combines', 'linguistics'), ('linguistics', 'computer'), ('computer', 'science')]


In [8]:
#One more example by merging the above 2 topics.
# Sample text
text = "Natural language processing is exciting"
import string
# Tokenize
doc = nlp(text.lower())
tokens = [token.text for token in doc]
# Remove punctuation (optional)
tokens = [t for t in tokens if t not in string.punctuation]
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

# generate bigrams
bi = list(bigrams(tokens))

# Print results
print("Tokens:", tokens)
print("Bigrams:", bi)

Tokens: ['natural', 'language', 'processing', 'exciting']
Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'exciting')]


## 3. Named Entity Recognition (NER)
Use spaCy to perform NER on a new sentence.

In [20]:
# Example sentence
sentence = "Barack Obama was born in Hawaii and was elected president in 2008."
doc = nlp(sentence)
for ent in doc.ents:
    print(ent.text, ent.label_)

Barack Obama PERSON
Hawaii GPE
2008 DATE


## 4. Converting Text to Numbers
Use CountVectorizer and TfidfVectorizer to convert a list of sentences into numeric vectors.

In [21]:
sentences = [
    "I love machine learning.",
    "Natural language processing is a part of AI.",
    "AI is the future."
]

# CountVectorizer
count_vec = CountVectorizer()
X_count = count_vec.fit_transform(sentences)
print("Count Vectorizer Output:\n", X_count.toarray())

# TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform(sentences)
print("\nTF-IDF Vectorizer Output:\n", X_tfidf.toarray())

Count Vectorizer Output:
 [[0 0 0 0 1 1 1 0 0 0 0 0]
 [1 0 1 1 0 0 0 1 1 1 1 0]
 [1 1 1 0 0 0 0 0 0 0 0 1]]

TF-IDF Vectorizer Output:
 [[0.         0.         0.         0.         0.57735027 0.57735027
  0.57735027 0.         0.         0.         0.         0.        ]
 [0.30650422 0.         0.30650422 0.40301621 0.         0.
  0.         0.40301621 0.40301621 0.40301621 0.40301621 0.        ]
 [0.42804604 0.5628291  0.42804604 0.         0.         0.
  0.         0.         0.         0.         0.         0.5628291 ]]


## 5. Word Embeddings (Advanced)
Use spaCy to get word vectors (embeddings) for given words.

In [22]:
# Note: en_core_web_sm does not have word vectors. You can install and use en_core_web_md
# Uncomment below to install and load the medium model if needed.
# !python -m spacy download en_core_web_md
# nlp = spacy.load("en_core_web_md")

# Example word vector
word = nlp("machine")[0]
print("Vector for 'machine':\n", word.vector)

Vector for 'machine':
 [-1.1848618  -0.5884644  -0.431729    0.04726774  0.15745789 -0.4003174
  0.92419213  0.57095623 -0.12864795 -0.3782031   0.0434885  -1.0776316
 -0.5907476   0.8977387  -0.07443497  1.2179598  -0.47437477 -1.4934833
  0.9644136   0.89380246  0.11737019  0.41316557  0.21915573 -0.18460639
  0.13433756  1.2257113  -0.25236145  0.42400342  0.12917608 -0.01702237
 -0.4671869  -1.0976603   1.0032707   0.79694384 -0.08251538 -0.6763874
  0.6893894   0.03594261  1.398281   -0.6806841  -0.7026392   0.0651902
 -0.34503898  0.6320825  -0.88809305 -0.24504882 -0.66866446  0.517808
  0.02162796 -0.05917531  0.07430831  0.66209084  1.070316   -0.9029243
  0.754373   -0.04681993  0.7206097  -0.3353402  -0.21590832 -0.5775555
 -0.4421476   0.71672314 -0.6513661  -0.47828794  1.7274861   0.9889608
 -0.34796178 -0.4340755   0.4084397   0.21592781  1.1011105   0.2514349
  1.7605747  -1.0425378  -0.5678966  -0.0391078  -0.12766747 -0.21116531
  0.43004888 -0.64141405 -0.39330667 -1

In [23]:
def get_word_vectors(words):
    vectors = {}
    for word in words:
        # Process the word and get the first token's vector
        doc = nlp(word)
        vectors[word] = doc[0].vector
    return vectors

# Example usage with words
words = ["machine", "learning", "science"]
word_vectors = get_word_vectors(words)

# Print vectors
for word, vector in word_vectors.items():
    print(f"Vector for '{word}':")
    print(vector)
    print(f"Vector dimension: {vector.shape}\n")

Vector for 'machine':
[-1.1848618  -0.5884644  -0.431729    0.04726774  0.15745789 -0.4003174
  0.92419213  0.57095623 -0.12864795 -0.3782031   0.0434885  -1.0776316
 -0.5907476   0.8977387  -0.07443497  1.2179598  -0.47437477 -1.4934833
  0.9644136   0.89380246  0.11737019  0.41316557  0.21915573 -0.18460639
  0.13433756  1.2257113  -0.25236145  0.42400342  0.12917608 -0.01702237
 -0.4671869  -1.0976603   1.0032707   0.79694384 -0.08251538 -0.6763874
  0.6893894   0.03594261  1.398281   -0.6806841  -0.7026392   0.0651902
 -0.34503898  0.6320825  -0.88809305 -0.24504882 -0.66866446  0.517808
  0.02162796 -0.05917531  0.07430831  0.66209084  1.070316   -0.9029243
  0.754373   -0.04681993  0.7206097  -0.3353402  -0.21590832 -0.5775555
 -0.4421476   0.71672314 -0.6513661  -0.47828794  1.7274861   0.9889608
 -0.34796178 -0.4340755   0.4084397   0.21592781  1.1011105   0.2514349
  1.7605747  -1.0425378  -0.5678966  -0.0391078  -0.12766747 -0.21116531
  0.43004888 -0.64141405 -0.39330667 -1.