## NLP PROJECTS IMPLEMENTATION

### Introduction to Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand, interpret and generate human language. It involves a wide range of computational techniques that allow machines to process spoken and written languages, making it possible for computers to communicate effectively with humans.

NLP bridges the gap between human communication and computer understanding. It plays a crucial role in various applications such as:

Machine Translation (e.g Google Translate)

Speech Recognition (e.g Siri, Alexa)

Text Summarization

Sentiment Analysis

Chatbots and Virtual Assistants

Spam Detection

Information Retrieval (e.g search engines)


**Key components of NLP include:**

Tokenization – breaking text into words or phrases

Part-of-Speech Tagging – identifying nouns, verbs, adjectives, etc.

Named Entity Recognition (NER) – extracting names of people, places and organizations

Parsing – analyzing grammatical structure

Semantic Analysis – understanding meaning and context

With the advancement of deep learning and large language models, modern NLP systems have become more accurate and powerful, enabling real-time translation, content generation and more human-like conversations. NLP is now an integral part of technologies we use every day and continues to evolve rapidly.

### Five Significance of NLP Applications.

**(i) Chatbots and Virtual Assistants (e.g., Siri, Google Assistant)**
NLP allows chatbots and virtual assistants to understand user queries and provide relevant responses. This enhances customer support, automates tasks, and improves overall user experience.

**(ii) Sentiment Analysis (e.g., Social Media Monitoring, Customer Feedback Analysis)**
Businesses use NLP to analyze customer feedback and social media posts to understand public sentiment. This helps companies improve their services, respond to customer concerns, and manage their reputation.

**(iii) Machine Translation (e.g., Google Translate, DeepL)**
NLP enables the translation of text from one language to another, breaking language barriers and making global communication more accessible. This is crucial for international business, education, and cross-cultural exchange.

**(iv) Information Extraction (e.g., Medical Reports, Legal Documents)**
NLP can automatically extract important information from unstructured text, such as identifying key entities, dates, and relationships in medical or legal documents. This saves time, reduces human error, and aids in faster decision-making.

**(v) Text Summarization (e.g., News Aggregators, Research Papers)**
NLP is used to generate concise summaries of long texts, making it easier to digest large volumes of information. Applications range from summarizing news articles to condensing academic papers or legal documents.

### Five Challenges of NLP.

**(i) Noisy Data**
NLP models rely on large amounts of text data, but real-world data is often messy. It may contain typos, slang, abbreviations, grammatical errors or inconsistencies. For example, social media posts often have informal language and emojis, making it harder for NLP systems to process them accurately.

**(ii) Context Dependency**
Understanding language requires considering the context in which words or phrases appear. The same words can have different meanings depending on prior conversations, cultural background, or even tone. E.g., "I was at the bank when the rain started" – the word "bank" could mean a financial institution or a riverbank, depending on the speaker’s intent and/or context.

**(iii) Ambiguity and Polysemy**
Many words have multiple meanings (polysemy), and sentences can often be interpreted in more than one way (ambiguity). For example, "She saw the man with the telescope" could mean she used a telescope to see the man, or the man had a telescope. Resolving such ambiguities is a major challenge in NLP.

**(iv) Sarcasm and Irony**
Detecting sarcasm and irony requires an understanding that goes beyond literal meanings. Statements like “Oh great, another Monday” can appear positive on the surface but are often meant sarcastically. NLP systems often struggle to detect these nuances, especially without vocal tone or facial cues.

**(v) Low-Resource Languages**
While popular languages like English have vast amounts of training data, many languages lack sufficient labeled datasets and linguistic resources. This creates a performance gap in NLP applications across different languages, limiting accessibility and inclusiveness in global AI solutions.

# Hands-On Projects

### Key Word Extraction

In [7]:
# Import library
import re

# Sentence to process
sentence = "Contact us at support@company.com or sales@business.org. For more, email info@service.net."

# Extract emails
emails = re.findall(r'\b[\w.-]+@[\w.-]+\.\w+\b', sentence)

# Print all emails
print('Emails Found:', emails)

Emails Found: ['support@company.com', 'sales@business.org', 'info@service.net']


In [8]:
# Sentence to process
sentence = "NLP is amazing for cleaning processing text while learning new techniques."

# Find words ending with 'ing'
ing_words = re.findall(r"\b\w+ing\b", sentence)
ing_words

['amazing', 'cleaning', 'processing', 'learning']

### Text Cleaning (preprocess)

In [10]:
# Text to process
text = "NLP makes AI smarter! But, sometimes, it's challenging... Don't you agree?"

# Remove all punctuation
no_punctuation_text = re.sub(r'[^A-Za-z\s]', '', text)

# Print result
no_punctuation_text

'NLP makes AI smarter But sometimes its challenging Dont you agree'

In [11]:
# Convert to lowercase
no_punctuation_text.lower()

'nlp makes ai smarter but sometimes its challenging dont you agree'

In [12]:
# Split words
split_words = re.split(r'\s+', no_punctuation_text)
split_words

['NLP',
 'makes',
 'AI',
 'smarter',
 'But',
 'sometimes',
 'its',
 'challenging',
 'Dont',
 'you',
 'agree']

In [13]:
# Text to process
text = "OMG!! NLP is soooo coool 🤩...!!! It costs $1000. Learn it now at https://3mtt.com 😎."


# Text cleaning (preprocess)
text = text.lower()  # Lowercase
text = re.sub(r'at https?://\S+', '', text)  # Remove at and URLs
text = re.sub(r'\$(\d+)', '1000', text)  # Replace $1000 → 1000
text = re.sub(r'([a-zA-Z])\1{2,}', r'\1', text)  # Reduce repeated characters
text = re.sub(r'[^a-z0-9\s]', '', text)  # Remove punctuation, symbols and emoji's
text = re.sub(r'\bcol\b', 'cool', text)  # Fix: replace col → cool
text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace

# Print result
cleaned_text = text
print(cleaned_text)

omg nlp is so cool it costs 1000 learn it now


### Tokenization (word_level & sentence_level)

In [15]:
# Import libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Assign variable
text = "Tokenization is the first step in NLP. It splits text into smaller pieces for analysis."

# Apply Word-level tokenization
words = word_tokenize(text)
sentences = sent_tokenize(text)

# Print result
print("\nWord-Level Tokenization:")
print(words)
print("\nSentence-Level Tokenization:")
print(sentences)


Word-Level Tokenization:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', '.', 'It', 'splits', 'text', 'into', 'smaller', 'pieces', 'for', 'analysis', '.']

Sentence-Level Tokenization:
['Tokenization is the first step in NLP.', 'It splits text into smaller pieces for analysis.']


### Stemming and Lemmatization

* Porter Stemmer for stemming.
* spaCy for lemmatization.

In [17]:
# Import library
from nltk.stem import PorterStemmer

# Words to process
words = ["running", "files", "studies", "easily", "better"]

# Initialize stemmer
stemmer = PorterStemmer()

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

# Print result
print("Porter Stemmer:", stemmed_words)

Porter Stemmer: ['run', 'file', 'studi', 'easili', 'better']


In [18]:
# Import library
import spacy

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Words to process
words = ["running", "files", "studies", "easily", "better"]

# Apply lemmatization with spaCy
doc = nlp(" ".join(words))
lemmatized_words = [token.lemma_ for token in doc]

# Print result
print("spaCy Lemmatizer:", lemmatized_words)

spaCy Lemmatizer: ['run', 'file', 'study', 'easily', 'well']


### One-Hot Encoding

In [20]:
# Import library
import numpy

# Vocabulary
vocab = ["apple", "banana", "cherry", "mango", "Orange"]

# Create a dictionary mapping each word to its one-hot vector
one_hot_vectors = {}

for i, word in enumerate(vocab):
    # Create a zero vector of length equal to the vocabulary size
    vector = [0] * len(vocab)
    # Set the position for the current word to 1
    vector[i] = 1
    # Store in the dictionary
    one_hot_vectors[word] = vector

# Print the one-hot encoded vectors
print("One-hot encoded vectors:")
for word, vector in one_hot_vectors.items():
    print(f"{word}: {vector}")

One-hot encoded vectors:
apple: [1, 0, 0, 0, 0]
banana: [0, 1, 0, 0, 0]
cherry: [0, 0, 1, 0, 0]
mango: [0, 0, 0, 1, 0]
Orange: [0, 0, 0, 0, 1]


### Bag of Words using CountVectorizer & TF-IDF using TfidfVectorizer.

In [22]:
# Import library
from sklearn.feature_extraction.text import CountVectorizer

# Dataset
sentences = ["The quick brown fox jumbs over the lazy dog.",
             "The dog sleeps in the kernel."]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the data
bow_matrix = vectorizer.fit_transform(sentences)

# Convert to array and display
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Matrix:\n", bow_matrix.toarray())

Vocabulary: ['brown' 'dog' 'fox' 'in' 'jumbs' 'kernel' 'lazy' 'over' 'quick' 'sleeps'
 'the']
Bag of Words Matrix:
 [[1 1 1 0 1 0 1 1 1 0 2]
 [0 1 0 1 0 1 0 0 0 1 2]]


In [23]:
# Import library
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the data
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)

# Convert to array and display
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Vocabulary: ['brown' 'dog' 'fox' 'in' 'jumbs' 'kernel' 'lazy' 'over' 'quick' 'sleeps'
 'the']
TF-IDF Matrix:
 [[0.342369   0.24359836 0.342369   0.         0.342369   0.
  0.342369   0.342369   0.342369   0.         0.48719673]
 [0.         0.30253071 0.         0.42519636 0.         0.42519636
  0.         0.         0.         0.42519636 0.60506143]]


### Word2Vec Model Using Gensim

* **Retreive embedding.**

In [25]:
# Import library
from gensim.models import Word2Vec

# Dataset created: "The elephant trumpets. The snake hisses. The fox yips"
sentences = [["the", "elephant", "trumpets"],
             ["the", "snake", "hisses"],
             ["the", "fox", "yips"]]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=1)

# Retrieve the embedding for the word "snake"
fox_vector = model.wv["snake"]

# Print the embedding
print("Embedding for 'snake':")
print(fox_vector)

Embedding for 'snake':
[ 0.00018913  0.00615464 -0.01362529 -0.00275093  0.01533716  0.01469282
 -0.00734659  0.0052854  -0.01663426  0.01241097 -0.00927464 -0.00632821
  0.01862271  0.00174677  0.01498141 -0.01214813  0.01032101  0.01984565
 -0.01691478 -0.01027138 -0.01412967 -0.0097253  -0.00755713 -0.0170724
  0.01591121 -0.00968788  0.01684723  0.01052514 -0.01310005  0.00791574
  0.0109403  -0.01485307 -0.01481144 -0.00495046 -0.01725145 -0.00316314
 -0.00080687  0.00659937  0.00288376 -0.00176284 -0.01118812  0.00346073
 -0.00179474  0.01358738  0.00794718  0.00905894  0.00286861 -0.00539971
 -0.00873363 -0.00206415]


### Pretrained GloVe Model (glove-wiki-gigaword-50) Using Gensim

* **Retrieving embedding and similar words.**

In [27]:
# Import library
import gensim.downloader as api

# Load the pretrained GloVe model (50-dimensional vectors)
glove_model = api.load("glove-wiki-gigaword-50")

# Retrieve the embedding for the word "king"
print("Word Embedding for 'Incredible':\n", glove_model['incredible'])

# Find the 5 most similar words to "king"
print("Words Similar to 'Incredible':\n", glove_model.most_similar('incredible', topn=5))

Word Embedding for 'Incredible':
 [-0.14689   0.54232  -0.011375  0.29162   0.7991   -0.061087  0.44189
  0.1891    0.54886   1.1932   -0.26771   0.064008 -0.69519  -0.51787
  0.67175  -0.70509   0.51642   0.54056  -0.7944   -0.80188  -0.48548
  0.79217  -0.21582  -1.0416    1.2546   -0.27579  -1.4243   -0.06234
  1.2305    0.066436  1.7358    0.88321   0.49214  -0.34407  -0.28822
  0.43435  -0.24437   0.28734   0.020281 -0.56087  -0.1894   -0.26222
 -0.47613   0.14826  -0.42779   0.10252   0.22617   0.14601  -0.050739
  0.36255 ]
Words Similar to 'Incredible':
 [('amazing', 0.9189565181732178), ('astonishing', 0.8662074208259583), ('awesome', 0.8470885157585144), ('unbelievable', 0.8440187573432922), ('tremendous', 0.8422953486442566)]
