Question 1: What is Computational Linguistics and how does it relate to NLP?


In [None]:
'''
Computational Linguistics is the field where we try to understand how human language works and teach computers to process it in a meaningful way.
It focuses on building models and rules that explain how sentences are formed, how meaning is created, and how language is structured.
Natural Language Processing (NLP) is the practical use of these ideas. It applies those models to build real-life tools like chatbots,
translators, voice assistants, and sentiment analysis systems.
'''

Question 2: Briefly describe the historical evolution of Natural Language Processing.


In [None]:
'''
Natural Language Processing (NLP) has evolved step by step with advances in computing and AI. In the 1950s–60s,
it started with rule-based systems, where language was processed using hand-written grammar rules. In the 1980s–90s,
the focus shifted to statistical and machine learning methods, using large datasets to learn language patterns.
From the 2010s onwards, deep learning and neural networks transformed NLP, leading to highly accurate systems like translators, chatbots, and voice assistants.
'''

Question 3: List and explain three major use cases of NLP in today’s tech industry.


In [None]:
'''
Three use cases are :
1. Language Translation : NLP enables automatic translation between languages, like Google Translate, helping people communicate globally.
2. Chatbots: NLP helps machines understand user queries and respond like humans, as seen in customer support bots, Alexa, and Siri.
3. Sentiment Analysis: Companies use NLP to analyze customer reviews, social media posts, and feedback to understand public opinion and improve services.
'''


Question 4: What is text normalization and why is it essential in text processing tasks?


In [None]:
'''
Text normalization is the process of cleaning and standardizing text so that it becomes easier for computers to understand and process.
It includes steps like converting text to lowercase, removing punctuation, correcting spelling, expanding contractions, and handling special characters.
It is essential because real-world text is often messy and inconsistent. Normalization helps reduce noise, improves accuracy,
and ensures better performance of NLP models in tasks like sentiment analysis, search, and text classification.
'''

Question 5: Compare and contrast stemming and lemmatization with suitable
examples.

In [None]:
'''
Stemming and lemmatization are both text processing techniques used in NLP to reduce words to their base form, but they work in different ways.
Stemming is a simpler and faster approach that removes prefixes or suffixes from words without considering grammar or meaning. Because of this,
the resulting word may not always be a real or meaningful word, for example, “running” becomes “run” and “studies” becomes “studi”.
lemmatization is more advanced and meaningful because it considers the context and grammatical structure of the word. It reduces words to
their actual dictionary form, known as the lemma, such as “running” becoming “run” and “better” becoming “good”.
'''

Question 6: Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

In [1]:
import re

text = """Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."""

emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)

print(emails)


['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download tokenizer (only first time)
nltk.download('punkt')
nltk.download('punkt_tab') # Added this line as suggested by the error message

text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

# Tokenization
tokens = word_tokenize(text)

# Frequency Distribution
freq_dist = FreqDist(tokens)

print("Tokens:", tokens)
print("\nWord Frequency:")
print(freq_dist.most_common(10))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Word Frequency:
[(',', 7), ('.', 4), ('NLP', 3), ('and', 3), ('is', 2), ('of', 2), ('Natural', 1), ('Language', 1), ('Processing', 1), ('(', 1)]


Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.


In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

def proper_noun_annotator(text):
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "PROPN":
            print(f"{token.text} → Proper Noun")

text = "John works at Google in New York and studies NLP at Stanford University."
proper_noun_annotator(text)


John → Proper Noun
Google → Proper Noun
New → Proper Noun
York → Proper Noun
NLP → Proper Noun
Stanford → Proper Noun
University → Proper Noun


Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.

In [6]:
!pip install gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

# Dataset
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]

# Step 1: Tokenization and preprocessing
tokenized_data = [word_tokenize(sentence.lower()) for sentence in dataset]

# Step 2: Train Word2Vec model
model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)

# Step 3: Test the model
print(model.wv["language"])
print(model.wv.most_similar("word"))

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
[-9.5806085e-03  8.9441882e-03  4.1653137e-03  9.2335343e-03
  6.6461298e-03  2.9213182e-03  9.8062111e-03 -4.4229976e-03
 -6.7968965e-03  4.2171725e-03  3.7335777e-03 -5.6669810e-03
  9.6989106e-03 -3.5659580e-03  9.5487935e-03  8.3945523e-04
 -6.3411104e-03 -1.9765138e-03 -7.3686293e-03 -2.9793803e-03
  1.0386854e-03  9.4879130e-03  9.3503986e-03 -6.6033388e-03
  3.4822454e-03  2.2797708e-03 -2.4912679e-03 -9.2314119e-03
  1.0282570e-03 -8.1718396e-03  6.3123670e-03 -5.7999776e-03
  5.5352398e-03  9.8330248e-03 -1.6240444e-04  4.5257602e-03
 -1.8113367e-03  7.3653422e-03  3.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.

In [None]:
'''
As a data scientist in a fintech startup, I would start by collecting and cleaning the customer reviews, removing noise like special characters,
URLs, emojis, duplicate entries, and converting text to lowercase.
Next, I would perform text preprocessing such as tokenization, stop-word removal, lemmatization, and handling spelling variations
to make the data consistent and machine-readable.
After that, I would apply feature extraction techniques like TF-IDF or word embeddings to convert text into numerical form.
Then, I would use NLP models for tasks such as sentiment analysis to understand customer opinions, topic modeling to identify common issues or requests,
and keyword extraction to spot trends.
Finally, I would visualize and interpret the results to provide actionable insights to product, support, and business teams,
helping them improve customer experience and decision-making.
'''
