<a href="https://colab.research.google.com/github/Arun9438/Boston-Housing-Pricing/blob/main/NLP_introducion_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question 1: What is Computational Linguistics and how does it relate to NLP?
Computational Linguistics is an interdisciplinary field that focuses on modeling, analyzing, and understanding human language using computational methods. It combines linguistics, computer science, artificial intelligence, and mathematics to study language structure and meaning.

Natural Language Processing (NLP) is an applied subfield of computational linguistics that focuses on enabling machines to process, understand, and generate human language. While computational linguistics emphasizes theoretical models of language, NLP focuses on building practical applications such as chatbots, translation systems, sentiment analysis tools, and speech recognition systems.

Thus, computational linguistics provides the theoretical foundation, while NLP applies these theories to real-world problems.

## Question 2: Briefly describe the historical evolution of Natural Language Processing.
-The evolution of NLP can be divided into several phases:
	1.	1950s–1960s (Rule-Based Era): Early NLP systems relied on hand-crafted linguistic rules. The Georgetown-IBM experiment (1954) demonstrated basic machine translation.
	2.	1970s–1980s (Symbolic & Statistical Methods): Researchers introduced probabilistic models and corpus-based approaches to overcome limitations of rule-based systems.
	3.	1990s (Machine Learning Era): NLP systems began using machine learning techniques such as Hidden Markov Models (HMMs) and Naive Bayes for tasks like POS tagging and speech recognition.
	4.	2000s–2010s (Deep Learning Era): Neural networks, word embeddings (Word2Vec, GloVe), and deep learning models improved NLP performance significantly.
	5.	Recent Advances: Transformer-based models like BERT and GPT revolutionized NLP by enabling context-aware language understanding and generation.


## Question 3: List and explain three major use cases of NLP in today’s tech industry.
- 1.	Chatbots and Virtual Assistants:
NLP enables conversational interfaces such as customer support chatbots and voice assistants to understand and respond to user queries.
-	2.	Sentiment Analysis:
Companies analyze customer reviews, social media posts, and feedback to determine public sentiment and improve products or services.
-	3.	Machine Translation:
NLP powers translation tools that convert text from one language to another while preserving meaning and context.


## Question 4: What is text normalization and why is it essential in text processing tasks?
-Text normalization is the process of converting text into a consistent and standardized format. It includes operations such as converting text to lowercase, removing punctuation, correcting spelling, expanding contractions, and removing stopwords.

Text normalization is essential because raw text data is often noisy and inconsistent. Normalization reduces complexity, improves model accuracy, and ensures that different forms of the same word are treated uniformly during processing.

## Question 5: Compare and contrast stemming and lemmatization with suitable examples.
-Feature
Stemming
Lemmatization
Definition
Removes word endings
Converts words to base dictionary form
Accuracy
Less accurate
More accurate
Linguistic knowledge
Not required
Required
Speed
Faster
Slower


## Question 6: Regex program to extract email addresses

In [3]:
import re

text = """Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz."""

emails = re.findall(r'[a-zA-Z0-9_.]+@[a-zA-Z0-9.]+\.[a-zA-Z]+', text)
print(emails)

['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


## Question 7: Tokenization and Frequency Distribution using NLTK

In [5]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

nltk.download('punkt')
nltk.download('punkt_tab') # Download missing resource

text = """Natural Language Processing (NLP) is a fascinating field that combines
linguistics, computer science, and artificial intelligence."""

tokens = word_tokenize(text.lower())
freq_dist = FreqDist(tokens)

print(freq_dist)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


<FreqDist with 20 samples and 21 outcomes>


## Question 9: Train Word2Vec using Gensim

In [7]:
!pip install gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation",
 "Word2Vec is a popular word embedding technique",
 "Text preprocessing is a critical step",
 "Tokenization and normalization help clean text"
]

tokenized_data = [word_tokenize(sentence.lower()) for sentence in dataset]

model = Word2Vec(tokenized_data, vector_size=50, window=5, min_count=1, workers=2)

print(model.wv['language'])

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
[ 0.00018913  0.00615464 -0.01362529 -0.00275093  0.01533716  0.01469282
 -0.00734659  0.0052854  -0.01663426  0.01241097 -0.00927464 -0.00632821
  0.01862271  0.00174677  0.01498141 -0.01214813  0.01032101  0.01984565
 -0.01691478 -0.01027138 -0.01412967 -0.0097253  -0.00755713 -0.0170724
  0.01591121 -0.00968788  0.01684723  0.01052514 -0.01310005  0.00791574
  0.0109403  -0.01485307 -0.01481144 -0.00495046 -0.01725145 -0.00316314
 -0.00080687  0.00659937  0.00288376 -0.00176284 -0.01118812  0.00346073
 -0.00179474  0.01358738  0.00794718  0.00905894  0.00286861 -0.00539971


## Question 10: NLP pipeline for analyzing fintech customer feedback
1.	Data collection
2.	Text cleaning & normalization
3.	Tokenization
4.	Stopword removal
5.	Lemmatization
6.	Sentiment analysis
7.	Topic modeling
8.	Insight extraction & visualization

In [8]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

reviews = ["The app is excellent and easy to use", "Customer support is very poor"]
sia = SentimentIntensityAnalyzer()

for review in reviews:
    print(review, sia.polarity_scores(review))

The app is excellent and easy to use {'neg': 0.0, 'neu': 0.476, 'pos': 0.524, 'compound': 0.765}
Customer support is very poor {'neg': 0.373, 'neu': 0.33, 'pos': 0.297, 'compound': -0.1761}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
