![](./lab%20header%20image.png)

<div style="text-align: center;">
    <h3>Experiment No. 07</h3>
</div>

<img src="./Student%20Information.png" style="width: 100%;" alt="Student Information">

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>AIM</strong>
</div>

**Study of Python Features in Natural Language Processing (NLP)**

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>Theory/Procedure/Algorithm</strong>
</div>

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) focused on enabling machines to interpret, understand, and generate human language. NLP tasks can range from simple operations like tokenization and part-of-speech tagging to complex tasks such as machine translation, summarization, and question answering. Python has become the dominant language for NLP due to its rich ecosystem of libraries like `nltk`, `spaCy`, `gensim`, `TextBlob`, and `transformers`.

##### Key Python Features for NLP:
1. **Tokenization**: Breaking down text into smaller units (words, sentences).
2. **Text Cleaning**: Removing stop words, punctuation, and performing lemmatization or stemming.
3. **Vectorization**: Representing text numerically, often using techniques like Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF).
4. **Part-of-Speech (POS) Tagging**: Assigning tags like noun, verb, adjective to each token.
5. **Named Entity Recognition (NER)**: Identifying entities like names, places, dates from text.
6. **Sentiment Analysis**: Classifying text based on its sentiment (positive, negative, neutral).
7. **Word Embeddings**: Dense vector representation of words, enabling better semantic understanding.
8. **Language Modeling**: Predicting the next word in a sequence, used in tasks like text generation.

##### Implementation
To demonstrate the above features, a Python implementation using libraries such as `nltk`, `spaCy`, and `transformers` was carried out. Each library was chosen based on its strength in specific NLP tasks.

**1. Tokenization using NLTK**:

In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Python is widely used in NLP due to its vast library support."
tokens = word_tokenize(text)
print(tokens)

['Python', 'is', 'widely', 'used', 'in', 'NLP', 'due', 'to', 'its', 'vast', 'library', 'support', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Kamran\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


*Explanation*: `nltk.tokenize` breaks the sentence into individual words or tokens.

**2. Text Cleaning and Lemmatization with spaCy:**

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp("Running faster is my goal")
cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop]
print(cleaned_tokens)

['run', 'fast', 'goal']


*Explanation*: Lemmatization converts words into their base form (e.g., "running" to "run"), while removing stop words like "is", "my".

**3. Vectorization using TF-IDF:**

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["Python is great for NLP", "Machine learning enhances NLP capabilities"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print(tfidf_matrix.toarray())

[[0.         0.         0.47107781 0.47107781 0.47107781 0.
  0.         0.33517574 0.47107781]
 [0.47107781 0.47107781 0.         0.         0.         0.47107781
  0.47107781 0.33517574 0.        ]]


*Explanation*: TF-IDF measures the importance of words in a document relative to a corpus. It’s used for representing text data in numeric form.

**4. Named Entity Recognition (NER) using spaCy:**

In [4]:
doc = nlp("Google, based in Mountain View, announced a new feature today.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Mountain View GPE
today DATE


*Explanation*: Named entities like "Google" and "Mountain View" are extracted, categorized as organizations, and places, respectively.

**5. Sentiment Analysis using TextBlob:**

In [5]:
from textblob import TextBlob

text = "Python makes NLP simple and intuitive."
blob = TextBlob(text)
print(blob.sentiment)


Sentiment(polarity=0.0, subjectivity=0.35714285714285715)


*Explanation*: `TextBlob` provides a simple way to perform sentiment analysis, returning polarity (positive/negative) and subjectivity scores.

**6. Word Embeddings using Gensim’s Word2Vec:**

In [6]:
from gensim.models import Word2Vec

sentences = [["python", "is", "great", "for", "nlp"], ["machine", "learning", "is", "useful"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['python']
print(vector)

[ 8.1681199e-03 -4.4430327e-03  8.9854337e-03  8.2536647e-03
 -4.4352221e-03  3.0310510e-04  4.2744912e-03 -3.9263200e-03
 -5.5599655e-03 -6.5123225e-03 -6.7073823e-04 -2.9592158e-04
  4.4630850e-03 -2.4740540e-03 -1.7260908e-04  2.4618758e-03
  4.8675989e-03 -3.0808449e-05 -6.3394094e-03 -9.2608072e-03
  2.6657581e-05  6.6618943e-03  1.4660227e-03 -8.9665223e-03
 -7.9386048e-03  6.5519023e-03 -3.7856805e-03  6.2549924e-03
 -6.6810320e-03  8.4796622e-03 -6.5163244e-03  3.2880199e-03
 -1.0569858e-03 -6.7875278e-03 -3.2875966e-03 -1.1614120e-03
 -5.4709399e-03 -1.2113475e-03 -7.5633135e-03  2.6466595e-03
  9.0701487e-03 -2.3772502e-03 -9.7651005e-04  3.5135616e-03
  8.6650876e-03 -5.9218528e-03 -6.8875779e-03 -2.9329848e-03
  9.1476962e-03  8.6626766e-04 -8.6784009e-03 -1.4469790e-03
  9.4794659e-03 -7.5494875e-03 -5.3580985e-03  9.3165627e-03
 -8.9737261e-03  3.8259076e-03  6.6544057e-04  6.6607012e-03
  8.3127534e-03 -2.8507852e-03 -3.9923131e-03  8.8979173e-03
  2.0896459e-03  6.24894

*Explanation*: Word2Vec represents words as vectors in a continuous space, capturing semantic relationships.

**7. Language Modeling using Hugging Face Transformers (GPT-2):**

In [7]:
from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')
result = generator("Natural Language Processing with Python", max_length=50, num_return_sequences=1)
print(result)

  from .autonotebook import tqdm as notebook_tqdm






All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Natural Language Processing with Python\n\nPython provides a fairly easy way to program in Python. Python will simply run a list of numbers and return data. If a list of integers is given, in your case, Python will do this to it, or'}]


*Explanation*: Hugging Face’s `transformers` library is used for generating text using a pre-trained GPT-2 model.

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>CONCLUSION</strong>
</div>

This study demonstrates Python’s flexibility and robustness in performing diverse NLP tasks, thanks to its extensive libraries and easy-to-use APIs. Key takeaways include:

- **Text Preprocessing**: Tokenization and text cleaning form the foundation for more advanced NLP tasks, with libraries like `nltk` and `spaCy` offering high efficiency.
- **Feature Extraction**: Techniques like TF-IDF and Word2Vec are vital for converting text data into a format that machine learning models can understand.
- **Sentiment Analysis and NER**: Python libraries make it simple to extract useful information and perform sentiment-based classification.
- **Language Models**: Pre-trained models like GPT-2 allow for advanced language generation, showing the power of transformer-based models.

The study concludes that Python, due to its rich ecosystem of NLP libraries, offers a powerful platform for both academic research and real-world applications in Natural Language Processing.

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>ASSESSMENT</strong>
</div>

<img src="./marks_distribution.png" style="width: 100%;" alt="marks_distribution">