**Module 7: Natural Language Processing**

**Exercise-1**

**Title:** Cleaning and Preparing Text Data for NLP Tasks

**Problem Statement:**
The goal of this task is to clean and preprocess a given text dataset to make it suitable for natural language processing (NLP) tasks. This includes removing punctuation, numbers, converting text to lowercase, tokenizing the text, and removing stopwords. Proper preprocessing of text data is crucial to improve the performance of NLP models.

**Steps to be Followed:**

1.	Install and Import Necessary Libraries:

    a.	Install the NLTK library.

    b.	Import necessary modules from NLTK and other libraries.

2.	Download NLTK Resources:

    a.	Download the stopwords and punkt resources from NLTK.

3.	Define and Clean the Text Data:

    a.	Create a sample text data string.

    b.	Define a function to clean the text by removing punctuation, numbers, and converting to lowercase.

4.	Tokenize the Cleaned Text:

    a.	Define a function to tokenize the cleaned text using NLTK's word_tokenize method.

5.	Remove Stopwords:

    a.	Define a function to remove stopwords from the tokenized text.

6.	Display the Results:

    a.	Print the original text, cleaned text, tokenized text, and text after stopword removal.


In [None]:
# Step 1: Install and Import Necessary Libraries
!pip install nltk
import nltk
from nltk.corpus import stopwords
import re

# Step 2: Download Required NLTK Resources
nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
text_data = "This is a sample text! It includes punctuation, numbers like 123, and stopwords such as the, is, and a."

# Step 3: Define a Function to Clean Text
def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Convert text to lowercase
    text = text.lower()
    return text

# Step 4: Define a Function to Tokenize Text
def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    return tokens

# Step 5: Define a Function to Remove Stopwords
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

# Clean the text
cleaned_text = clean_text(text_data)
# Tokenize the cleaned text
tokens = tokenize_text(cleaned_text)
# Remove stopwords from the tokenized text
filtered_tokens = remove_stopwords(tokens)

# Step 6: Print the Results
print("Original Text:", text_data)
print("Cleaned Text:", cleaned_text)
print("Tokenized Text:", tokens)
print("Text after Stopword Removal:", filtered_tokens)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Original Text: This is a sample text! It includes punctuation, numbers like 123, and stopwords such as the, is, and a.
Cleaned Text: this is a sample text it includes punctuation numbers like  and stopwords such as the is and a
Tokenized Text: ['this', 'is', 'a', 'sample', 'text', 'it', 'includes', 'punctuation', 'numbers', 'like', 'and', 'stopwords', 'such', 'as', 'the', 'is', 'and', 'a']
Text after Stopword Removal: ['sample', 'text', 'includes', 'punctuation', 'numbers', 'like', 'stopwords']


**Explanation of the Code:**

1.	Install and Import Necessary Libraries:

    a.	The nltk library is installed and necessary modules like stopwords and re (for regular expressions) are imported.

2.	Download Required NLTK Resources:

    a.	Stopwords and punkt resources are downloaded using nltk.download to ensure that stopword removal and tokenization can be performed.

3.	Clean Text:

    a.	The clean_text function removes punctuation using a regular expression that matches any character that is not a word character or whitespace.

    b.	Numbers are removed using another regular expression that matches digits.

    c.	The text is converted to lowercase to ensure uniformity.

4.	Tokenize Text:

    a.	The tokenize_text function uses nltk.word_tokenize to split the text into individual words (tokens).

5.	Remove Stopwords:

    a.	The remove_stopwords function filters out common English stopwords from the tokenized text using a list comprehension and the set of stopwords provided by NLTK.

6.	Print the Results:

    a.	The original text, cleaned text, tokenized text, and the text after stopword removal are printed to verify the results of each preprocessing step.


**Exercise-2**

**Title:** Sentiment Analysis Using Naive Bayes Classifier

**Problem Statement:**
The goal is to build a sentiment analysis model that classifies movie reviews as either positive or negative. Using the NLTK library, we will train a Naive Bayes classifier on the movie reviews dataset and evaluate its performance. Additionally, the model will be tested on a custom review to predict its sentiment.

**Steps to be Followed:**

1.	Install and Import Necessary Libraries:

    a.	Import modules from nltk.

    b.	Import additional necessary functions like shuffle.

2.	Download Required NLTK Resources:

    a.	Download the movie reviews dataset and stopwords.

3.	Load and Prepare Data:

    a.	Load positive and negative movie reviews.

    b.	Combine and shuffle the dataset.

4.	Extract Features:

    a.	Extract the most frequent words from the dataset to use as features.

    b.	Define a function to extract features from each review.

5.	Create Training and Testing Datasets:

    a.	Split the dataset into training and testing sets.

6.	Train the Naive Bayes Classifier:

    a.	Train the classifier using the training set.

7.	Evaluate the Classifier:

    a.	Evaluate the classifier on the test set and print the accuracy.

8.	Test on Custom Review:

    a.	Test the classifier on a custom review and print the predicted sentiment.


In [None]:
# Step 1: Install and Import Necessary Libraries
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from random import shuffle

# Step 2: Download Required NLTK Resources
nltk.download('movie_reviews')
nltk.download('stopwords')

# Step 3: Load and Prepare Data
# Load positive and negative movie reviews
positive_reviews = [(movie_reviews.words(file_id), 'positive') for file_id in movie_reviews.fileids('pos')]
negative_reviews = [(movie_reviews.words(file_id), 'negative') for file_id in movie_reviews.fileids('neg')]

# Combine positive and negative reviews and shuffle the dataset
all_reviews = positive_reviews + negative_reviews
shuffle(all_reviews)

# Step 4: Extract Features
# Extract the most frequent words as features
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

# Define a function to extract features from the review
def extract_features(review):
    words = set(review)
    features = {}
    for word in word_features:
        features[word] = (word in words)
    return features

# Step 5: Create Training and Testing Datasets
featuresets = [(extract_features(review), sentiment) for (review, sentiment) in all_reviews]
train_set, test_set = featuresets[:1500], featuresets[1500:]

# Step 6: Train the Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

# Step 7: Evaluate the Classifier
accuracy = nltk_accuracy(classifier, test_set)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Step 8: Test on Custom Review
custom_review = "This movie was fantastic! I loved every moment of it."
custom_tokens = word_tokenize(custom_review.lower())
custom_features = extract_features(custom_tokens)
sentiment = classifier.classify(custom_features)
print(f'Sentiment of the custom review: {sentiment}')


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Accuracy: 80.40%
Sentiment of the custom review: negative


**Explanation of the Code:**
1.	Install and Import Necessary Libraries:

    a.	The necessary modules from the nltk library are imported along with shuffle from the random module.

2.	Download Required NLTK Resources:

    a.	The movie reviews dataset and stopwords are downloaded to ensure they are available for use.

3.	Load and Prepare Data:

    a.	Positive and negative movie reviews are loaded from the movie reviews dataset.

    b.	The reviews are combined into a single dataset and shuffled to ensure randomness.

4.	Extract Features:

    a.	The most frequent words from the dataset are extracted to use as features.

    b.	A function extract_features is defined to extract these features from each review, creating a feature dictionary for the classifier.

5.	Create Training and Testing Datasets:

    a.	The dataset is split into training and testing sets, with the training set consisting of the first 1500 reviews and the testing set consisting of the remaining reviews.

6.	Train the Naive Bayes Classifier:

    a.	A Naive Bayes classifier is trained using the training set.

7.	Evaluate the Classifier:

    a.	The classifier is evaluated on the test set, and the accuracy is printed.

8.	Test on Custom Review:

    a.	The classifier is tested on a custom review to predict its sentiment, and the predicted sentiment is printed.


**Exercise-3**

**Title:** Building a Sentiment Analyzer for Social Media Posts

**Problem Statement:**
The task is to build a sentiment analysis model that classifies social media posts as either positive or negative. Using a dataset of social media posts, we will preprocess the text data, transform it using TF-IDF vectorization, and train a logistic regression model. The model's performance will be evaluated using accuracy and classification metrics.

**Steps to be Followed:**

1.	Install and Import Necessary Libraries:

    a.	Import necessary libraries for data handling, model training, and evaluation.

2.	Load and Display Dataset:

    a.	Create the sample data.

    b.	Display the first few rows of the dataset to understand its structure.

3.	Define Text Cleaning Function:

    a.	Define a function to clean the text data by removing URLs, mentions, hashtags, digits, punctuation, and stopwords, and converting the text to lowercase.

4.	Apply Text Cleaning Function:

    a.	Apply the cleaning function to the text data in the dataset.

5.	Split Data into Features and Target Labels:

    a.	Split the data into features (X) and target labels (y).

6.	Split Data into Training and Testing Sets:

    a.	Split the data into training and testing sets using a defined test size and random state.

7.	Initialize and Fit TF-IDF Vectorizer:

    a.	Initialize the TF-IDF vectorizer with a specified maximum number of features.

    b.	Fit and transform the training data using the TF-IDF vectorizer.

    c.	Transform the testing data using the fitted TF-IDF vectorizer.

8.	Initialize and Train Logistic Regression Model:

    a.	Initialize the logistic regression model with an increased maximum number of iterations for convergence.

    b.	Train the model on the TF-IDF transformed training data.

9.	Predict and Evaluate the Model:

    a.	Predict sentiment labels for the testing data.

    b.	Calculate and print the accuracy of the model.

    c.	Display the classification report to show detailed performance metrics.


In [None]:
# Step 1: Install and Import Necessary Libraries
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords if not already present
nltk.download('stopwords')

# Step 2: Create and Load Sample Dataset
data = {
    'Text': [
        "I love the new design of the app! It's fantastic and user-friendly.",
        "The recent update is terrible. It crashes all the time.",
        "Great job on the new features, very innovative!",
        "I'm really disappointed with the customer service.",
        "This is the best product I've ever used!",
        "Not happy with the performance of the new version.",
        "Absolutely amazing experience, I highly recommend it!",
        "The interface is confusing and not intuitive at all.",
        "Fantastic support from the team, helped me resolve my issue quickly.",
        "Horrible experience, I want a refund."
    ],
    'Sentiment': [
        'positive', 'negative', 'positive', 'negative', 'positive',
        'negative', 'positive', 'negative', 'positive', 'negative'
    ]
}

df = pd.DataFrame(data)

# Step 3: Define Text Cleaning Function
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)     # Remove mentions
    text = re.sub(r'#\w+', '', text)     # Remove hashtags
    text = re.sub(r'\d+', '', text)      # Remove digits
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()                  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])  # Remove stopwords
    return text

# Step 4: Apply Text Cleaning Function
df['Text'] = df['Text'].apply(clean_text)

# Step 5: Split Data into Features and Target Labels
X = df['Text']
y = df['Sentiment']

# Step 6: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 7: Initialize and Fit TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform testing data
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Step 8: Initialize and Train Logistic Regression Model
model = LogisticRegression(max_iter=200)  # Increased max_iter for convergence

# Train model on training data
model.fit(X_train_tfidf, y_train)

# Step 9: Predict and Evaluate the Model
# Predict sentiment labels for testing data
y_pred = model.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00         1
    positive       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Explanation of the Code:**
1.	Install and Import Necessary Libraries:

    a.	pandas for data handling.

    b.	re for regular expressions.

    c.	train_test_split, TfidfVectorizer, LogisticRegression, accuracy_score, and classification_report from sklearn for model training and evaluation.

    d.	nltk for natural language processing tasks.

2.	Create and Load Sample Dataset:

    a.	The dataset is created with sample text and sentiment labels, mimicking social media posts.

3.	Define Text Cleaning Function:

    a.	The clean_text function removes unwanted elements from the text, such as URLs, mentions, hashtags, digits, punctuation, and stopwords, and converts the text to lowercase.

4.	Apply Text Cleaning Function:

    a.	The cleaning function is applied to the Text column of the dataset to preprocess the text data.

5.	Split Data into Features and Target Labels:

    a.	The Text column is used as features (X), and the Sentiment column is used as target labels (y).

6.	Split Data into Training and Testing Sets:

    a.	The data is split into training and testing sets with a test size of 20% and a random state of 42 for reproducibility.

7.	Initialize and Fit TF-IDF Vectorizer:

    a.	The TF-IDF vectorizer is initialized with a maximum of 1000 features.

    b.	The training data is fit and transformed, and the testing data is transformed using the fitted vectorizer.

8.	Initialize and Train Logistic Regression Model:

    a.	The logistic regression model is initialized with an increased maximum number of iterations (200) for convergence.

    b.	The model is trained on the TF-IDF transformed training data.

9.	Predict and Evaluate the Model:

    a.	The model predicts sentiment labels for the testing data.

    b.	The accuracy of the model is calculated and printed.

    c.	A classification report is displayed, showing detailed performance metrics such as precision, recall, and F1-score for each class (positive and negative).


**Exercise-4**

**Title**: BERT Implementation for Question-Answering

**Problem Statement:**
Implement a BERT-based model to perform question-answering tasks, allowing users to ask questions based on a given context and receive accurate answers. This implementation demonstrates how to use the Hugging Face transformers library to load a pre-trained BERT model and tokenizer for this purpose.

**Steps to be Followed:**
1.	Install Required Libraries:

    a.	Install the transformers library.
2.	Import Necessary Modules:

    a.	Import the pipeline function from the transformers library.
3.	Load the Question-Answering Pipeline:

    a.	Initialize the question-answering pipeline with a pre-trained BERT model and tokenizer.
4.	Define the Context and Questions:

    a.	Provide the context (text) from which the model will extract answers.
    
    b.	Define the questions to be asked based on the context.
5.	Perform Question-Answering:

    a.	Use the pipeline to answer the questions based on the given context.
6.	Display the Results:

    a.	Print the questions and their corresponding answers.


In [None]:
# Step 1: Install Required Libraries
# Uncomment the line below if running in an environment where transformers is not installed
# !pip install transformers

# Step 2: Import Necessary Modules
from transformers import pipeline

# Step 3: Load the Question-Answering Pipeline
qa_pipeline = pipeline('question-answering', model='bert-base-uncased', tokenizer='bert-base-uncased')

# Step 4: Define the Context and Questions
context = """
Transformers are models that process sequential data, making them ideal for tasks such as language translation and text summarization.
BERT, which stands for Bidirectional Encoder Representations from Transformers, is a transformer-based model designed to understand the context of words in a sentence.
"""

questions = [
    "What are transformers ideal for?",
    "What is BERT?"
]

# Step 5: Perform Question-Answering
for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"Question: {question}")
    print(f"Answer: {result['answer']}\n")

# Step 6: Display the Results
# The results are printed within the loop above


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Question: What are transformers ideal for?
Answer: that process sequential data, making them ideal for tasks such

Question: What is BERT?
Answer: that process sequential data, making them ideal for tasks such



**Explanation of the Code:**
1.	Install Required Libraries:

    a.	The transformers library is required to use pre-trained BERT models. The installation command is commented out but can be uncommented if needed.

2.	Import Necessary Modules:

    a.	The pipeline function from the transformers library is imported. This function simplifies the process of using various models for different tasks, such as question-answering.

3.	Load the Question-Answering Pipeline:

    a.	The pipeline function is used to create a question-answering pipeline with the bert-base-uncased model and tokenizer. This pre-trained BERT model is suitable for English language tasks.

4.	Define the Context and Questions:

    a.	The context is a string containing the text from which the model will extract answers.

    b.	A list of questions related to the context is defined.

5.	Perform Question-Answering:

    a.	A loop iterates over each question, and the qa_pipeline is used to generate answers based on the provided context.

    b.	The results, including the questions and their corresponding answers, are printed.

6.	Display the Results:

    a.	The results are displayed within the loop, providing a clear and concise output for each question-answer pair.


**Exercise-5**

**Title:** T5 Implementation for Text Summarization

**Problem Statement:**
Implement a text summarization solution using the T5 (Text-To-Text Transfer Transformer) model. T5 is designed to handle various natural language processing (NLP) tasks by treating both input and output as text, allowing for a unified approach to different tasks. This implementation will demonstrate how to use the Hugging Face transformers library to load a pre-trained T5 model and tokenizer for summarizing a given text.

**Steps to be Followed:**
1.	Install Required Libraries:

    a.	Ensure the transformers library is installed.

2.	Import Necessary Modules:

    a.	Import T5ForConditionalGeneration and T5Tokenizer from the transformers library.

3.	Load the T5 Model and Tokenizer:

    a.	Initialize the T5 model and tokenizer using the pre-trained t5-base model.

4.	Define the Text for Summarization:

    a.	Provide the text that needs to be summarized.

5.	Preprocess the Input Text:

    a.	Encode the input text using the T5 tokenizer, adding a task prefix and setting the appropriate length constraints.

6.	Generate the Summary:

    a.	Use the T5 model to generate a summary from the encoded input text.

7.	Decode and Display the Summary:

    a.	Decode the generated summary and print both the original text and the generated summary.


In [None]:
# Step 1: Install Required Libraries
# Uncomment the line below if running in an environment where transformers is not installed
# !pip install transformers

# Step 2: Import Necessary Modules
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Step 3: Load the T5 Model and Tokenizer
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Step 4: Define the Text for Summarization
text_to_summarize = """
T5 (Text-To-Text Transfer Transformer) is a transformer-based model introduced by Google Research.
It is designed for various natural language processing tasks by formulating them as text-to-text problems.
In T5, input and output are treated as text, allowing the model to handle various tasks in a unified manner.
"""

# Step 5: Preprocess the Input Text
input_ids = tokenizer.encode("summarize: " + text_to_summarize, return_tensors="pt", max_length=512, truncation=True)

# Step 6: Generate the Summary
summary_ids = model.generate(input_ids)

# Step 7: Decode and Display the Summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Original Text:\n", text_to_summarize)
print("\nGenerated Summary:\n", summary)


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



Original Text:
 
T5 (Text-To-Text Transfer Transformer) is a transformer-based model introduced by Google Research. 
It is designed for various natural language processing tasks by formulating them as text-to-text problems. 
In T5, input and output are treated as text, allowing the model to handle various tasks in a unified manner.


Generated Summary:
 Google Research has introduced a transformer-based model for text-to-text problems.


**Explanation of the Code:**

1.	Install Required Libraries:

    a.	The transformers library is necessary to use the T5 model. The installation command is provided but commented out.

2.	Import Necessary Modules:

    a.	The T5ForConditionalGeneration and T5Tokenizer classes from the transformers library are imported to load and use the T5 model and tokenizer.

3.	Load the T5 Model and Tokenizer:

    a.	The pre-trained t5-base model and tokenizer are loaded using their respective classes. The t5-base model is suitable for a variety of text-to-text NLP tasks, including summarization.

4.	Define the Text for Summarization:

    a.	A sample text is provided that needs to be summarized. This text can be replaced with any other text as needed.

5.	Preprocess the Input Text:

    a.	The input text is preprocessed by encoding it with the T5 tokenizer. The prefix "summarize: " indicates the task. The max_length and truncation parameters ensure the input is appropriately sized for the model.

6.	Generate the Summary:

    a.	The T5 model generates a summary from the preprocessed input text. The generate method produces the summary in the form of token IDs.

7.	Decode and Display the Summary:

    a.	The generated summary is decoded back into text using the tokenizer. Both the original text and the generated summary are printed for comparison.
