#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [1]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

nltk.download('stopwords')
from nltk import FreqDist
from nltk.corpus import stopwords
from collections import Counter

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
    removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']

    processed_text = []
    for sentence in text:
        sentence = list(map(lambda x: x.lower(), sentence))
        for word in sentence:
            if word == '.':
                sentence.remove(word)
            else:
                processed_text.append(word)
    return processed_text

def build_probabilistic_ngram_model(corpus, n):
    ngram = []
    for sentence in corpus:
        tokenized_sentence = list(ngrams(sentence, n, pad_left=True, pad_right=True))
        ngram.extend(tokenized_sentence)

    model = {}
    for tokens in ngram:
        prefix = tokens[:n-1]
        next_word = tokens[n-1]
        if prefix not in model:
            model[prefix] = Counter()
        model[prefix][next_word] += 1

    return model

def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    prefix = tuple(seed_text.split()[-(n-1):])
    generated_text = list(seed_text.split())

    while len(generated_text) < min_length or (generated_text[-1] != '.' and generated_text[-1] != '.\''):
        if prefix in model:
            candidates = model[prefix]
            total_count = sum(candidates.values())
            probabilities = {word: count/total_count for word, count in candidates.items()}
            next_word = random.choices(list(probabilities.keys()), list(probabilities.values()))[0]

            if probabilities[next_word] >= probability_threshold:
                generated_text.append(next_word)
                prefix = tuple(generated_text[-(n-1):])
            else:
                break
        else:
            break

    return ' '.join(generated_text)


In [None]:
# Testing the code
sents = reuters.sents()
processed_text = preprocess_text(sents)
model = build_probabilistic_ngram_model(processed_text, 3)
seed_text = 'He said'
generated_text = generate_text(model, seed_text, 3, probability_threshold=0.1, min_length=10)

print(f'Seed Text: {seed_text}')
print(f'Generated Text: {generated_text}')

# Next word prediction
s=''
def pick_word(counter):
	"Chooses a random element."
	return random.choice(list(counter.elements()))
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
	suffix = pick_word(d[prefix])
	s=s+' '+suffix
	print(s)
	prefix = prefix[1], suffix

Seed Text: He said
Generated Text: He said
he said
he said ,
he said , a
he said , a major
he said , a major turnaround


KeyError: ('major', 'turnaround')

In [None]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 2  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [None]:
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is


In [None]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Test the text generator
seed_text = "Inflation is"

for n_value in range (2, 4):
  print("n is: ", n_value)
  # Build the probabilistic n-gram model
  probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)
  generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=10)
  print(f"Generated Text: {generated_text}")


KeyboardInterrupt: 

In [None]:
# imports
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import FreqDist
from nltk.corpus import stopwords  # Add this line to import the stopwords module
from collections import Counter  # Add this line to import the Counter class

# input the reuters sentences
sents =reuters.sents()

# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list

# generate unigrams bigrams trigrams
unigram=[]
bigram=[]
trigram=[]
tokenized_text=[]
for sentence in sents:
  sentence = list(map(lambda x:x.lower(),sentence))
  for word in sentence:
      if word== '.':
        sentence.remove(word)
      else:
        unigram.append(word)

  tokenized_text.append(sentence)
  bigram.extend(list(ngrams(sentence, 2,pad_left=True, pad_right=True)))
  trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))

# remove the n-grams with removable words
def remove_stopwords(x):
	y = []
	for pair in x:
		count = 0
		for word in pair:
			if word in removal_list:
				count = count or 0
			else:
				count = count or 1
		if (count==1):
			y.append(pair)
	return (y)
unigram = remove_stopwords(unigram)
bigram = remove_stopwords(bigram)
trigram = remove_stopwords(trigram)

# generate frequency of n-grams
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)

d = {}
for a, b, c in freq_tri:
    if a is not None and b is not None and c is not None:
        if (a, b) not in d:
            d[a, b] = Counter()
        d[a, b] += Counter({c: freq_tri[a, b, c]})


# Next word prediction
s=''
def pick_word(counter):
	"Chooses a random element."
	return random.choice(list(counter.elements()))
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
	suffix = pick_word(d[prefix])
	s=s+' '+suffix
	print(s)
	prefix = prefix[1], suffix


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


he said
he said citibank
he said citibank n
he said citibank n a
he said citibank n a .-
he said citibank n a .- c
he said citibank n a .- c 9
he said citibank n a .- c 9 5
he said citibank n a .- c 9 5 pct
he said citibank n a .- c 9 5 pct to
he said citibank n a .- c 9 5 pct to 2
he said citibank n a .- c 9 5 pct to 2 ,
he said citibank n a .- c 9 5 pct to 2 , after
he said citibank n a .- c 9 5 pct to 2 , after trading
he said citibank n a .- c 9 5 pct to 2 , after trading late
he said citibank n a .- c 9 5 pct to 2 , after trading late yesterday
he said citibank n a .- c 9 5 pct to 2 , after trading late yesterday that
he said citibank n a .- c 9 5 pct to 2 , after trading late yesterday that grain
he said citibank n a .- c 9 5 pct to 2 , after trading late yesterday that grain prices
he said citibank n a .- c 9 5 pct to 2 , after trading late yesterday that grain prices would


#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [None]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [None]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

    def train(self, training_data):
        # Implement training here
        # You should use get_features function to extract useful tokens from
        # dataset and use them to train the classifier.
        pass

    def classify(self, features):
        # Implement classification here
        pass

In [None]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.