# WEEK-1
##Tokenization using UNIX commands


---
### 1. **Create a text file `text1.txt` containing text of your choice.**
**Answer:**
```bash
echo "This is an example text file." > text1.txt
```

---

### 2. **Use `tr` utility to replace every occurrence of a given character with another given character.**
**Answer:**
```bash
tr 'o' 'x' < example.txt
```
**Output:**
```plaintext
Hellx Wxrld!
```

---

### 3. **What are `-s` and `-c` options of `tr` command?**
**Answer:**
- `-s` option (squeeze-repeats): This option is used to squeeze (reduce) repeated characters in the input to a single character.
- `-c` option (complement): This option is used to complement the set of characters. When used, it replaces characters not listed in the first set with the corresponding character from the second set.

**Examples:**

```bash
echo "aaabbbccc" | tr -s 'a'
```
**Output:**
```plaintext
abbbc
```

```bash
echo "abc123" | tr -c 'a-z' 'X'
```
**Output:**
```plaintext
XXX123
```

---

### 4. **What is the output of the following command?**
```bash
tr a-z A-Z < text1.txt
```
**Answer:**
The command `tr a-z A-Z < text1.txt` will transform all lowercase letters to uppercase in the content of the file `text1.txt`. The output will be displayed on the terminal.

**Example output:**
```plaintext
THIS IS AN EXAMPLE TEXT FILE.
```

---

### 5. **Create a text file `text2.txt` containing any text (one word per line) and sort the words in the `text2.txt` file using the `sort` command.**
**Answer:**
```bash
echo -e "banana\napple\norange\nkiwi\ngrape" > text2.txt
sort text2.txt
```
**Output:**
```plaintext
apple
banana
grape
kiwi
orange
```

---

### 6. **Sort and display unique lines in the `text2.txt` file.**
**Answer:**
```bash
sort -u text2.txt
```
**Output:**
```plaintext
apple
banana
grape
kiwi
orange
```

---

### 7. **Sort and display unique lines in `text2.txt` such that each word is preceded with the frequency count of the word in the `text2.txt` file.**
**Answer:**
```bash
sort text2.txt | uniq -c
```
**Output:**
```plaintext
      1 apple
      1 banana
      1 grape
      1 kiwi
      1 orange
```

---

### 8. **Obtain and display the tokens in `text1.txt` file.**
**Answer:**
```bash
tr -sc 'A-Za-z' '\n' < text1.txt | grep -v '^$'
```
**Input (in `text1.txt`):**
```plaintext
This is an example text file.
```

**Output:**
```plaintext
This
is
an
example
text
file
```

---

### 9. **Display the tokens in sorted order.**
**Answer:**
```bash
tr -sc 'A-Za-z' '\n' < text1.txt | grep -v '^$' | sort
```
**Output:**
```plaintext
This
an
example
file
is
text
```

---

### 10. **Display the unique tokens in sorted order.**
**Answer:**
```bash
tr -sc 'A-Za-z' '\n' < text1.txt | grep -v '^$' | sort | uniq
```
**Output:**
```plaintext
This
an
example
file
is
text
```



#WEEK-2
##Implement a word tokenization using regular expressions

In [3]:
#Week-2
#Implement a word tokenization using regular expressions

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
print(tokenizer.tokenize(s))

['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


In [7]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
print(tokens)

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [10]:
import re

# Example 1: Abbreviations like V.C.E, C.S.E, U.S.A
text1 = "This is V.C.E C.S.E I am a Student . I paid a fees of 13.0"
abbreviations = re.findall(r"(?:[A-Z]\.)+[A-Z]", text1)  # Regex to match abbreviations
print("Abbreviations:", abbreviations)

# Example 2: Splitting text into tokens
text2 = "That U.S.A poster-print costs $12.40... which is 3.45."
tokens = re.split(r"\s+", text2)  # Split text by whitespace
print("Tokens:", tokens)

# Example 3: Hyphenated words
hyphenated_words = re.findall(r"\w+-\w+", text2)  # Matches words with hyphens
print("Hyphenated words:", hyphenated_words)

# Example 4: Alternate way to match hyphenated words
hyphenated_words_alt = re.findall(r"(?:\w+-\w+)", text2)  # Non-capturing group for hyphenated words
print("Hyphenated words (non-capturing):", hyphenated_words_alt)


Abbreviations: ['V.C.E', 'C.S.E']
Tokens: ['That', 'U.S.A', 'poster-print', 'costs', '$12.40...', 'which', 'is', '3.45.']
Hyphenated words: ['poster-print']
Hyphenated words (non-capturing): ['poster-print']


In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text
text = "This is V.C.E C.S.E I am a Student. I paid a fees of 13.0"

# Tokenize the text
tokens = word_tokenize(text)
print("Tokenized words:", tokens)

# Remove stopwords
li = []
for w in tokens:
    if w.lower() not in stopwords.words('english'):  # Convert to lowercase for comparison
        li.append(w)

print("Words after removing stopwords:", li)


Tokenized words: ['This', 'is', 'V.C.E', 'C.S.E', 'I', 'am', 'a', 'Student', '.', 'I', 'paid', 'a', 'fees', 'of', '13.0']
Words after removing stopwords: ['V.C.E', 'C.S.E', 'Student', '.', 'paid', 'fees', '13.0']


# WEEK 3
##Implement Minimum Edit Distance (MED) algorithm for spelling correction

In [14]:
# Minimum Edit Distance Algorithm
source = 'kitten'  # Source string
target = 'sitting'  # Target string

m = len(source)
n = len(target)

# Initialize a 2D DP array with dimensions (n+1) x (m+1)
dp = [[0 for _ in range(m + 1)] for _ in range(n + 1)]

# Fill the base cases for the DP table
for i in range(m + 1):
    dp[0][i] = i  # Cost of deleting characters from source
for j in range(n + 1):
    dp[j][0] = j  # Cost of inserting characters into source

# Compute the DP table
for i in range(1, n + 1):
    for j in range(1, m + 1):
        if source[j - 1] == target[i - 1]:
            cost = 0  # No cost if characters match
        else:
            cost = 1  # Substitution cost if characters don't match

        # Calculate the minimum cost among insertion, deletion, and substitution
        dp[i][j] = min(
            dp[i - 1][j] + 1,      # Cost of deletion
            dp[i][j - 1] + 1,      # Cost of insertion
            dp[i - 1][j - 1] + cost  # Cost of substitution
        )

# Display the DP table (optional)
print("DP Table:")
for row in dp:
    print(row)

# Output the minimum edit distance
print("Edit Distance:", dp[n][m])


DP Table:
[0, 1, 2, 3, 4, 5, 6]
[1, 1, 2, 3, 4, 5, 6]
[2, 2, 1, 2, 3, 4, 5]
[3, 3, 2, 1, 2, 3, 4]
[4, 4, 3, 2, 1, 2, 3]
[5, 5, 4, 3, 2, 2, 3]
[6, 6, 5, 4, 3, 3, 2]
[7, 7, 6, 5, 4, 4, 3]
Edit Distance: 3


In [15]:
import nltk

def find_minimum_edit_distance(word1, word2):
    # Use NLTK's edit_distance function
    distance = nltk.edit_distance(word1, word2)
    return distance

# Example usage
word1 = "kitten"
word2 = "sitting"

min_edit_distance = find_minimum_edit_distance(word1, word2)
print(f"The minimum edit distance between '{word1}' and '{word2}' is: {min_edit_distance}")


The minimum edit distance between 'kitten' and 'sitting' is: 3


# WEEK 4
##Implement n-gram language model

In [16]:
import nltk
from nltk.corpus import brown
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.util import bigrams
from nltk.lm import MLE

# Download necessary NLTK resources
nltk.download('brown')

# Load the Brown corpus (news category sentences)
corpus = brown.sents(categories="news")

# Example test sentence
test_sentence = ['There', "wasn't", 'a', 'bit', 'of', 'trouble', 'in', 'Texas']

# Convert test sentence to bigrams with padding
test_sentence_bigrams = list(bigrams(pad_both_ends(test_sentence, n=2)))
print("Test sentence bigrams (with padding):", test_sentence_bigrams)

# Prepare the training data using padded everygram pipeline
n = 2  # Bigram model
train_data, vocab = padded_everygram_pipeline(n, corpus)

# Create an MLE (Maximum Likelihood Estimation) language model
lm = MLE(n)
lm.fit(train_data, vocab)

# Vocabulary size
print('Number of words in vocabulary is:', len(lm.vocab))

# Language model counts (debugging information)
print("Language model counts (partial view):", lm.counts)

# Calculate the probability of the test sentence
prob = 1.0
for t in test_sentence_bigrams:
    score = lm.score(t[1], [t[0]])  # P(word | context)
    print(f"Probability of '{t[1]}' given '{t[0]}': {score}")
    prob *= score

print("Probability of the entire test sentence:", prob)


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Test sentence bigrams (with padding): [('<s>', 'There'), ('There', "wasn't"), ("wasn't", 'a'), ('a', 'bit'), ('bit', 'of'), ('of', 'trouble'), ('trouble', 'in'), ('in', 'Texas'), ('Texas', '</s>')]
Number of words in vocabulary is: 14397
Language model counts (partial view): <NgramCounter with 2 ngram orders and 214977 ngrams>
Probability of 'There' given '<s>': 0.011464417045208739
Probability of 'wasn't' given 'There': 0.017241379310344827
Probability of 'a' given 'wasn't': 0.3333333333333333
Probability of 'bit' given 'a': 0.002508780732563974
Probability of 'of' given 'bit': 0.2857142857142857
Probability of 'trouble' given 'of': 0.001053001053001053
Probability of 'in' given 'trouble': 0.375
Probability of 'Texas' given 'in': 0.001584786053882726
Probability of '</s>' given 'Texas': 0.125
Probability of the entire test sentence: 3.6943506664397765e-15


# WEEK 5
##Implement Naïve Bayes classification for sentiment analysis

In [17]:
import nltk
from nltk.corpus import movie_reviews
import random

# Download movie_reviews corpus if not already downloaded
nltk.download('movie_reviews')

# Prepare the dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

# Create a list of the most frequent words in the corpus
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]  # Top 2000 frequent words

# Function to extract features from a document
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Create feature sets
featuresets = [(document_features(d), c) for (d, c) in documents]

# Split the dataset into training and testing sets
train_set, test_set = featuresets[100:], featuresets[:100]

# Train a Naïve Bayes Classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluate the classifier on the test set
confusion_matrix = {"tp": 0, "tn": 0, "fp": 0, "fn": 0}

for test_case in test_set:
    predicted = classifier.classify(test_case[0])
    actual = test_case[1]

    if predicted == 'pos' and actual == 'pos':
        confusion_matrix['tp'] += 1
    elif predicted == 'neg' and actual == 'neg':
        confusion_matrix['tn'] += 1
    elif predicted == 'pos' and actual == 'neg':
        confusion_matrix['fp'] += 1
    else:
        confusion_matrix['fn'] += 1

# Display the results
print("Confusion Matrix:")
print("True Positives (TP):", confusion_matrix['tp'])
print("True Negatives (TN):", confusion_matrix['tn'])
print("False Positives (FP):", confusion_matrix['fp'])
print("False Negatives (FN):", confusion_matrix['fn'])

# Accuracy of the model
accuracy = (confusion_matrix['tp'] + confusion_matrix['tn']) / len(test_set)
print(f"Accuracy: {accuracy:.2f}")

# Display the most informative features
print("\nMost Informative Features:")
classifier.show_most_informative_features(10)


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


Confusion Matrix:
True Positives (TP): 38
True Negatives (TN): 39
False Positives (FP): 8
False Negatives (FN): 15
Accuracy: 0.77

Most Informative Features:
Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.7 : 1.0
         contains(mulan) = True              pos : neg    =      9.1 : 1.0
        contains(seagal) = True              neg : pos    =      7.8 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.5 : 1.0
         contains(damon) = True              pos : neg    =      6.1 : 1.0
         contains(flynt) = True              pos : neg    =      5.7 : 1.0
        contains(wasted) = True              neg : pos    =      5.6 : 1.0
          contains(lame) = True              neg : pos    =      5.4 : 1.0
        contains(poorly) = True              neg : pos    =      5.4 : 1.0
         contains(awful) = True              neg : pos    =      5.2 : 1.0


# WEEK 6
##Implement POS tagging using HMM

In [18]:
import nltk
from nltk.corpus import brown
from nltk.tag import hmm

# Download necessary NLTK resources
nltk.download('brown')
nltk.download('punkt')

# Load the tagged sentences from the Brown corpus
sentences = brown.tagged_sents()

# Train a Hidden Markov Model (HMM) POS tagger
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train(sentences)

# Example input text for POS tagging
text = "This is a sample sentence for POS tagging in Python."
words = nltk.word_tokenize(text)

# Perform POS tagging using the trained HMM tagger
tags = tagger.tag(words)

# Display the words with their corresponding tags
for word, tag in tags:
    print(f"{word}: {tag}")


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])
  P[i] = self._priors.logprob(si)


This: DT
is: BEZ
a: AT
sample: NN
sentence: NN
for: IN
POS: AT
tagging: AT
in: AT
Python: AT
.: AT


  O[i, k] = self._output_logprob(si, self._symbols[k])


In [20]:
import nltk
from nltk.corpus import brown
from nltk.tag import hmm

# Download necessary NLTK resources
nltk.download('brown')

# Load Brown corpus tagged sentences (news category)
brown_tagged_sentences = brown.tagged_sents(categories='news')

# Split data into training and testing sets (90% train, 10% test)
size = int(len(brown_tagged_sentences) * 0.9)
train_sentences = brown_tagged_sentences[:size]
test_sentences = brown_tagged_sentences[:size]

# Train an HMM POS tagger
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train(train_sentences)

# Evaluate the tagger on the test data
accuracy = tagger.accuracy(test_sentences)
print(f"Accuracy of HMM POS Tagger: {accuracy:.2f}")


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Accuracy of HMM POS Tagger: 0.98


# WEEK 7
##Implement CKY parsing algorithm

In [21]:
def print_chart(chart, n):
    # Function to print the chart (parse table)
    for p in range(n+1):
        for q in range(n+1):
            print(chart[p][q], end="\t")
        print()

def CKY_PARSE(words, grammar):
    n = len(words)
    print("Words:", words)

    # Initialize parse table (chart)
    table = [[set() for _ in range(n+1)] for _ in range(n+1)]

    # Fill the diagonal (base case) of the chart
    for i in range(1, n+1):
        word = words[i-1]
        for lhs, rhs in grammar:
            if rhs == (word,):  # if the word matches the RHS of a grammar rule
                table[i-1][i].add(lhs)

    # Fill the rest of the chart (dynamic programming part)
    for j in range(2, n+1):  # length of the span
        for i in range(j-2, -1, -1):  # starting point of the span
            for k in range(i+1, j):  # partitioning point
                for lhs, rhs in grammar:
                    if len(rhs) == 2:  # we are handling binary productions
                        if rhs[0] in table[i][k] and rhs[1] in table[k][j]:
                            table[i][j].add(lhs)

    return table

# Sentence and Grammar Setup
sentence = "the dog chased the cat"
words = sentence.split()
n = len(words)

grammar = [
    ('S', ('NP', 'VP')),
    ('NP', ('DET', 'NOMINAL')),
    ('VP', ('VERB', 'NP')),
    ('NOMINAL', ('cat',)),
    ('NOMINAL', ('dog',)),
    ('VERB', ('chased',)),
    ('DET', ('the',))
]

# Perform CKY parsing
chart = CKY_PARSE(words, grammar)

# Print the parse table
print_chart(chart, n)

# Check if the start symbol is in the final cell
start_symbol = 'S'
if start_symbol in chart[0][n]:
    print("The sentence is grammatically correct.")
else:
    print("The sentence is not grammatically correct.")


Words: ['the', 'dog', 'chased', 'the', 'cat']
set()	{'DET'}	{'NP'}	set()	set()	{'S'}	
set()	set()	{'NOMINAL'}	set()	set()	set()	
set()	set()	set()	{'VERB'}	set()	{'VP'}	
set()	set()	set()	set()	{'DET'}	{'NP'}	
set()	set()	set()	set()	set()	{'NOMINAL'}	
set()	set()	set()	set()	set()	set()	
The sentence is grammatically correct.


# WEEK 8
##Implement PCKY parsing algorithm

In [22]:
def print_chart(chart, n, non_terminals):
    for p in range(n + 1):
        for q in range(n + 1):
            print('[', p, ',', q, ']:', end=" ")
            for nt in non_terminals:
                if chart[p][q].get(nt, 0) > 0:
                    print('{', nt, ':', chart[p][q][nt], '}', end=" ")
            print()
        print()

def PCKY_PARSE(words, grammar, non_terminals):
    n = len(words)
    print(words, end="\n")

    # Initialize parse table
    table = [[dict() for _ in range(n + 1)] for _ in range(n + 1)]

    # Initialize table cells with zero probabilities
    for i in range(n + 1):
        for j in range(n + 1):
            for nt in non_terminals:
                table[i][j][nt] = 0.0

    # Fill table cells for the base case
    for j in range(1, n + 1):
        for lhs, rhs, pr in grammar:
            if rhs == (words[j - 1],):  # terminal case
                table[j - 1][j][lhs] = pr

    # Fill the table for larger spans
    for length in range(2, n + 1):  # span length
        for i in range(n - length + 1):  # starting point
            j = i + length  # ending point
            for k in range(i + 1, j):  # partition point
                for lhs, rhs, pr in grammar:
                    if len(rhs) == 2:  # binary rule
                        if (rhs[0] in table[i][k]) and (rhs[1] in table[k][j]):
                            prob = pr * table[i][k].get(rhs[0], 0) * table[k][j].get(rhs[1], 0)
                            if table[i][j].get(lhs, 0) < prob:
                                table[i][j][lhs] = prob

    return table

# Sentence and Grammar Setup
sentence = "the flight includes a meal"
words = sentence.split()
n = len(words)

grammar = [
    ('S', ('NP', 'VP'), 0.80),
    ('NP', ('DET', 'NOMINAL'), 0.30),
    ('VP', ('VERB', 'NP'), 0.20),
    ('NOMINAL', ('meal',), 0.01),
    ('NOMINAL', ('flight',), 0.02),
    ('VERB', ('includes',), 0.05),
    ('DET', ('the',), 0.40),
    ('DET', ('a',), 0.40)
]

non_terminals = ['S', 'NP', 'VP', 'DET', 'NOMINAL', 'VERB']

# Perform PCKY parsing
chart = PCKY_PARSE(words, grammar, non_terminals)

# Print the parse table (chart)
print_chart(chart, n, non_terminals)

# Check if the start symbol is in the final cell and output the result
start_symbol = 'S'
if chart[0][n].get(start_symbol, 0) > 0:
    print("The sentence is grammatically correct.")
else:
    print("The sentence is not grammatically correct.")


['the', 'flight', 'includes', 'a', 'meal']
[ 0 , 0 ]: 
[ 0 , 1 ]: { DET : 0.4 } 
[ 0 , 2 ]: { NP : 0.0024 } 
[ 0 , 3 ]: 
[ 0 , 4 ]: 
[ 0 , 5 ]: { S : 2.3040000000000003e-08 } 

[ 1 , 0 ]: 
[ 1 , 1 ]: 
[ 1 , 2 ]: { NOMINAL : 0.02 } 
[ 1 , 3 ]: 
[ 1 , 4 ]: 
[ 1 , 5 ]: 

[ 2 , 0 ]: 
[ 2 , 1 ]: 
[ 2 , 2 ]: 
[ 2 , 3 ]: { VERB : 0.05 } 
[ 2 , 4 ]: 
[ 2 , 5 ]: { VP : 1.2000000000000002e-05 } 

[ 3 , 0 ]: 
[ 3 , 1 ]: 
[ 3 , 2 ]: 
[ 3 , 3 ]: 
[ 3 , 4 ]: { DET : 0.4 } 
[ 3 , 5 ]: { NP : 0.0012 } 

[ 4 , 0 ]: 
[ 4 , 1 ]: 
[ 4 , 2 ]: 
[ 4 , 3 ]: 
[ 4 , 4 ]: 
[ 4 , 5 ]: { NOMINAL : 0.01 } 

[ 5 , 0 ]: 
[ 5 , 1 ]: 
[ 5 , 2 ]: 
[ 5 , 3 ]: 
[ 5 , 4 ]: 
[ 5 , 5 ]: 

The sentence is grammatically correct.


# WEEK 9
##Compute cosine similarity between the words using term document matrix and term-term matrix

In [24]:
import nltk
import random
import math
from nltk.corpus import brown

# Function to extract words from a document
def extract_words(document):
    all_terms_list = brown.words(fileids=document)
    only_words_list = [w.lower() for w in all_terms_list if w.isalpha()]
    stopwords_list = nltk.corpus.stopwords.words('english')
    final_terms_list = [w for w in only_words_list if w not in stopwords_list]
    return final_terms_list

# Function to calculate the frequency of a word in a document
def freq(word, document):
    d_terms = extract_words(document)
    fdist = nltk.FreqDist(w for w in d_terms)
    return fdist[word]

# List of documents from the Brown corpus
doc_names = ['ca01', 'ca02', 'ca03', 'ca04']

# Constructing the vocabulary set (unique words across all documents)
vocab = set()
for doc in doc_names:
    vocab.update(extract_words(doc))
vocab_len = len(vocab)
print("Length of vocabulary:", vocab_len)

# Randomly select two words from the vocabulary
word1 = random.choice(list(vocab))
word2 = random.choice(list(vocab))
print("word-1:", word1)
print("word-2:", word2)

# Constructing term frequency vectors for the two words
word1_vector = [freq(word1, doc) for doc in doc_names]
word2_vector = [freq(word2, doc) for doc in doc_names]
print("word-1 vector:", word1_vector)
print("word-2 vector:", word2_vector)

# Compute the cosine similarity between the two word vectors
dot_product = sum(word1_vector[i] * word2_vector[i] for i in range(len(doc_names)))

sum1 = sum(word1_vector[i] * word1_vector[i] for i in range(len(doc_names)))
sum2 = sum(word2_vector[i] * word2_vector[i] for i in range(len(doc_names)))

vector1_len = math.sqrt(sum1)
vector2_len = math.sqrt(sum2)

# Cosine similarity formula
cos_theta = dot_product / (vector1_len * vector2_len)
print(f"Cosine similarity between {word1} and {word2}: {cos_theta}")


Length of vocabulary: 2023
word-1: institute
word-2: indispensable
word-1 vector: [0, 1, 1, 2]
word-2 vector: [0, 0, 1, 0]
Cosine similarity between institute and indispensable: 0.4082482904638631


# WEEK 10
##Compute tf-idf matrix for the given document set

In [26]:
import nltk
import random
import math
from nltk.corpus import brown

# Function to calculate Term Frequency (TF)
def term_freq(word, document):
    d_terms = extract_words(document)
    fdist = nltk.FreqDist(w for w in d_terms)
    return math.log10(fdist[word] + 1)  # log-transformation of term frequency

# Function to extract words from a document
def extract_words(document):
    all_terms_list = brown.words(fileids=document)
    only_words_list = [w.lower() for w in all_terms_list if w.isalpha()]
    stopwords_list = nltk.corpus.stopwords.words('english')
    final_terms_list = [w for w in only_words_list if w not in stopwords_list]
    return final_terms_list

# Function to calculate Inverse Document Frequency (IDF)
def idf(word, doc_names):
    df = 0
    for doc in doc_names:
        if word in extract_words(doc):
            df += 1
    return math.log10(len(doc_names) / (df + 1))  # log-transformation of IDF

# List of documents from the Brown corpus
doc_names = ['ca01', 'ca02', 'ca03', 'ca04']

# Construct vocabulary set (unique words across all documents)
vocab = set()
for doc in doc_names:
    vocab.update(extract_words(doc))
vocab_len = len(vocab)
print("Length of vocabulary:", vocab_len)

# Randomly select two words from the vocabulary
word1 = random.choice(list(vocab))
word2 = random.choice(list(vocab))
print("word-1:", word1)
print("word-2:", word2)

# Constructing term frequency vectors (TF * IDF) for the two words
word1_vector = [term_freq(word1, doc) * idf(word1, doc_names) for doc in doc_names]
word2_vector = [term_freq(word2, doc) * idf(word2, doc_names) for doc in doc_names]
print("word-1 vector:", word1_vector)
print("word-2 vector:", word2_vector)

# Computing cosine similarity between word1 and word2
dot_product = sum(word1_vector[i] * word2_vector[i] for i in range(len(doc_names)))

sum1 = sum(word1_vector[i] * word1_vector[i] for i in range(len(doc_names)))
sum2 = sum(word2_vector[i] * word2_vector[i] for i in range(len(doc_names)))

vector1_len = math.sqrt(sum1)
vector2_len = math.sqrt(sum2)

# Cosine similarity formula
cos_theta = dot_product / (vector1_len * vector2_len)
print(f"Cosine similarity between {word1} and {word2}: {cos_theta}")


Length of vocabulary: 2023
word-1: limited
word-2: aged
word-1 vector: [0.0, 0.03761030733945982, 0.0, 0.05961092677364148]
word-2 vector: [0.0, 0.0, 0.2342468675289098, 0.0]
Cosine similarity between limited and aged: 0.0


# WEEK 11
##IMPLEMENT LANGUAGE MODEL USING FEEDFORWARD NEURALNETWORK

In [27]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Toy dataset
corpus = [
    'This is a simple example',
    'Language modeling is interesting',
    'Neural networks are powerful',
    'Feed-forward networks are common in NLP'
]

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1  # +1 for padding

# Create input sequences and labels
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences to ensure they all have the same length
max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

# Split into X (input) and y (target)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# Convert y to categorical (one-hot encoding)
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 50, input_length=max_sequence_length-1),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(total_words, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=100, verbose=1)

# Generate text using the trained model
seed_text = "Neural networks"
next_words = 5

for _ in range(next_words):
    # Convert seed text to sequence of integers
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')

    # Predict the next word
    predicted = np.argmax(model.predict(token_list), axis=-1)

    # Convert the predicted integer back to the word
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break

    # Append the predicted word to the seed text
    seed_text += " " + output_word

print(seed_text)


Epoch 1/100




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.0000e+00 - loss: 2.8938
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 165ms/step - accuracy: 0.0625 - loss: 2.8709
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step - accuracy: 0.1875 - loss: 2.8500
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - accuracy: 0.3750 - loss: 2.8305
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step - accuracy: 0.5000 - loss: 2.8120
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.5000 - loss: 2.7933
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.5000 - loss: 2.7743
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.5000 - loss: 2.7548
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

# WEEK-12
##IMPLEMENT LANGUAGE MODEL USING RNN

In [28]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Toy dataset
corpus = [
    'This is a simple example',
    'Language modeling is interesting',
    'Neural networks are powerful',
    'Recurrent neural networks capture sequences well'
]

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1  # +1 for padding

# Create input sequences and labels
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences to ensure they all have the same length
max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

# Split into X (input) and y (target)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# Convert y to categorical (one-hot encoding)
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 50, input_length=max_sequence_length-1),
    tf.keras.layers.LSTM(100),
    tf.keras.layers.Dense(total_words, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=100, verbose=1)

# Generate text using the trained model
seed_text = "Recurrent neural networks"
next_words = 5

for _ in range(next_words):
    # Convert seed text to sequence of integers
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')

    # Predict the next word
    predicted = np.argmax(model.predict(token_list), axis=-1)

    # Convert the predicted integer back to the word
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break

    # Append the predicted word to the seed text
    seed_text += " " + output_word

print(seed_text)


Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.0000e+00 - loss: 2.8330
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - accuracy: 0.1333 - loss: 2.8268
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.2000 - loss: 2.8206
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - accuracy: 0.2000 - loss: 2.8144
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - accuracy: 0.2000 - loss: 2.8080
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step - accuracy: 0.2000 - loss: 2.8013
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.2000 - loss: 2.7942
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.2000 - loss: 2.7866
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m