<a href="https://colab.research.google.com/github/SabhyaGrover/ML-Coursework/blob/main/ml_lab1_nltk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Assignment Lab-1**

**Problem Statement**:
Suppose a user wants to find relevant documents/articles for a particular query from a given huge collection of documents. Then it becomes very difficult to search manually,as it will require a lot of efforts. So to search efficiently and automatically, we require a recommendation system that will recommend a document/article in response to a particular query.


**EXERCISE 1: Pre-Processing Text**

In [None]:
# import and download libraries from NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Read data from Documents
doc1="Broad to Rogers no run around the wicket Rogers back and across the off stump to block up the wicket"
doc2="Swann to Watson no run covers up on the off strump up the wicket"
doc3="Meth to Shahriar Nafees, no run, on a good length on the off,drives that on the up towards extra cover"

In [None]:
# Convert to Lower Case
doc1=doc1.lower()
doc2=doc2.lower()
doc3=doc3.lower()


In [None]:
# Remove Punctuation from each word
import string
doc1=doc1.translate(str.maketrans('', '', string.punctuation))
doc2=doc2.translate(str.maketrans('', '', string.punctuation))
doc3=doc3.translate(str.maketrans('', '', string.punctuation))
print(doc1)
print(doc2)
print(doc3)


broad to rogers no run around the wicket rogers back and across the off stump to block up the wicket
swann to watson no run covers up on the off strump up the wicket
meth to shahriar nafees no run on a good length on the offdrives that on the up towards extra cover


In [None]:
# Create Bag Of Words after Tokenizing each Sentence
tokens1=nltk.word_tokenize(doc1)
tokens2=nltk.word_tokenize(doc2)
tokens3=nltk.word_tokenize(doc3)
print(tokens1)
print(tokens2)
print(tokens3)

['broad', 'to', 'rogers', 'no', 'run', 'around', 'the', 'wicket', 'rogers', 'back', 'and', 'across', 'the', 'off', 'stump', 'to', 'block', 'up', 'the', 'wicket']
['swann', 'to', 'watson', 'no', 'run', 'covers', 'up', 'on', 'the', 'off', 'strump', 'up', 'the', 'wicket']
['meth', 'to', 'shahriar', 'nafees', 'no', 'run', 'on', 'a', 'good', 'length', 'on', 'the', 'offdrives', 'that', 'on', 'the', 'up', 'towards', 'extra', 'cover']


In [None]:
# Import Stop Words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# Filter out Stop Words from the existing Tokens
tokens1=[t for t in tokens1 if not t in stop_words]
tokens2=[t for t in tokens2 if not t in stop_words]
tokens3=[t for t in tokens3 if not t in stop_words]
print(tokens1)
print(tokens2)
print(tokens3)

['broad', 'rogers', 'run', 'around', 'wicket', 'rogers', 'back', 'across', 'stump', 'block', 'wicket']
['swann', 'watson', 'run', 'covers', 'strump', 'wicket']
['meth', 'shahriar', 'nafees', 'run', 'good', 'length', 'offdrives', 'towards', 'extra', 'cover']




---



**EXERCISE 2: Using the Pre-Processed Text for Analysis**

In [None]:
# Appending all Tokens to form a Combined Corpus
corpus = []
corpus = set(tokens1).union(set(corpus))
corpus = set(tokens2).union(set(corpus))
corpus = set(tokens3).union(set(corpus))
print(corpus)

{'back', 'wicket', 'stump', 'covers', 'watson', 'nafees', 'offdrives', 'rogers', 'broad', 'block', 'run', 'strump', 'length', 'across', 'good', 'meth', 'shahriar', 'towards', 'extra', 'cover', 'around', 'swann'}


In [None]:
# Initiating Empty Dictionary with Key values from the Corpus
wordDic1 = dict.fromkeys(corpus,0)
wordDic2 = dict.fromkeys(corpus,0)
wordDic3 = dict.fromkeys(corpus,0)

# Updating Freq of Words in each Document
for w in tokens1:
  wordDic1[w]+=1

for w in tokens2:
  wordDic2[w]+=1
  
for w in tokens3:
  wordDic3[w]+=1

In [None]:
# Steps to provide TF-IDF score in each documents

# Step 1:Compute Term Frequency
def computeTF(wordDict,bow):
  tfDict = {}
  bowCount = len(bow)
  for word,count in wordDict.items():
    tfDict[word] = count/float(bowCount)
  # TF(t) = (No of times term t appears in a document)/(Total no of terms in the
  # document).
  return tfDict

# Step 2:Compute Inverse Document Frequency
def computeIDF(docList):
  import math
  idfDict = {}
  N = len(docList)
  idfDict = dict.fromkeys(docList[0].keys(),0)
  for doc in docList:
    for word,val in doc.items():
      if val > 0:
        idfDict[word] += 1
  for word,val in idfDict.items():
    idfDict[word] = math.log(N/float(val))
  # IDF(t) = log_e(Total number of documents / Number of documents with term t
  # in it).
  return idfDict

# Step 3: Compute the combined TF-IDF score 
def computeTFIDF(tfBow, idfs):
  tfidf = {}
  for word,val in tfBow.items():
    tfidf[word] = val*idfs[word]
  return tfidf


In [None]:
# Step 1
tfBow1 = computeTF(wordDic1, tokens1)
tfBow2 = computeTF(wordDic2, tokens2)
tfBow3 = computeTF(wordDic3, tokens3)
# Step 2
idfs = computeIDF([wordDic1,wordDic2,wordDic3])
# Step 3
tfidfBow1 = computeTFIDF(tfBow1,idfs)
tfidfBow2 = computeTFIDF(tfBow2,idfs)
tfidfBow3 = computeTFIDF(tfBow3,idfs)
# Display TF-IDF score in a DataFrame
import pandas as pd
pd.DataFrame([tfidfBow1,tfidfBow2,tfidfBow3])

Unnamed: 0,back,wicket,stump,covers,watson,nafees,offdrives,rogers,broad,block,run,strump,length,across,good,meth,shahriar,towards,extra,cover,around,swann
0,0.099874,0.073721,0.099874,0.0,0.0,0.0,0.0,0.199748,0.099874,0.099874,0.0,0.0,0.0,0.099874,0.0,0.0,0.0,0.0,0.0,0.0,0.099874,0.0
1,0.0,0.067578,0.0,0.183102,0.183102,0.0,0.0,0.0,0.0,0.0,0.0,0.183102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.183102
2,0.0,0.0,0.0,0.0,0.0,0.109861,0.109861,0.0,0.0,0.0,0.0,0.0,0.109861,0.0,0.109861,0.109861,0.109861,0.109861,0.109861,0.109861,0.0,0.0




---

