# Assignment 2

## Problem 1 TF-IDF

Implement TF-IDF using using Python, Numpy, Pandas and whatever text cleaning library required.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics, you can use the following formulas.

### Term Frequency
$$tf_{t,d} = \log_{10}(count(t,d) +1)$$ 

* $tf_{t,d}$ is the frequency of the word t in the
document d

### Inverse Document Frequency
$$idf_t = \log_{10}(\frac{N}{df_t})$$

* $N$ is the total number of documents
* $df_t $ is the number of documents in which term t occurs

### TF-IDF
$$tf\text{-}idf_{t,d} = tf_{t,d} \times idf_t $$

### What is expected? 
Your implementation should include the following two functions:
 * `compute_tfidf_weights(train_docs)`
 * `word_tfidf_vector(word, tf_df, idf_df)`

To revise what TF-IDf is, you can revise the lecture notes and the further reading under Week 7.


In [None]:
import nltk
nltk.download('all')

In [None]:
import math
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
set(stopwords.words('english'))

def compute_tfidf_weights(train_docs):
  '''
  Input arguments:
    train_docs : list of documents strings
  Output arguments:
    docs_tf : tf as a DataFrame
    docs_idf : idf as a DataFrame
  '''
  docs_idf ={}
  docs_tf = {}
  wordfreq = {}
  for train in train_docs:
    corpus = nltk.sent_tokenize(train)

    for i in range(len(corpus )):
        corpus [i] = corpus [i].lower()
        corpus [i] = re.sub(r'\W',' ',corpus [i])
        corpus [i] = re.sub(r'\s+',' ',corpus [i])
        

  
    for sentence in corpus:
        tokens = nltk.word_tokenize(sentence)
        for token in tokens:
            if token not in wordfreq.keys():
                wordfreq[token] = 1
            else:
                wordfreq[token] += 1
  
  
 
  for word, count in wordfreq.items():

    docs_tf[word] =  math.log10(count + 1 )
    
  N = len(wordfreq)

  for word, val in wordfreq.items():
      docs_idf[word] = math.log10(N / float(val) + 1)
  
  
  data_tf= list(docs_tf.items())
  data_idf= list(docs_idf.items())
 
  df_tf = pd.DataFrame(data_tf)
  df_idf = pd.DataFrame(data_idf)
  
  return df_tf , df_idf



In [None]:
 lst_1=  ['Geeksforgeeks is a portal for geeks','hello how are you','hello am good how about yourself','am doing how are you agagian']
compute_tfidf_weights(lst_1)

(                0         1
 0   geeksforgeeks  0.301030
 1              is  0.301030
 2               a  0.301030
 3          portal  0.301030
 4             for  0.301030
 5           geeks  0.301030
 6           hello  0.477121
 7             how  0.602060
 8             are  0.477121
 9             you  0.477121
 10             am  0.477121
 11           good  0.301030
 12          about  0.301030
 13       yourself  0.301030
 14          doing  0.301030
 15        agagian  0.301030,                 0         1
 0   geeksforgeeks  1.230449
 1              is  1.230449
 2               a  1.230449
 3          portal  1.230449
 4             for  1.230449
 5           geeks  1.230449
 6           hello  0.954243
 7             how  0.801632
 8             are  0.954243
 9             you  0.954243
 10             am  0.954243
 11           good  1.230449
 12          about  1.230449
 13       yourself  1.230449
 14          doing  1.230449
 15        agagian  1.230449)

In [None]:
def word_tfidf_vector(word, tf_df, idf_df):
  '''
    Input arguments:
      word : a query string
      tf_tf : tf as a DataFrame
      tf_idf : idf as a DataFrame
    Output arguments:
      tf_idf_value : a numpy array of dimension 1xN
  '''
  tf_df = dict(tf_df.values)
  idf_df = dict(idf_df.values)
  corpus = nltk.sent_tokenize(word)

  for i in range(len(corpus )):
      corpus [i] = corpus [i].lower()
      corpus [i] = re.sub(r'\W',' ',corpus [i])
      corpus [i] = re.sub(r'\s+',' ',corpus [i])

  tfidf_values = {}
  for sentence in corpus:
      tokens = nltk.word_tokenize(sentence)
  for token in tokens:
    if token in tf_df.keys() and token in idf_df.keys():
        for val in tf_df.values():
          tfidf_values[token] = val * idf_df[token]
      
    
  tf_idf_model = np.asarray(tfidf_values)
  return tf_idf_model


In [None]:
lst_1=  ['Geeksforgeeks is a portal for geeks','hello how are you','hello am good how about yourself','am doing how are you again']
lst =  ' about yourself am doing how are you agagin doing gretaeakjsdkjahsdjkasd kjashdkjashdkbkashbd asjahdjka'
tdf,idf = compute_tfidf_weights(lst_1)
word_tfidf_vector(lst,tdf,idf)

array({'about': 0.37040203346725215, 'yourself': 0.37040203346725215, 'am': 0.2872556184789065, 'doing': 0.37040203346725215, 'how': 0.24131538171067718, 'are': 0.2872556184789065, 'you': 0.2872556184789065},
      dtype=object)

## Problem 2 Word embedding as features for classification

### Task
Implement legal case type classification on a legal case corpus. The corpus contains 39,155 legal cases including 22,776 taken from the United States supreme court.

You can find more details about the dataset at https://osf.io/qvg8s/wiki/home/ 

Implement classification using necessary libraries with the features being GloVe word embeddings using Gensim as demonstrated below. 

Report the accuracy and F1 score (micro- and macro-averaged).

### Dataset
The dataset can downloaded at https://osf.io/qvg8s/files/.
The files of interest are as highlighed below:

<center>
<img width="900px" src="https://drive.google.com/uc?id=1RUVQ8rGyjrv2gspluJM5f6lbqBS6k4yJ"> 
</center>


### Document representation
Convert words after cleaning into their embeddings, then take the average of of all the words in a case (document) to end up with single vector representing each case. the case vector is then used for case classification.

In the process of finding the embeddings for each word, you can ignore out-of-vocabulary words.

### Classifier choice
The choice of classification method(s) is left to you. You are expected to experiment with more that one type of classifier and comment on your findings.

### Suggestion (Optional)
Consider saving a cleaned up version of the dataset after creating the embeddings to a file which can be loaded and used for further experimentation. 

In [164]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [171]:
%%shell
DATA_URL='https://osf.io/8mjcy/download'

pushd /content
wget $DATA_URL -O data.zip
unzip -q data.zip
popd

/content /content
--2021-05-26 11:47:33--  https://osf.io/8mjcy/download
Resolving osf.io (osf.io)... 35.190.84.173
Connecting to osf.io (osf.io)|35.190.84.173|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://files.osf.io/v1/resources/qvg8s/providers/osfstorage/5c2c13130e8efd0017d0aa06?action=download&direct&version=1 [following]
--2021-05-26 11:47:33--  https://files.osf.io/v1/resources/qvg8s/providers/osfstorage/5c2c13130e8efd0017d0aa06?action=download&direct&version=1
Resolving files.osf.io (files.osf.io)... 35.186.214.196
Connecting to files.osf.io (files.osf.io)|35.186.214.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228944703 (218M) [application/octet-stream]
Saving to: ‘data.zip’


2021-05-26 11:47:37 (79.8 MB/s) - ‘data.zip’ saved [228944703/228944703]

/content




In [190]:
import nltk 
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [185]:
import gensim.downloader as api
import pandas as pd
import os
import glob
list(api.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [186]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("pneumatic")



[('hydraulic', 0.8155338168144226),
 ('actuators', 0.7667093276977539),
 ('sprinkler', 0.7374148368835449),
 ('valve', 0.727166473865509),
 ('actuation', 0.7141326069831848),
 ('hose', 0.7138993144035339),
 ('paddles', 0.7132105827331543),
 ('valves', 0.709661066532135),
 ('high-pressure', 0.7025710344314575),
 ('turntable', 0.7003635168075562)]

In [173]:
import pandas as pd
import os
# rootdir = "/content/preprocessed_cases[cases_29404]"
# newrootdir = '/content/preprocessed_cases'
# #os.rename(rootdir, newrootdir) #Rename the directory, run only once
# x=[]
# y=[]
# for subdir, dirs, files in os.walk(newrootdir):
#   for file in files:
#     with open(os.path.join(subdir, file), 'r') as f:
#       x.append(f.read())
#       y.append(subdir.strip('/content/preprocessed_cases/'))

In [166]:
import pickle

unique_cases_file = '/content/drive/MyDrive/CSE5NLP/unique_cases.zip (Unzipped Files)/unique_cases_dict.pickle'

with open(unique_cases_file, 'rb') as handle:
    unique_cases_folders = pickle.load(handle)

In [178]:
text=[]
casetype=[]
for files in unique_cases_folders.keys():
  for file in unique_cases_folders[files]:
    with open(os.path.join("/content/preprocessed_cases[cases_29404]/"+files, file), 'r') as f:
      text.append(f.read())
      casetype.append(files)

In [181]:
df = pd.DataFrame({'CaseType': casetype,'sentence':text})
df.head()


Unnamed: 0,CaseType,sentence
0,41,this case is the second appeal to this court i...
1,41,"on the evening of saturday,april,hardy haceesa..."
2,41,brockton hospital petitions for review of a de...
3,41,in this action under the individuals with disa...
4,41,this case involves an issue that has repeatedl...


In [195]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

def clean(text):
  text = re.sub(r'http\S+','',text)
  text = re.sub('[^a-zA-Z]',' ',text)
  text = str(text).lower()
  text = word_tokenize(text)
  text = [item for item in text if item not in stop_words]
  text = [lemma.lemmatize(word=w,pos='v') for w in text]
  text = [i for i in text if len(i)>2]
  text = ' '.join(text)
  return text 

df['CleanSentence'] = df['sentence'].apply(clean)


In [196]:
df.head()

Unnamed: 0,CaseType,sentence,CleanSentence
0,41,this case is the second appeal to this court i...,case second appeal court patent litigation cro...
1,41,"on the evening of saturday,april,hardy haceesa...",even saturday april hardy haceesa walk hospita...
2,41,brockton hospital petitions for review of a de...,brockton hospital petition review decision ord...
3,41,in this action under the individuals with disa...,action individuals disabilities education act ...
4,41,this case involves an issue that has repeatedl...,case involve issue repeatedly come federal cou...


In [211]:
from gensim.models import Word2Vec
model = Word2Vec(df['CleanSentence'], min_count=1)

## Problem 3 POS for classification

Robots and chat bots receive different commands to do certain tasks. 

Write a simple pragram that receive interactions in the form of a sentence and return:
* A tuple of (command, object) if the sentence is a command
* None if the sentence is not a command

To write this function, you can utilize a Part-of-speech tagger or named-entity recognizer from libraries like NLTK and Spacy.

Consider the following EXAMPLE sentences:

* Commands:
  * Grab the book
  * Fetch the ball
  * Open the jar
  * Can hand this spoon to John?

* Not commands:
  * Hey, how is it going?
  * How is your day today?
  * Do you like the weather?
This list is not exhaustive, your function should be able to handle more cases. 

### Expected outcome:
1. A function that performs the task
2. If your function has limitations, highlight those limitations with examples.

In [163]:
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
from spacy import displacy

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()
strn = " "
filtered_sent={}
def command(data):
  doc = nlp(data)
  displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
  for word in doc:
    if word.pos_ =="NOUN":
      if word.dep_=="dobj":
        # print(word.head.pos_)
        # print("inside dep")
        if word.head.pos_=="DET":
          # print("2")
          return (doc,word)
        elif word.head.pos_=="VERB":
          strn = [child for child in word.children]
          # print(strn)
          string =','.join(str(v) for v in strn)
          # print(string)
          dc = nlp(string)
          for i in dc:
            print(i.pos_)
            if i.pos_ == "DET":
              return (doc,word)
            else:
              return None
   
    elif word.pos_=="ADV":
      if word.head.pos_=="PROPN" or word.head.pos_=="VERB":
        return (doc,word)
    
lis = ["Grab the book","Fetch the ball","Open the jar","Can hand this spoon to John?"]
lis2 =[ "Hey, how is it going?","How is your day today?","Do you like the weather? This list is not exhaustive, your function should be able to handle more cases."]
lis3 = ["Hey, how is it going?","How is your day today?"]
print(command("Do you like the weather?"))


[the]
the
DET
(Do you like the weather?, weather)


In [153]:
import spacy
spacy.explain("ADV")

'adverb'

In the above model am looking for a Noun with previous word has delimiter and Classifing has has commond. In the given examples of the Example is getting wrong output has showen in the above graph. This is the limitation for my model. Further more am also checking Adverbs commands like run faster or swim faster. One more limitation which I checked is Sing a song wont work but Play the song works.
