## Q1

Using Version 3.8.0 or higher of the Stanford parser https://nlp.stanford.edu/software/lex-parser.shtml#History, 
parse the following corpus, where each file represents one genre
of text:

https://www.dropbox.com/s/vezkx8znrheti90/Brown_tokenized_text.zip?dl=0

For sentences containing 50 words or less (punctuation does not count), obtain the part-ofspeech
tags, the context-free phrase structure grammar representation, and the typed dependency
representation as shown in the following sample output:

https://nlp.stanford.edu/software/lex-parser.shtml#Sampl

In [2]:
import os 
import re
import time
import string
import nltk
from nltk.parse import CoreNLPParser
from stanfordcorenlp import StanfordCoreNLP

In [19]:
filename = r"C:\\Users\\mm199\\NLP\\HW2\\Brown_tokenized_text"
data_dict = {}
filenames = []
for name_dataset in os.listdir(filename):
    if ".txt" in name_dataset:
        filename_with_dataset = filename + "\\" + name_dataset
        with open(filename_with_dataset, encoding= "utf8", errors= "ignore") as file:
            file_content = file.read()
        data_dict[name_dataset] = file_content 
            

In [115]:
with open("data_dict.txt", "w") as file:
    for i in data_dict:
        str_i = str(i) + " : " + str(data_dict[i]) + "\n"
        file.write(str_i)

In [14]:
nlp = StanfordCoreNLP(r'C:\Users\mm199\NLP\HW2\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27')

In [20]:
# Generating pos tags, context-free phrase structure grammar representation and typed dependency representation
# for all the documents
filenames = list(full_content_dict.keys())
for key in filenames:
    data_dict[key+'_pos'] = []
    data_dict[key+'_cfg'] = []
    data_dict[key+'_dep'] = []
    for sentences in data_dict[key].split(' . '):
        # Number of words in a sentence (without punctuations)
        sent_len = len(nlp.word_tokenize(re.sub(r'[^\w\s]','', sentences)))
        # Filtering out sentences with 50 words or less
        if sent_len <= 50 and sent_len!=0:
            # POS tags
            data_dict[key+'_pos'].append(nlp.pos_tag(sentences))
            # Context-free phrase structure grammar representation
            data_dict[key+'_cfg'].append(nlp.parse(sentences))
            # Typed dependency representation
            data_dict[key+'_dep'].append(nlp.dependency_parse(sentences))

# 1

Using the part-of-speech tags that the parser has given you, report the number of verbs in
each file. Also report the part-of-speech tags that you are using to identify the verbs.

In [35]:
def vb_prep_count(res):
    different_vbs = {}
    for each_res in res:
        for i in each_res:
            pos_tag = i[1]
            which_prepos = i[0].lower()
            if "VB" in pos_tag:
                if pos_tag not in different_vbs:
                    different_vbs[pos_tag] = 1
                else:
                    different_vbs[pos_tag] += 1
           
    return (different_vbs)

In [53]:
total_prep = {}
data = []
for file in data_dict:
    if "pos" in file:
        print ("\nFor file: ", file[:-4])
        different_vbs = vb_prep_count(data_dict[file])
        print ("POS VB considered: ", different_vbs)
        print ("Total VB tags: ", sum(different_vbs.values()))



For file:  government.txt
POS VB considered:  {'VBZ': 1081, 'VBP': 823, 'VBG': 802, 'VBN': 1661, 'VB': 1706, 'VBD': 764}
Total VB tags:  6837

For file:  mystery.txt
POS VB considered:  {'VBD': 4221, 'VBG': 893, 'VB': 2013, 'VBN': 1140, 'VBP': 732, 'VBZ': 427}
Total VB tags:  9426

For file:  news.txt
POS VB considered:  {'VBD': 3714, 'VBZ': 1589, 'VBN': 2248, 'VB': 2535, 'VBG': 1296, 'VBP': 978}
Total VB tags:  12360

For file:  reviews.txt
POS VB considered:  {'VBZ': 1185, 'VBD': 887, 'VBP': 492, 'VBN': 762, 'VB': 799, 'VBG': 469}
Total VB tags:  4594

For file:  romance.txt
POS VB considered:  {'VBD': 4979, 'VB': 2325, 'VBN': 1235, 'VBG': 1180, 'VBZ': 485, 'VBP': 821}
Total VB tags:  11025


## 1.2 
Report the number of sentences parsed; do so by searching for ROOT in either the dependency
representation or in the context-free phrase structure grammar representation


In [58]:
total_prep = {}
num_of_sentences = 0
for file in data_dict:
    if "cfg" in file:
        print ("\nFor file: ", file[:-4])
        sent = sum([i.startswith("(ROOT") for i in data_dict[file]])
        print("Number of sentences parsed: ", sent)
        num_of_sentences += sent
print("\nTotal sentences parsed: ", num_of_sentences)



For file:  government.txt
Number of sentences parsed:  2315

For file:  mystery.txt
Number of sentences parsed:  3293

For file:  news.txt
Number of sentences parsed:  3947

For file:  reviews.txt
Number of sentences parsed:  1499

For file:  romance.txt
Number of sentences parsed:  3661

Total sentences parsed:  14715


# 1.3
Using the dependency representation (or the context-free phrase structure grammar repre-
sentation) that the parser has given you, report the total number of prepositions found in
each file. In addition, report the most common three preposition overall.

In [112]:
total_prep = {}
num_of_sentences = 0
for file in data_dict:
    preposition = []
    if "cfg" in file:
        print ("\nFor file: ", file[:-4])
        for i in data_dict[file]:
            preposition.extend(re.findall("IN \w+", i))
        prep_counter = Counter(preposition)
        print("Number of preposition found: ", sum(prep_counter.values()))
        total_prep[file] = prep_counter
common_prep = {}
found_across_all = {}
for file in total_prep:
    for key in total_prep[file]:
        if key not in common_prep:
            common_prep[key] = total_prep[file][key]
            found_across_all[key] = 1
        else:
            found_across_all[key] += 1
            common_prep[key] += total_prep[file][key]
num = 0  
common_prep = sorted(common_prep.items(),key = lambda x:x[1], reverse = True)
for key in common_prep:
    key_with_prep = key[0]
    if found_across_all[key_with_prep] == len(total_prep):
        print ("Top preposition","\'", key_with_prep.split(" ")[1],"\' :", key[1])
        num += 1
        if num == 3:
            break





For file:  government.txt
Number of preposition found:  7127

For file:  mystery.txt
Number of preposition found:  5045

For file:  news.txt
Number of preposition found:  10857

For file:  reviews.txt
Number of preposition found:  4095

For file:  romance.txt
Number of preposition found:  5555
Top preposition ' of ' : 7852
Top preposition ' in ' : 4720
Top preposition ' for ' : 2441


# 1.4 
Take a look at the constituent parsing and dependency parsing results. List out two common
errors made in each type of parsing results, and briefly discuss potential methods to reduce
each type of error.

In [None]:
# For constituent parsing 
# 1. The pos tagging seems inconsistent that results in different parsed trees. For instance a verb is tagged as noun.
# 2. Root word is not as per expectation for some sentences


# For Dependency parsing
# 1. The pos tagging is not always correct which causes a change in relation between words
# 2. The result at times is not what is expected.

# Method to reduce error - 
# POS tag with the help of more data might lead to a more robust and right tags which is consistent across all that could lead to
# less errors. Hence training with more data might reduce the errors