# Introduction to Natural Language Processing: Assignment 3

In this exercise we'll practice features extraction using SpaCy


- You can only use built-in Python packages, spaCy and Pandas.
- Please comment your code
- Submissions are due Sunday at 23:59 **only** on Ilias: **Assignmnets >> Student Submissions >> Assignment 3 (Deadline: 24.05.2022, at 23:59)**

- Name the file aproppriately: "Assignment_3_\<Your_Name\>.ipynb" and submit only the Jupyter Notebook file.

### Task 1 (2 points)

Write a function `find_top_five(my_file_name)` that takes as input a file name, reads its text, converts everything to lower case and returns the five most frequent tokens and their absolute frequency in a Python dictionary. 

**Note:**

You should ignore punctuation and use spaCy for obtaining the tokens


In [1]:
import string
from collections import Counter
import spacy

def find_top_five(my_file_name):
    top_five_tokens = {}
    # load english language model
    nlp = spacy.load('en_core_web_sm')
    with open(my_file_name,'r') as file:

        # convert all characters to lower case and ignore all the punctuation
        text = file.read().lower().translate(str.maketrans('','',string.punctuation))

        # process the text using Spacy language model
        doc = nlp(text)

        # count the tokens
        token_counts = Counter(token.text for token in doc if token.text.isalnum())

    top_five_tokens = token_counts.most_common(5)

    return(print(top_five_tokens))


find_top_five('corpus.txt')

[('this', 4), ('is', 4), ('the', 4), ('document', 4), ('first', 2)]


### Task 2 (2 points)

Write a function `extract_proper_nouns(my_file_name)` that takes a file name as input and returns a list containing all proper nouns with more than one token.

**Example:**

text = "Honk Kong and Japan are two countries in Asia and New York is the largest city in the world"

return = `["New York", "Hong Kong"]` **(Note: it should not return "Japan")**

In [2]:
import spacy

def extract_proper_nouns(file_name):
    # Load the English language model in spaCy
    nlp = spacy.load('en_core_web_sm')

    # Read the contents of the file
    with open(file_name, 'r') as file:
        text = file.read()

    # Process the text using spaCy
    doc = nlp(text)

    # Extract proper nouns
    proper_nouns = []
    for chunk in doc.noun_chunks:
        if len(chunk) > 1 and chunk.root.pos_ == 'PROPN': #length of probn muss be >1
            proper_nouns.append(chunk.text)

    # Print the list of proper nouns and the total count
    total_count = len(proper_nouns)
    print("Proper Nouns:")
    print(proper_nouns)
    print("Total Count:", total_count)

extract_proper_nouns('task02.txt')


Proper Nouns:
['Hong Kong', 'New York']
Total Count: 2
true


### Task 3 (3 points)

Write a function `common_lemma(my_file_name)` that takes a file name as input and returns a Python dictionary with lemmas as `key` and the `value` that should contain a list with both verbs and nouns sharing the same lemma.

**Examples:**

1.
text = "When users google a word or any query, their system internally runs a pipeline in order to process what the person is querying."

return = `{"query": ["query", "querying"]}`

2.
text = "I really loved the movie and show, the movie was showing reality but it showed sometimes nonesense!"

return = `{"show": ["show", "showing", "showed"]}` **(Note: it should not return "movie" because both "movie"s are NOUN)**

In [6]:
import spacy


def common_lemma(my_file_name):
    tokens_with_common_lemma = {}

    # load english language model and read file
    nlp = spacy.load('en_core_web_sm')
    with open(my_file_name, 'r') as file:
        text = file.read()
        doc = nlp(text)

        # filter out all tokens that are not noun or verb
        for token in doc:
            if token.pos_ in {'NOUN','VERB'}:
                # lemmatize the chosen token
                original_word = token.lemma_
                # if the dictionary contains the lemmatized word and the original word is not added yet, it will be
                # added to the list
                if tokens_with_common_lemma.keys().__contains__(original_word):
                    if not tokens_with_common_lemma[original_word].__contains__(token.text):
                        tokens_with_common_lemma[original_word].append(token.text)
                else:
                # or a new list will be created
                    tokens_with_common_lemma[original_word] = [token.text]

    # filter out all the pairs that contains only one token, that means it only contains a verb or a noun
    tokens_with_common_lemma = {key:value for key,value in tokens_with_common_lemma.items() if len(value) > 1}

    return(print(tokens_with_common_lemma))


common_lemma("task03_1.txt")
common_lemma("task03_2.txt")

{'query': ['query', 'querying']}
{'show': ['show', 'showing', 'showed']}


### Task 4 (3 points)

Write a function `token_matcher(string)` that takes the following text as input and prints a list of all the engineering courses mentioned in it + the total number of engineering courses.

**Hint:** You should use `Matcher` in SpaCy and define a pattern for it.

In [7]:
text = """If you choose to study chemical engineering, you may like to specialize in chemical reaction engineering, plant design, process engineering, process design or transport phenomena. Civil engineering is the professional practice of designing and developing infrastructure projects. This can be on a huge scale, such as the development of
nationwide transport systems or water supply networks, or on a smaller scale, such as the development of single roads or buildings.
specializations of civil engineering include structural engineering, architectural engineering, transportation engineering, geotechnical engineering,
environmental engineering and hydraulic engineering. If you study aeronautical engineering, you could specialize in aerodynamics, aeroelasticity, 
composites analysis, avionics, propulsion and structures and materials. Computer engineering concerns the design and prototyping of computing hardware and software. 
This subject merges electrical engineering with computer science, oldest and broadest types of engineering, mechanical engineering is concerned with the design,
manufacturing and maintenance of mechanical systems."""

import spacy
from spacy.matcher import Matcher

def token_matcher(string):
    # Load the English language model in spaCy
    nlp = spacy.load('en_core_web_sm')

    # Process the text using spaCy
    doc = nlp(string)

    # Initialize the Matcher
    matcher = Matcher(nlp.vocab)

    # Define the pattern for engineering courses
    pattern = [{'POS': 'ADJ'}, {'LOWER': {'IN': ['engineering', 'science']}}]

    # Add the pattern to the Matcher
    matcher.add('ENGINEERING_COURSE', [pattern])

    # Extract the matched phrases (engineering courses)
    engineering_courses = []
    for chunk in doc.noun_chunks:
        if chunk.text.lower() in engineering_courses : #no duplictae
            continue
        if len(chunk)>1 and chunk.text.lower().endswith('engineering') : #length of probn muss be >1
            engineering_courses.append(chunk.text.lower())


    # Print the list of engineering courses and the total count
    total_count = len(engineering_courses)
    print("Engineering Courses:")
    print(engineering_courses)
    print("Total Count:", total_count)

# Example usage
token_matcher(text)

Engineering Courses:
['chemical engineering', 'chemical reaction engineering', 'process engineering', 'civil engineering', 'structural engineering', 'architectural engineering', 'transportation engineering', 'geotechnical engineering', 'environmental engineering', 'hydraulic engineering', 'aeronautical engineering', 'computer engineering', 'electrical engineering', 'mechanical engineering']
Total Count: 14
