### <h1> NLP and the Web: Home Exercise 2 </h1>

As discussed in class, <b>spaCy</b> is a useful open-source library that enables the user to perform several NLP tasks with high-quality results. It is not only helpful for beginners in NLP but also for advanced programmers who want to integrate NLP features into real products.

For this exercise, you should only use spaCy; but you may use numpy and pandas if needed. Of course, you are also allowed to use the entire [Python Standard Library](https://docs.python.org/3.9/library/index.html). Please follow the instructions given below. In case of questions, use our Discussion forum in Moodle.

In [1]:
# imports
from typing import List, Mapping
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from spacy.matcher import Matcher

# download the language model if you haven't already (you may have to restart your Python kernel)
# spacy.cli.download("en_core_web_sm")

nlp = spacy.load('en_core_web_sm')

# this is a bit of a hacky patch for spaCy's missing handling of contractions (haven't, she'll, I'm) in version 3
# don't worry about it
try:
    nlp.get_pipe("attribute_ruler").add([[{"LOWER": "n't"}]], {"LEMMA": "not"})
    nlp.get_pipe("attribute_ruler").add([[{"LOWER": "'ll"}]], {"LEMMA": "will"})
    nlp.get_pipe("attribute_ruler").add([[{"LOWER": "'ve"}]], {"LEMMA": "have"})
    nlp.get_pipe("attribute_ruler").add([[{"LOWER": "'m"}]], {"LEMMA": "be"})
except KeyError:
    print('Tut mir leid')

Tut mir leid


### General Information
We tried to make the description of the parameters as clear as possible. However, if you believe that something is missing, please reach out to us in Moodle and we will try to help you out.  

We provide type hints for function parameters and return values of functions that you have to implement in the tasks. These are suggestions only, and you may use different types if you prefer. As long as you produce the required output in a coherent and understandable way, you can get full points. 

We use the term 'array-like object' to loosely refer to collection types like lists, arrays, maps, dataframes, etc.

### About the Corpus

In this exercise, you will work with a real dataset of public english-language tweets for the keyword 'lockdown' posted between Dec 14th and Dec 22nd 2020. It was originally collected for use in a psychological experiment investigating the public perception of covid lockdowns.  

Tweets were scraped from Twitter search results using the [snscrape](https://github.com/JustAnotherArchivist/snscrape) tool on Dec 22nd 2020. All links and @mentions were removed. The subset of the corpus you are working on has been further trimmed down to reduce spam and off-topic content present in the dataset.

## Task 1 - 5 Points
To get started, you will have to read the dataset from the provided `tweets.txt` file. Each line in this file represents a single tweet. You will need to open and read the file before starting the other subtasks.

**Hint 1**: Depending on how you read the dataset, you may have to remove linebreaks from the end of the tweets. You can use the [`rstrip`](https://docs.python.org/3.9/library/stdtypes.html) function to do so.  
**Hint 2**: You may have to select 'utf-8' as the encoding when opening the file.  
**Hint 3**: For this task you have to use some spaCy functions. You can find some useful information about spaCy tokens and their attributes [here](https://spacy.io/api/token). 

##### a) Tokenize each tweet in the dataset, then print the tokenized versions of the first five tweets ("token1", "token2", "token3"...). Use spaCy to solve this task. 

In [2]:
%%time
# read file
with open('tweets.txt','r',encoding='utf-8') as f:
    contents=list(sentence.rstrip() for sentence in f.readlines())

    
docs=list(nlp(tweet) for tweet in contents)
for i in range(5):
    print('tweet'+str(i)+': ',list(docs[i]))
    print('----------------------------------------')

token0:  [Nothing, stopping, the, nphet, disciples, from, following, their, dogma, in, private, ., No, one, is, forcing, you, and, your, fellow, cult, members, to, not, lockdown, and, follow, your, Leaders]
----------------------------------------
token1:  [After, making, it, through, our, tough, lockdown, in, Melbourne, ,, I, 'll, be, missing, my, son, at, Xmas, ,, police, officer, deployed, up, to, the, border, checkpoints, until, Sunday, ., Wish, you, guys, in, Sydney, wore, masks, 😷]
----------------------------------------
token2:  [Seriously, ,, another, variant, of, the, SARS, Cov-2, perhaps, a, mutated, SARS, Cov-3, virus, would, be, inimical, to, world, growth, only, heaven, knows, if, the, vaccine, would, be, effective, for, the, new, variant, whichsoever, is, a, doubt, ,, I, only, pray, another, phase, of, global, lockdown, does, n't, looms]
----------------------------------------
token3:  [How, about, reply, to, creators, that, need, your, help, to, access, their, accounts

##### b) Implement the function `occurence_lowercase`. It shall calculate the (absolute) number of occurrences of each token that is in lowercase. Apply the function to our dataset of tweets and print the result (i.e. 'token: occurence') in descending order. Use spaCy to identify lowercased tokens.

**Hint**: Do not lowercase all tokens, instead identify and count all already lowercased tokens. 

In [3]:
def occurence_lowercase(data: List[List[str]]) -> Mapping[str, int]:
    """
    Counts occurences of all lowercased tokens.
    The type hints are suggestions only. Feel free to use whatever works for you.
    
    @param data: array-like object containing tokenized tweets from subtask a)
    @return: array-like object with tokens and their counts
    """
    dic={}
    lower=[]
    for doc in data:
        for token in doc:
            if token.is_lower:
                if token.text in lower:
                    dic[token.text]+=1
                    
                else:
                    dic[token.text]=1
                    lower.append(token.text)
    dic=sorted(dic.items(), key=lambda item:item[1], reverse=True)
    
    return dic

df=pd.DataFrame()
df['elements']=occurence_lowercase(docs)
df

Unnamed: 0,elements
0,"(the, 1111)"
1,"(lockdown, 998)"
2,"(to, 799)"
3,"(and, 609)"
4,"(a, 608)"
...,...
4221,"(attempts, 1)"
4222,"(digibash, 1)"
4223,"(hotrod, 1)"
4224,"(buyer, 1)"


##### c) Implement the function `occurence_no_punctuation`. It shall extract all tokens which occur five or more times, excluding punctuation. Additionally it shall return the absolute occurence of these tokens similar to b). Apply the function to our dataset of tweets and print the result in descending order. Use spaCy to identify which tokens are considered to be punctuation.

In [4]:
def occurence_no_punctuation(data: List[List[str]]) -> Mapping[str, int]:
    """
    Counts occurences of all tokens excluding punctuation and returns all with occurences greater or equal 5.
    The type hints are suggestions only. Feel free to use whatever works for you.
    
    @param data: array-like object containing tokenized tweets from subtask a)
    @return: array-like object with tokens and their counts
    """
    dic={}
    words=[]
    for doc in data:
        for token in doc:
            if token.is_punct==False and token.is_space==False:
                if token.text in words:
                    dic[token.text]+=1
                else:
                    dic[token.text]=1
                    words.append(token.text)
    dic={k:v for k,v in dic.items() if v>=5}
    dic=sorted(dic.items(), key=lambda item:item[1], reverse=True)
    return dic

df=pd.DataFrame()
df['elements']=occurence_no_punctuation(docs)
df


Unnamed: 0,elements
0,"(the, 1111)"
1,"(lockdown, 998)"
2,"(to, 799)"
3,"(and, 609)"
4,"(a, 608)"
...,...
829,"(test, 5)"
830,"(yes, 5)"
831,"(showing, 5)"
832,"(local, 5)"


##### d) Explain the internal structure of spaCy with respect to tokens in at most ~5 sentences. Make sure you explain the three main data components. Please refer to the notebook from the practice class or the  [spaCy documentation](https://spacy.io/api). You may want to use an example to show how a given token is stored / represented in different ways.

**Answer:** This information is stored as a 64-bit hash value, and we can access it from any location and any object in spaCy, such as npl.vocab.strings, doc.vocab.strings, or span.doc.vocab.string. 

In [5]:
nlp = spacy.load('en_core_web_sm')
text = "And it starts with understanding this: Even as the Delta variant 19 [sic] has — COVID-19 — has been hitting this country hard, we have the tools to combat the virus, if we can come together as a country and use those tools."
doc = nlp(text)
w =[]
for token in doc[0:5]:
    w.append(token)
#the component for document
new_doc = nlp(text)
#the component for vocabulary
voca = nlp.vocab
#the look up table
lexeme = voca[new_doc[0].text]
print(lexeme.text, lexeme.orth,)
searched_string = nlp.vocab.strings[lexeme.orth]
searched_hash = nlp.vocab.strings[lexeme.text]
print("This is my desired string:", searched_string)
print("This is my desired hash:", searched_hash)

And 12172435438170721471
This is my desired string: And
This is my desired hash: 12172435438170721471


## Task 2 - 5 Points

##### a) Use the spaCy matcher to find groups of tokens that are similar in some way. You can decide for yourself what kind of 'similarity' is interesting to you. You should first create your pattern(s) and then use them to find matches in the tweet dataset. For each match, print both the match itself as well as the sentence (not necessarily the entire tweet!) it was found in. Briefly describe how you chose your pattern and your motivation for doing so (why is it relevant / useful / interesting?) in ~2 sentences.

The output for a match could look something like this:
```
This is the sentence of the match.
    - the match
```

**Hint**: You can check out [explosion.ai/demos/matcher](https://explosion.ai/demos/matcher) to play around with different patterns. You can also refer to the [spaCy documentation of the Token class](https://spacy.io/api/token) for interesting attributes and the [spaCy matching documentation](https://spacy.io/usage/rule-based-matching/) for info on how to create patterns.

**Example 1**: All tokens that describe a date or time. 'Let's meet this evening.' ==> *this evening*  
**Example 2**: Tokens describing appearance (e.g. adjectives after 'look'): 'It looks good.' ==> *looks good*

*Sidenote: Recall that the dataset you're working on was originally collected for a psychological study investigating differences in the public perception of the first vs. the later lockdowns. Perhaps you can explore aspects of that with your pattern?*

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern_attr = [{'POS': 'ADJ'},{'POS':'NOUN'}]
matcher.add('PATTERN_ATTR', [pattern_attr])

#pun=['\.','?',';','!']
text=" "
f=open("tweets.txt",mode='r',encoding='utf-8')
for line in f.readlines():
    text+=line.strip('\n')
#text="Nothing stopping the nphet disciples from following their dogma in private.I just want businesses open and students in class Eoin I'm not in the population of \"non frontline public sector workers\"...does that make me  more likely to fall into the bracket of being pro extreme lockdown, given"

doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    print("the match:",doc[start:end].text)
    for sentence in doc.sents:
        if doc[start:end].text in sentence.text:
            print(sentence.text+'\n')

##### b) Briefly describe another use case / pattern which you could have implemented and why it could be useful or interesting.

**Answer** These are the tokens, which are all in the form of adjectives + nouns. We think the best way to learn the changes in people's psychology, is to know how people describe the thing. Therefore, we think to learn adjectives and nouns at the same time, it is helpful to understand whether people have a positive or negative attitude towards before and after the lockdown. 

## Task 3 - 5 Points

##### a) Use the spaCy matcher to extract and print [proper nouns](https://en.wikipedia.org/wiki/Proper_and_common_nouns) that are longer than one token. Print each tweet from the dataset that contains at least one such proper noun together with all the matching proper nouns it contains.

The output for a given tweet should look something like this:  
```
I live in New York City and I like Hot Dogs & Coke
    - New York City
    - Hot Dogs
```

**Hint 1**: If there is a proper noun like 'New York City' you should only print 'New York City' and not 'New York', 'New York City', and 'York City'. As in the previous task, you can quickly test different patterns using [explosion.ai/demos/matcher](https://explosion.ai/demos/matcher).

**Hint 2**: For this task you may have to use some functions that are not provided by spaCy.

In [7]:
%%time
pattern = [
            {'POS': 'PROPN', 'OP': '!'},
            {'POS': 'PROPN', 'OP': '+', 'DEP': 'compound'},
            {'POS': 'PROPN'},
            {'POS': 'PROPN', 'OP': '!'}
            ]

matcher = Matcher(nlp.vocab)
matcher.add("propnoun",[pattern])
matches=matcher(doc)

pn=[]
tweets=[]
df=pd.DataFrame()
for doc in docs:
    matches=matcher(doc)
    for match_id,start,end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start+1:end-1]  # The matched span
        pn.append(span.text)
        tweets.append(doc)
        

df['PROPN']=pn
df['tweet']=tweets
df

Wall time: 56.1 ms


Unnamed: 0,PROPN,tweet
0,christmas Mongraal,"(No, ,, our, country, is, in, lockdown, for, -..."
1,Storm Oregon,"(Get, their, names, and, make, sure, they, do,..."
2,State Capitol,"(Get, their, names, and, make, sure, they, do,..."
3,South Korea,"(It, was, a, draconian, lockdown, that, would,..."
4,East Asia,"(It, was, a, draconian, lockdown, that, would,..."
...,...,...
217,B4 lockdown,"(Na, the, Koko, ., B4, lockdown, go, reach, 6m..."
218,Us Ontarians,"(LOL, !, Where, are, you, ?, Our, snow, has, n..."
219,Ye Olde Lockdown,"(i, am, ALSO, very, excited, during, Ye, Olde,..."
220,david liebe hart,"(Just, thinking, about, how, I, almost, saw, d..."


##### b) Go over the entire dataset and find verbs and nouns (including proper nouns!) that share the same lemma. Print each lemma that is shared between at least one verb and one noun together with all distinct, lowercased, non-lemmatized nouns and verbs from the dataset that share that lemma. Every lemma should only be printed once.

The output for the lemma 'walk' may look like this...
```
lemma: walk; nouns: walk; verbs: walk, walking, walked;
```
... assuming the dataset contains a sentence like 'We walked (V) the walk (N) and still walk (V) it today. Walking (V) brings us great joy.'

**Hint**: For this task you may need some functions which are not provided by spaCy. For example, you can join two dataframes and group strings by concatenating them. 

In [8]:
%%time
test='We walked the walk and still walk it today. Walking brings us great joy.'
tokens=[]
lemma=[]
pos=[]
nouns=[]
df=pd.DataFrame()

for doc in docs:
    for t in doc:
        tokens.append(t.text)
        lemma.append(t.lemma_)
        pos.append(t.pos_)
df['tokens']=tokens
df['lemma']=lemma
df['pos']=pos

is_verb_and_noun = lambda x: set(x) == set(['VERB', 'NOUN'])

out = df.loc[df.groupby('lemma')['pos'].transform(is_verb_and_noun), 'lemma']

groups_multiple=df.groupby(['lemma','pos'])
group=df.groupby('lemma')

lemma_v=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='VERB')
lemma_n=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='NOUN')
lemma_pn=set(gm[0][0] for gm in groups_multiple if gm[0][1]=='PROPN')
lemma_vn=set(list(lemma_v & (lemma_n | lemma_pn)))

verbList=[]
nounsList=[]
for lem in lemma_vn:
    verbs=groups_multiple.get_group((lem,'VERB'))
    try:
        nouns=groups_multiple.get_group((lem,'NOUN'))
    except KeyError:
        nouns=groups_multiple.get_group((lem,'PROPN'))
    
    verbList.append(set(list(verbs['tokens'])))
    nounsList.append(set(list(nouns['tokens'])))
df=pd.DataFrame()
df['lemma']=list(lemma_vn)
df['nouns']=nounsList
df['verbs']=verbList

df


Wall time: 1.16 s


Unnamed: 0,lemma,nouns,verbs
0,share,{share},"{sharing, share}"
1,Lockdown,{Lockdown},{Lockdown}
2,May,{May},{May}
3,line,"{line, lines}",{lining}
4,post,"{post, posts}","{post, posted}"
...,...,...,...
230,room,{room},{rooming}
231,wanna,{wanna},{wanna}
232,test,"{tests, test}",{tested}
233,back,"{backs, back}","{back, backed}"


## Task 4 - 5 Points

##### a) Use spaCy to find named entities representing organizations in the dataset. Print all (distinct) entities you find.


**Hint**: You can find NER tags available in spaCy's models in the [model documentation](https://spacy.io/models/en#en_core_web_sm). While you *could* use the Matcher again for this task, it is much easier to access the already parsed named entities of a document. You may want to refer to the [document documentation](https://spacy.io/api/doc/).

In [9]:
entities=list(entity for doc in docs for entity in doc.ents if entity.label_=='ORG')

df=pd.DataFrame()
df['Entity']=entities
df

Unnamed: 0,Entity
0,(hbu)
1,(BOTH)
2,(Royals)
3,(Quizzes)
4,(BCG)
...,...
261,(PCR)
262,(Covid)
263,(Paddy)
264,(Covid)


##### b) The dataset you've been working with contains real, messy web data. Looking at the output from subtask a), name three difficulties related to working with this kind of social media content rather than, for example, data from Wikipedia and give examples for each.


**Hint**: See where spaCy fails. It may be helpful to inspect the context of the named entities you found above.

**Answer:**

* abbreviation. e.g. Nov/Dec, which Spacy recognize it as ORG. It can also represent as time november and december.
* Emoji. Spacy recoginize it as ORG, but it is not clear if it is a name of ORG or it is just a emoji.
* word with all upppercase alphabets. e.g."BORING" it can be ORG also can represent emotion boring.

Please upload in Moodle your working Jupyter-Notebook **before next exercise session** <span style="color:red">(Nov 25, 16:14pm)</span>. Submission format: Group_No_Exercise_No.zip<br>
Submission should contain your filled out Jupyter notebook (naming schema: Group_No_Exercise_No.ipynb) and any auxiliar files that are necessary to run your code (e.g. datasets provided by us).  
Each submission must only be handed in once per group.