## Lab 2. NLP Pipelines

__Disclaimer: This lab may contain obscene language. The author of this Lab did not write the text provided in the examples and does not hold any responsibility for its contents.__

In this Lab, we are going to look into NLP pipelines and their role in text processing. Additionally, we touch upon noise reduction techniques that were mentioned in the previous Lab. 

### Pipeline

As you might have already guessed from the name, a pipeline is a set of processors combined together to form a chain. A user puts their input from one end of the pipeline and gets the desired output from the other end. As a real-life example, you may think of a water filter. It contains different materials like sand, rocks or charcoal to filter out various elemets from the water and/or add some additional healthy components. So, you give the filter a dirty water as an input and get a clean one enriched with healthy elements as an output.

In this Lab, we will use [StanfordNLP pipeline](https://stanfordnlp.github.io/stanfordnlp/index.html). I recommend this system because it is easy to install and use, it works at a reasonable speed and produces quite accurate results. The least is proven by the results of the [CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results.html).

But before we start working with it, we need to analyse our data and treat it accordingly.

In [1]:
from bs4 import BeautifulSoup
import re
from tqdm.notebook import tqdm
from collections import Counter

import spacy
spacy_nlp = spacy.load("en_core_web_sm")

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
warnings.filterwarnings("ignore",category=UserWarning)
    
import stanfordnlp

### Cleaning the data

With this Lab, you can find a page saved from the [The Internet Movie Script Database (IMSDb)](https://www.imsdb.com/). This website containts many movie scripts. The attached file contains a script of the [Pulp Fiction](https://www.imdb.com/title/tt0110912/) movie by Quentin Tarantino. 

One characher of this movie, more precisely Jules, is famous for dropping many F-bombs in his lines. Today, we are going to analyse the F out of it.

First, let's load the HTML file.

In [2]:
html_doc = open('Pulp-Fiction.html').read()

To work with HTML format, we are using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python library. Let's parse our HTML with it and look at it's contents.

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   Pulp Fiction Script at IMSDb.
  </title>
  <meta content="Pulp Fiction script at the Internet Movie Script Database." name="description"/>
  <meta content="Pulp Fiction script, Pulp Fiction movie script, Pulp Fiction film script" name="keywords"/>
  <meta content="width=device-width, initial-scale=1" name="viewport">
   <meta content="true" name="HandheldFriendly"/>
   <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
   <meta content="EN" http-equiv="Content-Language"/>
   <meta content="Document" name="objecttype"/>
   <meta content="INDEX, FOLLOW" name="ROBOTS"/>
   <meta content="Movie scripts, Film scripts" name="Subject"/>
   <meta content="General" name="rating"/>
   <meta content="Global" name="distribution"/>
   <meta content="2 days" name="revisit-after"/>
   <link href="/style.css" rel="stylesheet" type="text/css"/>
   <script type="text/javascript">
    var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-3785444-3']);
  _gaq

</html>


We can see a lot of technical stuff in the beginning of the page that we are not interested about. A little down below, we see the text of the script that we need to extract. We are lucky today, and the whole script is contained within one tag which is `<pre>`. 

We can easily access it with the `find_all()` method. After that, we can get a clean text without HTML tags by accessing the `text` attribute of our found object.

In [4]:
script_text = soup.find_all('pre')[0].text
print(script_text)



"PULP FICTION" -- by Quentin Tarantino & Roger Avary


                                      "PULP FICTION"

                                            By

                             Quentin Tarantino & Roger Avary

                

               PULP [pulp] n.

               1. A soft, moist, shapeless mass or matter.

               2. A magazine or book containing lurid subject matter and 
               being characteristically printed on rough, unfinished paper.

               American Heritage Dictionary: New College Edition

               INT. COFFEE SHOP – MORNING

               A normal Denny's, Spires-like coffee shop in Los Angeles. 
               It's about 9:00 in the morning. While the place isn't jammed, 
               there's a healthy number of people drinking coffee, munching 
               on bacon and eating eggs.

               Two of these people are a YOUNG MAN and a YOUNG WOMAN. The 
               Young Man has a slight working-class English acce




Now, it's time to analyse the text and see how we can process it to extract the lines for each character. 

Notice how all the names of the characters are in capital, we can use it to detect them (hint: using `isupper()` built-in method). However, we have some other things in capital like names of the place where a scene takes place or some other notes like "CUT TO:" that we want to avoid. If we look at the text, there's no empty line after the character name but there is one after the name of the place. We can use it to distinguish them apart.

We propose to process the script and construct a dictionary that contains the lines for each characters, for example:
`{'JULES': ['line1', 'line2', ... 'line100'], 'VINCENT': ['line1', 'line2', ... 'line100'], ...}`.

The following is left for you to complete.

In [5]:
def get_lines(script_text):
    lines = {}
    
    ...
            
    return lines

In [6]:
lines = get_lines(script_text)
lines['JULES']

[' – Okay now, tell me about the hash  bars?',
 ' Well, hash is legal there, right?',
 ' Those are hash bars?',
 " That did it, man – I'm fuckin' goin',  that's all there is to it.",
 ' What?',
 ' Examples?',
 " They don't call it a Quarter Pounder  with Cheese?",
 " What'd they call it?",
 " (repeating) Royale with Cheese. What'd they call  a Big Mac?",
 ' Le Big Mac. What do they call a  Whopper?',
 ' What?',
 ' Goddamn!',
 ' Uuccch!',
 ' We should have shotguns for this  kind of deal.',
 ' Three or four.',
 " I'm not sure.",
 " It's possible.",
 ' Mia.',
 ' I dunno, however people meet people.  She usta be an actress.',
 ' I think her biggest deal was she  starred in a pilot.',
 ' Well, you know the shows on TV?',
 " Yes, but you're aware that there's  an invention called television, and  on that invention they show shows?",
 " Well, the way they pick the shows on  TV is they make one show, and that  show's called a pilot. And they show  that one show to the people who pick  the sho

To count the number of the lines with F-word in them, we can use a simple substring search.

In [7]:
total_jules_lines = len(lines['JULES'])
f_jules_lines = sum([1 for line in lines['JULES'] if 'fuck' in line])

print(f'JULES has F-words in {f_jules_lines / total_jules_lines:.2%} of his lines')

JULES has F-words in 21.46% of his lines


However, we cannot be satisfied with this amateur result. 

Now, it's time to load out pipeline.

### StanfordNLP

In [8]:
nlp = stanfordnlp.Pipeline()

Use device: gpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\milin\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': 'C:\\Users\\milin\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'C:\\Users\\milin\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': 'C:\\Users\\milin\\stanfordnlp_resources\\en_ewt_models\\en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': 'C:\\Users\\milin\\stanfordnlp_resources\\en_ewt_models\\en_ewt_parser.pt', 'pretrain_path': 'C:\\Users\\milin\

Let's process each line of JULES and store it for the further analysis.

In [9]:
def process_lines(character_name, lines):
    processed_lines = []
    for line in tqdm(lines[character_name], desc=f'Analysing lines of {character_name}'):
        processed_lines.append(nlp(line))
    return processed_lines

In [10]:
jules_processed_lines = process_lines('JULES', lines)

HBox(children=(IntProgress(value=0, description='Analysing lines of JULES', max=205, style=ProgressStyle(descr…




After analysis, the pipeline gives you the `Doc` object. It contains `Sentence` objects inside that can be accessed with the `sentences` attribute. A `Sentence`, object in its turn, contains `Word` objects that can be accessed with the `words` attribute. Finally, a `Word` object has all the analysis results, that can be accesses with the following attributes: `index`, `text`, `lemma`, `pos`, `upos`, `feats` (morphological features), `governor` (governor/head in the dependency parse), `dependency_relation` (dependency relation between this word and its head), and `parent_token` (the Token object that this Word is part of).

For any additional information, you can refer to the [official documentation](https://stanfordnlp.github.io/stanfordnlp/data_objects.html).

Let's choose the first line and see what all of this looks like.

We can split the output into two categories: __morphology__ and __syntax__. 

Morphology output contains `upos`, a universal part of speech tag, `xpos`, a language-specific part of speech tag, and `feats`, a set of more detailed morphological features. For more information, consult the [UD page on morphological tags](https://universaldependencies.org/u/overview/morphology.html). 

In [11]:
print(*[f'text: {word.text+" "}\tlemma: {word.lemma}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats}' \
        for sent in jules_processed_lines[0].sentences for word in sent.words], sep='\n')

text: – 	lemma: -	upos: PUNCT	xpos: NFP	feats: _
text: Okay 	lemma: okay	upos: INTJ	xpos: UH	feats: _
text: now 	lemma: now	upos: ADV	xpos: RB	feats: _
text: , 	lemma: ,	upos: PUNCT	xpos: ,	feats: _
text: tell 	lemma: tell	upos: VERB	xpos: VB	feats: Mood=Imp|VerbForm=Fin
text: me 	lemma: I	upos: PRON	xpos: PRP	feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
text: about 	lemma: about	upos: ADP	xpos: IN	feats: _
text: the 	lemma: the	upos: DET	xpos: DT	feats: Definite=Def|PronType=Art
text: hash 	lemma: hash	upos: NOUN	xpos: NN	feats: Number=Sing
text: bars 	lemma: bar	upos: NOUN	xpos: NNS	feats: Number=Plur
text: ? 	lemma: ?	upos: PUNCT	xpos: .	feats: _


While morphology shows the information about a particular word, syntax tells us about the relation between the words in the sentence. 

`deprel` point to a _governor_ word in a sentence, i.e. a word from which the relation is going out. For example, in the sentence below, the word `tell` is the root of the sentence and it governs `-`, `Okay`, `now`, `,`, `me`, `bars`, `?`. The word `bars` governs the words `about`, `the`, `hash`.

`deps` column shows what type of relation the words have.

For more information, you can consult the [UD page on syntax](https://universaldependencies.org/u/overview/syntax.html).

In [12]:
print(*[f'text: {word.text+" "}\tdeprel: {word.governor}\tdeps: {word.dependency_relation}' \
        for sent in jules_processed_lines[0].sentences for word in sent.words], sep='\n')

text: – 	deprel: 5	deps: punct
text: Okay 	deprel: 5	deps: discourse
text: now 	deprel: 5	deps: advmod
text: , 	deprel: 5	deps: punct
text: tell 	deprel: 0	deps: root
text: me 	deprel: 5	deps: obj
text: about 	deprel: 10	deps: case
text: the 	deprel: 10	deps: det
text: hash 	deprel: 10	deps: compound
text: bars 	deprel: 5	deps: obl
text: ? 	deprel: 5	deps: punct


It is also possible to visualize the syntactic relations with [this online tool](https://urd2.let.rug.nl/~kleiweg/conllu/).

![Syntax visualization](syntax_tree.png)

Finally, you can output everything in the CoNLL-U format, which is, simply put, a text file that contains exactly _ten_ tab-separated values for each word. The words are split by a newline and the sentences are split by two newlines. See more about the format on the [UD page on the CoNLL-U format](https://universaldependencies.org/format.html).

In [13]:
print(jules_processed_lines[0].conll_file.conll_as_string())

1	–	-	PUNCT	NFP	_	5	punct	_	_
2	Okay	okay	INTJ	UH	_	5	discourse	_	_
3	now	now	ADV	RB	_	5	advmod	_	_
4	,	,	PUNCT	,	_	5	punct	_	_
5	tell	tell	VERB	VB	Mood=Imp|VerbForm=Fin	0	root	_	_
6	me	I	PRON	PRP	Case=Acc|Number=Sing|Person=1|PronType=Prs	5	obj	_	_
7	about	about	ADP	IN	_	10	case	_	_
8	the	the	DET	DT	Definite=Def|PronType=Art	10	det	_	_
9	hash	hash	NOUN	NN	Number=Sing	10	compound	_	_
10	bars	bar	NOUN	NNS	Number=Plur	5	obl	_	_
11	?	?	PUNCT	.	_	5	punct	_	_




We can write it into a file as well. The CoNLL-U files usually have `.conllu` format.

In [14]:
jules_processed_lines[0].write_conll_to_file('sentence.conllu')

### Script analysis

Now, we can perform a more sophisticated analysis of the movie script. We can now select the F-lemmas and also look what part of speech each of them is. We can also count the ration per-word instead of per-line to get a better overview of the F-ratio.

In [15]:
def word_ratio(docs, tgt_word):
    total_words_count = 0
    tgt_word_count = 0
    tgt_stats_morph = Counter()
    tgt_stats_synt = Counter()
    for doc in docs:
        for sent in doc.sentences:
            for word in sent.words:
                if word.upos is not 'PUNCT':
                    total_words_count += 1
                    if tgt_word in word.lemma:
                        tgt_word_count += 1
                        tgt_stats_morph[(word.lemma, word.upos)] += 1
                        tgt_stats_synt[(word.lemma, word.dependency_relation)] += 1
    
    return tgt_word_count / total_words_count, tgt_stats_morph, tgt_stats_synt

In [16]:
jules_ratio, jules_stats_morph, jules_stats_synt = word_ratio(jules_processed_lines, 'fuck')

In [17]:
print(f'JULES has F-words in {jules_ratio:.2%} of his words')

JULES has F-words in 1.35% of his words


In [18]:
for k, v in jules_stats_morph.most_common():
    word = k[0].replace('u', '*')
    pos, freq = k[1], v
    print(f'{word}, {pos}, {freq}')

motherf*cker, NOUN, 16
f*ckin, NOUN, 11
f*ck, VERB, 8
f*ck, NOUN, 6
f*ckin, ADJ, 3
motherf*ckin, NOUN, 3
f*ckin', VERB, 3
f*ckin, VERB, 3
Motherf*cker, NOUN, 1
f*ckin', NOUN, 1
motherf*ckin', NOUN, 1
Motherf*cker, PROPN, 1


In [19]:
for k, v in jules_stats_synt.most_common():
    word = k[0].replace('u', '*')
    pos, freq = k[1], v
    print(f'{word}, {pos}, {freq}')

motherf*cker, root, 5
f*ckin, compound, 4
f*ckin, root, 4
f*ck, obj, 4
f*ck, root, 3
motherf*ckin, compound, 3
motherf*cker, obj, 3
f*ck, xcomp, 3
f*ckin, nmod:poss, 3
motherf*cker, compound, 2
f*ckin', root, 2
f*ck, nsubj, 2
motherf*cker, vocative, 2
f*ckin, obl, 2
f*ckin, parataxis, 1
f*ckin, appos, 1
Motherf*cker, nsubj, 1
motherf*cker, parataxis, 1
f*ckin, xcomp, 1
f*ck, conj, 1
f*ckin', compound, 1
motherf*ckin', compound, 1
motherf*cker, nmod:poss, 1
f*ck, amod, 1
f*ckin', aux, 1
f*ckin, acl:relcl, 1
motherf*cker, nsubj, 1
motherf*cker, nsubj:pass, 1
Motherf*cker, obj, 1


In [20]:
script_text_norm = script_text.replace("in' ", "ing ")
lines_norm = get_lines(script_text_norm)
jules_processed_lines_norm = process_lines('JULES', lines_norm)
jules_ratio_norm, jules_stats_morph_norm, jules_stats_synt_norm = \
    word_ratio(jules_processed_lines_norm, 'fuck')

HBox(children=(IntProgress(value=0, description='Analysing lines of JULES', max=205, style=ProgressStyle(descr…




In [21]:
print(f'JULES has F-words in {jules_ratio_norm:.2%} of his words')

JULES has F-words in 1.37% of his words


In [22]:
for k, v in jules_stats_morph_norm.most_common():
    word = k[0].replace('u', '*')
    pos, freq = k[1], v
    print(f'{word}, {pos}, {freq}')

f*ck, VERB, 22
motherf*cker, NOUN, 16
f*ck, NOUN, 13
motherf*cking, NOUN, 3
motherf*ckin, NOUN, 1
Motherf*cker, NOUN, 1
Motherf*cker, PROPN, 1


In [23]:
for k, v in jules_stats_synt_norm.most_common():
    word = k[0].replace('u', '*')
    pos, freq = k[1], v
    print(f'{word}, {pos}, {freq}')

f*ck, root, 9
f*ck, compound, 7
f*ck, advcl, 5
motherf*cker, root, 5
f*ck, obj, 4
motherf*cker, obj, 3
f*ck, xcomp, 3
motherf*cker, compound, 2
f*ck, nsubj, 2
motherf*cking, compound, 2
motherf*cker, vocative, 2
f*ck, amod, 2
f*ck, parataxis, 1
motherf*ckin, compound, 1
Motherf*cker, nsubj, 1
motherf*cker, parataxis, 1
f*ck, conj, 1
motherf*cker, nmod:poss, 1
motherf*cking, obl:npmod, 1
f*ck, acl:relcl, 1
motherf*cker, nsubj, 1
motherf*cker, nsubj:pass, 1
Motherf*cker, obj, 1


We can clearly see that something wrong is going on again. That is because the F-word and its derivatives are extremely complex linguistic entities. It can be used as a noun, a verb, an adjective, an adverb, almost any existing part of speech. Its role is also very specific, for example in the sentence _I'm f*ckin' goin' there.,_ it is very difficult to determine its syntactic role and build a syntactic tree. 

I leave it up to you to see the sentences and analyze the pipeline outputs.

In [24]:
def print_line_conllu(lines, idx):
    print(lines[idx].conll_file.conll_as_string())
    
# Change the line index here
print_line_conllu(jules_processed_lines_norm, 1)

1	Well	well	INTJ	UH	_	6	discourse	_	_
2	,	,	PUNCT	,	_	6	punct	_	_
3	hash	hash	NOUN	NN	Number=Sing	6	nsubj	_	_
4	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	6	cop	_	_
5	legal	legal	ADJ	JJ	Degree=Pos	0	root	_	_
6	there	there	ADV	RB	PronType=Dem	5	advmod	_	_
7	,	,	PUNCT	,	_	6	punct	_	_
8	right	right	ADJ	JJ	Degree=Pos	6	discourse	_	_
9	?	?	PUNCT	.	_	6	punct	_	_




We can also compare JULES to another characters and see if someone swears more that him in the movie.

In [25]:
for character in ['VINCENT', 'BUTCH', 'MIA', 'MARSELLUS']:
    processed_lines = process_lines(character, lines)
    ratio, _, _ = word_ratio(processed_lines, 'fuck')
    print(f'{character} has F-words in {ratio:.2%} of his words')

HBox(children=(IntProgress(value=0, description='Analysing lines of VINCENT', max=252, style=ProgressStyle(des…


VINCENT has F-words in 0.93% of his words


HBox(children=(IntProgress(value=0, description='Analysing lines of BUTCH', max=131, style=ProgressStyle(descr…


BUTCH has F-words in 0.42% of his words


HBox(children=(IntProgress(value=0, description='Analysing lines of MIA', max=80, style=ProgressStyle(descript…


MIA has F-words in 0.07% of his words


HBox(children=(IntProgress(value=0, description='Analysing lines of MARSELLUS', max=12, style=ProgressStyle(de…


MARSELLUS has F-words in 0.81% of his words


### Try yourself

For the next task, choose a script of any movie that you like from [The Internet Movie Script Database (IMSDb)](https://www.imsdb.com/) and try to do the similar analysis. You may want to choose the script that has a similar structure to the one that we analysed before. Also, try to choose a less challenging word to analyse. 

In [26]:
# Make your own analysis here

### Project idea

To give you some idea on the possible project topics, I used the same Pulp Fiction script and fine-tuned a [GPT-2 model](https://github.com/openai/gpt-2) on it. In short, GPT-2 is a language model pretrained on a large amount of texts. It can be used in various tasks, one of which is text generation. The model can generate scaringly real-looking texts given a small hand-written context. You can use this model to fine-tune it with your custom text data and in only about 15-20 minutes of training, the model already starts to produce believeable texts. 

Here is one example of what the model produced after training on the Pulp Fiction script for only about 15 minutes:

```               VINCENT
 Goddammit, what a lazy ass day!

The rattlesnakes disperse in regency, surrounded by the
tremors of human flesh.

INT. VINCENT'S STREET BATHROOM – DAY

His dirty garage door rattles. He peers through the window.

We're in the bathroom bedroom of AQUAMEL GRAHAM'S LIVING ROOM. Butch is playing with a
platter of burgers. The window has tinted black, and the kitchen sink is
red. The glow of a neon light glows through the thin wall.
The towel that's remained of Brett's hands is messy. In his
hand, he'll usually carry his boxer briefs, which he
makes with oil to match the light. Butch seems to have no idea what's
inside of them.

Vincent opens the lock on the bathroom door.

             BUTCH
 Who said dump the grills?

             VINCENT
 You're welcome.

Butch paces about the bedroom, examining the naked body of
Brett.

             BUTCH
 Gotta love New Orleans.

Brett flinches. Butch looks at him.

Vince sees through the panic.

Vincent touches Butch's bare ass.

The red dot on Butch's naked ass disappears.

Butch's bare ass disappears too.

Then... Under the Towel. In front of His Hair.

Butch holds the Towel out to Marty. Underneath it, the
wearing Towel Star.

As Butch dries it, the Waffle House neon light comes on.

             BUTCH
 Underneath The Towel?

Brett rolls his usual silver Mickey Mouse T-shirt.

Butch sees the police cruiser pull up to the bed.

             BUTCH
 You okay?

             BRETT
 Yeah, I'm good.

Butch blows out the T-shirt.

The SWAT team comes knocking.

Butch rises and disappears into the bedroom.

EXT. BOXING BUILDING CITY STREET – DAY

The Van Duzer reek of grease. Five men, all dressed in black, stand between
Patriot and the door. They're smiling at each other.

EXT. LIVING ROOM – DAY

Butch, in a Hawaiian shirt and blue jeans, talks softly to a
lotta animals. He looks like a king. Wolf-face. Marilyn Monroe.

Vincent can't help but laugh at the wit behind the seemingly bizarre character.
Butch seems to recognize a father figure.
The man is Auberjonois Marin, the city's first black
officer.

With his gun out, Vincent draws a square in Mexican and English.
The names on his guns fit the gun SONG THE GUARDIAN HAS FORSWORN is playing.
The room is cordoned off as to not 25 yards lead, but
more like 10 yards at a time. We're not sure why, just the
scene from last week's NCIS: LOSER.
The walkman, Brett Goodyear, who is obviously Roger Corman,
dances his way around the room, hugging Butch. Vince can't make out the smile
on Brett's face, but Brett has a sense of humor.

             VINCENT
 Brett. Did you just hear that right? A black man
 is a danger to others. If he doesn't get along with
us, we're gonna have to be protective of each
others' well-being. Get out of my earholes!

Goodbyes hug Brett.

Butch and Vincent walk through the empty alley, practically bumping
together.

             BRETT
 Chill this chicken out for now, and look
 at your birth certificate. Leavin' 'a
 hundred!

Butch and Vincent stand in front of Brett holding a briefcase
with his birth certificate in front of him.

The African-American man smiles.

             BRETT
 Were you raised by a black man?```

Of course, the text is not perfect but you can study this more and get better results. 