# NLP Pipeline

## Overview of pipeline

![Overview of NLP Pipeline](images/pipeline.png "Overview of NLP Pipeline")

## Text Document

In [1]:
text = """Peaky Blinders is a British crime drama television series created by Steven Knight. Set in Birmingham, England, it follows the exploits of the Peaky Blinders crime gang in the direct aftermath of the First World War. The fictional gang is loosely based on a real urban youth gang of the same name who were active in the city from the 1890s to the 1910s. It features an ensemble cast led by Cillian Murphy, starring as Tommy Shelby, Helen McCrory as Elizabeth "Polly" Gray, Paul Anderson as Arthur Shelby and Joe Cole as John Shelby, the gang's senior members. Tom Hardy, Sam Neill, Annabelle Wallis, Iddo Goldberg, Charlotte Riley, Paddy Considine, Adrien Brody, Aidan Gillen, Anya Taylor-Joy, Sam Claflin, James Frecheville and Stephen Graham have recurring roles. It premiered on 12 September 2013, telecast on BBC Two until the fourth series (with repeats on BBC Four), then moved to BBC One for the fifth and sixth series. The fifth series premiered on BBC One on 25 August 2019 and finished on 22 September 2019. Netflix, under a deal with Weinstein Company and Endemol, acquired the rights to release the show in the United States and around the world. In January 2021, it was announced that series six would be the last, followed by a spinoff film. Series six was broadcast from 27 February 2022 to 3 April 2022."""
print(text)

Peaky Blinders is a British crime drama television series created by Steven Knight. Set in Birmingham, England, it follows the exploits of the Peaky Blinders crime gang in the direct aftermath of the First World War. The fictional gang is loosely based on a real urban youth gang of the same name who were active in the city from the 1890s to the 1910s. It features an ensemble cast led by Cillian Murphy, starring as Tommy Shelby, Helen McCrory as Elizabeth "Polly" Gray, Paul Anderson as Arthur Shelby and Joe Cole as John Shelby, the gang's senior members. Tom Hardy, Sam Neill, Annabelle Wallis, Iddo Goldberg, Charlotte Riley, Paddy Considine, Adrien Brody, Aidan Gillen, Anya Taylor-Joy, Sam Claflin, James Frecheville and Stephen Graham have recurring roles. It premiered on 12 September 2013, telecast on BBC Two until the fourth series (with repeats on BBC Four), then moved to BBC One for the fifth and sixth series. The fifth series premiered on BBC One on 25 August 2019 and finished on 2

## Step 1: Segmentation

In [2]:
sentences = text.split(". ")
sentences

['Peaky Blinders is a British crime drama television series created by Steven Knight',
 'Set in Birmingham, England, it follows the exploits of the Peaky Blinders crime gang in the direct aftermath of the First World War',
 'The fictional gang is loosely based on a real urban youth gang of the same name who were active in the city from the 1890s to the 1910s',
 'It features an ensemble cast led by Cillian Murphy, starring as Tommy Shelby, Helen McCrory as Elizabeth "Polly" Gray, Paul Anderson as Arthur Shelby and Joe Cole as John Shelby, the gang\'s senior members',
 'Tom Hardy, Sam Neill, Annabelle Wallis, Iddo Goldberg, Charlotte Riley, Paddy Considine, Adrien Brody, Aidan Gillen, Anya Taylor-Joy, Sam Claflin, James Frecheville and Stephen Graham have recurring roles',
 'It premiered on 12 September 2013, telecast on BBC Two until the fourth series (with repeats on BBC Four), then moved to BBC One for the fifth and sixth series',
 'The fifth series premiered on BBC One on 25 August 2

## Step 2: Tokenization

In [3]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\saile\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\saile\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\saile\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\saile\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\saile\AppData\Roaming\nltk_data...
[nltk_data]    | 

True

In [4]:
from nltk.tokenize import word_tokenize

words = []
for sentence in sentences:
    sentence_words = word_tokenize(sentence)
    words.append(sentence_words)

print(words)

[['Peaky', 'Blinders', 'is', 'a', 'British', 'crime', 'drama', 'television', 'series', 'created', 'by', 'Steven', 'Knight'], ['Set', 'in', 'Birmingham', ',', 'England', ',', 'it', 'follows', 'the', 'exploits', 'of', 'the', 'Peaky', 'Blinders', 'crime', 'gang', 'in', 'the', 'direct', 'aftermath', 'of', 'the', 'First', 'World', 'War'], ['The', 'fictional', 'gang', 'is', 'loosely', 'based', 'on', 'a', 'real', 'urban', 'youth', 'gang', 'of', 'the', 'same', 'name', 'who', 'were', 'active', 'in', 'the', 'city', 'from', 'the', '1890s', 'to', 'the', '1910s'], ['It', 'features', 'an', 'ensemble', 'cast', 'led', 'by', 'Cillian', 'Murphy', ',', 'starring', 'as', 'Tommy', 'Shelby', ',', 'Helen', 'McCrory', 'as', 'Elizabeth', '``', 'Polly', "''", 'Gray', ',', 'Paul', 'Anderson', 'as', 'Arthur', 'Shelby', 'and', 'Joe', 'Cole', 'as', 'John', 'Shelby', ',', 'the', 'gang', "'s", 'senior', 'members'], ['Tom', 'Hardy', ',', 'Sam', 'Neill', ',', 'Annabelle', 'Wallis', ',', 'Iddo', 'Goldberg', ',', 'Charlo

## Step 3: Parts of Speech Tagging

In [8]:
pos_tags = []
for sentence in words:
    pos_tags.append(nltk.pos_tag(sentence))
                    
print(pos_tags)

[[('Peaky', 'NNP'), ('Blinders', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('British', 'JJ'), ('crime', 'NN'), ('drama', 'NN'), ('television', 'NN'), ('series', 'NN'), ('created', 'VBN'), ('by', 'IN'), ('Steven', 'NNP'), ('Knight', 'NNP')], [('Set', 'NNP'), ('in', 'IN'), ('Birmingham', 'NNP'), (',', ','), ('England', 'NNP'), (',', ','), ('it', 'PRP'), ('follows', 'VBZ'), ('the', 'DT'), ('exploits', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Peaky', 'NNP'), ('Blinders', 'NNP'), ('crime', 'NN'), ('gang', 'NN'), ('in', 'IN'), ('the', 'DT'), ('direct', 'JJ'), ('aftermath', 'NN'), ('of', 'IN'), ('the', 'DT'), ('First', 'NNP'), ('World', 'NNP'), ('War', 'NNP')], [('The', 'DT'), ('fictional', 'JJ'), ('gang', 'NN'), ('is', 'VBZ'), ('loosely', 'RB'), ('based', 'VBN'), ('on', 'IN'), ('a', 'DT'), ('real', 'JJ'), ('urban', 'JJ'), ('youth', 'NN'), ('gang', 'NN'), ('of', 'IN'), ('the', 'DT'), ('same', 'JJ'), ('name', 'NN'), ('who', 'WP'), ('were', 'VBD'), ('active', 'JJ'), ('in', 'IN'), ('the', 'DT'), ('city'

## Step 4: Text Lemmatization

In [26]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# check all the unique tag values so that a mapping can be created from the pos tags to a wordnet type
tag_values = []
for lis in pos_tags:
    for tag in lis:
        tag_values.append(tag[1])

unique_tag_values = list(set(tag_values))
print(unique_tag_values)

['VBG', 'POS', 'CC', 'IN', 'NN', 'VBD', 'CD', 'WP', 'NNP', 'VB', "''", '(', 'JJ', 'VBN', 'NNPS', '.', 'TO', 'VBP', 'VBZ', 'NNS', 'MD', ')', 'DT', ',', 'RB', 'PRP', '``']


In [55]:
# mapping
from nltk.corpus.reader.wordnet import ADJ, ADV, NOUN, VERB

pos_tag_map = {
    'VBG': VERB,
    'NN': NOUN,
    'VBD': VERB,
    'JJ': ADJ,
    'VBP': VERB,
    'VBZ': VERB,
    'RB': ADV
}

lemmatized_words = []
for lis in pos_tags:
    temp = []
    for tag_tuple in lis:
        word = tag_tuple[0]
        tag = tag_tuple[1]
        if pos_tag_map.get(tag) == None:
            temp.append(word)
        else:
            temp.append(lemmatizer.lemmatize(word, pos_tag_map[tag]))
    lemmatized_words.append(temp)

temp = []
for lis in lemmatized_words:
    temp.extend(lis)
    temp.append('.')
    
lemmatized_words = temp[:-1]
print(lemmatized_words)

['Peaky', 'Blinders', 'be', 'a', 'British', 'crime', 'drama', 'television', 'series', 'created', 'by', 'Steven', 'Knight', '.', 'Set', 'in', 'Birmingham', ',', 'England', ',', 'it', 'follow', 'the', 'exploits', 'of', 'the', 'Peaky', 'Blinders', 'crime', 'gang', 'in', 'the', 'direct', 'aftermath', 'of', 'the', 'First', 'World', 'War', '.', 'The', 'fictional', 'gang', 'be', 'loosely', 'based', 'on', 'a', 'real', 'urban', 'youth', 'gang', 'of', 'the', 'same', 'name', 'who', 'be', 'active', 'in', 'the', 'city', 'from', 'the', '1890s', 'to', 'the', '1910s', '.', 'It', 'feature', 'an', 'ensemble', 'cast', 'led', 'by', 'Cillian', 'Murphy', ',', 'star', 'as', 'Tommy', 'Shelby', ',', 'Helen', 'McCrory', 'as', 'Elizabeth', '``', 'Polly', "''", 'Gray', ',', 'Paul', 'Anderson', 'as', 'Arthur', 'Shelby', 'and', 'Joe', 'Cole', 'as', 'John', 'Shelby', ',', 'the', 'gang', "'s", 'senior', 'members', '.', 'Tom', 'Hardy', ',', 'Sam', 'Neill', ',', 'Annabelle', 'Wallis', ',', 'Iddo', 'Goldberg', ',', 'Cha

## Step 5: Identifying stop words

In [56]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_paragraph = []

for word in lemmatized_words:
    if (word not in stop_words):
        filtered_paragraph.append(word)

print(filtered_paragraph)

['Peaky', 'Blinders', 'British', 'crime', 'drama', 'television', 'series', 'created', 'Steven', 'Knight', '.', 'Set', 'Birmingham', ',', 'England', ',', 'follow', 'exploits', 'Peaky', 'Blinders', 'crime', 'gang', 'direct', 'aftermath', 'First', 'World', 'War', '.', 'The', 'fictional', 'gang', 'loosely', 'based', 'real', 'urban', 'youth', 'gang', 'name', 'active', 'city', '1890s', '1910s', '.', 'It', 'feature', 'ensemble', 'cast', 'led', 'Cillian', 'Murphy', ',', 'star', 'Tommy', 'Shelby', ',', 'Helen', 'McCrory', 'Elizabeth', '``', 'Polly', "''", 'Gray', ',', 'Paul', 'Anderson', 'Arthur', 'Shelby', 'Joe', 'Cole', 'John', 'Shelby', ',', 'gang', "'s", 'senior', 'members', '.', 'Tom', 'Hardy', ',', 'Sam', 'Neill', ',', 'Annabelle', 'Wallis', ',', 'Iddo', 'Goldberg', ',', 'Charlotte', 'Riley', ',', 'Paddy', 'Considine', ',', 'Adrien', 'Brody', ',', 'Aidan', 'Gillen', ',', 'Anya', 'Taylor-Joy', ',', 'Sam', 'Claflin', ',', 'James', 'Frecheville', 'Stephen', 'Graham', 'recur', 'roles', '.', '

## Step 6: Dependency Parsing

In [67]:
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(' '.join(filtered_paragraph))
displacy.render(doc, jupyter=True)

## Step 7: Named entity recognition

In [69]:
for word in doc.ents:
    print(word.text, word.label_)

Peaky Blinders PERSON
British NORP
Steven Knight PERSON
Set Birmingham PERSON
England GPE
Peaky Blinders PERSON
First World War EVENT
1890s 1910s DATE
Cillian Murphy PERSON
Tommy Shelby PERSON
Helen McCrory Elizabeth PERSON
Paul Anderson PERSON
John Shelby PERSON
Tom Hardy PERSON
Sam Neill PERSON
Annabelle Wallis PERSON
Iddo Goldberg PERSON
Charlotte Riley PERSON
Paddy Considine ORG
Adrien Brody PERSON
Aidan Gillen PERSON
Anya Taylor-Joy PERSON
Sam Claflin PERSON
James Frecheville PERSON
Stephen Graham PERSON
12 September 2013 DATE
BBC ORG
Two fourth series DATE
BBC ORG
Four CARDINAL
BBC ORG
sixth ORDINAL
fifth ORDINAL
BBC ORG
August 2019 DATE
September 2019 DATE
Weinstein Company Endemol PERSON
United States GPE
January 2021 DATE
six CARDINAL
six CARDINAL
February 2022 DATE
3 April 2022 DATE


In [1]:
!pip install neuralcoref

import neuralcoref

newnlp = spacy.load('en_core_web_lg')
neuralcoref.add_to_pipe(newnlp)

Collecting neuralcoref
  Using cached neuralcoref-4.0.tar.gz (368 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: neuralcoref
  Building wheel for neuralcoref (setup.py): started
  Building wheel for neuralcoref (setup.py): finished with status 'error'
  Running setup.py clean for neuralcoref
Failed to build neuralcoref
Installing collected packages: neuralcoref
  Running setup.py install for neuralcoref: started
  Running setup.py install for neuralcoref: finished with status 'error'


  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [114 lines of output]
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.10
  creating build\lib.win-amd64-3.10\neuralcoref
  copying neuralcoref\file_utils.py -> build\lib.win-amd64-3.10\neuralcoref
  copying neuralcoref\__init__.py -> build\lib.win-amd64-3.10\neuralcoref
  creating build\lib.win-amd64-3.10\neuralcoref\tests
  copying neuralcoref\tests\test_neuralcoref.py -> build\lib.win-amd64-3.10\neuralcoref\tests
  copying neuralcoref\tests\__init__.py -> build\lib.win-amd64-3.10\neuralcoref\tests
  creating build\lib.win-amd64-3.10\neuralcoref\train
  copying neuralcoref\train\algorithm.py -> build\lib.win-amd64-3.10\neuralcoref\train
  copying neuralcoref\train\compat.py -> build\lib.win-amd64-3.10\neuralcoref\train
  copying neuralcoref\train\conllparser.py -> build\lib.win-amd64-3.10\neuralcoref\train
  cop

ModuleNotFoundError: No module named 'neuralcoref'