===========================================


Title: 4.2 Exercises


Author: Chad Wood


Date: 17 Jan 2021


Modified By: Chad Wood


Description: This program demonstrates the use of a built module to normalize text and using nlp libraries to determine parts of speech, dependancies, and lemmas


=========================================== 

In the text, there’s a text normalizer created – your assignment is to re-create that normalizer as a Python class that can be re-used (within a .py file). However, unlike the book author’s version, pass a Pandas Series (e.g., dataframe[‘column’]) to your normalize_corpus function and use apply/lambda for each cleaning function. (Ask questions in Teams if that’s unclear.)

Using your new text normalizer, create a Jupyter Notebook that uses this class to clean up the text found in the file big.txt (that text file is in the GitHub for Week 4 repository). Your resulting text should be a (long) single stream of text.

In [95]:
import pandas as pd
import normalizer as nm

with open('big.txt') as f:
    lines = f.readlines()

# Creates instance, normalizes, and returns long single stream of text
text_series = pd.Series(lines)
text_series = nm.Normalizer(text_series)
text =  ' '.join(text_series.normalize())

In [96]:
text[0:1021]

'Project Gutenberg EBok Adventures Sherlock Holmes Sir Arthur Conan Doyle series Sir Arthur Conan Doyle  copyright law change al world sure check copyright law country download redistribute Project Gutenberg eBok  header first thing sen view project Gutenberg file please remove change edit header without writen permision  please read legal smal print information eBok Project Gutenberg botom file include important information specific right restriction file may use also find make donation Project Gutenberg get involve   welcome World Fre Plain Vanila electronic text  eBoks readable Humans computer since  ebok prepare thousand Volunters   Title Adventures Sherlock Holmes  Author Sir Arthur Conan Doyle  Release Date March [ ebok ] [ recently update November ]  Edition  Language English  Character set encode ascus  START PROJECT GUTENBERG EBOK ADVENTURES SHERLOCK HOLMES     aditional editing Jose Menendez    ADVENTURES SHERLOCK HOLMES    SIR ARTHUR CONAN DOYLE  content  scandal Bohemia Red

Using spaCy and NLTK, show the tokens, lemmas, parts of speech, and dependencies in the first 1,021 characters of big.txt.

In [102]:
import spacy
nlp = spacy.load("en_core_web_sm")

txt = ' '.join(lines)[0:1021]
doc = nlp(txt)

# Prints each piece of Text, Part of Speech, Dependancy, and Explanation
print(f"{'Text':{10}} {'POS':{6}} {'Lemma':{10}} {'Dep':{10}} {'POS explained':{20}}")
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{6}} {token.lemma_:{10}} {token.dep_:{10}}')

Text       POS    Lemma      Dep        POS explained       
The        DET    the        det       
Project    PROPN  Project    nmod      
Gutenberg  PROPN  Gutenberg  npadvmod  
EBook      PROPN  EBook      appos     
of         ADP    of         prep      
The        DET    the        det       
Adventures PROPN  Adventures pobj      
of         ADP    of         prep      
Sherlock   PROPN  Sherlock   compound  
Holmes     PROPN  Holmes     pobj      

          SPACE  
          dep       
by         ADP    by         prep      
Sir        PROPN  Sir        compound  
Arthur     PROPN  Arthur     compound  
Conan      PROPN  Conan      compound  
Doyle      PROPN  Doyle      pobj      

          SPACE  
          dep       
(          PUNCT  (          punct     
#          SYM    #          nmod      
15         NUM    15         appos     
in         ADP    in         prep      
our        PRON   our        poss      
series     NOUN   series     pobj      
by         ADP    b

In [91]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.parse.corenlp import CoreNLPDependencyParser

wordnet_lemmatizer = WordNetLemmatizer()
dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')

# Tokenizes text
txt = nltk.word_tokenize(' '.join(lines)[0:1021])

# Sets objects for each requirement
part_os = nltk.pos_tag(txt)
lemma = [wordnet_lemmatizer.lemmatize(word) for word in txt]
parses = dep_parser.parse(' '.join(lines)[0:1021].split())

In [82]:
for pos, lem in zip(part_os, lemma):
    print(f"Word, POS: {pos} Lemma: {lem}")

Word, POS: ('The', 'DT') Lemma: The
Word, POS: ('Project', 'NNP') Lemma: Project
Word, POS: ('Gutenberg', 'NNP') Lemma: Gutenberg
Word, POS: ('EBook', 'NNP') Lemma: EBook
Word, POS: ('of', 'IN') Lemma: of
Word, POS: ('The', 'DT') Lemma: The
Word, POS: ('Adventures', 'NNP') Lemma: Adventures
Word, POS: ('of', 'IN') Lemma: of
Word, POS: ('Sherlock', 'NNP') Lemma: Sherlock
Word, POS: ('Holmes', 'NNP') Lemma: Holmes
Word, POS: ('by', 'IN') Lemma: by
Word, POS: ('Sir', 'NNP') Lemma: Sir
Word, POS: ('Arthur', 'NNP') Lemma: Arthur
Word, POS: ('Conan', 'NNP') Lemma: Conan
Word, POS: ('Doyle', 'NNP') Lemma: Doyle
Word, POS: ('(', '(') Lemma: (
Word, POS: ('#', '#') Lemma: #
Word, POS: ('15', 'CD') Lemma: 15
Word, POS: ('in', 'IN') Lemma: in
Word, POS: ('our', 'PRP$') Lemma: our
Word, POS: ('series', 'NN') Lemma: series
Word, POS: ('by', 'IN') Lemma: by
Word, POS: ('Sir', 'NNP') Lemma: Sir
Word, POS: ('Arthur', 'NNP') Lemma: Arthur
Word, POS: ('Conan', 'NNP') Lemma: Conan
Word, POS: ('Doyle', 'N

In [92]:
print('Dependancies')
[[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]

Dependancies


[[(('changing', 'VBG'), 'nsubj', ('EBook', 'NNP')),
  (('EBook', 'NNP'), 'det', ('The', 'DT')),
  (('EBook', 'NNP'), 'compound', ('Project', 'NN')),
  (('EBook', 'NNP'), 'compound', ('Gutenberg', 'NNP')),
  (('EBook', 'NNP'), 'nmod', ('Adventures', 'NNS')),
  (('Adventures', 'NNS'), 'case', ('of', 'IN')),
  (('Adventures', 'NNS'), 'det', ('The', 'DT')),
  (('Adventures', 'NNS'), 'nmod', ('Holmes', 'NNP')),
  (('Holmes', 'NNP'), 'case', ('of', 'IN')),
  (('Holmes', 'NNP'), 'compound', ('Sherlock', 'NNP')),
  (('EBook', 'NNP'), 'nmod', ('Doyle', 'NNP')),
  (('Doyle', 'NNP'), 'case', ('by', 'IN')),
  (('Doyle', 'NNP'), 'compound', ('Sir', 'NNP')),
  (('Doyle', 'NNP'), 'compound', ('Arthur', 'NNP')),
  (('Doyle', 'NNP'), 'compound', ('Conan', 'NNP')),
  (('Doyle', 'NNP'), 'dep', ('15', 'CD')),
  (('15', 'CD'), 'punct', ('-LRB-', '-LRB-')),
  (('15', 'CD'), 'dep', ('#', '#')),
  (('15', 'CD'), 'nmod', ('series', 'NN')),
  (('series', 'NN'), 'case', ('in', 'IN')),
  (('series', 'NN'), 'nmod: