# Part of Speech Tokenizer, Politics Data

Notebook following the sentence tokenizer work.

This notebook builds on the data cleaning and preparing tasks.  It assumes a sentence tokenizer, and walks through the process of removing words using Parts of Speech tagging.

# Contents

1. [Functions](#Functions)
    1. [cleanFile](#Clean-File-Function)
    2. [sentTokenizer](#Sentence-Tokenizer-Function)
    3. [Example Applying `cleanFile` and `sentTokenizer`](#Example-Applying-cleanFile-and-sentTokenizer)
        1. [Example file, `124146.txt`](#Example-file,-124146.txt)
        2. [Apply `cleanFile` function](#Apply-cleanFile-function)
        3. [Apply `sentTokenizer`](#Apply-sentTokenizer)
1. [Part of Speech Tagging](#Part-of-Speech-Tagging)
    1. [Create pos tagged sentences](#Create-pos-tagged-sentences)
    2. [Remove specific tags](#Remove-the-following-tags)
    3. [Lemmatize Words, Remove Punctuationa and Stopwords](#Lemmatize-Words,-Remove-Punctuationa-and-Stopwords)  

## Functions

### Clean File Function

In [1]:
def cleanFile(fname: str) -> str:
    # tname = input('Enter the text file name: ')
    clean_file_text = ''
    fhand = open(fname)
    for line in fhand:
        new_line = ''
        if line.find('@') != -1: continue 
        elif line.find('Host') != -1: continue 
        elif line.find('User') != -1: continue
        elif line.find('VAX') != -1: continue
        elif line.find('USPS Mail:') != -1: continue 
        elif line.find('UUCP:') != -1: continue
        elif line.find('P.O. Box') != -1: continue
        else: 
            new_line = line.replace('[IC]', '').replace('\n', ' ').replace('e.g.', ' ')

        # other characters encountered in the dataset but removed by the string.isalpha() method below
            # '>', '').replace(
            # '+', '').replace(
            # '-', '').replace(
            # '#', '').replace(
            # '*', '').replace(
            # '^', '').replace(
            # '|', '').replace(
            # '/', ' ').replace(
            # '_', ' ').replace(
            
    # clean each string of all non-alphabetic characters except common punctuation
        clean_line = str()
        for char in new_line:
            if char in ['.', '!', '?', ',']:
                clean_line += char
            elif char == "'":
                clean_line += ''
            elif char.isalpha():
                clean_line += char
            else:
                clean_line += ' '
        clean_file_text += clean_line
        
    return clean_file_text


### Sentence Tokenizer Function

In [2]:
def sentTokenizer(cleanFile: str) -> list[str]:
    from nltk.tokenize import sent_tokenize

    sentences = sent_tokenize(cleanFile.strip())
    return sentences

### Example Applying `cleanFile` and `sentTokenizer`

#### Example file, `124146.txt`

In [3]:
# print each line of raw file
fhand = open('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt')
for line in fhand:
    print(line)

Nntp-Posting-Host: acvax1

Nntp-Posting-User: cvads008



visser@convex.com (Lance Visser) writes:

> +>I can't find my source.

> +>But.  If you state that you will retract your claim, I'll go dig one up

> +>at the library.  Fair enough?

> 

> 	ARE YOU SERIOUS?  I'm not talking about retracting anything until

> you have produced SOMETHING.

> 

> 	If you were not just talking off the top of your head, I would

> assume that you have SOME memory of what your source is.

> 

> 	PUT UP NOW without conditions!





Yes, very serious.  I claim that I can substantiate my statement that

Rudman says he doesn't believe Perot was investigating him.  You claim

Perot was investigating him.  If you will state that you were in error

on this point, provided I produce the source, I'll go dig it up.



Now give me one reason why I should go to the trouble if you won't

agree to this?  It is simple enough you know.  But I don't have time

to waste if you'll just blow it off with more of the tripe

#### Apply `cleanFile` function

In [4]:
cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt')

'     I cant find my source.     But.  If you state that you will retract your claim, Ill go dig one up     at the library.  Fair enough?       ARE YOU SERIOUS?  Im not talking about retracting anything until   you have produced SOMETHING.       If you were not just talking off the top of your head, I would   assume that you have SOME memory of what your source is.       PUT UP NOW without conditions!   Yes, very serious.  I claim that I can substantiate my statement that Rudman says he doesnt believe Perot was investigating him.  You claim Perot was investigating him.  If you will state that you were in error on this point, provided I produce the source, Ill go dig it up.  Now give me one reason why I should go to the trouble if you wont agree to this?  It is simple enough you know.  But I dont have time to waste if youll just blow it off with more of the tripe you usually post.        Michael Pye '

#### Apply `sentTokenizer`

In [5]:
# cleanFile -> sentTokenizer
sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt'))

['I cant find my source.',
 'But.',
 'If you state that you will retract your claim, Ill go dig one up     at the library.',
 'Fair enough?',
 'ARE YOU SERIOUS?',
 'Im not talking about retracting anything until   you have produced SOMETHING.',
 'If you were not just talking off the top of your head, I would   assume that you have SOME memory of what your source is.',
 'PUT UP NOW without conditions!',
 'Yes, very serious.',
 'I claim that I can substantiate my statement that Rudman says he doesnt believe Perot was investigating him.',
 'You claim Perot was investigating him.',
 'If you will state that you were in error on this point, provided I produce the source, Ill go dig it up.',
 'Now give me one reason why I should go to the trouble if you wont agree to this?',
 'It is simple enough you know.',
 'But I dont have time to waste if youll just blow it off with more of the tripe you usually post.',
 'Michael Pye']

## Part of Speech Tagging

### Create pos tagged sentences

Note, here a simple string.split() method is being applied.  An alternative could be to apply nltk.word_tokenize to each sentence.  This might be an alternative if the accuracy is low and tuning is needed.  

In [6]:
import nltk
sentences = sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt'))
tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
print(tagged_word_sentences)

[[('I', 'PRP'), ('cant', 'VBP'), ('find', 'VB'), ('my', 'PRP$'), ('source.', 'NN')], [('But.', 'NN')], [('If', 'IN'), ('you', 'PRP'), ('state', 'NN'), ('that', 'IN'), ('you', 'PRP'), ('will', 'MD'), ('retract', 'VB'), ('your', 'PRP$'), ('claim,', 'NN'), ('Ill', 'NNP'), ('go', 'VBP'), ('dig', 'RB'), ('one', 'CD'), ('up', 'NN'), ('at', 'IN'), ('the', 'DT'), ('library.', 'NN')], [('Fair', 'NNP'), ('enough?', 'NN')], [('ARE', 'NNP'), ('YOU', 'NNP'), ('SERIOUS?', 'NNP')], [('Im', 'NNP'), ('not', 'RB'), ('talking', 'VBG'), ('about', 'IN'), ('retracting', 'VBG'), ('anything', 'NN'), ('until', 'IN'), ('you', 'PRP'), ('have', 'VBP'), ('produced', 'VBN'), ('SOMETHING.', 'NNP')], [('If', 'IN'), ('you', 'PRP'), ('were', 'VBD'), ('not', 'RB'), ('just', 'RB'), ('talking', 'VBG'), ('off', 'RP'), ('the', 'DT'), ('top', 'NN'), ('of', 'IN'), ('your', 'PRP$'), ('head,', 'NN'), ('I', 'PRP'), ('would', 'MD'), ('assume', 'VB'), ('that', 'IN'), ('you', 'PRP'), ('have', 'VBP'), ('SOME', 'NNP'), ('memory', 'NN

### Remove the following tags

Source of Tags list: [blog](https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/#:~:text=CC%20coordinating%20conjunction)

- IN preposition/subordinating conjunction  
- NNP proper noun, singular ‘Harrison’  
- NNPS proper noun, plural ‘Americans’  
- PRP personal pronoun I, he, she
- PRP\$ possessive pronoun my, his, hers  
- WP wh-pronoun who, what  
- WP\$ possessive wh-pronoun whose  
- WRB wh-abverb where, when

In [7]:
import nltk
sentences = sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt'))
tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
for sentence in tagged_word_sentences[:4]:
    print(sentence)

[('I', 'PRP'), ('cant', 'VBP'), ('find', 'VB'), ('my', 'PRP$'), ('source.', 'NN')]
[('But.', 'NN')]
[('If', 'IN'), ('you', 'PRP'), ('state', 'NN'), ('that', 'IN'), ('you', 'PRP'), ('will', 'MD'), ('retract', 'VB'), ('your', 'PRP$'), ('claim,', 'NN'), ('Ill', 'NNP'), ('go', 'VBP'), ('dig', 'RB'), ('one', 'CD'), ('up', 'NN'), ('at', 'IN'), ('the', 'DT'), ('library.', 'NN')]
[('Fair', 'NNP'), ('enough?', 'NN')]


In [8]:
import nltk
sentences = sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt'))
tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
new_document = str()
for sentence in tagged_word_sentences[:4]:
    # print(sentence)
    new_sentence = str()
    for word, tag in sentence:
        print(word, tag)
        if tag not in ['IN', 'NNP', 'NNPS', 'PRP', 'PRP$', 'WP', 'WP$', 'WRB']:
            new_sentence += ' '+word
    new_document += ' '+new_sentence
    # print(new_sentence.lstrip())
    print(new_sentence)

print(new_document)

I PRP
cant VBP
find VB
my PRP$
source. NN
 cant find source.
But. NN
 But.
If IN
you PRP
state NN
that IN
you PRP
will MD
retract VB
your PRP$
claim, NN
Ill NNP
go VBP
dig RB
one CD
up NN
at IN
the DT
library. NN
 state will retract claim, go dig one up the library.
Fair NNP
enough? NN
 enough?
  cant find source.  But.  state will retract claim, go dig one up the library.  enough?


In [9]:
import nltk
sentences = sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt'))
tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
new_document = str()
for sentence in tagged_word_sentences:
    # print(sentence)
    new_sentence = str()
    for word, tag in sentence:
        # print(word, tag)
        if tag not in ['IN', 'NNP', 'NNPS', 'PRP', 'PRP$', 'WP', 'WP$', 'WRB']:
            new_sentence += ' '+word
    new_document += ' '+new_sentence.lstrip()
    # print(new_sentence.lstrip())
    # print(new_sentence)

print(new_document.lstrip())

cant find source. But. state will retract claim, go dig one up the library. enough?  not talking retracting anything have produced were not just talking off the top head, would assume have memory source is. PUT UP conditions! Yes, very serious. claim can substantiate statement says doesnt believe was investigating him. claim was investigating him. will state were error this point, provided produce the source, go dig up. Now give one reason should go to the trouble wont agree to this? is simple enough know. But dont have time to waste youll just blow off more the tripe usually post. 


### Second Example

In [10]:
# print each line of raw file, without 'new lines' (\n)
fhand = open('./data/kaggle/Text_Classification_on_Documents/Politics/178444.txt')
for line in fhand:
    print(line.rstrip())


In article <1qk73q$3fj@agate.berkeley.edu> dzkriz@ocf.berkeley.edu (Dennis Kriz) writes:
>In article <sandvik-140493233557@sandvik-kent.apple.com> sandvik@newton.apple.com (Kent Sandvik) writes:
>>In article <1qid8s$ik0@agate.berkeley.edu>, dzkriz@ocf.berkeley.edu (Dennis
>>Kriz) wrote:
>>> The most recent reason given by the Clinton Administration for
>>> calling for federally funded abortions is that many private
>>> health insurance programs offer coverage for abortion.
>>
>>> The following are two form letters regarding this.  Please send
>>> them around to friends as well as other BBSs
>>
>>"Just sign it and send it, sonny, don't read the fine print. Just
>>sign it, sonny! :-).
>>
>>Cheers,
>>Kent
>>---
>>sandvik@newton.apple.com. ALink: KSAND -- Private activities on the net.
>
>
>Well you know that you're getting somewhere, when you start getting
>responses like this.
>
>Kent, let me explain it to you.
>
>If you are paying for a phone, and you don't want call-waiting, YOU DON'T

In [11]:
cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/178444.txt')

'   Kriz  wrote      The most recent reason given by the Clinton Administration for     calling for federally funded abortions is that many private     health insurance programs offer coverage for abortion.         The following are two form letters regarding this.  Please send     them around to friends as well as other BBSs       Just sign it and send it, sonny, dont read the fine print. Just   sign it, sonny!    .      Cheers,   Kent            Well you know that youre getting somewhere, when you start getting  responses like this.    Kent, let me explain it to you.    If you are paying for a phone, and you dont want call waiting, YOU DONT  NEED TO PAY FOR CALl WAITING.    This whole Clinton induced abortion debate SHOULD begin to make NARAL  nervous, because it has exposed a real scam.    If one is paying for a PRIVATE health insurance plan and DOES NOT WANT   abortion coverage  there is NO reason for that person to be COMPLELLED  to pay for it.   Just as one should not be compelle

In [12]:
sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/178444.txt'))

['Kriz  wrote      The most recent reason given by the Clinton Administration for     calling for federally funded abortions is that many private     health insurance programs offer coverage for abortion.',
 'The following are two form letters regarding this.',
 'Please send     them around to friends as well as other BBSs       Just sign it and send it, sonny, dont read the fine print.',
 'Just   sign it, sonny!',
 '.',
 'Cheers,   Kent            Well you know that youre getting somewhere, when you start getting  responses like this.',
 'Kent, let me explain it to you.',
 'If you are paying for a phone, and you dont want call waiting, YOU DONT  NEED TO PAY FOR CALl WAITING.',
 'This whole Clinton induced abortion debate SHOULD begin to make NARAL  nervous, because it has exposed a real scam.',
 'If one is paying for a PRIVATE health insurance plan and DOES NOT WANT   abortion coverage  there is NO reason for that person to be COMPLELLED  to pay for it.',
 'Just as one should not be c

In [13]:
import nltk
sentences = sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/178444.txt'))
tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
new_document = str()
for sentence in tagged_word_sentences:
    # print(sentence)
    new_sentence = str()
    for word, tag in sentence:
        # print(word, tag)
        if tag not in ['IN', 'NNP', 'NNPS', 'PRP', 'PRP$', 'WP', 'WP$', 'WRB']:
            new_sentence += ' '+word
    new_document += ' '+new_sentence.lstrip()
    # print(new_sentence.lstrip())
    # print(new_sentence)

print(new_document.lstrip())

wrote The most recent reason given the calling federally funded abortions is many private health insurance programs offer coverage abortion. The following are two form letters regarding this. send to friends as well other sign and send it, sonny, dont read the fine print. Just sign it, sonny! . know youre getting somewhere, start getting responses this. let explain to you. are paying a phone, and dont want call waiting, This whole induced abortion debate begin to make nervous, has exposed a real scam. one is paying a health insurance plan and abortion coverage there is reason that person to be COMPLELLED to pay it. Just one should not be compelled to pay lipposuction coverage ONE doesnt kind coverage . There are basic services and there are optional services , Call waiting is an optional service, but having the number work ones phone is a basic service. Just some nutcase doesnt happen to use the phone, none the numbers calls has a it, doesnt mean has the right to demand the phone compa

### Lemmatize Words, Remove Punctuationa and Stopwords

In [14]:
def remove_punctuation(word: str) -> str:

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords    
    
    new_word = str()                                     # create a string without punctuation
    for char in word:                                    # look at each character in the word
        if char.isalpha():                               # determine if the character is alphabetic
            new_word += char                             # if it is alphabetic, add it to the new word
        else:
            new_word += ' '

    return new_word
    
def lemmatize(word: str) -> str:
    from nltk.stem import WordNetLemmatizer
    wnl = WordNetLemmatizer()
    new_word = wnl.lemmatize(word, pos="v")
    return new_word
    
def create_lemmatized_stopwords() -> str:
    from nltk.corpus import stopwords

    # Create a set of stop words 
    STOP_WORDS = set(stopwords.words('english'))

    stop_words = set()
    for word in STOP_WORDS:
        new_word = str()                                     # create a string without punctuation
        for char in word:                                    # look at each character in the word
            if char.isalpha():                               # determine if the character is alphabetic
                new_word += char                             # if it is alphabetic, add it to the new word
        if new_word != '':
            stop_words.add(new_word)

    return stop_words

In [15]:
stopwords = create_lemmatized_stopwords()
"but" in stopwords

True

In [16]:
import nltk
sentences = sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt'))
tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
stopwords = create_lemmatized_stopwords()
new_document = str()
for sentence in tagged_word_sentences[:3]:
    # print(sentence)
    new_sentence = str()
    for word, tag in sentence:
        # print(word, tag)
        if tag not in ['IN', 'NNP', 'NNPS', 'PRP', 'PRP$', 'WP', 'WP$', 'WRB']:
            # new_sentence += ' '+remove_punctuation(word.lower()).strip()
            new_word = lemmatize(remove_punctuation(word.lower()).strip())
            # print(new_word)
            if new_word not in stopwords:
                new_sentence += ' '+new_word
    if new_sentence != '':
        new_document += ' '+new_sentence.strip()
    # print(new_sentence.lstrip())
    # print(new_sentence)

print(new_document.strip())

cant find source state retract claim go dig one library


In [18]:
def clean_document(sentTokenizer: list[str]) -> str:
    '''Remove Part of Speech Tags, create lowercase, remove punctuation, lemmatize, remove stop words'''

    import nltk
    sentences = sentTokenizer
    tagged_word_sentences = nltk.pos_tag_sents([sent.split() for sent in sentences])
    stopwords = create_lemmatized_stopwords()
    new_document = str()
    for sentence in tagged_word_sentences:
        # print(sentence)
        new_sentence = str()
        for word, tag in sentence:
            # print(word, tag)
            if tag not in ['IN', 'NNP', 'NNPS', 'PRP', 'PRP$', 'WP', 'WP$', 'WRB']:
                # new_sentence += ' '+remove_punctuation(word.lower()).strip()
                new_word = lemmatize(remove_punctuation(word.lower()).strip())
                # print(new_word)
                if new_word not in stopwords:
                    new_sentence += ' '+new_word
        if new_sentence != '':
            new_document += ' '+new_sentence.strip()
        # print(new_sentence.lstrip())
        # print(new_sentence)
    
    return new_document.strip()

In [19]:
# document '124146.txt'
clean_document(sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/124146.txt')))

'cant find source state retract claim go dig one library enough talk retract anything produce talk top head would assume memory source put condition yes serious claim substantiate statement say believe investigate claim investigate state error point provide produce source go dig give one reason go trouble agree simple enough know time waste blow tripe usually post'

ERROR! Session/line number was not unique in database. History logging moved to new session 2


In [20]:
# document '178444.txt'
clean_document(sentTokenizer(cleanFile('./data/kaggle/Text_Classification_on_Documents/Politics/178444.txt')))

'write recent reason give call federally fund abortions many private health insurance program offer coverage abortion follow two form letter regard send friends sign send sonny read fine print sign sonny  know get somewhere start get responses let explain pay phone want call wait whole induce abortion debate begin make nervous expose real scam one pay health insurance plan abortion coverage reason person complelled pay one compel pay lipposuction coverage one kind coverage basic service optional service call wait optional service number work ones phone basic service nutcase happen use phone none number call mean right demand phone company unbundle charge use phone digit unbundling would horrendously inefficient bill bookkeeping overhead abortion see basic service public fund abortion money view substantial portion population probably clear majority ethical thing save money tax way take tax save invest private charities program help reduce need abortion every pro lifer would create mass