# <span style ="color:#02731e;font-family:calibri"> Text Basics for Natural Language Processing

## <span style ="color:#02731e;font-family:calibri"> Working with Text files

In [1]:
person = "Digamber"

In [4]:
print(f"My name is {person}")

My name is Digamber


#### This is F string literals or in short F-string

### Another nice this about the F-string if you have a dictonary, we can perform some sort of operations within the F-string literal like below

In [5]:
d = {"a":123,"b":456}

In [6]:
print(f"my number is {d['a']}")

my number is 123


In [7]:
mylist = [0,1,2]

In [9]:
print(f"my number is {mylist[0]}")

my number is 0


### Doing alignment and padding with F-string, when we are tying to pring out multiple items

In [11]:
library=[('Author',"Topic","Pages"),("Twain","Rafting","601"),("Feyman","Physics",95),("hamilton","Mythology",144)]

In [12]:
library

[('Author', 'Topic', 'Pages'),
 ('Twain', 'Rafting', '601'),
 ('Feyman', 'Physics', 95),
 ('hamilton', 'Mythology', 144)]

In [13]:
for book in library:
    print(book)

('Author', 'Topic', 'Pages')
('Twain', 'Rafting', '601')
('Feyman', 'Physics', 95)
('hamilton', 'Mythology', 144)


In [15]:
for book in library:
    print(f"{book[0]}")

Author
Twain
Feyman
hamilton


### We can use tuple unpacking to do the following

In [19]:
for author,topic,pages in library:
    print(f"{author} {topic} {pages}")

Author Topic Pages
Twain Rafting 601
Feyman Physics 95
hamilton Mythology 144


### By the first look the formatting is not good because we are not taking care of any spacing and padding. First thing we can do is passin the minimum width

In [23]:
for author,topic,pages in library:
    print(f"{author:{10}} {topic:{30}} {pages:{10}}")

Author     Topic                          Pages     
Twain      Rafting                        601       
Feyman     Physics                                95
hamilton   Mythology                             144


#### We can still see pages is still not formatted well, so we can do as below

In [27]:
for author,topic,pages in library:
    print(f"{author:{10}} {topic:{30}} {pages:>{10}}")

Author     Topic                               Pages
Twain      Rafting                               601
Feyman     Physics                                95
hamilton   Mythology                             144


### Below is the ability of date formatting

In [28]:
from datetime import datetime

In [29]:
today = datetime(year = 2019, month = 2, day = 28)

In [33]:
print(f"{today}")

2019-02-28 00:00:00


### its is default way of datetime to display date, but often we want to format it. We can use specific strftime code, we can check from https://strftime.org/

In [34]:
print(f"{today:%B}")

February


In [35]:
print(f"{today:%B %d, %Y}")

February 28, 2019


## <span style ="color:#02731e;font-family:calibri"> Now we will check how to read and write to text files with python

### Let's create text files, the command used below specific to Jupyter notebook

In [36]:
%%writefile test.txt
Hello, this is a quick test file.
This is the second line of the file.

Writing test.txt


### We can see text file is created, now we can open it

In [37]:
myfile = open("whoops.txt")

FileNotFoundError: [Errno 2] No such file or directory: 'whoops.txt'

### If we ever get this Error no 2 while opeing the file it measn we have not given the correct location of the file or we have misspelled the file name

In [38]:
myfile = open("test.txt")

In [39]:
myfile

<_io.TextIOWrapper name='test.txt' mode='r' encoding='cp1252'>

In [40]:
myfile.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

#### "\n" in the text to represent new line

In [41]:
myfile.read()

''

### We can't open the same text file at the same time becasue we have already opened it for reading, what we can do in that case is as below:

In [42]:
myfile.seek(0)

0

In [43]:
myfile.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

### As the file is already once the cursor within file is reached to the end with .seek we can reset it to the beging and after that when we call read we can see the text of the file

In [44]:
myfile.seek(0)

0

In [45]:
content = myfile.read()

In [46]:
content

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [47]:
print(content)

Hello, this is a quick test file.
This is the second line of the file.



### <span style ="color:red;font-family:calibri"> Important Note: Always close the file after we completed the operations on it

In [48]:
myfile.close()

### There is readlines options also to read a particular line

In [49]:
myfile = open('test.txt')

In [50]:
myfile.readlines()

['Hello, this is a quick test file.\n',
 'This is the second line of the file.\n']

In [51]:
myfile.seek(0)

0

In [52]:
mylines = myfile.readlines()

In [53]:
for line in mylines:
    print(line[0])

H
T


In [54]:
for line in mylines:
    print(line.split()[0])

Hello,
This


### We saw how to read files now we will see how to write files. Although when we are using "w+" or "w" which is used for writing. But we should use this with causion because it performs of truncation on the orginal file which means anything in the original filw will be overwritten and deleted

In [55]:
myfile = open("test.txt","w+")

In [56]:
myfile.read()

''

### We can see everything is deleted, it should only be used if you want to completed overwrite the content

In [57]:
myfile.write("My Brand New Text")

17

In [58]:
myfile.seek(0)

0

In [59]:
myfile.read()

'My Brand New Text'

In [60]:
myfile.close()

### Now we will see how to append the file

In [75]:
myfile = open("test.txt",'a+')

#### "a+" is used to append the file. but one thing is important that if you misspell the file name it will create the new file with that name

In [76]:
myfile.write("\n My First Line in A+ Opening")

29

In [77]:
myfile.close()

In [88]:
newfile = open("test.txt", mode = "a+")

In [89]:
newfile.write("\nThis is an added line, because I used a+ mode")

46

In [90]:
newfile.seek(0)

0

In [91]:
newfile.read()

'My Brand New TextMy First Line in A+ Opening\n My First Line in A+ OpeningThis is an added line, because I used a+ mode\nThis is an added line, because I used a+ mode'

In [92]:
newfile.seek(0)

0

In [94]:
print(newfile.read())

My Brand New TextMy First Line in A+ Opening
 My First Line in A+ OpeningThis is an added line, because I used a+ mode
This is an added line, because I used a+ mode


### We can use the context manager to automatically close the file

In [96]:
with open('test.txt','r') as mynewfile:
    myvariabel = mynewfile.readlines()

In [97]:
myvariabel

['My Brand New TextMy First Line in A+ Opening\n',
 ' My First Line in A+ OpeningThis is an added line, because I used a+ mode\n',
 'This is an added line, because I used a+ mode']

### by using the "with" context manager, it can automatically close the file.

# <span style ="color:#02731e;font-family:calibri"> Working with PDF files
* Often you may need to read in text data from a PDF file.
* We can use the PyPDF2 library to read in the text data from a PDF file.

### <span style ="color:red;font-family:calibri"> Keep in mind: Not All PDFs Have Text That Can Be Extracted!

### <span style ="color:#02731e;font-family:calibri"> Some PDFs are created through scanning, instead of being exported from a text editor like Word.</span>
### <span style ="color:#02731e;font-family:calibri"> These scanned PDFs are more like image file, making it much harder to exteact the text.</span>
### <span style ="color:#02731e;font-family:calibri"> Often this requires specialized software!

### <span style ="color:#02731e;font-family:calibri">The PyPDF2 library is made to extract text from PDF files directly from word processor, but keep in mind, not all word processors created PDFs with extractable text!</span>

In [98]:
import PyPDF2

In [99]:
mypdf = open("US_Declaration.pdf", mode = 'rb')

### "rb" is reading in a binary format

In [101]:
pdf_reader = PyPDF2.PdfReader(mypdf)

In [104]:
len(pdf_reader.pages)

5

In [106]:
page_one = pdf_reader.pages[0]

In [109]:
print(page_one.extract_text())

Declaration of Independence
IN CONGRESS, July 4, 1776.  
The unanimous Declaration of the thirteen united States of America,  
When in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.— That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  That whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it, 

In [110]:
mypdf.close()

#### We can add another pages , we copy from another pdf or same pdf and append it at the end

In [111]:
f = open("US_Declaration.pdf",'rb')

In [112]:
pdf_reader = PyPDF2.PdfReader(f)

In [125]:
firs_Page = pdf_reader.pages[0]

In [126]:
pdf_writer = PyPDF2.PdfWriter()

In [127]:
pdf_writer.add_page(firs_Page)

{'/Type': '/Page',
 '/Contents': {},
 '/MediaBox': [0, 0, 612, 792],
 '/Resources': {'/Font': {'/F9': {'/Type': '/Font',
    '/Subtype': '/Type1',
    '/Name': '/F9',
    '/Encoding': '/WinAnsiEncoding',
    '/FirstChar': 31,
    '/LastChar': 255,
    '/Widths': [778,
     250,
     333,
     555,
     500,
     500,
     1000,
     833,
     278,
     333,
     333,
     500,
     570,
     250,
     333,
     250,
     278,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     333,
     333,
     570,
     570,
     570,
     500,
     930,
     722,
     667,
     722,
     722,
     667,
     611,
     778,
     778,
     389,
     500,
     778,
     667,
     944,
     722,
     778,
     611,
     778,
     722,
     556,
     667,
     722,
     722,
     1000,
     722,
     722,
     667,
     333,
     278,
     333,
     581,
     500,
     333,
     500,
     556,
     444,
     556,
     444,
     333,
     500,
     556,

In [128]:
pdf_output =  open("My_Brand_New.pdf","wb")

In [129]:
pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='My_Brand_New.pdf'>)

In [130]:
pdf_output.close()

In [131]:
f.close()

In [132]:
brand_new = open("My_Brand_New.pdf",'rb')

In [133]:
pdf_reader = PyPDF2.PdfReader(brand_new)

In [134]:
len(pdf_reader.pages)

1

In [144]:
f = open("US_Declaration.pdf","rb")

pdf_text = []

pdf_reader = PyPDF2.PdfReader(f)

for p in range(len(pdf_reader.pages)):
    
    page = pdf_reader.pages[p]
    
    pdf_text.append(page.extract_text())
f.close()

In [146]:
len(pdf_text)

5

# <span style ="color:#02731e;font-family:calibri"> Regular Expressions

### IF we need to search a string for a term, such as "phone". You can use the in keyword to do this like:


In [147]:
"phone " in "Is the phone here?"

True

We can use "in" operator to do that, but if we don't know the exact number?

### But if we know the format for what we are looking for like email add, phone number, dates, all we need is a regualr expressions to seach through the document for this pattern

### <span style ="color:#02731e;font-family:calibri"> Regular expressions allow for pattern searching in a text documnet.
* The syntax for regualr expressions can be very intimidating at first:</span>
#### <span style ="color:red;font-family:calibri"> r'\d{3}-\d{3}-\d{4}'


* The key thing to keep in mind is that every character type has a corresponding pattern code.
* For example. digits have the placeholder patter of \d
* The use of backslash allows python to understand that it is special code and not the letter "d"
* So the above example says find 3 digits then dash, then another 3 digit then dash and then 4 digits


In [148]:
text = "The phone number of the agent is 408-555-1234. Call soon!"

In [150]:
"phone" in text

True

In [151]:
"408-555-1234" in text

True

#### This was the use of in operator now see how to use regualr expression

In [152]:
import re

### This is Regualr expression library built in python

In [153]:
pattern = "phone"

In [154]:
re.search(pattern, text)

<re.Match object; span=(4, 9), match='phone'>

In [155]:
my_match = re.search(pattern, text)

In [156]:
my_match.span()

(4, 9)

This tells us that int index position of starting letter of the word to the last letter

In [157]:
my_match.start()

4

In [158]:
my_match.end()

9

If pattern occurs more than one time

In [159]:
text = "my phone is a new phone"

In [160]:
match = re.search(pattern, text)

In [161]:
match.span()

(3, 8)

In [163]:
all_match = re.findall(pattern,text)

In [164]:
all_match

['phone', 'phone']

In [165]:
len(all_match)

2

If we to find the matched objects instead of list

In [168]:
for match  in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


To search for generalized pattern

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [170]:
text = 'my telephone number is 777-555-1234'

Let's say we want to find a telephone number in a text and we don't know the number

In [173]:
pattern = r"\d\d\d-\d\d\d-\d\d\d\d"

In [176]:
phone_number = re.search(patter,text)

In [177]:
phone_number

<re.Match object; span=(23, 35), match='777-555-1234'>

In [178]:
phone_number.group()

'777-555-1234'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [180]:
pattern = r"\d{3}-\d{3}-\d{4}"

In [182]:
my_match = re.search(pattern,text)

In [183]:
my_match.group()

'777-555-1234'

In [184]:
pattern = r"(\d{3})-(\d{3})-(\d{4})"

By Putting round brackets around , we can call the individual groups as well

In [189]:
my_match = re.search(pattern, text)

In [190]:
my_match.group(1)

'777'

In [191]:
my_match.group(3)

'1234'

We can use pipe operator as an or statement

In [193]:
re.search(r"man|woman", "This man was here")

<re.Match object; span=(5, 8), match='man'>

In [194]:
re.search(r"man|woman", "This woman was here")

<re.Match object; span=(5, 10), match='woman'>

In [195]:
re.findall(r".at","The cat in the hat sat")

['cat', 'hat', 'sat']

"." and anything written after that works a wild card character to find a word containing that characters

In [199]:
re.findall(r"..at","The cat in the hat sat splat")

[' cat', ' hat', ' sat', 'plat']

We might to find words with starts with some charachters and ends with characters

^ is used for starts with

$ is used for ends with

In [200]:
re.findall(r"\d$", "This ends with a number 2")

['2']

In [201]:
re.findall(r"^\d", "1 is the lonliest number")

['1']

In [202]:
phrase = "there are 3 numbers 34 inside 5 this sentence"

If we want to get rid of all the numbers from the above phrase

In [205]:
re.findall(r"[^\d]+", phrase)## id carrot sign ^ is inside the square brackets it measn exlusion

['there are ', ' numbers ', ' inside ', ' this sentence']

In [206]:
test_phrase = "This is a string! but it has punctuation. How to remove it?"

In [209]:
re.findall(r"[^!.?]+",test_phrase)

['This is a string', ' but it has punctuation', ' How to remove it']

#### We can see some some comma in space where we had punctuation, we can remove that as well

In [212]:
mylist = re.findall(r"[^!.?]+",test_phrase)

In [213]:
mylist

['This is a string', ' but it has punctuation', ' How to remove it']

In [215]:
' '.join(mylist)

'This is a string  but it has punctuation  How to remove it'

#### "+" sign along with brackets allow us to use grouping

In [216]:
text = "only find the hypen-words. Where are the lonh-is dash words?"

In [218]:
re.findall(r"[\w]+-[\w]+",text)

['hypen-words', 'lonh-is']

# <span style ="color:#02731e;font-family:calibri"> Regualr Expression Exercise

#### 1. Print an f-string that displays `NLP stands for Natural Language Processing` using the variables provided.

In [221]:
abbr = 'NLP'
full_text = 'Natural Language Processing'

# Enter your code here:
print(f"{abbr} stands for {full_text}")

NLP stands for Natural Language Processing


#### 2. Create a file in the current working directory called `contacts.txt` by running the cell below:

In [267]:
%%writefile contacts.txt
First_Name Last_Name, Title, Extension, Email

Overwriting contacts.txt


#### 3. Open the file and use .read() to save the contents of the file to a string called `fields`.  Make sure the file is closed at the end.

In [268]:
with open("contacts.txt") as f:
    file = f.read()
print(file)

First_Name Last_Name, Title, Extension, Email




#### 4. Use PyPDF2 to open the file `Business_Proposal.pdf`. Extract the text of page 2.

In [226]:
mypdf = open("Business_Proposal.pdf",mode='rb')

In [228]:
pdfreader = PyPDF2.PdfReader(mypdf)

In [263]:
pages = pdfreader.pages[1].extract_text()

In [264]:
print(pages)

AUTHORS:  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  


In [265]:
mypdf.close()

#### 5. Open the file `contacts.txt` in append mode. Add the text of page 2 from above to `contacts.txt`
#### CHALLENGE: See if you can remove the word "AUTHORS:"

In [270]:
with open("contacts.txt", mode= "a+") as f:
    f.write(pages[8:])
    f.seek(0)
    print(f.read())

First_Name Last_Name, Title, Extension, Email
  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com    
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  


#### 6. Using the `page_two_text` variable created above, extract any email addresses that were contained in the file `Business_Proposal.pdf`.

In [274]:

pattern = r"[\w]+@[\w]+.\w{3}"

re.findall(pattern, pages)

['abaker@ourcompany.com',
 'cdonaldson@ourcompany.com',
 'efreeman@ourcompany.com']