# Nothing fancy here. Basics working with text

In [1]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

In [4]:
for book in library:
    print(f"Authonr {book[0]}")

Authonr Author
Authonr Twain
Authonr Feynman
Authonr Hamilton


In [6]:
for author, topic, pages in library:
    print(f"{author} {topic} {pages}")

Author Topic Pages
Twain Rafting 601
Feynman Physics 95
Hamilton Mythology 144


In [7]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting in water alone', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

This doesn't seem to be well aligned...

In [10]:
for author, topic, pages in library:
    print(f"{author:{10}} {topic:{30}} {pages:{10}}")

Author     Topic                          Pages     
Twain      Rafting in water alone                601
Feynman    Physics                                95
Hamilton   Mythology                             144


Now better

In [12]:
for author, topic, pages in library:
    print(f"{author:{10}} {topic:{30}} {pages:>{10}}")

Author     Topic                               Pages
Twain      Rafting in water alone                601
Feynman    Physics                                95
Hamilton   Mythology                             144


Or maybe something like this:

In [14]:
for author, topic, pages in library:
    print(f"{author:{10}} {topic:{30}} {pages:.>{10}}")

Author     Topic                          .....Pages
Twain      Rafting in water alone         .......601
Feynman    Physics                        ........95
Hamilton   Mythology                      .......144


### Working with datetime

Something about date formatting 

In [15]:
from datetime import datetime

In [16]:
today = datetime(year=2019,month=2,day=28)

In [17]:
print(f"{today}")

2019-02-28 00:00:00


Let's just make it look nicer.

In [21]:
print(f"{today:%y-%b-%d}")

19-Feb-28


# Working with text files

To create one, I will use magic command specific to Jupyter Notebooks

In [24]:
%%writefile test.txt
Hello, this is a quick test file.
This is the second line of the file.

Writing test.txt


In [30]:
myfile = open('test.txt')

In [31]:
myfile

<_io.TextIOWrapper name='test.txt' mode='r' encoding='cp1252'>

In [32]:
myfile.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [33]:
myfile.seek(0)

0

In [38]:
content = myfile.read()

In [40]:
print(content)

Hello, this is a quick test file.
This is the second line of the file.



In [41]:
myfile.close()

In [42]:
myfile = open('test.txt')

In [43]:
myfile.readlines()

['Hello, this is a quick test file.\n',
 'This is the second line of the file.\n']

In [44]:
myfile.seek(0)

0

In [45]:
mylines = myfile.readlines()

In [46]:
mylines

['Hello, this is a quick test file.\n',
 'This is the second line of the file.\n']

In [48]:
for line in mylines:
    print(line.split()[0])

Hello,
This


### Writing to a file

`w+` complitely overwrites content, which is presented in the cell where `myfile.read()` is executed

In [25]:
myfile = open('test.txt', mode='w+')

In [26]:
myfile.read()

''

In [27]:
myfile.write('MY BRAND NEW TEXT')

17

In [28]:
myfile.seek(0)

0

In [29]:
myfile.read()

'MY BRAND NEW TEXT'

In [30]:
myfile.close()

Append to a file keeps old information and let you append new lines. Appending is presented by `a+` passed in the `open`.

In [31]:
myfile = open('woops.txt', 'a+')

In [32]:
myfile.write('MY FIRST LINE IN A+ OPENING')

27

In [33]:
myfile.close()

In [34]:
newfile = open('woops.txt')

In [35]:
newfile.read()

'MY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENING'

In [36]:
newfile.close()

In [37]:
myfile = open('woops.txt', mode='a+')

In [38]:
myfile.write('This is an added line, because I used a+ mode')

45

In [39]:
myfile.seek(0)

0

In [40]:
myfile.read()

'MY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGThis is an added line, because I used a+ mode'

In [41]:
myfile.write('\nThis is a real new line, on the next line')

42

In [42]:
myfile.seek(0)

0

In [43]:
myfile.read()

'MY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGThis is an added line, because I used a+ mode\nThis is a real new line, on the next line'

In [44]:
myfile.seek(0)

0

In [45]:
print(myfile.read())

MY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGThis is an added line, because I used a+ mode
This is a real new line, on the next line


In [46]:
myfile.close()

Using context manager to automatically close file in order not to use `close()` method which is unsafe because you can forget to type it and actually close the file.

In [48]:
with open('woops.txt', 'r') as mynewfile:
    myvariable = mynewfile.readlines()

In [49]:
myvariable

['MY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGMY FIRST LINE IN A+ OPENINGThis is an added line, because I used a+ mode\n',
 'This is a real new line, on the next line']

### Working with PDF files

Using PyPDF2 library to work with PDFs.

In [50]:
import PyPDF2

In [51]:
myfile = open('US_Declaration.pdf', mode='rb')

In [53]:
pdf_reader = PyPDF2.PdfReader(myfile)

In [57]:
len(pdf_reader.pages)

5

In [60]:
page_one = pdf_reader.pages[0]

In [67]:
print(page_one.extract_text())

Declaration of Independence
IN CONGRESS, July 4, 1776.  
The unanimous Declaration of the thirteen united States of America,  
When in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.— That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  That whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it, 

In [65]:
mytext = page_one.extract_text()

In [66]:
myfile.close()

Adding a page to the PDF.

In [68]:
f = open('US_Declaration.pdf', 'rb')

In [69]:
pdf_reader = PyPDF2.PdfReader(f)

In [73]:
first_page = pdf_reader.pages[0]

In [72]:
pdf_writer = PyPDF2.PdfWriter()

In [None]:
pdf_writer.add_page(first_page)

In [76]:
pdf_output = open('MY_BRAND_NEW.pdf', 'wb')

In [77]:
pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='MY_BRAND_NEW.pdf'>)

In [78]:
pdf_output.close()

In [79]:
f.close()

Let's check the file.

In [80]:
brand_new = open('MY_BRAND_NEW.pdf', 'rb')

pdf_reader = PyPDF2.PdfReader(brand_new)

In [82]:
len(pdf_reader.pages)

1

In [84]:
f = open('US_Declaration.pdf', 'rb')

pdf_text = [0]

pdf_reader = PyPDF2.PdfReader(f)

for p in range(len(pdf_reader.pages)):
    
    page = pdf_reader.pages[p]
    
    pdf_text.append(page.extract_text())
    
f.close()

In [86]:
pdf_text

[0,
 "Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter o

In [87]:
len(pdf_text)

6

In [88]:
for page in pdf_text:
    print(page)
    print('\n')
    print('\n')
    print('\n')

0






Declaration of Independence
IN CONGRESS, July 4, 1776.  
The unanimous Declaration of the thirteen united States of America,  
When in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.— That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  That whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abol

# Regular expressions

Regular expressions allow to search for a pattern in the text. As an exampe `r'\d{3}-\d{3}-\d{4}'`.
The key thing to keep in mind is that every character type has a corresponding pattern code. For example, digits have the placeholder pattern of code `\d`. To use of backslash allowsx python to understand that it is a speacial code and not the letter `"d"`.

In [2]:
text = 'The phone number of the agent is 408-555-1234. Call soon!'

In [90]:
"phone" in text

True

In [1]:
import re

In [92]:
pattern = "phone"

This returns match pattern:

In [93]:
re.search(pattern, text)

<re.Match object; span=(4, 9), match='phone'>

In [94]:
my_match = re.search(pattern, text)

In [95]:
my_match.span()

(4, 9)

In [96]:
my_match.start()

4

In [97]:
my_match.end()

9

But it will return several positions if word appears more than 1 time in the text.

In [98]:
text = "my phone is a new phone"

In [99]:
match = re.search(pattern, text)

This returns first appearance.

In [100]:
match.span()

(3, 8)

But if we want to get all instances

In [102]:
all_matches = re.findall(pattern,text)

In [103]:
len(all_matches)

2

In [104]:
for match in re.finditer('phone', text):
    print(match.span())

(3, 8)
(18, 23)


Let's search for the phone number as if we don't know what phone number in the string exactly.

In [106]:
text = 'The phone number of the agent is 408-555-1234. Call soon!'

In [107]:
text

'The phone number of the agent is 408-555-1234. Call soon!'

In [108]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [109]:
phone_number = re.search(pattern,text)

In [111]:
phone_number.group()

'408-555-1234'

So, whatever phone number is, if it follows pattern, it will be found.

In [115]:
pattern = r'\d{3}-\d{3}-\d{4}' # Specifying directly appearance of the number of digits

In [113]:
phone_number = re.search(pattern,text)

In [114]:
phone_number.group()

'408-555-1234'

In [3]:
pattern = r"(\d{3})-(\d{3})-(\d{4})"

In [4]:
mymatch = re.search(pattern, text)

In [5]:
mymatch.group(1)

'408'

In [6]:
mymatch.group(3)

'1234'

Using 'or' in regular expressions.

In [7]:
re.search(r"man|woman", "This man was here")

<re.Match object; span=(5, 8), match='man'>

In [8]:
re.search(r"man|woman", "This woman was here")

<re.Match object; span=(5, 10), match='woman'>

Using wildcard to search for certain type of expression "ending with" like. The "." is a wildcard character

In [14]:
re.findall(r".at", "The cat in the hat sat, splat")

['cat', 'hat', 'sat', 'lat']

In [15]:
re.findall(r"..at", "The cat in the hat sat, splat")

[' cat', ' hat', ' sat', 'plat']

We can use "^" for "starts with", "$" for "ends width". It works for entire string not for certain word.

In [17]:
re.findall(r"\d$", 'This ends with a number 2')

['2']

In [18]:
re.findall(r"^\d", '1 is the lonliest number')

['1']

Character exclusion.

In [19]:
phrase = "There are 3 numbers 34 inside 5 this sentence"

In [21]:
re.findall(r"[^\d]+", phrase)

['There are ', ' numbers ', ' inside ', ' this sentence']

In [22]:
test_phrase = "This is a string! But it has punctuation. How to remove it?"

This will return us everything what isn't punctuation.

In [23]:
re.findall(r"[^!.? ]+", test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'to',
 'remove',
 'it']

In [24]:
mylist = re.findall(r"[^!.? ]+", test_phrase)

In [26]:
' '.join(mylist)

'This is a string But it has punctuation How to remove it'

Using grouping. `\w` indicates alphanumerics.

In [27]:
text = "Only find the hyphen-words. Where are the long-ish dash words?"

In [28]:
re.findall(r'[\w]+-[\w]+', text)

['hyphen-words', 'long-ish']