# Regular Expressions Exercise

Acknowledgement: This notebook is provided by Intel AI Developer Program

#### Introduction

Lets download the complete works of Sherlock Holmes and do some detective work using regular expression.

For many of these exercises it is helpful to compile a regular expression p and then use p.findall() followed by some additional simple python processing to answer the questions


In [14]:
import re

In [15]:
# for linux and mac, uncomment and run the following line:
# !wget https://sherlock-holm.es/stories/plain-text/cnus.txt
# for windows, use your browser to download txt file and place in working directory

In [16]:
text = ''
with open('./data/cnus.txt','r', encoding='utf-8') as f:
    text = " ".join([l.strip() for l in f.readlines()])

In [17]:
text[2611:3000]

"On landing at Bombay, I learned that my corps had advanced through the passes, and was already deep in the enemy's country. I followed, however, with many other officers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.  The campaign brought honours and promotion to many, but for me "

## Question 1

One of Sherlock Holmes' famous catch phrases is the use of the word 'undoubtedly'

* How many times is the word 'undoubtedly' used?
* Hint: findall returns the resuls in a list. You can use the length of the returned list.

In [18]:
# your answer

In [19]:
# the word undoubtedly only appears 43 times
p = re.compile(r'undoubtedly')
print(len(p.findall(text)))

# alternative

print(len(re.findall(r'undoubtedly', text)))

43
43


## Question 2

Characters are announced very deliberatly in the language of the setting in Victorian England. We can use this later to find characters in the book. But for now let's practice on a character we know.

* How often is Sherlock Holmes refered to by 'Mr. Sherlock Holmes' vs 'Sherlock Holmes' vs. 'Mr. Holmes' vs 'Sherlock'
* Hint 1: We can start with finding the occurance of the pattern Sherlock Holmes.
* Hint 2: If you read novel, you will also know that characters can be referred to by their given name or surname. In this case you can use the "or" operator |, for example "Sherlock Holmes|Mr Holmes|Mr Sherlock Homes"

In [20]:
# your answer based on Hint 1

In [21]:
#lets start with the very simplest thing. 
#Let's just find the occurances of Sherlock Holmes
p = re.compile('Sherlock Holmes')
len(p.findall(text))

361

In [22]:
# your answer based on Hint 2

In [23]:
# one easy way to solve this
# is to just use the 'or' operator | with all the patterns we want to match
p = re.compile('Mr\. Sherlock Holmes|Sherlock Holmes|Mr\. Holmes|Sherlock|Holmes')
results = p.findall(text)
counts = {}
for r in results:
    if r in counts.keys():
        counts[r] += 1
    else:
        counts[r] = 1
        
counts

{'Sherlock Holmes': 268,
 'Mr. Sherlock Holmes': 93,
 'Holmes': 1646,
 'Mr. Holmes': 496,
 'Sherlock': 22}

## Question 3

* Find all the doctors in the collection
    
* Hint: Make a list of all the characters that first appear in the collection. You can assumt that the characters have a proper salutation sich as  Mrs. Mr. Miss Dr. etc

In [24]:
# Your answer

In [25]:
p = re.compile('[MD][irs][s\.]?[s\.]? [A-Z]\w*')
set(p.findall(text))

{'Dr. Ainstree',
 'Dr. Armstrong',
 'Dr. Barnicot',
 'Dr. Becher',
 'Dr. Ferrier',
 'Dr. Fordham',
 'Dr. Grimesby',
 'Dr. Horsom',
 'Dr. Huxtable',
 'Dr. James',
 'Dr. Leon',
 'Dr. Leslie',
 'Dr. Moore',
 'Dr. Mortimer',
 'Dr. Percy',
 'Dr. Richards',
 'Dr. Roylott',
 'Dr. Shlessinger',
 'Dr. Somerton',
 'Dr. Sterndale',
 'Dr. Thorneycroft',
 'Dr. Trevelyan',
 'Dr. Watson',
 'Dr. Willows',
 'Dr. Wood',
 'Miss Adler',
 'Miss Alice',
 'Miss Brenda',
 'Miss Burnet',
 'Miss Cushing',
 'Miss Dobney',
 'Miss Doran',
 'Miss Edith',
 'Miss Ettie',
 'Miss Flora',
 'Miss Fraser',
 'Miss Harrison',
 'Miss Hatty',
 'Miss Helen',
 'Miss Holder',
 'Miss Honoria',
 'Miss Hunter',
 'Miss Irene',
 'Miss M',
 'Miss Marie',
 'Miss Mary',
 'Miss Miles',
 'Miss Morrison',
 'Miss Morstan',
 'Miss Nancy',
 'Miss Rachel',
 'Miss Roylott',
 'Miss Rucastle',
 'Miss S',
 'Miss Sarah',
 'Miss Smith',
 'Miss Stapleton',
 'Miss Stoner',
 'Miss Stoper',
 'Miss Susan',
 'Miss Sutherland',
 'Miss Turner',
 'Miss Viole

## Question 4

* Search out all the years that appear in the story.
* What is the format for year? If you know the book series, which years are the stories mostly likely set in?

In [26]:
# we can use \d to match any digit
p = re.compile('1[89]\d\d')
years = p.findall(text)
print (years)

['1878', '1860', '1857', '1871', '1878', '1878', '1882', '1882', '1882', '1888', '1858', '1890', '1890', '1869', '1870', '1878', '1883', '1883', '1869', '1869', '1884', '1887', '1846', '1855', '1875', '1891', '1890', '1891', '1894', '1894', '1840', '1881', '1884', '1887', '1894', '1901', '1895', '1900', '1888', '1872', '1883', '1884', '1883', '1883', '1883', '1883', '1894', '1884', '1882', '1882', '1884', '1882', '1883', '1876', '1800', '1865', '1875', '1872', '1874', '1875', '1892', '1895', '1897', '1914', '1911', '1915']


## Question 5

Sherlock holmes is frequently smoking his pipe. But like many verbs in English, there are many ways that the word smoking can be conjugated depending on the context.

* Capture all sentences that take about smoking (smoke, smokes, smoking, smoked).
* Hint: For simplicity, since this is a properly written novel, you can assume that the start of a sentence has a word breaker and starts with capital letter, and ends with a full stop.


In [27]:
# Your answer

In [28]:
p = re.compile('\.[ A-Za-z]+smok[ A-Za-z]+\.')
sentences = p.findall(text)

for s in sentences:
    print (s)
for s in sentences:
    clean_s = re.sub(r"\. ", "", s)
    print (clean_s)

. I am going to smoke and to think over this queer business to which my fair client has introduced us.
. I would have thought no more of knifing him than of smoking this cigar.
. The smoke and shouting were enough to shake nerves of steel.
. He had even smoked there.
. Then I went into the back yard and smoked a pipe and wondered what it would be best to do.
. As we rolled into Eyford Station we saw a gigantic column of smoke which streamed up from behind a small clump of trees in the neighbourhood and hung like an immense ostrich feather over the landscape.
. Then he lit his pipe and sat for some time smoking and turning them over.
. I had smoked two cigarettes before he moved.
.  We had breakfasted and were smoking our morning pipe on the day after the remarkable experience which I have recorded when Mr.
. I observed that he was smoking with extraordinary rapidity.
. He does smoke something terrible.
. From over a distant rise there floated a gray plume of smoke.
. This might partly 

    
## Question 6

Often we will recieve a block of unstructured text within some delimiter. For example,
a book is often segmented into chapters.
We can use REGEX to search the chapter delimiter for a book and split into its chapter.

* View the contents of the text by looking at the first 50,000 characters. Confirm if the book is divided into CHAPTER headings.


In [29]:
print(text[0:50000])

    THE COMPLETE SHERLOCK HOLMES  Arthur Conan Doyle    Table of contents  A Study In Scarlet  The Sign of the Four  The Adventures of Sherlock Holmes A Scandal in Bohemia The Red-Headed League A Case of Identity The Boscombe Valley Mystery The Five Orange Pips The Man with the Twisted Lip The Adventure of the Blue Carbuncle The Adventure of the Speckled Band The Adventure of the Engineer's Thumb The Adventure of the Noble Bachelor The Adventure of the Beryl Coronet The Adventure of the Copper Beeches  The Memoirs of Sherlock Holmes Silver Blaze The Yellow Face The Stock-Broker's Clerk The "Gloria Scott" The Musgrave Ritual The Reigate Squires The Crooked Man The Resident Patient The Greek Interpreter The Naval Treaty The Final Problem  The Return of Sherlock Holmes The Adventure of the Empty House The Adventure of the Norwood Builder The Adventure of the Dancing Men The Adventure of the Solitary Cyclist The Adventure of the Priory School The Adventure of Black Peter The Adventure of C

* Write a regular expression to list down all the chapter numbers found in the text

In [30]:
# your answers

In [31]:
p = re.compile('CHAPTER\s[\w]+')
p.findall(text)

['CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER XIII',
 'CHAPTER XIV',
 'CHAPTER XV',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER I',
 'CHAPTER II']

* Use the split() function to split the books by chapters. You should get a list where each item is a chapter

In [32]:
# your answers

In [33]:
chapters = p.split(text)

* Print out the content of chapter 1.

In [34]:
# your answers

In [35]:
print (chapters[1])

 Mr. Sherlock Holmes   In the year 1878 I took my degree of Doctor of Medicine of the University of London, and proceeded to Netley to go through the course prescribed for surgeons in the army. Having completed my studies there, I was duly attached to the Fifth Northumberland Fusiliers as Assistant Surgeon. The regiment was stationed in India at the time, and before I could join it, the second Afghan war had broken out. On landing at Bombay, I learned that my corps had advanced through the passes, and was already deep in the enemy's country. I followed, however, with many other officers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.  The campaign brought honours and promotion to many, but for me it had nothing but misfortune and disaster. I was removed from my brigade and attached to the Berkshires, with whom I served at the fatal battle of Maiwand. There I was struck on the shou

## Question 7

Finally, you can break the book into its respective chapters.
Output the chapters into individual files.

In [36]:
# your answers

In [37]:
for i in range(len(chapters)):
    f_name = f"data/output/Chapter_{i}.txt"
    print (f"Writing to {f_name}")
    f = open(f_name, "w", encoding='utf-8')
    f.write(chapters[i])
    f.close()

Writing to data/output/Chapter_0.txt
Writing to data/output/Chapter_1.txt
Writing to data/output/Chapter_2.txt
Writing to data/output/Chapter_3.txt
Writing to data/output/Chapter_4.txt
Writing to data/output/Chapter_5.txt
Writing to data/output/Chapter_6.txt
Writing to data/output/Chapter_7.txt
Writing to data/output/Chapter_8.txt
Writing to data/output/Chapter_9.txt
Writing to data/output/Chapter_10.txt
Writing to data/output/Chapter_11.txt
Writing to data/output/Chapter_12.txt
Writing to data/output/Chapter_13.txt
Writing to data/output/Chapter_14.txt
Writing to data/output/Chapter_15.txt
Writing to data/output/Chapter_16.txt
Writing to data/output/Chapter_17.txt
Writing to data/output/Chapter_18.txt
Writing to data/output/Chapter_19.txt
Writing to data/output/Chapter_20.txt
Writing to data/output/Chapter_21.txt
Writing to data/output/Chapter_22.txt
Writing to data/output/Chapter_23.txt
Writing to data/output/Chapter_24.txt
Writing to data/output/Chapter_25.txt
Writing to data/output