# Exercise 0

The motivation of this exercise is to gain familiarity with the Python programming language. We are going to do some basic text processing and analysis on a plaintext corpus. If you are not with familiar Python or Jupyter notebooks, it is recommended to start with the Python Tutorial notebook before attempting this exercise.

---

For this exercise, we are going to count the 25 most frequent words in **The Adventures of Sherlock Holmes**, by Sir Arthur Conan Doyle. You are free to use any other piece of text of your choice for this exercise. This notebook contains step by step instructions (with some hints) and you are required to fill in the code blocks based on the material covered in the Python Tutorial notebook.

### 0. Download the text file.
Run the cell below to download the book **The Adventures of Sherlock Holmes** as a text file from [Project Gutenberg](http://www.gutenberg.org), and save into a file called `sherlock.txt`.

In [None]:
!curl https://www.gutenberg.org/files/1661/1661-0.txt > sherlock.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  593k  100  593k    0     0  1413k      0 --:--:-- --:--:-- --:--:-- 1413k


---
### 1. Read text from file.
Open the text file `sherlock.txt` and read all the lines into a list.

In [None]:
lines = []  # read lines from sherlock.txt into this list
with open('sherlock.txt', 'r') as f:
  lines = f.readlines()

---
### 2. Filter out the metadata.
The text file contains some metadata about the book which is not relevant for our analysis. Discard this information by removing the first 32 lines from the beginning and the last 368 lines from the end.

In [None]:
lines = lines[32:-368]

---
### 3. Remove leading and trailing spaces from each line in the list.
Each line contains a newline character `\n` at the end while some lines also contain leading and trailing spaces. This formatting is done for presentation purposes and not relevant for our analysis.

In [None]:
clean_lines = []  # store the lines in this list after removing the leading and trailing spaces
for line in lines:
  clean_lines.append(line.strip())

---
### 4. Remove empty lines from the list.
After removing the newline character `\n` from each line in the list, some strings are now empty that can be discarded safely.

In [None]:
non_empty_lines = []  # store non empty lines in this list
for line in clean_lines:
  if line != '':
    non_empty_lines.append(line)

---
### 5. Join all the non empty lines into a single string.
Now that we have cleaned the corpus by removing the presentation details, we can focus on the actual text. Create a single string which contains all the lines from the text.



In [None]:
# text = # join all the lines into this string
text = ' '.join(non_empty_lines)

---
### 6. Convert to lowercase
To keep the word counts consistent, we are going to covert everything lowercase. If we don't do this, the words, **the** **The** and **THE**, would be considered distinct.  

In [None]:
text = text.lower()

---
### 7. Get a list of all the words in the text.

In [None]:
words = text.split(' ')

---
### 8. How many total words are there in the text?

In [None]:
len(words)

104541

---
### 9. How many unique words are there in the text?

In [None]:
len(set(words))

14085

---
### 10. What are the 25 most frequent words?

In [None]:
word_counts = dict()    # create an empty dictionary for word counts
for word in words:
  if word in word_counts:
    word_counts[word] += 1
  else:
    word_counts[word] = 1

word_counts = list(word_counts.items())  # convert dict to a list of tuples for word counts
sorted_by_word_counts = sorted(word_counts, key=lambda x: x[1], reverse=True)
sorted_by_word_counts[:25]

[('the', 5519),
 ('and', 2812),
 ('to', 2643),
 ('of', 2633),
 ('a', 2592),
 ('i', 2533),
 ('in', 1701),
 ('that', 1590),
 ('was', 1369),
 ('he', 1276),
 ('it', 1255),
 ('his', 1146),
 ('you', 1103),
 ('is', 1057),
 ('my', 955),
 ('have', 897),
 ('as', 839),
 ('with', 822),
 ('had', 813),
 ('at', 755),
 ('which', 747),
 ('for', 701),
 ('be', 596),
 ('not', 582),
 ('but', 536)]

#### Alternate Solutions:

1. Python >= 3.6 supports ordered dictionaries, so there is no need to convert to a list of tuples before sorting.
2. Look up the `Counter` container in the `collections` module in the [Python docs](https://docs.python.org/3/library/collections.html#collections.Counter).