# Lab 1: Jupyter and Basic Python

In this lab, you will load and perform a basic analysis of a text using python in a Jupyter interactive notebook.

Follow the instruction and run the code cells. Make sure you understand what happens in every stage.

## Getting the data
Following is a code snippet that download a file and save it locally. We will download the poem "HOWL" by Allen Ginsberg and save it locally to file named "howl.txt"

In [None]:
DOWNLOAD_URL = "http://www.everyday-beat.org/ginsberg/poems/howl.txt"
SAVE_URL = "howl.txt"

In [None]:
from urllib.request import urlopen

downloadedText = urlopen(DOWNLOAD_URL).read().decode('utf-8')
with open(SAVE_URL, "w") as file:
    file.write(downloadedText)

-----------
## Loading the data
* First we open the file using *open()*
* **Make sure the file path is correct**

In [None]:
with open("howl.txt", "r") as poemFile:
    poemText = poemFile.read()

----------
## Review the data
Lets print the first 500 character of the text

In [None]:
print(poemText[0:500])

----------
Now, lets print the beginning of the second section. It start at character 16692:

In [None]:
print(poemText[16692:17500])

---------
## Counting the number of lines in the story
We start by splitting the text to a list of lines using *splitlines()*

In [None]:
poemLines = poemText.splitlines()

---------
Now we print the first 10 lines:

In [None]:
poemLines[0:10]

----------
Last, we count the lines using *len()* method:

In [None]:
print("The number of lines in the text is:", len(poemLines))

---------
## Counting the number of words in the story
We start by splitting the text to a list of words using *split()*

In [None]:
poemWords = poemText.split()

---------
Now we print the first 10 words:

In [None]:
poemWords[:10]

----------
Last, we count the words using *len()* on the list of words:

In [None]:
print("The number of words in the text is:", len(poemWords))

## Counting the number of occurrences of a word

In [None]:
# first, initalize a words counter to 0
wordsCounter = 0

# for every word in the list storyWords
for word in poemWords:
    # strip non-alphanumeric character, and convert to lower case
    if word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower() == "who":
        wordsCounter += 1

In [None]:
print("The number of occurrences of 'Who' is:", wordsCounter)

## Counting the number of lines starting with a word

In [None]:
linesCounter = 0
for line in poemLines:
    if line.lower().strip().startswith("who"):
        linesCounter += 1

In [None]:
print("The number of lines that start with the word 'Who' is:", linesCounter)

## Counting the number of different words

In [None]:
# remove non-alphanumeric characters and convert to lower case
cleanWords = [word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower() 
               for word in poemWords]

In [None]:
print("number of different words: ", len(list(set(cleanWords))))

## Using *Counter* collection
We can use *Counter* to count the frequency of every item in a list.

In [None]:
from collections import Counter
wordsCounter = Counter(cleanWords)

---------
What is the number of occurrences of the word 'Christmas'?

In [None]:
wordsCounter["who"]

---------
What are the 10 most common words?

In [None]:
wordsCounter.most_common(10)

-------
## Finding a phrase in the text

Let find the first occurrence of the phrase "Canada" in the text.

In [None]:
poemText.find("Canada")

-------
Lets examine the text around this position

In [None]:
print(poemText[1800:1900])