# Lab 1: Jupyter and Basic Python

In this lab, you will load and perform a basic analysis of a text using python in a Jupyter interactive notebook.

Follow the instruction and run the code cells. Make sure you understand what happens in every stage.

## Getting the data
Following is a code snippet that download a file and save it locally. We will download the poem "HOWL" by Allen Ginsberg and save it locally to file named "howl.txt"

In [12]:
DOWNLOAD_URL = "http://www.everyday-beat.org/ginsberg/poems/howl.txt"
SAVE_URL = "howl.txt"

In [14]:
from urllib.request import urlopen

downloadedText = urlopen(DOWNLOAD_URL).read().decode('utf-8')
with open(SAVE_URL, "w") as file:
    file.write(downloadedText)

-----------
## Loading the data
* First we open the file using *open()*
* **Make sure the file path is correct**

In [16]:
with open("howl.txt", "r") as poemFile:
    poemText = poemFile.read()

----------
## Review the data
Lets print the first 500 character of the text

In [18]:
print(poemText[0:500])


                          HOWL

                    For Carl Solomon 

                           I 

       I saw the best minds of my generation destroyed by 
              madness, starving hysterical naked, 
       dragging themselves through the negro streets at dawn 
              looking for an angry fix, 
       angelheaded hipsters burning for the ancient heavenly 
              connection to the starry dynamo in the machin- 
              ery of night, 
       who poverty and tatters 


----------
Now, lets print the beginning of the second section. It start at character 16692:

In [26]:
print(poemText[16692:17500])

                           II 

       What sphinx of cement and aluminum bashed open 
              their skulls and ate up their brains and imagi- 
              nation? 
       Moloch! Solitude! Filth! Ugliness! Ashcans and unob 
              tainable dollars! Children screaming under the 
              stairways! Boys sobbing in armies! Old men 
              weeping in the parks! 
       Moloch! Moloch! Nightmare of Moloch! Moloch the 
              loveless! Mental Moloch! Moloch the heavy 
              judger of men! 
       Moloch the incomprehensible prison! Moloch the 
              crossbone soulless jailhouse and Congress of 
              sorrows! Moloch whose buildings are judgment! 
              Moloch the vast stone of war! Moloch the stun- 
              ned governments! 
     


---------
## Counting the number of lines in the story
We start by splitting the text to a list of lines using *splitlines()*

In [27]:
poemLines = poemText.splitlines()

---------
Now we print the first 10 lines:

In [28]:
poemLines[0:10]

['',
 '                          HOWL',
 '',
 '                    For Carl Solomon ',
 '',
 '                           I ',
 '',
 '       I saw the best minds of my generation destroyed by ',
 '              madness, starving hysterical naked, ',
 '       dragging themselves through the negro streets at dawn ']

----------
Last, we count the lines using *len()* method:

In [29]:
print("The number of lines in the text is:", len(poemLines))

The number of lines in the text is: 445


---------
## Counting the number of words in the story
We start by splitting the text to a list of words using *split()*

In [30]:
poemWords = poemText.split()

---------
Now we print the first 10 words:

In [32]:
poemWords[:10]

['HOWL', 'For', 'Carl', 'Solomon', 'I', 'I', 'saw', 'the', 'best', 'minds']

----------
Last, we count the words using *len()* on the list of words:

In [33]:
print("The number of words in the text is:", len(poemWords))

The number of words in the text is: 2957


## Counting the number of occurrences of a word

In [45]:
# first, initalize a words counter to 0
wordsCounter = 0

# for every word in the list storyWords
for word in poemWords:
    # strip non-alphanumeric character, and convert to lower case
    if word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower() == "who":
        wordsCounter += 1

In [46]:
print("The number of occurrences of 'Who' is:", wordsCounter)

The number of occurrences of 'Who' is: 68


## Counting the number of lines starting with a word

In [50]:
linesCounter = 0
for line in poemLines:
    if line.lower().strip().startswith("who"):
        linesCounter += 1

In [51]:
print("The number of lines that start with the word 'Who' is:", linesCounter)

The number of lines that start with the word 'Who' is: 64


## Counting the number of different words

In [52]:
# remove non-alphanumeric characters and convert to lower case
cleanWords = [word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower() 
               for word in poemWords]

In [53]:
print("number of different words: ", len(list(set(cleanWords))))

number of different words:  1318


## Using *Counter* collection
We can use *Counter* to count the frequency of every item in a list.

In [54]:
from collections import Counter
wordsCounter = Counter(cleanWords)

---------
What is the number of occurrences of the word 'Christmas'?

In [57]:
wordsCounter["who"]

68

---------
What are the 10 most common words?

In [58]:
wordsCounter.most_common(10)

[('the', 220),
 ('of', 139),
 ('and', 120),
 ('in', 111),
 ('who', 68),
 ('to', 58),
 ('with', 48),
 ('moloch', 39),
 ('a', 38),
 ('you', 32)]

-------
## Finding a phrase in the text

Let find the first occurrence of the phrase "Canada" in the text.

In [60]:
poemText.find("Canada")

1845

-------
Lets examine the text around this position

In [70]:
print(poemText[1800:1900])

 mind leaping toward poles of 
              Canada & Paterson, illuminating all the mo- 
          
