# BDTA Lesson 3: Lists

In this lesson we will learn more ways of handling strings.

## A text to process

In [1]:
theBigText = '''To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them: to die, to sleep
No more; and by a sleep, to say we end
'''
theBigText

"To be, or not to be, that is the question:\nWhether 'tis nobler in the mind to suffer\nThe slings and arrows of outrageous fortune,\nOr to take Arms against a Sea of troubles,\nAnd by opposing end them: to die, to sleep\nNo more; and by a sleep, to say we end\n"

Here you can see how we can concatenate text to the variable.

In [2]:
theBigText += "the heart-ache, and the thousand natural shocks\n"
theBigText += "that Flesh is heir to?"
theBigText

"To be, or not to be, that is the question:\nWhether 'tis nobler in the mind to suffer\nThe slings and arrows of outrageous fortune,\nOr to take Arms against a Sea of troubles,\nAnd by opposing end them: to die, to sleep\nNo more; and by a sleep, to say we end\nthe heart-ache, and the thousand natural shocks\nthat Flesh is heir to?"

## Doing things to the text

We can do things to the text like found out how many characters long it is. Note how we built a response for the ```print()``` function.

In [3]:
print("This string has", len(theBigText), "characters.")

This string has 325 characters.


We can use a method ```splitlines()``` to create a list of lines. A list is a different datatype. Lists can be manipulated like strings. You can get a slice or an item. Note how the first item has an index of 0.

In [4]:
theLines = theBigText.splitlines()
theLines

['To be, or not to be, that is the question:',
 "Whether 'tis nobler in the mind to suffer",
 'The slings and arrows of outrageous fortune,',
 'Or to take Arms against a Sea of troubles,',
 'And by opposing end them: to die, to sleep',
 'No more; and by a sleep, to say we end',
 'the heart-ache, and the thousand natural shocks',
 'that Flesh is heir to?']

In [5]:
theLines[3]

'Or to take Arms against a Sea of troubles,'

In [6]:
len(theLines[3])

42

In [7]:
print(theLines[4:-1])

['And by opposing end them: to die, to sleep', 'No more; and by a sleep, to say we end', 'the heart-ache, and the thousand natural shocks']


## Tokenize

We can also split the original Hamlet text into words using the ```split()``` method. This splits the string into units on white space into a list of words.

In [8]:
theWords = theBigText.split()
theWords[:15]

['To',
 'be,',
 'or',
 'not',
 'to',
 'be,',
 'that',
 'is',
 'the',
 'question:',
 'Whether',
 "'tis",
 'nobler',
 'in',
 'the']

There are a couple of problems with this list of words. Some are capitalized and some are not. There is also punctuation. The campitalization we can fix with the ```.lower()``` method. Removing punctuation we will learn later.

In [9]:
theLowerBig = theBigText.lower()
theLWords = theLowerBig.split()
theLWords[:15]

['to',
 'be,',
 'or',
 'not',
 'to',
 'be,',
 'that',
 'is',
 'the',
 'question:',
 'whether',
 "'tis",
 'nobler',
 'in',
 'the']

In [10]:
len(theLWords)

65

This list can be sorted. Note how the punctuation causes trouble.

In [11]:
theSorted = sorted(theLWords)
theSorted

["'tis",
 'a',
 'a',
 'against',
 'and',
 'and',
 'and',
 'and',
 'arms',
 'arrows',
 'be,',
 'be,',
 'by',
 'by',
 'die,',
 'end',
 'end',
 'flesh',
 'fortune,',
 'heart-ache,',
 'heir',
 'in',
 'is',
 'is',
 'mind',
 'more;',
 'natural',
 'no',
 'nobler',
 'not',
 'of',
 'of',
 'opposing',
 'or',
 'or',
 'outrageous',
 'question:',
 'say',
 'sea',
 'shocks',
 'sleep',
 'sleep,',
 'slings',
 'suffer',
 'take',
 'that',
 'that',
 'the',
 'the',
 'the',
 'the',
 'the',
 'them:',
 'thousand',
 'to',
 'to',
 'to',
 'to',
 'to',
 'to',
 'to',
 'to?',
 'troubles,',
 'we',
 'whether']

### Getting a Random Word

We can also get a random word from a list and use it in a sentence. Note how we have to import a module ```random```.

In [13]:
import random
random.choice(theLWords)

'whether'

In [14]:
print("The random word is: \"" + random.choice(theLWords) + "\"")

The random word is: "troubles,"


## Counting words

To count the number of each **word type** will use a module called *collections*. This will give us a new datatype that we can use like a *dictionary*.

In [None]:
from collections import Counter
theCounts = Counter(theSorted)
print(theCounts)

In [None]:
type(theCounts)

With this we can look up the count for any word.

In [None]:
theCounts["the"]

In [None]:
theCounts.most_common(3)

----
# Getting Help

There are a number of places to get help for manipulating strings and lists. Here are some tutorials that will give you ideas about what you can do:
* [Manipulating Strings in Python](https://programminghistorian.org/lessons/manipulating-strings-in-python)
* [Counting Frequencies](https://programminghistorian.org/lessons/counting-frequencies)
For that matter there are a whole series of Programming Historian lessons you can check out.

Stéfan Sinclai and I have also write a text book called the [Art of Literary Text Analyss](https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb)

----
# Homework

### Part 1
Write a notebook that has code that generates random sentences from lists of words. The template should be something like: 

```
All <noun> are <adjective>!
```

To do this you need to:

* Create a list of adjectives.
* Create a list of nouns.
* Use ```random.choice``` to pick one of each.
* Concatenate the sentence and print it.

### Part 2
Write notebook that opens a text file, tokenizes it and identifies the top 10 high-frequency words. To do this you need to:
* Put the text file in the folder with the notebook.
* Figure out the code for opening a text file and reading it into a variable. Can you figure out how to find code that will open a file and read it in?
* Then you want to split it into words (tokenize it). 
* Then you need to count the words and find the top ten.

The important thing is to explain what you are doing using markdown cells.