## Python Fundamentals

**Time: 2 hours**

**You will learn to:**

- Use variables and print their values
- Create a list and dictionary and iterate/loop through them
- Use 'if' conditions and select what code to run

**Objectives:**

- Learn Python fundamentals and syntax
- Run and write Python Code



### How to learn using a notebook:

#### - Read through the cells with explanations
#### - Run all the cells that contain code 

#### (To run a selected cell use the keyboard shortcut Shift+Enter or click the >Run button at the top)

### Introduction to Python and Jupyter Notebooks

Python is a programming language. We will use it to interact with text and analyse text documents.

This introduction to Python is important for understanding all the steps that follow in the text mining lesson. If you are new to programming it may not be entirely clear why you need to learn Python first. Bear with us; you will soon understand why it is needed and useful, and it will help to speed things up later on.

We will use Jupyter Notebooks (this file you are looking at) for learning. Notebooks are really useful because you can have your notes cells and your code cells in one place:

- Notes (or Markdown) Cells: like what you are reading now, are cells of written text that explain concepts
- Code Cells: like the `2+3` below, are parts of Python programming code that you can RUN (we'll show you how). This means that the code (like `2+3`) will be interpreted by the Python (`oh, I think they are asking me to add two numbers, 2 and 3`) and executed `that's 5, here you go`. When the code has run, Python will kindly return you the result, and we will learn later how to show it (print it) in nicer ways.

In [None]:
2 + 3

# Above is your first piece of Python code and this line is a comment. 
# Comments are notes you can leave inside of your code. 
# They start with a # and are ignored when Python code runs.

# To 'RUN' this cell, select this cell and click '> Run' above 
# (or if you're into keyboard shortcuts, use Shift+Enter)

When the code above is **interpreted** (or **run**) you should see the result below the cell you have ran.

Do you see it? Spoiler alert, it should be 5, because if you interpret the `2+3` operation, it will result in the number `5`. 

Notice that when you run a cell, the next cell is automatically highlighted.

Run the cells below and see what happens (basically click "Run" a few times, or press Shift+Enter a few times).

What do you expect to see?

In [None]:
# Numbers need no quotes around them. Python will use maths to add them.
20+30

In [None]:
100 + 123

In [None]:
# But words have quotes, and Python will combine them to add them.
"hello" + "my" + "friend"

In [None]:
# Oh no, the result above is too close together. We could try to add spaces, but it starts looking complicated.
"hello" + " " + "my"  + " " + "friend"

### Printing things (i.e., showing them on the screen in a pretty way)

So far we were sort of cheating by "returning" the result of our code. Returning happens when python reaches the end of a cell and isn't told what to do, so it just panics and returns the most recent thing it thought about. Returning is not the kindest way to show something on the screen. A nicer way is to kindly ask Python to "print" things.

`print()` is Python's **function** to print (or show or display) something on the screen. 

Functions are sort of like skills that Python creators created for you. You can use them to achieve what you're after. We will learn a few of them in a minute.

Like every function, print() has a:

- **name** usually describing the action we'll do, e.g., `print`
- **arguments** in brackets, the things we want use for that action, e.g., `("Hello")` or `("Hello","my","friend")`

In [None]:
print("Hello!")
print("I can print!")
print("All the things")

You know how you can teach a dog to `fetch`? It would not make much sense to tell a dog to fetch but not tell it what object it needs to fetch. That's why you would normally show the dog a ball or a stick, and then shout `fetch` as you throw it. In progamming terms you would do something like `dog.fetch("stick")` or `dog.fetch("ball")`, but this will all start looking familiar soon (Note: the `.fetch()` syntax is a **method**, which is similar to a function except it is called on an object, which in this example is `dog`).

Did you notice that the results print underneath the cell without the ugly **Out[123]:** on the left, like it did in the previous examples?  That's because we printed the text, not just 'returned the result'. Printing is a nicer and often more readable way to get Python to show something.

Also notice that you can give many arguments to the `print()` function. What will happen is they will get separated by spaces, so they look better. It will help us to improve that complicated code from above (with loads of plus signs).

In [None]:
print("Hello","my","friend")

#### Minitask: Printing text

In [None]:
# Add some code in this cell (below this comment) and then RUN this cell:
# write a line that will print your full name, 
# and another line that will print your favourite pizza toppings.
# Do you see your words displayed on the screen? Yay! You are a Python programmer now.

### Variables and using = to assign them

Each line of code is a world of its own. Once it's 'Run', it forgets everything. It might have printed something or returned something but nothing is **stored for later**, unless we use a digital way to store things, called a VARIABLE. 

A Variable is like a locker in which you can store things, and just as a locker has its number and its contents, each variable has: 

- a **name**: like a label on the locker, so we can find it later
- **content**: something we put inside the locker to use later

For example, we can put the string “text mining” into the box named “superpower”.

In [None]:
superpower = "Text Mining"
# Run this cell now, but do not worry if you do not see anything returned or printed!

This part will be a bit confusing to some of you: the symbol **=** means something else in programming than what it means in maths (where it means 'equals').

The equals symbol **=** in Python is called an 'assignment operator' because it assigns what's on the right of it to what's on the left. It works a bit like a left-pointing arrow **<---**

You can imagine 

`superpower = "Text Mining"`

as

`superpower <--- "Text Mining"`.

This code will take the text value on the right hand side (`"Text Mining"`) and store it for later (assign it) to a variable on the left hand side (`superpower`). Notice variable names have no quotes around them.

So the code

`contents_of_my_locker = "shoes"`

is a bit like 

`contents_of_my_locker <--- "shoes"`,

putting the word `"shoes"` into the variable `contents_of_my_locker`.

In [None]:
# When you ran the previous code cell it did NOT return or print anything,
# but it did something even more useful: 
# it remembered the words "Text Mining" forever in a variable called superpower.
# Now in other cells you can use that variable:

print(superpower)
print("My new superpower is "+superpower)

In [None]:
# You can create as many variables as you want, as long as their names are unique 
# and do not contain spaces (Python likes to use _ underscores for multi-word variable names)

student_name = "Nicola Minestrone"
course_title = "Text Mining"
print(student_name, "is taking", course_title, "course")

Notice the difference between using "superpower" as a word, and superpower as a variable name. Python does not like guessing what we mean, so we need to explicitly specify if we mean something as a name of a variable or text. In Python, the colour highlighting often can help you identify what is interpreted as what.

#### Minitask: Using variables

Try to explain in your own words what's happening in the code below, then change the name of the variable into `surname` and put your own surname there. You might see some errors, but do not fret.

In [None]:
friend = "Natalie"

print("hello " + "friend")
print("hello " + friend)

### Types of variables: String, Integer, Float

In this lesson we will be concentrating mainly on strings, but this is only one of the 'types' of things variables can hold onto:

In [None]:
superpower = "Text Mining"  # String: holds text, has quotes "" or ''
number = 42                 # Integer: holds whole numbers, no quotes
pi_value = 3.1415           # Float: numbers with decimal places, in some context you might also see 'double'

print(superpower, number, pi_value)

Text variables are called **strings** because they are chains (or lists or strings) of characters. You can imagine them as a necklace with a string of beads that have letters on them, like the ones that spell out your name.

For example, the string `"Edinburgh"` is made of 9 letters: `"E", "d", "i", "n", "b", "u", "r", "g" and "h"`. 

Why are types important? Mainly because Python tries to be 'smart' and guess what we want it to do, but it's not always very good at it.

Consider the two lines of code below. Try to guess what will be printed BEFORE you run them.

In [None]:
# Can you guess what will be printed before you run this cell?

print(123 + 123)
print("123" + "123")

This happened because when we add two words, we sort of glue them together in order, but when we add two numbers we use maths.

That was quite straightforward. Python knows how to add two numbers or two strings together. What about when the below scenario happens?

In [None]:
print("The meaning of life is " + 42)

# BRACE BRACE! This code is not correct and Python will freak out,
# but don't worry. Run this cell and see what happens!

Errors are your best friends and they really do their best to explain:

- Where something went wrong (green arrow): `---->`
- What is wrong & how you can fix it (last line): `TypeError: can only concatenate str (not "int") to str`

By the way: concatenate is a fancy word for adding two things together.

The above line "Errored out" - instead of returning the expected output you got a nice explanation what can and can't be done.

#### Minitask: Combining two items 

Can you fix the above line of code so that it prints "The meaning of life is 42" and run it again? Go on, I believe in you! There are a few ways to achieve that. Try to use the fact it knows how to add two strings - how do you make something a string?

# Lists are what we use to store a collection of things:  
# days = ["Monday", "Tuesday", "Wednesday"] 

(They're used for storing a collection of items in one variable.)

Data can be grouped together in an ordered way using a list. Lists are very common data structures used in Python, for example to represent text when it is split into smaller units.

We can represent a sentence as one long string:

`sentence = "Just think happy thoughts and you’ll smile."`

Or we can store the same content in a List named `sentence` that contains all the words and punctuation of the sentence "Just think happy thoughts and you’ll smile."

`sentence = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']`

Lists are created by typing comma-separated values inside square brackets. eg. `["one","two","three"]` or `[4,5,6]`

You can print out all elements in the list at once using `print(your_list_variable)`.

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

print(tokens) # Print all elements

Notice some really creative (but correct) use of puctuation above. It is very important to get your quotes correct when working with strings (text) and lists (collections).

Colours in your code editor will help you.

#### Minitask: Errors in a List

Code below is not correct, there are at least three things wrong in it; can you fix it?

Try running the cell below and reading the error. Errors are your friends.

I have a game for you: run the cell, read the error and ONLY FIX WHAT the ERROR TELLS YOU TO FIX, one step at a time. Then run the cell again, see the new error and fix only that. It's very good practice and you will learn a lot!

Hint: single quotes and double quotes are interchangable, so you can use 'hello' or "hello".

In [None]:
# Run this cell, and only fix what error tells you. Resist the urge to fix other mistakes.
# Notice a little mini arrow pointing up  ^  to the exact point something is wrong.

reaction = ["It"s", "my" "absolutely", favourite, "song", !]
print(reaction

## Accessing items of a List

The list holds an **ordered** sequence of elements (in this case words or punctuation) and each element can be accessed using its index (position in the list).

It means that if you want to get the n'th item in a list (where n is the item's index), you need to say something like `list_name[n]`. 

For example: `tokens[3]` - using the name of the list, e.g., `tokens`, and the desired index in square brackets, e.g., `[3]` will return the item in tokens list stored at index (place) 3. 

In [None]:
# For example, to print first item in the below list...  
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

# ...you would use:
print(tokens[1]) # print the first element

WAIT WHAT? That should have printed `'Just'`

Note: Python indexes start with 0 instead of 1. So the first element in the list has index 0, and is accessed with `tokens[0]`.  `tokens[1]` actually returns the second item in the list (which is `'think'`).

#### Minitask: accessing items in a list using indexes

Fix the code below to actually print the first (not the second) item in the list.

How would you access the third item, or the last item? Add lines of code that do that.

How about accessing item at an index that does not exist, like 100 or -1? Try it.

In [None]:
print(tokens[1])

## Accessing sub-sections of a List (some people call them slices)

You can also get a slice (a section) of the list by specifying the start and end of your slice (note that the end is not inclusive).

The syntax is `some_list_variable[ beginning_index : end_index ]`, so to get the second half of our `tokens` list we could use: `tokens[4:9]`. It will be a slice which starts with the item at index 4 (`'and'`) and ends with the item at index 9 (`'.'`)

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

print(tokens[4:9])

In [None]:
# The first two elements of the list:
print(tokens[0:2])

# Fom third (at index 2) item up to the sixth (at index 5) item:
print(tokens[2:6])

It takes a bit of time to get used to counting from 0 and remembering the end index of a slice is not included, e.g. `tokens[0:2]` is not ` ['Just', 'think', 'happy']` but ` ['Just', 'think']` because the item at end_index (`2`) is not included.

#### Minitask:

Write code below that will:

- Print first 5 items of tokens list
- Print last 5 items
- Print all items apart of the first one and the last one
- Print just `['happy','thoughts']`

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
# Write your answers here:

### Additional ways to address indexes: default value and counting from the end

In [None]:
# If you skip the number before or after the : it will assume 'all the way from the beginning' or 'all the way to the end',
# so these two lines do the same thing, but the second one uses the 'default value' of 'from the begining':
print(tokens[0:3])
print(tokens[:3]) 

In [None]:
# Here the second one uses the 'default value' of 'to the end' :
print(tokens[3:9])
print(tokens[3:])

In [None]:
# You can use the minus sign to count from the end using negative numbers,
# e.g., the second line will get the last 3 items starting with "the third-to-last" and going to the end:
print(tokens[6:])
print(tokens[-3:])

In [None]:
# To get all the words apart from the last two, we can use negative values as the end index:
print(tokens[:7])
print(tokens[:-2])

#### Minitask:

Re-do the previous minitask using the new ways to use indeces:

- Print first 5 items of tokens list
- Print last 5 items
- Print all items apart of the first one and the last one
- Print just `['happy','thoughts']`

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
# Write your answers here:

### Removing things from a List

You can **delete** an item in the list by using the `pop(n)` function specifying the index of the element, like `some_list.pop( index_to_remove )`.

In our example above `tokens.pop(2)` would remove the word `'happy'` from the list (beacuse it's at index `2`).

If you do not specify the index, the last one gets removed and `tokens.pop()` removes the last item in the `tokens`, which would be a  `'.'`

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(tokens)

tokens.pop(2)
print(tokens)

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(tokens)

tokens.pop()
print(tokens)

In [None]:
# Notice what happens if you run pop() again
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(tokens)

tokens.pop()
print(tokens)

tokens.pop()
print(tokens)

In [None]:
# If you do not re-establish the tokens again in a cell, it will just take the value from the previous cell,
# so here value of tokens is the same as it was at the end of the previous cell.

# What will happen when you run this cell again? And again?  Do it a few times. Notice the output.
# (You'll need to re-select the cell with your mouse and then press Shift+Enter again.)
tokens.pop()
print(tokens)

Did you notice that when we change the List it gets changed in all the cells? On some level your notebook is like a computer: If you delete a file, go for a cup of tea and come back... the file is still deleted.

Values stored in variables persist across the times you run a cell and they are shared between all cells. That's why you can create a variable in one cell and then use it in another cell.

To reset your List, you need to run again the cell where you have assigned it with `tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']`

### Adding things to a List

Items can be added to a list either by inserting them at a specific position (index) or by appending them to the end of the list. 

The syntax you would use is: `collection.insert( where, what )`

E.g., to insert "Hello" at index 3 you'll use: `tokens.insert( 3, "Hello")`

In [None]:
# Note: this would replace the ending punctuation '.' with the word 'widely', but that's not what we want to do
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(tokens)

tokens[8] = 'widely'
print(tokens)

In [None]:
# We want to insert something at a place in a list and push along everything else 
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(tokens)

tokens.insert(8,'widely')
print(tokens)

Notice that as you add things to a list, you push everything else along, rather than replacing list items.

To simply add something to the end of a list, you can use: `collection.append( what )`

E.g., to insert "Bye" at the end you'll use: `tokens.append("Bye")`

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(tokens)

tokens.append('[J.M. Barrie]')
print(tokens)

#### Minitask:

Can you adjust the `tokens` list so that you replace `'smile'` with `'fly'` ?
First remove the item you don't want using `your_list.pop(n)`, and then add the new item using `your_list.append(item, n)`.

### Checking the Length of a List

Finally you might want to check how long a list is. For this you can pass the list into Python's length function `len( something )` where `something` is the collection you want to measure.

In [None]:
sentence = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print(len(sentence))

sentence.pop()
print(len(sentence))

### FUN FACT - Strings are basically lists of characters

Because each string is sort of like a list of characters, you can use the list functions on the strings as well:

- To get a length of the word you can use `len(a_word)` like `len("thoughts")`
- To get the nth character of the word  you can use `a_word(n)` like `"thoughts"[3]`
- To get a slice of the word you can use `a_word.slice(start, end)` like `"thoughts"[3:-2]`

In [None]:
print(len("thoughts"))
print("thoughts"[3])
print("thoughts"[3:-2])

In [None]:
# Or even
word = "thoughts"
print(len(word))
print(word[3])
print(word[3:-2])

## 'For' Loop: used to iterate through the elements in a list, one at a time

It's best if we start by seeing a loop with our own eyes:

In [None]:
# This will loop through each word in the list and print "I say:" before printing each word

tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
for word in tokens:
    print("I say:", word) # This will happen 9 times, once for each item in a list

The syntax of a for loop starts with `for a_thing in many_things:` where:

- `many_things` is the collection you want to go through (in our case, a list)
- `a_thing` is a temporary variable name that will represent the exact element of `many_things` that we are dealing with. 

The code above has run 9 times: the first time `word` was equal to `"Just"`, the second time it was equal to `"think"`, and so on.

When we're looping through (going through) the list, the `a_thing` variable keeps changing its value until we've looped through every value in the list.

The indented (moved to the side) pieces of code after the `:` specify code to be performed with `a_thing`. This needs to be indented using a tab (or four spaces) so that Python knows the instructions that follow need to be executed for each element.

When the indentation ends, Python will know to stop looping.

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
print("about to loop ->")

for word in tokens:
    print("- about to say it")
    print("the word is", word)   
    print("- just said it")

print("-> done looping")

You can specify the collection to loop through 'by hand' rather than using a variable.

In [None]:
for day in ['Monday', 'Tuesday', 'Wednesday']:
    print("The day is", day)   

#### Minitask:

- Can you print each item in `tokens` but surrounded by `*`, as in `*Just*`?
- *Tricky:* Can you print only the first 5 elements of the sentence? (Remember the slice syntax above: `your_list[ start : end]`.)


In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']
# Write your answer below


### Compare two values with == and you'll get True or False

Note: because equals sign **=** already has a meaning in programming (it's used for assigning right to left), Python has to use another symbol for **EQUALS**.

In Python, we use two equals signs **==** to check if two things are equal to each other.

In [None]:
print("Hello" == "Bye")
print("Hello" == "Hello")

In [None]:
# Or even
word = "Hello"
print(word == "Bye")
print(word == "Hello")

You can also combine conditions with logic operators: and, or, not

In [None]:
username = "Jackie"
password = "secret"
print(username == "Nicola" and password == "secret") # False, because 'and' needs both sides to be true
print(username == "Nicola" or password == "secret")  # True, because 'or' is fine with just one side being true
print(not username == "Nicola")                      # True, because 'username == "Nicola"' is false, and 'not false' is true

# Conditionals  if / elif / else : used to pick which lines of code to skip

Most of the time, all of your lines of code are run from top to bottom. Sometimes, though, you might want to choose to run some lines and not others.  For example, you might want to only print words if they are longer than 5 letters.

In Python we can create conditions (simple rules) that will direct the 'flow' of our code, a little bit like how semaphores direct train tracks, or blinking roadwork signs direct car traffic, or that person with shining lollipops directs airplanes at the landing strip.

Conditionals either say: *come this way (true)* or *don't come this way (false)*.

If the condition (a sort of test) passes, the indented code underneath it is executed. If the condition doesn't pass, then the indented code underneath it is skipped.

In [None]:
password = "secret"
if password == "secret":
    print("your password is too easy") # This happens if password is "secret"
else:
    print("what a safe password!")     # This happens otherwise

Change the value of the variable password in the cell above to be something else and run it again.

You can go further, providing more tests (conditions) to pass, using `elif`.

In [None]:
password = "secret"
if password == "":
    print("your password cannot be empty")                    # this would happen if password was "" (which it is not)
elif password == "secret":
    print("that's way too easy")                              # this would happen if password is equal to "secret"
elif len(password) < 8:
    print("your password need to have at least 8 characters") # this happens if len(password) < 8
else:
    print("what a safe password!")                            # this happens otherwise

Note that the conditions will be checked from top to bottom, so even though the length of 'secret' is shorter than 8, the condition above it will evaluate to true first, so it will 'steal the show'.
 
Only one condition is ever executed in the `if` statement and only one condition is ever executed in each `elif` statement.

#### Minitask: Combining everything we learned so far

Read the code below and deduce what it does. Then, change the values (edit, add, or remove) of the variable `words` to reach every branch (condition) of this if-else tree.

In [None]:
words = ["apples", "pears", "bananas", "kiwi"] 


if words[1] == "plum":
    print("there's a plum where I expected it") 
elif len(words) > 4:
    print("there are more than 4 fruits") 
elif len(words[0]) > 7:
    print("first fruit has a long name")
elif words[0] == words[1] or words[-2] == words[-1]:
    print("first two or last two fruits are the same") 
else:
    print("I don't know what to say!")

## Final two collection types: Tuples (unchangable lists), Dictionaries (key -> value pairs)

# Tuples - like lists, but can't be changed 
# my_tuple = ('blue', 'green', 'red')

A tuple is similar to a list as it is an ordered sequence of elements. The difference is that tuples can’t be changed once created. You recognise them by their `( )` parentheses. As with lists, you separate items in a tuple with a comma, as in `('blue', 'green', 'red')`.

In [None]:
# Tuples are basically lists that can't be changed, so there are no pop() or insert() methods.
colour_tuple = ('blue', 'green', 'red')
print(colour_tuple[0]) 
print(colour_tuple[2]) 
print(len(colour_tuple)) 

# Dictionaries - like a list, but instead of indexes pointing to values, we can create our own 'keys' that point to values
# student = {"firstname":"Nicola", "surname":"Paczkova", "course":"Politics"}


#### Back when all we knew were lists:

Remember, in a list we used **NUMBER KEYS (indeces)**, so numbers 0, 1, 2, 3 and so on were pointing to values, e.g., `tokens[2]` points to the value at the third place in the list. The `2` is the INDEX, which is sort of like an address pointing to a particular place in a list.

Lists are like a room full of lockers, and each locker has a number by which we can find them. In the example below, the first locker has number 0 and to find the owner I need to use that number (index) to find out who owns it. E.g., `lockers_list[0]` will return the value `"Nicola's"`.

`lockers_list = ["Nicola's", "Kat's", "Jill's", "Pat's"]`

As long as these lockers do not move too much (we can trust that Nicola's will always be first) this works fine.


But things get more interesting if I want to store information about something more real-life, for example, about a student. If I want to know their firstname, surname and course, I could just store them in a list like `["Nicola", "Pachkova", "Politics"]` but then we need to remember that the first item is firstname, the second, surname, and the third, course, which can lead to all sorts of trouble.

#### Dictionaries: when order does not matter or is likely to change

In a dictionary we have **WORD KEYS** pointing to values, e.g., `"firstname"` will point to `"Nicola"`, `"surname"` will point to `"Paczkova"` and `"course"` will point to a value `"Politics"`.  `student[word_key]` points to the value at the place in the dictionary specified by whatever you put in as the `word_key`.

`student = {"firstname":"Nicola", "surname":"Paczkova", "course":"Politics"}`

We use curly brackets `{ }` to define a Dictionary, a colon `:` to separate each key from its value, and a comma `,` to separate key-value pairs from each other.

If we want to get that student's firstname we would use the **KEY** (word) the same way as we would have used the **INDEX** (number) in lists:

`student['firstname']`

Keys are nicer than indeces, because we can make them more meaningful than just 0,1,2,3, etc. Our code becomes more readable and easier to maintain.

#### Dictionaries are sort of like a bunch of variables in a wrapper

Each key acts as a reference (or a name) for a single value in the dictionary, almost like a bunch of variables bundled up into a mega-variable.

Instead of:

`firstname = "Nicola"
surname = "Paczkova"
course = "Politics"`

we combine these values and variables into one meaningful unit and give it a name:

`student = {"firstname":"Nicola", "surname":"Paczkova", "course":"Politics"}`

In [None]:
student = {"firstname":"Nicola", "surname":"Paczkova", "course":"Politics"}

print(student)           # Print the whole dictionary
print(student['course']) # Print just the value we stored at the 'course' key

You will very often see dictionaries when analysing text, usually with keys as individual words and values as integers (e.g., counts of how many times the words appear in a text). 

In [None]:
word_frequencies = {'Mary':5,  'lamb':5 , 'the': 5, 'a': 2, 'had':1,  'little':1}

print(word_frequencies['Mary']) # Prints the value of the key 'Mary'

#### Minitask: 

- Given the dictionary below, which is a result of analysing a text document, print the number of verbs


In [None]:
types_of_speech = {'verbs':45, 'nouns':24, 'adjectives':33}
# type your answer here


### Looping through a dictionary:

You can also request all the keys, all the values, or all the key-value pairs as tuples from a dictionary. This will come in handy later: 

In [None]:
student = {"firstname":"Nicola", "surname":"Paczkova", "course":"Politics"}
print(student.keys())   # Method for getting a list of keys
print(student.values()) # Method for getting a list of values
print(student.items())  # Method for getting a list of (key, value) tuples

This is useful when we want to loop through items in a dictionary. Unlike in a list, instead of having only one `value` for each item in a list, we will have a `key` and a `value` for each item in a dictionary.

To make sure that our dictionary is separated into keys and values when looping, we need to use `my_dictionary.items()`.

```
for key, value in my_dictionary.items():
    # you can do something here with key and value, like
    print(key, value) 
```

In [None]:
types_of_speech = {'verbs':45, 'nouns':24, 'adjectives':33, 'pronouns': 12}

for type_of_speach, count in types_of_speech.items(): # .items() splits dictionary into (key, value) pairs
    print(count, "is the number of", type_of_speach)
    

#### Minitask:

How would you change the code from the above example to only print information for types of speech where their count was larger than 30? You are given a partial answer and you need to fill in the gaps marked with `?????`.

In [None]:
types_of_speech = {'verbs':45, 'nouns':24, 'adjectives':33, 'pronouns': 12}

for type_of_speach, count in types_of_speech.items(): # .items() splits our dictionary into (key, value) pairs
    if ????? > 30 :
        print(count, "is the number of", type_of_speach)

Notice that I am giving the key and value variables meaningful names, `type_of_speach` and `count`, rather than just writing `key` and `value`, so that my code is easier to read.

# List Comprehension: another very useful loop  (that we will use a lot)

# lowercase_tokens = [ word.lower() for word in tokens ]

### Modify a list on the fly

You can use this simplified loop syntax when what you want is to take a list of items and change each item in that list into something else. 

Think about it like a conveyor belt: things go in on one side, and slightly changed things come out on the other side.

The syntax for this is:


`result = [ my_output for one_item in all_items]`

for readability it's best to add an extra new line (Python basically ignores new lines) and write it like this:


`result = [  my_output
            for one_item in all_items]`

Example: For each word in `tokens`, represent it as a lowercased version itself using `word.lower()`, e.g., change `Just` into `just`, `Think` into `think`, etc.

In [None]:
# Note: some_word.lower() turns that word into a lowercase

tokens = ['Just', 'think', 'HAPPY','thoughts', 'and', 'you', '’ll', 'smile', '.']

lower_case_words = [ word.lower() 
                     for word in tokens ]

print(lower_case_words)

Example: For each word in `tokens`, represent it with its length, e.g., change `Just` into `4`, `think` into `5`, etc.

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

lengths_of_words = [ len(word) 
                     for word in tokens ]

print(lengths_of_words)

Notice that we are taking a list of some things and returning a list of some other things, but both lists have the same length.

### Further filtering a list to keep only some of its elements:

Optionally, you can add a third line with a condition that needs to be true for the item to be kept in the final result.

The syntax becomes: 


```
result = [  output
            for one_item in all_items
            if condition]
```

In [None]:
tokens = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

lengths_of_long_words = [ len(word)          # Turn a word into its length
                          for word in tokens # For each word in tokens list
                          if len(word) > 4 ] # But only keep those where the word's length was over 4

print(lengths_of_long_words)

We can even use the words as they came for output, not changing them at all, and still use the filtering part of our code:

In [None]:
sentence = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

long_words = [ word # If you just want to filter words but output word should be the same as the input word
          for word in sentence
          if len(word) > 4 ]

print(long_words)

Example: Only keep words starting with a letter 't'. Since strings are, deep inside, just lists of characters, we can get the first letter of our words by asking for the first item in that string as we do with lists: `word[0]`.

In [None]:
sentence = ['Just', 'think', 'happy','thoughts', 'and', 'you', '’ll', 'smile', '.']

words_starting_with_t = [ word 
              for word in sentence
              if word[0] == "t" ]

print(words_starting_with_t)

#### Minitask:

- Given a list of names, create another list with just these names but in lowercase (you can use `your_word.lower()`)
- Now filter this list to only include those names that are longer than 5 letters (you can use `len(word)` and a `<`)

In [None]:
names = ['Matilda', 'Jess', 'Pat', 'Yu', 'Prianka', 'Hermenegilda', 'Makiko']

# For the Curious: combining it all together

Let's write a piece of code that will take the counts of words in the "Mary had a little lamb" rhyme and:
- Print the number of times each word occurs
- Print the number of times each word occurs, but only for words that occur more often than 4 times

Then, we'll change the code to:
- Print words that occur more often than 4 times and do NOT start with 't'

Then, we'll change the code again to:
- Store these long, popular words in a separate list

In [None]:
word_counts = {'mary': 5, 'had': 1, 'a': 2, 'little': 1, 'lamb': 5, 
               'whose': 1, 'fleece': 1, 'was': 3, 'white': 1, 'as': 1, 
               'snow': 1, 'and': 4, 'everywhere': 1, 'that': 1, 'went': 1,
               'the': 8, 'sure': 1, 'to': 3, 'go': 1, 'it': 4, 'followed': 1,
               'her': 1, 'school': 2, 'one': 1, 'day': 1, 'which': 1,
               'against': 1, 'rules': 1, 'made': 1, 'children': 2, 'laugh': 1,
               'play': 1, 'see': 1, 'at': 1, 'so': 2, 'teacher': 2, 'turned': 1,
               'out': 1, 'but': 1, 'still': 1, 'lingered': 1, 'near': 1, 
               'waited': 1, 'patiently': 1, 'about': 1, 'till': 1, 'did': 2,
               'appear': 1, 'why': 2, 'does': 1, 'love': 1, 'eager': 1, 'cry': 1,
               'loves': 1, 'you': 1, 'know': 1, 'reply': 1}

In [None]:
# To only print words that occur more often than 4 times
for word, count in word_counts.items():
    if count > 4:
        print(word, "appears", count, "times")

In [None]:
# Only print words that occur more often than 4 times and do not start with 't'
# Note: you can combine conditions with logic operators: and, or, not
for word, count in word_counts.items():
    if count > 4 and not word[0] == 't':
        print(word, "appears", count, "times")

In [None]:
# Store these long, popular words in a separate list
long_popular_words = [] # Notice: I am creating a list but putting nothing in it, yet
for word, count in word_counts.items():
    if count > 4 and not word[0] == 't':
        long_popular_words.append(word)

print(long_popular_words)

# A final note:

We will download some files and libraries on the first day. To speed up this process, run the below cell with some imports and downloads, so that you do not have to wait for it to happen on the first day. Thanks!

In [None]:
# Run this cell now. It's the usual imports of text mining libraries.

import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')