# Week 1: Coding with data in Python

We start out with the basics. The exercises in this session cover:

* Writing Python code and Markdown in Jupyter notebooks
* Introductory Python
* Getting some data from Reddit

## Exercises

### Part 1: Know thy notebook

This document is what we call a *Jupyter notebook*. We will be using these extensively throughout the course so **READ THIS CLOSELY**. If you understand how notebooks work, you will save yourself lots of time and frustration throughout this course!

There are two basic things you need to know about Jupyter notebooks:

1. A notebook is nothing but a list of cells. A cell can either be a **code cell** or a **Markdown cell**. Code cells are for writing executable code, and Markdown cells (like this one) are for explaining things in text and making your notebook more readable. A typical workflow that you will soon get used to, is something like: solving a problem with some code in a *code cell* and explaining your reasoning or the results you obtained in a *Markdown cell*. You can toggle cell type when you are in *command mode* by pressing <kbd>y</kbd> for code and <kbd>m</kbd> for Markdown. **Try to do that**. Change this *Markdown* cell to a *code* cell, and change it back again. What happens if you execute (<kbd>shift</kbd>+<kbd>enter</kbd>) when this cell is a code cell, compared to when it is a Markdown cell?

2. The notebook has two *modes*: **edit mode** and **command mode**. You enter command mode by pressing <kbd>esc</kbd> or clicking outside a cell, and edit mode by clicking a cell and pressing <kbd>enter</kbd> or double clicking a cell. When you're in edit mode, the left border of the current cell turns green (not with `jupyter lab`, though, there the bar is always blue) and whatever you type into your keyboard goes into that cell, whether it is a code or Markdown cell. [Here](http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html)'s a nice rundown of the different commands you can use. **Beware of <kbd>x</kbd> and <kbd>dd</kbd>**. Read the full list of hotkeys by pressing <kbd>h</kbd> in command mode to figure out why.

>*Heads up:* Because we'll be using Jupyter notebooks so much in this course, I strongly recommend investing 5 more minutes playing around with cell types, modes and hotkeys. It will save you heaps of time down the road. Above all, make sure you have read and understood these ^ two points!

When you run a code cell by pressing <kbd>shift</kbd> + <kbd>enter</kbd>, the code gets evaluated by the Python interpreter installed on your computer. The interpreter always returns some output, so unless you store it in a variable, it gets printed below the cell. In general, you will use code cells for doing analysis and working with data.

*Markdown* is a simple markup language for formatting text (like *HTML* or $\LaTeX$). You will typically use it for writing explanations about how you solve the exercises and the results you get, and styling your notebook with sections and subsections. It can do **bold**, *italics* and $\LaTeX$ formatting (for equations), and much much more. You can read about the Markdown language [here](http://daringfireball.net/projects/markdown/).

Below is your first exercise. The exercise are numbered by the convention `[session]`.`[section]`.`[problem]`.`[subproblem]`. For example, exercise 4.2.3.1 is in week 4, section 2, problem 3, and subproblem 1.

>**Ex. 1.0.1**: In the Markdown cell below, write a short text that shows that you can:
>* Create sections
>* Write words in bold and italics
>* Write an equation in LateX formatting
>* Create bullet lists
>* Create [hyperlinks](https://en.wikipedia.org/wiki/Hyperlink)

>*Hint: Remember to execute the cell (<kbd>shift</kbd>+<kbd>enter</kbd>) so the Markdown gets rendered.*

[Answer to Ex. 1.0.1]

### Part 2: Essential Python ([DSFS](https://www.oreilly.com/library/view/data-science-from/9781492041122/) Chapter 2)

These exercises take you through some very basic Python. Use them to calibrate your expectations: If you find them hard, you must spend some more time getting up to speed (see the preparation goals for today's session on Canvas).

>**Ex. 1.1.1**: Create a list `a` that contains the numbers from $0$ to $1110$ (including $0$ and $1110$), incremented by one, using the `range` function.

In [5]:
a = []
for i in range(0,1111):
    a.append(i)

>**Ex. 1.1.2**: Show that you understand [slicing](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation) in Python by extracting a list `b` with the numbers from $760$ to $769$ (including both) from the list created above.

In [9]:
b = a[760:770]

[760, 761, 762, 763, 764, 765, 766, 767, 768, 769]

>**Ex. 1.1.3**: Define a function that takes as input a number $x$ and outputs the number multiplied by itself plus three $f(x) = x(x+3)$. 

In [11]:
def func(x):
    return x*(x+3)

4

>**Ex. 1.1.4**: Apply this function to every element of the list `b` using a `for` loop and append the results to a new list `c`. Print `c`.

In [13]:
c = []
for i in b:
    c.append(func(i))
print(c)

[579880, 581404, 582930, 584458, 585988, 587520, 589054, 590590, 592128, 593668]


>**Ex. 1.1.5**: Do the exact same thing using a *list comprehension*.

In [14]:
c = [func(i) for i in b]
print(c)

[579880, 581404, 582930, 584458, 585988, 587520, 589054, 590590, 592128, 593668]


>**Ex. 1.1.6**: Write the numbers in `c` to a text file with one number per line.

In [19]:
with open("numbers.txt", "w+") as f:
    for i in c:
        f.write(str(i)+'\n')

>There are three ways to format strings in Python.
> 1) The oldest is %-formatting, which has more or less gone out of style. 
> 2) The next is str.format() a more modern approach. Read more [here](https://realpython.com/python-f-strings/#option-2-strformat)
> 3) Finally, formatting with f-strings is the newest and in most cases best method. Read more [here](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python)
> 
>**Ex. 1.1.7**: Show that you understand how strings work in Python. You should:
>
>1. Add a comment above each line of code that explains it.
>2. Find all the lines where **a string** is put into a string. How many are there?
>3. Rewrite the last examples with owners and rabbits so that it uses f-strings instead

In [3]:
# This is an example of a comment

#This create some variables 
# Examples using f-strings
x = f"There are {10} types of people."
binary = "binary"
do_not = "don't"
y = f"Those who know {binary} and those who {do_not}."

#This prints x and y f-strings which then use the values of binary and do_not
print(x)
print(y)


# Examples using str.format()
speaker = "I"
#This uses str.format() to print the value of speaker followed by said: followed by x
print("{} said: {}.".format(speaker, x))
#This uses str.format to print the value of speaker then also said: then y
print("{speaker} also said: '{str_to_insert}'.".format(speaker=speaker, str_to_insert=y))

hilarious = False
joke_evaluation = "Isn't that joke so funny?! {joke_is_funny}"

#This prints the string "Isn't that joke so funny?!" followed by the value of hilarious which is false
print(joke_evaluation.format(joke_is_funny=hilarious))

w = "This is the left side of..."
e = "a string with a right side."

#This prints the concatentation of the strings w and e
print(w + e)

owner_to_rabbit_count = {'Alice': 5, 'Bob': 10, 'Cinderella': 2}
rabbit_stats_template = '{name} has {rabbit_count} rabbits'

#This iterates through the names of owner_to_rabbit_count dictionary and for each value in the dictionary prints the number of rabbits that each owner has
for owner_name, rabbits in owner_to_rabbit_count.items():
    print(rabbit_stats_template.format(name=owner_name, rabbit_count=rabbits))

There are 10 types of people.
Those who know binary and those who don't.
I said: There are 10 types of people..
I also said: 'Those who know binary and those who don't.'.
Isn't that joke so funny?! False
This is the left side of...a string with a right side.
Alice has 5 rabbits
Bob has 10 rabbits
Cinderella has 2 rabbits


5

In [5]:
for owner_name, rabbits in owner_to_rabbit_count.items():
    output = f"{owner_name} has {rabbits} rabbits"
    print(output)
    

Alice has 5 rabbits
Bob has 10 rabbits
Cinderella has 2 rabbits


>**Ex. 1.1.8**: Why does `5 // 2 == 2` in Python 3? What does `5 / 2` give?

5//2 == 2 because it is performing integer division. 5/2 performs floating point division and equals ~2.5 though floats should not be directly compared.

>**Ex. 1.1.9**: Explain the point of using `try` and `except` statements? Write some code that shows how to use these.
>
> *Hint: You will do a lot of Googling in this course. If you don't already know how to use `try` and `except`, start Googling now.*

In [22]:
"""
try and except statements exist to catch exceptions and allow the program to continue or do 
something else when an exception occurs
"""
try:
    print(XYZ)
except:
    print("Exception")

Exception


>**Ex 1.1.10**: `dict`s and `defaultdict`s.
1. What is a `defaultdict`? How would you say it is different from a normal Python `dict`?
2. Write some code that takes a list of tuples:
>
>        l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]
>
>     And produces a `defaultdict` object
>
>        defaultdict(<class 'list'>, {'a': [1, None, None], 'c': [False], 'b': [3, True]})
>
>*Hint: you can import `defaultdict` from `collections`. Your code should be a for loop that loops over the tuples in `l` and updates an initially empty defaultdict, iteration after iteration.*

1. The difference between dict and defaultdict is that defaultdict is a subclass of the dict class that overrides the initalization method to provide a default value if the key does not yet exist which can be specified when the default dict is created.


In [33]:
from collections import defaultdict
l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]
my_dict={}
for key, value in l:
    my_dict.setdefault(key, []).append(value)
print(my_dict)

{'a': [1, None, None], 'b': [3, True], 'c': [False]}



>**Ex 1.1.11**: Take a list `a = list("justreadtheinstructions")` and
1. count the number of times each element occurs using `Counter`,
2. report the two most common elements
>
>*Hint: you can import `Counter` from `collections`. `Counter` has a method called `most_common` can you can use.*

In [36]:
from collections import Counter
a = list("justreadtheinstructions")
print(Counter(a))
print(Counter(a).most_common(2))

Counter({'t': 4, 's': 3, 'u': 2, 'r': 2, 'e': 2, 'i': 2, 'n': 2, 'j': 1, 'a': 1, 'd': 1, 'h': 1, 'c': 1, 'o': 1})
[('t', 4), ('s', 3)]


>**Ex 1.1.12**: Take another list `b = list("ofcourseistillloveyou")` and
1. get the `set` of characters that exist in both `a` and `b` (intersection),
2. get the `set` of characters that exist in either `a` or `b` (union), and
3. compute the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) between the distinct elements in `a` and `b`.
>
>*Hint: use the `set` function to get a `set`-type object of distinct elements from a list. Sets supports a [number of different operations](https://snakify.org/en/lessons/sets/#section_4).*

In [39]:
inter = list(set(a) & set(b))
union = list(set(a) or set(b))
print(inter)
print(union)
print(len(inter) / len(union))

['u', 'c', 't', 's', 'h', 'e', 'a', 'd', 'o', 'n', 'r', 'j', 'i']
['u', 'c', 'n', 's', 'h', 'e', 'a', 'd', 'o', 't', 'r', 'j', 'i']
1.0


### Part 3: A little bit of real data

>**Ex. 1.2.1**: Learn about JSON by reading the **[wikipedia page](https://en.wikipedia.org/wiki/JSON)**. Then answer the following questions in the cell below. 
>
>1. What do the letters stand for?
>2. What is JSON?
>3. Why is JSON superior to XML? (... or why not?)

1. JavaScript Object Notation
2. JSON is a language independent way of storing data
3. JSON and XML are both formats for storing and exchanging data but they have slightly different structures with JSON using a nested tree like structure while XML uses tags that define the different elements. This means that JSON is considered to be more human readable than XML. However, XML supports a wider array of data types than JSON which means that it can be used for applications. This does however mean that XML is more complex and more difficult to parse.

>**Ex. 1.2.2**: Working with JSON files
>1. Use [`requests`](https://www.google.dk/search?q=python+requests+get+json&gws_rd=cr&ei=M5OdWaewD8Ti6AS54J24Bg), or another Python module, to store **[this data](https://www.reddit.com/r/gameofthrones/.json)** in a new variable `data`. You may have to pass a User-agent argument in the header of your request to avoid HTTP 429. In the requests module, this can be done by including <code>headers = {'User-agent': 'whatever-you-like'}</code> as a keyword argument in the function call.
>2. Show that `data` is a `dict` type object.

In [2]:
import requests
headers = {'User-agent': 'test'}
data = requests.get('https://www.reddit.com/r/gameofthrones/.json', headers = headers)
data = data.json()
type(data)

dict

>**Ex. 1.2.3**: Let's try to inspect the data you retrieved. 
>
>1. Use the `json` module to print your data variable as a string with `indent=4`.
>2. Print the keys of `data`.
>
>*Hint: 1. Use the `json` function `dumps`. 2. Call `.keys()` on the variable.*

In [3]:
import json
json.dumps(data, indent = 1)

'{\n "kind": "Listing",\n "data": {\n  "after": "t3_114a6bu",\n  "dist": 25,\n  "modhash": "",\n  "geo_filter": null,\n  "children": [\n   {\n    "kind": "t3",\n    "data": {\n     "approved_at_utc": null,\n     "subreddit": "gameofthrones",\n     "selftext": "",\n     "author_fullname": "t2_un3wjsct",\n     "saved": false,\n     "mod_reason_title": null,\n     "gilded": 0,\n     "clicked": false,\n     "title": "Team \\"I don\'t want it\\"",\n     "link_flair_richtext": [],\n     "subreddit_name_prefixed": "r/gameofthrones",\n     "hidden": false,\n     "pwls": 6,\n     "link_flair_css_class": null,\n     "downs": 0,\n     "thumbnail_height": 140,\n     "top_awarded_type": null,\n     "hide_score": false,\n     "name": "t3_113sqab",\n     "quarantine": false,\n     "link_flair_text_color": "dark",\n     "upvote_ratio": 0.96,\n     "author_flair_background_color": "",\n     "subreddit_type": "public",\n     "ups": 1999,\n     "total_awards_received": 0,\n     "media_embed": {},\n     "

In [4]:
data.keys()

dict_keys(['kind', 'data'])

>**Ex. 1.2.4**: The URL reveals that the data is from reddit/r/gameofthrones, but can you recover that information from the data? Give your answer by 'keying' into the dictionary using square brackets.
>
>*Hint: 'Keying' is a word i just made up. By it, I mean the following. Consider a nested dictionary like:*
>
>        my_json_obj = {
>            'cats': {
>                'awesome': ['Missy'],
>                'useless': ['Kim', 'Frank', 'Sandy']
>            },
>            'dogs': {
>                'awesome': ['Finn', 'Dolores', 'Fido', 'Casper'],
>                'useless': []
>            }
>        }
>
>*I can get the list of useless cats by keying into `my_json_obj` like such:*
>
>        >>> my_json_obj['cats']['useless']
>        Out [ ]: ['Kim', 'Frank', 'Sandy']
>
>*`my_json_obj['cats']` returns the dictionary `{'awesome': ['Missy'], 'useless': ['Kim', 'Frank', 'Sandy']}` and getting '`useless`' from that eventually gives us `['Kim', 'Frank', 'Sandy']`. If any of those list items were a list of a dictionary themselves, we could have kept keying deeper into the structure.*

In [5]:
data['data']['children'][0]["data"]['subreddit']

'gameofthrones'

>**Ex 1.2.5**: Write two `for` loops (or list comprehensions) which:
>1. Count the number of spoilers.
>2. Only prints headlines that aren't spoilers.

In [6]:
spoiler_count = 0
for post in data['data']['children']:
    if post['data']['spoiler']:
        spoiler_count+=1
print("Number of spoilers:", spoiler_count)

Number of spoilers: 4


In [7]:
for post in data['data']['children']:
    if not post['data']['spoiler']:
        print(post['data']['title'])

Team "I don't want it"
Ai writes new game of thrones ending since the one we got sucked
Did a quick sketch of how Tyrion is supposed to look like in the books
[NO SPOILERS] Currently rewatching, so I had to paint my favourite character.
I found the best wallpaper
What is your favorite Castle in GOT?
[No Spoilers] An Anime of Ice and Fire - Jaime &amp;amp; Cersei Lannister
Robb and Daemon
An interesting similarity found
What was Little Finger’s ultimate goal?
Watching Season 7 Ep 7 - Bronn gets all the best lines
16 year olds in Westeros
What is your favorite throne from the show?
Favorite House Words and Sigil?
Who do you think was a better king - Robert Baratheon or Viserys Targaryen?
Lady Mormont had the biggest balls in a hall filled with leaders of minor houses and Jon Snow
season 4 could have been the last and I would have been fine with it.
Do You Think Grey Worm Could Be the Antagonist of the Jon Snow Series?
in storm's end when Borros Baratheon asked Lucerys if him or Jacaerys 