> **DO NOT EDIT IF INSIDE `computational_analysis_of_big_data_2017` folder** 

# Week 2

*Thursday, August 31, 2017*

## Outline

This week's exercises build on what you have read (or will read) for today in the book we use for this course. We cover:
* The introduction to the book
* Useful Python functionality
* Some more advanced Reddit scraping and visualization of that data

This is the last week where we will be doing Python for the sake of learning Python. If you feel that it is very difficult you should read the chapter in the book carefully. Alternatively, you can go through [Codeacademy's Python course](https://www.codecademy.com/learn/learn-python).

**A word of advice**: Some of you may be new to solving problems using code. At this point you may be wondering what level of detail I expect from your solutions. This is the guideline: Solve the exercises in a manner that allows you to—later in life—use them as examples. This also means that you should add code comments when the code isn't self-explanatory or if you're afraid it won't make sense when you look at it with fresh eyes. You may also want to comment on your output in plain text to capture the conclusions you arrive at throughout your analysis. But express yourself succinctly... Or to quote good old Einstein: *"Make everything as simple as possible, but not simpler"*. When you optimize for your own future comprehension, you also optimize for mine (and your peers').

## Material

*Data Science from Scratch* Chapter 1, 2, 3.
* *Chapter 1 - Introduction:* Read this. It introduces data science very nicely and sets the stage for the book.
* *Chapter 2 - A Crash Course in Python:* Study pages 15-26, and maybe take a dip into "The Not-So-Basics" if you feel like challenging yourself. If you are a skilled Python programmer at this point you can dash through a lot of this and take a closer look when things look foreign.
* *Chapter 3 - Visualizing Data:* This is a very short chapter that gives a few examples of how to use `matplotlib`, which is the Python library we we'll be using for data visualization. Read the introduction, take note of what's in there and use it as reference when you need it.

## Exercises

### Part 1: Introduction (DSFS Chapter 1)

>**Ex. 2.1.1**: The VP of Networking tasks you with finding the key connectors in the company.
1. Which of your colleagues have greatest degree centrality? What is the value?
2. Who has the lowest? What is that value?
3. Skip ahead to chapter 21. Who has the highest betweenness centrality, and why?

>**Ex. 2.1.2**: The VP of Public Relations asks you to produce some fun fact about how much data scientists earn. She gives you a datasheet which pairs tenure with yearly salary.
1. Why is it useless to aggregate salary for each tenure? How does bucket'ing (also: *histogramming*) help?
2. Joel hints at a fundamental problem with the bucketing approach when he writes "...we chose the buckets in a pretty arbitrary way." What could that problem be?
3. Can you give an example of a method that could be used for predicting the salary effect of having an additional year of experience?


### Part 2: A Crash Course in Python (DSFS Chapter 2)

>**Ex. 2.2.1**: Which is better:
1. Simple or complex?
2. Flat or nested?
3. Sparse or dense?

>*Hint: find the Zen within*

>**Ex. 2.2.2**: Why does `5 / 2` give `2` in Python 2.7?

>**Ex. 2.2.3**: What is the point of using `try` and `except`? Write some code that shows how to use these.

>**Ex 2.2.4**: About `defaultdict`s:
1. What is a `defaultdict`? How would you say it is different from a normal Python `dict`?
2. Write some code that takes a list of tuples:

>        l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]

>     And produces a `defaultdict` object

>        defaultdict(<type 'list'>, {'a': [1, None, None], 'c': [False], 'b': [3, True]})

>*Hint: you can import `defaultdict` from `collections`*

**Ans. 2.2.4.1**: `defaultdict`s can be used just like normal Python `dict`s. The important difference is that when you initiate it, you do so with a datatype. From the documentation (read with `help(defaultdict)`) I understand that when you use a key which does not yet exist, the default factory creates the requested key along with a value of the input datatype. This allows you to do things like `my_dict[new_key].append(some_value)`, which would have raised a `KeyError` had `my_dict` been a `dict` type object.

**Ans. 2.2.4.2**:

In [8]:
from collections import defaultdict

l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]

my_dict = defaultdict(list)     # Create the defaultdict
for key, value in l:            # Loop over the pairs inside the list
    my_dict[key].append(value)  # Append the value to the object that the key creates/returns
    
# Print the result
my_dict

defaultdict(list, {'a': [1, None, None], 'b': [3, True], 'c': [False]})

>**Ex 2.2.5**: Take a list `a = list("justreadtheinstructions")` and
1. count the number of times each element occurs using `Counter`,
2. report the two most common elements

>*Hint: you can import `Counter` from `collections`*

>**Ex 2.2.6**: Take another list `b = list("ofcourseistillloveyou")` and
1. get the `set` of characters that exist in both `a` and `b` (intersection),
2. get the `set` of characters that exist in either `a` or `b` (union), and
3. compute the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) between the distinct elements in `a` and `b`.

>*Hint: use the `set` function to get a `set`-type object of distinct elements from a list*

### Part 3: Visualization (DSFS Chapter 3)

>**Ex. 2.3.1**: Create two lists, `x` and `y`, that each contain 10 numbers of your liking. Using `matplotlib`'s `scatter` function, plot these two lists against each other. Give your figure x and y axis labels and a title.

>*Hint: To get figures to display inside the notebook, use the Jupyter magic `%matplotlib inline`* <br>
>***Info***:* From now on, unless otherwise stated, you should always label your axes and title your figure appropriately.*

>**Ex. 2.3.2**: Plot the score versus number of comments for posts on the `gameofthrones` and `news` subreddits.
1. The coding part
    * Write a function that takes as input the name of a subreddit and returns the data on the subreddit as a json object.
    * Write another function that takes as input some reddit data, extracts the scores and number of comments into seperate lists and returns both lists.
    * Using these functions, get a set of x and y variables for each subreddit.
    * In two seperate figures, floating side by side, scatter plot each set of x and y variables against each other. Choose different colors for the points in either plot.
2. The reflecting part
    * The News and GOT trends look distinctly different. Explain how they look different. Why might this be?

>My figure looks like [this](http://ulfaslak.com/computational_analysis_of_big_data/exer_figures/example_2.2b.png).

>**Ex. 2.3.3**: Looking at the scatter plots there appears to be some unevenness in the number of comments and upvotes that different posts receive.
1. Plot the distributions of x for GOT and News as histograms, side by side.
2. What do these distributions say about how people comment on Reddit?

>My figure looks like [this](http://ulfaslak.com/computational_analysis_of_big_data/exer_figures/example_2.2c.png).

>**Ex. 2.3.4**: You may have noticed that the data['data'] object has a key called 'after'.
1. What do you think this is?
2. Write a function that takes an integer `N` and the name of a subreddit, and returns a JSON with all posts on the first `N` pages of that subreddit. Use it to retrieve a large number of posts.
3. Make an updated version of the figures you produced in Ex. 2.3.2-3 with this larger dataset.
4. Visualize the number of posts over time.