# Python Fundamentals for Data Science: Midterm Exam
### Summer 2016

## Instructions
The midterm exam is designed to evaluate your grasp of Python theory as well as Python coding.

- This is an individual exam.
- You have 24 hours to complete the exam, starting from the point at which you first access it.

For the coding questions, follow best practices.  Partial credit may be given for submissions that are clearly commented and well organized.

# Short Questions (10 Points)

- Python's dynamic typing allows one variable to refer to multiple types of objects within the same program execution.  This can speed program development.  Name one disadvantage of dynamic typing.

**[Type Answer Here]**

You may accidentally override the variable value and type as you would not get an error message or warning by doing so in Python. For example, you first assign x = 6; later, you forgot that you already used variable x and then assign another object to x, x = input("Please enter your name: "). When you use x in some math calculation later in the program, you will get error like 'TypeError: Can't convert 'int' object to str implicitly'.

- Compiled languages typically offer faster performance than interpreted languages.  Why would you choose an interpreted language like Python for the purpose of analyzing a data set?

**[Type Answer Here]**

Compiled languages are relatively low-level languages, like Java or C/C++. Though low-level languages run faster, the developer has to write more codes and worry about the hardware underneath. On the other hand, interpreted languages are high-level languages and hide those details from developers which means that we don't need to care about the physical location for the objects and we can develop faster. Thus, choosing Python or other interpreted languages for data set analysis, we can focus on the analysis and modeling part instead of worring about hardware details.

* We have gone over FOR and WHILE loops.  Discuss one reason to use a for loop over a while loop and one reason to do the opposite. Please elaborate beyond a single word.

**[Type Answer Here]**
1. For loop is easier to read and doesn't need to increment the counter. It's better to use it if you're sure about the number of iterations and no complex conditions for the loop, such as loop through the items in the last.

    myList=['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
    
    for items in myList:
        
        print(items)
    
2. On the other hand,if you don't the the number of iterations or have complex loop conditions, then you will have more flexibility using while loop, such as finding the square root. We only know the stop condition, but not sure after how many iterations. Thus, we will use while loop in the following format.

    ans = 0
    
    step =1
    
    while ans * ans <= x:
    
        ans += step
 

# Programming Styles (10 Points)

We have taught you a number of ways to program python. These have included using scripts versus functions and using jupyter notebooks versus calling .py files from the command line. Describe a scenario in which you would use a script versus a function and the opposite. Then describe a senario in which you would use .py files from the command line over a jupyter notebook and the opposite. There are four cases and each answer should be about 1-2 sentences.

 * I would use a jupyter notebook over .py files:



* I would use .py files over a notebook:




* I would use a script over a function:





* I would use a function over a script:





**[Type Answer Here]**
* I would use a jupyter notebook over .py files:
  
  If I would like to test some codes or run short simple scripts, then I would prefer to do it in the notebook. For example, I would like to test examples of list mutability and difference between copy and deepcopy.


* I would use .py files over a notebook:

  On the other hand, if I develop a complex project and create long scripts, then I would like to use .py file for the final project. Then I can run the whole project quickly with neat scripts in .py file instead of multiple jupyter cells or one messy cell.


* I would use a script over a function:
  If I don't need to reuse the functionality or the codes are easy to update, then I don't need to create a function and call it. For example, if I only need to read the raw data once, I would just use the script over a function.

* I would use a function over a script:

  If I need to reuse some codes several times, then I would like to create a function and then call the function in different parts in my main script. In this way, I can hide all the details and avoid repeating in the main script. For example, when building the revenue forecast models, I need to pull historical data several times for training the model, ensemble model and prediction different years. So, I created a function to pull the historical data with Year and other attributes (Product) as the parameters.
  
  Another 

# Dictionaries vs Lists (10 Points)

Suppose we have the following problem.  We have a dictionary of names and ages and we would like to find the oldest person.

```
ages_dict = {'Bill':34, 'Fred':45, 'Alice':14, 'Betty':17}
```

### Dictionary approach
Here is a loop that shows how to iterate through the dictionary:

In [4]:
ages_dict = {'Bill':34, 'Fred':45, 'Alice':14, 'Betty':17}

max_age = 0
max_name = ''
for name,age in ages_dict.items():
    if age > max_age:
        max_age = age
        max_name = name
        
print(max_name, "is the oldest")    

Fred is the oldest


### List approach 

Your friend comes to you and says that this dictionary is hard to deal with and instead offers a different plan with 2 lists.

```
names = ['Bill', 'Fred', 'Alice', 'Betty']
ages = [34, 45, 14, 17]
```

Instead of using a loop, your friend writes this code to find the oldest person.

In [5]:
names = ['Bill', 'Fred', 'Alice', 'Betty']
ages = [34, 45, 14, 17]

max_age = max(ages)
index_max = ages.index(max_age)
max_name = names[index_max]

print(max_name, 'is the oldest')

Fred is the oldest


### Discussion

Discuss the advantages and disadvantages of each of the approaches.  
* Is one more efficient (i.e. faster) than the other?
* Why would you prefer the dictionary?
* Why would you prefer the 2 lists?

**[Type Answer Here]**

I don't think one is more efficient than the other. Ultimately, both methods loop through each items in the list or dictionary, comparing each with some base value.

I would prefer the dictionary. The two lists will not guarantee that items match. For example, the order changes in one list and 34 becomes Fred's age which is incorrect. Using dictionary not only makes sure the name and age match but also key is immutable so that you would not change it accidentally.

# Mutability Surprises (20 Points)

In the asyncronous sessions, we discussed mutability. Please describe, in your own words, why mutability is a useful feature in Python lists and dictionaries.

**[Type Answer Here]**

Mutability means that you can change the object. List is mutable and you can change the value of the items or append new items or pop. Then you can create just one variable with List type and change its contents during the process. Dictionaries are similar lists; though keys are immutable, values can be any type, such as lists.

The list doesn't contain the objects but has reference to the objects. When associate several list variables to the same items, you can change the item at one place and get updated values for other variables as well.

Mutability can also lead to unexpected behavior - specifically when multiple variables point to the same object or when mutable objects are nested inside each other. 

Please write some code demonstrating a situation where mutability could lead to unexpected behavior. Specifically, show how updating just one object (list_a) can change the value when you print a second variable (list_b).

In [17]:
# list_a contains my name before marriage
list_a = ['Cynthia', 'Hu']
print('Before updating list_b, list_a has value: ', list_a)

# after marriage, I still use the same name
list_b = list_a

# one day, my husband asked me to change my last name to his.
list_b[1] = 'Gu'

# however, I didn't mean to change the list for my old name
print('After updating list_b, list_a has value:', list_a)
print('After updating list_b, list_b has value:', list_b)

Before updating list_b, list_a has value:  ['Cynthia', 'Hu']
After updating list_b, list_a has value: ['Cynthia', 'Gu']
After updating list_b, list_b has value: ['Cynthia', 'Gu']


Show how "copy" or "deepcopy" could be used to prevent the unexpected problem you described, above.

In [16]:
list_a = ['Cynthia', 'Hu']
print('Before updating list_b, list_a has value: ', list_a)

# use copy: list_b and list_a are different lists now
list_b = list_a.copy()

# one day, my husband asked me to change my last name to his.
list_b[1] = 'Gu'

# however, I didn't mean to change the list for my old name
print('After updating list_b, list_a has value:', list_a)
print('After updating list_b, list_b has value:', list_b)

Before updating list_b, list_a has value:  ['Cynthia', 'Hu']
After updating list_b, list_a has value: ['Cynthia', 'Hu']
After updating list_b, list_b has value: ['Cynthia', 'Gu']


Now, show the same problem using two dictionaries. Again show how "copy" or "deepcopy" can fix the issue.

In [18]:
dict_a = {'Cynthia': 35, 'Roger': 36, 'Nolan': 2}
print('Before updating dict_b, dict_a has value: ', dict_a)
dict_b = dict_a
dict_b['Nolan'] = 1.5
dict_b = dict_a
dict_b['Nolan'] = 1.5
print('After updating dict_b, dict_a has value: ', dict_a)
print('After updating dict_b, dict_b has value: ', dict_b)

Before updating dict_b, dict_a has value:  {'Cynthia': 35, 'Nolan': 2, 'Roger': 36}
After updating dict_b, dict_a has value:  {'Cynthia': 35, 'Nolan': 1.5, 'Roger': 36}
After updating dict_b, dict_b has value:  {'Cynthia': 35, 'Nolan': 1.5, 'Roger': 36}


In [19]:
# using copy to fix the issues with the dictionaries above
# as I don't want to change dict_a
dict_a = {'Cynthia': 35, 'Roger': 36, 'Nolan': 2}
print('Before updating dict_b, dict_a has value: ', dict_a)

dict_b = dict_a.copy()
dict_b['Nolan'] = 1.5

print('After updating dict_b, dict_a has value: ', dict_a)
print('After updating dict_b, dict_b has value: ', dict_b)

Before updating dict_b, dict_a has value:  {'Cynthia': 35, 'Nolan': 2, 'Roger': 36}
After updating dict_b, dict_a has value:  {'Cynthia': 35, 'Nolan': 2, 'Roger': 36}
After updating dict_b, dict_b has value:  {'Cynthia': 35, 'Nolan': 1.5, 'Roger': 36}


Can this unexpected behavior problem occur with tuples? Why, or why not?

**[Type Answer Here]**

This unexpected behavior problem cannot occur with tuples as tuples are immutable. 'Tuple' object does not support item assignment which means once created, you cannot change the items in tuple.

# Tweet Analysis (20 Points)

A tweet is a string that is between 1 and 140 characters long (inclusive). A username is a string of letters and/or digits that is between 1 and 14 characters long (inclusive). A username is mentioned in a tweet by including @username in the tweet. A retweet is way to share another user's tweet, and can be identified by the string RT, followed by the original username who tweeted it.

Your job is to fill in the function count_retweets_by_username so that it returns a frequency dictionary that indicates how many times each username was retweeted.

In [20]:
tweets = ["This is great! RT @fake_user: Can you believe this http://some-link.com",
         "It's the refs! RT @dubsfan: Boo the refs and stuff wargarbal",
         "That's right RT @ladodgers: The dodgers are destined to win the west!",
         "RT @sportball: That sporting event was not cool",
         "This is just a tweet about things @person, how could you",
         "RT @ladodgers: The season is looking great!",
         "RT @dubsfan: I can't believe it!",
         "I can't believe it either! RT @dubsfan: I can't believe it"]

In [50]:
def count_retweets_by_username(tweet_list):
    """ (list of tweets) -> dict of {username: int}
    Returns a dictionary in which each key is a username that was 
    retweeted in tweet_list and each value is the total number of times this 
    username was retweeted.
    """
    
    # write code here and update return statement with your dictionary
    user_list = []
    user_count = {}
    for item in tweet_list:  
        # find the starting position of RT string
        start = item.find('RT @')
        # extract the maximum len allowed for a username
        if start >= 0:
            end = min(start + 18, len(item))
            user = item[start+4 : end].split(':')
            user_list.append(user[0])
    # get a unique list of username
    user_set = set(user_list)

    for user in user_set:
        user_count[user] = user_list.count(user)
    return user_count

In [51]:
# allow this code to work by implementing count_retweets_by_username function above
print(count_retweets_by_username(tweets))

{'ladodgers': 2, 'sportball': 1, 'fake_user': 1, 'dubsfan': 3}


# Looking for Minerals (30 Points)

A mining company conducts a survey of an n-by-n square grid of land.  Each row of land is numbered from 0 to n-1 where 0 is the top and n-1 is the bottom, and each column is also numbered from 0 to n-1 where 0 is the left and n-1 is the right.  The company wishes to record which squares of this grid contain mineral deposits.

The company decides to use a list of tuples to store the location of each deposit.  The first item in each tuple is the row of the deposit.  The second item is the column.  The third item is a non-negative number representing the size of the deposit, in tons.  For example, the following code defines a sample representation of a set of deposits in an 8-by-8 grid.

In [52]:
deposits = [(0,4,.3), (6, 2, 3), (3, 7, 2.2), (5, 5, .5), (3, 5, .8)]

Given a list of deposits like the one above, write a function to create a string representation for a rectangular subregion of the land.  Your function should take a list of deposits, then a set of parameters denoting the top, bottom, left, and right edges of the subgrid.  It should return a multi-line string in which grid squares without deposits are represented by "-" and grid squares with a deposit are represented by "X".

In [172]:
def display(deposits, top, bottom, left, right):
    """display a subgrid of the land, with rows starting at top and up to 
    but not including bottom, and columns starting at left and up to but
    not including right."""
    # assuming it's a 8*8 grid as n is not parameter of the function
    import numpy as np
    n=8
    o = '-'*(n**2)
    # initialize a string array for displaying
    ans=np.array(list(o),dtype=str).reshape(n,n)
    # initialize a numeric array for storing the deposits
    a = np.zeros((n,n))
    # update numeric array with deposits
    for t in deposits:
        a[t[0]][t[1]]=t[2]
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if a[i][j] >0:
                ans[i][j] = 'X'
    # extract subgrid
    ans_sub=ans[top:bottom,left:right]    
    return ans_sub

In [175]:
print(display(deposits, 0, 8, 0, 8))
print()
print(display(deposits, 5, 8, 5, 8))

[['-' '-' '-' '-' 'X' '-' '-' '-']
 ['-' '-' '-' '-' '-' '-' '-' '-']
 ['-' '-' '-' '-' '-' '-' '-' '-']
 ['-' '-' '-' '-' '-' 'X' '-' 'X']
 ['-' '-' '-' '-' '-' '-' '-' '-']
 ['-' '-' '-' '-' '-' 'X' '-' '-']
 ['-' '-' 'X' '-' '-' '-' '-' '-']
 ['-' '-' '-' '-' '-' '-' '-' '-']]

[['X' '-' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


For example, your function should replicate the following behavior for the example grid:
```
print(display(deposits, 0, 8, 0, 8))
----X---
--------
--------
-----X-X
--------
-----X--
--X-----
-------X

print(display(deposits, 5, 8, 5, 8))
X--
---
--X

```

Next, complete the following function to compute the total number of tons in a rectangular subregion of the grid.

In [176]:
def tons_inside(deposits, top, bottom, left, right):
    """Returns the total number of tons of deposits for which the row is at least top,
    but strictly less than bottom, and the column is at least left, but strictly
    less than right."""
    # Do not alter the function header.  
    # Just fill in the code so it returns the correct number of tons.
    # assuming it's a 8*8 grid as n is not parameter of the function
    import numpy as np
    n=8
    # initialize the grid with zerios
    a = np.zeros((n,n))
    # replace 0 with deposit number
    for t in deposits:
        a[t[0]][t[1]]=t[2]
    # create a new array for the subregion of the grid
    b=a[top:bottom,left:right]
    # return the total of deposits in the array
    return sum(sum(b))

In [177]:
print(tons_inside(deposits,0,8,0,8))
print(tons_inside(deposits,5,8,5,8))

6.8
0.5


Next, write a function to return the square i-by-i grid subregion with the highest amount of deposits.  This is the subregion of size i-by-i with the most tons.  If there are multiple such regions with the maximum number of tons, your function should return any one of these regions.

In [178]:
def best_square(n, deposits, i):
    """Returns a tuple (top, left) representing the top row and left column of the 
    square i-by-i subgrid with the highest amount of deposits."""
    if i > n:
        raise Exception("Subregion grid is greater than the grid. Please enter i no larger than n.")
    if i <= 0:
        raise Exception("Please enter a positive integer.")
    # initialize the answers to 0
    max_tons = 0
    max_top, max_left = 0, 0
    
    # loop through the row and then the column for any combination of the i-i grid
    top = 0
    while top <= n and top + i <= n:
        left = 0
        while left <= n and left + i <= n:
             # call the tons_inside funtion to calcualte the total deposits within the i-i grid
            temp = tons_inside(deposits,top,top + i, left, left + i)
            
            if temp > max_tons:
                max_tons = temp
                max_top = top
                max_left = left
            #print('total tons:',temp,'  top is: ',top,'  left is: ',left)
            left += 1
        #print('total tons:',temp,'  top is: ',top,'  left is: ',left)
        top += 1
    return (max_top, max_left)

In [179]:
a=best_square(8,deposits,3)
print(a)
print(a[0])

(3, 5)
3


Finally, write code to find the 4x4 subgrid of the sample grid above with the highest density of deposits and display that subgrid.

In [180]:
print(best_square(8,deposits,4))
start = best_square(8,deposits,4)
print(display(deposits, start[0], start[0] + 4, start[1], start[1] + 4))

(3, 2)
[['-' '-' '-' 'X']
 ['-' '-' '-' '-']
 ['-' '-' '-' 'X']
 ['X' '-' '-' '-']]


In [181]:
# print the subgrid with number below to confirm
n=8
a = np.zeros((n,n))
for t in deposits:
    a[t[0]][t[1]]=t[2]
start = best_square(8,deposits,4)
b=a[start[0]: start[0] + 4, start[1]: start[1] + 4]
print(b)

[[ 0.   0.   0.   0.8]
 [ 0.   0.   0.   0. ]
 [ 0.   0.   0.   0.5]
 [ 3.   0.   0.   0. ]]
