## Programming for Analytics -- Homework Assignment 1
## "Belgium ATMs"

### SOLUTION KEY - ELABORATE VERSION

#### Name: Yufeng Huang
#### Email: yufeng.huang@simon.rochester.edu
#### Date: 7/30/2024


## 0. General description and preparation steps

This is the first homework assignment for GBA464 Programming for Analytics. Here, we will read and analyze a dataset on the number of ATMs in each market and some withdrawal measures. We have not learned to work with Pandas yet. So we we will work with NumPy and a dataset in the form of a list, where each element in the list is a column name:column value pair. See Exercise 1 in the NumPy lecture notes for a similar data structure. 

To complete the assignment, you should:
1. Make sure your Pandas and NumPy are installed correctly. We have installed NumPy in the first week. In the same manner, install Pandas using `pip install pandas` (or `conda install pandas`) in the Terminal/Anaconda Prompt.
2. Follow the notebook and complete each question. Do not alter my code in Section 0 ("General description and preparation steps"). 
3. When your code is finished, clear all outputs and select "run all" to do a full run. Then check and confirm there is no error, and all outputs are as intended. 
4. Submit the finished notebook on Blackboard. Do this before the submission deadline. 

Grading is based on completion. So make a genuine programming attempt in completing each question. Do not "hard code" your answer. For example, the number of rows in the data is 659, and the number of nonmissing rows is 310. Do **not** code `nrows = 659` and `nrows_nonmissing = 310` and use them as answers. 

#### Import libraries

In [2]:
import pandas as pd
import numpy as np

#### Read data

We will first read data. The data comes from a csv file in a url. We will use Pandas to read data. But we will not use Pandas after the conversion. So do not modify the first few lines of code, where I read the data, convert it into a dictionary, and delete the original data. 

In [23]:
# read df from Google Drive
df = pd.read_csv('https://drive.google.com/uc?id=1ouWC0rdnuKevZGiyT6w2YJgT25Wb2_oj&export=download')

# Note: if cannot access this link or it is too slow, download file "belgium_atm.csv" from Blackboard and replace the link with the file's location

# print df for view
df

Unnamed: 0,population,numATMs,ATMwithdr,withdrvalue,unemprate
0,3722,1,0.25542593,79.13402557,0.072868
1,7006,2,1.837865114,102.6663437,0.022695
2,4234,0,missing,missing,0.027397
3,6229,0,missing,missing,0.024402
4,10303,1,0.606253982,98.93833923,0.028438
...,...,...,...,...,...
654,601,0,missing,missing,0.021766
655,1028,0,missing,missing,0.021766
656,2033,0,missing,missing,0.021766
657,15521,2,0.698489904,110.1268387,0.023195


Then we convert the Pandas object `df` into a dictionary, which has the format column_name:column_value. Specifically, each column we made it a NumPy `ndarray`. We call this dictionary `data`. 

In [24]:
# convert the data into a dictionary, treating each element is a NumPy Array
data = {}
for col in df.columns:
    data[col] = np.array(df[col].tolist())

# we can also just run below, but we'll need to separately convert the columns into NumPy Arrays
# data = df.to_dict()

Next, we confirm `data` is what we wanted:
- It is a dictionary. 
- It has columns "population," "numATMs," "ATMwithdr," "withdrvalue," "unemprate."
- Each column is a NumPy `ndarray`.

In [25]:
# we can show the structure of data
print(type(data))       # it's a dict.
print(data.keys())      # it has columns population, numATMs, ATMwithdr, withdrvalue, unemprate, and numbranches.
print(type(data["population"]))     # columns are ndarray.

<class 'dict'>
dict_keys(['population', 'numATMs', 'ATMwithdr', 'withdrvalue', 'unemprate'])
<class 'numpy.ndarray'>


In [6]:
# now, our data reading job is done, so we delete df and never touch it
del df

## 1. Understanding the data's structure
We first try to gain understandings of the dataset. Use what we have learned (basic Python functions, NumPy, and flow control if necessary), find out about the following:
- [1a] How many rows does the dataset have? Keep in mind the number of rows is the length of each column. We often call each row in a dataset "an observation."
- [1b] How many columns are numeric -- meaning that their values are either integer or float? How many columns are strings? Which columns are strings? How many columns are booleans? Keep in mind that with NumPy each `ndarray` has a fixed type, and that is captured in the attribute `.dtype`.
- [1c] How many rows contain a string "missing" in any column? If we think "missing" represents missing values, how many rows have any missing values?

Keep in mind we will not use Pandas functions to answer these and the subsequent questions. Also, although we are working with dictionaries, we will still call each element a "column." 


In [26]:
# Code for 1a
nrows = len(data["population"])
print(f"the number of rows is {nrows}")

# just say len(data["population"]) or something similar is fine.

the number of rows is 659


##### Comments
There are several alternatives to get shapes, size, and dimensions. E.g. 

In [18]:
# 1a alternatives
(nrows,) = data["population"].shape
nrows

659

In [None]:
# 1a alternatives (if we were to use df)
# len(df)

In [27]:
# Code for 1b
ncol = 0
ncol_num = 0
ncol_bool = 0
ncol_str = 0
set_string_col = []
for col in data.keys():
    ncol += 1   # always have one more string row
    if data[col].dtype in ["float", "int"]:
        ncol_num += 1
    elif data[col].dtype == "bool":
        ncol_bool += 1
    else:
        ncol_str += 1
        set_string_col.append(col)

print(f"number of columns = {ncol}, number of boolean columns = {ncol_bool}, number of string columns = {ncol_str}, and the string columns are {set_string_col}")

number of columns = 5, number of boolean columns = 0, number of string columns = 2, and the string columns are ['ATMwithdr', 'withdrvalue']


#### Comments
This was a really tricky question. The intent was simple: check types. At the time of making the question, I didn't realize at the time that `.dtype == "str"` doesn't work as intended. Of course I realized this before I release the assignment, but I chose to keep the question because we can always find "work arounds." See above, where I avoided checking strings. 

Now, the real question from some of you is "why couldn't we say 'np.array.dtype == str'?" 

A quick-ish answer is that strings have many specific types, and these are related to their encoding. If we see dtype is "U<11" that means we have Unicode strings up to 11 characters. But there are other string encodings, such as "ASCII."

One way to tackle this head-on is to use `np.issubdtype()` to get True's for something that is a subtype of the broader family of NumPy strings, `np.str_`. We can also change the "int" and "bool" to be NumPy versions of these types.  


In [30]:
ncol = 0
ncol_num = 0
ncol_bool = 0
ncol_str = 0
set_string_col = []
for col in data.keys():
    ncol += 1   # always have one more string row
    if data[col].dtype in {np.int_, np.float_}:
        ncol_num += 1
    elif data[col].dtype == np.bool_:
        ncol_bool += 1
    elif np.issubdtype(data[col].dtype, np.str_):
        ncol_str += 1
        set_string_col.append(col)

print(f"number of columns = {ncol}, number of boolean columns = {ncol_bool}, number of string columns = {ncol_str}, and the string columns are {set_string_col}")

number of columns = 5, number of boolean columns = 0, number of string columns = 2, and the string columns are ['ATMwithdr', 'withdrvalue']


In [10]:
# Code for 1c
is_missing_row = (data['ATMwithdr'] == "missing") | (data['withdrvalue'] == "missing")  # we need is_missing_row later
missing_rows = is_missing_row.sum()     # True's are 1's, so they sum up to the number of missing rows
print(f"the number of rows with a 'missing' is {missing_rows}, which is also the number of rows with missing values")

the number of rows with a 'missing' is 349, which is also the number of rows with missing values


#### Comments
It is interesting to see that people have very different responses to this question and also question 2.

One might be tempted to loop across rows. Afterall, wouldn't the way to count "missing" be just to add one to a counter every time we see "missing"? 

But (remember the drunk bar guy) with np.arrays, we can avoid loops with universal functions (vectorized functions). In this case, "==" is vectorized, so we can just feed the entire column to the operation at once. 

In general, because we have pd.Series or np.ndarrays to represent columns (a column is a pd.Series or an np.array or elements), we do not have to loop over rows most of the time. The exception is we do not have a vectorized function, and we cannot write one ourselves. 

Writing loops is not only time consuming but can also lead to slower code. To see this point, let's use `%timeit` and bechmark the two versions. We will return to this point around the end of this class.

In [33]:
# benchmark, sum way
def vectorized_approach():
    is_missing_row = (data['ATMwithdr'] == "missing") | (data['withdrvalue'] == "missing")  # we need is_missing_row later
    missing_rows = is_missing_row.sum()

%timeit vectorized_approach()

12.6 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [35]:
# benchmark, for loop
def loop_approach():
    missing_rows = 0
    for r in range(nrows):
        if (data['ATMwithdr'][r] == "missing") | (data['withdrvalue'][r] == "missing"): 
            missing_rows += 1

%timeit loop_approach()

552 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## 2. What explains the rows with "missing"?

We think that the string "missing" was an intention to represent missing values. Let's get a smaller dataset that focuses on rows without any missings. 
- [2a] First, create a **copy** of the original dictionary `data`, call it `data_sub`. 
- [2b] Then, for each column in `data_sub`, remove rows where **any column** contains "missing." For example: 
    - for row 0, we have population = 3722, numATMs = 1, ATMwithdr = 0.255, withdrvalue = 79.134, unemprate = 0.073. None of them is a zero, so this row stays in the new dictionary `data_sub`. 
    - for row 2, population = 4234, numATMs = 0, ATMwithdr = "missing", withdrvalue = "missing", and unemprate = 0.027. Some of the columns in this row have the value "missing," so we'll drop the entire row. 
    - the new dictionary should still have five elements, each being a NumPy `ndarray` with equal length. Print that length. 
- [2c] For this smaller data, compute the average of population, numATMs, and unemprate. Compare these averages against the full sample's average. Organize your results in a convenient format (a format that you like). What variable explains the missing rows? 

In [11]:
# Code for 2a
data_sub = data.copy()      # we don't actually need this step. We can just create an empty dictionary and add the shrunk columns in. But the instruction says copy, which is okay.

In [12]:
# Code for 2b
#   now we use is_missing_row to do indexing, or masking. You can convert it to indices using np.where and use the numeric indices
for col in data.keys():
    data_sub[col] = data_sub[col][~is_missing_row]
    if col in set_string_col:                               # use the set of string columns, derived before
        data_sub[col] = data_sub[col].astype("float")       # if we're getting the wrong rows, i.e. if the rows still contain missing, there will be an error here

nrows_sub = len(data_sub["population"])
print(f"data_sub now has {nrows_sub} rows")

data_sub now has 310 rows


#### Comments
Similar comment as before: we often loop over columns, but we do not often loop over rows. Here, over rows we have access to boolean masking, and we can directly use `[~is_missing_row]` (find non-missing rows) to select column.

Remember to select all columns for the same set of rows!

Same comment for the last question.

In [13]:
# Code for 2c (Just compare the averages. The last question is for thinking.)
avg_full = {}       # use dictionary because they have keys
avg_sub = {}

for col in ["population", "numATMs", "unemprate"]:
    avg_full[col] = data[col].mean().round(3)   # round to make the display nicer. Note the method chaining.
    avg_sub[col] = data_sub[col].mean().round(3)

print(f"the full sample averages are {avg_full}")
print(f"the sub sample averages are {avg_sub}")
print("those with non-missing observations are larger markets (higher population) and more ATMs")

the full sample averages are {'population': 8738.464, 'numATMs': 0.737, 'unemprate': 0.031}
the sub sample averages are {'population': 13445.374, 'numATMs': 1.568, 'unemprate': 0.032}
those with non-missing observations are larger markets (higher population) and more ATMs


## 3. Fill in missing with zeros

We now see that the "missing"'s come from the fact that there is no ATMs in certain markets. That is, when ATMwithdr and withdrvalue is "missing," numATMs is 0. So one way to keep all rows is to replace the value "missing" with some other values. 

- [3a] Convert the observations in ATMwithdr and withdrvalue, whose current value is "missing," to "0.0" (string 0.0). Then, convert the entire columns to float. Do so directly on the dictionary `data`.



In [14]:
# Code for 3a
types = []
for col in ["ATMwithdr", "withdrvalue"]:
    data[col][is_missing_row] = "0.0"
    data[col] = data[col].astype("float")
    types.append(data[col].dtype)

print(f"to confirm, the two columns have types {types}")

to confirm, the two columns have types [dtype('float64'), dtype('float64')]


#### Final comment

I talked to many people and saw many of you working very hard. Not everyone does this in the most ideal way on the first try. That's okay! We learn by accumulating experiences. Good job! 