## Programming for Analytics -- Homework Assignment 1
## "Belgium ATMs"

#### Name: Liuzhiying
#### Email: zliu124@ur.rochester.edu
#### Date: Aug 6th, 2024


## 0. General description and preparation steps

This is the first homework assignment for GBA464 Programming for Analytics. Here, we will read and analyze a dataset on the number of ATMs in each market and some withdrawal measures. We have not learned to work with Pandas yet. So we we will work with NumPy and a dataset in the form of a list, where each element in the list is a column name:column value pair. See Exercise 1 in the NumPy lecture notes for a similar data structure. 

To complete the assignment, you should:
1. Make sure your Pandas and NumPy are installed correctly. We have installed NumPy in the first week. In the same manner, install Pandas using `pip install pandas` (or `conda install pandas`) in the Terminal/Anaconda Prompt.
2. Follow the notebook and complete each question. Do not alter my code in Section 0 ("General description and preparation steps"). 
3. When your code is finished, clear all outputs and select "run all" to do a full run. Then check and confirm there is no error, and all outputs are as intended. 
4. Submit the finished notebook on Blackboard. Do this before the submission deadline. 

Grading is based on completion. So make a genuine programming attempt in completing each question. Do not "hard code" your answer. For example, the number of rows in the data is 659, and the number of nonmissing rows is 310. Do **not** code `nrows = 659` and `nrows_nonmissing = 310` and use them as answers. 

#### Import libraries

In [3]:
import pandas as pd
import numpy as np

#### Read data

We will first read data. The data comes from a csv file in a url. We will use Pandas to read data. But we will not use Pandas after the conversion. So do not modify the first few lines of code, where I read the data, convert it into a dictionary, and delete the original data. 

In [4]:
# read df from Google Drive
df = pd.read_csv('https://drive.google.com/uc?id=1ouWC0rdnuKevZGiyT6w2YJgT25Wb2_oj&export=download')

# Note: if cannot access this link or it is too slow, download file "belgium_atm.csv" from Blackboard and replace the link with the file's location

# print df for view
df

Unnamed: 0,population,numATMs,ATMwithdr,withdrvalue,unemprate
0,3722,1,0.25542593,79.13402557,0.072868
1,7006,2,1.837865114,102.6663437,0.022695
2,4234,0,missing,missing,0.027397
3,6229,0,missing,missing,0.024402
4,10303,1,0.606253982,98.93833923,0.028438
...,...,...,...,...,...
654,601,0,missing,missing,0.021766
655,1028,0,missing,missing,0.021766
656,2033,0,missing,missing,0.021766
657,15521,2,0.698489904,110.1268387,0.023195


Then we convert the Pandas object `df` into a dictionary, which has the format column_name:column_value. Specifically, each column we made it a NumPy `ndarray`. We call this dictionary `data`. 

In [5]:
# convert the data into a dictionary, treating each element is a NumPy Array
data = {}
for col in df.columns:
    data[col] = np.array(df[col].tolist())

# we can also just run below, but we'll need to separately convert the columns into NumPy Arrays
# data = df.to_dict()

Next, we confirm `data` is what we wanted:
- It is a dictionary. 
- It has columns "population," "numATMs," "ATMwithdr," "withdrvalue," "unemprate."
- Each column is a NumPy `ndarray`.

In [7]:
# we can show the structure of data
print(type(data))       # it's a dict.
print(data.keys())      # it has columns population, numATMs, ATMwithdr, withdrvalue, unemprate, and numbranches.
print(type(data["population"]))     # columns are ndarray.

<class 'dict'>
dict_keys(['population', 'numATMs', 'ATMwithdr', 'withdrvalue', 'unemprate'])
<class 'numpy.ndarray'>


In [8]:
# now, our data reading job is done, so we delete df and never touch it
del df

## 1. Understanding the data's structure
We first try to gain understandings of the dataset. Use what we have learned (basic Python functions, NumPy, and flow control if necessary), find out about the following:
- [1a] How many rows does the dataset have? Keep in mind the number of rows is the length of each column. We often call each row in a dataset "an observation."
- [1b] How many columns are numeric -- meaning that their values are either integer or float? How many columns are strings? How many columns are booleans? Keep in mind that with NumPy each `ndarray` has a fixed type, and that is captured in the attribute `.dtype`.
- [1c] How many rows contain a string "missing" in any column? If we think "missing" represents missing values, how many rows have any missing values?

Keep in mind we will not use Pandas functions to answer these and the subsequent questions. Also, although we are working with dictionaries, we will still call each element a "column." 


In [9]:
# Code for 1a

num_rows = len(data[list(data.keys())[0]])
print("Number of rows (observations):", num_rows)

Number of rows (observations): 659


In [10]:
# Code for 1b

num_numeric_cols = 0
num_string_cols = 0
num_boolean_cols = 0

for col in data.keys():
    if isinstance(data[col][0], (int, float, np.number)):
        num_numeric_cols += 1
    elif isinstance(data[col][0], str):
        num_string_cols += 1
    elif isinstance(data[col][0], bool):
        num_boolean_cols += 1

print("Number of numeric columns:", num_numeric_cols)
print("Number of string columns:", num_string_cols)
print("Number of boolean columns:", num_boolean_cols)

Number of numeric columns: 3
Number of string columns: 2
Number of boolean columns: 0


In [12]:
# Code for 1c

num_rows_missing_string = 0
num_rows_missing_values = 0

for i in range(num_rows):
    row_missing_string = False
    row_missing_value = False
    for col in data.values():
        if col[i] == 'missing':
            row_missing_string = True
        if pd.isnull(col[i]) or col[i] == 'missing':
            row_missing_value = True
    if row_missing_string:
        num_rows_missing_string += 1
    if row_missing_value:
        num_rows_missing_values += 1

print("Number of rows with 'missing' string in any column:", num_rows_missing_string)
print("Number of rows with missing values:", num_rows_missing_values)

Number of rows with 'missing' string in any column: 349
Number of rows with missing values: 349


## 2. What explains the rows with "missing"?

We think that the string "missing" was an intention to represent missing values. Let's get a smaller dataset that focuses on rows without any missings. 
- [2a] First, create a **copy** of the original dictionary `data`, call it `data_sub`. 
- [2b] Then, for each column in `data_sub`, remove rows where **any column** contains "missing." For example: 
    - for row 0, we have population = 3722, numATMs = 1, ATMwithdr = 0.255, withdrvalue = 79.134, unemprate = 0.073. None of them is a zero, so this row stays in the new dictionary `data_sub`. 
    - for row 2, population = 4234, numATMs = 0, ATMwithdr = "missing", withdrvalue = "missing", and unemprate = 0.027. Some of the columns in this row have the value "missing," so we'll drop the entire row. 
    - the new dictionary should still have five elements, each being a NumPy `ndarray` with equal length. Print that length. 
- [2c] For this smaller data, compute the average of population, numATMs, and unemprate. Compare these averages against the full sample's average. Organize your results in a convenient format (a format that you like). What variable explains the missing rows? 

In [14]:
# Code for 2a
url = 'https://drive.google.com/uc?id=1ouWC0rdnuKevZGiyT6w2YJgT25Wb2_oj&export=download'
df = pd.read_csv(url)

data_sub = df.copy()

In [16]:
# Code for 2b

data_clean = data_sub[~data_sub.apply(lambda row: row.astype(str).str.contains('missing').any(), axis=1)]

print("Number of rows in cleaned data:", len(data_clean))

Number of rows in cleaned data: 310


In [19]:
# Code for 2c (Just compare the averages. The last question is for thinking.)

original_averages = df[['population', 'numATMs', 'unemprate']].mean()

cleaned_averages = data_clean[['population', 'numATMs', 'unemprate']].mean()

print(original_averages)
print(cleaned_averages)

missings = df.apply(lambda col: (col.astype(str) == 'missing').sum())
print(missings)

### As we can observed, 'ATMwithdr', 'withdrvalue' is the mean variables that cause the missing rows. Belgium's authority may be need to pay attention to these problems, and enhance related data collection (or protection, etc.).


population    8738.464340
numATMs          0.737481
unemprate        0.030951
dtype: float64
population    13445.374194
numATMs           1.567742
unemprate         0.031677
dtype: float64
population       0
numATMs          0
ATMwithdr      349
withdrvalue    349
unemprate        0
dtype: int64


## 3. Fill in missing with zeros

We now see that the "missing"'s come from the fact that there is no ATMs in certain markets. That is, when ATMwithdr and withdrvalue is "missing," numATMs is 0. So one way to keep all rows is to replace the value "missing" with some other values. 

- [3a] Convert the observations in ATMwithdr and withdrvalue, whose current value is "missing," to "0.0" (string 0.0). Then, convert the entire columns to float. Do so directly on the dictionary `data`.



In [22]:
# Code for 3a

data['ATMwithdr'] = np.where(data['ATMwithdr'] == 'missing', '0.0', data['ATMwithdr'])
data['withdrvalue'] = np.where(data['withdrvalue'] == 'missing', '0.0', data['withdrvalue'])

data['ATMwithdr'] = data['ATMwithdr'].astype(float)
data['withdrvalue'] = data['withdrvalue'].astype(float)

print(data)

{'population': array([ 3722,  7006,  4234,  6229, 10303,  7424,  2868,  7129,  5422,
       11564, 13634, 11864, 21657,  4105,  8933,  6014,  9143,  6471,
        5540,  4992,  7871, 22233,  8782,  9144, 22148,  6220,  5064,
        1982,  9839,  7796, 21307,  3278, 17900,  4059,  5257,  2559,
        4110,  5056, 18196,  5951,  2689, 11831,  8615,  8844,  8560,
        2047, 14254, 11651, 10535, 17166, 15444,  3873, 11096,  7294,
        8732, 10829,  6942,  8884,  4508,  7335, 11980,  9608, 25582,
       18007,  5985,  8712,  4680, 13649,  8690,  6830,  2645,  2668,
       10920, 22024,  8315, 15098,  7181,  3299,  7352, 38713, 10482,
        6127,  3652,  2888,  1746,  3456,  8058, 15846, 12444, 11883,
        5060,  4520,  3558,  2156, 13949, 31763,  9463,  5338, 33879,
        9204, 17633,  9797,  8489, 16222,  3188, 32386, 17615,  9590,
        2363, 20497, 15938, 15389, 10169,  2019,  6336,  5099, 14237,
       14549, 13931,  5176, 15780, 19841,  7536, 16481, 17342, 17253,
     