<h1><center> PPOL564 - Data Science I: Foundations<br><br><font color='grey'> Working with Nested Lists </font> </center><h1>

## Learning Goals

In this notebook, we will cover:

- open csv as nested lists
- Working with retangular data as nested lists

This is mostly code I will go over during the lecture. 

In [22]:
# Batteries included Functions
import csv # convert a .csv to a nested list
import os  # library for managing our operating system. 


# Read in the gapminder data 
with open("gapminder.csv",mode="rt") as file:
    data = [row for row in csv.reader(file)]

What does the data looks like?

In [14]:
# it is a nested list. 
data

[['country', 'lifeExp', 'gdpPercap'],
 ['Guinea_Bissau', '39.21', '652.157'],
 ['Bolivia', '52.505', '2961.229'],
 ['Austria', '73.103', '20411.916'],
 ['Malawi', '43.352', '575.447'],
 ['Finland', '72.992', '17473.723'],
 ['North_Korea', '63.607', '2591.853'],
 ['Malaysia', '64.28', '5406.038'],
 ['Hungary', '69.393', '10888.176'],
 ['Congo', '52.502', '3312.788'],
 ['Morocco', '57.609', '2447.909'],
 ['Germany', '73.444', '20556.684'],
 ['Ecuador', '62.817', '5733.625'],
 ['Kuwait', '68.922', '65332.91'],
 ['New_Zealand', '73.989', '17262.623'],
 ['Mauritania', '52.302', '1356.671'],
 ['Uganda', '47.619', '810.384'],
 ['Equatorial Guinea', '42.96', '2469.167'],
 ['Croatia', '70.056', '9331.712'],
 ['Indonesia', '54.336', '1741.365'],
 ['Canada', '74.903', '22410.746'],
 ['Comoros', '52.382', '1314.38'],
 ['Montenegro', '70.299', '7208.065'],
 ['Slovenia', '71.601', '14074.582'],
 ['Trinidad and Tobago', '66.828', '7866.872'],
 ['Poland', '70.177', '8416.554'],
 ['Lesotho', '50.007', 

## Indexing Nested Lists

Notice something important here, because we open the data using a iterator, the code doesn't know that the first row is the header of the csv

In [15]:
# accessing the header
print(data[0])

# %% -----------------------------------------
# Indexing Rows


# For any row > 0, row == 0 is the column names. 
print(data[100])

['country', 'lifeExp', 'gdpPercap']
['Burundi', '44.817', '471.663']


### Indexing by columns

In [23]:
# Indexing Columns - Remember this is a nested lest

# Referencing a column data value
d = data[100] # First select the row
d[1] # Then reference the column

# doing the above all in one step
data[100][1]

# The key is to keep in mind the column names
cnames = data.pop(0)

cnames

# We can now reference this column name list to pull out the columns we're interested in.
ind = cnames.index("lifeExp") # Index allows us to "look up" the location of a data value. 
data[99][ind]

'44.817'

## Accessing a entire column

If I want to extract all the values of a particular column, I need to loop through all the *j* element of a sublist. 

In [1]:
# Looping through each row pulling out the relevant data value
life_exp = []
for row in data:
    life_exp.append(float(row[ind]))

# Same idea, but as a list comprehension 
life_exp = [float(row[ind]) for row in data]
print(life_exp)

# Make this code more flexible with list comprehensions
var_name = "gdpPercap"
out = [row[cnames.index(var_name)] for row in data]

NameError: name 'data' is not defined

## Motivating Numpy

All of the above seems a little too much for working with retangular data in Python. And it is. So of course, there are more recent, modern and easy to work with strategies to work with data frames in Python. 

A first approach to facilitate working with Data Frames in Python comes through using `numpy` to convert nested lists in `arrays`. 

**If you coming from R, think about numpy arrays as matrices.**

We will see more of numpy soon. But, let's see briefly how numpy works and the speed boost of using numpy to access data in Python


In [25]:
# %% -----------------------------------------
# Numpy offers an efficiency boost, especially when indexing
import numpy as np


# Convert to a numpy array
data_np = np.array(data)
data_np

array([['Guinea_Bissau', '39.21', '652.157'],
       ['Bolivia', '52.505', '2961.229'],
       ['Austria', '73.103', '20411.916'],
       ['Malawi', '43.352', '575.447'],
       ['Finland', '72.992', '17473.723'],
       ['North_Korea', '63.607', '2591.853'],
       ['Malaysia', '64.28', '5406.038'],
       ['Hungary', '69.393', '10888.176'],
       ['Congo', '52.502', '3312.788'],
       ['Morocco', '57.609', '2447.909'],
       ['Germany', '73.444', '20556.684'],
       ['Ecuador', '62.817', '5733.625'],
       ['Kuwait', '68.922', '65332.91'],
       ['New_Zealand', '73.989', '17262.623'],
       ['Mauritania', '52.302', '1356.671'],
       ['Uganda', '47.619', '810.384'],
       ['Equatorial Guinea', '42.96', '2469.167'],
       ['Croatia', '70.056', '9331.712'],
       ['Indonesia', '54.336', '1741.365'],
       ['Canada', '74.903', '22410.746'],
       ['Comoros', '52.382', '1314.38'],
       ['Montenegro', '70.299', '7208.065'],
       ['Slovenia', '71.601', '14074.582'],
      

### slicing data with numpy

In [26]:
# simple slicing of rows and columns of your 2d array
# array[rows, columns]
data_np[:,2]


array(['652.157', '2961.229', '20411.916', '575.447', '17473.723',
       '2591.853', '5406.038', '10888.176', '3312.788', '2447.909',
       '20556.684', '5733.625', '65332.91', '17262.623', '1356.671',
       '810.384', '2469.167', '9331.712', '1741.365', '22410.746',
       '1314.38', '7208.065', '14074.582', '7866.872', '8416.554',
       '780.553', '16245.209', '3477.21', '1200.416', '680.133',
       '3484.779', '12013.579', '13969.037', '1044.582', '5613.844',
       '4469.453', '4898.398', '1854.731', '675.368', '6384.055',
       '7269.216', '1153.82', '1569.275', '6197.645', '3163.352',
       '6703.289', '14160.936', '4426.026', '13920.011', '2697.833',
       '17425.382', '1488.309', '817.559', '648.343', '6283.259',
       '3675.582', '1835.01', '3009.288', '675.669', '10863.164',
       '3255.367', '1017.713', '542.278', '673.093', '20261.744',
       '604.814', '1335.595', '1165.454', '11529.865', '4768.942',
       '1358.199', '7300.17', '2844.856', '3074.031', '1533.12

In [49]:
# loop approach
%%timeit 
out1 = []
for row in data:
    out1.append(row[var_ind])

8.48 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [50]:
# list comprehension
%%timeit
out2 = [row[var_ind] for row in data]

4.99 µs ± 32.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [51]:
# numpy
%%timeit
out3 = data_np[:,var_ind]

144 ns ± 0.417 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [1]:
!jupyter nbconvert _week_4_nested_lists.ipynb --to html --template classic


[NbConvertApp] Converting notebook _week_4_nested_lists.ipynb to html
[NbConvertApp] Writing 310867 bytes to _week_4_nested_lists.html
