# Denison CS181/DA210

---

## Tabular Representations Exercises

*Execute the prolog cell*

In [70]:
import os
import io
import sys
from contextlib import redirect_stdout
from IPython.core.debugger import set_trace

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

datadir = util.resolve_dir("tabulardata")

### Input File for Exercises

In the tabular data directory (as determined in the final line of the prolog cell) is a file named `namespop10.csv`.  This file conforms to the "simple CSV format" as described in section 3.4.5 of the textbook.  A prefix of the file contents is diplayed below:
```
year,sex,name,count,population
2010,Female,Isabella,22913,309330000
2010,Male,Jacob,22127,309330000
2011,Female,Sophia,21842,311580000
2011,Male,Jacob,20371,311580000

    ...
```
Since this file was exported as a CSV from Google Sheets, each line ends with the two invisibles of a carriage return and a newline.  The latest year in the dataset is 2018.

Note that, in terms of data types for any in-memory data structure created, the year, count, and population columns should be **integers** so that summary information (like min, max, average, etc.) can be calculated from their values.

**Q** Write a function

    readNamesPopLoL(path)
    
that reads from the CSV file at location `path` and returns two items:

- a list of strings of the column variable names obtained from the first line,
- a list of row lists (LoL) data structure containing the data obtained from the file.  The field values in the row lists should be the correct data type (with integers for the first, forth, and fifth elements of the row).

You can assume that the data is formatted as described above, but your function should work correctly even if the file directory, the file name, or the number of lines in the file were different from the example `namespop10.csv` file.

You should return `None, None` if no file is found at the specified location.

In [71]:
def readNamesPopLoL(path):
    """
    This function organizes data from csv files and puts 
    the data in a list of lists. It also puts the headers
    for each column in another list.
    
    Params: path: this is the path of the file that will
    have its data organized.
    
    Return: columns, LoL: returns two lists, the first is
            has the names of each column and the second 
            is the data from the file.
            None, None: this is returned when the path 
            does not lead to a file
    """
    if os.path.isfile(path):
        with open(path) as f:
            line1 = f.readline()
            trimheaders = line1.strip()
            columns = trimheaders.split(',')
            
            LoL = []
            for line in f:
                linetrim = line.strip()
                fields = linetrim.split(',')
                fields[0] = int(fields[0])
                fields[3] = int(fields[3])
                fields[4] = int(fields[4])
                LoL.append(fields)
        return columns, LoL
    return None, None

In [72]:
# Experimentation cell for students to debug

path = os.path.join(datadir, "namespop10.csv")
assert os.path.isfile(path)

columns, dataset = readNamesPopLoL(path)
print(columns)
print(dataset)

['year', 'sex', 'name', 'count', 'population']
[[2010, 'Female', 'Isabella', 22913, 309330000], [2010, 'Male', 'Jacob', 22127, 309330000], [2011, 'Female', 'Sophia', 21842, 311580000], [2011, 'Male', 'Jacob', 20371, 311580000], [2012, 'Female', 'Sophia', 22313, 313870000], [2012, 'Male', 'Jacob', 19074, 313870000], [2013, 'Female', 'Sophia', 21223, 316060000], [2013, 'Male', 'Noah', 18257, 316060000], [2014, 'Female', 'Emma', 20936, 318390000], [2014, 'Male', 'Noah', 19305, 318390000], [2015, 'Female', 'Emma', 20455, 320740000], [2015, 'Male', 'Noah', 19635, 320740000], [2016, 'Female', 'Emma', 19496, 323070000], [2016, 'Male', 'Noah', 19117, 323070000], [2017, 'Female', 'Emma', 19800, 325150000], [2017, 'Male', 'Liam', 18798, 325150000], [2018, 'Female', 'Emma', 18688, 327170000], [2018, 'Male', 'Liam', 19837, 327170000]]


In [73]:
# Testing cell

columns, dataset = readNamesPopLoL(os.path.join(datadir, "namespop10.csv"))
assert columns == ['year', 'sex', 'name', 'count', 'population']
assert len(dataset) == 18
assert isinstance(dataset, list)
assert isinstance(dataset[0], list)
assert len(dataset[0]) == 5


**Q** Create a function

    columnAverageLoL(datasetLoL, column_index)
    
that computes the average of the numbers in a column in a list of lists data set, where the column is given by the integer parameter `column_index`.  For the `namespop10` dataset, we could compute the average of the values in column 0 (the year), column 3 (the count), or column 4 (the population).  The function should return the numeric average, and need not handle the case where the specified column values do **not** contain numeric data types.

In [74]:
def columnAverage(datasetLoL, column_index):
    """
    This returns the average of the integers in the 
    columns of a dataset arranged in a list of lists.
    
    Params: datasetLoL: the list of list that will be
            used to find the column average.
            column_index: the index of the column that
            needs to be averaged.
    
    Return: average of the integers in the column of the
            dataset
    """
    average = 0;
    for item in datasetLoL:
        average = average + item[column_index]
    return average/len(datasetLoL)

In [75]:
# Experimentation cell for students to debug

LoL0 = [
    [5, 3, 7, 2]
]

LoL1 = [
    [5, 3, 7, 2],
    [4, 4, 8, 3]
]

LoL2 = [
    [5, 3, 7, 2],
    [4, 4, 8, 3],
    [9, 3, 0, 4]
]

print(columnAverage(LoL0, 0))
print(columnAverage(LoL1, 0))
print(columnAverage(LoL2, 0))

5.0
4.5
6.0


In [76]:
LoL2 = [
    [5, 3, 7, 2],
    [4, 4, 8, 3],
    [9, 5, 0, 4]
]
assert columnAverage(LoL2, 0) == 6.0
assert columnAverage(LoL2, 1) == 4.0
assert columnAverage(LoL2, 2) == 5.0


**Q** Write a function

    readNamesPopDoL(path)
    
that reads from the CSV file at location `path` and returns a dictionary of column lists (DoL) representation of the data.  

You can assume that the data is formatted as described above, but your function should work correctly even if the file directory, the file name, or the number of lines in the file were different from the example `namespop10.csv` file.

You should return `None` if no file is found at the specified location.

In [77]:
def readNamesPopDoL(path):
    """
    This function returns a dictionary of lists that
    contains data from a csv file.
    
    Params: path: the path of the file that will get its
            data organized.
    
    Return: DoL: the dictionary of lists with the 
            organized data.
            None: if the path does not lead to a file.
    """
    if os.path.isfile(path):
        DoL = {}
        with open(path) as fObj:
            line1 = fObj.readline()
            keys = line1.strip().split(',')
            for item in keys:
                DoL[item] = []
            for line in fObj:
                fields = line.strip().split(',')
                fields[0] = int(fields[0])
                fields[3] = int(fields[3])
                fields[4] = int(fields[4])
                for i in range(len(keys)):
                    DoL[keys[i]].append(fields[i])
        return DoL        
    return None

In [78]:
# Experimentation cell for students to debug

path = os.path.join(datadir, "namespop10.csv")
assert os.path.isfile(path)

dataset = readNamesPopDoL(path)
print(dataset)

{'year': [2010, 2010, 2011, 2011, 2012, 2012, 2013, 2013, 2014, 2014, 2015, 2015, 2016, 2016, 2017, 2017, 2018, 2018], 'sex': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'], 'name': ['Isabella', 'Jacob', 'Sophia', 'Jacob', 'Sophia', 'Jacob', 'Sophia', 'Noah', 'Emma', 'Noah', 'Emma', 'Noah', 'Emma', 'Noah', 'Emma', 'Liam', 'Emma', 'Liam'], 'count': [22913, 22127, 21842, 20371, 22313, 19074, 21223, 18257, 20936, 19305, 20455, 19635, 19496, 19117, 19800, 18798, 18688, 19837], 'population': [309330000, 309330000, 311580000, 311580000, 313870000, 313870000, 316060000, 316060000, 318390000, 318390000, 320740000, 320740000, 323070000, 323070000, 325150000, 325150000, 327170000, 327170000]}


In [79]:
# Testing cell

dataset = readNamesPopDoL(os.path.join(datadir, "namespop10.csv"))
assert isinstance(dataset, dict)
for column in ['year', 'sex', 'name', 'count', 'population']:
    assert column in dataset
assert len(dataset['year']) == 18
assert isinstance(dataset['year'], list)


**Q** Write a function

    getNamesPopDoLRow(dataset, row_index)

that obtains a single row from a dictionary of lists representation of the `namespop` dataset, where the DoL is given by `dataset` and the row index is given by `row_index`.  The row should be representd as a list of the field values.

For example, if we wanted to obtain row 17 (the last row) from the example data set, then `getNamesPopDoLRow(dataset, 17)` would yield the list

    [2018, 'Male', 'Liam', 19837, 327170000]
    
    

In [80]:
def getNamesPopDoLRow(dataset, row_index):
    """
    This function makes a list of the data in a row of a
    dictionary of lists.
    
    Params: dataset: the dictionary of lists that the row
            is needed from.
            row_index: the index of the row that the data
            is located in.
    
    Return: rowList: the list that has the data from a row
            of a dictionary of lists.
    """
    rowList = []
    for item in dataset:
        rowList.append(dataset[item][row_index])
    return rowList

In [81]:
# Experimentation cell for students to debug

dataset = {'year': [2016, 2016, 2017, 2017, 2018, 2018], 
           'sex': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male'], 
           'name': ['Emma', 'Noah', 'Emma', 'Liam', 'Emma', 'Liam'], 
           'count': [19496, 19117, 19800, 18798, 18688, 19837], 
           'population': [323070000, 323070000, 325150000, 
                          325150000, 327170000, 327170000]}

print(getNamesPopDoLRow(dataset, 0))

[2016, 'Female', 'Emma', 19496, 323070000]


In [82]:
dataset = {'year': [2016, 2016, 2017, 2017, 2018, 2018], 
           'sex': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male'], 
           'name': ['Emma', 'Noah', 'Emma', 'Liam', 'Emma', 'Liam'], 
           'count': [19496, 19117, 19800, 18798, 18688, 19837], 
           'population': [323070000, 323070000, 325150000, 
                          325150000, 327170000, 327170000]}
assert getNamesPopDoLRow(dataset, 0) == [2016, 'Female', 'Emma', 19496, 323070000]
assert getNamesPopDoLRow(dataset, 5) == [2018, 'Male', 'Liam', 19837, 327170000]