# Design Decisions

***

### Duck Typing and Flexibility

I'm not fond of the idea, but it seems prevelant in python so I gave it a go. 

  -  List inputs to MySeries and MyDataFrame can be any kind of iterable, and elements can be any type of object. 
  
  -  A single value where an iterable is expected is treated as an iterable with that single value. 

  -  Functions like "mean" will try to run regardless of the element data type, and fail if the data is not suitable. 

  -  MySeries will accept an index and iterable of values that are of different lengths, it will ignore the extra elements on the longer one.

There are some exceptions:

  -  Dictionaries are expected to be instances of collections.abc.Mapping. Simply expecting an object with an items() function returning key-value pairs seemed obscure to me.
  -  MyDataFrame will raise an error if the column sizes and index size aren't all the same length - this is because the assignment asked us to check that the input data represented a valid data frame.

### Error Handling

MySeries and MyDataFrame instances will raise errors if something goes wrong, rather than print warning information or fail silently.


### Return values

Methods like mean() and max() return values rather than print them. I'm not sure if that is what the assignment was looking for. In the case of the MyDataFrame, the mean() method returns a MySeries, meaning that the invocation to print the mean looks like:

```
means = df.mean()
means.print()
```

Or just:

```
df.mean().print()
```

for short.

### Treatment of strings as iterable

If a string is found where an iterable is expected, it is treated as an iterable of characters. This was accidental and, which not incorrect, isn't very intuitive to the user. However, fixing it would complicate the code and it is convenient for testing so I'm leaving it be.

### \_\_GetItem\_\_() is implemented on MySeries

This is purely for convenience later on - it means that the print_table function can assume its operating on a list-like object and not worry about the particulars of MySeries

MySeries
=

In [1]:
from collections.abc import Mapping

class MySeries:
    
    def __init__(self, source, index=None):
        
        if isinstance(source, Mapping):
            self.s_dict = dict(source.items())

        else:
            
            try:
                iter(source)
            except:
                source = (source,)

            if index is None:
                pairs = enumerate(source)
            
            else:
                
                try:
                    iter(index)
                except:
                    index = (index,)
                
                pairs = zip(index, source)

            self.s_dict = dict(pairs)
    
    def __getitem__(self, key):
        return self.s_dict[key]
    
    def print(self, to_str=str, separation=2):

        def width(item):
            return len(to_str(item))

        # The "+ 1" here is for the colon
        keys_width = max(map(width, self.s_dict.keys())) + 1
        values_width = max(map(width, self.s_dict.values()))

        for k, v in self.s_dict.items():

            text = (to_str(k) + ":").ljust(keys_width)
            text += " "*separation
            text += to_str(v).rjust(values_width)
            print(text)
        
    def min(self):
        
        if len(self.s_dict) == 0:
            msg = "Cannot find the min of an empty series"
            raise ValueError(msg)
        
        try:
            return min(self.s_dict.values())
        
        except TypeError:
            
            message = ("Failed to find min of MySeries."
                " Are the order operators defined"
                " for all pairs of values?")
            
            raise TypeError(message)
    
    def max(self):
        
        if len(self.s_dict) == 0:
            msg = "Cannot find the max of an empty series"
            raise ValueError(msg)
        
        try:
            return max(self.s_dict.values())
        
        except TypeError:
            
            message = ("Failed to find min of MySeries."
            " Are the order operators defined"
            " for all pairs of values?")
                
            raise TypeError(message)
    
    def mean(self):
        
        if len(self.s_dict) == 0:
            msg = "Cannot find the mean of an empty series"
            raise ValueError(msg)
        
        total = 0
        for value in self.s_dict.values():
            
            try:
                number = float(value)
            
            except ValueError:
                
                message = ("Non-numeric value found while"
                " calculating the mean: ") + repr(value)
                
                raise ValueError(message)
            
            total += number

        return total/len(self.s_dict)

Testing MySeries
-

Creating an instance from a list and an index

In [2]:
A = MySeries(["a", "b", "c"], index=[1,2,3])
A.print()

1:  a
2:  b
3:  c


Creating an instance from a list without an index

In [3]:
A = MySeries(["a", "b", "c"])
A.print()

0:  a
1:  b
2:  c


Creating an instance directly from a dictionary

In [4]:
A = {
    1:"a", 
    2:"b", 
    3:"c"
}

B = MySeries(A)
B.print()

1:  a
2:  b
3:  c


**Note:** since a string is iterable, it can be used as input.<br>
This is used often in the remaining tests!

In [5]:
A = MySeries("abc", "123")
A.print()

1:  a
2:  b
3:  c


Input values and index can be of different lengths<br>
(Extra elements on the longer iterable will be ignored)

In [6]:
A = MySeries("abcdef", "123")
A.print()

1:  a
2:  b
3:  c


Input values can be single (non-iterable) values

In [7]:
A = MySeries(0, "abc")
A.print()

a:  0


In [8]:
A = MySeries("abc", 0)
A.print()

0:  a


Printed keys and values are left and right justified respectively

In [9]:
values = [
    "a-123",
    "b-12345",
    "c-12"
]

index = [
    "A-1235",
    "B-12",
    "C-1234567"
]

MySeries(values, index).print()

A-1235:       a-123
B-12:       b-12345
C-1234567:     c-12


Min, max and mean work with numeric values

In [10]:
A = MySeries([1,2,3])

print("Min:", A.min())
print("Max:", A.max())
print("Mean:", A.mean())

Min: 1
Max: 3
Mean: 2.0


Min and max work for any type with an order

In [11]:
A = MySeries("abc")
print("Min:", A.min())
print("Max:", A.max())

Min: a
Max: c


An error will be raised if the necessary order operators are not implemented

In [12]:
class A:
    pass # No definition of order operators

B = MySeries([A(), A()])
print("Max:", B.max())

TypeError: Failed to find min of MySeries. Are the order operators defined for all pairs of values?

One exception to the above rule:
They do work regardless in the trivial case of a single element MySeries

In [13]:
class A:
    pass

B = MySeries(A())
print("Min:", B.min())
print("Max:", B.max())

Min: <__main__.A object at 0x000001E816D05670>
Max: <__main__.A object at 0x000001E816D05670>


Mean works with values that can be parsed as a float

In [14]:
A = MySeries("123")
A.print()
print("Mean:", A.mean())

0:  1
1:  2
2:  3
Mean: 2.0


An error will be raised if a value cannot be converted to a float

In [15]:
A = MySeries("abc")
A.print()
print("Mean:", A.mean())

0:  a
1:  b
2:  c


ValueError: Non-numeric value found while calculating the mean: 'a'

Min, max and mean will raise an error if the series is empty

In [17]:
MySeries([]).min()

ValueError: Cannot find the min of an empty series

In [18]:
MySeries([]).max()

ValueError: Cannot find the max of an empty series

In [19]:
MySeries([]).mean()

ValueError: Cannot find the mean of an empty series

Indexing is possible.<br>
This isn't needed for the assignment, but it makes things simpler for part 2.

In [20]:
A = MySeries("abc")
A.print()

print("Element at index 0:", A[0])

0:  a
1:  b
2:  c
Element at index 0: a


There is no special error handling for indexing.<br> It is just a wrapper around the publically available "s_dict[]" indexer

In [21]:
A = MySeries("abc")
A["invalid name"]

KeyError: 'invalid name'

MyDataFrame
=

 One thing first: printing a table...

In [22]:
# This is a mess. The general strategy is to 
# to calculate the width of each column (as the width of the widest cell)
# and then print row by row with the appropriate padding.

def print_table(record_ids, field_ids, values):
    
    # We could use str or repr here
    to_str = str
    
    # Space between each column
    spacing = 3
    
    # Width an object when printed
    def width(x):
        return len(to_str(x))
    
    # Get the width of the record name column
    records_width = max(map(width, record_ids))
        
    # Get the width of each field column
    field_widths = list()
    for field_id in field_ids:
        
        def item_width(record_id):
            return width(values[field_id][record_id])
        
        field_width = max(map(item_width, record_ids))
        field_width = max(field_width, width(field_id))
        
        field_widths.append(field_width)

    # Print a cell in the records column
    def printR(text):
        print(to_str(text).ljust(records_width), end=" "*spacing)
        
    # Print a cell in a field column
    def printF(text, field_index):
        print(to_str(text).ljust(field_widths[i]), end=" "*spacing)
       
    # Top right (empty) cell
    printR("")
        
    # Header field names
    for i, field_id in enumerate(field_ids):
        printF(field_id, i)
    
    print()
    
    # Remaining rows
    for i, record_id in enumerate(record_ids):
        
        # Record name cell
        printR(record_id)
        
        # Field cells
        for i, field_id in enumerate(field_ids):
            printF(values[field_id][record_id], i)
            
        print()    

Testing, testing

In [23]:
char_data = {
    "a-1":["a0-1234","a1-12","a2-12"],
    "b-123456":["b0-1","b1-12","b2-1"],
    "c-12":["c0-123456","c1-12345","c2-123"]
}

field_ids = ["a-1", "b-123456", "c-12"]
record_ids = [0, 1, 2]

print_table(record_ids, field_ids, char_data)

    a-1       b-123456   c-12        
0   a0-1234   b0-1       c0-123456   
1   a1-12     b1-12      c1-12345    
2   a2-12     b2-1       c2-123      


In [47]:
# It's a notebook, so we don't need to reimport. But I'd prefer to
# make it explicit that this is being used here too.
from collections.abc import Mapping

class MyDataFrame:
    
    def __init__(self, data, index=None):
        
        if not isinstance(data, Mapping):
            
            message = ("Input data must be a Mapping type,"
            " such as a dictionary")
            
            raise TypeError(message)
        
        # Similar to len(), but handles more cases.
        def guarded_length(item):
            try:
                iter(item)
                try:
                    return len(item)
                except TypeError:
                    return len(list(item))
            except TypeError:
                return 1
        
        if len(data) == 0:
            n_records = None
            
        else:
            column = next(iter(data.values()))
            n_records = guarded_length(column)
            
            for value in data.values():
                if guarded_length(value) != n_records:
                    
                    message = ("All columns must have the"
                    " the same number of items")
            
                    raise ValueError(message)
        
        if index is None:
            index = range(n_records)
        
        else:
            try:
                iter(index)
            except:
                index = (index,)
                
        index = list(index)
        if len(index) != n_records:
            
            message = ("Count of index must match that"
            " of the columns")
            
            raise ValueError(message)
        
        columns = dict()
        for k,v in data.items():
            columns[k] = MySeries(v, index)
        
        self.index = index
        self.columns = columns
            
    def print(self):
        print_table(self.index, self.columns.keys(), self.columns)
            
    def sort_values(self, field_id):
        
        if field_id not in self.columns:
            raise ValueError("Unrecognized field id: ", field_id)
            
        def key(record_id):
            return self.columns[field_id][record_id]
        
        self.index.sort(key=key)
        
    def __reduce(self, action, action_name):
        index = list()
        values = list()
        
        for field_id, column_series in self.columns.items():
            
            try:
                value = action(column_series)
                
            # Ignore these types of errors, it probably
            # means that the column isn't of a suitable data type.
            except ValueError: pass
            except TypeError: pass
            
            else:
                index.append(field_id)
                values.append("%.02f" % value)
        
        return MySeries(values, index)
        
    def min(self):

        def action(series):
            return series.min()
        
        return self.__reduce(action, "min")
        
    def max(self):

        def action(series):
            return series.max()
        
        return self.__reduce(action, "max")
        
    def mean(self):

        def action(series):
            return series.mean()
        
        return self.__reduce(action, "mean")
        
            

Testing MyDataFrame
-

Test data

In [48]:
weather_data = {
    "Sun Hours": [4.5,4.0,5.1,5],
    "Max Temp": [19.6,19.1,19.6,20.0],
    "Min Temp": [12.7,12.5,13.3,12.1],
    "Rain (mm)": [82,109,65,76],
    "Rain Days": [13,20,10,9.7]
}

weather_data_index = ["Clare", "Galway", "Dublin", "Wexford"]

Creating an instance

In [49]:
df = MyDataFrame(weather_data)
df.print()

    Sun Hours   Max Temp   Min Temp   Rain (mm)   Rain Days   
0   4.5         19.6       12.7       82          13          
1   4.0         19.1       12.5       109         20          
2   5.1         19.6       13.3       65          10          
3   5           20.0       12.1       76          9.7         


Specifying an index

In [50]:
df = MyDataFrame(weather_data, index=weather_data_index)
df.print()

          Sun Hours   Max Temp   Min Temp   Rain (mm)   Rain Days   
Clare     4.5         19.6       12.7       82          13          
Galway    4.0         19.1       12.5       109         20          
Dublin    5.1         19.6       13.3       65          10          
Wexford   5           20.0       12.1       76          9.7         


An error will be raised if the input is not a mapping type

In [51]:
df = MyDataFrame(["This", "ain't", "a", "map"])
df.print()

TypeError: Input data must be a Mapping type, such as a dictionary

The index can be a non-iterable value

In [52]:
df = MyDataFrame({"field1": 20, "field2": 15}, 0)
df.print()

    field1   field2   
0   20       15       


As before, strings act as lists of characters

In [53]:
df = MyDataFrame({"a": "13", "b":"24"}, "xy")
df.print()

    a   b   
x   1   2   
y   3   4   


An error will be raised if columns are not of the same length

In [54]:
df = MyDataFrame({"a": "123456789", "b":"123"}, "wxyz")
df.print()

ValueError: All columns must have the the same number of items

An error will be raised if the index is of a different length to the columns

In [55]:
df = MyDataFrame({"a": "12", "b":"12"}, "xyz")
df.print()

ValueError: Count of index must match that of the columns

Sorting by column name

In [56]:
df = MyDataFrame(weather_data, weather_data_index)
df.print()

print()
print(" - Sorting by 'Rain Days'")
print()

df.sort_values("Rain Days")
df.print()

          Sun Hours   Max Temp   Min Temp   Rain (mm)   Rain Days   
Clare     4.5         19.6       12.7       82          13          
Galway    4.0         19.1       12.5       109         20          
Dublin    5.1         19.6       13.3       65          10          
Wexford   5           20.0       12.1       76          9.7         

 - Sorting by 'Rain Days'

          Sun Hours   Max Temp   Min Temp   Rain (mm)   Rain Days   
Wexford   5           20.0       12.1       76          9.7         
Dublin    5.1         19.6       13.3       65          10          
Clare     4.5         19.6       12.7       82          13          
Galway    4.0         19.1       12.5       109         20          


Min

In [57]:
df = MyDataFrame(weather_data, weather_data_index)
df.min().print()

Sun Hours:   4.00
Max Temp:   19.10
Min Temp:   12.10
Rain (mm):  65.00
Rain Days:   9.70


Max

In [58]:
df = MyDataFrame(weather_data, weather_data_index)
df.max().print()

Sun Hours:    5.10
Max Temp:    20.00
Min Temp:    13.30
Rain (mm):  109.00
Rain Days:   20.00


Mean

In [59]:
df = MyDataFrame(weather_data, weather_data_index)
df.mean().print()

Sun Hours:   4.65
Max Temp:   19.58
Min Temp:   12.65
Rain (mm):  83.00
Rain Days:  13.18


Final test case

In [None]:
films = {
    "Rank": [112,62,41,172,230,176],
    "Release Year": [1973,1980,1960,2015,1976,1996],
    "IMDB Rating": [8.3,8.4,8.5,8.1,8.1,8.1],
    "Time (minutes)": [129,146,109,118,120,98],
    "Main Genre": ["Comedy","Horror","Horror","Drama","Drama","Drama"]
}

f_names = ["Sting","Shining", "Psycho","Room","Rocky", "Fargo"]
                   
df = MyDataFrame(films, f_names)
df.print()

print()

df.mean().print()

print()

df.sort_values("Release Year")
df.print()