# Task A: DataFrames in plain Python 

At Marshall Wace, we make decisions based on data, so tools for analysing it are something we spend a lot of time with. One of the building blocks of data engineering is a DataFrame which is a tabular structure that organises data into rows and columns. Many datasets can very naturally fit into this 2d structure which makes DataFrames incredibly useful for manipulation, analysis, and visualisation. 

There are many common libraries which implement this functionality and one of the commonly used ones is Pandas. For today's task, we will ask you to reimplement DataFrames in plain Python code, thinking about correctness, elegant design, and performance. We'll aim for a minimal implementation which can achieve the basic operations required for data manipulation but there is plenty of scope for extensions and optimisations!

Don't worry if you don't know Pandas or feel unsure about programming in general! 

## Requirements 
A `DataFrame` is a 2d tabular data structure. It can be thought of as a collection of named columns, which can be naturally represented by a dictionary of names to `Series`.

A `Series` is a 1d collection which can be likened to a list or a vector of elements. Comparison and logical operators on `Series` is what makes `DataFrame`s so powerful. One thing that frustrates many data engineers using Pandas is how lousy it can be with type safety and treating `None` values. Thus, we encourage you to build your solution with those in mind.   

There is nothing more frustrating when you want to quickly analyse some data and iterate on some approach but you have to wait a couple of minutes every time you run your script. Hence, efficiency is another aspect you'd ideally consider, either through some explicit optimisations or comments signifying 'hot spots', or parts of the code which the program spends the most time on.

## Spec
### Series
We want you to implement `Series` for `string`s, `bool`s, `int`s, and `float`s. This should give us a good range of functionality while keeping the implementation reasonably simple. Feel free to do everything in one class, create a separate class for each, use inheritence, or whatever you think is best. In terms of functionality of each `Series`: 
- you'll want to be able to initialise it with a list of elements, each of which can be of the given type or `None`
- each `Series` should be immutable and operations should return a new `Series` object
- you should be able to read the element at a given index, as well as the lenght of a `Series`
- you should be able to use equality operators (`==`, `!=`) which return a boolean `Series` with elements equal to the element-wise operator results
- for the `Series` pairs which make sense, you should implement element-wise boolean operators (`|`, `&`,  `~`, `^`) and comparison operators (`<`, `>`, `<=`, `>=`) which also result in a boolean `Series`. Think carefully about how you want to handle `None`. Where appropriate, also add operators which work between a `Series` and a variable
    - for instance `[1, 2, 3, 4] < 3` should return something like `[True, True, False, False]`
- for convenience, it would be nice to be able to print the `Series` nicely formatted
- (bonus) you can implement some aggregation methods which are commonly found in data analysis like `sum()`, `count()`, `mean()` or filtering capabilities 

### DataFrame
With a solid `Series` implementation, we can start working on the `DataFrame`s
- you should be able to initialise a `DataFrame` with a dictionary of names and `Series`
- `DataFrame`s should be immutable and operators should return a new `DataFrame` as appropriate
- they should be indexed by boolean `Series` which allows you to write code such as `df[(df["name"] != "Joe") & (df["age"] > 21)]` 
- for convenience, it would be nice to be able to pretty print the `DataFrame`s in a 2d form with middle rows and columns redacted for readability
- (bonus) you can implement some `DataFrame`-wide aggregation, filtering, pivoting, or any of the common operations which make sense for a 2d table. Get creative!


# Some basic implementation

Below you can find partial implementation of two classes, `BooleanSeries` and `StringSeries`, which represent series of boolean and string values, respectively. These classes offer basic functionality for creating and comparing series of data, just to give you a taste of what we're looking for. What we have now:

- `BooleanSeries`: Initialises a series of boolean values, with input validation
- `StringSeries`: Initialises a series of string values, with input validation
- Equality comparison (`__eq__`) between two `StringSeries` objects, returning a `BooleanSeries`
- Basic indexing for `StringSeries` using `__getitem__`
- String representation for both classes
- `DataFrame`: Initialises a dataframe of series, with input validation
- From CSV to handle file loading `from_csv` of CSV's into Dataframe
- Basic column manipulation such as indexing `__getitem__`, adding columns `add_column` etc
- Pretty printing of columnar wise data with `__str__`

In [None]:
from typing import Union
from types import NoneType
import base64
import csv
import os
from array import array

: 

In [None]:
class Series:
    __slots__ = ('_data', 'dtype')

    _noneHandling = 'contagious'
    # Contagious - all binary operations involving None return None
    # Strict - any attempted binary operation involving None will raise an error.
    _contagiousOperations = ('')
    _operationTypeSupport = {
        '==' : {type(int), type(float), type(str)},
        '!=' : {type(int), type(float), type(str)},
        '&' : {type(int), type(bool)},
        '|' : {type(int), type(bool)},
        '^' : {type(int), type(bool)}
    }
    
    @staticmethod
    def setTypeCheckingMode(mode: str):
        mode = mode.lower()
        if mode not in ("strict", "contagious"):
            raise ValueError(f"{mode} is not a recognised type checking mode, expected one of (strict, contagious)")
        Series._typeChecking = mode
    
    def __init__(self, items: list[Union[bool, None]], dtype : type = NoneType, _checkType : bool = True):
        if not _checkType:
            if dtype == NoneType:
                raise ValueError("Must provide dtype when _checkType is False")
            self.dtype = dtype
            self._data = items
            return

        self.dtype = NoneType
        self._data = []
        for item in items:
            if type(item) != NoneType :
                if self.dtype != NoneType:
                    raise ValueError(f"Provided Series data contains multiple datatypes found {self.dtype} and {type(item)}")
                self.dtype = type(item)
            self._data.append(item)
    
    def __getitem__(self, i):
        if isinstance(i, slice):
            return Series(self._data[i], dtype=self.dtype, _checkType=False)
        return self._data[i]
    
    def __iter__(self):
        return (x for x in self._data)
    
    def __len__(self):
        return len(self._data)
    
    def __str__(self):
        return str(self._data)
    
    def __repr__(self):
        return f"Series({self.__str__()}, dtype={self.dtype})"

    def _binary_op(self, b, op, op_name: str):
        def handle(x, y):
            # Contagious mode
            if Series._noneHandling == 'contagious':
                if x is None or y is None:
                    return None if op_name in Series._contagiousOperations else op(x, y)
                return op(x, y)
            
            # Strict mode
            if x is None or y is None:
                raise ValueError(f"Attempted {op_name} with None type while in strict typing mode")
            return op(x, y)
        
        if isinstance(b, Series):
            b_type = b.dtype
            b_data = b._data
        else:
            b_type = type(b)
            b_data = (b for _ in range(len(self)))
        
        allowed_types = Series._operationTypeSupport[op_name]
        if not b_type in allowed_types or not self.dtype in allowed_types:
            raise ValueError(f"types {b_type}, {self.dtype} not supported for {op_name}")


        new_data = [handle(x, y) for x, y in zip(self._data, b_data)]
        return Series(new_data)

    def __eq__(self, b):  return self._binary_op(b, lambda x, y: x == y, "==")
    def __ne__(self, b):  return self._binary_op(b, lambda x, y: x != y, "!=")
    def __and__(self, b): return self._binary_op(b, lambda x, y: x & y, "&")
    def __or__(self, b):  return self._binary_op(b, lambda x, y: x | y, "|")
    def __xor__(self, b): return self._binary_op(b, lambda x, y: x ^ y, "^")

    def __invert__(self):
        return Series([~x for x in self._data], dtype=self.dtype, _checkType=False)


Creating example data

In [None]:
CSV_ENCODED_DATA = "Q291bnRyeSxDaXR5LElzIENhcGl0YWwKSXRhbHksVHVyaW4sRmFsc2UKSmFwYW4sS3lvdG8sVHJ1ZQpDYW5hZGEsVG9yb250byxUcnVlCkNhbmFkYSxNb250cmVhbCxUcnVlClNwYWluLFNldmlsbGUsVHJ1ZQpGcmFuY2UsUGFyaXMsVHJ1ZQpKYXBhbixLeW90byxGYWxzZQpTcGFpbixaYXJhZ296YSxGYWxzZQpJdGFseSxSb21lLFRydWUKQXVzdHJhbGlhLFBlcnRoLEZhbHNlCkF1c3RyYWxpYSxCcmlzYmFuZSxUcnVlCkZyYW5jZSxOaWNlLFRydWUKQ2FuYWRhLFZhbmNvdXZlcixGYWxzZQpKYXBhbixOYWdveWEsRmFsc2UKSmFwYW4sS3lvdG8sRmFsc2UKQ2FuYWRhLE1vbnRyZWFsLFRydWUKQ2FuYWRhLE90dGF3YSxGYWxzZQpCcmF6aWwsU8OjbyBQYXVsbyxUcnVlCkF1c3RyYWxpYSxNZWxib3VybmUsVHJ1ZQpBdXN0cmFsaWEsU3lkbmV5LFRydWUK"
CSV_FILE_NAME = "countries_and_cities.csv"
if not os.path.isfile(CSV_FILE_NAME):
    csv_data = base64.b64decode(CSV_ENCODED_DATA)
    with open(CSV_FILE_NAME, 'wb') as f:
        f.write(csv_data)

In [None]:
# your code goes here