# Task A: DataFrames in plain Python 

At Marshall Wace, we make decisions based on data, so tools for analysing it are something we spend a lot of time with. One of the building blocks of data engineering is a DataFrame which is a tabular structure that organises data into rows and columns. Many datasets can very naturally fit into this 2d structure which makes DataFrames incredibly useful for manipulation, analysis, and visualisation. 

There are many common libraries which implement this functionality and one of the commonly used ones is Pandas. For today's task, we will ask you to reimplement DataFrames in plain Python code, thinking about correctness, elegant design, and performance. We'll aim for a minimal implementation which can achieve the basic operations required for data manipulation but there is plenty of scope for extensions and optimisations!

Don't worry if you don't know Pandas or feel unsure about programming in general! 

## Requirements 
A `DataFrame` is a 2d tabular data structure. It can be thought of as a collection of named columns, which can be naturally represented by a dictionary of names to `Series`.

A `Series` is a 1d collection which can be likened to a list or a vector of elements. Comparison and logical operators on `Series` is what makes `DataFrame`s so powerful. One thing that frustrates many data engineers using Pandas is how lousy it can be with type safety and treating `None` values. Thus, we encourage you to build your solution with those in mind.   

There is nothing more frustrating when you want to quickly analyse some data and iterate on some approach but you have to wait a couple of minutes every time you run your script. Hence, efficiency is another aspect you'd ideally consider, either through some explicit optimisations or comments signifying 'hot spots', or parts of the code which the program spends the most time on.

## Spec
### Series
We want you to implement `Series` for `string`s, `bool`s, `int`s, and `float`s. This should give us a good range of functionality while keeping the implementation reasonably simple. Feel free to do everything in one class, create a separate class for each, use inheritence, or whatever you think is best. In terms of functionality of each `Series`: 
- you'll want to be able to initialise it with a list of elements, each of which can be of the given type or `None`
- each `Series` should be immutable and operations should return a new `Series` object
- you should be able to read the element at a given index, as well as the lenght of a `Series`
- you should be able to use equality operators (`==`, `!=`) which return a boolean `Series` with elements equal to the element-wise operator results
- for the `Series` pairs which make sense, you should implement element-wise boolean operators (`|`, `&`,  `~`, `^`) and comparison operators (`<`, `>`, `<=`, `>=`) which also result in a boolean `Series`. Think carefully about how you want to handle `None`. Where appropriate, also add operators which work between a `Series` and a variable
    - for instance `[1, 2, 3, 4] < 3` should return something like `[True, True, False, False]`
- for convenience, it would be nice to be able to print the `Series` nicely formatted
- (bonus) you can implement some aggregation methods which are commonly found in data analysis like `sum()`, `count()`, `mean()` or filtering capabilities 

### DataFrame
With a solid `Series` implementation, we can start working on the `DataFrame`s
- you should be able to initialise a `DataFrame` with a dictionary of names and `Series`
- `DataFrame`s should be immutable and operators should return a new `DataFrame` as appropriate
- they should be indexed by boolean `Series` which allows you to write code such as `df[(df["name"] != "Joe") & (df["age"] > 21)]` 
- for convenience, it would be nice to be able to pretty print the `DataFrame`s in a 2d form with middle rows and columns redacted for readability
- (bonus) you can implement some `DataFrame`-wide aggregation, filtering, pivoting, or any of the common operations which make sense for a 2d table. Get creative!


# Some basic implementation

Below you can find partial implementation of two classes, `BooleanSeries` and `StringSeries`, which represent series of boolean and string values, respectively. These classes offer basic functionality for creating and comparing series of data, just to give you a taste of what we're looking for. What we have now:

- `BooleanSeries`: Initialises a series of boolean values, with input validation
- `StringSeries`: Initialises a series of string values, with input validation
- Equality comparison (`__eq__`) between two `StringSeries` objects, returning a `BooleanSeries`
- Basic indexing for `StringSeries` using `__getitem__`
- String representation for both classes
- `DataFrame`: Initialises a dataframe of series, with input validation
- From CSV to handle file loading `from_csv` of CSV's into Dataframe
- Basic column manipulation such as indexing `__getitem__`, adding columns `add_column` etc
- Pretty printing of columnar wise data with `__str__`

In [1]:
from typing import Union
import base64
import csv
import os
from array import array

In [97]:
from array import array
from typing import Union

class BooleanSeries:
    __slots__ = ('_data',)
    NONE = 2

    def __init__(self, items: list[bool], _checkType=True):
        if not _checkType:
            self._data = array('b', items)
            return

        self._data = array('b', [])
        for item in items:
            if not isinstance(item, Union[None, bool]):
                raise ValueError(
                    f"Item in Series is not of type Boolean or None, and instead is `{type(item)}`."
                )
            self._data.append(
                BooleanSeries.NONE if item is None else (1 if item else 0)
            )

    def __getitem__(self, i: int):
        if isinstance(i, slice):
            return BooleanSeries(self._data[i], _checkType = False)
        x = self._data[i]
        return None if x == BooleanSeries.NONE else bool(x)

    def __iter__(self):
        return (None if x == BooleanSeries.NONE else bool(x) for x in self._data)

    def __len__(self):
        return len(self._data)

    def __str__(self):
        return str([None if x == BooleanSeries.NONE else bool(x) for x in self._data])

    def __repr__(self):
        return f"BooleanSeries({self.__str__()})"

    def __eq__(self, b):
        if isinstance(b, BooleanSeries):
            return BooleanSeries(
                [x == y for x, y in zip(self._data, b._data)],
                _checkType=False
            )

        b = BooleanSeries.NONE if b is None else b
        return BooleanSeries(
            [x == b if x != BooleanSeries.NONE else BooleanSeries.NONE for x in self._data],
            _checkType=False
        )

    def __ne__(self, b):
        if isinstance(b, BooleanSeries):
            return BooleanSeries(
                [x != y for x, y in zip(self._data, b._data)],
                _checkType=False
            )

        b = BooleanSeries.NONE if b is None else b
        return BooleanSeries(
            [x != b for x in self._data],
            _checkType=False
        )

    def __and__(self, b: Union['BooleanSeries', None, bool]):
        if isinstance(b, BooleanSeries):
            return BooleanSeries(
                [
                    x & y if x != BooleanSeries.NONE and y != BooleanSeries.NONE
                    else BooleanSeries.NONE
                    for x, y in zip(self._data, b._data)
                ],
                _checkType=False
            )

        if isinstance(b, bool):
            return BooleanSeries(
                [x & b if x != BooleanSeries.NONE else BooleanSeries.NONE for x in self._data],
                _checkType=False
            )

        if b is None:
            return BooleanSeries(
                [BooleanSeries.NONE] * len(self._data),
                _checkType=False
            )

        raise ValueError(
            f"Not accepted type for & operation. Expected one of (BooleanSeries, bool, None), got {type(b)}"
        )

    def __or__(self, b: Union['BooleanSeries', None, bool]):
        if isinstance(b, BooleanSeries):
            return BooleanSeries(
                [
                    x | y if x != BooleanSeries.NONE and y != BooleanSeries.NONE
                    else (1 if x == 1 or y == 1 else BooleanSeries.NONE)
                    for x, y in zip(self._data, b._data)
                ],
                _checkType=False
            )

        if isinstance(b, bool):
            return BooleanSeries(
                [
                    x | b if x != BooleanSeries.NONE
                    else (1 if b else BooleanSeries.NONE)
                    for x in self._data
                ],
                _checkType=False
            )

        if b is None:
            return BooleanSeries(
                [x if x else BooleanSeries.NONE for x in self._data],
                _checkType=False
            )

        raise ValueError(
            f"Not accepted type for | operation. Expected one of (BooleanSeries, bool, None), got {type(b)}"
        )

    def __xor__(self, b: Union['BooleanSeries', None, bool]):
        if isinstance(b, BooleanSeries):
            return BooleanSeries(
                [
                    x ^ y if x != BooleanSeries.NONE and y != BooleanSeries.NONE
                    else (1 if x == 1 or y == 1 else BooleanSeries.NONE)
                    for x, y in zip(self._data, b._data)
                ],
                _checkType=False
            )

        if isinstance(b, bool):
            return BooleanSeries(
                [
                    x ^ b if x != BooleanSeries.NONE
                    else (1 if b else BooleanSeries.NONE)
                    for x in self._data
                ],
                _checkType=False
            )

        if b is None:
            return BooleanSeries(
                [x if x else BooleanSeries.NONE for x in self._data],
                _checkType=False
            )

        raise ValueError(
            f"Not accepted type for ^ operation. Expected one of (BooleanSeries, bool, None), got {type(b)}"
        )

    def __invert__(self):
        return BooleanSeries(
            [not x if x != BooleanSeries.NONE else BooleanSeries.NONE for x in self._data],
            _checkType=False
        )

Creating example data

In [None]:
CSV_ENCODED_DATA = "Q291bnRyeSxDaXR5LElzIENhcGl0YWwKSXRhbHksVHVyaW4sRmFsc2UKSmFwYW4sS3lvdG8sVHJ1ZQpDYW5hZGEsVG9yb250byxUcnVlCkNhbmFkYSxNb250cmVhbCxUcnVlClNwYWluLFNldmlsbGUsVHJ1ZQpGcmFuY2UsUGFyaXMsVHJ1ZQpKYXBhbixLeW90byxGYWxzZQpTcGFpbixaYXJhZ296YSxGYWxzZQpJdGFseSxSb21lLFRydWUKQXVzdHJhbGlhLFBlcnRoLEZhbHNlCkF1c3RyYWxpYSxCcmlzYmFuZSxUcnVlCkZyYW5jZSxOaWNlLFRydWUKQ2FuYWRhLFZhbmNvdXZlcixGYWxzZQpKYXBhbixOYWdveWEsRmFsc2UKSmFwYW4sS3lvdG8sRmFsc2UKQ2FuYWRhLE1vbnRyZWFsLFRydWUKQ2FuYWRhLE90dGF3YSxGYWxzZQpCcmF6aWwsU8OjbyBQYXVsbyxUcnVlCkF1c3RyYWxpYSxNZWxib3VybmUsVHJ1ZQpBdXN0cmFsaWEsU3lkbmV5LFRydWUK"
CSV_FILE_NAME = "countries_and_cities.csv"
if not os.path.isfile(CSV_FILE_NAME):
    csv_data = base64.b64decode(CSV_ENCODED_DATA)
    with open(CSV_FILE_NAME, 'wb') as f:
        f.write(csv_data)

In [None]:
# your code goes here