# Task A: DataFrames in plain Python 

At Marshall Wace, we make decisions based on data, so tools for analysing it are something we spend a lot of time with. One of the building blocks of data engineering is a DataFrame which is a tabular structure that organises data into rows and columns. Many datasets can very naturally fit into this 2d structure which makes DataFrames incredibly useful for manipulation, analysis, and visualisation. 

There are many common libraries which implement this functionality and one of the commonly used ones is Pandas. For today's task, we will ask you to reimplement DataFrames in plain Python code, thinking about correctness, elegant design, and performance. We'll aim for a minimal implementation which can achieve the basic operations required for data manipulation but there is plenty of scope for extensions and optimisations!

Don't worry if you don't know Pandas or feel unsure about programming in general! 

## Requirements 
A `DataFrame` is a 2d tabular data structure. It can be thought of as a collection of named columns, which can be naturally represented by a dictionary of names to `Series`.

A `Series` is a 1d collection which can be likened to a list or a vector of elements. Comparison and logical operators on `Series` is what makes `DataFrame`s so powerful. One thing that frustrates many data engineers using Pandas is how lousy it can be with type safety and treating `None` values. Thus, we encourage you to build your solution with those in mind.   

There is nothing more frustrating when you want to quickly analyse some data and iterate on some approach but you have to wait a couple of minutes every time you run your script. Hence, efficiency is another aspect you'd ideally consider, either through some explicit optimisations or comments signifying 'hot spots', or parts of the code which the program spends the most time on.

## Spec
### Series
We want you to implement `Series` for `string`s, `bool`s, `int`s, and `float`s. This should give us a good range of functionality while keeping the implementation reasonably simple. Feel free to do everything in one class, create a separate class for each, use inheritence, or whatever you think is best. In terms of functionality of each `Series`: 
- you'll want to be able to initialise it with a list of elements, each of which can be of the given type or `None`
- each `Series` should be immutable and operations should return a new `Series` object
- you should be able to read the element at a given index, as well as the lenght of a `Series`
- you should be able to use equality operators (`==`, `!=`) which return a boolean `Series` with elements equal to the element-wise operator results
- for the `Series` pairs which make sense, you should implement element-wise boolean operators (`|`, `&`,  `~`, `^`) and comparison operators (`<`, `>`, `<=`, `>=`) which also result in a boolean `Series`. Think carefully about how you want to handle `None`. Where appropriate, also add operators which work between a `Series` and a variable
    - for instance `[1, 2, 3, 4] < 3` should return something like `[True, True, False, False]`
- for convenience, it would be nice to be able to print the `Series` nicely formatted
- (bonus) you can implement some aggregation methods which are commonly found in data analysis like `sum()`, `count()`, `mean()` or filtering capabilities 

### DataFrame
With a solid `Series` implementation, we can start working on the `DataFrame`s
- you should be able to initialise a `DataFrame` with a dictionary of names and `Series`
- `DataFrame`s should be immutable and operators should return a new `DataFrame` as appropriate
- they should be indexed by boolean `Series` which allows you to write code such as `df[(df["name"] != "Joe") & (df["age"] > 21)]` 
- for convenience, it would be nice to be able to pretty print the `DataFrame`s in a 2d form with middle rows and columns redacted for readability
- (bonus) you can implement some `DataFrame`-wide aggregation, filtering, pivoting, or any of the common operations which make sense for a 2d table. Get creative!


# Some basic implementation

Below you can find partial implementation of two classes, `BooleanSeries` and `StringSeries`, which represent series of boolean and string values, respectively. These classes offer basic functionality for creating and comparing series of data, just to give you a taste of what we're looking for. What we have now:

- `BooleanSeries`: Initialises a series of boolean values, with input validation
- `StringSeries`: Initialises a series of string values, with input validation
- Equality comparison (`__eq__`) between two `StringSeries` objects, returning a `BooleanSeries`
- Basic indexing for `StringSeries` using `__getitem__`
- String representation for both classes
- `DataFrame`: Initialises a dataframe of series, with input validation
- From CSV to handle file loading `from_csv` of CSV's into Dataframe
- Basic column manipulation such as indexing `__getitem__`, adding columns `add_column` etc
- Pretty printing of columnar wise data with `__str__`

In [99]:
from typing import Type, Union, Callable
import base64
import csv
import os

In [111]:
class BooleanSeries:
    pass

class StringSeries:
    pass

class Series:
    def __init__(self, items: list, dtype: Type):
        self._items = items
        self._count = None
        self.dtype = Union[dtype, None]
        
    def __getitem__(self, item):
        return self._items[item]
    
    def __str__(self):
        return f"{self.__class__.__name__}({self._items})"
    
    def __len__(self):
        return len(self._items)
    
    def get_data(self, indices) -> 'Series':
        data_buffer = [None] * len(indices)
        for i, idx in enumerate(indices):
            if idx >= len(self):
                raise IndexError(f"{self.__class__.__name__} has {len(self)} items but attempt to access index {idx}.")
            data_buffer[i] = self._items[idx]
        return type(self)(data_buffer)
    
    def count(self) -> int:
        return self._count
    
    def where(self, cond: BooleanSeries, other=None) -> 'Series':
        if len(self) != len(cond):
            raise ValueError(f"Condition length ({len(cond)}) must match Series length ({len(self)}).")
        
        if isinstance(other, Series):
            if len(self) != len(other):
                raise ValueError(f"Other Series length ({len(other)}) must match self length ({len(self)}).")
            target_type = type(self)
        elif other is not None and not isinstance(other, self.dtype):
            target_type = StringSeries
        else:
            target_type = type(self)
        
        new_items = [None] * len(self)
        for i in range(len(self)):
            if cond[i] is None or not cond[i]:
                value = other[i] if isinstance(other, Series) else other
                new_items[i] = str(value) if target_type == StringSeries and value is not None else value
            else:
                new_items[i] = str(self._items[i]) if target_type == StringSeries and self._items[i] is not None else self._items[i]
        
        return target_type(new_items)

    def mask(self, cond: BooleanSeries, other=None) -> 'Series':
        return self.where(~cond, other)
    
    def __eq__(self, other) -> 'BooleanSeries':
        if isinstance(other, type(self)):
            if len(self) != len(other):
                raise ValueError("Cannot perform comparison (==) on Series of different lengths.")
            
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item == other[index])
                
            return BooleanSeries(items=new_series)
        elif isinstance(other, self.dtype):
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item == other)

            return BooleanSeries(items=new_series)
        
        raise TypeError(f"Comparison (==) not supported between `{self.__class__.__name__}` and `{type(other).__name__}`.")

    def __ne__(self, other) -> 'BooleanSeries':
        if isinstance(other, type(self)):
            if len(self) != len(other):
                raise ValueError("Cannot perform comparison (!=) on Series of different lengths.")
            
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item == other[index])
                
            return BooleanSeries(items=new_series)
        elif isinstance(other, self.dtype):
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item != other)

            return BooleanSeries(items=new_series)
        
        raise TypeError(f"Comparison (!=) not supported between `{self.__class__.__name__}` and `{type(other).__name__}`.")
    

In [112]:
class BooleanSeries(Series):

    def __init__(self, items):
        count = 0
        for item in items:
            if not isinstance(item, (bool, type(None))):
                raise ValueError(f"Item in StringSeries must be a string or None, not `{type(item).__name__}`.")
            if item:
                count += 1
        
        super().__init__(items, bool)
        self._count = count

    def __len__(self):
        return len(self._items)
    
    def __getitem__(self, item):
        return self._items[item]

    def __str__(self):
        return self._items.__str__()
    
    # --- Boolean Operator Implementations ---
    def __invert__(self) -> 'BooleanSeries':
        new_series = [None] * len(self._items)

        for index, item in enumerate(self._items):
            new_series[index] = not item

        return BooleanSeries(new_series)
    
    def __or__(self, other) -> 'BooleanSeries':
        if not isinstance(other, BooleanSeries):
            raise TypeError("Element-wise boolean OR operation is only supported between two BooleanSeries.")
        if len(self) != len(other):
            raise ValueError("Cannot perform Element-wise boolean OR on Series of different lengths.")
            
        new_series = [None] * len(self._items)

        for index, item in enumerate(self._items):
            new_series[index] = (item or other[index])

        return BooleanSeries(new_series)

    def __and__(self, other) -> 'BooleanSeries':
        if not isinstance(other, BooleanSeries):
            raise TypeError("Element-wise boolean AND operation is only supported between two BooleanSeries.")
        if len(self) != len(other):
            raise ValueError("Cannot perform Element-wise boolean AND on Series of different lengths.")
            
        new_series = [None] * len(self._items)

        for index, item in enumerate(self._items):
            new_series[index] = (item and other[index])

        return BooleanSeries(new_series)

    def __xor__(self, other) -> 'BooleanSeries':
        if not isinstance(other, BooleanSeries):
            raise TypeError("Element-wise XOR operation is only supported between two BooleanSeries.")
        if len(self) != len(other):
            raise ValueError("Cannot perform Element-wise XOR on Series of different lengths.")

        new_series = [None] * len(self._items)

        for index, item in enumerate(self._items):
            new_series[index] = (item ^ other[index])

        return BooleanSeries(new_series)
    
    def identical(self, other) -> bool:
        if not isinstance(other, BooleanSeries):
            raise TypeError("Identity check is only supported between two BooleanSeries.")
        if len(self) != len(other):
            raise ValueError("Cannot perform Identity check on Series of different lengths.")
        
        for i in range(len(self)):
            if self[i] != other[i]:
                return False
            
        return True

    


In [113]:
class StringSeries(Series):
    def __init__(self, items):
        count = 0
        for item in items:
            if not isinstance(item, (str, type(None))):
                raise ValueError(f"Item in StringSeries must be a string or None, not `{type(item).__name__}`.")
            if item:
                count += 1

        super().__init__(items, str)
        self._count = count

In [114]:
class NumericSeries(Series):
    def __init__(self, items, dtype):
        super().__init__(items, dtype)
        self._sum = None
        self._mean = None

    def __lt__(self, other) -> BooleanSeries:
        if isinstance(other, NumericSeries):
            if len(self) != len(other):
                raise ValueError("Cannot perform comparison (<) on Series of different lengths.")
            
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item < other[index]) if (item is not None) and (other[index] is not None) else None

            return BooleanSeries(items=new_series)
        
        elif isinstance(other, (int, float)):
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item < other) if (item is not None) else None

            return BooleanSeries(items=new_series)
        
        raise TypeError(f"Comparison (<) not supported between `{self.__class__.__name__}` and `{type(other).__name__}`.")
    
    def __gt__(self, other) -> BooleanSeries:
        if isinstance(other, NumericSeries):
            if len(self) != len(other):
                raise ValueError("Cannot perform comparison (>) on Series of different lengths.")
            
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item > other[index]) if (item is not None) and (other[index] is not None) else None

            return BooleanSeries(items=new_series)
        
        elif isinstance(other, (int, float)):
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item > other) if (item is not None) else None

            return BooleanSeries(items=new_series)
        
        raise TypeError(f"Comparison (>) not supported between `{self.__class__.__name__}` and `{type(other).__name__}`.")
    
    def __le__(self, other) -> BooleanSeries:
        if isinstance(other, NumericSeries):
            if len(self) != len(other):
                raise ValueError("Cannot perform comparison (<=) on Series of different lengths.")
            
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item <= other[index]) if (item is not None) and (other[index] is not None) else None

            return BooleanSeries(items=new_series)
        
        elif isinstance(other, (int, float)):
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item <= other) if (item is not None) else None

            return BooleanSeries(items=new_series)
        
        raise TypeError(f"Comparison (<=) not supported between `{self.__class__.__name__}` and `{type(other).__name__}`.")
    
    def __ge__(self, other) -> BooleanSeries:
        if isinstance(other, NumericSeries):
            if len(self) != len(other):
                raise ValueError("Cannot perform comparison (>=) on Series of different lengths.")
            
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item >= other[index]) if (item is not None) and (other[index] is not None) else None

            return BooleanSeries(items=new_series)
        
        elif isinstance(other, (int, float)):
            new_series = [None] * len(self._items)

            for index, item in enumerate(self._items):
                new_series[index] = (item >= other) if (item is not None) else None

            return BooleanSeries(items=new_series)
        
        raise TypeError(f"Comparison (>=) not supported between `{self.__class__.__name__}` and `{type(other).__name__}`.")
    
    def sum(self):
        return self._sum
    def mean(self):
        return self._mean

class IntegerSeries(NumericSeries):
    def __init__(self, items):
        sum_buffer = 0
        count = 0
        for item in items:
            if not isinstance(item, (int, type(None))):
                raise ValueError(f"Item in IntegerSeries must be an integer or None, not `{type(item).__name__}`.")
            if item:
                sum_buffer += item
                count += 1
        super().__init__(items, int)
        if sum_buffer != 0:
            self._sum = sum_buffer
            self._count = count
            self._mean = self._sum / count

class FloatSeries(NumericSeries):
    def __init__(self, items):
        sum_buffer = 0
        count = 0
        for item in items:
            if not isinstance(item, (float, type(None))):
                raise ValueError(f"Item in FloatSeries must be a float or None, not `{type(item).__name__}`.")
            if item:
                sum_buffer += item
                count += 1
        super().__init__(items, float)
        if sum_buffer != 0:
            self._sum = sum_buffer
            self._count = count
            self._mean = self._sum / count


In [115]:
import operator
# --- Test Cases ---
print("Running tests...")

## 1. Series vs. Scalar Comparisons
int_s = IntegerSeries([10, 25, 30, 40])
ops = ['<', '>', '<=', '>=']
tests = [(int_s < 25), (int_s > 25), (int_s <= 10), (int_s >= 10.5)]
expecteds = [BooleanSeries([True, False, False, False]), 
             BooleanSeries([False, False, True, True]), 
             BooleanSeries([True, False, False, False]), 
             BooleanSeries([False, True, True, True])]
passed = 0
for ops, test, expected in zip(ops, tests, expecteds):
    if test.identical(expected):
        print(f"Comparison ({ops}) succeed")
        passed += 1
    else:
        print(f"Comparison ({ops}) failed! Expect: {expected}; got {test}.")

print(f"Passed {passed}/{len(tests)} Series vs. Scalar Comparison tests.")


# --- 2. Series vs. Series Comparisons ---
print("--- Testing Series vs. Series ---")
int_s1 = IntegerSeries([10, 20, 30, 40])
int_s2 = IntegerSeries([10, 21, 29, 40])
float_s = FloatSeries([9.5, 20.0, 30.5, 41.0])
ops = ['<', '>', '<=', '>=']
tests = [(int_s1 < int_s2), (int_s1 > int_s2), (int_s1 <= int_s2), (int_s1 >= float_s)]
expecteds = [
    BooleanSeries([False, True, False, False]),
    BooleanSeries([False, False, True, False]),
    BooleanSeries([True, True, False, True]),
    BooleanSeries([True, True, False, False])
]
passed = 0
for op, test, expected in zip(ops, tests, expecteds):
    if test.identical(expected):
        print(f"Comparison ({op}) succeed")
        passed += 1
    else:
        print(f"Comparison ({op}) failed! Expect: {expected}; got {test}.")
print(f"Passed {passed}/{len(tests)} Series vs. Series Comparison tests.\n")

# --- 3. Handling of None Values ---
print("--- Testing None Handling ---")
s1 = IntegerSeries([10, None, 30, 40, None])
s2 = IntegerSeries([15, 20, None, 40, None])
ops = ['< (scalar)', '>= (series)', '< (series)']
tests = [(s1 < 25), (s1 >= s2), (s2 < s1)]
expecteds = [
    BooleanSeries([True, None, False, False, None]),
    BooleanSeries([False, None, None, True, None]),
    BooleanSeries([False, None, None, False, None])
]
passed = 0
for op, test, expected in zip(ops, tests, expecteds):
    if test.identical(expected):
        print(f"Comparison ({op}) succeed")
        passed += 1
    else:
        print(f"Comparison ({op}) failed! Expect: {expected}; got {test}.")
print(f"Passed {passed}/{len(tests)} None Handling tests.\n")

# --- 4. Error Handling Tests ---
print("--- Testing Error Handling ---")
def test_raises(error, func, *args, **kwargs):
    try:
        func(*args, **kwargs)
        assert False, f"Expected {error.__name__} but no exception was raised."
    except error as e:
        print(f"Correctly raised {error.__name__}: {e}")

# ValueError for different lengths
s_short = IntegerSeries([1, 2])
s_long = IntegerSeries([1, 2, 3])
test_raises(ValueError, operator.lt, s_short, s_long)


print("All tests finished.")

Running tests...
Comparison (<) succeed
Comparison (>) succeed
Comparison (<=) succeed
Comparison (>=) succeed
Passed 4/4 Series vs. Scalar Comparison tests.
--- Testing Series vs. Series ---
Comparison (<) succeed
Comparison (>) succeed
Comparison (<=) succeed
Comparison (>=) succeed
Passed 4/4 Series vs. Series Comparison tests.

--- Testing None Handling ---
Comparison (< (scalar)) succeed
Comparison (>= (series)) succeed
Comparison (< (series)) succeed
Passed 3/3 None Handling tests.

--- Testing Error Handling ---
Correctly raised ValueError: Cannot perform comparison (<) on Series of different lengths.
All tests finished.


In [116]:
string1 = StringSeries(items=["a", "b", "c", None])
string2 = StringSeries(items=["a", "b", "d", None])
result = (string1 == string2)
print(result)
result2 = (string1 == 'b')
print(result2)
int_s1_filtered = int_s1.get_data([0,2,3])
print(int_s1_filtered)

[True, True, False, True]
[False, True, False, False]
IntegerSeries([10, 30, 40])


In [133]:
def check_int(value: str) -> bool:
            if not value: return True
            try:
                int(value)
                return True
            except (ValueError, TypeError):
                return False

def check_float(value: str) -> bool:
    if not value: return True
    try:
        float(value)
        return True
    except (ValueError, TypeError):
        return False
            
class DataFrame:
    def __init__(self, data: dict):
        self._columns = {}
        self._length = 0
        for key, value in data.items():
            if isinstance(value, Series):
                if self._length == 0 or len(value) == self._length:
                    self._length = len(value)
                    self._columns[key] = value
                else:
                    raise ValueError(f"Dimension mismatch at Column {key}; Expecting {self._length}, got {len(value)}")
            else:
                raise ValueError(f"Column '{key}' must be a Series")

    def __getitem__(self, key):
        if isinstance(key, BooleanSeries):
            if len(key) != self._length:
                raise ValueError(f"Dimension mismatch! Key must have the same length as the Dataframe; Expecting {self._length}, got {len(key)}")
            result_index = []
            for index in range(self._length):
                if key[index]:
                    result_index.append(index)
            result_data = {}
            for column, value in self._columns.items():
                result_data[column] = value.get_data(result_index)
            return DataFrame(result_data)
        else:
            return self._columns[key]

    def __setitem__(self, key, value):
        if isinstance(value, Series):
            self._columns[key] = value
        else:
            raise ValueError(f"Column '{key}' must be a Series")
        

    def __str__(self):
        if not self._columns:
            return "Empty DataFrame"
        col_names = list(self._columns.keys())
        col_widths = {col: max(len(col), max(len(str(item)) for item in self._columns[col]._items)) for col in col_names}
        header = "  ".join(col.ljust(col_widths[col]) for col in col_names)
        separator = "-" * len(header)
        rows = []
        for i in range(len(next(iter(self._columns.values()))._items)):
            row = "  ".join(str(self._columns[col]._items[i]).ljust(col_widths[col]) for col in col_names)
            rows.append(row)
        return "\n".join([header, separator] + rows)

    def add_column(self, name: str, series):
        if isinstance(series, Series):
            self._columns[name] = series
        else:
            raise ValueError(f"Column '{name}' must be a Series")

    def remove_column(self, name: str):
        if name in self._columns: 
            del self._columns[name]
        else: 
            raise KeyError(f"Column '{name}' not found in DataFrame")

    def get_column_names(self):
        return list(self._columns.keys())

    def get_column(self, name: str):
        return self._columns.get(name)
    
    def agg(self, func: Union[str, Callable], axis: int = 0) -> Union['DataFrame', 'Series']:
        if not self._columns:
            return DataFrame({}) if axis == 0 else IntegerSeries([])

        func_map = {
            'count': lambda s: s.count(),
            'sum': lambda s: s.sum() if hasattr(s, 'sum') else None,
            'mean': lambda s: s.mean() if hasattr(s, 'mean') else None
        }
        
        if isinstance(func, str):
            if func not in func_map:
                raise ValueError(f"Unknown aggregation function: {func}. Supported: {list(func_map.keys())}")
            func_callable = func_map[func]
        elif callable(func):
            func_callable = func
        else:
            raise TypeError(f"func must be a string or callable, not {type(func).__name__}")

        if axis == 0:
            result = {}
            for col, series in self._columns.items():
                agg_result = func_callable(series)
                if agg_result is None:
                    continue
                
                # Create Series type based on result type
                if isinstance(agg_result, int):
                    result_series = IntegerSeries([agg_result])
                elif isinstance(agg_result, float):
                    result_series = FloatSeries([agg_result])
                else:
                    try:
                        result_series = StringSeries([str(agg_result)])
                    except:
                        continue
                
                result[col] = result_series
            return DataFrame(result)
        
        elif axis == 1:
            result = []
            col_names = list(self._columns.keys())
            for i in range(self._length):
                row_values = []
                for col in col_names:
                    val = self._columns[col][i]
                    if func in ('sum', 'mean') and val is not None and isinstance(self._columns[col], NumericSeries):
                        row_values.append(val)
                    elif func == 'count':
                        if val is not None:
                            row_values.append(1) 
                
                if row_values:
                    if func == 'count':
                        agg_result = len(row_values)
                    elif func == 'sum':
                        agg_result = sum(row_values)
                    elif func == 'mean':
                        agg_result = sum(row_values) / len(row_values)
                    else:
                        agg_result = func_callable(row_values)
                else:
                    agg_result = None
                result.append(agg_result)
            
            # Return appropriate Series type
            if func in ('sum', 'mean') and any(x is not None for x in result):
                return FloatSeries(result)
            else:
                return IntegerSeries(result)
        
        else:
            raise ValueError("axis must be 0 (columns) or 1 (rows)")
    
    @classmethod
    def from_csv(cls, file_path: str, delimiter: str = ',') -> 'DataFrame':            
        with open(file_path, 'r', newline='') as csvfile:
            reader = csv.reader(csvfile, delimiter=delimiter)
            header = next(reader)
            # Read all data into a temporary dictionary
            columns = {col: [] for col in header}
            for row in reader:
                for i, value in enumerate(row):
                    columns[header[i]].append(value)
            
            data = {}
            for col, values in columns.items():
                non_empty_vals = [val for val in values if val not in ('', 'none', None)]

                if all(check_int(val) for val in non_empty_vals):
                    int_values = [None if val in ('', 'none', None) else int(val) for val in values]
                    data[col] = IntegerSeries(int_values)
                
                elif all(check_float(val) for val in non_empty_vals):
                    float_values = [None if val in ('', 'none', None) else float(val) for val in values]
                    data[col] = FloatSeries(float_values)

                elif all(val.lower() in ('true', 'false') for val in non_empty_vals):
                    bool_values = [None if val in ('', 'none', None) else val.lower() == 'true' for val in values]
                    data[col] = BooleanSeries(bool_values)

                else:
                    str_values = [None if val in ('', 'none', None) else val for val in values]
                    data[col] = StringSeries(str_values)
            
            return cls(data)

Creating example data

In [134]:
CSV_ENCODED_DATA = "Q291bnRyeSxDaXR5LElzIENhcGl0YWwKSXRhbHksVHVyaW4sRmFsc2UKSmFwYW4sS3lvdG8sVHJ1ZQpDYW5hZGEsVG9yb250byxUcnVlCkNhbmFkYSxNb250cmVhbCxUcnVlClNwYWluLFNldmlsbGUsVHJ1ZQpGcmFuY2UsUGFyaXMsVHJ1ZQpKYXBhbixLeW90byxGYWxzZQpTcGFpbixaYXJhZ296YSxGYWxzZQpJdGFseSxSb21lLFRydWUKQXVzdHJhbGlhLFBlcnRoLEZhbHNlCkF1c3RyYWxpYSxCcmlzYmFuZSxUcnVlCkZyYW5jZSxOaWNlLFRydWUKQ2FuYWRhLFZhbmNvdXZlcixGYWxzZQpKYXBhbixOYWdveWEsRmFsc2UKSmFwYW4sS3lvdG8sRmFsc2UKQ2FuYWRhLE1vbnRyZWFsLFRydWUKQ2FuYWRhLE90dGF3YSxGYWxzZQpCcmF6aWwsU8OjbyBQYXVsbyxUcnVlCkF1c3RyYWxpYSxNZWxib3VybmUsVHJ1ZQpBdXN0cmFsaWEsU3lkbmV5LFRydWUK"
CSV_FILE_NAME = "countries_and_cities.csv"
if not os.path.isfile(CSV_FILE_NAME):
    csv_data = base64.b64decode(CSV_ENCODED_DATA)
    with open(CSV_FILE_NAME, 'wb') as f:
        f.write(csv_data)

In [135]:
df = DataFrame.from_csv(CSV_FILE_NAME)
print("Initial DataFrame:")
print(df)
print("\n")

print("Column names:")
print(df.get_column_names())
print("\n")

print("Removing 'Population' column:")
df.remove_column('Country')
print(df)
print("\n")

print("Accessing 'City' column:")
country_column = df['City']
print(country_column)
print("\n")

print("Cities equal to 'Paris':")
cities_equal_paris = df['City'] == StringSeries(['Paris'] * len(df['City']._items))
print(cities_equal_paris)
print("\n")


print("Capital Cities:")
capital = df[df['Is Capital'] == True]
print(capital)
print("\n")

Initial DataFrame:
Country    City       Is Capital
--------------------------------
Italy      Turin      False     
Japan      Kyoto      True      
Canada     Toronto    True      
Canada     Montreal   True      
Spain      Seville    True      
France     Paris      True      
Japan      Kyoto      False     
Spain      Zaragoza   False     
Italy      Rome       True      
Australia  Perth      False     
Australia  Brisbane   True      
France     Nice       True      
Canada     Vancouver  False     
Japan      Nagoya     False     
Japan      Kyoto      False     
Canada     Montreal   True      
Canada     Ottawa     False     
Brazil     São Paulo  True      
Australia  Melbourne  True      
Australia  Sydney     True      


Column names:
['Country', 'City', 'Is Capital']


Removing 'Population' column:
City       Is Capital
---------------------
Turin      False     
Kyoto      True      
Toronto    True      
Montreal   True      
Seville    True      
Paris      True    

In [136]:
adult_df = DataFrame({
    'salary': IntegerSeries([60000, 40000, 75000]),
    'name': StringSeries(['Alice', 'Bob', 'Charlie'])
})
print(adult_df)
print('\n')
adult_df['salary'] = adult_df['salary'].mask(adult_df['salary'] < 50000, "Low").mask(adult_df['salary'] >= 50000, "High")
print(adult_df)


salary  name   
---------------
60000   Alice  
40000   Bob    
75000   Charlie


salary  name   
---------------
High    Alice  
Low     Bob    
High    Charlie


In [142]:
CSV_FILE_NAME = "sales.csv"
df = DataFrame.from_csv(CSV_FILE_NAME)
# Are there any missing data?
print(df.agg('count'))
print("\n")
print(df.agg('mean'))
print("\n")
# What is the mean price of red vehicles?
print(df[df['Color'] == 'Red']['Price_USD'].mean())

# What is the total sales volume of 5 Series in Asia?
print(df[(df['Model'] == '5 Series') & (df['Region'] == 'Asia')]['Sales_Volume'].sum())



Model  Year   Region  Color  Fuel_Type  Transmission  Engine_Size_L  Mileage_KM  Price_USD  Sales_Volume  Sales_Classification
------------------------------------------------------------------------------------------------------------------------------
50000  50000  50000   50000  50000      50000         50000          50000       50000      50000         50000               


Year       Engine_Size_L      Mileage_KM    Price_USD   Sales_Volume
--------------------------------------------------------------------
2017.0157  3.247179999999999  100307.20314  75034.6009  5067.51468  


74896.44511402576
3935629
