<a href="https://colab.research.google.com/github/Fameless4ellL/GoogleCollabML/blob/main/Home_Assignment_Zaur_Dzassokhov.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2020 ML Engineer Home Assignment

This is a Jupyter notebook containing exercises from 3  fields: 
1. Algorithms,
2. Python,
3. Spark.

We think that it is possible to complete it within a couple hours and we are expecting that you will send back the results within a week (7 days).  
Your submission can be in the form of this notebook with your code and necessary remarks added to it but we are fine with other forms, as well as partial solutions.

## Python requirements:

You can use the below library versions for running this notebook.

If you are using Conda to manage you Python environments you can use the command like below:
```
conda create --name vl-home-assignment --file requirements.txt
```

Requirements:
```
pyspark=2.3.0
python=3.6.10
jupyter
```

## Ex. 1: Algorithms

You are given two singly-linked lists (precisely: the references to their heads).
Implement an optimal (both in terms of memory and time) function that returns the intersection node.  

We define intersection based on reference, not value. 
That is, if the k-th element of the first list is exactly the same node reference as the j-th element of the second list then the lists intersect. 
We expect that your solution will provide your own implementation of the singly-linked lists. 
Please provide a couple of examples testing your solution.

## Solution Ex.1

In [None]:
# Space for you solution
class Solution(object):
    def getIntersectionNode(self, headA, headB):
        
        curA, curB = headA, headB
        while curA != curB:
            curA = headB if curA is None else curA.next
            curB = headA if curB is None else curB.next
            
        return curA

Hello world!


## Ex. 2: Python

Given two files `file.json` and `file.csv` containing the same mapping write a small library with loaders for these file formats. 
A loader `l` should have two public members `l.keys` and `l.data`.
There should be a basic filtering functionality built around `keys` attribute as in the example below.

Use Python Standard Library for parsing json and csv files.

### Sample files
`file.json`:
```
{
  "alfa": 1,
  "beta": 2
}
```
`file.csv`:
```
alfa,1
beta,2
```

### Sample usage
```
from file_loaders import JsonLoader, CsvLoader
loaders: Tuple[Loader] = (
    JsonLoader('file.json'),
    CsvLoader('file.csv')
)
for l in loaders:
    print(f"======== {type(l).__name__} =======")
    print(f"data: {l.data}") # => {'alfa': 1, 'beta': 2}
    print(f"keys: {l.keys}") # => ('alfa', 'beta')
    l.keys = ('alfa')
    print(f"data: {l.data}") # => {'alfa': 1}
    print(f"keys: {l.keys}") # => alfa
```

## Solution Ex.2

In [None]:
# file_loaders.py
import csv
import json


class JsonLoader:
    """
    Parse JSON - Convert data from JSON file to Python
    """

    def __init__(self, path, bdata=None, keys=None, data=None):
        self.path = path
        self.bdata = bdata
        self.keys = keys
        self.data = data

        with open(self.path, 'r') as f:
            self.bdata = json.load(f)
            self.data = self.bdata
            self.keys = tuple(self.data.keys())
        f.close()

    @property
    def data(self):
        try:
            if self._data == self.bdata:
                dk = self._data[self._keys]
                self._data = {self._keys: dk}
            else:
                self._data = self.bdata
                dk = self._data[self._keys]
                self._data = {self._keys: dk}
        except:
            pass
        return self._data

    @property
    def keys(self):
        return self._keys

    @keys.setter
    def keys(self, value):
        self._keys = value

    @data.setter
    def data(self, value):
        self._data = value


class CsvLoader:
    """
        Parse CSV - Convert data from csv file to Python
    """

    def __init__(self, path, data=None, keys=None, bdata=None):
        self.path = path
        self.data = data
        self.keys = keys
        self.bdata = bdata

        with open(self.path, 'r') as f:
            reader = csv.reader(f)
            vdata = []
            for row in reader:
                vdata.append(row)
        f.close()
        self.data = dict(vdata)
        self.bdata = self._data
        self.keys = tuple(self.data.keys())

    @property
    def data(self):
        try:
            if self._data == self.bdata:
                dk = self._data[self._keys]
                self._data = {self._keys: dk}
            else:
                self._data = self.bdata
                dk = self._data[self._keys]
                self._data = {self._keys: dk}
        except:
            pass
        return self._data

    @property
    def keys(self):
        return self._keys

    @keys.setter
    def keys(self, value):
        self._keys = value

    @data.setter
    def data(self, value):
        self._data = value

In [None]:
# test.py
from importlib.abc import Loader
from typing import Tuple



loaders: Tuple[Loader] = (
    JsonLoader('file.json'),
    CsvLoader('file.csv')
)
for l in loaders:
    print(f"======== {type(l).__name__} =======")
    print(f"data: {l.data}")  # => {'alfa': 1, 'beta': 2}
    print(f"keys: {l.keys}")  # => ('alfa', 'beta')
    l.keys = 'alfa'
    print(f"data: {l.data}")  # => {'alfa': 1}
    print(f"keys: {l.keys}")  # => alfa

FileNotFoundError: ignored

## Ex. 3: Spark

Below are two Spark dataframes representing 1) people and 2) hobbies.
You are given two tasks.
1. Show how many hobbies each person has.
2. Show all hobbies that no one has.




In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

# global spark session
spark = (SparkSession
         .builder
         .appName('MLEngineerSpark')
         .master('local[2]')
         .getOrCreate())


ModuleNotFoundError: ignored

## Solution Ex.3
 1. Show how many hobbies each person has.

In [None]:
# Space for you solution
new_df = persons_df.join(hobbies_df, on=['id'], how='left_outer')
new_df = new_df.filter(col("hobby").isNotNull())
new_df.groupBy("id", "name"). \
    agg(count(col("id")).alias("count_hobby")). \
    orderBy("id", "name"). \
    show()

+---+----+-----------+
| id|name|count_hobby|
+---+----+-----------+
|  1|Mary|          3|
|  2|John|          1|
+---+----+-----------+



2. Show all hobbies that no one has.

In [None]:
# Space for you solution
another_df = hobbies_df.join(persons_df, on=['id'], how='left_outer')
another_df.filter(col("name").isNull()).show()

+---+------+----+
| id| hobby|name|
+---+------+----+
|  3|Soccer|null|
+---+------+----+

