# The Relational Model in Python

Copyright Jens Dittrich & Marcel Maltry, [Big Data Analytics Group](https://bigdata.uni-saarland.de/), [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode)

In [1]:
from ra.utils import load_csv
from ra.relation import Relation

## The `Relation` class

The class `Relation` is implemented in `ra.relation` and implements the following methods:

* `add_tuple(tup)`: Adds the tuple `tup` if the tuple's schema is valid.
* Several print methods that are showcased below for the IMDb dataset.

**Remember:** Neither the order of rows nor the order of columns carry any meaning in a relation!

In [2]:
foo = Relation('foo', [('id', int), ('name', str)])
foo.add_tuple( (2,'Hello') )
foo.add_tuple( (7,'World') )
foo.add_tuple( (1,'!') )

foo

---
[1mfoo[0m 
--------------
[1mid     name   [0m 
--------------
7      World  
1      !      
2      Hello  

## IMDb

In the following, we will introduce `Relation`'s `print` methods. We use the IMDb dataset that was introduced in the lecture. We first have to import the data from csv files. 

In [3]:
from os import listdir

# Data source: https://relational.fit.cvut.cz/dataset/IMDb
# Information courtesy of IMDb (http://www.imdb.com). Used with permission.
#
# Notice: The data can only be used for personal and non-commercial use and must not
# be altered/republished/resold/repurposed to create any kind of online/offline
# database of movie information (except for individual personal use).

path = 'data/IMDb_sample'  
# create a list of all files in that directory that end with "*.csv":
files = [file for file in listdir(path) if file.endswith('.csv')]
print(f'List of csv files: {files}')

List of csv files: ['movies_directors.csv', 'actors.csv', 'directors.csv', 'movies_genres.csv', 'directors_genres.csv', 'movies.csv', 'roles.csv']


In [4]:
# load all relations from csv files
relations = list()
for file in files:
    print("Reading {} ...".format(file))  # print currently parsed file
    filepath = path + '/' + file  # preappend file name by path
    name = file[:-4]  # removes .csv file ending and takes filename as relation name
    relation = load_csv(filepath, name, delimiter='\t')
    relations.append(relation)

# add relations into a dictionary such that they can be accessed by their name
relations_dict = {}
for rel in relations:
    relations_dict[rel.name] = rel
    # in addition, create a separate variable name for each relation:
    globals()[rel.name] = rel

Reading movies_directors.csv ...
Reading actors.csv ...
Reading directors.csv ...
Reading movies_genres.csv ...
Reading directors_genres.csv ...
Reading movies.csv ...
Reading roles.csv ...


In [5]:
# now each relation has become a global variable:
actors

------
[1mactors[0m 
--------------------------------------------------------------------------------------------
[1mid                     first_name             last_name              gender                 [0m 
--------------------------------------------------------------------------------------------
717432                 Lana                   McKissack              F                      
621910                 Bridget                Fonda                  F                      
490228                 David (I)              Vaughan                M                      
209799                 Adolf                  Hitler                 M                      
661430                 Elizabeth              Inglis                 F                      
510502                 Shane                  Wilder                 M                      
773658                 Rochelle               Rose                   F                      
259066                 Stanley        

In [6]:
directors

---------
[1mdirectors[0m 
------------------------------------
[1mid          first_name  last_name   [0m 
------------------------------------
43095       Stanley     Kubrick     
11652       James (I)   Cameron     
78273       Quentin     Tarantino   

Like that the schema and the first ten rows are printed.

Alternatively, you may call `print_table()` directly. Optionally, we can limit the number of rows.

In [7]:
maxRowsLimit = 10
actors.print_table(maxRowsLimit)

------
[1mactors[0m 
--------------------------------------------------------------------------------------------
[1mid                     first_name             last_name              gender                 [0m 
--------------------------------------------------------------------------------------------
717432                 Lana                   McKissack              F                      
621910                 Bridget                Fonda                  F                      
490228                 David (I)              Vaughan                M                      
209799                 Adolf                  Hitler                 M                      
661430                 Elizabeth              Inglis                 F                      
510502                 Shane                  Wilder                 M                      
773658                 Rochelle               Rose                   F                      
259066                 Stanley        

As an alternative, a set representation of the relation can be printed using the `set()` method.

In [8]:
directors.print_set(3)

[directors] : {[id:int, first_name:str, last_name:str]}
{
	(43095, Stanley, Kubrick),
	(11652, James (I), Cameron),
	(78273, Quentin, Tarantino)
}


As a third alternative, a LaTeX representation of the relation can be printed using the `print_latex()` method.

In [9]:
directors.print_latex()

\definecolor{tableheadercolor}{rgb}{0.8,0.8,0.8}\begin{tabular}{|l|l|l|}\hline
\multicolumn{3}{|l|}{\cellcolor{tableheadercolor}{\textbf{directors}}}\\\hline
	\cellcolor{tableheadercolor}{\textbf{id}} & \cellcolor{tableheadercolor}{\textbf{first\textunderscore name}} & \cellcolor{tableheadercolor}{\textbf{last\textunderscore name}} \\
	\hline\hline
	43095 & Stanley & Kubrick \\
	11652 & James (I) & Cameron \\
	78273 & Quentin & Tarantino \\
\hline
\end{tabular}


Here is an overview over all relations from the IMDb dataset.

In [10]:
for rel in relations:
    rel.print_set(3)

[movies_directors] : {[director_id:int, movie_id:int]}
{
	(43095, 121538),
	(78273, 223710),
	(78273, 118367)
}
[actors] : {[id:int, first_name:str, last_name:str, gender:str]}
{
	(717432, Lana, McKissack, F),
	(621910, Bridget, Fonda, F),
	(490228, David (I), Vaughan, M)
}
[directors] : {[id:int, first_name:str, last_name:str]}
{
	(43095, Stanley, Kubrick),
	(11652, James (I), Cameron),
	(78273, Quentin, Tarantino)
}
[movies_genres] : {[movie_id:int, genre:str]}
{
	(10934, Documentary),
	(328277, Drama),
	(65764, Drama)
}
[directors_genres] : {[director_id:int, genre:str, prob:float]}
{
	(43095, Drama, 0.625),
	(78273, Action, 0.5),
	(11652, Adventure, 0.166667)
}
[movies] : {[id:int, name:str, year:int, rank:float]}
{
	(177019, Killing, The, 1956, 8.1),
	(328277, Terminator 2: Judgment Day, 1991, 8.1),
	(1711, 2001: A Space Odyssey, 1968, 8.3)
}
[roles] : {[actor_id:int, movie_id:int, role:str]}
{
	(128257, 105938, Karl Kuhn),
	(257870, 322652, Cyberdyne Video Host),
	(363245, 176711

# Exercise

Extend class `Relation` to support keys and check for duplicates of keys when adding tuples:

In [11]:
# upload the contents of this cell to our CMS as a text file

# a relation subclass respecting key constraints:
class KeyRelation(Relation):
    # keys: names of the key attributes as a list
    def __init__(self, name, schema, keys):
        super().__init__(name, schema)
        
        # assert that the list of keys is subset-equal self-attributes:
        assert set(keys) <= set(self.attributes)
        # make sure that at least one key attribute is defined:
        assert len(keys) >= 1
        
        # add your code here!
        # ...
        # initialize data structures that are required
        # to check the key constraint for new tuples
        pass
        
    def add_tuple(self, tup):
        # add your code here!
        # ...
        # check if there is a tuple with the same key in the relation
        # only insert it using super().add_tuple(tup) if there is not.
        # raise a ValueError if the key is already present.
        # Make sure to perform your check in O(1) time!
        pass
        
    def print_schema(self):
        super().print_schema()
        # add your code here!
        # ...
        # should also print the key attributes
        pass

### Unit Test for Relation

Note that test cases are by no means exhaustive!

In [12]:
import unittest

class RelationTest(unittest.TestCase):

    def setUp(self):
        self.foo = Relation('foo', [('id', int), ('name', str)])
        self.foo.add_tuple( (2,'Hello') )
        self.foo.add_tuple( (7,'World') )
        self.foo.add_tuple( (1,'!') )

        self.bar = Relation('bar', [('a', int), ('b', int), ('c', int), ('d', int)])
        self.bar.add_tuple( (1, 2, 3, 4) )
        self.bar.add_tuple( (2, 2, 3, 4) )
        self.bar.add_tuple( (3, 2, 3, 4) )
        self.bar.add_tuple( (4, 2, 3, 4) )
        self.bar.add_tuple( (5, 2, 3, 4) )
        
    def test_size(self):
        # foo should contain 3 tuples
        self.assertEqual(len(self.foo), 3)
        # check valid insert
        self.assertTrue(self.foo.add_tuple( (3, '?') ))
        self.assertEqual(len(self.foo), 4)
        # check duplicate insert
        self.assertFalse(self.foo.add_tuple( (1,'!') ))
        self.assertEqual(len(self.foo), 4)
        
        # bar should contain 5 tuples
        self.assertEqual(len(self.bar), 5)
        # check valid insert
        self.assertTrue(self.bar.add_tuple( (6, 2, 3, 4) ))
        self.assertEqual(len(self.bar), 6)
        # check duplicate insert
        self.assertFalse(self.bar.add_tuple( (5, 2, 3, 4) ))
        self.assertEqual(len(self.bar), 6)
    
    def test_schema(self):
        # incorrectly typed tuple
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( ('wrong order', 42) )
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( (0.1, 'wrong type') )
        # inccorectly sized tuples
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( (6, 'wrong size', 12) )
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( (42,) )
        
        # incorrectly typed tuple
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( (0.1, 0.2, 0.3, 0.4) )
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( ('1', '3', '2', '4') )
        # incorrectly sized
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( (1, 2, 4, 5, 6) )
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( (1, 2, 4) )

### Unit Test for KeyRelation

Note that test cases are by no means exhaustive!

In [13]:
class KeyRelationTest(unittest.TestCase):
    
    def setUp(self):
        keys = ['id']
        self.foo = KeyRelation('foo', [('id', int), ('name', str)], keys)
        self.foo.add_tuple( (1, 'first') )
        self.foo.add_tuple( (2, 'second') )
        self.foo.add_tuple( (3, 'thrid') )
        
        keys = ['a', 'c']
        self.bar = KeyRelation('bar', [('a', int), ('b', int), ('c', int), ('d', int)], keys)
        self.bar.add_tuple( (1, 2, 1, 3) )
        self.bar.add_tuple( (1, 3, 2, 1) )
        self.bar.add_tuple( (2, 3, 2, 1) )
        self.bar.add_tuple( (2, 3, 1, 2) )
        
    def test_size(self):
        # foo should contain 3 tuples
        self.assertEqual(len(self.foo), 3)
        # check valid insert
        self.foo.add_tuple( (4, 'fourth') )
        self.assertEqual(len(self.foo), 4)
        # check duplicate key insert
        with self.assertRaises(ValueError):
            self.foo.add_tuple( (1, 'one') ) # should raise ValueError  
        self.assertEqual(len(self.foo), 4)  # should not add tuple
        # check duplicate tuple insert
        with self.assertRaises(ValueError):
            self.foo.add_tuple( (1,'first') )  #should raise ValueError
        self.assertEqual(len(self.foo), 4)  # should not add tuple
        
        # bar should contain 4 tuples
        self.assertEqual(len(self.bar), 4)
        # check valid insert
        self.bar.add_tuple( (3, 1, 2, 3) )
        self.assertEqual(len(self.bar), 5)
        # check duplicate key insert
        with self.assertRaises(ValueError):
            self.bar.add_tuple( (1, 3, 1, 2) )  # should raise ValueError
        self.assertEqual(len(self.bar), 5)  # should not add tuple
        # check duplicate tuple insert
        with self.assertRaises(ValueError):
            self.bar.add_tuple( (2, 3, 1, 2) )  # should raise ValueError
        self.assertEqual(len(self.bar), 5)  # should not add tuple
    
    def test_schema(self):
        # incorrectly typed tuple
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( ('seventh', 7) )
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( (0.1, 'zero point first') )
        # inccorectly sized tuples
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( (42, 'oops', 12) )
        with self.assertRaises(AssertionError):
            self.foo.add_tuple( (43,) )
        
        # incorrectly typed tuple
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( (0.1, 0.2, 0.3, 0.4) )
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( ('1', '3', '2', '4') )
        # incorrectly sized
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( (1, 2, 4, 5, 6) )
        with self.assertRaises(AssertionError):
            self.bar.add_tuple( (1, 2, 4) )

In [14]:
# Run the unit test without shutting down the jupyter kernel
unittest.main(argv=['ignored', '-v'], verbosity=2, exit=False)

test_schema (__main__.KeyRelationTest) ... FAIL
test_size (__main__.KeyRelationTest) ... FAIL
test_schema (__main__.RelationTest) ... ok
test_size (__main__.RelationTest) ... ok

FAIL: test_schema (__main__.KeyRelationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-13-fc8a6f390d61>", line 49, in test_schema
    self.foo.add_tuple( ('seventh', 7) )
AssertionError: AssertionError not raised

FAIL: test_size (__main__.KeyRelationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-13-fc8a6f390d61>", line 19, in test_size
    self.assertEqual(len(self.foo), 3)
AssertionError: 0 != 3

----------------------------------------------------------------------
Ran 4 tests in 0.003s

FAILED (failures=2)


<unittest.main.TestProgram at 0x12db11460>