#### Introduction to relations and relational algebra
CS 236 <br>
Fall 2023

Michael A. Goodrich <br>
Brigham Young University <br>
March 2023
***

Consider the following example from the class slides:
    _Relational Algebra Part 1_

* $R\subset A\times B$
* $R=\{(a,1),(b,2),(c,3),(d,4)\}$

The relation $R$ is represented as a set of mathematical tuples, which uses the textbook's notation. 

In relational databases and in class, the same relation is represented as a table:

| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ | 

The first row of the table is the __header__. You can think of the elements of the header as assigning names to each column. For example, the first column contains elements from the set $A$. The header contains information about the sets from which the cartesian product is formed, $R\subset A\times B$.

Below the header are rows containing __tuples__. The tuples are the elements of the __set__ the defines the relation, $R=\{(a,1),(b,2),(c,3),(d,4)\}$.

Just like the order of elements in a set doesn't matter

* $\{(a,1),(b,2),(c,3),(d,4)\} = \{(d,4),(b,2),(a,1),(c,3)\}$

The order of the rows in the table doesn't matter (except for the header). The following table represents the same relation as the table above.

| char | int | 
| :-: | :-: | 
| $d$ | $4$ | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 


***

Let's construct a class that represents a relation.  The private variables of this method need to match the parts of the table. Thus, there needs to be a variable representing the header and another variable representing the set of tuples.

In [13]:
from tabulate import tabulate # requires the tabulate module
### I like using the tabulate environment, but I had to install a
        ### package to make this work so I've included a for loop that does
        ### does the same thing with just a little less formatting
        ### Within the vscode terminal, the command to install is "pip3 install tabulate"
        
class Relation:
    def __init__(self, relation_name, relation_header, set_of_tuples):
        self.name = relation_name # I threw in a string for the relation name for fun
        self.header = relation_header
        self.set_of_tuples = set_of_tuples

    def toString(self):
        print("The relation name is ", self.name)
        print(tabulate(self.set_of_tuples,headers = self.header,tablefmt = 'fancy_grid'))
        # Uncomment the lines below if you don't want to install tabulate
        #print(self.relation_header)
        #for my_tuple in self.tuples:
        #    print(my_tuple)
    
R = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
R.toString()

Q = Relation('Q',('C','D'),{})
Q.toString()


The relation name is  R
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ b      │     2 │
├────────┼───────┤
│ d      │     4 │
├────────┼───────┤
│ a      │     1 │
├────────┼───────┤
│ c      │     3 │
╘════════╧═══════╛
The relation name is  Q
╒═════╤═════╕
│ C   │ D   │
╞═════╪═════╡
╘═════╧═════╛


The top row in the tables above consist of the relation header. The remaining rows represent the set of tuples contained in the relation. Each line contains a unique tuple.

The order of the tuples in the set of tuples doesn't matter since sets are not ordered, so the relation R2 defined below is the same as R defined above.

In [14]:
R2 = Relation(relation_name = "R2",relation_header = ('char','int'), set_of_tuples = {('b',2),('a',1),('c',3),('d',4)})
R2.toString()


The relation name is  R2
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ b      │     2 │
├────────┼───────┤
│ d      │     4 │
├────────┼───────┤
│ a      │     1 │
├────────┼───────┤
│ c      │     3 │
╘════════╧═══════╛


Interestingly, my machine puts the tuples in R and the tuples in R2 on the same row.

Note that equality of the two relations is not defined the way we think it should be in python.

In [15]:
print(R==R2)

False


The reason is that R and R2 are two different instances of a class, and even though the variables within the two classes are the same the instances are not considered equal. When we print out the information about each instance, we see that the objects have different addresses

In [16]:
print(R)
print(R2)

<__main__.Relation object at 0x109641750>
<__main__.Relation object at 0x1094b2cd0>


But we can define an equality operator for the class within the class definition. See

https://stackoverflow.com/questions/1227121/compare-object-instances-for-equality-by-their-attributes

Let's redefine the class

In [17]:
class Relation:
    def __init__(self, relation_name, relation_header, set_of_tuples):
        self.name = relation_name # I threw in a string for the relation name for fun
        self.header = relation_header
        self.set_of_tuples = set_of_tuples

    def toString(self):
        print("The relation name is ", self.name)
        print(tabulate(self.set_of_tuples,headers = self.header,tablefmt = 'fancy_grid'))
        #print(self.relation_header)
        #for my_tuple in self.tuples:
        #    print(my_tuple)

    def __eq__(self,other):
        # First, check whether the thing passed to the equality method is the same type
        if not isinstance(other, Relation):
            # don't attempt to compare against unrelated types
            raise ValueError
        # Second, return true only if the header and sets all match. I don't really care if the names match
        return self.header == other.header and self.set_of_tuples == other.set_of_tuples
    
R = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
R2 = Relation(relation_name = "R2",relation_header = ('char','int'), set_of_tuples = {('b',2),('a',1),('c',3),('d',4)})
print(R==R2)

True


Defining the __eq__ method within the class allows us to use the == operator to check whether the contents of the two relations are the same

***

A tuple can have only a single element in it, what we jokingly call a "one-ple" in class. There is a tricky problem that comes up when we try to create a tuple with only a single element. Consider the following:

In [18]:

R = Relation('R',('char'),{('a')}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
R.toString()


The relation name is  R
╒═════╕
│ c   │
╞═════╡
│ a   │
╘═════╛


Notice how the only the "c" from the attribute name "char" is printed out. This can be a difficult bug to deal with and it arises because we don't explicitly define the variable types in python. (Remember what a pain that was to learn in C++?)

It gets worse if we try to create a relation with no tuples in the set

In [19]:
Q = Relation('Q',('char'),{}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
Q.toString()

The relation name is  Q
╒═════╤═════╤═════╤═════╕
│ c   │ h   │ a   │ r   │
╞═════╪═════╪═════╪═════╡
╘═════╧═════╧═════╧═════╛


That's not what we intended at all. We can get around this by adding a comma after the single element of the tuple (after both the 'char' and the 'a')...

In [20]:

P = Relation('P',('char',),{('a',)}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
P.toString()



The relation name is  P
╒════════╕
│ char   │
╞════════╡
│ a      │
╘════════╛


... and by adding a single comma after the header when we create a relation with the set of tuples empty

In [21]:
Q = Relation('Q',('char',),{}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
Q.toString()


The relation name is  Q
╒════════╕
│ char   │
╞════════╡
╘════════╛


***
We are now in a position to start filling in the rest of the Relation class.

I want to be able to apply the _relational operators_ to any relation or pair of relations. Using good object-oriented programming style, I'll add the relational operators as methods to the class.

Let's begin with the union operator

### Union ###
Consider the relation R defined as before
| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ | 

and consider a new relation Q defined as
| char | int | 
| :-: | :-: | 
| $f$ | $3$ |

The union $P\cup Q$ is possible since the headers match, and is
| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ | 
| $f$ | $3$ |  

Let's check in the code.

In [22]:
class Relation:
    def __init__(self, relation_name, relation_header, set_of_tuples):
        self.name = relation_name # I threw in a string for the relation name for fun
        self.header = relation_header
        self.set_of_tuples = set_of_tuples

    def toString(self):
        print("The relation name is ", self.name)
        print(tabulate(self.set_of_tuples,headers = self.header,tablefmt = 'fancy_grid'))
        #print(self.relation_header)
        #for my_tuple in self.tuples:
        #    print(my_tuple)def Union(self,relation2):
    
    ########################
    # Relational Operators #
    ########################
    def Union(self,other):
        if not isinstance(other, Relation):
            raise ValueError # don't attempt to union with something not a relation
        # First, check the precondition to see if the headers are the same
        if self.getHeader() != other.getHeader():
            raise ValueError
        
        # Second, create a new header that is the union of the sets of tuples
        name = self.getName() + "\u222A" + other.getName()
        header = self.getHeader()
        set_of_tuples = self.getTuples()
        set_of_tuples = set_of_tuples.union(other.getTuples())    # This is the union operator defined for set objects
        return Relation(name,header,set_of_tuples)
        
    #######################
    # Getters and Setters #
    #######################
    def getName(self): return self.name
    def getHeader(self): return self.header
    def getTuples(self): return self.set_of_tuples
    
    ################################################
    # Define how the == operator acts on relations #
    ################################################
    def __eq__(self,other):
        # First, check whether the thing passed to the equality method is the same type
        if not isinstance(other, Relation):
            # don't attempt to compare against unrelated types
            raise ValueError
        # Second, return true only if the header and sets all match. I don't really care if the names match
        return self.header == other.header and self.set_of_tuples == other.set_of_tuples

R = Relation(relation_name = 'R',relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
Q = Relation('Q',('char','int'),{('f',3),}) # Notice the comma after the ('f',3) tuple. This create a set of tuples, with only one element in the set

P = R.Union(Q)
P.toString()

The relation name is  R∪Q
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ b      │     2 │
├────────┼───────┤
│ d      │     4 │
├────────┼───────┤
│ f      │     3 │
├────────┼───────┤
│ c      │     3 │
├────────┼───────┤
│ a      │     1 │
╘════════╧═══════╛


***

Let's now look at the the projection operator

### Project ###
 
Consider the relation $P = R\cup Q$ defined as before
| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ |
| $f$ | $3$ | 

We want to compute $\pi_{char}(P)$, which does two things. 
* First, it creates a new relation.
* Second, it populates the new relation with the char column.

The result is the relation $\pi_{char}(P)$
| char | 
| :-: |
| $a$ |
| $b$ |
| $c$ |
| $d$ |
| $f$ |

***
Here's the code

In [23]:
class Relation:
    def __init__(self, relation_name, relation_header, set_of_tuples):
        self.name = relation_name # I threw in a string for the relation name for fun
        self.relation_header = relation_header
        self.tuples = set_of_tuples

    ########################
    # Relational Operators #
    ########################
    def Union(self,relation2):
        if type(relation2) != Relation:
            raise TypeError
        if self.getHeader() != relation2.getHeader():
            raise ValueError
        name = self.getName() + "\u222A" + relation2.getName()
        header = self.getHeader()
        tuples = self.getTuples()
        tuples = tuples.union(relation2.getTuples())
        return Relation(name,header,tuples)
    def Project(self,column_header):
        # Only a portion of this method is implemented. 
        # Specifically, only the portion that projects onto a single column.
        # You have to implement the rest of this method in the Project 3
        
        # First, check the precondition
        # The precondition for the project operator is that the column attribute
        # must exist in the set of attributes
        if column_header not in self.relation_header:
            raise ValueError
        
        # Second, create a new relation that is the output of the projection operator.
        # THe relation needs a name, a header, and the set of tuples
        name = "\u03C0" + "_{" + column_header + "}(" + self.name + ")" # The \u03c0 is a special code for a union symbol
        header = (column_header,) # Notice the comma after "column_header", which forces the header to be a tuple
        header_index = self.relation_header.index(column_header) # Get the index of the header that matches the column you want
        tuples = set()
        # A good "pythonic" way to implement this is using list comprehensions,
        # https://www.geeksforgeeks.org/python-list-comprehension/
        # but I'll implement it using a for loop here because it might be easier to see
        for my_tuple in self.tuples:
            tuples.add((my_tuple[header_index],)) # Notice the comma after the "... index]"
        new_relation = Relation(name,header,tuples) # Create the relation
        return new_relation 

    
    #######################
    # Getters and Setters #
    #######################
    def getName(self): return self.name
    def getHeader(self): return self.relation_header
    def getTuples(self): return self.tuples

    def toString(self):
        ### Prints the name, header, and contents of the relation
        ### I like using the tabulate environment, but I had to install a
        ### package to make this work so I've included a for loop that does
        ### does the same thing with just a little less formatting
        print("The relation name is ", self.name)
        print(tabulate(self.tuples,headers = self.relation_header,tablefmt = 'fancy_grid'))
        #print(self.relation_header)
        #for my_tuple in self.tuples:
        #    print(my_tuple)

    ################################################
    # Define how the == operator acts on relations #
    ################################################
    def __eq__(self,other):
        # First, check whether the thing passed to the equality method is the same type
        if not isinstance(other, Relation):
            # don't attempt to compare against unrelated types
            raise ValueError
        # Second, return true only if the header and sets all match. I don't really care if the names match
        return self.header == other.header and self.set_of_tuples == other.set_of_tuples


Let's now look at the output of the project operator

In [24]:
R = Relation(relation_name = 'R',relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
Q = Relation('Q',('char','int'),{('f',3),}) # Observe the comma after the ('f',3) tuple. This forces python to make the tuple the lone element of a set

P = R.Union(Q)
M = P.Project('char')
M.toString()


The relation name is  π_{char}(R∪Q)
╒════════╕
│ char   │
╞════════╡
│ c      │
├────────┤
│ a      │
├────────┤
│ f      │
├────────┤
│ d      │
├────────┤
│ b      │
╘════════╛


Let's project onto the 'int' column instead. Why does projecting onto the 'char' column produce five tuples but projecting onto the 'char' column only produce four tuples?

In [25]:
M = P.Project('int')
M.toString()

The relation name is  π_{int}(R∪Q)
╒═══════╕
│   int │
╞═══════╡
│     1 │
├───────┤
│     2 │
├───────┤
│     3 │
├───────┤
│     4 │
╘═══════╛


The answer is that the set of tuples is a set, and sets don't have repeats. 

Without ignoring repeats, we have the relation $P$ defined as
| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ |
| $f$ | $3$ | 

yielding $\pi_{int}(P)$ 
| int | 
| :-: | 
| $1$ |
| $2$ | 
| $3$ | 
| $4$ |
| $3$ |

but the "3" appears in the set twice, so the correct answer is
$\pi_{int}(P)$ 
| int | 
| :-: | 
| $1$ |
| $2$ | 
| $3$ | 
| $4$ |


***
### Somthing to Keep In Mind: Deep vs Shallow Copies ###
In python, unless specified, every copy is a shallow copy. That means that when you try to copy an object, it will copy the memory address not the actual object itself. This can cause some nasty bugs that are hard to find. See this in the example below:


In [26]:
class relation:
     def __init__(self):
        self.my_name = "relation"
        self.my_list = [1,2,3,4]
     def change_name(self,new_name): self.my_name = new_name
     def get_name(self): return self.my_name
     def get_name(self): return self.my_name
     def get_list(self): return self.my_list
r1 = relation()
print("r1's name: " + r1.get_name())
r2 = r1
r2.change_name("different relation")
print("r2's name: " + r2.get_name())
print("r1's name: " + r1.get_name())
print("r1's address: " + str(r1))
print("r2's address: " + str(r2))

r1's name: relation
r2's name: different relation
r1's name: different relation
r1's address: <__main__.relation object at 0x10966ae90>
r2's address: <__main__.relation object at 0x10966ae90>


Notice that r1 and r2's memory address are the same, meaning that the memory address was copied. Also notice that when r2's name was changed, r1's name was changed as well.
This happened because both r1 and r2 contain the same memory address, which means that they are both pointing to the same object in memory. So, when r2 changes the name of the object, it is changed for r1 as well, which is probably not what we want to happen.

This is where we need deep copies. A deep copy doesn't just copy the memory address, it copies the data at the memory address (in this case, a relation object) to a new memory address. When this happens, we will have two variables pointing to two separate objects in memory, so when you change one you won't change the other. Luckily, python has a very easy way to do this with the "copy.deepcopy()" method. Notice what changes in our output: 

In [11]:
import copy # We need to import the copy module 
class relation:
     def __init__(self):
        self.my_name = "relation"
        self.my_list = [1,2,3,4]
     def change_name(self,new_name): self.my_name = new_name
     def get_name(self): return self.my_name
     def get_name(self): return self.my_name
     def get_list(self): return self.my_list
r1 = relation()
print("r1's name: " + r1.get_name())
r2 = copy.deepcopy(r1)
r2.change_name("different relation")
print("r2's name: " + r2.get_name())
print("r1's name: " + r1.get_name())
print("r1's address: " + str(r1))
print("r2's address: " + str(r2))

r1's name: relation
r2's name: different relation
r1's name: relation
r1's address: <__main__.relation object at 0x10964c850>
r2's address: <__main__.relation object at 0x108805250>


Now when we run this code, r1 and r2 have two different memory addresses, and r1's name was not changed. 

Now although copy.deepcopy() is useful, it is expensive. So only use deepcopy when you have to! You should only need to use it if you are editing a copy of an object and you don't want the original object to be changed. If you don't care about the original being changed, or you are not changing the copied object, then don't use copy.deepcopy().