# TP1 - Finding Keys using Functional Dependencies 
--------------------------

In this lab, we are going to learn 

- how to use Jupyter
- how to use SQLite
- how to discover keys in relations

## How to use Jupyter

To execute a cell, click on it, and then use SHIFT+ENTER (this will execute the contents of the cell, and move down to the next one!)

To edit a cell, click on it. If the cell uses markdown code, then use ENTER to edit it.

See the Help menu for other useful keyboard commands. You can always use the menu bar instead as well.


In [1]:
print("Hello world!")

Hello world!


Another example:

In [2]:
for i in range(1,10):
    print(i)

1
2
3
4
5
6
7
8
9


#### Exercise 1

Print numbers 1 to 9 in reverse order

In [3]:
for i in range(9,0,-1):
    print(i)

9
8
7
6
5
4
3
2
1


## How to use SQLite

To work with SQL easily in a notebook, we'll load the ipython-sql extension as follows:

In [4]:
%load_ext sql
%sql sqlite://

'Connected: @None'

We are going to create a table:

In [5]:
%%sql DROP TABLE IF EXISTS T;
CREATE TABLE T(course VARCHAR, classroom INT, time INT);
INSERT INTO T VALUES ('CS 364', 132, 900);
INSERT INTO T VALUES ('CS 245', 140, 1000);
INSERT INTO T VALUES ('EE 101', 210, 900);

 * sqlite://
Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.


[]

Let's run our first SQL query:

In [6]:
%sql SELECT * FROM T;

 * sqlite://
Done.


course,classroom,time
CS 364,132,900
CS 245,140,1000
EE 101,210,900


#### Exercise 2

List the name of the courses with time less than 950

In [7]:
%sql SELECT COURSE FROM T WHERE TIME<950;

 * sqlite://
Done.


course
CS 364
EE 101


## How to discover keys in relations

Now, we are going to work with functional dependencies, keys and closures. Our final goal is going to build a method to find keys in a relation.

### Functional Dependencies

Recall that given a set of attributes  $\{A_1, \dots, A_n\}$ and a set of FDs $\Gamma$

The closure, denoted $\{A_1, \dots, A_n\}^+$, is defined to be the largest set of attributes B s.t. $$A_1,\dots,A_n \rightarrow B \text{ using } \Gamma.$$

We're going to use some functions to compute the closure of a set of attributes and other such operations (_from CS145 Stanford_)

In [8]:
# Source code

def to_set(x):
  """Convert input int, string, list, tuple, set -> set"""
  if type(x) == set:
    return x
  elif type(x) in [list, set]:
    return set(x)
  elif type(x) in [str, int]:
    return set([x])
  else:
    raise Exception("Unrecognized type.")

def fd_to_str(lr_tuple):
    lhs = lr_tuple[0]
    rhs = lr_tuple[1]
    return ",".join(to_set(lhs)) + " -> " + ",".join(to_set(rhs))

def fds_to_str(fds): return "\n\t".join(map(fd_to_str, fds))

def set_to_str(x): return "{" + ",".join(x) + "}"

def fd_applies_to(fd, x): 
  lhs, rhs = map(to_set, fd)
  return lhs.issubset(x)

def print_setup(A, fds):
  print("Attributes = " + set_to_str(A))
  print("FDs = \t" + fds_to_str(fds))

def print_fds(fds):
  print("FDs = \t" + fds_to_str(fds))    


Now, let's look at a concrete example. For example, the code for

attributes = { name, category, color, department, price}

and functional dependencies:

name $\rightarrow$ color

category $\rightarrow$ department

color, category $\rightarrow$ price

is the following:

In [9]:
attributes  = set(["name", "category", "color", "department", "price"]) # These are the attribute set.
fds = [(set(["name"]),"color"),
         (set(["category"]), "department"),
         (set(["color", "category"]), "price")]

print_setup(attributes, fds)

Attributes = {category,name,price,color,department}
FDs = 	name -> color
	category -> department
	category,color -> price


### Closure of a set of Attributes

Let's implement the algorithm for obtaining the closure of a set of attributes:

In [10]:
def compute_closure(x, fds, verbose=False):
    bChanged = True        # We will repeat until there are no changes.
    x_ret    = x.copy()    # Make a copy of the input to hold x^{+}
    while bChanged:
        bChanged = False   # Must change on each iteration
        for fd in fds:     # loop through all the FDs.
            (lhs, rhs) = map(to_set, fd) # recall: lhs -> rhs
            if fd_applies_to(fd, x_ret) and not rhs.issubset(x_ret):
                x_ret = x_ret.union(rhs)
                if verbose:
                    print("Using FD " + fd_to_str(fd))
                    print("\t Updated x to " + set_to_str(x_ret))
                bChanged = True
    return x_ret

As an example, let's compute the closure for the attribute "name":

In [11]:
A = set(["name"])
compute_closure(A,fds, True)

Using FD name -> color
	 Updated x to {name,color}


{'color', 'name'}

#### Exercise 3

Is the attribute "name" a superkey for this relation? Why?

The attribute "name" is not a superkey for this relation. To be a superkey, a set of attributes $A_1,..., A_n$ must be such that for any other attribute $B$ in the relation, we have $\{A_1, ..., A_n\} \rightarrow B$. In this relation, the closure of "name" is the set {"color", "name"}, so that the attributes category, department and price are not functionally determined by "name" (they do not belong to its closure). Thus name is not a superkey.

### Keys and Superkeys

Next, we'll add some new functions now for finding superkeys and keys.  Recall:
* A _superkey_ for a relation $R(B_1,\dots,B_m)$ is a set of attributes $\{A_1,\dots,A_n\}$ s.t.
$$ \{A_1,\dots,A_n\} \rightarrow B_{j} \text{ for all } j=1,\dots m$$
* A _key_ is a minimal (setwise) _superkey_

The algorithm to determine if a set of attributes $A$ is a superkey for $X$ is actually very simple (check if $A^+=X$):

In [12]:
def is_superkey_for(A, X, fds, verbose=False): 
    return X.issubset(compute_closure(A, fds, verbose=verbose))

Is "name" a superkey of the relation?

In [13]:
is_superkey_for(A, attributes, fds)

False

Then, to check if $A$ is a key for $X$, we just confirm that:
* (a) it is a superkey
* (b) there are no smaller superkeys (_Note that we only need to check for superkeys of one size smaller_)

In [14]:
import itertools
def is_key_for(A, X, fds, verbose=False):
    subsets = set(itertools.combinations(A, len(A)-1))
    return is_superkey_for(A, X, fds) and \
        all([not is_superkey_for(set(SA), X, fds) for SA in subsets])

Now, let's look at another example:

attributes = { cru, type, client, remise}

and functional dependencies:

cru $\rightarrow$ type

type, client $\rightarrow$ remise

#### Exercise 4

Is "cru" and "client" a key of the relation? Why?

We can first use the function is_key_for() to determine whether "cru" and "client" are a key for the relation.


In [15]:
# define the attributes
attributes = set(["cru", "type", "client", "remise"])
# define the functional dependencies
fds = [(set(["cru"]),"type"),
         (set(["type", "client"]), "remise")]
# define the candidate key
A = set(["cru", "client"])
# check whether A is a key
print("Is A a key for the relation: " + str(is_key_for(A, attributes, fds, verbose=False)) + ".")

Is A a key for the relation: True.


This reports "True". So $A=\{"cru", "client"\}$ is a key. Let us now see why. To be a key, $A$ needs to be a superkey, and $A$ needs to be the smallest possible superkey for the attributes of the relation.

Is $A$ a superkey? It is a superkey if $A$ must be such that for any other attribute $B$ in the relation, we have $A \rightarrow B$. So let's build the closure of $A$.
1. $\{"cru", "client"\} \rightarrow \{"cru", "client"\}$
2. $\{"cru", "client"\} \rightarrow \{"cru", "client", "type"\}$ (from $\{"cru"\} \rightarrow \{"type"\}$)
3. $\{"cru", "client"\} \rightarrow \{"cru", "client", "type", "remise"\}$ (from $\{"type", "client"\} \rightarrow \{"remise"\}$)

The closure of $A$ is the set of all attributes in the relation, and thus $A$ is a superkey. To be a key, it also needs to be a minimal key. To see that, consider the superkey one dimension smaller, and test. There are two candidates: $A_1=\{"cru"\}$ and $A_2=\{"client"\}$. One trivially obtains the closures of $A_1$ and $A_1$ to be $\{A_1\}^+=\{"cru", "type"\}$ and $\{A_2\}^+=\{"client"\}$. None of these closures include all the attributes of the relations, so neither $A_1$ nor $A_2$ are superkeys. Thus $A$ is a key.

### Closure of a set of functional dependencies

The algorithm to find the closure of a set of functional dependencies is the following:

In [16]:
import itertools
def findsubsets(S,m):
    return set(itertools.combinations(S, m))
def closure(X, fds, verbose=False):
    c = []
    for size in range(1, len(X)):
        subsets = findsubsets(X, size) 
        for SA in subsets:      # loop through all the subsets.
            cl = compute_closure(set(SA), fds, verbose)
            if len(cl.difference(SA)) > 0: 
                c.extend([(set(SA), cl.difference(SA))])
    return c

Let's see some examples of how to use it:

In [17]:
attributes1 = set(['A', 'B', 'C', 'D'])
fds1 = [(set(['A', 'B']), 'C'),
     (set(['A', 'D']), 'B'),
     (set(['B']), 'D')]
print_fds(closure(attributes1, fds1))


FDs = 	B -> D
	C,B -> D
	B,A -> C,D
	D,A -> C,B
	D,B,A -> C
	C,A,B -> D
	D,C,A -> B


In [18]:
attributes2 = set (['CRU', 'TYPE', 'CLIENT', 'REMISE'])
fds2 = [(set(['CRU']), 'TYPE'),
     (set(['TYPE', 'CLIENT']), 'REMISE')]
print_fds(closure(attributes2, fds2))

FDs = 	CRU -> TYPE
	REMISE,CRU -> TYPE
	CLIENT,CRU -> REMISE,TYPE
	TYPE,CLIENT -> REMISE
	TYPE,CLIENT,CRU -> REMISE
	CLIENT,REMISE,CRU -> TYPE


In [19]:
attributes3 = set( ['N VEH', 'TYPE', 'COULEUR', 'MARQUE', 'PUISSANCE'])
fds3 = [(set(['N VEH']), 'TYPE'), 
      (set(['N VEH']), 'COULEUR'),
      (set(['TYPE']), 'MARQUE'),
      (set(['TYPE']), 'PUISSANCE')]
print_fds(closure(attributes3,fds3))

FDs = 	TYPE -> PUISSANCE,MARQUE
	N VEH -> PUISSANCE,TYPE,COULEUR,MARQUE
	COULEUR,N VEH -> PUISSANCE,TYPE,MARQUE
	N VEH,PUISSANCE -> TYPE,COULEUR,MARQUE
	TYPE,PUISSANCE -> MARQUE
	TYPE,N VEH -> PUISSANCE,COULEUR,MARQUE
	TYPE,MARQUE -> PUISSANCE
	TYPE,COULEUR -> PUISSANCE,MARQUE
	N VEH,MARQUE -> PUISSANCE,TYPE,COULEUR
	TYPE,COULEUR,PUISSANCE -> MARQUE
	TYPE,N VEH,PUISSANCE -> COULEUR,MARQUE
	TYPE,COULEUR,MARQUE -> PUISSANCE
	TYPE,COULEUR,N VEH -> PUISSANCE,MARQUE
	COULEUR,N VEH,MARQUE -> PUISSANCE,TYPE
	PUISSANCE,N VEH,MARQUE -> TYPE,COULEUR
	COULEUR,N VEH,PUISSANCE -> TYPE,MARQUE
	TYPE,N VEH,MARQUE -> PUISSANCE,COULEUR
	PUISSANCE,N VEH,MARQUE,TYPE -> COULEUR
	TYPE,COULEUR,N VEH,PUISSANCE -> MARQUE
	TYPE,COULEUR,N VEH,MARQUE -> PUISSANCE
	PUISSANCE,COULEUR,N VEH,MARQUE -> TYPE


Now, let's write a method to find superkeys of the relations:


In [20]:
def superkeys(X, fds, verbose=False):
    c = []
    for size in range(1, len(X)):
        subsets = findsubsets(X, size)
        for SA in subsets:
            cl = compute_closure(set(SA), fds, verbose)
            if cl == X and len(cl.difference(SA)) > 0: ## cl = X
                c.extend([SA])
    return c

In [21]:
superkeys(attributes1, fds1)


[('B', 'A'), ('D', 'A'), ('D', 'B', 'A'), ('C', 'B', 'A'), ('C', 'D', 'A')]

Let's see some examples:

In [22]:
superkeys(attributes2, fds2)


[('CLIENT', 'CRU'), ('TYPE', 'CLIENT', 'CRU'), ('CLIENT', 'REMISE', 'CRU')]

In [23]:
superkeys(attributes3, fds3)

[('N VEH',),
 ('N VEH', 'COULEUR'),
 ('N VEH', 'PUISSANCE'),
 ('N VEH', 'TYPE'),
 ('MARQUE', 'N VEH'),
 ('N VEH', 'PUISSANCE', 'TYPE'),
 ('N VEH', 'TYPE', 'COULEUR'),
 ('MARQUE', 'N VEH', 'COULEUR'),
 ('MARQUE', 'N VEH', 'PUISSANCE'),
 ('N VEH', 'PUISSANCE', 'COULEUR'),
 ('MARQUE', 'N VEH', 'TYPE'),
 ('MARQUE', 'N VEH', 'PUISSANCE', 'TYPE'),
 ('N VEH', 'PUISSANCE', 'TYPE', 'COULEUR'),
 ('MARQUE', 'N VEH', 'TYPE', 'COULEUR'),
 ('MARQUE', 'N VEH', 'PUISSANCE', 'COULEUR')]

#### Exercise 5

Implement a `keys` method to find keys of a relation.

**Note**: If there exist multiple keys of a relation, the `keys` method should return at least one of them.

In [52]:
# a minimal function (which works! and relies on the fact that a minimal superkey has to be a key!)
def keys(X, fds, verbose=False):
    # identify all the superkeys
    spks = superkeys(X, fds)  
    # get the minimum length among all superkeys
    ml = min([len(spk) for spk in spks])
    # return the sets of superkeys with minimal length: by definition, they have to be the keys! 
    # (since they are superkeys and cannot have a subset which is also a superkey)
    return to_set([spk for spk in spks if len(spk) == ml])

# a more sophisticated one, relying on the explicit algorithm to find a key
def keys2(X, fds, verbose=False):
    allkeys=[]
    # identify all the superkeys
    spks = superkeys(X, fds)
    # check over superkeys individually to check if they are keys
    for spk in spks:
        if is_key_for(set(spk), X, fds, verbose=False):
            allkeys.extend([spk])
    return to_set(allkeys)



In [53]:
# my tests
print(keys(attributes1, fds1))

print(keys(attributes2, fds2))

print(keys(attributes3, fds3))

{('D', 'A'), ('B', 'A')}
{('CLIENT', 'CRU')}
{('N VEH',)}


Test it 

In [26]:
keys(attributes1, fds1)

{('B', 'A'), ('D', 'A')}

In [27]:
keys(attributes2, fds2)

{('CLIENT', 'CRU')}

In [28]:
keys(attributes3, fds3)

{('N VEH',)}