# Data Structures and Processing

## Week2: Introduction to Python (Cont.)


## Note

Although we have covered the section on custom function or user defined function at the end of week 2's content, we would like to incorporate into practice right from beginning for two reasons:

* defining function becomes natural to you, and that you be able to extract out of your blocks a utility that could be used as many times as you please.

* we would like to run small tests against your functions to see if your utilities work as expected in several different scenarios.  This is called unit testing the functions, code or utility.

## Revisiting Personal Database

Recall that in the previous Jupyter notebook, we have asked you to make a small database, which you could use it for personal use.  We want to make it more functional and close to a real use case.

Let us describe the expectations and provide a framework.

* The database file `bibliox.txt` is stored on the disk.
* Each line of the file `biblio.txt` is a dictionary entry of a record.  Here is how a sample line in the file looks like.

    `{"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"}`

We expect

* a function named `biblio_from_file` that could read the lines of the file `biblio.txt` and return a list of `dict` objects in Python.  The function should be able to take the name of the file as an argument.

* a function named `biblio_to_file` that takes a `list` of  `dict`s and writes its content to a given file name.  We use this function to ship the data to a file on computer.

* a function named `biblio_add_record` that takes a list as a first argument (the list of records) and a `dict` element (a new record) and adds the record to the list of others. This function should returs a list with additional record.

* a function named `biblio_record_presenet` that takes a list as a first argument (the list of records), and a `dict` element as a second argument and determines if a record is present in the list.

* a function named `biblio_query_record_by_author` that takes a list as a first argument (the list of records), and a `str` element as a second argument (the name of an author for an entry), and returns the list of all the matching records.

* a function named `biblio_remove_record` that takes a list as a first argument (the list of records), and a `dict` element as a second argument and removes it from the database. The function should return the biblio list without the record.

## Hints for a Solution to the Task on Personal Database

In the following sections, we are going to provide some hints that might be useful in writing down all the utilities desired for the task of a functional database.  You are suggested to think about them on your own and ignore the hints at first, but you are very welcomed to have a look at our suggestions, and provide your own solution.

## Hints: Reading the database file


### `biblio_from_file`

We have suggested to write down a function that could read the contents of a file and returns a list of records.

We suggest to break the task down into even smaller tasks or sub-tasks.  For example, we could use the `readlines()` method to read all the lines of the file into a `list` object.  This is easier.  If you look at an element of this list, you may notice that it is a `str` and contains a record together with some characters that we do not need, for example, the spaces or the braces `{`, `}`, or even the newline character `\n`. Therefore, at this point, we should concentrate on working on a helpful utility that could parse the line and return a `dict` element.  If we succeed in this, then we can use an iterative structure--the `for` loop, together with the newly defined utility to obtain the list of records as we need it.

Let us recall again how a line in the file looks like after it is read in.

        '{"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"}\n'

Notice that the line is read as string but it contains different "strings" signified by the use of the doubl quotation marks `"`.

Let us write our observations and write utilities to act on them.

* The line contains the characters `{`, `}` and `\n`, which we do not need.  This means that we need to write down a function that would remove them.  The utility should return the cleaned string.

* After we have removed the undesired characters, the new string should look like the following

        '"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"'
    
* In the cleaned line, we have a different parts that are meant to be the `key-value` pair for the `dict` object--our record.  We see that these are separated by a `,`.  Therefore, we need to find a way to break the string into several different parts and obtain a list of strings containing these `key-value` strings.  A helpful `str` method for such a purpose is the `split()`, which takes a string argument and splits that string at exactly that and returns a list. In our case, we want to split at `,`, therefore, we would use `split(",")` on the new line.  Our record after applying the utlity would look like the following.

        ['"id": "skiena2017data"',
         ' "title": "The data science design manual"',
         ' "authors": ["Steven Skiena"]',
         ' "year": 2017',
         ' "publisher": "Springer"']
  
  Notice that the second element in the above list is `' "title": "The data science design manual"'`. It has an extra white space at the beginning of the string.  We should make sure that our utility function takes care of it by removing the leading or the trailing spaces.  We do not know how many spaces there are, but they need to be removed.  A useful method for removing leading or trailing space is `strip()`. This method does not take any argument and removes the leading and the trailing spaces from a string, regardless of their number.

  If we have defined our function correctly, then the output should look lik the following.
  
        ['"id": "skiena2017data"',
         '"title": "The data science design manual"',
         '"authors": ["Steven Skiena"]',
         '"year": 2017',
         '"publisher": "Springer"']
         
* Now need to actually take the members of the list and write down a utility that would return a dictionary.  We could divide this task into subtasks, where we first concentrate on a member of a list which should provide the `key-value` pair.  For example, let us concentrate on the string `'"title": "The data science design manual"`.  There might be several ways to go forward from here.  But one interesting way is to use our previosly defined utility which splitted the given string at `,`, removed the leading and trailing spaces, and returned a list.  Notice that we need the same utility but with a small change that the character at which we want to split now is not `,`, but it is `:`.  This takes us one step back, and motivates us to add the splitting charachter a part of our input parameters of our utility, because it would then provide us more control.

  If the utility is redefined, or another utility is defined instead, we should have as our result the following list

        ['title', 'The data science design manual']
        
   We can the first element (at index 0) as the key and the second element (at index 1) as the corresponding value to build a dictionary or add it to a dictionary.
  

In [1]:
import json
def biblio_from_file(filename: str):
    file_contents = []
    try:
        with open(filename, "r") as file:
            for line in file.readlines():
                file_contents.append(json.loads(line))
    except FileNotFoundError:
        print(f"Error: File not found {filename}")
    return file_contents
    

In [9]:
assert type(biblio_from_file("biblio.txt")) == list, "The function does not return a list, as expected."
assert type(biblio_from_file("biblio.txt")[0]) == dict, "The list does not contain dictionaries."

In [8]:
biblio = biblio_from_file("biblio.txt")

 By this point, you should be able to put all the utilities, defined above, together with iterative loops to build the function `biblio_from_file`.

### `biblio_to_file`

This function might be easier to write as compared to the previous `biblio_from_file`.

The situaiton is that we have our record `biblio`, which is a list of `dict` objects – the records. We should be able to use the `for` statement to write each element of the list `biblio` as a line in the output file.  We leave the details for you to think and provide the solution.

In [11]:
import json
def biblio_to_file(biblio: list, filename: str):
    try:
        with open(filename, "w") as outfile:
            for l in biblio:
                json.dump(l,outfile)
    except FileNotFoundError:
        print("file now found") 

# this part is for testing
biblio = [
    {"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"},
    {"id": "tufte1983visual", "title": "The visual display of quantitative information", "authors": ["Edward Tufte"], "year": 1983, "publisher": "Graphics Press"},
    {"id": "knuth1986art", "title": "The Art of Computer Programming", "authors": ["Donald E. Knuth"], "year": 1986, "publisher": "Addison-Wesley"}
]

filename = "output.txt"  

biblio_to_file(biblio, filename)       


### `biblio_add_record`

This function is also easier to write.

The situation is that we have a list and we want to add an element to it. A better version would be to see if the record is also available in `biblio`, before actually adding it.  There is no need to add a record if it is already a part of the database.  Therefore, before defining this function, we should defined the following which checks if a record is already present.

In [None]:
def biblio_add_record(biblio: list, record: dict):
    for i in biblio:
        if i == record:
            return ("already exists")
    return biblio.append(record)

In [28]:
xrecord = {"id": "skiena2017dt", "title": "The data science design mMnual", "authors": ["Steven SSkiena"], "year": 2017, "publisher": "SSpringer"}
assert xrecord in biblio_add_record("biblio.text", xrecord)



TypeError: argument of type 'bool' is not iterable

### `biblio_record_present`

This function is one the simplest.  But we would like to have it for completion of our system.

The situation is that we have a list and we want to check an object is in the list.  We use the infix operator `in` which returns one of the boolean values: `True` or `False`.

In [17]:
# same as previous code just slight adjecements
def biblio_record_present(biblio: list, record: dict):
    for i in biblio:
        if i == record:
            return True #if present then true
    return False
    

In [27]:
xrecord = {"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"}
assert biblio_record_present("biblio.txt", xrecord) == True #using textfile hence assertion gives error same as previous

NameError: name 'biblio_record_present' is not defined

### `biblio_query_record_by_author`

This function filters for matching records and returns a list.

To define this function, we basically rely on the `for` statement which would take each record (a `dict` object), let us call it `record`, and gets the value of authors by `record["author"]`, which is a list of authors for a particular bibliographic entry.  We can then use our given string together with the `in` operator to determine the corresponding boolean value.  `True` would mean that the the `record` matches and we should append it to a temporary list that we are going to return.  We append using the `list` method `append()`. If `False` is returned, then it means that we ignore it.

In [22]:
def biblio_query_record_by_author(biblio: list, author: str):
    emptylist = []
    for record in biblio:
        if record.get("author") == author:
            return emptylist.append(record)

In [26]:
xrecord = {"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"}
assert xrecord in biblio_query_record_by_author("biblio.txt", "Steven Skiena")

AttributeError: 'str' object has no attribute 'get'

### `biblio_query_record_by_key` (optional)

We have not asked for this function, but it is worth thinking that we might want to retrieve list of records on the basis of other criteria than just the `author`.  One way is to define function for all possible keys that we could include, or the second option is to define a generate function that takes `key-value` pair as arguments and returns all the matched records.

After having the `biblio_query_record_by_author` defined, it is not hard to define the abstracted version.  We leave it as a small exercise to think how to best do it.

In [None]:
def biblio_query_record_by_author(biblio: list, key: str, value):
    emptylist = []
    for record in biblio:
        if record.get(value) == key:
            return emptylist.append(record)
    pass

### `biblio_remove_record`

The simplest thinking here is that we have a list of records and we want to remove the records if it is present.  Our function `biblio_record_present`, already defined, might be helpful to check if a record is present.

Another way could be to use the `list` method `remove()` that takes the object as an argument that we want to remove.  There is a need to be careful here. If the object that we are trying to remove is not present in the first place, then the method `remove()` returns `TypeError`.  At this point, we can either our utility function or use the `try...except` to avoid the error.

In [25]:
def biblio_remove_record(biblio, record):
    try:
        if(biblio_record_present(biblio,record) == True):
            biblio.remove(record)
    except TypeError:
        print("Does not exist")

In [None]:
xrecord = {"id": "skiena2017data", "title": "The data science design manual", "authors": ["Steven Skiena"], "year": 2017, "publisher": "Springer"}
assert xrecord not in biblio_remove_record("biblio.txt", xrecord)