# SQLAlchemy – NoSQL (optional)

This last course is optional and intended for those of you who have fully understood the entire SQL section. The topics will not be covered in depth, the goal is just to attract your attention and make you aware of their existence. The goal is to give yous some notions in case you are confronted with such subjects in the future, or that you wish to deepen them by yourself in order to perfect your mastery of databases with Python.

## SQLAlchemy

Unlike SQLite, SQLAlchemy is not a database engine or a DBMS, but a database access toolkit. More exactly, SQLAclhemy is known for a component called ORM : Object Relational Mapper, which is the most used one. What is it ? Python is an object-oriented language, and the SQLAlchemy’s ORM will map SQL queries results to objects, which are easierly and more powerfully manipulated by Python. SQLAlchemy is an interface between the database engine and Python, making transparent the manipulation of SQL queries while maintaining their effectiveness.

Some of the classes used by SQLAlchemy are (you will intuitively know to what SQL entities they are linked) :

* Base
* Table
* Column
* ForeignKey
* Integer
* String
* etc.

### Simple : SQLAlchemy to connect to a database

The simpler use of SQLAlchemy is just to declare an engine and use it to connect to an existing database, and refer to this connection with a connector. We could then send queries via this connector with methods like `pd.read_sql_query()` :


In [10]:
from sqlalchemy import create_engine, text
import os

engine = create_engine(os.path.join('sqlite:///','data', 'european-soccer.sqlite'))

In [11]:
import pandas as pd

def request(query, engine=engine):
    with engine.begin() as conn:
        return pd.read_sql_query(text(query), conn)

In [12]:
print(request('SELECT * FROM Country'))

       id         name
0       1      Belgium
1    1729      England
2    4769       France
3    7809      Germany
4   10257        Italy
5   13274  Netherlands
6   15722       Poland
7   17642     Portugal
8   19694     Scotland
9   21518        Spain
10  24558  Switzerland


With SQLAchemy, you can create engine to connect to most of the DBMS : you will have the same interface to connect to a lot of different DBMS. For example, if MySQL runs on your computer, you can connect to a MySQL database as such :

```Python
################################################################
# To connect to a MySQL database running local with SQLAlchemy #
################################################################

user = "<user_name>"
password = "<user_password>"
db_name = "<database_name>"
port = 3306 # default port for MySQL but 
host = "127.0.0.1" # you can also try using "localhost" – if MySQL runs distant, 
                    # host will be the server address
connection_infos = f"mysql+mysqldb://{user}:{password}@{host}:{port}/{db_name}"
engine = create_engine(connection_infos)
```

### Advanced : SQLAlchemy models

If we don’t want to use the SQLAlchemy interface to connect ot a database and not query it with SQL queries, we have to create a model. It will define the mapping between the database tables and queries results and the object defined and used in Python. 

A model is a class, inherited from the `Base()` class, which, as its name tells, models columns making the tables, and relationships between the tables, etc.

Remember our `movies.db` ?

Let’s list its tables and columns, as a refresher :

In [16]:
import sqlite3

######################################################
# List tables and columns of a database (arg = path) #
######################################################

def explore_db(path: 'string'):
    conn = sqlite3.connect(path)
    c = conn.cursor()
    
    list_table = '''
    PRAGMA table_list;
    '''
    c.execute(list_table)
    for row in c.fetchall():
        table = row[1] # to improve readability
    
        # then get columns names
        column_list = 'PRAGMA table_info(' + table +')'
        print('\n----- table ' + row[1] + ' columns -----\n')
        c.execute(column_list)
        
        # print columns names
        for row in  c.fetchall():
            column = row[1] # to improve readability
            print(column)
        
        # finally print first lines of each table   
        print('\n table ' + table + ' first lines :\n')
        select_all = 'SELECT * FROM ' + table + ' LIMIT 5'
        c.execute(select_all)
        for row in  c.fetchall():
            print(row)

In [17]:
explore_db('data/movies.db')


----- table Credits columns -----

Id
Movie_id
Direction
Producer
Studio
Playscreen
Cast
Country

 table Credits first lines :

(1, 3, '"Big director"', '"Big producer"', '"Big studio"', '"Big screenwriter"', '"Big Actor 1, Big Actor 2, Other big actors"', '"Big country"')
(2, 1, '"Unknown director"', '"Unknown producteur"', '"Unknown studio"', '"Unknown screenwriter"', '"Unknown actor 1, Unknown acteur 2, Other unknown actors"', '"Unknown country"')
(3, 2, '"Small director"', '"Small producer"', '"Small studio"', '"Small screenwriter"', '"Small actor 1, Small actor 2, Small other actors"', '"Small country"')
(4, 5, '"Acceptable director"', '"Acceptable producer"', '"Acceptable studio"', '"Acceptable screenwriter"', '"Acceptable actor 1, Acceptable actor 2, Other acceptable actors"', '"Acceptable country"')
(5, 4, '"Incompetent director"', '"Incompetent producer"', '"Incompetent studio"', ' "Incompetent screenwriter"', '"Incompetente actor 1, Incompetent actor 2, Other incompetent act

Nous avons donc deux tables :

* Credits (5 colonnes, cf. ci-dessus)
* Movies (5 colonnes, cf. ci-dessus)

Maintenant que nous nous sommes rafraîchi la mémoire, connectons-nous à la base : 

In [1]:
from sqlalchemy import create_engine, text
import os

engine = create_engine(os.path.join('sqlite:///','data', 'movies.db'))

Et créons le modèle suivant :

In [2]:
from sqlalchemy import Table, Column, ForeignKey, Integer, String, Float
from sqlalchemy.orm import declarative_base

Base = declarative_base() # créons la classe de base dont vont hériter
                          # les classes que nous allons créer pour le modèle

class Movies(Base):    # on instancie la classe modèle pour la table Movies
    __tablename__ = 'Movies' 
    Id = Column(Integer, primary_key=True) # instances des colonnes de la table
    Title = Column(String)
    Date = Column(String)
    Duration = Column(Integer)
    Budget = Column(Integer)
    First_week_viewers = Column(Integer)
    Votes = Column(Float)
    
class Credits(Base):
    __tablename__ = 'Credits'
    Id = Column(Integer, primary_key=True)
    Movie_id = Column(String, ForeignKey('Movies.Id')) # relation entre tables
    Direction = Column(String)
    Producer = Column(String)
    Studio = Column(String)
    Playscreen = Column(String)
    Cast = Column(String)
    Country = Column(String)

Now, the model is declared, connection to the base is established, let’s see how we make a request.

To do that, we have to open a `Session` : it is a 'holding place' as described in the SQLAlchemy documentation, where  
```
it provides the interface where SELECT and other queries are made that will return and modify ORM-mapped objects.
```
(objects that we have instanciated and loaded)

Once the `session` is instanciated, we can call the `.query` method, that we will chain with other methods to create a query.

Here the query that select all the records of the `Movies` table :

In [3]:
from sqlalchemy.orm import Session

with Session(engine) as session:
    results = session.query(Movies).all()

To display the results, we have to simply iterate and check the columns we want to get :

In [4]:
for result in results: 
    print(result.Title, result.Votes) 

A good movie 4.36
Another good movie, slightly better 4.63
A bad movie, but with some success 4.26
A very bad movie 2.86
A not so bad movie 3.86


The method `.filter` is used to create conditions of selection :

In [6]:
with Session(engine) as session:
    results = session.query(Movies).filter(Movies.Votes > 4.0).all()
    for result in results: 
        print(result.Title, result.Votes) 

A good movie 4.36
Another good movie, slightly better 4.63
A bad movie, but with some success 4.26


Others methods to write requests :
* `.group_by()`
* `.count()`
* `.order_by()`
* `.join()`
* etc.

You can read the [query guide](https://docs.sqlalchemy.org/en/14/orm/queryguide.html) in the SQLAlchemy documentation.

You may ask : isn’t it a bit complicated just to write a query ? It is a high-level approach that respond to specific problems (do not use SQL, OOP, agnostic of the DBMS, etc.)

As previously said, this is not an extensive lecture about SQLAlchemy, but the idea is to mention the existence of SQLAlchemy in case you are working on a project that uses it or requires it.

## NoSQL

### Limits of the relational model

As we have learned, SQL was designed to query relational databases. This kind of database has numerous advantages :

* relational model make easy the querying on relationships between data belongings to multiple tables
* data are stored in a well structured manner : the structure of the data model and data type is defined before the actual manipulation of the data
* structure brings constraints that garanties that storage is secure and robust (very low risk of error). You can’t delete or add feature and data that would brings incoherence in the dataset (delete a column that is a foreign key in another table, create a column with a datatype or default value that conclicts with already existing data or structure definition, etc.)
* [ACID](https://en.wikipedia.org/wiki/ACID) compliant :
    * Atomicity : a query is executed with success, or not at all (if some error occured during the exectution). For example you can’t start an UPDATE and stop it before it is fully processed with some data modified and other not. That prevents the apparition of inconsistancies.
    * Consistancy : the dataset is valid before a query is processed, and valid once it has been processed. If you write new data in the dataset, it must be valid according to all defined rules of the model
    * Isolation : when several users access to the database at the same time, the DBMS has to deal with concurrent queries (that write and read the same data at the same time). The isolation principle ensures that at the end the database is in the same state as if it would have been requested sequentially (one query after the other)
    * Durability : long term storage is secure : commited data can’t be lost
  The ACID principle allows the possibility to perform complex operations in one single query, like joins operation

But these qualities can become defects in some cases :

* the emphasis – and necessity – on the model and the predefined structure lengthens the development time of such databases
* another caveats of this dependance to structure is that it can’t manage unstructured data or data whose characteristics are not known in advance
* the implementation, management or administration of relational databases is monolithic : they are hard to scale horizontally, they are easier to upscale or scale vertically

  We have here to present some terminology : 
    * horizontal scaling is the process where the capacity of treatment of a database is obtained by adding other servers or node managing the data
    * upscaling or vertical scaling is the process where the capacity of treatment of a database is obtained by upscaling the capacity of the server managing the database (more memory, more computing power, more storage capacity, etc.) wich is more expansive, demanding, and you can’t upscale forever (there is a physical limit, max memory or CPU, etc.)

  When a database gets bigger, we may want to divides the data in smaller entities :
    * horizontal partitioning : the records (lines) belonging to the same table can be distributed between several tables. For example a customers table could be divided between several tables, each table gathering the cutomers of a specific city (one table for Marseille, one for Paris, etc.) each table having records with the same columns. It’s easier to make an horizontal scaling in this case : you can add servers that manage determined tables. The problems is that the schema become confused (several tables with same structured records) especially when you have to write data and control constraints (reading is far simpler to deal with)
    * vertical partitioning : a table is splitted along its columns ("rows splitting" : rows get splitted). For example a table customers containing customers id (Firstname, Lastname, birthday…), adress, orders, could be divided between an  id table, an address table and an orders table
    * sharding : it is similar to horizontal partitioning, but shards (partition) go beyond that. In horizontal partitioning, there is only one schema, sharding implies that rows are distributed between several tables, but that it occurs between several instances of the schema. Each shard is totally independant and can be heberged on different servers, datacenters, etc.

This lead to the idea that whe data gets big, and really big, there is a need of flexibility in the database schema. The relational model reaches its limits.

### NoSQL 

NoSQL means "Not only SQL", and not "Not SQL"! That rather refers to "no relational model". Exemple of NoSQL solutions (disclaimer, the categorization is not as strict as presented, it’s just examples to fix the idea) : 
* MongoDB, CouchBase (documents - more or less complex JSON - rather than rows)
* Redis, Amazon DynamoDB (key-values)
* Cassandra, Big Table, Accumulo (columns)
* Amazon Neptune, Neo4j (graphs)

NoSQL deals with the limits exposed just before :

* NoSQL is flexible because *it just does not support* relationships between tables (simple!). So it can deal with unstructured data or data whose type is not well known (or totally unknown) before we build the database. The structures in NoSQL can be :
    * an unstructured document (JSON, BSON, XML, etc.)
    * a pair key-value object (particularly efficient for unique but complex values)
    * a table (columns records rather than rows : efficient if queries only use few columns)
    * graphs
    * time series
    * etc.
* NoSQL do not follow a unique concept or schema, but it covers different types of non relational databases that correspond to different usecases
* a NoSQL database can be built dynamically : schema has not to be defined before we begin to manage data. Moreover, documents belonging to the same collection can have different types (for example, the key-value documents do not need to have the same keys). Sometimes it is static, for example if we deal with a lot of columns oriented documents (tables).
* dealing with unstructured data makes the sharding easier, and therefore the building of distributed databases and horizontal scaling far more easier. NoSQL engines are optimized to operate in highly distributed environment (datacenters scale)
* in fact NoSQL database are generally used to build distributed database with large amount of data. On part of this process is the replication of shards of data from one node to others (replicas). That  is really useful when a lot of clients want to access to the data at the same time, to manage the load balance. But this operation takes time. This leads to a high risk of inconsistancy if one client access to a data in a replica that has not yet been updated (that’s more a problem of distributed databases than specifically NoSQL, but NoSQL are generally distributed databases).
*  NoSQL is not specified to be ACID compliant, and there is no definition of a guaranteed simple way of performing a JOIN operation in one single request, for exemple (as there is no relation…)

Moreover be careful. Know what you’re doing. Relational databases can also be dynamic, distributed, deal with JSON or exotic data structures. It’s just easier with NoSQL tools. On the contrary, as MongoDB is very easy to use, some has the reflex to use systematically a MongoDB databases, even in situations where they end-up building a NoSQL database that follows a relational model… that’s nonsense.

### UnQLite

UnQLite is an embedded NoSQL component comparable to SQLite (no need to install a third party server, lightweight - <1.8Mo). It was originally developped for Java, but a binding in Python can be installed with `pip`, along with Cython. 

Cython is a language very close to Python, to which it adds support for some instructions in C/C++. It simplify the coding of extensions for Python.

In [11]:
!pip install Cython unqlite



In [9]:
import unqlite