#### SQL Cheat Sheet
http://www.sqltutorial.org/wp-content/uploads/2016/04/SQL-cheat-sheet.pdf


## SQLite3

### Working with SQLite3 DBs in Jupyter Notebook

    #First we import the sqlite3 module
    ```python
    #import sqlite3
    ```

    #Next we create our connection to the sqlite database file `pet_database.db` by using the method `.connect()` and the file name we would like for our database.
    ```python
    connection = sqlite3.connect('pet_database.db')
    ```

    #Then we create the *cursor* which we will use to execute SQL statements
    ```python
    cursor = connection.cursor()
    ```

    #Finally, when we want to execute our SQL statements we reference our SQL cursor object and call the method `.execute()` with our SQL statement as the argument
    ```python
    sql_return = cursor.execute('''SQL statement GOES here;''')
    ```
    To see a list of the information we retrieved from our SQL statement, we can take our `sql_return` variable and call the method `.fetchall()`, which will return a list of records (if we are executing a `SELECT` statement.
    
    #creating a table CATS with ID column as primary key using integers, a NAME column using text and an AGE column using integers
    cursor.execute('''
                CREATE TABLE cats (
                id INTEGER PRIMARY KEY,
                name TEXT, 
                age INTEGER
                );'''
               ) 

### Data Types
TEXT, INTEGER, REAL(FLOAT), BLOB(BINARY DATA)

### Primary and Foreign Keys
Primary Key	
- Primary key uniquely identify a record in the table.	
- Primary Key can't accept null values.	
- By default, Primary key is clustered index and data in the database table is physically organized in the sequence of clustered index.	
- We can have only one Primary key in a table.	
    
Foreign Key
- Foreign key is a field in the table that is primary key in another table.
- Foreign key can accept multiple null value.
- Foreign key do not automatically create an index, clustered or non-clustered. You can manually create an index on foreign key.
- We can have more than one foreign key in a table.



## Using SQL syntax with `pandasql`

Since SQL is such a powerful, comfortable tool for Data Scientists, some people had the bright idea of creating a library that lets users query DataFrames using SQL-style syntax.  This library is called [pandasql]( https://pypi.org/project/pandasql/ ).

We can install `pandasql` using the bash comman `pip install pandasql`.

#### Importing pandasql

In order to use `pandasql`, we need to start by importing a `sqldf` object from `pandasql`

    ```python
    from pandasql import sqldf
    ```

Next, we'll write a lambda function that will make it quicker and easier to write queries.  Normally, we would have to pass in the global variables every time we use an object.  In order to avoid doing this every time, we'll write a lambda that does this for us. 

    ```python
    pysqldf = lambda q: sqldf(q, globals())
    ```

Now, when we pass a query into `pysqldf`, the lambda will also pass along the globals for us, saving us that repetitive task. 

#### Writing Queries

To write a query, we just format it as a multi-line string!

    ```python
    q = """SELECT
            m.date, m.beef, b.births
         FROM
            meats m
         INNER JOIN
            births b
               ON m.date = b.date;"""
    ```

In order to query DataFrames, we can just pass in the query string we've created to our `sqldf` object that we stored in `pysqldf`.  This will return a DataFrame.  

    ```python
    results = pysqldf(q)

## Normalization in Databases
You've probably noticed by now that tables in databases are essentially just spreadsheets. However, there's a reason that companies spend millions of dollars on building and maintaining performant relational databases, instead of just keeping a massive spreadsheet in Excel. The main advantage that databases provide are a robust way to organize our tables--these organization strategies are referred to as normalization. There are different levels of normalizations, ranging from 0th normal form to 3rd normal form (there are more normalized versions than 3rd normal form, but they're rare enough that you likely won't have to worry about them).

#### What is Database Normalization?
Database normalization refers to the practice of storing data across one or more tables based on the information that data contains. You might have noticed that although we typically jam everything into a single DataFrame when exploring or modeling our data, things like a customer's name, address, and order history are typically stored in separate tables, and that each of those corresponding tables only contains data pertaining to a specific topic (e.g. all addresses stored in the "addresses" table, orders stored in the "orders" table, etc). The information is linked to other relevant records by the use of keys.

#### Benefits of Database Normalization
You've probably wondered at some point--why bother? Storing data in separate tables and then joining with foreign keys when that information is needed can be a bit time consuming. However, Database Normalization provides some great benefits:

- 1. Minimize Duplicate Data
By storing data in separate tables, we can avoid the need to store duplicate data in our database. Duplicating records is a waste of space, and space used to be quite expensive! By storing a customer's information in a "Customer" table, we can just reference the appropriate key for that customer in an "Orders" table every time that customer places an order. In a denormalized database to track orders (technically called 0th normal form), every row in the spreadsheet would be an order, and if a customer has placed 20 orders, then you'll have that customer's name, shipping address, and other information repeated across 20 different rows! With a normalized database, we can save on memory by just pointing to that customer in the customer table every time they make an order.

- 2. Minimize Data Modification Issues
Another benefit of a normalized database is that it makes it much easier to avoid issues that arise from modifying information. Consider the example we used above of a single spreadsheet storing information about orders and the customers that placed them. Let's assume that our customer changes their address. The simplest solution here would be to just put the new address in only on new orders, and leave the old address alone in the old orders. But what happens when you try to query for that customer's address? That query will return two different addresses for that customer, with no obvious way for you to tell which address is current. If we decide to change all instances of that customer's address to match the current one, then that leads to a performance problem--making that one change of address means changing it for every single row in our spreadsheet, would slow down our database.

By normalizing a database, we can plan ahead for issues like this. With an "Address" table, perhaps we can add datestamps to each address to tell when it is changed, or a column allowing the customer to name the different addresses they ship items to. Best of all, if we decide to change a value, we only need to change it in one place--whether that customer has placed one order or one million, we're only changing a single cell in a single table--no performance hit!

- 3. Simplifying Queries
This was alluded to in the previous paragraphs, but one of the main benefits of database normalization is that it simplifies the structure of our queries when we need to get information.

For instance, let's assume we have a spreadsheet containing information on sales associates in our company. Each row represents a different member of the sales team. Each customer that the associate has dealt with is stored as a different column in the speadsheet under headings such as "Customer_1", "Customer_2", etc. Some companies have been with us a long time, and have dealt with multiple sales associates. What if we wanted to query our data to get all the sales associates that have ever sold anything to IBM?

Our query would be horrible, and would look something like:

    SELECT SalesAssociate FROM SalesTeam
    WHERE Customer_1 = 'IBM' OR
    Customer_2 = 'IBM' OR
    Customer_3 = 'IBM' OR... // continues on like this for every customer column :-(

This becomes much, much simpler when we use a normalized database. We can just store all of our sales associate data in one table, all of our customer information in another table, and link them together with a join table (since this is a many-to-many relationship). This greatly simplifies our query.

#### Types of Normal Forms
Although there are more strict types of normalization such as 4th and 5th normal form, in practice, you'll rarely ever run into database stored in versions other than 1st, 2nd, or 3rd normal form. Since you're a data scientist, not a database administrator, you don't need to spend too much time worrying about the differences between the 3--however, you should have a basic understanding of what each means.

1st Normal Form: All rows have the same number of columns. No column names are repeated.

2nd Normal Form: Meets the specifications of 1st normal form, plus all column data depends on the entire primary key, and not just part (remember, primary keys can be a composite of 2 or more columns in a table!)

3rd Normal Form: Meets the specifications of 2nd normal form, plus no column depends on other columns. Each column in the table depends on the primary key, the whole primary key, and nothing but the primary key.

*/Often, you'll need to get data out of databases where it is stored in 3rd normal format, and then denormalize the data you need so that it fits in a single DataFrame we can use for all our data science-y purposes./*

#### Table Relationships
Recall that entities stored in our database tables can be related to one or more entities in other tables. Before we move onto reading Entity-Relationship diagrams in the next lesson, we'll quickly review the different types of relationships, and provide an example of each.

##### One-to-One Relationships
In one-to-one relationships, an entity in a table is connected to exactly one entity in a corresponding table through a foreign key.

Example: Employee and Compensation. Each employee will only have one row in the compensation table related to them.

##### One-to-Many Relationships
In one-to-many relationships, an entity in a table can be connected to one or more entities in a corresponding table through a foreign key.

Example: City to Zip Code. A city can contain multiple zip codes, but each zip code is only in one city.

##### Many-to-Many Relationships
In many-to-many relationships, an multiple entities in a table can be connected to one or more of the same entities in a corresponding table. These connections are queried through an intermediate table called a Join Table (more on this in a future lesson!)

Example: Sales Associates and Customers. In our previous normalization example, each sales associate deals with multiple companies, and each company deals with multiple sales associates.

## Entity Relationship Diagram

https://yintingchou.com/posts/2017-09-01-learning-microsoft-sql-server/ERD.png

#### ERD relationship notation

https://d2slcw3kip6qmk.cloudfront.net/marketing/pages/chart/ER-diagram-symbols-and-meaning/ERD_notation-416x315.PNG

## SQLalchemy and Object Relation Mappers (ORM)

#### Defining Our Mappings
We'll begin by importing everything we need to create our database and structure our mappings so that they look like the tables in the ERD.

    #import packages and decalre a base
    from sqlalchemy import *
    from sqlalchemy.orm import relationship #to create relationships
    from sqlalchemy.ext.declarative import declarative_base #to declare a base
    Base = declarative_base()

#### Creating Class Mappings

https://www.sqlalchemy.org/

In order to set up our classes, define:

- The __tablename__ for each class
- The attributes of each class, which will be Column objects
- The relationship that each class has to other classes


        #Complete the Customer, ShoppingCart, and Item classes.
    
        class Customer(Base):
            __tablename__ = 'customer'

            id = Column(Integer, primary_key=True)
            name = Column(String)
            cart_id = Column(Integer, ForeignKey('shoppingCart.id'))

            # Create 1-to-1 relationship with ShoppingCart, as shown in the SQLAlchemy documentation
            shoppingCart = relationship('ShoppingCart', uselist=False, back_populates='customer')
        class ShoppingCart(Base):
            __tablename__ = "shoppingCart"

            id = Column(Integer, primary_key=True)
            item_id = Column(Integer, ForeignKey('item.id'))
            # Create 1-to-1 relationship with Customer
            customer = relationship('Customer', uselist=False, back_populates='shoppingCart')
            # Create 1-to-many relationship with Item
            items = relationship('Item')
        class Item(Base):
            __tablename__ = 'item'

            id = Column(Integer, primary_key=True)
            description = Column(String)
            price = Column(Float)

        #Creating Our Database
        engine = create_engine('sqlite:///shopping_cart.db', echo=True)
        Base.metadata.create_all(engine)

        #create some objects, and then populate the database with them.

        customer1 = Customer(name="Jane")
        item1 = Item(description="widget", price=9.99)
        cart1 = ShoppingCart(customer=customer1, items = item1)
        customer1.shoppingCart = cart1
        
        #add our new data to our database tables by creating a session object.
        from sqlalchemy.orm import sessionmaker, Session
        Session = sessionmaker(bind=engine)
        session = Session()

        #add items to our database one at a time by passing them in as a parameter to session.add(). 
        #add multiple items by passing them as a list into the add_all() method.
        session.add_all([customer1, cart1, item1])

        #see all the items that have been added by checking the session objectthe cell below.
        #session.new


        #commit our objects to push them to the database.
        session.commit()


## Querying with SQLalchemy

    #Connecting to the Database
    import sqlalchemy
    from sqlalchemy import create_engine
    from sqlalchemy.orm import Session, sessionmaker
    engine = create_engine("sqlite:///Northwind_small.sqlite", echo=True)
    Session = sessionmaker(bind=engine)
    session = Session()

    #Get Table Names and Table Information
    from sqlalchemy import inspect
    inspector = inspect(engine)
    print(inspector.get_table_names())

    #function to print out the name and type of each column in a well-formatted way.
    def get_columns_info(col_name):
        cols_list = inspector.get_columns(col_name)

        print("Table Name: {}".format(col_name))
        print("")

        for column in cols_list:
            print("Name: {} \t Type: {}".format(column['name'], column['type']))
    get_columns_info('Employee')

    #Connecting and Executing Raw SQL Statements
    con = engine.connect()
    rs = con.execute("SELECT * FROM Customer LIMIT 5")
    print(rs.fetchall())

    #Storing data in Pandas DataFrame
    import pandas as pd
    rs = con.execute("SELECT firstname, lastname, title from Employee")
    df = pd.DataFrame(rs.fetchall())
    df.head()


    Nice! We can now read our results. However, the columns of our DataFrame aren't labeled. Luckily, pandas plays nicely with the sqlalchemy library, and can actually execute sql queries!

    #query to select all orders from customer VINET
    df = pd.read_sql_query("SELECT * FROM [Order] WHERE CUSTOMERId = 'VINET'", engine)
    df.head()


    #Executing JOIN Statements
    df = pd.read_sql_query("""SELECT o.ID, c.CompanyName, Count(*) num_orders FROM [Order] \
    o INNER JOIN Customer c on o.CustomerID = c.ID GROUP BY c.CompanyName ORDER BY num_orders DESC""", engine)
    df.head()

    #JOINs with Many-To-Many Relationships
    q = """SELECT LastName, FirstName, COUNT(*) as TerritoriesAssigned from \
    Employee \
    JOIN EmployeeTerritory et on Employee.Id = et.employeeId \
    GROUP BY Employee.lastname \
    ORDER BY TerritoriesAssigned DESC"""
    df2 = pd.read_sql_query(q, engine)
    df2.head()

    #create mappings of tables to objects in python to use ORM
    from sqlalchemy import MetaData
    from sqlalchemy.ext.automap import automap_base
    metadata = MetaData()
    metadata.reflect(engine)
    Base = automap_base(metadata=metadata)
    Base.prepare()
    Employee, Customer = Base.classes.Employee, Base.classes.Customer

#### Writing Basic Queries

    #for loop that iterates through the results returned by a session.query() of the Employee table and orders the results by the Employee's .HireDate attribute.
    for instance in session.query(Employee).order_by(Employee.HireDate):
        print("Name: {}, {}  Hired: {}".format(instance.LastName, instance.FirstName, instance.HireDate))

    Implicit JOINs using .filter()
    One great benefit of using session.query() to query our data is that we can easily execute implicit joins by making use of the .filter() method.

    So far we've only explicitly specified mappings for the Employee and Customer classes. We'll need to do this now for the Product and Category classes before we can use them with session.query().

    #set the mappings for Product and Category.
    Product, Category = Base.classes.Product, Base.classes.Category

    #for loop that iterates through all results returned from a query of Products and Categories and use the .filter() method to only include cases where the Product's .CategoryID matches the Category's .Id attribute.
    for p, c in session.query(Product, Category).filter(Product.CategoryId==Category.Id).all():
        print("Product Name: {}  Category Name: {}".format(p.ProductName, c.CategoryName))

## NoSQL Database Types

#### Key - Value DatabasesÂ¶
Key value databases, are one of the most simplistic database systems, simply storing data as key-value pairs, just like python dictionaries. The most common implementation is Redis.

- Redis
- Initial release: 2009

Redis has been used by large companies including github and instagram. It is by far the most common key-value database.

#### Document Model Databases
Document model databases are a subclass of key-value databases. The initial concept is of documents such as json or xml. The database stores these documents using key-value pairs. However, unlike key-value databases, document model databases have the additional ability to access information within these documents directly.

- MongoDB
- Initial release: 2009

MongoDB is one of the most popular sql alternatives. It represents data very similar to the JSON format we have been investigating today. It also supports a distributed model where data can be stored across multiple computers.

#### Wide Column Databases
Wide column databases can be thought of as tables where the data in each column can vary from row to row.

- Cassandra
- Initial release: 2008

Cassandra was initially developed internally at Facebook and was later released as an open source software, eventually being picked up and maintained by the Apache Foundation. It was developed for handling large amounts of data to be distrubted across multiple servers. It is notable for being particualrly reliable and not having a single failure point.

#### Graph Databases
Graph databases expand upon the idea of document databases, adding in the concept of relations between documents. This makes certain operations and mappings such as connectivity of the graph of data very easy. However, individual data nodes may not be indexed which can mean that they are not directly accessible on their own but must be accessed via their relationship to more central objects.

- Neo4j
- Initial release: 2007

Neo4j is probably the most popular graph database. It stores all its data as nodes, edges or attributes.

- GraphQL
- Initial release: 2015

GraphQL was developed internally at Facebook and allows users to define specific data structures when requesting data from servers.

#### Choosing an Appropriate Database
There are many consideration when choosing a database including the size of the project, anticipated use cases, and development costs. One obvious and straightforward consideration is training and familiarity. This contributes to the popularity of SQL. Size and use cases are also incredibly imporatant considerations. For personal projects or small businesses, you may not even need a database and can perhaps simply use a csv or json file. As data grows, a database management system is often needed. Until scale continues to grow, any of these choices could meet needs. One of the biggest drawbacks of relational databases such as sql is that they don't scale well horizontally (such as adding columns). In such scenarios, some of the alternative models provide more computationally effective solutions at scale.

#### Additional Resources
Check out https://db-engines.com/en/ranking for a ranking of various databases as well as much more information about them!