#### Opening

Consider the scenario:
Given a database, you have to investigate which schema/tables -metada- would be useful?

Now consider further that your project is migrated to a new SQL dialect, say snowflake. Could you rely on your code solely designed on oracle? ...



### Why SQLAlchemy?

SQLAlchemy and ORM packages are very popular with pythond developers working on non-data science related software/webapps. I think one of the main reasons are "protection against sql injection attacks" but it can also be hard to update a project written with [`cx_oracle`](https://cx-oracle.readthedocs.io/en/latest/) when the project owners decide to migrate to a PostGresSQL database for example (migrations happen and not always to the cloud). Imagine how much harder it would be if all the queries were just long multiline strings!

<ol>
    <li>Easy access to table metadata</li>
    <li>Consitent cursor/engine methods</li>
    <li>Better control over table architecture</li>
    <li>Refractoring is made EASY</li>
</ol>

Notes:
<ol>
    <li>
        There are methods and objects that come with SQLAlchemy that allow the user to quickly derive table metadata. This saves time from switching between your python IDE and Oracle SQL Developer for example.
    </li>
    <li>
        Interfacing with databases is not always straigh forward in python. Although the [DBAPI](https://www.python.org/dev/peps/pep-0249/) specs provides a standard, different developers are free to structure/name their methods as they see fit.
    </li>
    <li>
        There is nothing like having great control over the structure of the table you are responsible for. You must make sure that all primary keys are properly identified, that all columns that cannot be null are distinguishable and that database defaults are set.
    </li>
    <li>
        If you use an IDE like Pycharm, you probably love how easy it is to refractor objects. Refractoring objects/variables are very easy and the fact that SQLAlchemy provides the object representation of tables and columns; refractoring is often (not always) a piece of cake.
    </li>
    
</ol>

### Con
The main complaint is always efficiency/speed of sqlalchemy compared to raw sql paired with dedicated packages for each SQL dialect

#### Setup

In [58]:
import pandas as pd
import sqlalchemy as sa
import sqlalchemy.orm as orm
from sqlalchemy.engine import reflection
from sqlalchemy.ext.hybrid import hybrid_property
from sqlalchemy.ext.declarative import declarative_base

In [19]:
engine = sa.create_engine("mysql+pymysql://tester:password@localhost:3306/dstest")
ny_sat = r"https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv?accessType=DOWNLOAD"
data_docs = r"https://catalog.data.gov/dataset/sat-college-board-2010-school-level-results-5c6d6"

In [21]:
df = pd.read_csv(ny_sat)
df.columns = [column.lower().replace(' ', '_') for column in df.columns]
df.head()

Unnamed: 0,dbn,school_name,number_of_test_takers,critical_reading_mean,mathematics_mean,writing_mean
0,01M292,Henry Street School for International Studies,31.0,391.0,425.0,385.0
1,01M448,University Neighborhood High School,60.0,394.0,419.0,387.0
2,01M450,East Side Community High School,69.0,418.0,431.0,402.0
3,01M458,SATELLITE ACADEMY FORSYTH ST,26.0,385.0,370.0,378.0
4,01M509,CMSP HIGH SCHOOL,,,,


#### Metadata - Inspecting Data Available in a Database

In [32]:
inspector = reflection.Inspector.from_engine(engine)

# If we want all the schemas we have access to we can do this
inspector.get_schema_names()

['dstest', 'information_schema']

In [33]:
# To get all the tables in a schema we can do the following
inspector.get_table_names(schema='dstest')

['category', 'user']

In [34]:
# Reflecting tables
metadata = sa.MetaData(engine)
category_table = sa.Table('category', metadata, autoload_with=engine)
category_table

Table('category', MetaData(bind=Engine(mysql+pymysql://tester:***@localhost:3306/dstest)), Column('category_id', INTEGER(display_width=11), table=<category>, primary_key=True, nullable=False), Column('name', VARCHAR(length=255), table=<category>, nullable=False), schema=None)

#### Consistent Cursor/Engine Methods

Inserting or Updating data in a table is usually done via SQL queries. The specific syntax could differ from one dialect to another, Columns with special characters (spaces etc...) will need to be handled properly and on top of that, parsing data from a pandas dataframe to complete a sql query is not always straighforward.

Consider the following dataframe. How would you insert the data into the table?

In [35]:
data_to_insert_into_category = pd.DataFrame({
    'category_id': [3, 4],
    'name': ['Math', 'Stats']
})

data_to_update_in_category = pd.DataFrame({
    'category_id': [1, 2],
    'name': ['Updated Category #1', 'Updated Category #2']
})

In [36]:
# Answer here
sql = """
"""

In [37]:
# closer look at native properties of SQLAlchemy Table objects
print(category_table.insert())

INSERT INTO category (category_id, name) VALUES (%(category_id)s, %(name)s)


In [38]:
# update properties
print(category_table.update())

UPDATE category SET category_id=%(category_id)s, name=%(name)s


Dictionaries are key to inserting and updating using these methods and Pandas have great support for [turning the data into dictionaries](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html#pandas.DataFrame.to_dict).

In [39]:
data_to_insert_into_category.to_dict(orient='records') # I believe record as reference to database/table record?

[{'category_id': 3, 'name': 'Math'}, {'category_id': 4, 'name': 'Stats'}]

Benefits of the approach above

<ul>
    <li>Database Agnostic</li>
    <li>Auto-generated with all escaping taken care of!</li>
    <li>Quickly and Easily turn Pandas Dataframes into Actual Records on Database</li>
    ...
</ul>

Keep in mind that there are some columns of certain types, datetime for example, can cause problems but that usually just requires further data processing.

#### Table Structure

When you have to create/maintain tables, you MUST make sure constraints are not violated. PRIMARY KEYS are sometimes necessary for the integrity of the data. Let's say we want to create a new table for the New York City SAT data we obtained above.

In [40]:
# Answer here - SQL Query
sql = """
"""

We could actually create a `Table` object but instead we will take the ORM approach. Just a heads up, `np.nan` does not play well with these methods (although I believe you could configure sqlalchemy to treat them as `None`) so we will be replacing them with `None` since `None` is the real equivalent of `NULL`.

In [59]:
Base = declarative_base()

class NewYorkCitySat(Base):
    __tablename__ = 'new_york_sat'
    __table_args__ = {'schema' : 'dstest'} # this is if you want to specify the schema
    
    dbn = sa.Column(sa.String(15), primary_key=True)
    school_name = sa.Column(sa.String(100), primary_key=True)
    number_of_test_takers = sa.Column(sa.Integer, nullable=True)
    critical_reading_mean = sa.Column(sa.Float, nullable=True)
    mathematics_mean = sa.Column(sa.Float, nullable=True)
    writing_mean = sa.Column(sa.Float, nullable=True)
    
    # since this is a class, we could provide an __init__ just because
    def __init__(self, dbn, school_name, number_of_test_takers, critical_reading_mean, mathematics_mean,
                writing_mean):
        self.dbn = dbn
        self.school_name = school_name
        self.number_of_test_takers = number_of_test_takers
        self.critical_reading_mean = critical_reading_mean
        self.mathematics_mean = mathematics_mean
        seld.writing_mean = writing_mean
    
    # say we want to have some sort of computation that tells us when a school is a "good" school
    # the criteria for this is all made up and could be complicated but we focus on an easy one
    @hybrid_property
    def good_school(self):
        return self.critical_reading_mean >= 466 and self.mathematics_mean >= 489 and self.writing_mean >= 464
    
    # We obviously could not use that propery directy as a filter on a query on the database
    @good_school.expression
    def good_school(cls):
        return sa.and_(
            cls.critical_reading_mean >= 466,
            cls.mathematics_mean >= 489,
            cls.writing_mean >= 464
        )
        
# there are others types for the columns. There are also ways to set server default values!
# you could add other methods to the class. You could even add custom validators (only for instances of the class)

In [60]:
# That might have been tedious yes. But it not only has an easy to understand object, it also has the easy to use
# Table object accessible like this

NewYorkCitySat.__table__

Table('new_york_sat', MetaData(bind=None), Column('dbn', String(length=15), table=<new_york_sat>, primary_key=True, nullable=False), Column('school_name', String(length=100), table=<new_york_sat>, primary_key=True, nullable=False), Column('number_of_test_takers', Integer(), table=<new_york_sat>), Column('critical_reading_mean', Float(), table=<new_york_sat>), Column('mathematics_mean', Float(), table=<new_york_sat>), Column('writing_mean', Float(), table=<new_york_sat>), schema='dstest')

In [None]:
# one way of creating the table on the database
NewYorkCitySat.__table__.create(bind = engine, checkfirst = True)

# equivalent to
Base.metadata.tables[NewYorkCitySat.__tablename__].create(bind = engine)

# another way
Base.metadata.create_all(bind = engine, tables = [NewYorkCitySat.__table__], checkfirst = True)

# preference is left to the user

In [64]:
# Creating instances of the class


In [66]:
# Creating instances by relying on the __init__ method since we have it


In [67]:
# Inserting the data using orm sessions - create instances from the dataframe


#### Refactoring and Easy Maitenance

My personal experience with refactoring and maintenance has to deal with changing queries. When you build a complex query that spans multiple lines, changing even one column in that string can cause major problems (or at least make debugging a nightmare). Building queries with the metadata classes on the other hand makes it easy to make changes to the query.

In [69]:
# Create a query using goog_school filter - as mentioned, the criteria can change and could even involve 
# case statements.

session = orm.Session()
query = session.query(NewYorkCitySat).filter(NewYorkCitySat.good_school)
print(query.statement)

# If I had a models.py with the class defined, I could update my good_school property
# and all my code that depend on it will remain the same

SELECT dstest.new_york_sat.dbn, dstest.new_york_sat.school_name, dstest.new_york_sat.number_of_test_takers, dstest.new_york_sat.critical_reading_mean, dstest.new_york_sat.mathematics_mean, dstest.new_york_sat.writing_mean 
FROM dstest.new_york_sat 
WHERE dstest.new_york_sat.critical_reading_mean >= :critical_reading_mean_1 AND dstest.new_york_sat.mathematics_mean >= :mathematics_mean_1 AND dstest.new_york_sat.writing_mean >= :writing_mean_1


In [None]:
# Using subqueries - often time we have to create queries to group our data on specific keys and then merge
# it back with the original table. 


# to be honest compound queries could be made simple with f-strings

These opinions are my own ...