# Integrating our Model
Here we will use Python as a middle man between SQL and Tableau, using the pymysql library to create a connection to SQL, exectuting queries to manipulate data in a SQL database.

Note that there are many alternatives to this method, such as using TabPy to directly communicate between Python and Tableau, cutting out the SQL component. Each method has its pros and cons, but using SQL to store the data in a database is a robust way (and common business practice) to store and access data.

### Loading our Modules
Steps:
* Load the class module containing methods to load data, pre-process (using our scaler) and model (using our logistic regression model).
* Instantiate the model class from the module and use it to pre-process the data (including standardizing the inputs using the pre-created scaler object) and using the model to predict outputs based on the new inputs (using our pre-created model object).

In [12]:
# load class module
from absenteeism_module import *

# instantiate model using new data
model = absenteeism_model('model', 'scaler')

# load and pre-process data
model.load_and_clean_data('Absenteeism_new_data.csv')

# store predicted outputs in df
df_new_obs = model.predicted_outputs()

# show predicted outputs of model
model.predicted_outputs().head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pet,Probability,Prediction
0,0,0.0,0,1,6,179,30,19,1,0,0,0.122469,0
1,1,0.0,0,0,6,361,28,27,0,1,4,0.873365,1
2,0,0.0,0,1,6,155,34,25,0,2,0,0.266253,0
3,0,0.0,0,1,6,179,40,22,1,2,0,0.19857,0
4,1,0.0,0,0,6,155,34,25,0,2,0,0.720861,1


### Connecting to SQL
Steps:
* Using pymysql to allow the writing and running of SQL/Python queries.
* Create a connection to a MySQL database (that I've created in this case).
* Create a cursor to directly interact with the DB.

In [27]:
# load library to connect Python and SQL (specifically Jupyter and MySQL)
import pymysql

# create connection to SQL DB
# this is the equivalent of opening a query in SQL and typing/running queries in there
conn = pymysql.connect(host = 'localhost', database = 'predicted_outputs', user = 'root', password = 'Gafro010')

# create cursor to directly interact with your DB
cursor = conn.cursor()

### Data to SQL (loops, slow)
Source: https://www.dataquest.io/blog/sql-insert-tutorial/

A few key points on the below:
* Loops can be quite slow, therefore this code is instructional only and a batch method should be used (see code below this).
* There is an important distinction between single quotes \' and backticks \` when writing SQL queries in Python:
    * Single quotes are simply used to specify strings (e.g. the sql query itself can be written inside either single or double quotes).
    * Backticks are specifically used to reference tables and columns (i.e. variable names). If you use quotes here, you will receive "1054 internal error" or "error 1064" telling you that your syntax is incorrect.

In [15]:
# rename cols to match SQL table names
df_new_obs.columns = ['reason_1', 'reason_2', 'reason_3', 'reason_4', 'month_value', 'transportation_expense', 'age', 'body_mass_index',
                      'education', 'children', 'pet', 'probability', 'prediction']

# store column names as list of strings
cols = "`,`".join([str(i) for i in df_new_obs.columns.tolist()])

# iterate through df
for i, row in df_new_obs.iterrows():
    # create SQL query to insert data into table columns
    # creates %s for the number of items in the row (i.e. length of row - 1 with the final %s afterwards)
    sql = "INSERT INTO `predicted_outputs` (`" + cols + "`) VALUES (" + "%s," * (len(row) - 1) + "%s)"
    
    # execute query, using data in each row (tuple stores rows in comma separated list instead of native pandas series object)
    cursor.execute(sql, tuple(row)) # executes above SQL, substituting the row in place of %s
    
    # commit changes to DB
    conn.commit()
    
# check above commits using select query
# select data from table
sql = 'SELECT * FROM predicted_outputs;'
cursor.execute(sql)

# iterate throgh fetched data (show first row only for check)
# NOTE: bits are returned with the prefix b'\x00' for 0 and b'\x01' for 1
result = cursor.fetchall()
for i in result:
    print(i)
    break

(b'\x00', b'\x00', b'\x00', b'\x01', 6, 179, 30, 19, b'\x01', 0, 0, 0.122469, b'\x00')


Note below that the number of rows of our dataframe and the length of our data now present in our SQL DB are identical (40 rows in each).

This is a good check that the data we intended to be written to our DB has been, and that it hasn't been duplicated etc.

In [16]:
df_new_obs.shape

(40, 13)

In [17]:
len(result)

40

### Reset Database
Steps:
* Re-create database for fresh start
* Re-create table for fresh start (including definition of columns and datatypes)
* This is to ensure the above steps (i.e. loading data to DB) is cleared so that the below code can run from scratch

In [28]:
# drop and re-create DB
cursor.execute('DROP DATABASE IF EXISTS predicted_outputs;')
cursor.execute('CREATE DATABASE IF NOT EXISTS predicted_outputs;')

# select DB
cursor.execute('USE predicted_outputs;')

# drop and re-create table (specifying data types and var names)
cursor.execute('DROP TABLE IF EXISTS predicted_outputs;')
cursor.execute('CREATE TABLE predicted_outputs(Reason_1 BIT NOT NULL, Reason_2 BIT NOT NULL, Reason_3 BIT NOT NULL, Reason_4 BIT NOT NULL, month_value INT NOT NULL, transportation_expense INT NOT NULL, age INT NOT NULL, body_mass_index INT NOT NULL, education BIT NOT NULL, children INT NOT NULL, pet INT NOT NULL, probability FLOAT NOT NULL, prediction BIT NOT NULL);')

  result = self._query(query)


0

### Dataframe to SQL (directly/fast, no loops)
Source: https://www.dataquest.io/blog/sql-insert-tutorial/

This method is far easier to write than the above loop and also allows you to batch the process to avoid OOM. It's also quicker to run as you're not iteratively looping through every row of a potentially huge dataframe.

**NOTE:** It's important to specify 'index=False' in the 'to_sql' method if your dataframe doesn't have an index column (i.e. when you view your df.head(), there shouldn't be an index column on the left). If you don't specify this, then it will complain saying that it can't find an index column and won't run your code.

The 'if_exists' component is useful too, as it will append your query to existing data if present, or create a table from scratch if not, saving you the hassle of that extra step.

In [29]:
# load library to store connection settings
from sqlalchemy import create_engine

# create sqlalchemy engine (i.e. connections settings string)
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                       .format(user='root',
                               pw='Gafro010',
                               db='predicted_outputs'))

# write dataframe to SQL DB
# write to specified table, using engine connection settings above,
# append to existing or create new if not existing, chunk df to avoid OOM
df_new_obs.to_sql('predicted_outputs', con = engine, if_exists = 'append', chunksize = 1000, index=False)

Now we can check that our data has been committed to the database, the first row by itself (i.e. the execute statement) will return the number of rows of the selection if written by itself. The second part returns the first row of the data which we can see matches our datafarme perfectly.

**NOTE:** it's important to close your connection after completing your code, it is simply best practice to avoid excessive connections and potential for things to go wrong/get confused during query writing.

In [24]:
df_new_obs.head(1)

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,month_value,transportation_expense,age,body_mass_index,education,children,pet,probability,prediction
0,0,0.0,0,1,6,179,30,19,1,0,0,0.122469,0


In [30]:
# select and show data from table
cursor.execute('SELECT * FROM predicted_outputs;')
results = cursor.fetchall()
for i in results:
    print(i)
    break
    
# close connection
conn.close()

(b'\x00', b'\x00', b'\x00', b'\x01', 6, 179, 30, 19, b'\x01', 0, 0, 0.122469, b'\x00')


### Extracting the Data for Tableau
Once the above steps have been completed, we have our final dataset in SQL. Here, we can use the MySQL Workbench interface to export the data as a CSV which we will then use as our data source for Tableau.