# Setting Up

In this section, we are going to discuss what is SQL and how to set up SQLite in a Python environment. If you are completely unfamiliar with SQL (structured query language), it might be a good idea to take the other Anaconda course *[Introduction to SQL](https://learning.anaconda.cloud/introduction-to-sql)*. 



## What is SQL (Structured Query Language?) 

**SQL** stands for structured query language and is used to retrieve, manipulate, and write data. In this course we will put emphasis on retrieving and manipulating data for the purpose of analytics. While SQL is traditionally associated with relational databases, SQL has continued to be popular enough to be implemented in NoSQL databases ("Not only SQL") as well as "big data" platforms like Apache Spark and Trino. Even though it is 50 years old, SQL continues to be a necessary skill for any data professional and a go-to language for working with data.

## What Are Relational Databases? 

**Relational database management systems (RDBMS)** are repositories containing tables that may have relationships to each other. if you have a table called `EMPLOYEE` and another called `EMPLOYEE_AIR_TRAVEL` that tracks their flights for business travel, we can reasonably expect the latter table to have a field (perhaps called `BOOKED_EMPLOYEE_ID` tying it to the `EMPLOYEE_ID` of the first table.  

![](./resource/uXeyKTO9.svg)

Storing data in this manner where we separate different types of data is called **normalization**, and it efficiently reduces storage space and minimizes duplicative data. After all, why would we store the `FIRST_NAME` and `LAST_NAME` of each employee for every single `EMPLOYEE_AIR_TRAVEL` booking? Instead, we just use an integer key to refer the employee information.

It is important to note that in an analytics context, a **data warehouse** is an entity you will frequently interact with. Relational databases can be used for live operations. Examples would be a database managing the baggage and customers flowing through an airport in real time, or capturing and fulfilling orders on a shopping website. We do not hit analytical queries against these databases because it could slow them down. Instead, we have data that is regularly extracted, transformed, and loaded (ETL) into a data warehouse that serves analytical users trying to get insights from the business without disrupting the operational databases. 

There are other types of repositories that store and provide interfaces with data such data lakes, data lakehouses, and data fabrics. But generally, you will find SQL can be used to interface with many of these data platforms. For our purposes, we will use a relational datababase platform (SQLite) which is built right into Python. But you can extend this knowledge to other data platforms. 

 

## Why SQLite? 

**SQLite** is a relational database platform just like [PostgreSQL](https://www.postgresql.org), [Oracle](https://www.oracle.com/database/technologies/appdev/sql.html), or [Microsoft SQL Server](https://www.microsoft.com/en-us/sql-server). However what is unique about it is it does not require a server. Instead the database is simply stored as a file on your local machine and you use a library or user interface to open it. Python already contains SQLite support by default so you do not have to install it. It also complies to [DBI API 2.0 specified by PEP 249](https://docs.python.org/3/library/sqlite3.html). This means that other database platform packages that comply to this standard (including [Microsoft SQL Server](https://pypi.org/project/pymssql/) and [Oracle](https://pypi.org/project/cx-Oracle/)) can be worked with in the same way we will use SQLite. Therefore, everything you learn in this training can apply to most major database platforms!  

> If you want to write SQL against a SQLite database with a graphical user interface, there are many tools that provide this. My personal favorites are [SQLiteOnline](https://docs.python.org/3/library/sqlite3.html) and [SQLiteStudio](https://sqlitestudio.pl/). 

## Setting Up 

As stated earlier, SQLite is already built-in with Python 3. If you use other platforms like [Microsoft SQL Server](https://pypi.org/project/pymssql/) or [Oracle](https://pypi.org/project/cx-Oracle/) you will need to `pip install` those respective packages that comply to the DBI-API 2.0 standard. 

We do however need to get the SQLite file containing a sample database we will work examples with. For convenience, we can use download the file straight [off the Github repository](https://github.com/thomasnield/anaconda_intro_to_sql/) and put it in our working Python directory. 

In [None]:
import urllib.request
#urllib.request.urlretrieve("https://github.com/thomasnield/anaconda_intro_to_sql/blob/main/company_operations.db?raw=true", "company_operations.db")

We are now ready to connect to the database. We will create a connection using the `sqlite3` package. 

In [None]:
import sqlite3

conn = sqlite3.connect('company_operations.db')

We will also bring in `pandas` which has a very convenient function `read_sql` to execute a SQL query against a connection and package the results in a `DataFrame`. 

In [None]:
import pandas as pd

sql = "SELECT * FROM EMPLOYEE_AIR_TRAVEL"
pd.read_sql(sql, conn)

Notice above that we display the query results in a `DataFrame`. Let's talk about running queries next. 

## Why SQL Instead of Pandas? 

As we will be learning how to do analytics with SQL, you might be wondering why not just use pandas since it can do so many of these tasks too. SQL and Pandas are not competitors, but rather two different tools for two different environments. When you have many terabytes of data stored on a relational database, you will likely be unable to process that data locally on your machine using pandas. It makes sense to let SQL do the heavy computation on the server side (which is optimized to process the data it is storing) and have pandas simply receive the results. Conversely, SQL may be less equipped for machine learning tasks and merging disparate data sources, or running more elaborate algorithms that Python and pandas are better equipped to do. 

Generally, it is a good practice when working with a relational database to have the database server do the computation work where possible and have the Python environment consume the results. Keep both tools in your back pocket, and use them situationally where they make sense. 

## Running Queries

Hopefully you have worked with SQL before, and if not check out the other Anaconda course *[Introduction to SQL](https://learning.anaconda.cloud/introduction-to-sql)*. Here we will do a basic review of the `SELECT` operation and common tasks we will do. 

`SELECT * FROM CUSTOMER` will select all fields from the `CUSTOMER` table. 

In [None]:
sql = "SELECT * FROM CUSTOMER"
pd.read_sql(sql, conn)

In [None]:
sql = "SELECT CUSTOMER_ID, CUSTOMER_NAME FROM CUSTOMER"
pd.read_sql(sql, conn)

To filter rows based on one or more conditions, use the `WHERE` clause. Use the `AND` and `OR` keywords to specify multiple conditions, using `AND` to require all conditions to be met or `OR` for at least one condition. 

In [None]:
sql = """
SELECT * FROM CUSTOMER 
WHERE STATE = 'TX' AND CATEGORY = 'COMMERCIAL'
"""

pd.read_sql(sql, conn)

Use parantheses to group up multiple conditions, such as whether there was `SNOW` or sleet occurred. For sleet to occur, there has to be rain and the temperature must be less than or equal to 32 degrees Fahrenheit, so we treat this as a single condition. 

In [None]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE SNOW > 0 OR (RAIN = 1 AND TEMPERATURE <= 32)
"""

pd.read_sql(sql, conn)

Use functions like `SUM`, `MIN`, `MAX`, `COUNT`, and `AVG` to aggregate a column. Below we get the total rain when tornados occurred. 

In [None]:
sql = """
SELECT SUM(RAIN) FROM WEATHER_MONITOR 
WHERE TORNADO = 1 
"""

pd.read_sql(sql, conn)

Use `GROUP BY` to slice aggregate functions on one or more fields/expressions. Below, we get the total rain by each report date. 

In [None]:
sql = """
SELECT REPORT_DATE, 
SUM(RAIN) AS TOTAL_RAIN

FROM WEATHER_MONITOR 
WHERE TORNADO = 1 
GROUP BY REPORT_DATE 
"""

pd.read_sql(sql, conn)

Another nuance to note is that numeric values (including binary 1/0 values) as well as floating point values do not have to quotes when declaring a value. But texts, dates/times, and other data types typically have to wrap values in quotes as shown below. 

In [None]:
sql = """
SELECT * FROM WEATHER_MONITOR
WHERE REPORT_CODE = '3J3YUUD'
"""

pd.read_sql(sql, conn)


## EXERCISE

Complete the SQL query below (by replacing the question marks "?") to find the minimum and maximum temperature by year since March 1, 2024. 

In [None]:
sql = """
SELECT strftime('%m', REPORT_DATE) AS YEAR, 
? AS MIN_TEMP, 
? AS MAX_TEMP

FROM WEATHER_MONITOR
WHERE ? >= ?
GROUP BY ?
"""

pd.read_sql(sql, conn)




### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
sql = """
SELECT strftime('%m', REPORT_DATE) AS YEAR, 
MIN(TEMPERATURE) AS MIN_TEMP, 
MAX(TEMPERATURE) AS MAX_TEMP

FROM WEATHER_MONITOR
WHERE REPORT_DATE >= '2024-03-01'
GROUP BY 1 
"""

pd.read_sql(sql, conn)