# Databases


Within SQL, we will use various queries to:

- select data subsets
- Sum over groups
- create new tables
- Count distinct values of desired variables
- Order data by chosen variables

### Establish a Connection to the Database

The first parameter is the connection to the database. To create a connection we will use the SQLAlchemy package and tell it which database we want to connect to, just like in pgAdmin. Additional details on creating a connection to the database are provided in the [Databases](02_1_Databases.ipynb) notebook.

__Parameter 1: Connection__

In [7]:
# to create a connection to the database, 
# we need to pass the name of the database 

DB = 'ncdoc.db'

conn = sqlite3.connect(DB)

### Formulate Data Query

Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of the offenders data.

__Create a query as a `string` object in Python__

In [8]:
query = '''
SELECT *
FROM inmate
LIMIT 20;
'''

Note:

- the three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing SQL queries because the new line character will be considered part of the string, instead of breaking the string

In [9]:
# Now that we have defined a variable `query`, we can call it in the code
print(query)


SELECT *
FROM inmate
LIMIT 20;



> Note that the `LIMIT` provides one simple way to get a "sample" of data; however, using `LIMIT` does **not provide a _random_** sample. You may get different samples of data than others using just the `LIMIT` clause, but it is just based on what is fastest for the database to return.

### Pull Data from the Database

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function, and obtain the data.

In [10]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `wage` 
# to the dataframe returned by the function
df = pd.read_sql(query, conn)

In [11]:
# Look at the data
df.head()

Unnamed: 0,INMATE_DOC_NUMBER,INMATE_LAST_NAME,INMATE_FIRST_NAME,INMATE_MIDDLE_INITIAL,INMATE_NAME_SUFFIX,INMATE_NAME_SOUNDEX_CODE,INMATE_GENDER_CODE,INMATE_RACE_CODE,INMATE_BIRTH_DATE,INMATE_ETHNIC_AFFILIATION,...,CURRENT_PENDING_REVIEWS_FLAG,ESCAPE_HISTORY_FLAG,PRIOR_INCARCERATIONS_FLAG,NEXT_PAROLE_REVIEW_TYPE_CODE,TIME_OF_LAST_MOVEMENT,POPULATION/MANAGEMENT_UNIT,INMATE_POSITIVELY_IDENTIFIED,PAROLE_AND_TERMINATE_STATUS,INMATE_LABEL_STATUS_CODE,PRIMARY_OFFENSE_QUALIFIER
0,4,AARON,DAVID,C,,,MALE,WHITE,1961-10-15,UNKNOWN,...,N,N,Y,,00:09:00,,YES,,,
1,6,AARON,GERALD,,,,MALE,WHITE,1951-07-17,UNKNOWN,...,N,N,Y,,00:11:00,,YES,,,
2,8,AARON,JAMES,M,,,MALE,WHITE,1963-12-29,UNKNOWN,...,N,N,Y,,23:59:00,,YES,,FILE JACKET LABEL PRINTED,PRINCIPAL
3,10,AARON,KENNETH,T,,,MALE,BLACK,1953-05-18,UNKNOWN,...,N,N,Y,,00:13:00,,YES,,,
4,14,AARON,MOYER,,,,MALE,WHITE,1921-08-26,UNKNOWN,...,N,N,Y,,00:12:00,,YES,,,


## Conditional Statements

## Joins