# HONR 39900 Fall 2021: Foundations of Geospatial Analytics
## Week 1 Class Notebook
## Reviewing Python and SQL

### Justin A. Gould (gould29@purdue.edu)

# Required Packages

In [49]:
import pandas as pd
from pandasql import *
import sqlite3
from IPython.display import display

#Set up pandasql
pysqldf = lambda q: sqldf(q, globals())

#Load Sample Data -- DO NOT CHANGE
conn = sqlite3.connect("week_1.db")

jobs = pd.read_sql_query("SELECT * FROM jobs", conn)
employees = pd.read_sql_query("SELECT * FROM employees", conn)

# General Familiarity with Python
- https://developers.google.com/edu/python

In [19]:
#Set a variable in the notebook
a = 6 #Run via "SHIFT + ENTER"
a #Entering an expression prints its value

6

In [3]:
a + 2

8

In [20]:
a = "hello" #Let's work with strings. Wrap it in either "" or ''
a

'hello'

In [9]:
len(a) #Determine the length of a

5

In [10]:
a + len(a) #Try something that will fail

TypeError: can only concatenate str (not "int") to str

In [11]:
a + str(len(a)) #Let's try this... turn the length of a into a string
                #"Adding" strings will concat them

'hello5'

In [12]:
hello #This will fail, as hello is not expression

NameError: name 'hello' is not defined

In [13]:
a #But, the value of a is "hello"

'hello'

In [21]:
"hello" #This works, too, as it will just print the string value

'hello'

# Strings

In [15]:
s = 'hi'
print(s[1])          ## i
print(len(s))        ## 2
print(s + ' there')  ## hello there

i
2
hi there


Unlike Java, the '+' does not automatically convert numbers or other types to string form. The str() function converts values to a string form so they can be combined with other strings.

In [16]:
pi = 3.14
text = 'The value of pi is ' + pi      ## NO, does not work

TypeError: can only concatenate str (not "float") to str

In [18]:
text = 'The value of pi is '  + str(pi)  ## yes
text

'The value of pi is 3.14'

Common string methods:

 - `s.lower()`, `s.upper()` -- returns the lowercase or uppercase version of the string
 - `s.strip()` -- returns a string with whitespace removed from the start and end
 - `s.isalpha()`/`s.isdigit()`/`s.isspace()`... -- tests if all the string chars are in the various character classes
 - `s.startswith('other')`, `s.endswith('other')` -- tests if the string starts or ends with the given other string
 - `s.find('other')` -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
 - `s.replace('old', 'new')` -- returns a string where all occurrences of 'old' have been replaced by 'new'
 - `s.split('delim')` -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> `['aaa', 'bbb', 'ccc']`. As a convenient special case s.split() (with no arguments) splits on all whitespace chars.
 - `s.join(list)` -- opposite of split(), joins the elements in the given list together using the string as the delimiter. e.g. `'---'.join(['aaa', 'bbb', 'ccc'])` -> aaa---bbb---ccc

In [22]:
text = "the lazy brown dog jumped over the fox"

#Try them out here!


The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice `s[start:end]` is the elements beginning at start and extending up to but not including end. Suppose we have s = "Hello":

- `s[1:4]` is 'ell' -- chars starting at index 1 and extending up to but not including index 4
- `s[1:]` is 'ello' -- omitting either index defaults to the start or end of the string
- `s[:]` is 'Hello' -- omitting both always gives us a copy of the whole thing (this is the pythonic way to copy a sequence like a string or list)
- `s[1:100]` is 'ello' -- an index that is too big is truncated down to the string length 


- `s[-1]` is 'o' -- last char (1st from the end)
- `s[-4]` is 'e' -- 4th from the end
- `s[:-3]` is 'He' -- going up to but not including the last 3 chars.
- `s[-3:]` is 'llo' -- starting with the 3rd char from the end and extending to the end of the string. 

In [23]:
text = "welcome to class"

#Try them out here!


# Lists

Python has a great built-in list type named "list". List literals are written within square brackets [ ]. Lists work similarly to strings -- use the len() function and square brackets [ ] to access data, with the first element at index 0. (See http://docs.python.org/tut/node7.html)

In [24]:
colors = ['red', 'blue', 'green']
print(colors[0])    ## red
print(colors[2])    ## green
print(len(colors))  ## 3

red
green
3


In [25]:
squares = [1, 4, 9, 16]
sum = 0

for num in squares:
    sum += num
    
print(sum)  ## 30

30


In [26]:
#Print the numbers from 0 through 9
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


In [27]:
#Add to a list
empty = []
for i in range(10):
    empty.append(i)#Add the numbers from 0 through 9 to list

empty

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Common list methods:


 - `list.append(elem)` -- adds a single element to the end of the list. Common error: does not return the new list, just modifies the original.
 - `list.insert(index, elem)` -- inserts the element at the given index, shifting elements to the right.
 - `list.extend(list2)` adds the elements in list2 to the end of the list. Using `+` or `+=` on a list is similar to using `extend()`.
 - `list.index(elem)` -- searches for the given element from the start of the list and returns its index. Throws a ValueError if the element does not appear (use "in" to check without a `ValueError`).
 - `list.remove(elem)` -- searches for the first instance of the given element and removes it (throws `ValueError` if not present)
 - `list.sort()` -- sorts the list in place (does not return it).
 - `list.reverse()` -- reverses the list in place (does not return it)
 - `list.pop(index)` -- removes and returns the element at the given index. Returns the rightmost element if index is omitted (roughly the opposite of `append()`).

In [29]:
list = ['larry', 'curly', 'moe']
print("Original list:", list, "\n")

list.append('shemp')         ## append elem at end
print("Appending 'shemp':", list, "\n")

list.insert(0, 'xxx')        ## insert elem at index 0
print("Inserting 'xxx' at 0:", list, "\n")

list.extend(['yyy', 'zzz'])  ## add list of elems at end
print("Extending 'yyy' and 'zzz':", list, "\n")

print("Index of 'curly':", list.index('curly'), "\n")    ## 2
print("Remember, Python index starts at 0, not 1!", "\n")

list.remove('curly')         ## search and remove that element
print("Removing 'curly':", list, "\n")

list.pop(1)                  ## removes and returns 'larry'
print("Pop index 1 (second item, 'larry')", list)  ## ['xxx', 'moe', 'shemp', 'yyy', 'zzz']

Original list: ['larry', 'curly', 'moe'] 

Appending 'shemp': ['larry', 'curly', 'moe', 'shemp'] 

Inserting 'xxx' at 0: ['xxx', 'larry', 'curly', 'moe', 'shemp'] 

Extending 'yyy' and 'zzz': ['xxx', 'larry', 'curly', 'moe', 'shemp', 'yyy', 'zzz'] 

Index of 'curly': 2 

Remember, Python index starts at 0, not 1! 

Removing 'curly': ['xxx', 'larry', 'moe', 'shemp', 'yyy', 'zzz'] 

Pop index 1 (second item, 'larry') ['xxx', 'moe', 'shemp', 'yyy', 'zzz']


In [30]:
list = ['larry', 'curly', 'moe']

#Try some of the methods yourself!


# Using Functions

Functions help you structure your code and generalize it for efficient use.

For example, to multiply a number by 2, you would have to run this every time:

```
a = 4
b = a * 2 #8
```

Instead, have a function you can call, which is modular for any numeric input:
```
def double(a):
    return a *2

b = double(4) #8
c = double(5) #10
```

In [32]:
#Practice!
#Create a function to sort a list


#Create a function to square a number


# Basic SQL Syntax and Functionality
- https://www.w3schools.com/sql/


We will use SQL on Pandas DataFrames, via the `pandasql` package.

# SQL Query Structure

```
SELECT
    COLUMN_1,
    COLUMN_2,
    ...
FROM
    TABLE
WHERE
    CONDITION
```

- `SELECT`: The `SELECT` statement is used to select data from a database. You can specify column names (as shown above), or, to select all columns, you can use `*`...`SELECT * FROM TABLE`

- `FROM`: The `FROM` statement is used to determine against what table to execute thequery. For example, if we have a table called `customers`, to get all data, we would use: `SELECT * FROM customers`.

- `WHERE`: The `WHERE` clause is used to filter records. The `WHERE` clause is used to extract only those records that fulfill a specified condition.
  - The `CONDITION` (`WHERE customer_id = 1`, which would return records where `customer_id` is equal to `1`) is broken into three parts:
    - `COLUMN` (`customer_id`): Column to which condition is applied.
    - `OPERATOR` (`=`)
    - `VALUE` (`1`)
  - You can have multiple `WHERE` clauses in a single query, separated by `AND` or `OR`.
    - The `AND` operator displays a record if all the conditions separated by `AND` are TRUE.
    - The `OR` operator displays a record if any of the conditions separated by `OR` is TRUE.

## Example:

```
SELECT
    *
FROM
    Customers
WHERE
    Country='Germany'
    AND City='Berlin'
```

The above query will return all records (`*`) from the table `Customers` if a row's value for the column `Country` is equal to `Germany` and the value for the column `City` is equal to `Berlin`.

# Exampe `pandasql` Usage

In [47]:
query = "SELECT * FROM employees"
pysqldf(query)[:10] #Showing first 10 rows

Unnamed: 0,employee_id,first_name,last_name,email,phone_number,hire_date,job_id,salary,manager_id,department_id
0,100,Steven,King,steven.king@sqltutorial.org,515.123.4567,1987-06-17,4,24000.0,,9
1,101,Neena,Kochhar,neena.kochhar@sqltutorial.org,515.123.4568,1989-09-21,5,17000.0,100.0,9
2,102,Lex,De Haan,lex.de haan@sqltutorial.org,515.123.4569,1993-01-13,5,17000.0,100.0,9
3,103,Alexander,Hunold,alexander.hunold@sqltutorial.org,590.423.4567,1990-01-03,9,9000.0,102.0,6
4,104,Bruce,Ernst,bruce.ernst@sqltutorial.org,590.423.4568,1991-05-21,9,6000.0,103.0,6
5,105,David,Austin,david.austin@sqltutorial.org,590.423.4569,1997-06-25,9,4800.0,103.0,6
6,106,Valli,Pataballa,valli.pataballa@sqltutorial.org,590.423.4560,1998-02-05,9,4800.0,103.0,6
7,107,Diana,Lorentz,diana.lorentz@sqltutorial.org,590.423.5567,1999-02-07,9,4200.0,103.0,6
8,108,Nancy,Greenberg,nancy.greenberg@sqltutorial.org,515.124.4569,1994-08-17,7,12000.0,101.0,10
9,109,Daniel,Faviet,daniel.faviet@sqltutorial.org,515.124.4169,1994-08-16,6,9000.0,108.0,10


In [48]:
query = "SELECT * FROM jobs"
pysqldf(query)[:10] #Showing first 10 rows

Unnamed: 0,job_id,job_title,min_salary,max_salary
0,1,Public Accountant,4200.0,9000.0
1,2,Accounting Manager,8200.0,16000.0
2,3,Administration Assistant,3000.0,6000.0
3,4,President,20000.0,40000.0
4,5,Administration Vice President,15000.0,30000.0
5,6,Accountant,4200.0,9000.0
6,7,Finance Manager,8200.0,16000.0
7,8,Human Resources Representative,4000.0,9000.0
8,9,Programmer,4000.0,10000.0
9,10,Marketing Manager,9000.0,15000.0


# SELECT Clause

In [None]:
#Try to select any subset of columns from either `jobs` or `employees`



# Aggregators

We can also use aggregator functions to make computations in our query. For example:
- `MIN()`: The MIN() function returns the smallest value of the selected column.
- `MAX()`: The MAX() function returns the largest value of the selected column.
- `COUNT()`: The COUNT() function returns the number of rows that matches a specified criterion.
- `SUM()`: The SUM() function returns the total sum of a numeric column.
- `AVG()`: The AVG() function returns the average value of a numeric column.

How to use aggregators:
```
SELECT MIN(column_name)
FROM table_name
WHERE condition;
```

```
SELECT AVG(column_name)
FROM table_name
WHERE condition;
```

```
SELECT COUNT(column_name)
FROM table_name
WHERE condition;
```

In [60]:
#For example...
query = "SELECT MIN(min_salary) FROM jobs" #Shows the smallest minimum salary
display(pysqldf(query))
print("\n")

query = "SELECT MAX(max_salary) FROM jobs WHERE job_title LIKE '%Manager'" #Shows the largest maximum salary for anyone with a "manager" position
display(pysqldf(query))

Unnamed: 0,MIN(min_salary)
0,2000.0






Unnamed: 0,MAX(max_salary)
0,20000.0


In [None]:
#Now, you try to use aggregators to compute data!
#What is the average minimum salary?

#How many positions have a minimum salary greater than 4000?

# WHERE Clause

In [68]:
#For example...
display(pysqldf("SELECT job_id, job_title, max_salary FROM jobs WHERE max_salary > 10000"))

#Now, you try to use WHERE clauses to filter data!
#Show the positions with a maximum salary greater than 12000


#Show the positions with a minimum salary less than 7000


Unnamed: 0,job_id,job_title,max_salary
0,2,Accounting Manager,16000.0
1,4,President,40000.0
2,5,Administration Vice President,30000.0
3,7,Finance Manager,16000.0
4,10,Marketing Manager,15000.0
5,12,Public Relations Representative,10500.0
6,14,Purchasing Manager,15000.0
7,15,Sales Manager,20000.0
8,16,Sales Representative,12000.0


# JOINs

A `JOIN` clause is used to combine rows from two or more tables, based on a related column between them.

## Different Types of SQL JOINs
Here are the different types of the JOINs in SQL:
- (INNER) JOIN: Returns records that have matching values in both tables
- LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table
- RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table
- FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

![JOINs](http://www.datapine.com/blog/wp-content/uploads/2015/08/summary-sql-joins.png)


In [55]:
#Example
query = """
SELECT
    employees.*,
    jobs.job_title
FROM
    employees
JOIN
    jobs on jobs.job_id = employees.job_id
"""

pysqldf(query)[:10] #First 10 rows

#Notice the final column! I was able to reference the `job_title` column, via `job_id` in both the `employees` and `jobs` tables.

Unnamed: 0,employee_id,first_name,last_name,email,phone_number,hire_date,job_id,salary,manager_id,department_id,job_title
0,100,Steven,King,steven.king@sqltutorial.org,515.123.4567,1987-06-17,4,24000.0,,9,President
1,101,Neena,Kochhar,neena.kochhar@sqltutorial.org,515.123.4568,1989-09-21,5,17000.0,100.0,9,Administration Vice President
2,102,Lex,De Haan,lex.de haan@sqltutorial.org,515.123.4569,1993-01-13,5,17000.0,100.0,9,Administration Vice President
3,103,Alexander,Hunold,alexander.hunold@sqltutorial.org,590.423.4567,1990-01-03,9,9000.0,102.0,6,Programmer
4,104,Bruce,Ernst,bruce.ernst@sqltutorial.org,590.423.4568,1991-05-21,9,6000.0,103.0,6,Programmer
5,105,David,Austin,david.austin@sqltutorial.org,590.423.4569,1997-06-25,9,4800.0,103.0,6,Programmer
6,106,Valli,Pataballa,valli.pataballa@sqltutorial.org,590.423.4560,1998-02-05,9,4800.0,103.0,6,Programmer
7,107,Diana,Lorentz,diana.lorentz@sqltutorial.org,590.423.5567,1999-02-07,9,4200.0,103.0,6,Programmer
8,108,Nancy,Greenberg,nancy.greenberg@sqltutorial.org,515.124.4569,1994-08-17,7,12000.0,101.0,10,Finance Manager
9,109,Daniel,Faviet,daniel.faviet@sqltutorial.org,515.124.4169,1994-08-16,6,9000.0,108.0,10,Accountant


# Subqueries

A subquery is, essentially, a query within a query. Subqueries let you specify the results of one query as an argument in another query. When you're writing an SQL query, you may want to specify a parameter based on the result of another query.

For example, if I want to know which employees have a salary greater than or equal to the company's average salary, I need to compute the average salary for all employees and use that as a parameter in my query:

```
SELECT
    *
FROM
    employees
WHERE
    salary >= (SELECT AVG(salary) FROM employees)
```

In [67]:
#Here's the example from above
print("The company's average salary is:", pysqldf("SELECT AVG(salary) FROM employees")["AVG(salary)"][0], "\n")

query = """
SELECT
    *
FROM
    employees
WHERE
    salary >= (SELECT AVG(salary) FROM employees)
"""
pysqldf(query)

The company's average salary is: 8060.0 



Unnamed: 0,employee_id,first_name,last_name,email,phone_number,hire_date,job_id,salary,manager_id,department_id
0,100,Steven,King,steven.king@sqltutorial.org,515.123.4567,1987-06-17,4,24000.0,,9
1,101,Neena,Kochhar,neena.kochhar@sqltutorial.org,515.123.4568,1989-09-21,5,17000.0,100.0,9
2,102,Lex,De Haan,lex.de haan@sqltutorial.org,515.123.4569,1993-01-13,5,17000.0,100.0,9
3,103,Alexander,Hunold,alexander.hunold@sqltutorial.org,590.423.4567,1990-01-03,9,9000.0,102.0,6
4,108,Nancy,Greenberg,nancy.greenberg@sqltutorial.org,515.124.4569,1994-08-17,7,12000.0,101.0,10
5,109,Daniel,Faviet,daniel.faviet@sqltutorial.org,515.124.4169,1994-08-16,6,9000.0,108.0,10
6,110,John,Chen,john.chen@sqltutorial.org,515.124.4269,1997-09-28,6,8200.0,108.0,10
7,114,Den,Raphaely,den.raphaely@sqltutorial.org,515.127.4561,1994-12-07,14,11000.0,100.0,3
8,121,Adam,Fripp,adam.fripp@sqltutorial.org,650.123.2234,1997-04-10,19,8200.0,100.0,5
9,145,John,Russell,john.russell@sqltutorial.org,,1996-10-01,15,14000.0,100.0,8


In [None]:
#Now, it's your turn. Write a query to determine which employees have a salary higher than the average Programmer salary (HINT: job_id = 9)
