#HW4 - SQL

This homework has you working with a new database of information on ticket sales for various types of events.  Your job will be to do some initial exploring and then demonstrate your ability to do all the different types of SQL queries we learned over the last week.  You'll also need to make one function that'll make looking at the tables easier.

These questions are written in the way someone would ask them to you.  In other words, I'm using 'plain english' questions vs. ones where I'm very explicit in terms of what columns and tables to use.  Your exploring of the database and functions to ease that process will come in handy here!  

The database has been created using a set of data from Amazon. You can read more about what each table contains here: https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html.  

**Submission Instruction**

1- Replace the blank with your name (e.g. DE_HW4-Sara_Riazi)

2- Run your notebook (all the outputs must be visible).

3- Download .ipynb  

4- Submit on Gradescope

## Libraries and import functions

In [1]:
%pip install mysql-connector-python

Note: you may need to restart the kernel to use updated packages.


First bring the libraries we'll need!

In [2]:
import mysql.connector
import pandas as pd

Now bring get_conn_cur and run_query funtions as well as connection information as we used in the practice notebook for SQL (Lab 3)!  

In [3]:
#connection
mysql_address = '131.193.32.85'
mysql_username='de_student'
mysql_password='DE_Student_PaSS'

mysql_database = 'my_dataengineering_dbs'

def get_conn_cur():
    cnx = mysql.connector.connect(user=mysql_username, password=mysql_password,
    host=mysql_address,
    database=mysql_database, port='3306');
    return (cnx, cnx.cursor())

In [4]:
#run_query
def run_query(query_string):
    conn, cur = get_conn_cur() 
    cur.execute(query_string) 

    my_data = cur.fetchall()
    result_df = pd.DataFrame(my_data, columns=cur.column_names)

    cur.close()
    conn.close()

    return result_df

## Make a SQL head function - 5 point

Make function to get the pandas equivalent of `.head()`

This function should be called `sql_head` and take a single argument of `table_name` where you specify the table name you want the head information from.  It should return the column names along with the first five rows of the table along.  

**For full points, return a pandas dataframe with this information so it displays nicely :)**

In [7]:
# make sql_head function
def sql_head(table_name):
    qs = """
    SELECT *
    FROM %s
    LIMIT 5
    """ % table_name
    return run_query(qs)


In [8]:
# Check that it works!
sql_head(table_name = 'ticketsdb_sales')

Unnamed: 0,salesid,listid,sellerid,buyerid,eventid,dateid,qtysold,pricepaid,commission,saletime
0,1,1,36861,21191,7872,1875,4,728.0,109.2,2008-02-17 20:36:48
1,2,4,8117,11498,4337,1983,2,76.0,11.4,2008-06-06 00:00:16
2,3,5,1616,17433,8647,1983,2,350.0,52.5,2008-06-06 03:26:17
3,4,5,1616,19715,8647,1986,1,175.0,26.25,2008-06-09 03:38:52
4,5,6,47402,14115,8240,2069,2,154.0,23.1,2008-08-31 04:17:02


## Explore and SELECT - 5 point

Let's start this homework with some basic queries to get a look at what's in the various tables. Remember that we are using one Database for all schemas in this course. So running "show tables" will list all tables from previous schemas too.
* use run_query first to run "show tables"
* look at the column name, we only wants the tables that starts with 'ticketsdb' which is the schema of this notebook.
* run "show tables where Tables_in_ista322dbs like 'ticketsdb_%' " query using run_query to see all tables for ticketsdb schema.
* Now use the `sql_head()` function you created to get the first five rows of all tables in the ticketsdb schema

In [16]:
ticketsdb_schema = run_query("show tables where Tables_in_my_dataengineering_dbs like 'ticketsdb_%'")
for table in ticketsdb_schema['Tables_in_my_dataengineering_dbs']:
    print(sql_head(table))

   catid catgroup catname                            catdesc
0      1   Sports     MLB            Major League Baseball\n
1      2   Sports     NHL           National Hockey League\n
2      3   Sports     NFL         National Football League\n
3      4   Sports     NBA  National Basketball Association\n
4      5   Sports     MLS              Major League Soccer\n
   dateid     caldate day  week month qtr  year  holiday
0    1827  2008-01-01  WE     1   JAN   1  2008        0
1    1828  2008-01-02  TH     1   JAN   1  2008        0
2    1829  2008-01-03  FR     1   JAN   1  2008        0
3    1830  2008-01-04  SA     2   JAN   1  2008        0
4    1831  2008-01-05  SU     2   JAN   1  2008        0
   eventid  venueid  catid  dateid                    eventname  \
0        1      305      8    1851              Gotterdammerung   
1        2      306      8    2114                Boris Godunov   
2        3      302      8    1935                       Salome   
3        4      309     

## WHERE - 5 points

Now let's do a bit of filtering with WHERE.  Write and run queries to get the following results.  
**LIMIT all returns to first five rows.**

* Get venues with >= 10000 seats from the venues table
* Get venues in Arizona
* Get users who have a first name that starts with H
* Get **just email addresses** of users who gave a .edu email address




In [20]:
# Get big venues... so those with >= than 10000 seats
qs = """
    SELECT * FROM ticketsdb_venue
    WHERE venueseats >= 10000
    """
run_query(qs)
# sql_head("ticketsdb_venue")

Unnamed: 0,venueid,venuename,venuecity,venuestate,venueseats
0,5,Gillette Stadium,Foxborough,MA,68756
1,6,New York Giants Stadium,East Rutherford,NJ,80242
2,15,McAfee Coliseum,Oakland,CA,63026
3,18,Madison Square Garden,New York City,NY,20000
4,67,Ralph Wilson Stadium,Orchard Park,NY,73967
5,68,Rogers Centre,Toronto,ON,50516
6,69,Dolphin Stadium,Miami Gardens,FL,74916
7,70,M&T Bank Stadium,Baltimore,MD,70107
8,71,Paul Brown Stadium,Cincinnati,OH,65535
9,72,Cleveland Browns Stadium,Cleveland,OH,73200


In [24]:
# Get venues in AZ
qs = """
    SELECT * FROM ticketsdb_venue
    WHERE venuestate = 'AZ'
    """
run_query(qs)

Unnamed: 0,venueid,venuename,venuecity,venuestate,venueseats
0,38,US Airways Center,Phoenix,AZ,0
1,65,Jobing.com Arena,Glendale,AZ,0
2,92,University of Phoenix Stadium,Glendale,AZ,0
3,117,Chase Field,Phoenix,AZ,0


In [None]:
#Get users who have a first name that starts with H
qs = """
    SELECT * FROM ticketsdb_users
    WHERE firstname like 'H%'
    """
run_query(qs)
# debugging
# run_query("show tables")
# sql_head("ticketsdb_users")

Unnamed: 0,userid,username,firstname,lastname,city,state,email,phone,likesports,liketheatre,likeconcerts,likejazz,likeclassical,likeopera,likerock,likevegas,likebroadway,likemusicals
0,13,QTF33MCG,Henry,Cochran,Bossier City,QC,Aliquam.vulputate.ullamcorper@amalesuada.org,(783) 105-0989,0,0,0,0,0,0,0,0,0,0
1,22,RHT62AGI,Hermione,Trevino,Walnut,WI,non.justo.Proin@ametconsectetuer.edu,(245) 110-6540,0,0,0,0,0,0,0,0,0,0
2,29,HUH27PKK,Helen,Avery,Garland,PE,in.faucibus.orci@ultrices.edu,(385) 925-3875,0,0,0,0,0,0,0,0,0,0
3,56,MHU11LZP,Howard,Wiley,Oklahoma City,NU,accumsan@vulputateullamcorper.ca,(277) 315-5682,0,0,0,0,0,0,0,0,0,0
4,67,TWU10MZT,Herman,Myers,Basin,PE,Mauris@neque.com,(471) 895-6189,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2331,49901,CUC69PVF,Hermione,Mcclain,Del Rio,MB,id@Curabitur.org,(242) 264-7006,0,0,0,0,0,0,0,0,0,0
2332,49936,ZKV32TGE,Hop,Mcclain,Saint Cloud,WV,adipiscing@at.org,(930) 289-0793,0,0,0,0,0,0,0,0,0,0
2333,49966,XXT27FBP,Hayden,Wilkinson,Portland,ON,ullamcorper.Duis@pharetra.com,(945) 884-6008,0,0,0,0,0,0,0,0,0,0
2334,49973,KVB52LOX,Harlan,Murphy,Hot Springs,NM,Sed.nulla@nec.ca,(844) 671-5836,0,0,0,0,0,0,0,0,0,0


In [32]:
# Get all .edu email addresses... just the email addresses
qs = """
    SELECT email FROM ticketsdb_users
    WHERE email like '%.edu'
"""
run_query(qs)

Unnamed: 0,email
0,Etiam.laoreet.libero@sodalesMaurisblandit.edu
1,Suspendisse.tristique@nonnisiAenean.edu
2,ullamcorper.nisl@Cras.edu
3,vel.est@velitegestas.edu
4,justo.nec.ante@quismassa.edu
...,...
12472,nec.orci@adipiscing.edu
12473,Aliquam@sed.edu
12474,velit.Aliquam.nisl@ac.edu
12475,Proin@Class.edu


## GROUP BY and HAVING - 5 points

Time to practice some GROUP BY and HAVING operations. Please write and run queries that do the following:

GROUP BY application
* Find the top five venues that hosted the most events: Alias the count of events as 'events_hosted'. Also return the venue ID
* Get the number of events hosted in each month. You'll need to use `date_part()` in your select to select just the months. Alias this as 'month' and then the count of the number of events hosted as 'events_hosted'.
* Get the top five sellers who made the most commission. Alias their total commission made as 'total_com'. Also get their average commission made and alias as 'avg_com'.  Be sure to also display the seller_id.  

HAVING application
* Using the same query as the last one, instead of getting the top five sellers get all sellers who have made a total commission greater than $4000.
* Using the same query as the first groupby, instead of returning the top five venues, return just the ID's of venues that have had greater than 60 events.

In [41]:
### GROUP BY application
# Find the top five venues that hosted the most events: Alias the count of events as 'events_hosted'. Also return the venue ID
qs = """
    SELECT venueid, COUNT(venueid) as events_hosted
    FROM ticketsdb_event
    GROUP BY venueid
    LIMIT 5
"""

run_query(qs)

# run_query("show tables")
# sql_head("ticketsdb_event")

Unnamed: 0,venueid,events_hosted
0,1,49
1,2,39
2,3,35
3,4,28
4,5,32


In [43]:
# Get the number of events hosted in each month. You'll need to use `month()` in your select to select just the months.
# Alias this as 'month' and then the count of the number of events hosted as 'events_hosted'
qs = """
    SELECT month(starttime) as month, COUNT(venueid) as events_hosted
    FROM ticketsdb_event
    GROUP BY month
"""

run_query(qs)

# run_query("show tables")
# sql_head("ticketsdb_event")

Unnamed: 0,month,events_hosted
0,1,778
1,2,711
2,3,753
3,4,725
4,5,727
5,6,709
6,7,729
7,8,737
8,9,746
9,10,735


In [50]:
# Get the top five sellers who made the most commission. Alias their total commission made as 'total_com'.
# Also get their average commission made and alias as 'avg_com'. Be sure to also display the seller_id
qs = """
    SELECT sellerid, SUM(commission) as total_com, AVG(commission) as avg_com
    FROM ticketsdb_sales
    GROUP BY sellerid
    LIMIT 5
"""

run_query(qs)

# run_query("show tables")
# sql_head("ticketsdb_sales")

Unnamed: 0,sellerid,total_com,avg_com
0,1,102.45,51.225
1,2,277.35,34.66875
2,3,267.75,89.25
3,4,671.1,223.7
4,5,43.95,14.65


In [51]:
### HAVING application
# Using the same query as the last groupby, instead of getting the top five sellers get all sellers who have made a total commission greater than $4000
qs = """
    SELECT sellerid, SUM(commission) as total_com, AVG(commission) as avg_com
    FROM ticketsdb_sales
    GROUP BY sellerid
    HAVING total_com > 4000
"""

run_query(qs)

# run_query("show tables")
# sql_head("ticketsdb_sales")

Unnamed: 0,sellerid,total_com,avg_com
0,1140,4859.85,347.132143
1,2372,4073.85,678.975
2,13385,4274.25,388.568182
3,25433,4147.95,518.49375
4,43551,4704.75,470.475


In [52]:
# Using the same query as the first groupby, instead of returning the top five venues, return just the ID's of venues that have had greater than 60 events
qs = """
    SELECT venueid, COUNT(venueid) as events_hosted
    FROM ticketsdb_event
    GROUP BY venueid
    HAVING events_hosted > 60
"""

run_query(qs)

# run_query("show tables")
# sql_head("ticketsdb_event")

Unnamed: 0,venueid,events_hosted
0,201,62
1,203,80
2,205,70
3,207,67
4,208,69
5,209,66
6,215,62
7,216,72
8,217,81
9,218,70


## JOIN - 5 points

Time for some joins. You've probably noticed by now that there is at least one relational key in each table, but some have more.  For example, sales has a unique sale id, listing id, seller id, buyer id, date id.  This allows you to link each sale to relevant information in other tables.  

Please write queries to do the following items:

* Join information of users to each sale made (using seller id).  
* Join information about each venue to each event.

In [None]:
# Join users information to each sale using seller id (correct solution has 172456 rows)
qs = """
    SELECT s.sellerid, u.*
    FROM ticketsdb_sales s
    JOIN ticketsdb_users u
    ON s.sellerid = u.userid
"""

run_query(qs)

# run_query("show tables")
# sql_head("ticketsdb_users")

In [None]:
# For each event attach the venue information (correct solution has 8659 rows)


## Subqueries - 5 points

To wrap up let's do several subqueries. Please do the following:

* Get all purchases made by users of live in Arizona
* Get event information for all events that took place in a venue where the venue name ends with 'Stadium'.
* Get event information for all events where the total ticket sales were greater than $50,000.  

In [None]:
# Get all purchases from users who live in Arizona (correct solution has 1855 rows)


In [None]:
# Get event information for all events that took place in a venue where the name ended in 'Stadium' (correct solution has 1029 rows)


In [None]:
# Get event name where the total sales for that event were greater than $50000 (correct solution has three rows for Adriana Lecouvreur,Phantom of the Opera, and Janet Jackson )
# Note that we are looking for  event name!
