# Analyzing Online Ticket Sales with Amazon Redshift

## By: Vatsal Vinay Parikh
In this project, we will be accessing data stored in Amazon Redshift, a data warehouse product that is part of Amazon Web Services. More specifically, we'll be analyzing sales activity from a fictional ticketing website where users both buy and sell tickets online for sporting events, shows, and concerts ([source](https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html)).

## Explore events

In [1]:
-- List all the events
SELECT * FROM event

Unnamed: 0,eventid,venueid,catid,dateid,eventname,starttime
0,6649,6,9,1827,Hannah Montana,2008-01-01 19:30:00+00:00
1,1433,248,6,1827,Grease,2008-01-01 19:00:00+00:00
2,4135,16,9,1827,Nas,2008-01-01 14:30:00+00:00
3,5807,45,9,1827,Return To Forever,2008-01-01 15:00:00+00:00
4,1217,238,6,1827,Mamma Mia!,2008-01-01 20:00:00+00:00
...,...,...,...,...,...,...
8793,6034,45,9,2191,War,2008-12-31 14:00:00+00:00
8794,6783,60,9,2191,The Police,2008-12-31 15:00:00+00:00
8795,6857,18,9,2191,Judas Priest,2008-12-31 14:00:00+00:00
8796,7192,54,9,2191,Lindsey Buckingham,2008-12-31 19:30:00+00:00


This is linking up to several other tables in the warehouse, such as venue, category and date. Let's join things up.

In [2]:
SELECT *
FROM event
INNER JOIN venue USING(venueid)
INNER JOIN category USING(catid)
INNER JOIN date USING(dateid)
LIMIT 100

Unnamed: 0,dateid,catid,venueid,eventid,eventname,starttime,venuename,venuecity,venuestate,venueseats,catgroup,catname,catdesc,caldate,day,week,month,qtr,year,holiday
0,1827,9,6,6649,Hannah Montana,2008-01-01 19:30:00+00:00,New York Giants Stadium,East Rutherford,NJ,80242.0,Concerts,Pop,All rock and pop music concerts,2008-01-01 00:00:00+00:00,WE,1,JAN,1,2008,True
1,1827,6,248,1433,Grease,2008-01-01 19:00:00+00:00,Charles Playhouse,Boston,MA,0.0,Shows,Musicals,Musical theatre,2008-01-01 00:00:00+00:00,WE,1,JAN,1,2008,True
2,1827,9,16,4135,Nas,2008-01-01 14:30:00+00:00,TD Banknorth Garden,Boston,MA,0.0,Concerts,Pop,All rock and pop music concerts,2008-01-01 00:00:00+00:00,WE,1,JAN,1,2008,True
3,1827,9,45,5807,Return To Forever,2008-01-01 15:00:00+00:00,Prudential Center,Newark,NJ,0.0,Concerts,Pop,All rock and pop music concerts,2008-01-01 00:00:00+00:00,WE,1,JAN,1,2008,True
4,1827,6,238,1217,Mamma Mia!,2008-01-01 20:00:00+00:00,Winter Garden Theatre,New York City,NY,0.0,Shows,Musicals,Musical theatre,2008-01-01 00:00:00+00:00,WE,1,JAN,1,2008,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1830,9,98,6270,Rodrigo y Gabriela,2008-01-04 14:00:00+00:00,Yankee Stadium,New York City,NY,52325.0,Concerts,Pop,All rock and pop music concerts,2008-01-04 00:00:00+00:00,SA,2,JAN,1,2008,False
96,1830,8,308,117,Il Trovatore,2008-01-04 19:00:00+00:00,Metropolitan Opera,New York City,NY,0.0,Shows,Opera,All opera and light opera,2008-01-04 00:00:00+00:00,SA,2,JAN,1,2008,False
97,1830,7,230,3035,The Caretaker,2008-01-04 14:00:00+00:00,Richard Rodgers Theatre,New York City,NY,0.0,Shows,Plays,All non-musical theatre,2008-01-04 00:00:00+00:00,SA,2,JAN,1,2008,False
98,1830,9,6,4394,Led Zeppelin,2008-01-04 14:30:00+00:00,New York Giants Stadium,East Rutherford,NJ,80242.0,Concerts,Pop,All rock and pop music concerts,2008-01-04 00:00:00+00:00,SA,2,JAN,1,2008,False


There's a `starttime` column coming from the `event` table and there's also a `caldate` column, coming from the `date` table. Let's see what's up with this.

In [3]:
SELECT
	CASE WHEN DATE(caldate) = DATE(starttime) THEN True ELSE FALSE END AS same_date,
	COUNT(*)
FROM event
INNER JOIN date USING(dateid)
GROUP BY 1

Unnamed: 0,same_date,count
0,True,8095
1,False,703


In [4]:
SELECT MAX(DATEDIFF('hour', caldate, starttime))
FROM event
INNER JOIN date USING(dateid)

Unnamed: 0,max
0,20


Let's see how much events are happening in different cities.

In [9]:
SELECT 
	venuecity,
    COUNT(*) AS num_event
FROM event
INNER JOIN venue USING(venueid)
GROUP BY 1
ORDER BY 2 DESC

Unnamed: 0,venuecity,num_event
0,New York City,2647
1,Los Angeles,312
2,Las Vegas,300
3,Chicago,209
4,San Francisco,194
...,...,...
74,Newark,27
75,Montreal,27
76,Irving,25
77,Sunrise,24


In [10]:
import plotly.express as px
px.bar(event_per_city, x = 'venuecity', y = 'num_event' )

## Explore listings and sales

In [11]:
-- show 100 listing records
SELECT * FROM listing

Unnamed: 0,listid,sellerid,eventid,dateid,numtickets,priceperticket,totalprice,listtime
0,1315,37302,920,1827,9,126.0,1134.0,2008-01-01 04:05:41+00:00
1,724,35016,3468,1827,10,40.0,400.0,2008-01-01 03:32:37+00:00
2,1825,45077,3181,1827,16,118.0,1888.0,2008-01-01 01:16:37+00:00
3,7266,45195,7721,1827,2,91.0,182.0,2008-01-01 07:55:03+00:00
4,4118,40141,5624,1827,16,43.0,688.0,2008-01-01 03:10:06+00:00
...,...,...,...,...,...,...,...,...
192492,39400,26496,1042,2191,10,34.0,340.0,2008-12-31 08:31:07+00:00
192493,34734,2720,7192,2191,8,286.0,2288.0,2008-12-31 12:37:49+00:00
192494,55967,36546,5099,2191,10,231.0,2310.0,2008-12-31 05:48:57+00:00
192495,125197,37690,3305,2191,12,115.0,1380.0,2008-12-31 09:22:29+00:00


In [12]:
-- show 100 sales records
SELECT * FROM sales
LIMIT 100

Unnamed: 0,salesid,listid,sellerid,buyerid,eventid,dateid,qtysold,pricepaid,commission,saletime
0,33095,36572,30047,660,2903,1827,2,234.0,35.10,2008-01-01 09:41:06+00:00
1,88268,100813,45818,698,8649,1827,4,836.0,125.40,2008-01-01 07:26:20+00:00
2,150314,173969,48680,816,8762,1827,2,688.0,103.20,2008-01-01 03:50:02+00:00
3,110917,127048,37631,116,1749,1827,1,337.0,50.55,2008-01-01 07:05:02+00:00
4,157751,206999,3003,157,6605,1827,1,1730.0,259.50,2008-01-01 12:50:55+00:00
...,...,...,...,...,...,...,...,...,...,...
95,40196,44927,32034,477,5931,1831,4,432.0,64.80,2008-01-05 03:41:37+00:00
96,50877,57387,37390,572,8538,1831,1,233.0,34.95,2008-01-05 01:56:15+00:00
97,50084,56553,23982,3808,1104,1831,1,245.0,36.75,2008-01-05 11:32:27+00:00
98,60724,68779,37824,1850,7626,1831,1,487.0,73.05,2008-01-05 03:23:28+00:00


Let's see if multiple sales can happen for the same listing.

In [13]:
WITH listings_with_sales AS (
    SELECT 
        listid,
        COUNT(*) AS number_of_sales
    FROM listing
	INNER JOIN sales USING(listid)
    GROUP BY 1
)
SELECT
	number_of_sales,
    COUNT(*) AS cnt
FROM listings_with_sales
GROUP BY 1
ORDER BY 1

Unnamed: 0,number_of_sales,cnt
0,1,48029
1,2,36570
2,3,14665
3,4,1808
4,5,12


It turns out the bulk of listings have only one sale associated with them. There are 12 listings that had 5 sales.

We can find the user that sold the most tickets in 2008:

In [14]:
SELECT
	sellerid,
    username,
    (firstname ||' '|| lastname) as name,
	city,
    sum(qtysold)
FROM sales
INNER JOIN date USING(dateid)
INNER JOIN users ON sales.sellerid = users.userid
WHERE year = 2008
GROUP BY 1, 2, 3, 4
ORDER BY 5 desc
LIMIT 5

Unnamed: 0,sellerid,username,name,city,sum
0,48950,TUT90BHI,Nayda Hood,Frisco,46
1,19123,DZW00VOQ,Scott Simmons,Carson,41
2,20029,RPM45HGY,Drew Mcguire,Lancaster,41
3,36791,DCE77DOA,Emerson Delacruz,Springfield,40
4,9697,GDM25KSM,Dorian Ray,Vicksburg,39


Similarly, we can find the most active buyer on the site in 2008:

In [15]:
select
	buyerid,
    username,
    (firstname ||' '|| lastname) as name,
	city,
    sum(qtysold)
FROM sales
INNER JOIN date USING(dateid)
INNER JOIN users ON sales.buyerid = users.userid
WHERE year = 2008
GROUP BY 1, 2, 3, 4
ORDER BY 5 desc
LIMIT 5

Unnamed: 0,buyerid,username,name,city,sum
0,8933,CNF70VPH,Jerry Nichols,Middlebury,67
1,1298,EDB46JXK,Kameko Bowman,Newburyport,64
2,3797,KTV94TWB,Armando Lopez,Pomona,64
3,5002,CBC51API,Kellie Savage,Falls Church,63
4,3881,XJN46RCL,Herrod Sparks,Rome,60


Let's see if there's a big difference in average sales price for different categories of events. We're looking at actual sales here, not listings!

In [16]:
SELECT
	catgroup,
    AVG(pricepaid / qtysold) AS avg_ticket_price,
	MEDIAN(pricepaid / qtysold) AS median_ticket_price
FROM sales
INNER JOIN event USING(eventid)
INNER JOIN category USING(catid)
GROUP BY 1

Unnamed: 0,catgroup,avg_ticket_price,median_ticket_price
0,Concerts,333.755006,229.0
1,Shows,336.982704,232.0


Are there listings where the sale happened before the listing?

In [17]:
SELECT COUNT(*)
FROM listing
INNER JOIN sales USING(listid)
WHERE listtime > saletime

Unnamed: 0,count
0,2965


This must be bad data! Let's keep these out when figuring out the shortest and longest time to get a listing sold.

In [18]:
WITH tts AS (
    SELECT DATEDIFF('seconds', listtime, saletime) AS time_to_sell
    FROM sales
    INNER JOIN listing USING(listid)
    WHERE listtime < saletime
)
SELECT 
	MIN(time_to_sell) AS shortest_time_to_sell_seconds,
    MAX(time_to_sell)/3600/24 AS longest_time_to_sell_days 
FROM tts

Unnamed: 0,shortest_time_to_sell_seconds,longest_time_to_sell_days
0,17,60


## Finding users that should advertise

Suppose we, as owners of the ticketing website, want to target certain users with the suggestion to advertise their listings, so they have a higher chance of actually getting sales. Let's build up a list of users that had the most outstanding listings in terms of price per ticket.

In [20]:
WITH listings_with_sales AS (
    SELECT 
        listid,
 	    listing.sellerid,
        numtickets AS tickets_listed,
        priceperticket,
        SUM(COALESCE(qtysold,0)) AS tickets_sold
    FROM listing
	LEFT JOIN sales USING(listid)
    GROUP BY 1, 2, 3, 4
)
SELECT 
	sellerid,
    (firstname ||' '|| lastname) as name,
	SUM((tickets_listed - tickets_sold) * priceperticket) AS unrealized_sales
FROM listings_with_sales lws
INNER JOIN users ON lws.sellerid = users.userid
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 100

Unnamed: 0,sellerid,name,unrealized_sales
0,25428,Jaime Wagner,58395.0
1,24896,Macey Ortiz,53086.0
2,49322,Dustin Vincent,50914.0
3,36926,Audrey Barber,50345.0
4,45819,Kelly Barrett,49826.0
...,...,...,...
95,35926,Ulysses Kinney,38895.0
96,45372,Lysandra Sanchez,38862.0
97,48188,Caesar Parrish,38847.0
98,25373,Jakeem Byrd,38847.0


Looks like Jaime Wagner had 58k of unrealized sales!

## Visualize sales over time

In [28]:
-- Show total pricepaid on a weekly basis
SELECT 
	catgroup,
	DATE_TRUNC('week', saletime) AS sales_week,
	SUM(pricepaid) AS total_sales
FROM sales
INNER JOIN event USING(eventid)
INNER JOIN category USING(catid)
GROUP BY 1,2
ORDER BY 2

Unnamed: 0,catgroup,sales_week,total_sales
0,Concerts,2007-12-31 00:00:00+00:00,146716.0
1,Shows,2007-12-31 00:00:00+00:00,98275.0
2,Concerts,2008-01-07 00:00:00+00:00,420710.0
3,Shows,2008-01-07 00:00:00+00:00,313664.0
4,Concerts,2008-01-14 00:00:00+00:00,674721.0
...,...,...,...
101,Shows,2008-12-15 00:00:00+00:00,419135.0
102,Concerts,2008-12-22 00:00:00+00:00,289723.0
103,Shows,2008-12-22 00:00:00+00:00,245468.0
104,Concerts,2008-12-29 00:00:00+00:00,44918.0


In [29]:
import plotly.express as px
px.line(sales_over_week, x = 'sales_week', y = 'total_sales', color = 'catgroup')