# Concert Tickets Price Prediction - Web Scraping
#### By: Sarah Alabdulwahab & Asma Althakafi

>Our goal for this project is to predict the prices of the concert tickets of the upcoming concerts in America.

## Data Description 
We aim to obtain the dataset by web scraping a concert ticket selling website: Razorgator. We aim to collect the following features:
- **Artist**: A band or an individual that will perform live music.
- **City**: The city that the concert will occur in.
- **State**: The state that the concert will occur in.
- **Venue**: The venue that the concert will occur in.
- **Date**: The date that the concert will occur on.
- **Day**: The day that the concert will occur on.
- **Time**: The time that the concert will occur at.
- **Level**: Front, middle, and last sections/rows.
- **Price**: The price of the concert ticket.

In addition, we will add a feature that contains the median and average salary of the state that the concert will occur in.

In [1]:
#imports
from bs4 import BeautifulSoup 
import requests
import os
import time
import pandas as pd
from selenium import webdriver
from tqdm import tqdm

In [2]:
# path to the chromedriver executable
chromedriver = "C:/Users/HP/AppData/Roaming/Microsoft/Windows/Start Menu/Programs/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

## Collecting Concert Tickets from Razorgator.com

### Scraping Artist Links

In [3]:
response = requests.get('https://www.razorgator.com/concerts-tickets/')
page = response.text
soup = BeautifulSoup(page, "lxml")

artist_links = []
#there are only 100 artists in the website, starting from index 136
for link in soup.find_all('a',class_='contentItem')[136:236]: 
    artist_links.append(['https://www.razorgator.com' + link.get('href'), link.text])

In [4]:
#making sure that we got all 100 artists
len(artist_links)

100

In [5]:
artist_links

[['https://www.razorgator.com/machine-gun-kelly-tickets/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/run-the-jewels-tickets/', 'Run the Jewels'],
 ['https://www.razorgator.com/j-cole-tickets/', 'J. Cole'],
 ['https://www.razorgator.com/21-savage-tickets/', '21 Savage'],
 ['https://www.razorgator.com/post-malone-tickets/', 'Post Malone'],
 ['https://www.razorgator.com/doja-cat-tickets/', 'Doja Cat'],
 ['https://www.razorgator.com/tyler-the-creator-tickets/',
  'Tyler The Creator'],
 ['https://www.razorgator.com/travis-scott-tickets/', 'Travis Scott'],
 ['https://www.razorgator.com/jack-harlow-tickets/', 'Jack Harlow'],
 ['https://www.razorgator.com/megan-thee-stallion-tickets/',
  'Megan Thee Stallion'],
 ['https://www.razorgator.com/dababy-tickets/', 'DaBaby'],
 ['https://www.razorgator.com/atliens-tickets/', 'ATLiens'],
 ['https://www.razorgator.com/dj-diesel-tickets/', 'DJ Diesel'],
 ['https://www.razorgator.com/z-trip-tickets/', 'Z-Trip'],
 ['https://www.razorgator.com/t

### Scraping Concert Links for Each Artist

In [6]:
def get_concert_link(links):
    #links2 will store the concert links for each artist
    links2 = []
    for i in tqdm(range(len(links))):
        #the first element of links contains the link of the artist
        response = requests.get(links[i][0]) 
        page = response.text
        soup = BeautifulSoup(page, "lxml")
        
        for link in soup.find_all('a',class_='buyButton adaElement'):
            #the second element of links contains the name of the artist
            links2.append(['https://www.razorgator.com' + link.get('href'), links[i][1]])
            
    return links2

In [7]:
concerts_links = get_concert_link(artist_links)

100%|██████████| 100/100 [03:46<00:00,  2.26s/it]


In [8]:
len(concerts_links)

1938

In [9]:
concerts_links

[['https://www.razorgator.com/buy-machine-gun-kelly-tickets-eagles-ballroom-10-5-21-7pm/4662184/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-machine-gun-kelly-tickets-ascend-amphitheater-10-6-21-7pm/4662243/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-austin-city-limits-festival-weekend-two-3-day-pass-tickets-zilker-park-10-8-21-3am/4495493/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-machine-gun-kelly-tickets-bill-graham-civic-auditorium-10-10-21-8pm/4662316/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-machine-gun-kelly-tickets-pavilion-at-riverfront-10-12-21-6pm/4662319/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-machine-gun-kelly-tickets-mcmenamins-historic-edgefield-amphitheater-10-13-21-6pm/4661694/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-machine-gun-kelly-tickets-greek-theatre-10-15-21-7pm/4665409/',
  'Machine Gun Kelly'],
 ['https://www.razorgator.com/buy-machine-gun-kelly-ticke

### Scraping Details of Each Concert Ticket Using Chromedriver
- Venue
- City
- State
- Day
- Date
- Time
- Level
- Price

In [10]:
def get_info(links):
    info = []
    for i in tqdm(range(len(links))):
        driver = webdriver.Chrome(chromedriver)
        #the first element of links contains the link of the concert
        driver.get(links[i][0])
        
        #wait 10 seconds to let the page load, then scrape the data
        time.sleep(10)
            
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        #data will store the venue, city, state, day, date, and time all together
        if soup.find('p') == None and soup.find('h2') == None: #if there is no data, we will skip this link
            continue
        elif soup.find('p') == None:
            data = soup.find('h2').text
        else:
            data = soup.find('p').text

        levels = []
        for level in soup.find_all('div',class_ = 'sout'):
            levels.append(level.text)
            
        prices = []
        for price in soup.find_all('label',class_ = 'sendE'):
            prices.append(price.text)
           
        driver.close()
        
        for j in range(len(levels)):
            info.append([links[i][1], data, levels[j], prices[j]])
             
    return info

In [11]:
info = get_info(concerts_links)

100%|██████████| 1938/1938 [12:48:57<00:00, 23.81s/it]  


#### As you can see above, the runtime of the `get_info` method took nearly 13 hours!!

In [12]:
len(info)

180094

In [13]:
info

[['Machine Gun Kelly',
  'Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2021 7:30 PM',
  'Section GA • Row GA1',
  '$202'],
 ['Machine Gun Kelly',
  'Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2021 7:30 PM',
  'VIP BALC • Row GA',
  '$278'],
 ['Machine Gun Kelly',
  'Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2021 7:30 PM',
  'VIP BALC • Row GA',
  '$291'],
 ['Machine Gun Kelly',
  'Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2021 7:30 PM',
  'FLOOR GA • Row GA1',
  '$370'],
 ['Machine Gun Kelly',
  'Ascend Amphitheater: Nashville, TN -\nWed, Oct 06 2021 7:30 PM\n',
  'GA Pit • Row GA0',
  '$394'],
 ['Machine Gun Kelly',
  'Ascend Amphitheater: Nashville, TN -\nWed, Oct 06 2021 7:30 PM\n',
  'GA Pit • Row GAO',
  '$407'],
 ['Machine Gun Kelly',
  'Ascend Amphitheater: Nashville, TN -\nWed, Oct 06 2021 7:30 PM\n',
  'GA Pit • Row GA2',
  '$407'],
 ['Machine Gun Kelly',
  'Ascend Amphitheater: Nashville, TN -\nWed, Oct 06 2021 7:30 PM\n',
  'GA Pit • Row GA0',
  '$411'],
 ['Machine G

### Storing the Scraped Data Into A DataFrame

In [14]:
df_tickets = pd.DataFrame(info, columns = ['artist','data','level','price'])
df_tickets

Unnamed: 0,artist,data,level,price
0,Machine Gun Kelly,"Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2...",Section GA • Row GA1,$202
1,Machine Gun Kelly,"Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2...",VIP BALC • Row GA,$278
2,Machine Gun Kelly,"Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2...",VIP BALC • Row GA,$291
3,Machine Gun Kelly,"Eagles Ballroom: Milwaukee, WI - Tue, Oct 05 2...",FLOOR GA • Row GA1,$370
4,Machine Gun Kelly,"Ascend Amphitheater: Nashville, TN -\nWed, Oct...",GA Pit • Row GA0,$394
...,...,...,...,...
180089,Dave Chappelle,"Smoothie King Center: New Orleans, LA -\nThu, ...",Section 113 • Row 5,$438
180090,Dave Chappelle,"Smoothie King Center: New Orleans, LA -\nThu, ...",Section 302 • Row 8,$152
180091,Dave Chappelle,"Smoothie King Center: New Orleans, LA -\nThu, ...",Section 301 • Row 1,$209
180092,Dave Chappelle,"Smoothie King Center: New Orleans, LA -\nThu, ...",Section 108 • Row 11,$250


In [15]:
#storing the df as csv
df_tickets.to_csv('tickets.csv',index = False)

## Collecting Median and Average Salaries per State from Wikipedia.org

In [36]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_median_wage_and_mean_wage')
soup = BeautifulSoup(response.text, "lxml")

In [37]:
#data will store the state name, median salary, and average salary for each state
data = []
for x in soup.find_all('table')[1].find_all('td'):
    data.append(x.text.strip())

In [38]:
data

['1',
 'District of Columbia',
 '$71,690',
 '$115,923',
 '2',
 'Massachusetts',
 '$48,680',
 '$76,437',
 '3',
 'Alaska',
 '$48,020',
 '$69,789',
 '4',
 'Connecticut',
 '$46,920',
 '$74,405',
 '5',
 'Washington',
 '$46,100',
 '$74,016',
 '6',
 'New York',
 '$44,990',
 '$80,640',
 '7',
 'Maryland',
 '$44,690',
 '$69,893',
 '8',
 'New Jersey',
 '$43,600',
 '$71,959',
 '9',
 'Minnesota',
 '$42,630',
 '$62,156',
 '10',
 'Hawaii',
 '$42,480',
 '$59,231',
 '11',
 'California',
 '$42,430',
 '$75,400',
 '12',
 'Colorado',
 '$42,310',
 '$62,375',
 '13',
 'Rhode Island',
 '$42,040',
 '$59,055',
 '14',
 'North Dakota',
 '$41,340',
 '$55,447',
 '15',
 'Virginia',
 '$40,820',
 '$64,517',
 '16',
 'Wyoming',
 '$40,240',
 '$55,018',
 '17',
 'Illinois',
 '$39,950',
 '$66,600',
 '18',
 'Delaware',
 '$39,900',
 '$62,817',
 '19',
 'New Hampshire',
 '$39,870',
 '$62,427',
 '20',
 'Vermont',
 '$39,720',
 '$50,826',
 '21',
 'Oregon',
 '$39,580',
 '$60,306',
 '22',
 'Pennsylvania',
 '$38,450',
 '$64,706',
 '23

In [39]:
#drop indices
del data[::4]
#every third element starting from index 0 is the state name
states = data[::3][:52] 
#every third element starting from index 1 is the median salary
median = data[1::3][:52]
#every third element starting from index 2 is the average salary
mean = data[2::3][:52]

### Storing the Scraped Data Into A DataFrame

In [40]:
df_salaries = pd.DataFrame(list(zip(states, median,mean)), columns = ['States', 'Median_Salary', 'Mean_Salary'])
df_salaries

Unnamed: 0,States,Median_Salary,Mean_Salary
0,District of Columbia,"$71,690","$115,923"
1,Massachusetts,"$48,680","$76,437"
2,Alaska,"$48,020","$69,789"
3,Connecticut,"$46,920","$74,405"
4,Washington,"$46,100","$74,016"
5,New York,"$44,990","$80,640"
6,Maryland,"$44,690","$69,893"
7,New Jersey,"$43,600","$71,959"
8,Minnesota,"$42,630","$62,156"
9,Hawaii,"$42,480","$59,231"


In [41]:
#storing the df as csv
df_salaries.to_csv('salaries.csv',index = False)