# **Auction Hunters**
## Exploratory Data Analysis
#### Project started: March, 2023

## What is *Auction Hunters*?
* Auction hunters is a TV show that aired in the USA from 2010-2015. Hosts Allen and Ton travel and bid for abandoned storage units. Each episode shows where they go, what storage units they bid, how much they win them for, and also shows the most exciting items found in each won unit.
* In each episode, they find buyers for the most profitable items in the storage units they bid on and won.
* The show reports how much each item was sold for, allowing the viewer to keep a tally of their profit for each unit.
* In addition to financial data, we also know where they are and thus can monitor location.
* For more information, check out: https://www.imdb.com/title/tt1742340/

<img src="https://m.media-amazon.com/images/M/MV5BNTc4OTE0MzcxOF5BMl5BanBnXkFtZTcwMjQ0NTM0Ng@@._V1_FMjpg_UX558_.jpg" alt="Hosts Allen and Ton of Auction Hunters">

## Objectives
* A
* B
* C

## Methodology

1. Scrape episode data from wikipedia
2. Organise and clean the data
3. Dive into the data to answer the objectives

# 1. Imports and Globals

In [1]:
import requests
from bs4 import BeautifulSoup
import pprint
import pandas as pd
import re
import datetime
import os

In [5]:
URL = "https://en.wikipedia.org/wiki/List_of_Auction_Hunters_episodes"
DATA_DIR = "data"
FILENAME = 'auction_hunters_webpage_content.html'
WEBPAGE = f'{DATA_DIR}/{FILENAME}'

# 2. Scrape the Data

## 2.1. Download the webpage to avoid multiple requests

In [6]:
# Send an HTTP request to the URL
response = requests.get(URL)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the content with Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Save the parsed data to a file
    with open(WEBPAGE, 'w', encoding='utf-8') as website_data:
        website_data.write(str(soup))
else:
    print(f">> Status Code: {response.status_code}. Please check settings and try again.")

## 2.2. Load the local copy of the webpage for parsing

In [8]:
with open(WEBPAGE, 'r', encoding='utf-8') as auction_hunters_wiki_website:
    website_read = auction_hunters_wiki_website.read()
    soup = BeautifulSoup(website_read, 'html.parser')

## 2.3. Inspecting the parsed data to find our table of episode information

* The season-episode information is found in tables under the class *"wikitable plainrowheaders wikiepisodetable"*.
* We can see that 5 seasons are identified.

In [14]:
all_seasons_scrape = soup.find_all(class_=r"wikitable plainrowheaders wikiepisodetable")
number_of_seasons = len(all_seasons_scrape)
print(f">> There are {number_of_seasons} season(s) in the soup! If this is the number you're expecting, continue!")

>> There are 5 season(s) in the soup! If this is the number you're expecting, continue!


## 2.4. Saving the parsed data for each season individually

In [None]:
season_1_data = all_seasons_scrape[0]
season_2_data = all_seasons_scrape[1]
season_3_data = all_seasons_scrape[2]
season_4_data = all_seasons_scrape[3]
season_5_data = all_seasons_scrape[4]

all_seasons_data = [(season_1_data, 1), (season_2_data, 2), (season_3_data, 3), (season_4_data, 4), (season_5_data, 5)]

In [None]:
# Within each season data, there are 2 classes in the html which contain the data I want to scrape. 
# class vevent (episode, date, location), and 
# class expand-child (episode description with cost, sold, profit text).

# print(type(season_1_data))
# There is 1 class vevent for every episide, so the number of vevent instances will sum the episode number.
# Furthermore, each expand-child will contain episode information, so the number of vevent instances MUST match the number of expand-child instances.
def check_season_X_count_matches(season_X_data):
    print(f"There are: >> {len(season_X_data.find_all(class_=r'vevent'))} << episode, date and location information cells in this season.")
    # We can see that the season 1 data object has Tag property and find_all has 8 returns. This tells me that season 1 has 8 episodes. True.
    print(f"There are: >> {len(season_X_data.find_all(class_=r'expand-child'))} << episode descriptions with cost, sold, profit text in this season)")

    if len(season_X_data.find_all(class_=r'vevent')) == len(season_X_data.find_all(class_=r'expand-child')):
        print("AMAZING! The number of episode data and episode descriptions counts match! We can proceed to extract the information.")
    else:
        print("NO! Something is wrong. There are unequal amounts of episodes data and episode descriptions. ")
        
check_season_X_count_matches(season_1_data)

In [None]:
def get_season_information(season_X_data):
    """This function will process the episode data and descriptions for a single season. It will seperate it into 3 parts. General data (episode, title, date), 
    episode descripion, and the monetary data. Function returns a list containing these informations. For our purposes, season_X_data is the season that the user
    wants to get information from. Earlier, we created season_1_data, season_2_data and so on. This function is designed to process those variables."""
    
    # This first section will seperate out the episode #, title, location from the vevent class.
    # I am making a list of episode data, where each element in the list is its own list of individual episode information.
    season_X_episode_data = []
    for episode_data in season_X_data.find_all(class_=r'vevent'):
        episode_data_list = []
        for data_tag in episode_data.select("td"):
            episode_data_list.append(data_tag.text)
        season_X_episode_data.append(episode_data_list)
    
    # This second section will seperate out the episode descriptions from the expant-child class.
    # I am making a list of episode descriptions (including the monetary values), where each element in the list is its own list of individual episode information.
    season_X_episode_descriptions = []
    season_X_episode_monetary = []
    for episode_description in season_X_data.find_all(class_=r'expand-child'):
        episode_description_list = []
        for description in episode_description.select("p"):
            episode_description_list.append(description.text)
        season_X_episode_descriptions.append(episode_description_list)

        # Episode monetary covers the paid, sold and profit values from each episode. 
        # Remember, these are going to be lists of lists.
        episode_monetary = []
        for money_values in episode_description.find_all('dd'):
            episode_monetary.append(money_values.text)
        season_X_episode_monetary.append(episode_monetary)

    # Now, we're going to join the information for each episode together and save it in a list. 
    season_X_full_list_data_combined = []
    for i in range(0, len(season_X_episode_data)):
        season_X_full_list_data_combined.append([season_X_episode_data[i], season_X_episode_descriptions[i], season_X_episode_monetary[i]])
    return season_X_full_list_data_combined

In [None]:
season_X_list_info_combined = get_season_information(season_4_data)
pprint.pprint(season_X_list_info_combined)

In [None]:
def generate_unique_season_episode_number(season_num, episode_num):
    """I want to generate an episode ID, such as season 1 episode 1 = 101, season 1 episode 12 = 112. We will generate as a string data type."""
    s_num = str(season_num)
    if episode_num < 10:
        ep_num = "0" + str(episode_num)
    else:
        ep_num = str(episode_num)
    
    return s_num + ep_num  

In [None]:
def convert_episode_data_to_dict(season_X_episode_prepared_data, season_num):
    """Enter season_X_data for the data to process, and the season number. season_num is a single integer, e.g. 1 for season 1, or 12 for season 12.
    If any certain value throws an exception, put in a default value from the dictionary. Then, it can be detected easily in post-processing."""
    
# So, season 4 episode 1 had no description, meaning the list made for it was length 0. So, a 0 length list raises an index error.
# I decided to create a default values table so that if it throws an error, it just assigns a default value.
# I can then reference the defaults later in pandas and replace specific values, or identify them in a csv in excel.
    
    except_value_defaults = {
            'season_num': "0",
            'episode_num': "99999",
            'unique_ep_ID': "88888",
            'episode_name': "NoNameEpisode", 
            'location': "NoLocationEpisode", 
            'air_date': "NoAirDateEpisode", 
            'description': "NoDescriptionEpisode", 
            'paid_$': "NoPaidEpisode", 
            'sold_$': "NoSoldEpisode", 
            'profit_$': "NoProfitEpisode"
        }
    
    season_X_episode_info_list_of_dicts = []
    for episode in season_X_episode_prepared_data:    
        try:
            ep_num = int(episode[0][0])
        except:
            ep_num = except_value_defaults['episode_num']
                  
        try:
            ep_name = episode[0][1].replace('"', "")
        except:
            ep_name = except_value_defaults['episode_name']

        try:
            ep_loc = episode[0][2]
        except:
            ep_loc = except_value_defaults['location']
        
        try:
            ep_date_initial = episode[0][3]
            ep_date_extract = re.findall(r'\d{4}-\d{2}-\d{2}', ep_date_initial)
            ep_date = datetime.datetime.strptime(ep_date_extract[0], '%Y-%m-%d').date()
        except:
            ep_date = except_value_defaults['air_date']
        
        try:
            ep_desc = episode[1][0].replace("\n", "").replace("  ", " ")
        except:
            ep_desc = except_value_defaults['description']
                      
        try:
            ep_paid = episode[2][0].replace(",", "").replace("$", "")
        except:    
            ep_paid = except_value_defaults['paid_$']
            
        try:
            ep_sold = episode[2][1].replace(",", "").replace("$", "")
        except:             
            ep_sold = except_value_defaults['sold_$']
            
        try:
            ep_profit = episode[2][2].replace(",", "").replace("$", "")
        except:             
            ep_profit = except_value_defaults['profit_$']

        try:       
            unique_ep_ID = generate_unique_season_episode_number(season_num, ep_num)
        except: 
            unique_ep_ID = except_value_defaults['unique_ep_ID']
            
            
        ep_dict = {
            'season_num': season_num,
            'episode_num': ep_num,
            'unique_ep_ID': unique_ep_ID,
            'episode_name': ep_name, 
            'location': ep_loc, 
            'air_date': ep_date, 
            'description': ep_desc, 
            'paid_$': ep_paid, 
            'sold_$': ep_sold, 
            'profit_$': ep_profit
        }

        season_X_episode_info_list_of_dicts.append(ep_dict)

    return season_X_episode_info_list_of_dicts

In [None]:
# Here, we will assign the season list of dictionaries variable and generate them using the season data and season number.
season_X_episodes_as_list_of_dictionaries = convert_episode_data_to_dict(season_X_list_info_combined, 1)
pprint.pprint(season_X_episodes_as_list_of_dictionaries)

In [None]:
# Now, we will export the results to a pandas dataframe.
season_X_as_df = pd.DataFrame.from_records(season_X_episodes_as_list_of_dictionaries)
print(season_X_as_df)
season_X_as_df.to_csv('season_X_auction_hunters.csv', index=False)

In [None]:
for season in all_seasons_data:
    check_season_X_count_matches(season[0])
    season_X_list_info_combined = get_season_information(season[0])
    convert_episode_data_to_dict(season_X_list_info_combined, season[1])
    season_X_episodes_as_list_of_dictionaries = convert_episode_data_to_dict(season_X_list_info_combined, season[1])
    season_X_as_df = pd.DataFrame.from_records(season_X_episodes_as_list_of_dictionaries)
    
    
    # If you set exist_ok=True, you can specify an existing directory without encountering an error.
    os.makedirs('data', exist_ok=True)  
    title_concat = "season_" + str(season[1]) + "_auction_hunters"
    season_X_as_df.to_csv('data/'+f"{title_concat}.csv", index=False)