# **Auction Hunters**
## Exploratory Data Analysis
#### Joshua Higgins | March~ 2023

## What is *Auction Hunters*?
* Auction hunters is a TV show that aired in the USA from 2010-2015. Hosts Allen and Ton travel and bid for abandoned storage units. Each episode shows where they go, what storage units they bid, how much they win them for, and also shows the most exciting items found in each won unit.
* In each episode, they find buyers for the most profitable items in the storage units they bid on and won.
* The show reports how much each item was sold for, allowing the viewer to keep a tally of their profit for each unit.
* In addition to financial data, we also know where they are and thus can monitor location.
* For more information, check out: https://www.imdb.com/title/tt1742340/

<img src="https://m.media-amazon.com/images/M/MV5BNTc4OTE0MzcxOF5BMl5BanBnXkFtZTcwMjQ0NTM0Ng@@._V1_FMjpg_UX558_.jpg" alt="Hosts Allen and Ton of Auction Hunters">

## Objectives
* A
* B
* C

## Methodology

1. Import necessary libraries and set global variables.
2. Scrape episode data from wikipedia.
3. Organise and clean the data.
4. Dive into the data to answer the objectives.

# 1. Imports and Globals

In [1]:
import requests
from bs4 import BeautifulSoup
import pprint
import pandas as pd
import re
import datetime
import os

In [2]:
URL = "https://en.wikipedia.org/wiki/List_of_Auction_Hunters_episodes"
DATA_DIR = "data"
WEB_DATA_FILENAME = 'auction_hunters_webpage_content.html'
WEBPAGE = f'{DATA_DIR}/{WEB_DATA_FILENAME}'

# 2. Scrape the Data

## 2.1. Download the webpage to avoid multiple requests

In [3]:
# Send an HTTP request to the URL
response = requests.get(URL)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the content with Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Save the parsed data to a file
    with open(WEBPAGE, 'w', encoding='utf-8') as website_data:
        website_data.write(str(soup))
else:
    print(f">> Status Code: {response.status_code}. Please check settings and try again.")

## 2.2. Load the local copy of the webpage for parsing

In [4]:
with open(WEBPAGE, 'r', encoding='utf-8') as auction_hunters_wiki_website:
    website_read = auction_hunters_wiki_website.read()
    soup = BeautifulSoup(website_read, 'html.parser')

## 2.3. Inspecting the parsed data to find our table of episode information

* Each season's episode information is stored in a table found in the class `wikitable plainrowheaders wikiepisodetable`.
* We can see that there are 5 seasons by calling the length of the find_all of this class.

In [6]:
all_seasons_scrape = soup.find_all(class_=r"wikitable plainrowheaders wikiepisodetable")
number_of_seasons = len(all_seasons_scrape)
print(f">> There are {number_of_seasons} season(s) in the soup! If this is the number you're expecting, continue!")

>> There are 5 season(s) in the soup! If this is the number you're expecting, continue!


## 2.4. Saving the parsed data for each season individually

In [7]:
season_1_data = all_seasons_scrape[0]
season_2_data = all_seasons_scrape[1]
season_3_data = all_seasons_scrape[2]
season_4_data = all_seasons_scrape[3]
season_5_data = all_seasons_scrape[4]
all_seasons_data = [(season_1_data, 1), 
                    (season_2_data, 2), 
                    (season_3_data, 3), 
                    (season_4_data, 4), 
                    (season_5_data, 5)]

## 2.5. Confirm that each season's table contains a full lists of episodes

* In each season's html data, there are 2 classes which contain the data I want to extract.
   * class `vevent` which contains episode, date and location information, and
   * class `expand-child` which contains episode description with cost, sold and profit text.
* There is 1 class `vevent` and `expand-child` for each episode. 
   * This means we can verify that each episode has full information by counting these classes.

In [8]:
def validate_episode_count(season_X):
    """
    This function is designed to be used in a loop. It will take the season data and extract the class counts for vevent and expand_child.
    If this function is successful, you will confirm all the table data is available for each season.
    This function doesn't confirm the total episode counts, so we trust the wikipedia page is complete.
    """
    season_X_data = season_X[0]
    season_X_num = season_X[1]
    
    vevent_count = len(season_X_data.find_all(class_=r'vevent'))
    expand_child_count = len(season_X_data.find_all(class_=r'expand-child'))
    
    print(f">> Season {season_X_num}: Counted {vevent_count} vevent classes and {expand_child_count} expand-child classes.")
    
    if vevent_count == expand_child_count:
        print(f">>>> Class count matches! Season {season_X_num} has {vevent_count} episodes!\n")
    else:
        print(f">> Class count mis-match. Please check your scrape data / class names.\n")


for season in all_seasons_data:
    validate_episode_count(season)

>> Season 1: Counted 8 vevent classes and 8 expand-child classes.
>>>> Class count matches! Season 1 has 8 episodes!

>> Season 2: Counted 27 vevent classes and 27 expand-child classes.
>>>> Class count matches! Season 2 has 27 episodes!

>> Season 3: Counted 26 vevent classes and 26 expand-child classes.
>>>> Class count matches! Season 3 has 26 episodes!

>> Season 4: Counted 26 vevent classes and 26 expand-child classes.
>>>> Class count matches! Season 4 has 26 episodes!

>> Season 5: Counted 20 vevent classes and 20 expand-child classes.
>>>> Class count matches! Season 5 has 20 episodes!



# 3. Extracting Data

## 3.1. Initial Data Extraction

In [9]:
def get_seasons_episodes_data(season_X_data):
    """
    This function extracts the critical data from the html. Namely:
    
    Data:         [episode number, episode title, location, date]
    Descriptions: [episode description]
    Monetary:     [paid, value, profit]
    
    episode_data = [[Data], [Descriptions], [Monetary]]
                 = [[episode number, episode title, location, date], [episode description], [paid, value, profit]]
    
    
    These 3 lists are captured in a single list for each episode. 
    Thus, the return for each season is a list for each epuside containing 3 lists, which will then need to go through some data cleaning later.
    """
    
    episode_data = []
    episode_descriptions = []
    episode_monetary = []

    # Extract the [episode number, episode title, location, date] data:
    episode_data_tags = season_X_data.find_all(class_=r'vevent')
    for episode_data_tag in episode_data_tags:
        episode_data_list = [data_tag.text for data_tag in episode_data_tag.select("td")]
        episode_data.append(episode_data_list)
        
    # Extract the [episode description] data:
    episode_description_tags = season_X_data.find_all(class_=r'expand-child')
    for episode_description_tag in episode_description_tags:
        episode_description_list = [description.text for description in episode_description_tag.select("p")]
        episode_descriptions.append(episode_description_list)
        
        # In the same class, look for the tag 'dd' containing the monetary [paid, value, profit] data: 
        monetary = [money_values.text for money_values in episode_description_tag.find_all('dd')]
        episode_monetary.append(monetary)

    # Make each episode 1 list of 3 lists (data, description, monetary) and return the entire season's data:
    season_X_full_data_combined = [
        [data, desc, monetary]
        for data, desc, monetary in zip(episode_data, episode_descriptions, episode_monetary)
    ]

    return season_X_full_data_combined

In [10]:
season_1_data_extracted = get_seasons_episodes_data(season_1_data)
season_2_data_extracted = get_seasons_episodes_data(season_2_data)
season_3_data_extracted = get_seasons_episodes_data(season_3_data)
season_4_data_extracted = get_seasons_episodes_data(season_4_data)
season_5_data_extracted = get_seasons_episodes_data(season_5_data)
all_seasons_data_extracted = [(season_1_data_extracted, 1), 
                              (season_2_data_extracted, 2), 
                              (season_3_data_extracted, 3), 
                              (season_4_data_extracted, 4), 
                              (season_5_data_extracted, 5)]

### Validate data extraction:

In [11]:
for season_data_extracted in all_seasons_data_extracted:
    print(f"Season {season_data_extracted[1]}:")
    pprint.pprint(season_data_extracted[0])
    print("")

Season 1:
[[['1',
   '"The Wild West"',
   'San Bernardino, California',
   'November\xa09,\xa02010\xa0(2010-11-09)'],
  ['Ton and Allen head to auctions in the desert town of San Bernardino. Ton '
   'scores a deadly 19th century British Pepper-box handgun and tests it at '
   'the gun range. Allen wins a unit for $1 and finds a fully functional '
   'pre-WWI train set.\n'],
  ['$376', '$1,190', '$814']],
 [['2',
   '"The Big Score"',
   'Downtown Los Angeles, California',
   'November\xa09,\xa02010\xa0(2010-11-09)'],
  ['Ton and Allen bid on units in downtown LA and uncover a ‘70s German H&K P7 '
   'pistol and a rare copper cash register.\n'],
  ['$2,025', '$5,850', '$3,825']],
 [['3',
   '"Ton\'s Got a Gun"',
   'Mission Hills, California',
   'November\xa016,\xa02010\xa0(2010-11-16)'],
  ['Allen encounters some old rivals in the Valley.  Ton and Allen uncover a '
   'Depression-era “Art Case” slot machine, a custom minibike and a Wild West '
   '1880s Colt Peacemaker.\n'],
  ['$1,

## 3.2. Cleaning Data and Converting Season Episode List Data to Dictionary

### 3.2.1. Function: Set Data Field Value - Cleaned or Default

In [12]:
def get_field_value(data, default_value):
    """
    This code will replace repetitive try and except clauses from the original code.
    It will first try to generate a clean data value, but if it can't, it will return a default value.
    This function will be run inside a larger function where the default_value will be passed in.
    
    """
    
    try:
        return data
    except:
        return default_value

### 3.2.2. Function: Generate a Unique Season-Episode Number (as a string)

In [13]:
def generate_unique_season_episode_number(season_num, episode_num):
    """
    This code will generate an episode ID, such as season 1 episode 1 = 101, season 1 episode 12 = 112. 
    It will generate this as a string data type.
    """
    
    s_num = str(season_num)
    ep_num = str(episode_num).zfill(2)
    return s_num + ep_num

### 3.2.3. Function: Generate Episode Data Dictionary with Clean or Default Data

In [14]:
def convert_season_episode_data_to_dict(season_X_data, season_num):
    """
    This function is desidned to be run in a loop and depends on 2 other functions above.
    This function takes season data and season number, then processes each field value, cleans it,
    then stores it in a dictionary. 
    Each episode's data ends up in a dictionary.
    In this way, the entire Auction Hunter show data can be compiled into a large data frame.
    """
    
    field_defaults = {
        'episode_num': 99,
        'episode_name': "No Episode Name",
        'location': "No Location",
        'air_date': "No Air Date",
        'description': "No Description",
        'paid_$': 0,
        'sold_$': 0,
        'profit_$': 0
    }

    season_X_episode_info_list_of_dicts = []
    for episode in season_X_data:
        ep_num = get_field_value(episode[0][0], field_defaults['episode_num'])
        ep_name = get_field_value(episode[0][1].replace('"', ""), field_defaults['episode_name'])
        ep_loc = get_field_value(episode[0][2], field_defaults['location'])
        
        # Try to get the data information
        try:
            ep_date_initial = episode[0][3]
            ep_date_extract = re.findall(r'\d{4}-\d{2}-\d{2}', ep_date_initial)
            ep_date = datetime.datetime.strptime(ep_date_extract[0], '%Y-%m-%d').date()
        except:
            ep_date = field_defaults['air_date']
        
        # Check if the description list is empty
        if episode[1]:
            ep_desc = get_field_value(episode[1][0].replace("\n", "").replace("  ", " "), field_defaults['description'])
        else:
            ep_desc = field_defaults['description']
            
        ep_paid = get_field_value(episode[2][0].replace(",", "").replace("$", ""), field_defaults['paid_$'])
        ep_sold = get_field_value(episode[2][1].replace(",", "").replace("$", ""), field_defaults['sold_$'])
        ep_profit = get_field_value(episode[2][2].replace(",", "").replace("$", ""), field_defaults['profit_$'])
        
        unique_ep_ID = generate_unique_season_episode_number(season_num, ep_num)
            
        ep_dict = {
            'season_num': season_num,
            'episode_num': ep_num,
            'unique_ep_ID': unique_ep_ID,
            'episode_name': ep_name, 
            'location': ep_loc, 
            'air_date': ep_date, 
            'description': ep_desc, 
            'paid_$': ep_paid, 
            'sold_$': ep_sold, 
            'profit_$': ep_profit
        }

        season_X_episode_info_list_of_dicts.append(ep_dict)

    return season_X_episode_info_list_of_dicts

### 3.2.4. Generate each season's cleaned episode information as a dictionary.

In [15]:
season_1_data_dict = convert_season_episode_data_to_dict(all_seasons_data_extracted[0][0], all_seasons_data_extracted[0][1])
season_2_data_dict = convert_season_episode_data_to_dict(all_seasons_data_extracted[1][0], all_seasons_data_extracted[1][1])
season_3_data_dict = convert_season_episode_data_to_dict(all_seasons_data_extracted[2][0], all_seasons_data_extracted[2][1])
season_4_data_dict = convert_season_episode_data_to_dict(all_seasons_data_extracted[3][0], all_seasons_data_extracted[3][1])
season_5_data_dict = convert_season_episode_data_to_dict(all_seasons_data_extracted[4][0], all_seasons_data_extracted[4][1])
all_seasons_data_dict = [(season_1_data_dict, 1), 
                         (season_2_data_dict, 2), 
                         (season_3_data_dict, 3), 
                         (season_4_data_dict, 4), 
                         (season_5_data_dict, 5)]

In [16]:
for season in all_seasons_data_dict:
    pprint.pprint(season[0])

[{'air_date': datetime.date(2010, 11, 9),
  'description': 'Ton and Allen head to auctions in the desert town of San '
                 'Bernardino. Ton scores a deadly 19th century British '
                 'Pepper-box handgun and tests it at the gun range. Allen wins '
                 'a unit for $1 and finds a fully functional pre-WWI train '
                 'set.',
  'episode_name': 'The Wild West',
  'episode_num': '1',
  'location': 'San Bernardino, California',
  'paid_$': '376',
  'profit_$': '814',
  'season_num': 1,
  'sold_$': '1190',
  'unique_ep_ID': '101'},
 {'air_date': datetime.date(2010, 11, 9),
  'description': 'Ton and Allen bid on units in downtown LA and uncover a ‘70s '
                 'German H&K P7 pistol and a rare copper cash register.',
  'episode_name': 'The Big Score',
  'episode_num': '2',
  'location': 'Downtown Los Angeles, California',
  'paid_$': '2025',
  'profit_$': '3825',
  'season_num': 1,
  'sold_$': '5850',
  'unique_ep_ID': '102'},
 {'air_d

## 3.3. Exporting Dictionaries to CSV by Season

In [17]:
# Now, we will export the results to a pandas dataframe.
def all_season_dict_to_csv(all_seasons_dicts):
    """
    This function will take the all_seasons_data_dict list containing cleaned season data dictionaries and season number tuples,
      and converting each season into its own csv file AND a csv file of the ENTIRE series data.
    
    """
    for season_dict in all_seasons_dicts:
        season_data = season_dict[0]
        season_df = pd.DataFrame.from_records(season_data)

        season_num = season_dict[1]
        filename = f"{DATA_DIR}/auction_hunters_season_{season_num}_data.csv"

        # Check if the csv file data already exists:
        if os.path.isfile(filename):
            print(f">> '{filename}' already exists. Skipping save operation. Handle existing data, then try again.")
        else:
            season_df.to_csv(filename, index=False)


In [18]:
all_season_dict_to_csv(all_seasons_data_dict)

## 3.4. Cleaning the data