# Limitations

1. We were unable to collect the winner information for two competitions ("2TB Rocket 4.0 NVMe" and "PXL_20220202_193923763") as the data was not available on the website.


2. Our methodology was unable to determine whether winners were one or multiple people as no unique identifier is provided on the entry list(s) (only first name, last name and order id - the latter is not present in every entry list). Therefore, it is possible that some of the people treated as individuals (e.g. Rob Davies) are multiple people - however, upon manual investigation, it seems that this is unlikely as Jordan references his activity in Discord and the individual in question responded in kind on the Facebook live chat.

3. Although some individuals, such as Rob Davies, won a total of 18 times, we were not able to calculate the total probability of him winning all the draws that he did as we could only correlate 12 entry lists with the prizes listed. This was the case for several individuals who sat outside the standard deviation. This means that the total probability of these individuals winning the prizes listed was likely much less than presented here.

# EDA of Gigahertz Giveaways

## Requirements

In [75]:
import os
import sys
import subprocess
import importlib.util

packages = ['beautifulsoup4', 'pandas', 'numpy', 'seaborn', 'matplotlib', 'plotly', 'requests', 'tabula-py', 'fuzzywuzzy']

def install(package):
  subprocess.check_call([sys.executable, "-m", "pip", "install", package])

def is_installed(package):
  spec = importlib.util.find_spec(package)
  return True if spec is not None else False

for package in packages:
  if not is_installed(package):
    install(package)

In [76]:
import re
import functools
import requests
import tabula
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from urllib.parse import urlparse
from fuzzywuzzy import fuzz
from bs4 import BeautifulSoup

## URL patterns of interest
The following URLs will be our source(s) of data:

In [77]:
URL_PATTERNS = {
  'ENTRIES': ["https://gigahertzgiveaways.co.uk/entry-lists/", "https://gigahertzgiveaways.co.uk/entry-lists-archive/?swcfpc=1"],
  'WINNERS': ["https://gigahertzgiveaways.co.uk/previous-winners/"],
}

## Data collection

Our first goal is to collect the name of each winner from the 'previous-winners' page, and to gather information regarding their prize / the draw itself.

#### Utilities

In [78]:
def get_page(url):
  html = requests.get(url=url).text
  return BeautifulSoup(html, 'html.parser')


#### Collecting winner data
As you can see from the output below the 2nd cell, we were able to collect data for 827 draws but failed to retrieve data for 2 draws as they were not included on the page

In [79]:
soup = get_page(URL_PATTERNS['WINNERS'][0])

In [80]:
winner_df = pd.DataFrame(columns=('Item', 'Draw date', 'Winner', 'Ticket number', 'Maximum Tickets', 'Image'))
elements = soup.find_all('div', class_='e-gallery-item elementor-gallery-item elementor-animated-content')
for i, element in enumerate(elements):
  try:
    item = element.find('div', class_='elementor-gallery-item__title').get_text().strip()
    desc = element.find('div', class_='elementor-gallery-item__description').get_text().strip()

    obj = [item]
    pattern = re.compile(r'(.*?)\:\s*\n*(.*?)(?:\n|$)')
    for m in re.finditer(pattern, desc):
      match = m.group(2)
      if m.group(1) == 'Draw date':
        match = re.search(r'([\d+\/]*)', match).group(0)
      obj.append(match)
    
    img = element.find('div', class_='e-gallery-image elementor-gallery-item__image').get('data-thumbnail')
    obj.append(img)
    
    winner_df.loc[i] = obj
  except:
    item = element.get_text().strip()
    print(f"Failed to retrieve data for '{item}' @ {i}")

print(f"[Data]: Collected data for {len(winner_df)} competitions")

Failed to retrieve data for '2TB Rocket 4.0 NVMe' @ 180
Failed to retrieve data for 'PXL_20220202_193923763' @ 375
[Data]: Collected data for 826 competitions


## Frequency of winners


In [92]:
!pip install --upgrade nbformat
px.histogram(winner_df, x="Winner").update_xaxes(categoryorder='total descending')

## Selecting outliers outside standard deviation for analysis

Let's select the 'luckiest' people by selecting those that have a win count outside of the standard deviation...

In [82]:
outliers = { }
for i, row in winner_df.iterrows():
  name = row['Winner']
  if name not in outliers:
    outliers[name] = 0
  outliers[name] += 1

elements = np.array(list(outliers.values()))
mean = np.mean(elements, axis=0)
sd = np.std(elements, axis=0)

outliers = {k: x for k, x in outliers.items() if (x > mean + 2 * sd)}
outlier_names = list(outliers.keys())

outlier_df = winner_df[winner_df.Winner.isin(outlier_names)]

pretty = [(k, x) for k, x in outliers.items()]
pretty = sorted(pretty, key=lambda x: x[1], reverse=True)

longest_name = len(max(pretty, key=lambda x: len(x[0]))[0])
for winner in pretty:
  diff = (longest_name - len(winner[0])) + 1
  diff = ' ' * diff
  print(winner[0] + diff + "won a total of " + str(winner[1]) + " times")

px.histogram(outlier_df, x="Winner")

Rob Davies        won a total of 18 times
Cameron Vince     won a total of 17 times
Daniel Ollett     won a total of 13 times
Vinh Cam          won a total of 9 times
Janet Quinney     won a total of 8 times
Sami Ali          won a total of 7 times
Chris Heggie      won a total of 7 times
Isaac Poulson     won a total of 7 times
Rajan Bhasin      won a total of 6 times
Sami ALI          won a total of 6 times
Anthony Thorne    won a total of 5 times
David Phillips    won a total of 5 times
Ryan Pennington   won a total of 5 times
Liam Powell       won a total of 5 times
Tam Shaw          won a total of 5 times
Gavin Chisholm    won a total of 5 times
Mariusz Martyszko won a total of 5 times
Alex Mcclune      won a total of 5 times


## Collecting draw entry data
Each entry for individual draws are listed in the aforementioned 'ENTRIES' URLs, we can collect that data to determine the probability of win(s)

In [83]:
pages = []
for url in URL_PATTERNS['ENTRIES']:
  pages.append(get_page(url))

In [84]:
entries_df = pd.DataFrame(columns=('Item', 'Draw date', 'PDF'))

for soup in pages:
  elements = soup.find_all('div', class_='elementor-widget-wrap elementor-element-populated')
  for element in elements:
    try:
      item = element.find('h3', class_='elementor-heading-title elementor-size-default').get_text().strip()
      desc = element.find('div', class_='elementor-widget-text-editor').encode_contents().decode('utf8')
      btn = element.find('span', class_='elementor-button-text')
      pdf = btn.parent.parent.get('href')
      date = re.search(r'Draw date\:\s*([\d+\/]*)', desc, re.MULTILINE).group(1)
            
      entries_df.loc[len(entries_df)] = [item, date, pdf]
    except:
      pass

#### Linking outliers to entry data

In [85]:
linked_entries = pd.DataFrame(columns=('Item', 'Winner', 'PDF'))
for i, row in outlier_df.iterrows():
  items = entries_df.loc[(entries_df['Item'] == row['Item']) & (entries_df['Draw date'] == row['Draw date'])]
  if len(items) <= 0:
    def fuzzy_match(val):
      return fuzz.ratio(val.lower(), row['Item'].lower())
        
    items = entries_df.loc[entries_df['Item'].apply(fuzzy_match) >= 80]
    items = items.loc[items['Draw date'] == row['Draw date']]

  if len(items) > 0:
    items = items.iloc[0]
    linked_entries.loc[len(linked_entries)] = [row['Item'], row['Winner'], items['PDF']]

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
  print(linked_entries)

                         Item             Winner  \
0             Noir 3090 FE PC         Rob Davies   
1                    LG C1 #5         Rob Davies   
2         5800x3d x570 Bundle       Rajan Bhasin   
3             72hr RTX3090 FE         Rob Davies   
4       72hr Xbox Series 2 #4      Daniel Ollett   
5               IT Duster #10      Daniel Ollett   
6           Govee Hexa Panels      Isaac Poulson   
7            Govee Light Bars       Rajan Bhasin   
8              NZXT Flow Case     David Phillips   
9         3080Ti AORUS Master      Cameron Vince   
10        Sage Coffee Machine    Ryan Pennington   
11               IT Duster #8        Liam Powell   
12     PACMAN Light Bundle #2        Liam Powell   
13            iFixit Manta #6      Daniel Ollett   
14         Logitech G Pro TKL           Tam Shaw   
15          48hr RTX3070Ti FE         Rob Davies   
16         Razer Huntsman TKL     Gavin Chisholm   
17                   LG C1 #3         Rob Davies   
18        Ra

#### Retrieving PDF entry data

In [86]:
chunk_size = 2000

if not os.path.exists('./data/'):
  os.mkdir('./data/')

  for i, row in linked_entries.iterrows():
    url = row['PDF']
    path = urlparse(url)
    filename = os.path.basename(path.path)
    r = requests.get(url, stream=True)
    with open(f'./data/{filename}', 'wb') as f:
      for chunk in r.iter_content(chunk_size):
        f.write(chunk)

## Probability of each outlier winning their respective prize(s)
Let's first determine the amount of tickets purchased v.s. the amount of tickets available

In [87]:
def transform(item):
  return item.lower()

win_probability = pd.DataFrame(columns=('Winner', 'Item', 'Max Tickets', 'Purchased Tickets'))
for i, row in linked_entries.iterrows():
  filename = os.path.basename(urlparse(row['PDF']).path)
  tables = tabula.read_pdf(f'./data/{filename}', pages='all')
  firstname, lastname = row['Winner'].split(' ')

  entries = pd.DataFrame(columns=tables[0].columns)
  for tab in tables:
    for j, entry in tab.iterrows():
      entries.loc[len(entries)] = [entry.iloc[0], entry.iloc[1], entry.iloc[2], entry.iloc[3]]
  
  purchased_tickets = len(entries.loc[(entries['First Name'].apply(transform) == transform(firstname)) & (entries['Last Name'].apply(transform) == transform(lastname))])
  max_tickets = outlier_df.loc[(outlier_df['Winner'] == row['Winner']) & (outlier_df['Item'] == row['Item'])].iloc[0]['Maximum Tickets']

  """
   The following is done for the case of 'Cameron Vince', who entered 1 time for the "275 Ticket RTX3070" (PDF: 223_LianLi215_5800x_3070-072021-Final.pdf)
   but due to issues with the PDF parser, we were unable to detect his entry.

   Upon manual inspection of the PDF, I have found that he entered a single time.
  """
  purchased_tickets = purchased_tickets if purchased_tickets > 0 else purchased_tickets + 1


  win_probability.loc[len(win_probability)] = [row['Winner'], row['Item'], max_tickets, purchased_tickets]

Now let's calculate the probability of that individual winning the respective draw

In [88]:
probabilities = []
for i, row in win_probability.iterrows():
  available = row['Max Tickets']
  purchased = row['Purchased Tickets']
  probabilities.append(float(purchased) / float(available))

win_probability['Probability'] = probabilities

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
  print(win_probability)

               Winner                       Item Max Tickets  \
0          Rob Davies            Noir 3090 FE PC         800   
1          Rob Davies                   LG C1 #5         100   
2        Rajan Bhasin        5800x3d x570 Bundle         175   
3          Rob Davies            72hr RTX3090 FE         110   
4       Daniel Ollett      72hr Xbox Series 2 #4          15   
5       Daniel Ollett              IT Duster #10          70   
6       Isaac Poulson          Govee Hexa Panels          40   
7        Rajan Bhasin           Govee Light Bars          10   
8      David Phillips             NZXT Flow Case          40   
9       Cameron Vince        3080Ti AORUS Master        1250   
10    Ryan Pennington        Sage Coffee Machine          75   
11        Liam Powell               IT Duster #8          70   
12        Liam Powell     PACMAN Light Bundle #2         135   
13      Daniel Ollett            iFixit Manta #6          55   
14           Tam Shaw         Logitech G

Now let's calculate the probability of the individuals outside the SD winning their combined draws

P.S. for comparison, the probability of winning the jackpot on the lottery is approx. ~5.10 × 10^-15 (0.00000000000051%)

In [103]:
successive_event_probability = { }
for i, row in win_probability.iterrows():
  if row['Winner'] not in successive_event_probability:
    successive_event_probability[row['Winner']] = [row['Probability'], 1]
  else:
    successive_event_probability[row['Winner']][0] *= row['Probability']
    successive_event_probability[row['Winner']][1] += 1

successive_event_probability = {k: ["{:.20f}% ({})".format(x[0] * 100, x[0]), x[1]] for k, x in successive_event_probability.items()}

for winner, group in successive_event_probability.items():
  print(f"{winner} had a probability of {group[0]} to win the {group[1]} times that he/she did")


Rob Davies had a probability of 0.00000000000027001656% (2.7001655960931986e-15) to win the 12 times that he/she did
Rajan Bhasin had a probability of 0.00190476190476190498% (1.904761904761905e-05) to win the 4 times that he/she did
Daniel Ollett had a probability of 0.00002915154430305945% (2.915154430305945e-07) to win the 6 times that he/she did
Isaac Poulson had a probability of 0.00005787037037037037% (5.787037037037037e-07) to win the 5 times that he/she did
David Phillips had a probability of 0.00133333333333333329% (1.3333333333333333e-05) to win the 3 times that he/she did
Cameron Vince had a probability of 0.00000073881673881674% (7.3881673881673874e-09) to win the 5 times that he/she did
Ryan Pennington had a probability of 6.66666666666666696273% (0.06666666666666667) to win the 1 times that he/she did
Liam Powell had a probability of 0.04232804232804233263% (0.0004232804232804233) to win the 2 times that he/she did
Tam Shaw had a probability of 0.00000005019607843137% (5.