Web Scraping experiments

In [None]:
%pip install requests beautifulsoup4 pandas pyarrow

Let's explore a fundamental example of web scraping by extracting lottery numbers from a lottery website and then saving this data in a Parquet file. We can use this dataset for various purposes, such as training a machine learning model to predict winning numbers.

## Step 1: Web Scraping

To begin, we'll utilize web scraping techniques to retrieve the latest lottery numbers from a specific lottery website. Web scraping involves extracting information from web pages programmatically. In this case, we are interested in the lottery numbers, which are typically displayed on the website.

## Step 2: Data Extraction

Once we've collected the lottery numbers from the website, we need to extract and structure this data. This involves parsing the HTML content of the webpage to locate the relevant information and then storing it in a structured format, such as a Python dictionary or a pandas DataFrame.


In [7]:
# imports
import pandas as pd
import numpy
from src.game import Game
from src.utils import scrape_lottery_data
from src.utils import read_all_parquet_files

In [8]:
# declare constants

# extracted date format
DATE_FORMAT = "%a, %b %d, %Y"

# years to process
YEARS_TO_PROCESS = [2023, 2022, 2021, 2020, 2019]

# dataset filename
DATASET_LOCAL_COPY = './static/data/'

In [9]:
# for all years we want to process 
for year in YEARS_TO_PROCESS:
  # get the data for this year
  scrape_lottery_data(year, DATASET_LOCAL_COPY, DATE_FORMAT)

In [10]:
# get all the files 
all_games = read_all_parquet_files(DATASET_LOCAL_COPY)

if all_games is not None:
    # Let us examine what we have
    print(all_games.head(24))

         Date Ball_1 Ball_2 Ball_3 Ball_4 Ball_5 Ball_Bonus
0  2019-12-28     20     23     39     59     60         18
1  2019-12-25     02     04     16     30     46         20
2  2019-12-21     19     31     35     50     67         14
3  2019-12-18     14     18     26     39     68         09
4  2019-12-14     03     06     12     32     64         19
5  2019-12-11     24     29     42     44     63         10
6  2019-12-07     18     42     53     62     66         25
7  2019-12-04     08     27     44     51     61         14
8  2019-11-30     15     35     42     63     68         18
9  2019-11-27     15     26     37     53     55         21
10 2019-11-23     28     35     38     61     66         23
11 2019-11-20     07     15     39     40     57         12
12 2019-11-16     14     22     26     55     63         26
13 2019-11-13     23     26     27     28     66         11
14 2019-11-09     14     17     35     38     60         25
15 2019-11-06     15     28     46     6

In [11]:
# for feature engineering if you want to play with the data in excel and whatnot 
all_games.to_csv(f'{DATASET_LOCAL_COPY}all_Games.csv')

In [13]:
# let us "crack" the dataframe and see whats there
all_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        599 non-null    datetime64[ns]
 1   Ball_1      599 non-null    object        
 2   Ball_2      599 non-null    object        
 3   Ball_3      599 non-null    object        
 4   Ball_4      599 non-null    object        
 5   Ball_5      599 non-null    object        
 6   Ball_Bonus  599 non-null    object        
dtypes: datetime64[ns](1), object(6)
memory usage: 32.9+ KB


do not like those objects fields. 
next step to convert it to int

In [15]:
# change data type
all_games['Ball_1'] = all_games['Ball_1'].astype('int32')
all_games['Ball_2'] = all_games['Ball_2'].astype('int32')
all_games['Ball_3'] = all_games['Ball_3'].astype('int32')
all_games['Ball_4'] = all_games['Ball_4'].astype('int32')
all_games['Ball_5'] = all_games['Ball_5'].astype('int32')
all_games['Ball_Bonus'] = all_games['Ball_Bonus'].astype('int32')

In [16]:
# so let see if it worked
all_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        599 non-null    datetime64[ns]
 1   Ball_1      599 non-null    int32         
 2   Ball_2      599 non-null    int32         
 3   Ball_3      599 non-null    int32         
 4   Ball_4      599 non-null    int32         
 5   Ball_5      599 non-null    int32         
 6   Ball_Bonus  599 non-null    int32         
dtypes: datetime64[ns](1), int32(6)
memory usage: 18.8 KB


In [17]:
# display dataframe's summary statistics including all collumns
all_games.describe(include='all')

Unnamed: 0,Date,Ball_1,Ball_2,Ball_3,Ball_4,Ball_5,Ball_Bonus
count,599,599.0,599.0,599.0,599.0,599.0,599.0
mean,2021-08-03 00:28:50.884807936,12.15025,23.277129,35.277129,47.193656,58.564274,13.310518
min,2019-01-02 00:00:00,1.0,2.0,3.0,7.0,22.0,1.0
25%,2020-06-08 00:00:00,4.0,15.0,26.0,39.0,54.0,6.0
50%,2021-10-16 00:00:00,10.0,22.0,36.0,48.0,61.0,13.0
75%,2022-09-29 12:00:00,18.0,31.0,44.0,57.0,66.0,20.0
max,2023-09-13 00:00:00,52.0,58.0,64.0,68.0,69.0,26.0
std,,9.807723,11.853058,12.430543,11.949555,9.377023,7.712675


We can confirm range of the downloaded data [2019, 2023]

## Step 3: Data Cleaning and Preprocessing

Data obtained through web scraping may require cleaning and preprocessing. This can involve handling missing values, removing duplicates, and converting data types to ensure it's ready for analysis and model training.

In [None]:
# reset index
all_games.reset_index(drop=True, inplace=True)
all_games.set_index('Date', inplace=True)

original = all_games

# let us savbe it for now 
#all_games = all_games.drop('Ball_Bonus', axis=1)

## Step 4: Data Storage

Now that we have our clean and structured dataset, we'll save it in a Parquet file format. Parquet is a columnar storage format that's highly efficient for analytics and machine learning. It supports compression and works well with various data processing tools and frameworks.

In [None]:
# store prepared dataset
all_games.to_parquet(f'{DATASET_LOCAL_COPY}_all_games.parquet')

## Step 5: Exploratory Data Analysis (EDA)

Before diving into model training, it's a good practice to perform exploratory data analysis. This step involves visualizing the data, calculating summary statistics, and gaining insights into the distribution and patterns of lottery numbers.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)
fig.suptitle("Winning numbers preserving order", fontsize=18, y=0.95)

# loop through tickers andaxes
number_order = ['Ball_1', 'Ball_2', 'Ball_3', 'Ball_4', 'Ball_5', 'Ball_Bonus']
for number_order, ax in zip(number_order, axs.ravel()):
    # filter df for ticker and plot on specified axes
    all_games[number_order].hist(ax=ax, bins=len(all_games[number_order].unique()), color='skyblue', ec='blue')
    ax.set_title(number_order)
    ax.set_xlabel("Count of " + number_order)
    ax.set_xlim([0, 69])
plt.show()

## Step 6: Model Development

With the dataset prepared and understood, we can move on to developing a machine learning model. In this case, our goal is to predict future winning numbers based on historical data. We can use various algorithms, such as regression, time series forecasting, or neural networks, depending on the nature of the lottery game and the dataset.

Note: NEXT POST

## Step 7: Model Evaluation and Fine-Tuning

After training the model, we need to evaluate its performance using appropriate metrics. Depending on the results, we may need to fine-tune the model, adjust hyperparameters, or try different algorithms to improve its accuracy and reliability.

Note: NEXT POST



## Step 8: Deployment and Prediction

Once we have a well-performing model, we can deploy it to make predictions on upcoming lottery draws. The model will take historical data as input and provide predictions for future winning numbers.

Note: NEXT POST

## Step 9: Continuous Updates

Lottery numbers change with each draw, so it's essential to continuously update the dataset and retrain the model to ensure its accuracy over time.

In summary, web scraping, data extraction, cleaning, and model development are essential steps in creating a predictive model for lottery numbers. This process allows us to harness historical data to improve our chances of predicting winning numbers in future draws. Storing the data in a Parquet file ensures efficient data handling and analysis.

Note: NEXT POST