Skip to content

Gilbertly/oddspiders

Repository files navigation

Introduction

Scraping content from Oddsportal and trying to best match it with content from Soccerstats.

Configuration

// install dependencies
$ npm install

// start the Browserless API
$ npm run browserless

Workflow

The main aim of the project is to have a clean output which has data aggregated from both Oddsportal and Soccerstats. With this, the workflow to analyze statistics of a particular day should follow as:

  1. Gather inference statistics - This step starts from AWS by triggering a separate stepfunctions workflow, which in the end saves the inference data in the database table fixture_pregame_analysis. The process also outputs another file which can be retreved from S3, containing the same data that was just stored in the database.

  2. Start monintoring select fixtures - Before moving on with this step, ensure you have filtered the fixtures of interest from the previous step, as they are the ones to be monitored here. This step fetches current fixtures of a particular day and saves them in the database table oddsportal_fixtures. The data mainly comes from Oddsportal so several steps need to be performed to link this data to Soccerstats data obtained in step 1 above. The key element here is the gameHomeTeamID or the gameAwayTeamID from step 1, which is entered manually to filter-down fixtures of interest. The workflow command npm run monitor:start should handle this and output to the file data/oddsportal/aggregated_metrics.json and also save the data to the database table agg_sstats_oddsportal. Note that this workflow is scheduled to run every ~5 minutes onwards after being started.

  3. Visualize aggregated statistics - Here, the output file from step 2; data/oddsportal/aggregated_metrics.json, should be read into a Pandas dataframe within a Jupyter notebook. Because the file is updated every ~5 minutes, the Jupyter notebook user should try to reload the notebook as often as possible to get the freshest data at that time.

  4. Profit

Scripts

src/match_proxies.js

This file gets teams from a select league from the table sstats_seasons, then for each of those teams navigates to each teams' oddsportal standing page https://www.oddsportal.com/standings/#soccer. After this, puppeteer trying to compare the team name queried from sstats_seasons and from the league standing, and fixes where appropriate.

The purpose of this is so that each team, is matched to it's equivalent name in oddsportal, and thus sharing the same teamID. By sharing the teamID, data from sstats_fixtures table can be cross-linked with data from oddsportal using this teamID attribute.

To run this script, use the command below.

$ npm run oddsportal:proxies

This command will create/populate several files in the data/oddsportal folder, namely:

  • /leagues_unmatched.json - Leagues that did not get matched on both sstats and oddsportal.

In some cases, it's ok to leave these unmatched/extra league divisions to be since not 100% of leagues & divisions will be a match.

  • /leagues_matched.json - Leagues that matched from both sstats and oddsportal.
  • /teams_matched.json - Teams that matched from both sstats and oddsportal.
  • /teams_unmatched.json - Teams that did not get matched on both sstats and oddsportal.

Whenever possible, try to recover unmatched teams. This can be achieved by running the script src/scripts/teams_unmatched.js.

Some known issues when running this script includes:

  1. An error querying teams from sstats_seasons. This can be caused when the script is run at the same time the table has another connection from the sstats project. This can cause some inconsistencies, and the available solution right now is to retry running the script again after some seconds.

  2. The league Paraguay has a unique standing table, and by this the script can't successfully match any team from there. The solution to this could be to ignore the league completely, or add custom conditionals to handle the league's uniqueness.

  3. Leagues with more than 5 league divisions can sometimes cause the script to crash with a browser disconnected error. An example of this could the England league which has 12 league divisions (or Germany). Since the script will try to match teams from all league divisions, one at a time, it could get overwhelming and perhaps due to the memory constraints of the browserless Docker container, the puppeteer browser will disconnect. The solution to this is to un-comment the extra line of code below. What this does is to limit the number of league divisions the container can handle at a time. You'll also need to change the slice dimensions as needed to completion, updating the maximum league divisions at four at a time, eg. .slice(0, 4), .slice(5, 9), etc.

...
proxyMatchedOrdered = proxyMatchedOrdered.slice(0,4);
...

src/scripts/teams_unmatched.js

This script uses the prompts npm module to let the developer easily add an alias name to a series of unmatched teams. The script reads the file data/oddsportal/teams_unmatched.json and iteratively generates CLI questions where the developer can manually add a team name alias by navigating to the teams' equivalent Oddsportal page. After completion, the file data/oddsportal/whitelist_script.json will be created/populated with the team aliases. The developer should confirm whether the file has the correct team aliases, and then manually copy/paste the json content to the file src/oddsportal/models/whitelist_teams.js.

You can run this script via the command below:

$ npm run unmatched:teams

There are no known issues at the moment with the script.

src/oddsportal/scripts/monitor_fixtures.js

This script takes user input; gameHomeTeamIDs or gameAwayTeamIDs separated by commas, which are then used to match with fixtures located in the file fixtures_example.json. This file contains fixtures of a particular day, as fetched from Oddsportal, with relevant metadata that links these fixtures to Soccerstats metadata in this case the gameHomeTeamID and gameAwayTeamID. The filtered fixtures are thenn saved in the output file fixtures_monitor.json, for further processing by match_bookies.js.

To run this script, use the command below:

$ node src/oddsportal/scripts/monitor_fixtures.js

NOTE: The above command can be ran as part of a workflow which executes several scripts. To run/start the workflow, use the command npm run monitor:start. Also check the scripts section of the package.json file for further reference.

Some known issues with the script include:

  1. Since user input is split by a comma between IDs, if the user doesn't clearly use this form of input the script may have different results, ie. the script does not strictly check for IDs beyond splitting user input by commas.

src/oddsportal/match_bookies.js

This script is responsible for getting bookmarkers odds from a specific fixture/match. It is meant to be run as frequently as possible, while a match has not kicked-off so as to get the freshest odds offered by the various bookmarkers. For this to happen, the script depends on the file fixtures_monitor.json which contains fixtures of interest, from which the script os scheduled to fetch bookmarker odds every 5 minutes. After successfully fetching the odds, they are saved to the file odds_all.json.

To run this script, use the command below:

$ npm run oddsportal:bookies

There are no known issues at the moment with the script.

src/oddsportal/scripts/aggregate_metrics.js

This script is called as part of match_bookies.js, to compose bookmarkers odds and inference data. The purpose of this is to provide a more readable file that also sits nicely when stored in a database table. To achieve this, the script takes-in data from match_bookies.js, fetches data from both the oddsportal_fixtures and fixtures_pregame_analysis tables to create a uniform record. These records are then saved in the output file aggregated_metrics.json.

To run this script, use the command below (ran as part of match_bookies.js):

$ npm run oddsportal:bookies

There are no known issues at the moment with the script.

src/oddsportal/match_fixtures.js

This script fetches scheduled fixtures for a particular date, as well as the number of bookmarkers available for each match. The output is then saved to fixtures_oddsportal.json for further processing, while still saving this data to the database table oddsportal_fixtures.

To run this script, use the command below:

$ npm run oddsportal:fixtures

There are no known issues at the moment with the script.

About

Browserless scrapers hosted on AWS Fargate, scheduled via Cloudwatch (should change name to ecs-scrapers).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published