Skip to content

ACM40960/project-leasauer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Olympic logo

Project in Maths Modelling: Predict the 5000m Race Winner of the 2024 Paris Olympic Games

Authors: Lea Sauer & Agathe Vianey-Liaud

Basic Overview

Sports predictions significantly influence areas like betting, sponsorship, and athlete training, with much of the research traditionally focused on team sports. This project shifts the focus to an individual sport by attempting to predict the results of the men's 5000m race at the 2024 Paris Olympics using two distinct approaches:

  1. Sentiment-Based Ranking: Utilizing Tweets mentioning the athletes name posted before the race.
  2. AI Models: Employing race-related features to predict athletes' scores (time and position).

Table of Contents

Installation and Setup

Installation

  1. Clone the repository:

    git clone https://github.com/project-leasauer
  2. Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
  3. Install the required dependencies:

    pip install -r requirements.txt

Prerequisites

  • Python 3.x and the following libraries:
    • keras, json, csv, requests, BeautifulSoup, pandas, numpy, matplotlib, seaborn, scikit-learn, tweepy, re, transformers, sentencepiece, os, time, twikit, subprocess

Project Structure

The repository is organized as follows:

/data/             # Contains the datasets used in the project.
/Part1/          # Jupyter notebook for the sentiment analysis and corresponding visualizations.
/Part2/        # Jupyter notebooks for the feature anaylsis.
/Web scrapping/        # Scripts for web scraping
README.md          # The file you’re currently reading.
poster.pdf   # Project poster.

Usage Instructions

Running the Project

X Logo Part 1: Sentiment analysis of the Tweets

Web Scraping Tweets:

python Web scrapping/Web Scrape Tweets.ipynb
python Web scrapping/Web Scrape race results.ipynb

Jupyter Notebook: Load the Jupyter notebook in /Part1/Sentiment analysis.ipynb to extract the sentiment of the Tweets.

📈 Part 2: Feature analysis - past performances + simulated features

Jupyter Notebook: Load the Jupyter notebook in /Part2/code/UCD_Project-part2.ipynb to do the webscraping, train the AI models and analyse the data and model outcomes.

Datasets

Tweets

  • Source: Tweets collected from X via web scraping using the Twikit Python package, which circumvents the need for API keys. Filters set on recent and top performing Tweets.
  • Content: Includes 625 tweets mentioning the athletes, with fields such as tweet content, username, posting date, and the athlete's actual race time and rank.

Performance Data

  • Source: Web scrapped from World Athletics.
  • Content: The dataset comprises features from the top 100 athletes worldwide in the 5000m sprint, including past performances.

Competition results

  • Source: Web scrapped from World Athletics.
  • Content: The dataset comprises the athlets name, time and rank from the men's Olympic 5k run.

Model and Analysis

X Logo Part 1: Sentiment analysis of the Tweets

  • Data Collection: Tweets were collected before the race using the athletes' names as keywords.
  • Data Processing: Sentiment analysis was conducted using the nlptown/bert-base-multilingual-uncased-sentiment model, classifying tweets into five sentiment categories, cardiffnlp/twitter-roberta-base-sentiment and xlm-roberta-base model, classifying tweets into three sentiment categories.

📈 Part 2: Feature analysis - past performances + simulated features

  • Data Collection: Historical data of top athletes' performances was combined with simulated race features.
  • Data Simulation: Monte Carlo techniques were employed to simulate variables like crowd cheering effect.
  • Data Analysis and Visualization: Features from the performance dataset were analyzed using Linear Regression, Neural Networks, and Random Forest and evaluated using metrics such as MAE, MSE, and RMSE.

Results

Key Findings

  • Sentiment Analysis: nlptown/bert-base-multilingual-uncased-sentiment model performed best on the Tweets. Showed moderate correlations between tweet sentiment and race results. However, sentiment alone was not a reliable predictor.
  • AI Models: The Random Forest model outperformed others, demonstrating better predictive accuracy.

Visualizations

  • Sentiment distribution and correlation matrices are available in the /Part1/Visualizations folder.
  • Model performance metrics and residual plots are also included.

Contribution Guidelines

To contribute to this project, follow these steps:

git checkout -b {your-name/feature}
git add .
git commit -m "New Feature"
git push --set-upstream origin '{your-name/feature}'
git checkout main
git pull  # After PR gets merged into main branch

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact Information

For inquiries or collaboration, please reach out to:

Acknowledgements

We would like to thank University College Dublin and specifically our professor Dr Sarp Akcay for his support and guidance throughout this project.

About

project-leasauer created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published