Project in Maths Modelling: Predict the 5000m Race Winner of the 2024 Paris Olympic Games

Authors: Lea Sauer & Agathe Vianey-Liaud

Basic Overview

Sports predictions significantly influence areas like betting, sponsorship, and athlete training, with much of the research traditionally focused on team sports. This project shifts the focus to an individual sport by attempting to predict the results of the men's 5000m race at the 2024 Paris Olympics using two distinct approaches:

Sentiment-Based Ranking: Utilizing Tweets mentioning the athletes name posted before the race.
AI Models: Employing race-related features to predict athletes' scores (time and position).

Installation and Setup

Installation

Clone the repository:

git clone https://github.com/project-leasauer

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install the required dependencies:
```
pip install -r requirements.txt
```

Prerequisites

Python 3.x and the following libraries:
- keras, json, csv, requests, BeautifulSoup, pandas, numpy, matplotlib, seaborn, scikit-learn, tweepy, re, transformers, sentencepiece, os, time, twikit, subprocess

Project Structure

The repository is organized as follows:

/data/             # Contains the datasets used in the project.
/Part1/          # Jupyter notebook for the sentiment analysis and corresponding visualizations.
/Part2/        # Jupyter notebooks for the feature anaylsis.
/Web scrapping/        # Scripts for web scraping
README.md          # The file you’re currently reading.
poster.pdf   # Project poster.

Usage Instructions

Running the Project

Part 1: Sentiment analysis of the Tweets

Web Scraping Tweets:

python Web scrapping/Web Scrape Tweets.ipynb
python Web scrapping/Web Scrape race results.ipynb

Jupyter Notebook: Load the Jupyter notebook in /Part1/Sentiment analysis.ipynb to extract the sentiment of the Tweets.

📈 Part 2: Feature analysis - past performances + simulated features

Jupyter Notebook: Load the Jupyter notebook in /Part2/code/UCD_Project-part2.ipynb to do the webscraping, train the AI models and analyse the data and model outcomes.

Datasets

Tweets

Source: Tweets collected from X via web scraping using the Twikit Python package, which circumvents the need for API keys. Filters set on recent and top performing Tweets.
Content: Includes 625 tweets mentioning the athletes, with fields such as tweet content, username, posting date, and the athlete's actual race time and rank.

Performance Data

Source: Web scrapped from World Athletics.
Content: The dataset comprises features from the top 100 athletes worldwide in the 5000m sprint, including past performances.

Competition results

Source: Web scrapped from World Athletics.
Content: The dataset comprises the athlets name, time and rank from the men's Olympic 5k run.

Model and Analysis

Part 1: Sentiment analysis of the Tweets

Data Collection: Tweets were collected before the race using the athletes' names as keywords.
Data Processing: Sentiment analysis was conducted using the nlptown/bert-base-multilingual-uncased-sentiment model, classifying tweets into five sentiment categories, cardiffnlp/twitter-roberta-base-sentiment and xlm-roberta-base model, classifying tweets into three sentiment categories.

📈 Part 2: Feature analysis - past performances + simulated features

Data Collection: Historical data of top athletes' performances was combined with simulated race features.
Data Simulation: Monte Carlo techniques were employed to simulate variables like crowd cheering effect.
Data Analysis and Visualization: Features from the performance dataset were analyzed using Linear Regression, Neural Networks, and Random Forest and evaluated using metrics such as MAE, MSE, and RMSE.

Results

Key Findings

Sentiment Analysis: nlptown/bert-base-multilingual-uncased-sentiment model performed best on the Tweets. Showed moderate correlations between tweet sentiment and race results. However, sentiment alone was not a reliable predictor.
AI Models: The Random Forest model outperformed others, demonstrating better predictive accuracy.

Visualizations

Sentiment distribution and correlation matrices are available in the /Part1/Visualizations folder.
Model performance metrics and residual plots are also included.

Contribution Guidelines

To contribute to this project, follow these steps:

git checkout -b {your-name/feature}
git add .
git commit -m "New Feature"
git push --set-upstream origin '{your-name/feature}'
git checkout main
git pull  # After PR gets merged into main branch

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact Information

For inquiries or collaboration, please reach out to:

Lea Sauer: lea.sauer@ucdconnect.ie
Agathe Vianey-Liaud: agathe.vianey-liaud@ucdconnect.ie

Acknowledgements

We would like to thank University College Dublin and specifically our professor Dr Sarp Akcay for his support and guidance throughout this project.

Name	Name	Last commit message	Last commit date
Latest commit leasauer Added comments Aug 23, 2024 743204c · Aug 23, 2024 History 70 Commits
.ipynb_checkpoints	.ipynb_checkpoints	Started with Twitter	Jun 10, 2024
Part1	Part1	Added comments	Aug 23, 2024
Part2	Part2	Deleted useless files	Aug 22, 2024
Web scrapping	Web scrapping	Added comments	Aug 23, 2024
data	data	Added comments	Aug 23, 2024
.DS_Store	.DS_Store	Model updates	Jul 11, 2024
.gitignore	.gitignore	Restructured files	Aug 19, 2024
Poster.pdf	Poster.pdf	Add poster	Aug 21, 2024
README.md	README.md	Update README.md	Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project in Maths Modelling: Predict the 5000m Race Winner of the 2024 Paris Olympic Games

Basic Overview

Table of Contents

Installation and Setup

Installation

Prerequisites

Project Structure

Usage Instructions

Running the Project

Part 1: Sentiment analysis of the Tweets

📈 Part 2: Feature analysis - past performances + simulated features

Datasets

Tweets

Performance Data

Competition results

Model and Analysis

Part 1: Sentiment analysis of the Tweets

📈 Part 2: Feature analysis - past performances + simulated features

Results

Key Findings

Visualizations

Contribution Guidelines

License

Contact Information

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

ACM40960/project-leasauer

Folders and files

Latest commit

History

Repository files navigation

Project in Maths Modelling: Predict the 5000m Race Winner of the 2024 Paris Olympic Games

Basic Overview

Table of Contents

Installation and Setup

Installation

Prerequisites

Project Structure

Usage Instructions

Running the Project

Part 1: Sentiment analysis of the Tweets

📈 Part 2: Feature analysis - past performances + simulated features

Datasets

Tweets

Performance Data

Competition results

Model and Analysis

Part 1: Sentiment analysis of the Tweets

📈 Part 2: Feature analysis - past performances + simulated features

Results

Key Findings

Visualizations

Contribution Guidelines

License

Contact Information

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages