Authors: Lea Sauer & Agathe Vianey-Liaud
Sports predictions significantly influence areas like betting, sponsorship, and athlete training, with much of the research traditionally focused on team sports. This project shifts the focus to an individual sport by attempting to predict the results of the men's 5000m race at the 2024 Paris Olympics using two distinct approaches:
- Sentiment-Based Ranking: Utilizing Tweets mentioning the athletes name posted before the race.
- AI Models: Employing race-related features to predict athletes' scores (time and position).
- Installation and Setup
- Project Structure
- Usage Instructions
- Datasets
- Model and Analysis
- Results
- Contribution Guidelines
- License
- Contact Information
- Acknowledgements
-
Clone the repository:
git clone https://github.com/project-leasauer
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
- Python 3.x and the following libraries:
keras
,json
,csv
,requests
,BeautifulSoup
,pandas
,numpy
,matplotlib
,seaborn
,scikit-learn
,tweepy
,re
,transformers
,sentencepiece
,os
,time
,twikit
,subprocess
The repository is organized as follows:
/data/ # Contains the datasets used in the project.
/Part1/ # Jupyter notebook for the sentiment analysis and corresponding visualizations.
/Part2/ # Jupyter notebooks for the feature anaylsis.
/Web scrapping/ # Scripts for web scraping
README.md # The file you’re currently reading.
poster.pdf # Project poster.
Web Scraping Tweets:
python Web scrapping/Web Scrape Tweets.ipynb
python Web scrapping/Web Scrape race results.ipynb
Jupyter Notebook:
Load the Jupyter notebook in /Part1/Sentiment analysis.ipynb
to extract the sentiment of the Tweets.
Jupyter Notebook:
Load the Jupyter notebook in /Part2/code/UCD_Project-part2.ipynb
to do the webscraping, train the AI models and analyse the data and model outcomes.
- Source: Tweets collected from X via web scraping using the
Twikit
Python package, which circumvents the need for API keys. Filters set on recent and top performing Tweets. - Content: Includes 625 tweets mentioning the athletes, with fields such as tweet content, username, posting date, and the athlete's actual race time and rank.
- Source: Web scrapped from World Athletics.
- Content: The dataset comprises features from the top 100 athletes worldwide in the 5000m sprint, including past performances.
- Source: Web scrapped from World Athletics.
- Content: The dataset comprises the athlets name, time and rank from the men's Olympic 5k run.
- Data Collection: Tweets were collected before the race using the athletes' names as keywords.
- Data Processing: Sentiment analysis was conducted using the
nlptown/bert-base-multilingual-uncased-sentiment
model, classifying tweets into five sentiment categories,cardiffnlp/twitter-roberta-base-sentiment
andxlm-roberta-base
model, classifying tweets into three sentiment categories.
- Data Collection: Historical data of top athletes' performances was combined with simulated race features.
- Data Simulation: Monte Carlo techniques were employed to simulate variables like crowd cheering effect.
- Data Analysis and Visualization: Features from the performance dataset were analyzed using Linear Regression, Neural Networks, and Random Forest and evaluated using metrics such as MAE, MSE, and RMSE.
- Sentiment Analysis:
nlptown/bert-base-multilingual-uncased-sentiment
model performed best on the Tweets. Showed moderate correlations between tweet sentiment and race results. However, sentiment alone was not a reliable predictor. - AI Models: The Random Forest model outperformed others, demonstrating better predictive accuracy.
- Sentiment distribution and correlation matrices are available in the
/Part1/Visualizations
folder. - Model performance metrics and residual plots are also included.
To contribute to this project, follow these steps:
git checkout -b {your-name/feature}
git add .
git commit -m "New Feature"
git push --set-upstream origin '{your-name/feature}'
git checkout main
git pull # After PR gets merged into main branch
This project is licensed under the MIT License. See the LICENSE file for more details.
For inquiries or collaboration, please reach out to:
- Lea Sauer: lea.sauer@ucdconnect.ie
- Agathe Vianey-Liaud: agathe.vianey-liaud@ucdconnect.ie
We would like to thank University College Dublin and specifically our professor Dr Sarp Akcay for his support and guidance throughout this project.