This project involves a complete pipeline for extracting football data from Wikipedia, storing it in Azure Data Lake, transforming the data using Azure Databricks, querying the data through Azure Synapse, and visualizing the results in Tableau. It's designed to provide comprehensive analytics on football data for enthusiasts and analysts.
- Apache Airflow
- Azure Data Lake Storage Gen2 account
- Azure Databricks workspace
- Azure Synapse Analytics workspace
- Tableau Desktop or Tableau Public account
- Clone the Repository: git clone https://github.com/AnishmMore/Football-Data-Analytics.git
- Azure Setup:
- Set up Azure Data Lake Storage Gen2.
- Configure Azure Databricks workspace.
- Initialize Azure Synapse Analytics workspace.
- Airflow Setup: Detail how to set up Apache Airflow to run your DAGs.
- File:
wikipedia_azure.py
within thedags
directory. - Description: This is the primary DAG file containing the Apache Airflow code.
- Execution:
- Run Airflow on localhost.
- Initiate the DAG to begin data extraction from Wikipedia.
- Data is subsequently stored in Azure Data Lake Storage Gen2.
- File:
Football Analytics.ipynb
. - Process:
- Data retrieved from Azure Data Lake Storage Gen2.
- Transformation is executed using the Azure Databricks compute engine.
- Transformed data is then stored back in Azure Data Lake Storage Gen2.
- Usage: Execute the notebook on Azure Databricks to transform the
raw_data
.
- File:
Synapse.sql
. - Functionality: Contains a collection of SQL queries used for data analysis.
- Utility: Use these queries in Azure Synapse to derive insights and prepare data for visualization.
- File:
Football_Analytics.twb
. - Tool: Tableau is employed for creating visual representations of the data.
- Visualization: The dashboard within the Tableau workbook provides an interactive view of the football data.