# 1. Introduction
## Objective

The dataset appears to contain information about various sports competitions. Our aim is to explore this dataset to uncover insights into the different types of competitions, their geographic distribution, and other characteristics that might be present. We'll look into aspects such as competition types, countries involved, and any other unique attributes of these competitions.

## Dataset Overview

At first glance, the dataset includes columns like `competition_id`, `competition_code`, `name`, `sub_type`, `type`, `country_id`, `country_name`, `domestic_league_code`, `confederation`, and `url`.
This suggests a comprehensive dataset covering diverse aspects of sports competitions across different countries and confederations.

# 2. Data Loading and Preliminary Analysis


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
competitions_df = pd.read_csv("../data/competitions.csv")

# Display the first few rows of the dataset for initial inspection
competitions_df.head()



## Initial Observations

* Competition Details: Columns like `competition_id`, `competition_code`, and `name` provide basic identification details for each competition.

* Competition Types: The `sub_type` and `type` columns indicate the nature of the competitions, such as domestic cups or other types.

* Geographical Data: The `country_id`, `country_name`, and `confederation` columns suggest geographical categorization, likely offering insights into the global spread of these competitions.

* Related Links: The `url` column contains links to more detailed information, possibly useful for further exploration.


In [None]:
# Assessing the dataset for missing values and data types
missing_data = competitions_df.isnull().sum()
data_types = competitions_df.dtypes

# Summarizing the findings
missing_data_summary = pd.DataFrame({'Missing Values': missing_data, 'Data Type': data_types})
missing_data_summary


# 3. Data Cleaning and Preprocessing
### Missing Data
Country Name and Domestic League Code: There are 7 missing values each in the `country_name` and `domestic_league_code` columns. Since these are categorical data, we need to decide whether to fill these missing values with a placeholder (like "Unknown") or to drop them. The decision depends on the significance of these columns for our analysis.

Taking a closer look at the missing values and the surrounding information in those rows, we can see that the missing values are mostly for competitions that are not country-specific. For example, the `country_name` is missing for the `UEFA Champions League` and `UEFA Europa League` competitions, which are continental competitions. Similarly, the `domestic_league_code` is missing for the `UEFA Champions League` and `UEFA Europa League` competitions, which are not domestic competitions. Therefore, we can group those into a new category called "International" for both columns.

In [None]:
# Filling missing values with "International"
competitions_df['country_name'].fillna('International', inplace=True)
competitions_df['domestic_league_code'].fillna('INT', inplace=True)

# Verify if the missing values are filled
competitions_df.isnull().sum()


<h3> Data Type Conversions</h3>
The dataset mostly contains object (string) types and integers. The data types appear appropriate for their respective columns.

# 4. Exploratory Data Analysis
# 4. Exploratory Data Analysis

Here are some key observations from the descriptive statistics:

**Competition Details:**
- There are 43 entries (competitions) in the dataset.
- The dataset includes 42 unique competition names and codes, with one competition appearing twice.

**Competition Types:**
- There are 11 unique sub-types of competitions, with `first_tier` being the most frequent.
- The dataset consists of 4 distinct types of competitions, with `domestic_league` being the most common.

**Geographical Distribution:**
- The `country_id` column has values ranging from -1 to 190. (The presence of -1 might require further investigation.)
- There are 15 unique countries represented in the dataset. `International` is listed for 7 entries, reflecting the nature of those competitions as international.

**Confederation:**
- All entries belong to the `europa` confederation.


## Visualisations


In [None]:
# Descriptive statistics for the dataset
descriptive_stats = competitions_df.describe(include='all')

# Display the descriptive statistics
descriptive_stats


**Distribution of Competitions by Type**:
- The bar chart shows the frequency of different types of competitions in the dataset. 'Domestic_league' is the most common type, followed by 'domestic_cup' and others.

**Distribution of Competitions by Country**:
- This bar chart illustrates the number of competitions per country. 'International' competitions, which likely represent tournaments involving multiple countries, are notably represented. Other countries have varying numbers of competitions, with some having more representation than others.

In [None]:
# Setting the aesthetic style of the plots
sns.set_style("whitegrid")

# Creating visualizations
plt.figure(figsize=(14, 6))

# Distribution of competitions by type
plt.subplot(1, 2, 1)
sns.countplot(x='type', data=competitions_df)
plt.title('Distribution of Competitions by Type')
plt.xlabel('Type')
plt.ylabel('Count')

# Distribution of competitions by country
plt.subplot(1, 2, 2)
country_counts = competitions_df['country_name'].value_counts()
country_counts.plot(kind='bar')
plt.title('Distribution of Competitions by Country')
plt.xlabel('Country')
plt.ylabel('Count')

plt.tight_layout()
plt.show()


## In-depth Analysis

In [None]:
# Analysis 1: Comparison of Domestic vs. International Competitions
domestic_vs_international = competitions_df['country_name'].apply(lambda x: 'International' if x == 'International' else 'Domestic').value_counts()

# Analysis 2: Distribution of Sub-types of Competitions
sub_type_distribution = competitions_df['sub_type'].value_counts()

# Plotting the results
plt.figure(figsize=(14, 6))

# Plot for Domestic vs. International Competitions
plt.subplot(1, 2, 1)
domestic_vs_international.plot(kind='bar')
plt.title('Domestic vs. International Competitions')
plt.xlabel('Competition Type')
plt.ylabel('Count')

# Plot for Sub-types of Competitions
plt.subplot(1, 2, 2)
sub_type_distribution.plot(kind='bar')
plt.title('Distribution of Sub-types of Competitions')
plt.xlabel('Sub-type')
plt.ylabel('Count')

plt.tight_layout()
plt.show()


**Domestic vs. International Competitions**:
- The first bar chart illustrates the comparison between domestic and international competitions. There are significantly more domestic competitions in the dataset compared to international ones.

**Distribution of Sub-types of Competitions**:
- The second bar chart shows the distribution of different sub-types of competitions. '`First_tier`' is the most common sub-type, followed by other types like '`domestic_cup`, '`domestic_super_cup`', and several others.

# 5. Insights and Conclusions

## Key Findings
- The majority of competitions in the dataset are domestic, with a smaller portion being international.

- '`First_tier`' competitions (likely top-division leagues) dominate the dataset, followed by various types of cups and super cups.

- All competitions are under the '`europa`' confederation, indicating a focus on European sports competitions.

## Limitations 
- The dataset is limited to 43 entries, which may not fully represent the global diversity of sports competitions.

- The scope is confined to the 'europa' confederation, excluding significant competitions from other parts of the world.

## Recommendations

- For a more comprehensive analysis, expanding the dataset to include competitions from other confederations and a larger variety of countries could provide broader insights.

- Further investigation into specific types of competitions (like 'first_tier' leagues) could reveal more detailed trends and patterns relevant to sports analytics and management.

# Saving the cleaned dataset

In [None]:
# Save the cleaned DataFrame to a new CSV file
cleaned_data_path = '../data/cleaned/competitions.csv'
competitions_df.to_csv(cleaned_data_path, index=False)