# Getting Started with Football Data ETL

This notebook demonstrates the basic setup and usage of the Football Data ETL project.

## Objectives
- Load sample football data
- Perform basic data cleaning
- Explore and visualize the data
- Save processed data

## 1. Setup and Imports

In [None]:
import sys
import os

# Add the src directory to the path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import custom utilities
from etl_utils import load_csv_data, clean_dataframe, save_data, get_data_summary

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ All imports successful!")

## 2. Load Data

Load your football data from the `data/raw/` directory.

In [None]:
# Example: Create a sample dataset for demonstration
# In practice, you would load your actual data file

# Sample data creation
sample_data = {
    'player_name': ['Lionel Messi', 'Cristiano Ronaldo', 'Neymar Jr', 'Kylian Mbappé', 'Kevin De Bruyne'],
    'team': ['Inter Miami', 'Al Nassr', 'Al Hilal', 'PSG', 'Manchester City'],
    'position': ['Forward', 'Forward', 'Forward', 'Forward', 'Midfielder'],
    'goals': [30, 28, 25, 35, 15],
    'assists': [25, 10, 20, 18, 30],
    'matches_played': [35, 38, 32, 40, 38]
}

df = pd.DataFrame(sample_data)

print(f"Dataset loaded with {len(df)} rows and {len(df.columns)} columns")
df.head()

## 3. Data Exploration

Explore the structure and characteristics of the data.

In [None]:
# Display basic information
print("Data Info:")
print(df.info())
print("\nData Description:")
df.describe()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

## 4. Data Cleaning

Clean the data using our utility functions.

In [None]:
# Clean the dataframe
df_clean = clean_dataframe(df)

# Get data summary
summary = get_data_summary(df_clean)
print("Data Summary:")
for key, value in summary.items():
    print(f"{key}: {value}")

## 5. Data Analysis and Visualization

Perform basic analysis and create visualizations.

In [None]:
# Calculate goals per match
df_clean['goals_per_match'] = df_clean['goals'] / df_clean['matches_played']
df_clean['assists_per_match'] = df_clean['assists'] / df_clean['matches_played']

print("Performance Metrics:")
df_clean[['player_name', 'goals_per_match', 'assists_per_match']].sort_values('goals_per_match', ascending=False)

In [None]:
# Visualize goals vs assists
plt.figure(figsize=(10, 6))
plt.scatter(df_clean['goals'], df_clean['assists'], s=200, alpha=0.6, c='blue')

for i, player in enumerate(df_clean['player_name']):
    plt.annotate(player, (df_clean['goals'].iloc[i], df_clean['assists'].iloc[i]), 
                xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.xlabel('Goals', fontsize=12)
plt.ylabel('Assists', fontsize=12)
plt.title('Player Performance: Goals vs Assists', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Bar chart of goals per player
plt.figure(figsize=(10, 6))
plt.bar(df_clean['player_name'], df_clean['goals'], color='steelblue', alpha=0.7)
plt.xlabel('Player', fontsize=12)
plt.ylabel('Goals', fontsize=12)
plt.title('Total Goals by Player', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 6. Save Processed Data

Save the cleaned and processed data.

In [None]:
# Save processed data
output_path = '../data/processed/football_data_processed.csv'
save_data(df_clean, output_path, format='csv')
print(f"✓ Data saved to {output_path}")

## Next Steps

Now that you have completed this getting started notebook, you can:

1. Load your own football datasets
2. Customize the ETL pipeline for your specific needs
3. Add more advanced analysis and visualizations
4. Integrate with databases or APIs for data extraction
5. Create additional notebooks for specific analyses