# üèÖ Exploratory Data Analysis - Olympic Games Data

## Projekt_OS - Olympic Games Data Analysis

Detta notebook inneh√•ller en grundlig utforskning av olympisk data fr√•n Kaggle-datasetet "120 years of Olympic history: athletes and results".

### Dataset Information
- **K√§lla**: [120 years of Olympic history: athletes and results](https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results)
- **Fil**: `athlete_events.csv`
- **Storlek**: ~200MB, 271,116 rader, 15 kolumner
- **Tidsperiod**: 1896-2016
- **Fokus**: Kanada (CAN)

### Inneh√•ll
1. Data Loading och √ñversikt
2. Grundl√§ggande Statistik
3. Visualiseringar
4. Kanada-specifik Analys

## 1. Imports och Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# S√§tt stil f√∂r visualiseringar
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Ladda data
data_path = os.path.join('..', 'data', 'athlete_events.csv')
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape}")
print(f"\nKolumner: {df.columns.tolist()}")
print(f"\nF√∂rsta raderna:")
df.head()

## 2. Grundl√§ggande Statistik

### 2.1 Dataset √ñversikt

In [None]:
# Dataset information
print("="*60)
print("DATASET √ñVERSIKT")
print("="*60)
print(f"Antal rader: {len(df):,}")
print(f"Antal kolumner: {len(df.columns)}")
print(f"\nData typer:")
print(df.dtypes)
print(f"\nSaknade v√§rden:")
print(df.isnull().sum())

### 2.2 L√§nder (NOC)

In [None]:
# a) Antal l√§nder
num_countries = df['NOC'].nunique()
print(f"a) Antal l√§nder: {num_countries}")

# b) Lista √∂ver l√§nder
countries = sorted(df['NOC'].unique())
print(f"\nb) L√§nder (f√∂rsta 20): {countries[:20]}")
print(f"\nTotalt antal unika l√§nder: {len(countries)}")


### 2.3 Sporter


In [None]:
sports = sorted(df['Sport'].unique())
print(f"c) Totalt antal sporter: {len(sports)}")
print(f"\nSporter (f√∂rsta 20): {sports[:20]}")


### 2.4 Medaljtyper


In [None]:
medal_types = df['Medal'].dropna().unique()
print(f"d) Medaljtyper: {medal_types}")
print(f"\nTotalt antal medaljer: {df['Medal'].notna().sum()}")


### 2.5 √Öldersstatistik


In [None]:
ages = df['Age'].dropna()
print(f"e) √Ölder - Medel: {ages.mean():.1f}, Median: {ages.median():.1f}, "
      f"Min: {ages.min()}, Max: {ages.max()}, Std: {ages.std():.1f}")


## 3. Visualiseringar - √ñversikt

### 3.1 K√∂nsf√∂rdelning


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# g) Gender distribution
gender_counts = df['Sex'].value_counts()
axes[0,0].pie(gender_counts.values, labels=gender_counts.index, 
              autopct='%1.1f%%', colors=['#FF6B6B', '#4ECDC4'])
axes[0,0].set_title('K√∂nsf√∂rdelning - Alla idrottare')


### 3.2 Top 10 L√§nder - Medaljer


In [None]:
# h) Top 10 countries by medals
medal_counts = df[df['Medal'].notna()]['NOC'].value_counts().head(10)
axes[0,1].barh(medal_counts.index, medal_counts.values, color='#2A9D8F')
axes[0,1].invert_yaxis()
axes[0,1].set_title('Topp 10 L√§nder - Medaljer')
axes[0,1].set_xlabel('Antal medaljer')


### 3.3 Medaljer √∂ver tid


In [None]:
# i) Medals over time
medals_over_time = df[df['Medal'].notna()].groupby('Year').size()
axes[1,0].plot(medals_over_time.index, medals_over_time.values, 
               marker='o', color='#E76F51', linewidth=2, markersize=4)
axes[1,0].set_title('Medaljer √∂ver tid')
axes[1,0].set_xlabel('√Ör')
axes[1,0].set_ylabel('Antal medaljer')
axes[1,0].grid(True, alpha=0.3)

# j) Age distribution
axes[1,1].hist(df['Age'].dropna(), bins=30, color='#264653', alpha=0.8, edgecolor='black')
axes[1,1].set_title('√Öldersf√∂rdelning - Alla idrottare')
axes[1,1].set_xlabel('√Ölder')
axes[1,1].set_ylabel('Frekvens')
axes[1,1].axvline(ages.mean(), color='red', linestyle='--', linewidth=2, label=f'Medel: {ages.mean():.1f}')
axes[1,1].legend()

plt.tight_layout()
plt.savefig('../figures/eda_overview.png', dpi=300, bbox_inches='tight')
plt.show()


### 3.4 √Öldersf√∂rdelning


## 4. Kanada-specifik Analys

### 4.1 √ñversikt - Kanadas Prestation

In [None]:
# Filtrera f√∂r Kanada
canada_df = df[df['NOC'] == 'CAN']

print(f"Totalt antal deltagare fr√•n Kanada: {len(canada_df)}")
print(f"Unika idrottare fr√•n Kanada: {canada_df['ID'].nunique()}")
print(f"Antal medaljer f√∂r Kanada: {canada_df['Medal'].notna().sum()}")

# Medaljf√∂rdelning
canada_medals = canada_df[canada_df['Medal'].notna()]['Medal'].value_counts()
print(f"\nMedaljf√∂rdelning f√∂r Kanada:")
print(canada_medals)


### 4.2 Kanadas Toppsporter

In [None]:
# Top sports for Canada
canada_top_sports = canada_df[canada_df['Medal'].notna()]['Sport'].value_counts().head(10)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(canada_top_sports.index, canada_top_sports.values, color='#E63946')
ax.invert_yaxis()
ax.set_title('Kanada - Top 10 sporter med flest medaljer', fontsize=14, fontweight='bold')
ax.set_xlabel('Antal medaljer')
plt.tight_layout()
plt.savefig('../figures/canada_top_sports.png', dpi=300, bbox_inches='tight')
plt.show()


### 4.3 Kanadas Medaljer per OS

In [None]:
# Canada medals per Olympics
canada_medals_year = canada_df[canada_df['Medal'].notna()].groupby('Year').size()

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(canada_medals_year.index, canada_medals_year.values, 
        marker='o', color='#1D3557', linewidth=2, markersize=6)
ax.set_title('Kanada - Medaljer per OS', fontsize=14, fontweight='bold')
ax.set_xlabel('√Ör')
ax.set_ylabel('Antal medaljer')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../figures/canada_medals_per_year.png', dpi=300, bbox_inches='tight')
plt.show()


### 4.4 Kanadas √Öldersf√∂rdelning

In [None]:
# Canada age distribution
canada_ages = canada_df['Age'].dropna()

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(canada_ages, bins=30, color='#457B9D', alpha=0.8, edgecolor='black')
ax.axvline(canada_ages.mean(), color='red', linestyle='--', linewidth=2, 
           label=f'Medel: {canada_ages.mean():.1f} √•r')
ax.set_title('Kanada - √Öldersf√∂rdelning', fontsize=14, fontweight='bold')
ax.set_xlabel('√Ölder')
ax.set_ylabel('Frekvens')
ax.legend()
plt.tight_layout()
plt.savefig('../figures/canada_age_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"Kanada - √Öldersstatistik:")
print(f"Medel: {canada_ages.mean():.1f} √•r")
print(f"Median: {canada_ages.median():.1f} √•r")
print(f"Min: {canada_ages.min()} √•r")
print(f"Max: {canada_ages.max()} √•r")


## 5. Sammanfattning

Detta notebook har genomf√∂rt en grundlig explorativ dataanalys av olympisk data med fokus p√•:

- ‚úÖ Dataset √∂versikt och grundl√§ggande statistik
- ‚úÖ Visualiseringar av k√∂nsf√∂rdelning, toppl√§nder, medaljer √∂ver tid och √•ldersf√∂rdelning
- ‚úÖ Djupg√•ende analys av Kanadas prestation i olympiska spelen

**N√§sta steg**: Anv√§nd denna analys som grund f√∂r dashboard-applikationen i Task 1-3.