# Pandas application: Group project workshop (Introduction)

_This week is dedicated to applying the [Pandas](https://pandas.pydata.org/) skills you've developed in Weeks 04 and 05 to your Olympics group project. This application week bridges the gap between learning data wrangling techniques and implementing them in your assessment._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Sonnet 4.5)*, including updated documentation and git commit messages.

## Overview

### The 3-week learning block

This week completes the Pandas learning block:

- **Week 04**: Basic Pandas principles (DataFrames, filtering, grouping, merging)
- **Week 05**: Advanced Pandas techniques (reshaping, pivot tables, MultiIndex)
- **Week 06**: Application to your Olympics group project ← **You are here**

### Week 06 structure

Unlike previous teaching weeks, this week focuses on **application and practice**:

- **No new concepts** – consolidate what you've learnt
- **No exercises notebook** – you'll work on your actual project
- **Demonstration as reference** – worked example you can adapt
- **Group project time** – apply techniques to Olympics dataset

## Your group project (Week 03 reminder)

In Week 03, you were introduced to the Olympics group assessment. Let's recap the key requirements:

### Dataset

**[120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)**

- 271,116 rows (athlete-event combinations)
- 15 columns including: ID, Name, Sex, Age, Height, Weight, Team, NOC, Games, Year, Season, City, Sport, Event, Medal
- Data from Athens 1896 to Rio 2016

### Core requirements

Your group must complete these essential tasks:

1. **Data loading**: Load athlete_events.csv into a Pandas DataFrame
2. **Data cleaning**: Handle missing values, convert data types, remove duplicates
3. **Data wrangling**: Create derived columns (Age_Group, Full_Name, Century)
4. **Data analysis**: Calculate statistics (average ages, top countries, medal leaders)
5. **Data visualisation**: Create meaningful visualisations with interpretations
6. **Export results**: Save cleaned data and visualisations

Review the [Week 03 Introduction notebook](../../Week03/Introduction.ipynb) for complete assessment details, marking criteria, and stretch goals.

## How to use Week 06 materials

### 1. Use Week03/Template.ipynb for your project

The [Week03/Template.ipynb](../../Week03/Template.ipynb) provides the scaffolding for your group submission:

- Pre-structured sections matching assessment requirements
- Placeholder cells for your code and analysis
- Markdown sections for introduction, literature review, conclusion
- Reference list template

**Action**: Copy Template.ipynb and rename it for your group (e.g., `Group_A_Olympics_Analysis.ipynb`)

### 2. Use this week's Demonstration as a worked example

The [Week06 Demonstration](Demonstration.ipynb) shows you how to apply Week 04 and Week 05 techniques to a dataset that parallels the Olympics structure:

- **Part 1**: Review basic Pandas on familiar data (Setup week sales data)
- **Part 2**: Complete worked example on Olympics-parallel dataset (company sales data)
  - Shows basic techniques (Week 04) with full working code
  - Suggests advanced extensions (Week 05) for stretch goals
- **Part 3**: Explicit guidance on transferring techniques to your Olympics dataset

**Why a parallel dataset?**

Rather than solve the Olympics assignment for you, the Demonstration uses employee sales data with the exact same structure:

| Olympics Dataset | Demonstration Dataset |
|------------------|------------------------|
| athlete (Name) | employee (name) |
| NOC/Team | region |
| Games, Year, Season | quarter, year, half |
| Sport, Event | product_category, product |
| Medal (Gold/Silver/Bronze/NA) | award (Gold/Silver/Bronze/NA) |
| Age, Height, Weight | age, height_cm, weight_kg |

This allows you to:
- See complete solutions to similar problems
- Understand the logic and approach
- Adapt the code to your Olympics dataset
- Develop your own analytical skills rather than copying

## Recommended workflow

Follow this workflow for your group project session:

### Before the session (individual preparation)

1. **Review Week 04 Demonstration** – refresh basic Pandas operations
2. **Review Week 05 Demonstration** – understand data reshaping techniques
3. **Read Week 03 Introduction** – understand project requirements clearly
4. **Download Olympics dataset** – have athlete_events.csv ready

### During the session (group work)

#### Phase 1: Setup (15 minutes)

1. **Designate roles**:
   - Code writer (shares screen, types code)
   - Navigator (guides approach, references Demonstration)
   - Documenter (writes explanations in markdown cells)
   - Rotate roles throughout session

2. **Setup project structure**:
   - Copy Week03/Template.ipynb
   - Add all group members' names and IDs
   - Place athlete_events.csv in same directory

#### Phase 2: Data loading and exploration (20 minutes)

1. Load the dataset into Pandas
2. Use `.head()`, `.info()`, `.describe()` to explore
3. Discuss initial observations as a group
4. Reference **Week06 Demonstration Part 2, Section 1** for loading examples

#### Phase 3: Data cleaning (30 minutes)

1. **Identify missing values**:
   - Use `.isnull().sum()` to count NaN values per column
   - Discuss strategy: drop, fill, or leave as-is?
   - Document your decisions with markdown explanations

2. **Fix data types**:
   - Convert Year to datetime (requirement)
   - Check other columns for type issues

3. **Remove duplicates**:
   - Check for and remove duplicate rows
   - Document how many were removed

**Reference**: Week06 Demonstration Part 2, Section 2 (Data Cleaning)

#### Phase 4: Data wrangling (30 minutes)

Create the required new columns:

1. **Age_Group**: Categorise athletes by age
   - Decide on age bands (e.g., 0-18, 19-25, 26-35, 36+)
   - Use conditional logic or `pd.cut()`

2. **Full_Name**: Combine name components
   - Note: Olympics data has full names already in 'Name' column
   - Consider if you need to parse or clean names

3. **Century**: Extract century from Year
   - Calculate from Year column
   - Format appropriately (e.g., '19th', '20th', '21st')

**Reference**: Week06 Demonstration Part 2, Section 3 (Data Wrangling)

#### Phase 5: Initial analysis (30 minutes)

Begin the core analytical tasks:

1. **Average age by event**:
   - Use `.groupby()` with `.mean()`
   - Consider handling NaN ages

2. **Top 10 countries by gold medals**:
   - Filter for Gold medals only
   - Group by NOC, count medals
   - Sort and select top 10

3. **Most decorated athletes by sport**:
   - Group by Sport and athlete Name
   - Count medals per athlete
   - Find maximum in each sport

**Reference**: Week06 Demonstration Part 2, Section 4 (Data Analysis)

#### Phase 6: Plan next steps (15 minutes)

1. Review what you've completed
2. Identify remaining essential requirements
3. Assign tasks for group members to complete outside session
4. Schedule your next group meeting
5. Discuss which stretch goals to attempt

### After the session (distributed work)

- Complete remaining data analysis tasks
- Create visualisations (Week 07-09 will help with this)
- Write introduction, literature review, conclusion
- Format references properly
- Review and polish as a group

## Common challenges and solutions

Based on previous cohorts, here are challenges you might encounter:

### Challenge 1: Missing values in Medal column

**Issue**: Most athletes don't win medals, so Medal column is mostly NaN

**Solution**: This is expected! For medal counts:
```python
# Example code (adapt for your Olympics dataset):
# df[df['Medal'].notna()].groupby('NOC')['Medal'].count()

# Or count specific medal types:
# df[df['Medal'] == 'Gold'].groupby('NOC').size()
```

### Challenge 2: Converting Year to datetime

**Issue**: Year is an integer, not a full date

**Solution**: Create a datetime from year only:
```python
# Example code (adapt for your Olympics dataset):
# df['Year_dt'] = pd.to_datetime(df['Year'], format='%Y')
```

### Challenge 3: Duplicate athlete entries

**Issue**: Athletes appear multiple times (different events, different years)

**Solution**: This is correct! Each row is an athlete-event combination, not unique athletes. Only remove true duplicates:
```python
# Example code (adapt for your Olympics dataset):
# df.drop_duplicates()  # Removes rows that are identical in ALL columns
```

### Challenge 4: Team vs Individual sports

**Issue**: In team sports, multiple athletes get the same medal

**Solution**: Decide on your analysis approach and document it:
- Count medals per athlete (includes team medals)
- Or count medals per event (one medal = one event win)
- Explain your choice in your report

### Challenge 5: Historical country name changes

**Issue**: Some NOC codes represent countries that no longer exist (e.g., USSR, East Germany)

**Solution**: 
- Keep historical NOC codes as-is (recommended for basic requirement)
- Or combine historical/modern codes (good stretch goal)
- Document your approach

**Reference**: Week06 Demonstration Part 3 discusses these patterns

## Stretch goals: Using advanced Pandas (Week 05)

Once you've completed the essential requirements, consider these advanced techniques from Week 05:

### Reshaping for analysis

- Use `pivot_table()` to create medal count tables by country and year
- Use `melt()` to reshape wide-format data for visualisation
- Create cross-tabulations of sports and seasons

### Multi-dimensional analysis

- Create MultiIndex DataFrames for hierarchical analysis (Country → Sport → Event)
- Use `.xs()` for cross-sectional analysis
- Apply `stack()` and `unstack()` for complex reshaping

### Trend analysis

- Calculate rolling averages of participation over time
- Analyse medal count trends by country across decades
- Compare athlete characteristics (age, height, weight) across eras

### Advanced aggregations

- Apply multiple aggregation functions simultaneously
- Create sophisticated summary statistics
- Use method chaining for efficient data pipelines

**Reference**: Week06 Demonstration Part 2 includes "Advanced placeholder" sections suggesting how to extend basic analyses

## Further resources

### Project-specific resources

- **Week 03 Introduction** – Complete assessment brief, marking criteria, deliverables
- **Week 03 Template** – Jupyter Notebook structure for your submission
- **Week 06 Demonstration** – Worked example with Olympics-parallel dataset

### Pandas technique resources

- **Week 04 Materials** – Basic Pandas operations reference
- **Week 05 Materials** – Advanced reshaping and aggregation techniques
- [Pandas Documentation: Working with missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html)
- [Pandas Documentation: Group by operations](https://pandas.pydata.org/docs/user_guide/groupby.html)

### Olympics dataset resources

- [Kaggle Dataset Page](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) – Dataset description and context
- [Kaggle Notebooks](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/code) – Community analyses (for inspiration, not copying!)

### Academic resources

- Linden, A. (2016). "[Measuring the performance of nations at the Summer Olympics using data envelopment analysis](https://doi.org/10.1057/palgrave.jors.2601327)". *Journal of the Operational Research Society*, 57(4), 501-511.
  - Example of academic analysis of Olympics data

### Writing and presentation

- Bowden, J. (2011). *[Writing a Report: How to Prepare, Write and Present Really Effective Reports](https://ebookcentral-proquest-com.royalholloway.idm.oclc.org/lib/rhul/detail.action?docID=471307)*. How To Books.
  - Guidance on formulating aims and objectives

### Getting help

- **Moodle Q&A Forum** – Ask questions, share challenges
- **In-session support** – Use session time to ask questions
- **Office hours** – For more detailed guidance
- **Group peers** – Discuss approaches (but write your own code!)

## Summary

This week is your opportunity to:

- **Apply** the Pandas skills from Weeks 04 and 05 to a real project
- **Work collaboratively** with your group on the Olympics dataset
- **Practice** data wrangling in a realistic analytical context
- **Build** the foundation for your group assessment

Remember:
- Use **Week03/Template.ipynb** for your actual project work
- Use **Week06/Demonstration.ipynb** as a worked example to guide your approach
- **Document your decisions** – explain why you chose each approach
- **Collaborate actively** – discuss, debate, and decide together
- **Ask questions** – use this session to get support

## Next steps

1. Review the [Demonstration](Demonstration.ipynb) notebook to see a complete worked example
2. Open [Week03/Template.ipynb](../../Week03/Template.ipynb) and start your group project
3. Follow the recommended workflow outlined above
4. Reference the Demonstration as you work through each section

Good luck with your Olympics data analysis!