<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/codeAlongs/WRIT20833_Data_Cleaning_Analysis_Pandas_F25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning & Analysis with Pandas
## From Messy Cultural Data to Meaningful Insights

Welcome to the world of **real-world cultural data** - which is almost always messy, inconsistent, and requires significant cleaning before analysis. Today we'll learn advanced pandas techniques that transform chaotic datasets into clean, analyzable information.

In actual cultural research, data rarely comes pre-packaged and ready for analysis. Historical records have spelling variations, contemporary datasets have missing information, and scraped data contains formatting inconsistencies. This lesson focuses on the essential skills of **data cleaning** and **advanced analysis** that turn messy cultural materials into research-ready datasets.

### üìö How This Lesson Works:
This is an **advanced demonstration notebook** that builds on pandas fundamentals. We'll work through real examples of messy cultural data and cleaning techniques.

**üéØ Ready to practice with your own messy data?** After this lesson, use the companion **Student Practice Notebook**:
- **üìù `WRIT20833_Data_Cleaning_Student_Practice_F25.ipynb`** (in the homework folder)
- Apply these cleaning techniques to your own cultural dataset
- Work through guided exercises with real data challenges
- Submit your cleaned analysis for assessment

### What We'll Learn Today:
- **Handling Missing Data**: Strategies for incomplete cultural records
- **Text Cleaning**: Standardizing names, places, and categories using pandas string methods
- **Data Transformation**: Reshaping and reorganizing cultural datasets
- **Grouping & Aggregation**: Comparing patterns across categories and time periods
- **Advanced Filtering**: Complex queries for sophisticated cultural analysis

Think of today's lesson as becoming cultural data archaeologists - carefully cleaning and reconstructing fragmented information to reveal hidden patterns in cultural history.

### ‚ö†Ô∏è Important: Dataset Requirements for This Lesson
**This advanced lesson works best with datasets that have:**
- **Rich string/text data**: Author names, titles, genres, locations, descriptions
- **Categorical columns**: Classifications that can be standardized and grouped
- **Numeric data**: Values for calculations, aggregations, and mathematical analysis
- **Mixed data types**: A healthy combination of text and numbers

**‚ö†Ô∏è Limitations to Consider:**
- **Text-only datasets**: Grouping and aggregation will be limited without numeric columns
- **Pure numeric datasets**: String cleaning methods won't be applicable\n",
    "- **Very small datasets**: Statistical patterns may not be meaningful\n",
    "- **Highly structured data**: May not need the extensive cleaning we'll practice\n",
    "\n",
    "**Before using your own data**, check that it includes a mix of messy text and numeric values. If your dataset is primarily text-based, focus on the string cleaning sections. If it's primarily numeric, focus on the grouping and aggregation techniques."

## Part 1: Setting Up Our Messy Cultural Dataset

Let's start with a realistic scenario: you've found a dataset of historical literary publications, but it's messy and inconsistent - just like real cultural data!

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Set display options
pd.options.display.max_rows = 100
pd.options.display.max_columns = 20

In [None]:
# Create a realistically messy cultural dataset
# This simulates the kind of inconsistent data you'd find in historical records

messy_cultural_data = {
    'title': ['pride and prejudice', 'JANE EYRE', 'Wuthering Heights', 'emma', 'Frankenstein', 
              'dracula', 'The Picture of Dorian Gray', 'the importance of being earnest', 
              'The Strange Case of Dr. Jekyll and Mr. Hyde', 'TREASURE ISLAND',
              'Alice\'s Adventures in Wonderland', 'through the looking glass', 
              'The Time Machine', 'the war of the worlds', 'The Invisible Man'],
    'author_name': ['Jane Austen', 'charlotte bronte', 'Emily Bront√´', 'jane austen', 'mary shelley',
                   'Bram Stoker', 'Oscar Wilde', 'oscar wilde', 'Robert Louis Stevenson', 'r.l. stevenson',
                   'Lewis Carroll', 'lewis carroll', 'H.G. Wells', 'h.g. wells', 'H.G. Wells'],
    'publication_year': [1813, 1847, 1847, 1815, 1818, 1897, 1890, 1895, 1886, 1883, 
                        1865, 1871, 1895, 1898, 1897],
    'genre': ['Romance', 'gothic', 'Gothic', 'romance', 'Science Fiction', 'Horror', 
             'Philosophical Fiction', 'Comedy', 'Horror', 'adventure', 'Fantasy', 
             'fantasy', 'science fiction', 'Science Fiction', 'sci-fi'],
    'pages': [432, None, 416, 474, 280, 418, 254, None, 144, 292, 200, 228, 104, 192, 153],
    'setting_country': ['England', 'england', 'England', 'England', 'Switzerland/Germany', 
                       'Romania/England', 'England', 'England', 'Scotland', 'Treasure Island',
                       'Wonderland', 'Wonderland', 'England', 'England', 'England'],
    'female_protagonist': ['Yes', 'yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No',
                          'Yes', 'Yes', 'No', 'No', 'No'],
    'modern_adaptations': [15, 12, 8, 6, 25, 35, 5, 3, 18, 12, 20, 4, 15, 8, 10]
}

# Convert to DataFrame
books_df = pd.DataFrame(messy_cultural_data)

print("üìö Created messy literary dataset!")
print(f"Dataset contains {len(books_df)} classic literary works")
books_df

### üîç Identifying Data Problems

Let's examine what makes this dataset "messy" - these are common issues in real cultural data:

In [None]:
# Check for missing data
print("Missing data summary:")
print(books_df.isnull().sum())
print("\n" + "="*50 + "\n")

# Look at inconsistent formatting
print("Unique genres (notice inconsistencies):")
print(books_df['genre'].unique())
print("\n" + "="*50 + "\n")

print("Unique author names (notice duplicates):")
print(books_df['author_name'].unique())

## Part 2: Handling Missing Data

Missing data is common in cultural datasets. Let's learn strategies for dealing with it responsibly.

In [None]:
# Identify rows with missing data
print("Rows with missing page data:")
missing_pages = books_df[books_df['pages'].isnull()]
print(missing_pages[['title', 'author_name', 'pages']])

In [None]:
# Strategy 1: Fill missing values with meaningful estimates
# For pages, we could use the median page count for similar genres

# Calculate median pages by genre (excluding missing values)
genre_medians = books_df.groupby('genre')['pages'].median()
print("Median pages by genre:")
print(genre_medians)

# Create a copy for our cleaned data
books_cleaned = books_df.copy()

In [None]:
# Fill missing pages with overall median (simple approach)
overall_median_pages = books_df['pages'].median()
books_cleaned['pages'] = books_cleaned['pages'].fillna(overall_median_pages)

print(f"Filled missing page counts with median: {overall_median_pages}")
print("\nChecking our work:")
print(books_cleaned[['title', 'pages']].loc[books_df['pages'].isnull()])

### ü§î Discussion: Ethics of Filling Missing Data

When we "fill" missing cultural data, we're making assumptions. Consider:
- Is it better to estimate missing values or exclude incomplete records?
- How might our filling strategy bias our cultural analysis?
- What does missing data itself tell us about historical record-keeping?

## Part 3: Text Cleaning with Pandas String Methods

Cultural data often involves text that needs standardization. Pandas string methods are powerful tools for cleaning textual cultural information.

### Standardizing Book Titles

In [None]:
# Standardize title capitalization using .str.title()
print("Original titles:")
print(books_cleaned['title'].tolist())

books_cleaned['title'] = books_cleaned['title'].str.title()

print("\nStandardized titles:")
print(books_cleaned['title'].tolist())

### Cleaning Author Names

In [None]:
# Standardize author names - this is trickier!
print("Original author names:")
print(books_cleaned['author_name'].unique())

# First, standardize capitalization
books_cleaned['author_name'] = books_cleaned['author_name'].str.title()

print("\nAfter title case:")
print(books_cleaned['author_name'].unique())

In [None]:
# Handle specific author name variations using .str.replace()
# This requires domain knowledge about the authors

author_corrections = {
    'R.L. Stevenson': 'Robert Louis Stevenson',
    'H.G. Wells': 'H.G. Wells',  # Keep this format consistent
    'Charlotte Bronte': 'Charlotte Bront√´',  # Add proper accent
}

for old_name, new_name in author_corrections.items():
    books_cleaned['author_name'] = books_cleaned['author_name'].str.replace(old_name, new_name)

print("After manual corrections:")
print(books_cleaned['author_name'].unique())

### Standardizing Genres

In [None]:
# Clean up genre categories
print("Original genres:")
print(books_cleaned['genre'].unique())

# Create a mapping for genre standardization
genre_mapping = {
    'romance': 'Romance',
    'gothic': 'Gothic',
    'science fiction': 'Science Fiction',
    'sci-fi': 'Science Fiction',
    'adventure': 'Adventure',
    'fantasy': 'Fantasy'
}

# Apply the mapping using .str.lower() first, then .replace()
books_cleaned['genre_clean'] = books_cleaned['genre'].str.lower()
books_cleaned['genre_clean'] = books_cleaned['genre_clean'].replace(genre_mapping)
books_cleaned['genre_clean'] = books_cleaned['genre_clean'].str.title()

print("\nCleaned genres:")
print(books_cleaned['genre_clean'].unique())

### Standardizing Yes/No Categories

In [None]:
# Standardize the female_protagonist column
print("Original female_protagonist values:")
print(books_cleaned['female_protagonist'].unique())

# Convert to consistent True/False values
books_cleaned['has_female_protagonist'] = books_cleaned['female_protagonist'].str.lower() == 'yes'

print("\nCleaned to boolean:")
print(books_cleaned['has_female_protagonist'].unique())
print("\nCounts:")
print(books_cleaned['has_female_protagonist'].value_counts())

## Part 4: Checking for and Handling Duplicates

Cultural datasets often contain duplicate entries or near-duplicates that need identification and handling.

In [None]:
# Check for duplicate titles
duplicate_titles = books_cleaned['title'].duplicated()
print(f"Number of duplicate titles: {duplicate_titles.sum()}")

if duplicate_titles.any():
    print("\nDuplicate titles:")
    print(books_cleaned[duplicate_titles][['title', 'author_name']])

In [None]:
# Check for books by the same author (after cleaning)
print("Books per author:")
author_counts = books_cleaned['author_name'].value_counts()
print(author_counts)

In [None]:
# Find potential duplicates based on multiple columns
potential_duplicates = books_cleaned.duplicated(subset=['title', 'author_name'], keep=False)
print(f"Potential duplicates based on title + author: {potential_duplicates.sum()}")

if potential_duplicates.any():
    print("\nPotential duplicate entries:")
    print(books_cleaned[potential_duplicates][['title', 'author_name', 'publication_year']])

## Part 5: Advanced Data Analysis - Grouping and Aggregation

Now that our data is clean, let's perform sophisticated cultural analysis using pandas grouping capabilities.

### Analyzing by Author

In [None]:
# Group by author and analyze their work
author_analysis = books_cleaned.groupby('author_name').agg({
    'title': 'count',  # Number of books
    'publication_year': ['min', 'max'],  # Career span
    'pages': 'mean',  # Average book length
    'modern_adaptations': 'sum',  # Total adaptations
    'has_female_protagonist': 'mean'  # Proportion with female protagonists
})

# Flatten column names
author_analysis.columns = ['book_count', 'first_publication', 'last_publication', 
                          'avg_pages', 'total_adaptations', 'female_protagonist_rate']

# Calculate career span
author_analysis['career_span'] = author_analysis['last_publication'] - author_analysis['first_publication']

print("Author Analysis:")
author_analysis

### Analyzing by Genre

In [None]:
# Group by cleaned genre
genre_analysis = books_cleaned.groupby('genre_clean').agg({
    'title': 'count',
    'publication_year': 'mean',
    'pages': 'mean',
    'modern_adaptations': 'mean',
    'has_female_protagonist': 'mean'
})

genre_analysis.columns = ['book_count', 'avg_publication_year', 'avg_pages', 
                         'avg_adaptations', 'female_protagonist_rate']

print("Genre Analysis:")
genre_analysis.round(2)

### Time Period Analysis

In [None]:
# Create time period categories
def categorize_period(year):
    if year < 1850:
        return 'Early 19th Century (pre-1850)'
    elif year < 1880:
        return 'Mid 19th Century (1850-1879)'
    else:
        return 'Late 19th Century (1880+)'

books_cleaned['time_period'] = books_cleaned['publication_year'].apply(categorize_period)

# Analyze by time period
period_analysis = books_cleaned.groupby('time_period').agg({
    'title': 'count',
    'pages': 'mean',
    'modern_adaptations': 'mean',
    'has_female_protagonist': 'mean'
})

period_analysis.columns = ['book_count', 'avg_pages', 'avg_adaptations', 'female_protagonist_rate']

print("Time Period Analysis:")
period_analysis.round(2)

## Part 6: Advanced Filtering and Queries

Let's practice complex filtering for sophisticated cultural analysis questions.

In [None]:
# Complex query 1: Highly adapted works with female protagonists
popular_female_led = books_cleaned[
    (books_cleaned['modern_adaptations'] > 10) & 
    (books_cleaned['has_female_protagonist'] == True)
]

print("Highly adapted books with female protagonists:")
print(popular_female_led[['title', 'author_name', 'modern_adaptations', 'genre_clean']])

In [None]:
# Complex query 2: Short books from prolific authors
prolific_authors = author_analysis[author_analysis['book_count'] > 1].index
short_books_prolific_authors = books_cleaned[
    (books_cleaned['author_name'].isin(prolific_authors)) & 
    (books_cleaned['pages'] < 200)
]

print("Short books by prolific authors:")
print(short_books_prolific_authors[['title', 'author_name', 'pages', 'genre_clean']])

In [None]:
# Complex query 3: Using string methods for advanced filtering
# Find books with 'the' in the title
titles_with_the = books_cleaned[books_cleaned['title'].str.lower().str.contains('the')]

print("Books with 'The' in the title:")
print(titles_with_the[['title', 'author_name', 'publication_year']])

In [None]:
# Complex query 4: Books from authors with specific patterns in their names
authors_with_initials = books_cleaned[books_cleaned['author_name'].str.contains(r'\b[A-Z]\.[A-Z]\.')]

print("Books by authors with initials (e.g., H.G. Wells):")
print(authors_with_initials[['title', 'author_name', 'genre_clean']])

## Part 7: Creating Meaningful Visualizations from Clean Data

Clean data enables sophisticated visualizations that reveal cultural patterns.

In [None]:
# Visualization 1: Genre popularity over time
plt.figure(figsize=(12, 6))
for genre in books_cleaned['genre_clean'].unique():
    genre_data = books_cleaned[books_cleaned['genre_clean'] == genre]
    plt.scatter(genre_data['publication_year'], [genre] * len(genre_data), 
               s=genre_data['modern_adaptations'] * 3, alpha=0.7, label=genre)

plt.xlabel('Publication Year')
plt.ylabel('Genre')
plt.title('Literary Genres Over Time\n(bubble size = modern adaptations)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Visualization 2: Author productivity and adaptation success
plt.figure(figsize=(10, 6))
plt.scatter(author_analysis['book_count'], author_analysis['total_adaptations'], 
           s=author_analysis['avg_pages'], alpha=0.7, c=author_analysis['female_protagonist_rate'], 
           cmap='RdYlBu')
plt.xlabel('Number of Books in Dataset')
plt.ylabel('Total Modern Adaptations')
plt.title('Author Productivity vs. Adaptation Success\n(bubble size = avg pages, color = female protagonist rate)')
plt.colorbar(label='Female Protagonist Rate')

# Add author labels
for author, data in author_analysis.iterrows():
    plt.annotate(author, (data['book_count'], data['total_adaptations']), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

In [None]:
# Visualization 3: Time period comparison
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Average pages by period
period_analysis['avg_pages'].plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('Average Book Length by Period')
axes[0,0].set_ylabel('Pages')

# Modern adaptations by period
period_analysis['avg_adaptations'].plot(kind='bar', ax=axes[0,1], color='lightcoral')
axes[0,1].set_title('Average Modern Adaptations by Period')
axes[0,1].set_ylabel('Adaptations')

# Female protagonist rate by period
period_analysis['female_protagonist_rate'].plot(kind='bar', ax=axes[1,0], color='lightgreen')
axes[1,0].set_title('Female Protagonist Rate by Period')
axes[1,0].set_ylabel('Rate')

# Book count by period
period_analysis['book_count'].plot(kind='bar', ax=axes[1,1], color='gold')
axes[1,1].set_title('Number of Books by Period')
axes[1,1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## Part 8: Cultural Insights from Clean Data

Now that we've cleaned our literary dataset, let's explore what cultural patterns and insights emerge from proper data cleaning practices.

### üéØ Key Questions Clean Data Can Help Answer:
- **Which literary genres were most popular in different time periods?**
- **How do publication patterns reflect cultural and historical trends?**
- **What role does data standardization play in cultural analysis?**

### üìä The Impact of Data Cleaning on Analysis
Notice how our cleaning process has transformed messy, inconsistent data into a reliable foundation for cultural research. This is exactly the workflow you'll practice in the homework assignment with your own chosen cultural dataset!

## Part 9: Advanced String Methods for Cultural Data

Let's explore more sophisticated text processing techniques that are particularly useful for cultural datasets.

### Extracting Information from Text

In [None]:
# Extract publication decade
books_cleaned['decade'] = (books_cleaned['publication_year'] // 10) * 10
books_cleaned['decade_label'] = books_cleaned['decade'].astype(str) + 's'

print("Decade analysis:")
decade_counts = books_cleaned['decade_label'].value_counts().sort_index()
print(decade_counts)

In [None]:
# Extract word count from titles
books_cleaned['title_word_count'] = books_cleaned['title'].str.split().str.len()

print("Title length analysis:")
print(f"Average words in title: {books_cleaned['title_word_count'].mean():.2f}")
print(f"Shortest title: {books_cleaned['title_word_count'].min()} words")
print(f"Longest title: {books_cleaned['title_word_count'].max()} words")

# Find the longest title
longest_title = books_cleaned[books_cleaned['title_word_count'] == books_cleaned['title_word_count'].max()]
print(f"\nLongest title: '{longest_title['title'].iloc[0]}'")

In [None]:
# Check for common title patterns
print("Title patterns:")
print(f"Titles containing 'The': {books_cleaned['title'].str.contains('The').sum()}")
print(f"Titles containing 'of': {books_cleaned['title'].str.contains(' Of ').sum()}")
print(f"Titles containing 'and': {books_cleaned['title'].str.contains(' And ').sum()}")

# Create a category for title types
def categorize_title(title):
    if title.startswith('The '):
        return 'Starts with "The"'
    elif ' Of ' in title or ' And ' in title:
        return 'Contains "Of" or "And"'
    else:
        return 'Simple title'

books_cleaned['title_type'] = books_cleaned['title'].apply(categorize_title)
print("\nTitle type distribution:")
print(books_cleaned['title_type'].value_counts())

### Working with Complex Text Categories

In [None]:
# Create author nationality based on names (this would require cultural knowledge)
def guess_author_origin(name):
    """This is a simplified example - real cultural data would require more sophisticated approaches"""
    if name in ['Jane Austen', 'Charlotte Bront√´', 'Emily Bront√´']:
        return 'English (Women Writers)'
    elif name in ['Oscar Wilde', 'Robert Louis Stevenson']:
        return 'British Isles (Male Writers)'
    elif name in ['H.G. Wells', 'Lewis Carroll']:
        return 'English (Male Writers)'
    else:
        return 'Other'

books_cleaned['author_category'] = books_cleaned['author_name'].apply(guess_author_origin)

print("Author categories:")
print(books_cleaned['author_category'].value_counts())

## Part 10: Final Analysis and Insights

Let's bring together all our cleaning and analysis techniques to answer sophisticated cultural questions.

In [None]:
# Final comprehensive analysis
final_analysis = books_cleaned.groupby(['decade_label', 'genre_clean']).agg({
    'title': 'count',
    'modern_adaptations': 'mean',
    'has_female_protagonist': 'mean',
    'pages': 'mean'
}).round(2)

final_analysis.columns = ['book_count', 'avg_adaptations', 'female_protagonist_rate', 'avg_pages']

print("Comprehensive analysis by decade and genre:")
print(final_analysis)

In [None]:
# Create a summary of our cleaning work
print("DATA CLEANING SUMMARY")
print("=" * 50)
print(f"Original dataset: {len(books_df)} rows")
print(f"Final cleaned dataset: {len(books_cleaned)} rows")
print(f"Columns added during cleaning: {len(books_cleaned.columns) - len(books_df.columns)}")
print("\nCleaning actions performed:")
print("‚úÖ Filled missing page data")
print("‚úÖ Standardized title capitalization")
print("‚úÖ Cleaned author names and handled variations")
print("‚úÖ Standardized genre categories")
print("‚úÖ Converted Yes/No to boolean values")
print("‚úÖ Created time period categories")
print("‚úÖ Added title analysis features")
print("‚úÖ Added author categorization")

In [None]:
# Save our cleaned dataset
books_cleaned.to_csv('cleaned_literary_dataset.csv', index=False)
print("‚úÖ Saved cleaned dataset to CSV file")

# Save our analysis results
author_analysis.to_csv('author_analysis_results.csv')
genre_analysis.to_csv('genre_analysis_results.csv')
period_analysis.to_csv('period_analysis_results.csv')
print("‚úÖ Saved analysis results to CSV files")

## Cultural Insights and Discussion

Based on our cleaned data analysis, let's discuss what we've learned:

### Key Findings:
1. **Gender Representation**: What patterns do we see in female protagonists across different time periods and genres?
2. **Adaptation Patterns**: Which types of books tend to get more modern adaptations?
3. **Historical Trends**: How did literary production change across the 19th century?
4. **Genre Evolution**: What can we learn about how literary genres developed?

### Questions for Further Research:
- How might our cleaning choices have influenced our findings?
- What other cultural factors might explain the patterns we observe?
- How could we expand this analysis with additional data sources?
- What ethical considerations arise when quantifying cultural production?

*Use the space below to record your insights and discussion points.*

### My Cultural Analysis Insights:

*Record your observations and insights here...*

**Patterns I noticed:**

**Questions this raises:**

**Limitations of this analysis:**

**Ideas for future research:**

## üéì Summary: From Messy to Meaningful Cultural Data

Congratulations! You've learned the essential techniques for transforming messy cultural data into clean, analysis-ready datasets.

### ‚úÖ Key Skills Demonstrated:
- **Identifying data problems**: Missing values, inconsistent formatting, duplicate entries
- **Handling missing data**: Ethical strategies for filling gaps without introducing bias
- **Text standardization**: Using pandas string methods for consistent cultural categories
- **Advanced analysis**: Grouping, aggregation, and trend identification
- **Data ethics**: Understanding the cultural implications of cleaning choices

### üîß Essential Pandas Techniques Mastered:
- `df.isnull()` and `df.fillna()` for missing data handling
- `str.title()`, `str.replace()`, and string methods for text cleaning
- `groupby()` and aggregation functions for cultural pattern analysis
- Creating derived variables (decades, categories, flags) for deeper insights
- Combining multiple cleaning operations into systematic workflows

### üéØ Next Steps: Apply These Skills
**Now it's your turn!** In the companion homework assignment, you'll:
1. **Choose your own cultural dataset** from areas that interest you
2. **Apply these exact techniques** to clean and analyze real-world cultural data
3. **Practice ethical data collection** principles including robots.txt compliance
4. **Generate cultural insights** through systematic data cleaning and analysis

### üöÄ Your Cultural Data Journey Continues:
1. **Practice**: Use the homework to cement these technical skills with your chosen cultural domain
2. **Expand**: Learn advanced pandas techniques like merging datasets and time series analysis
3. **Specialize**: Develop expertise with tools specific to your cultural research interests
4. **Share**: Present your cultural data findings to academic and public audiences

Remember: **Clean data is the foundation of trustworthy cultural analysis.** The time you invest in careful data cleaning pays dividends in the reliability and credibility of your cultural insights!

### üìö Ready for Your Own Analysis?
Head to the **HW3-2 homework assignment** to apply these skills to your own chosen cultural dataset. You'll practice the complete workflow from data collection ethics to final cultural insights!