<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/exercises/WRIT20833_Data_Cleaning_Student_Practice_F25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning & Analysis: Student Practice Notebook
## Apply Your Skills to Real Cultural Data

**Name:** ________________________________  
**Date:** ________________________________

This notebook is your space to practice the data cleaning and analysis techniques from the main lesson. Upload your own cultural dataset and work through the cleaning process step by step.

### üìã Before You Begin: Dataset Requirements Checklist
**‚úÖ Check that your dataset includes:**
- [ ] **Rich text data** (names, titles, categories, descriptions)
- [ ] **Numeric columns** (counts, ratings, years, measurements)
- [ ] **Mixed data types** for comprehensive practice
- [ ] **At least 15-20 rows** for meaningful patterns

**‚ö†Ô∏è If your dataset is missing certain types:**
- **Text-only datasets**: Focus on Parts 1-3 (text cleaning and standardization)
- **Mostly numeric datasets**: Focus on Parts 4-5 (grouping and aggregation)
- **Very small datasets**: Practice the techniques but note that patterns may not be statistically meaningful

### üìÇ Recommended Dataset Sources:
- **Kaggle**: Movies, books, music, museums, historical records
- **Government data**: Cultural statistics, census data, arts funding
- **Academic repositories**: Digital humanities collections
- **Cultural institutions**: Museum collections, library catalogs

## Part 1: Loading and Initial Exploration

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Set display options for better readability
pd.options.display.max_rows = 100
pd.options.display.max_columns = 20

In [None]:
# Load your cultural dataset
# Replace 'your_filename.csv' with your actual file name
# For other file types, use: pd.read_excel(), pd.read_json(), etc.

my_data = pd.read_csv('your_filename.csv')

print(f"‚úÖ Loaded dataset with {len(my_data)} rows and {len(my_data.columns)} columns")
print(f"Dataset shape: {my_data.shape}")

In [None]:
# First look at your data
print("First 5 rows:")
my_data.head()

In [None]:
# Get basic information about your dataset
print("Dataset info:")
my_data.info()

print("\n" + "="*50 + "\n")

print("Column names:")
print(list(my_data.columns))

### ü§î Reflection: Initial Observations
**Write your observations about your dataset:**
1. **What cultural topic does this data represent?**

2. **What are the main columns and what do they represent?**

3. **What data types do you see? (text, numbers, dates, etc.)**

4. **What potential data quality issues do you notice?**

## Part 2: Identifying Data Problems

In [None]:
# Check for missing data
print("Missing data summary:")
missing_data = my_data.isnull().sum()
print(missing_data[missing_data > 0])  # Only show columns with missing data

print(f"\nTotal missing values: {my_data.isnull().sum().sum()}")
print(f"Percentage of missing data: {(my_data.isnull().sum().sum() / (len(my_data) * len(my_data.columns))) * 100:.2f}%")

In [None]:
# Examine unique values in text/categorical columns
print("Unique values in categorical columns:")
print("(Look for inconsistencies, duplicates, and formatting issues)\n")

for col in my_data.select_dtypes(include=['object']).columns:
    unique_count = my_data[col].nunique()
    print(f"{col} ({unique_count} unique values):")
    if unique_count <= 20:  # Show all if 20 or fewer
        print(f"  {sorted(my_data[col].dropna().unique())}")
    else:  # Show first 10 if more than 20
        print(f"  First 10: {sorted(my_data[col].dropna().unique())[:10]}")
        print(f"  ... and {unique_count - 10} more")
    print()

In [None]:
# Check for potential duplicates
duplicate_rows = my_data.duplicated().sum()
print(f"Complete duplicate rows: {duplicate_rows}")

if duplicate_rows > 0:
    print("\nDuplicate rows:")
    print(my_data[my_data.duplicated()])
else:
    print("‚úÖ No complete duplicate rows found")

### üìù Data Problems Identified:
**List the specific issues you found in your data:**

**Missing Data:**
- 

**Text/Formatting Issues:**
- 

**Duplicate Issues:**
- 

**Other Problems:**
- 

## Part 3: Creating Your Data Cleaning Plan

Before cleaning, create a systematic plan based on your data problems.

In [None]:
# Create a copy for cleaning (preserve original)
my_data_cleaned = my_data.copy()

print("‚úÖ Created copy for cleaning")
print(f"Original data: {len(my_data)} rows")
print(f"Working copy: {len(my_data_cleaned)} rows")

### üéØ My Cleaning Strategy:
**Write your step-by-step plan:**

1. **Missing Data Strategy:**
   - 

2. **Text Standardization Tasks:**
   - 

3. **Category Cleaning:**
   - 

4. **Duplicate Handling:**
   - 

## Part 4: Text Cleaning and Standardization

Apply pandas string methods to clean and standardize your text data.

In [None]:
# Example: Standardize capitalization for a text column
# Replace 'column_name' with your actual column name

# Before cleaning:
# print("Before standardization:")
# print(my_data_cleaned['column_name'].unique())

# Apply title case:
# my_data_cleaned['column_name'] = my_data_cleaned['column_name'].str.title()

# After cleaning:
# print("\nAfter standardization:")
# print(my_data_cleaned['column_name'].unique())

print("Add your text cleaning code here")
print("Use the examples from the main lesson as templates")

In [None]:
# Example: Create category mappings to standardize similar values
# Replace with your actual column and categories

# category_mapping = {
#     'old_value_1': 'Standard_Value_1',
#     'old_value_2': 'Standard_Value_1',  # Multiple old values can map to same new value
#     'old_value_3': 'Standard_Value_2'
# }

# Apply mapping:
# my_data_cleaned['column_name_clean'] = my_data_cleaned['column_name'].str.lower()
# my_data_cleaned['column_name_clean'] = my_data_cleaned['column_name_clean'].replace(category_mapping)
# my_data_cleaned['column_name_clean'] = my_data_cleaned['column_name_clean'].str.title()

print("Add your category mapping code here")

In [None]:
# Handle missing data for text columns
# Choose appropriate strategy based on your data

# Option 1: Fill with placeholder
# my_data_cleaned['column_name'] = my_data_cleaned['column_name'].fillna('Unknown')

# Option 2: Drop rows with missing values in critical columns
# my_data_cleaned = my_data_cleaned.dropna(subset=['critical_column'])

print("Add your missing data handling code here")

## Part 5: Handling Numeric Data and Missing Values

In [None]:
# Handle missing numeric data
# Choose strategy based on your data and research questions

# Option 1: Fill with median (robust to outliers)
# numeric_column = 'your_numeric_column'
# median_value = my_data_cleaned[numeric_column].median()
# my_data_cleaned[numeric_column] = my_data_cleaned[numeric_column].fillna(median_value)
# print(f"Filled missing {numeric_column} with median: {median_value}")

# Option 2: Fill with mean
# mean_value = my_data_cleaned[numeric_column].mean()
# my_data_cleaned[numeric_column] = my_data_cleaned[numeric_column].fillna(mean_value)

# Option 3: Fill based on category groups
# my_data_cleaned[numeric_column] = my_data_cleaned.groupby('category_column')[numeric_column].transform(lambda x: x.fillna(x.median()))

print("Add your numeric data cleaning code here")

In [None]:
# Convert data types if needed
# Example: Convert string numbers to numeric
# my_data_cleaned['numeric_column'] = pd.to_numeric(my_data_cleaned['numeric_column'], errors='coerce')

# Example: Convert string dates to datetime
# my_data_cleaned['date_column'] = pd.to_datetime(my_data_cleaned['date_column'], errors='coerce')

print("Add data type conversion code here if needed")

## Part 6: Checking Your Cleaning Work

In [None]:
# Compare before and after cleaning
print("CLEANING SUMMARY")
print("=" * 50)
print(f"Original dataset: {len(my_data)} rows, {len(my_data.columns)} columns")
print(f"Cleaned dataset: {len(my_data_cleaned)} rows, {len(my_data_cleaned.columns)} columns")
print(f"Rows removed: {len(my_data) - len(my_data_cleaned)}")
print(f"Columns added: {len(my_data_cleaned.columns) - len(my_data.columns)}")

print("\nMissing data comparison:")
print(f"Original missing values: {my_data.isnull().sum().sum()}")
print(f"Cleaned missing values: {my_data_cleaned.isnull().sum().sum()}")

In [None]:
# Final check of cleaned data
print("Cleaned data sample:")
my_data_cleaned.head()

In [None]:
# Check remaining data issues
print("Remaining missing data:")
remaining_missing = my_data_cleaned.isnull().sum()
print(remaining_missing[remaining_missing > 0])

print("\nData types:")
print(my_data_cleaned.dtypes)

## Part 7: Data Analysis and Exploration

Now that your data is clean, perform some exploratory analysis!

In [None]:
# Basic descriptive statistics for numeric columns
print("Descriptive statistics:")
my_data_cleaned.describe()

In [None]:
# Value counts for categorical columns
# Replace 'category_column' with your actual column name

# print("Distribution of categories:")
# category_counts = my_data_cleaned['category_column'].value_counts()
# print(category_counts)

print("Add your categorical analysis code here")

In [None]:
# Groupby analysis (if you have both categorical and numeric data)
# Replace column names with your actual columns

# grouped_analysis = my_data_cleaned.groupby('category_column').agg({
#     'numeric_column_1': ['mean', 'count'],
#     'numeric_column_2': ['sum', 'median']
# })
# print("Analysis by category:")
# print(grouped_analysis)

print("Add your groupby analysis code here")

## Part 8: Data Visualization

In [None]:
# Create visualizations appropriate for your data
# Examples:

# Bar chart for categorical data:
# category_counts = my_data_cleaned['category_column'].value_counts()
# category_counts.plot(kind='bar', title='Distribution of Categories', figsize=(10, 6))
# plt.xlabel('Category')
# plt.ylabel('Count')
# plt.xticks(rotation=45)
# plt.tight_layout()
# plt.show()

print("Add your visualization code here")
print("Consider: bar charts, histograms, scatter plots, or time series plots")

In [None]:
# Second visualization
# Example: Histogram for numeric data:
# my_data_cleaned['numeric_column'].hist(bins=20, title='Distribution of Numeric Values', figsize=(8, 6))
# plt.xlabel('Value')
# plt.ylabel('Frequency')
# plt.show()

print("Add a second visualization here")

## Part 9: Save Your Work

In [None]:
# Save your cleaned dataset
output_filename = 'my_cleaned_cultural_data.csv'
my_data_cleaned.to_csv(output_filename, index=False)
print(f"‚úÖ Saved cleaned dataset as: {output_filename}")

# Optional: Save analysis results
# if 'grouped_analysis' in locals():
#     grouped_analysis.to_csv('my_analysis_results.csv')
#     print("‚úÖ Saved analysis results")

## Part 10: Reflection and Cultural Insights

### üéØ Data Cleaning Reflection:
**What did you learn about your data through the cleaning process?**


**What were the biggest challenges in cleaning your dataset?**


**How did cleaning change your understanding of the data quality?**


### üìä Cultural Analysis Insights:
**What patterns or trends did you discover in your clean data?**


**What cultural questions does your analysis raise?**


**How might data quality issues have affected historical or cultural research using similar datasets?**


### üîç Next Steps:
**What additional analysis would you like to perform?**


**What other datasets could you combine with this one?**


**How could this cleaned data be useful for cultural research or digital humanities projects?**


## üìö Summary: Your Data Cleaning Journey

Congratulations! You've successfully:

### ‚úÖ Technical Skills Practiced:
- Identified and documented data quality issues
- Applied pandas string methods for text standardization
- Handled missing data with appropriate strategies
- Created clean, analysis-ready datasets
- Performed exploratory data analysis
- Created meaningful visualizations

### ‚úÖ Cultural Research Skills Developed:
- Critical evaluation of data quality and bias
- Understanding the impact of data cleaning choices on analysis
- Recognition of patterns in cultural datasets
- Appreciation for the complexity of real-world cultural data

### üéì Next Steps in Your Cultural Data Journey:
1. **Practice with different datasets** to encounter various cleaning challenges
2. **Learn advanced pandas techniques** like merging multiple datasets
3. **Explore specialized tools** for your specific cultural research interests
4. **Share your findings** with classmates and the broader academic community

Remember: **Clean data is the foundation of trustworthy cultural analysis.** The skills you've practiced here will serve you well in any data-driven cultural research project!