<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/homework/WRIT20833_HW3-2_Pandas_Data_Cleaning_F25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning & Analysis: Student Practice Notebook
## Apply Your Skills to Real Cultural Data

**Name:** ________________________________  
**Date:** ________________________________

Welcome to your hands-on practice with data cleaning! This notebook is your space to apply the data cleaning and analysis techniques from **Pandas_02** to your own cultural dataset. You'll practice the complete workflow from ethical data collection to meaningful cultural insights.

### üéØ Learning Goals:
By completing this practice, you will:
- **Apply ethical data collection principles** including robots.txt compliance and AI-era considerations
- **Systematically identify and document data quality issues** in real cultural datasets
- **Use pandas string methods and data handling techniques** to clean messy cultural data
- **Create analysis-ready datasets** through systematic cleaning workflows
- **Generate cultural insights** from properly cleaned data
- **Reflect critically** on how data cleaning choices affect cultural analysis

### üîÑ **Continuing Your Cultural Data Journey from HW3-1**
**Default Approach: Use Your HW3-1 Dataset**
The best learning experience comes from deepening your analysis of the cultural dataset you explored in HW3-1. This continuity allows you to:
- **Build expertise** in your chosen cultural domain
- **Apply progressive skills** from basic exploration to advanced cleaning
- **Reference discoveries** from HW3-1 as you clean and refine the data
- **Create comprehensive analysis** that spans multiple skill levels

### üìã Dataset Requirements for Effective Cleaning Practice
**‚úÖ Your dataset should have opportunities for:**
- [ ] **Text standardization** (inconsistent capitalization, formatting variations)
- [ ] **Missing data handling** (empty cells, null values, incomplete records)
- [ ] **Category consolidation** (similar values with different names/spellings)
- [ ] **Data type issues** (numbers stored as text, date formatting problems)
- [ ] **Duplicate detection** (exact or near-duplicate entries)

### üîÄ **When to Consider Switching Datasets**
**You may choose a different dataset from HW3-1 if:**
- [ ] **Too clean**: Your HW3-1 data has no missing values, perfect formatting, and no inconsistencies
- [ ] **Too small**: Fewer than 15 rows limits meaningful cleaning practice
- [ ] **Wrong focus**: Your research interests have shifted to a different cultural domain
- [ ] **Limited complexity**: Only text OR only numbers (missing mixed data types)

### üìÇ **If Switching: Recommended Dataset Sources**
- **Kaggle**: Movies, books, music, museums, historical records
- **Government data**: Cultural statistics, census data, arts funding  
- **Academic repositories**: Digital humanities collections
- **Cultural institutions**: Museum collections, library catalogs

**‚ö†Ô∏è Important**: If you switch datasets, ensure it meets the data complexity requirements for meaningful cleaning practice!

## Part 0: Understanding Data Collection Ethics & Cultural Responsibility

Before diving into data cleaning, let's establish the ethical foundation for responsible cultural data analysis in the contemporary digital landscape.

### üåê Found Data Ethics & robots.txt
When working with cultural datasets, especially those scraped from websites, it's crucial to understand **robots.txt** - the web's ethical guidelines for automated data collection.

**üìù How to Check robots.txt Compliance:**
- Add `/robots.txt` to any website URL (e.g., `https://example.com/robots.txt`)
- This file specifies what automated data collection is permitted
- **Always respect these guidelines** in your research

**üí° Example robots.txt concerns for cultural sites:**
- Museum websites often restrict scraping of collection databases
- Literary archives may limit automated access to protect copyright
- Social media platforms restrict bulk downloading of cultural content

### ü§ñ AI Era Considerations: LLMs and Data Ethics
The rise of AI training has created new ethical debates around data collection and use:

**Key Contemporary Issues:**
- **Training Data**: Many websites now explicitly restrict their content from AI training
- **Attribution**: Cultural data creators expect proper credit and context
- **Scale**: Large-scale scraping can burden cultural institutions' servers
- **Purpose**: Research goals should respect the intentions of data creators

**üìã Ethical Checklist for Cultural Data:**
- [ ] **Source attribution**: Credit your data sources appropriately
- [ ] **Robots.txt compliance**: Verify data collection followed website guidelines
- [ ] **Scale appropriateness**: Use only the data you need for your research questions
- [ ] **Purpose transparency**: Be clear about your research goals
- [ ] **Cultural sensitivity**: Consider the communities and cultures represented in your data

### üé≠ Cultural Data & Representation
Cultural datasets carry special responsibilities:
- **Whose voices are included?** Consider who created the original cultural artifacts
- **What biases exist?** Historical data often reflects the perspectives of dominant groups
- **How might cleaning affect meaning?** Standardization can erase important cultural nuances
- **What's missing?** Absence of data is often as significant as what's present

### ü§î Reflection: Your Dataset's Ethics & Cultural Context
**Answer these questions about your chosen dataset before beginning analysis:**

1. **Are you continuing with your HW3-1 dataset or switching? Why?**

2. **What cultural domain does your dataset represent? (literature, art, music, film, etc.)**

3. **Where did this data originate? Who collected it and for what purpose?**

4. **If web-scraped, were ethical guidelines (robots.txt) followed during collection?**

5. **What biases might exist in how this cultural data was categorized or collected?**

6. **Whose voices or perspectives might be missing from this dataset?**

7. **How should you properly attribute this data source in your research?**

8. **What cultural sensitivities should you consider during analysis?**

9. **How might your data cleaning choices affect the cultural meaning of the information?**

10. **If continuing from HW3-1: What data quality issues did you notice during your initial exploration?**

## Part 1: Loading and Initial Exploration

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Set display options for better readability
pd.options.display.max_rows = 100
pd.options.display.max_columns = 20

In [None]:
# Load your cultural dataset
# Replace 'your_filename.csv' with your actual file name
# For other file types, use: pd.read_excel(), pd.read_json(), etc.

my_data = pd.read_csv('your_filename.csv')

print(f"‚úÖ Loaded dataset with {len(my_data)} rows and {len(my_data.columns)} columns")
print(f"Dataset shape: {my_data.shape}")

In [None]:
# First look at your data
print("First 5 rows:")
my_data.head()

In [None]:
# Get basic information about your dataset
print("Dataset info:")
my_data.info()

print("\n" + "="*50 + "\n")

print("Column names:")
print(list(my_data.columns))

### ü§î Reflection: Initial Observations
**Write your observations about your dataset:**
1. **What cultural topic does this data represent?**

2. **What are the main columns and what do they represent?**

3. **What data types do you see? (text, numbers, dates, etc.)**

4. **What potential data quality issues do you notice?**

## Part 2: Identifying Data Problems

In [None]:
# Check for missing data
print("Missing data summary:")
missing_data = my_data.isnull().sum()
print(missing_data[missing_data > 0])  # Only show columns with missing data

print(f"\nTotal missing values: {my_data.isnull().sum().sum()}")
print(f"Percentage of missing data: {(my_data.isnull().sum().sum() / (len(my_data) * len(my_data.columns))) * 100:.2f}%")

In [None]:
# Examine unique values in text/categorical columns
print("Unique values in categorical columns:")
print("(Look for inconsistencies, duplicates, and formatting issues)\n")

for col in my_data.select_dtypes(include=['object']).columns:
    unique_count = my_data[col].nunique()
    print(f"{col} ({unique_count} unique values):")
    if unique_count <= 20:  # Show all if 20 or fewer
        print(f"  {sorted(my_data[col].dropna().unique())}")
    else:  # Show first 10 if more than 20
        print(f"  First 10: {sorted(my_data[col].dropna().unique())[:10]}")
        print(f"  ... and {unique_count - 10} more")
    print()

In [None]:
# Check for potential duplicates
duplicate_rows = my_data.duplicated().sum()
print(f"Complete duplicate rows: {duplicate_rows}")

if duplicate_rows > 0:
    print("\nDuplicate rows:")
    print(my_data[my_data.duplicated()])
else:
    print("‚úÖ No complete duplicate rows found")

### üìù Data Problems Identified:
**List the specific issues you found in your data:**

**Missing Data:**
- 

**Text/Formatting Issues:**
- 

**Duplicate Issues:**
- 

**Other Problems:**
- 

## Part 3: Creating Your Data Cleaning Plan

Before cleaning, create a systematic plan based on your data problems.

In [None]:
# Create a copy for cleaning (preserve original)
my_data_cleaned = my_data.copy()

print("‚úÖ Created copy for cleaning")
print(f"Original data: {len(my_data)} rows")
print(f"Working copy: {len(my_data_cleaned)} rows")

### üéØ My Cleaning Strategy:
**Write your step-by-step plan:**

1. **Missing Data Strategy:**
   - 

2. **Text Standardization Tasks:**
   - 

3. **Category Cleaning:**
   - 

4. **Duplicate Handling:**
   - 

## Part 4: Text Cleaning and Standardization

Apply pandas string methods to clean and standardize your text data.

In [None]:
# Example: Standardize capitalization for a text column
# Replace 'column_name' with your actual column name

# Before cleaning:
# print("Before standardization:")
# print(my_data_cleaned['column_name'].unique())

# Apply title case:
# my_data_cleaned['column_name'] = my_data_cleaned['column_name'].str.title()

# After cleaning:
# print("\nAfter standardization:")
# print(my_data_cleaned['column_name'].unique())

print("Add your text cleaning code here")
print("Use the examples from the main lesson as templates")

In [None]:
# Example: Create category mappings to standardize similar values
# Replace with your actual column and categories

# category_mapping = {
#     'old_value_1': 'Standard_Value_1',
#     'old_value_2': 'Standard_Value_1',  # Multiple old values can map to same new value
#     'old_value_3': 'Standard_Value_2'
# }

# Apply mapping:
# my_data_cleaned['column_name_clean'] = my_data_cleaned['column_name'].str.lower()
# my_data_cleaned['column_name_clean'] = my_data_cleaned['column_name_clean'].replace(category_mapping)
# my_data_cleaned['column_name_clean'] = my_data_cleaned['column_name_clean'].str.title()

print("Add your category mapping code here")

In [None]:
# Handle missing data for text columns
# Choose appropriate strategy based on your data

# Option 1: Fill with placeholder
# my_data_cleaned['column_name'] = my_data_cleaned['column_name'].fillna('Unknown')

# Option 2: Drop rows with missing values in critical columns
# my_data_cleaned = my_data_cleaned.dropna(subset=['critical_column'])

print("Add your missing data handling code here")

## Part 5: Handling Numeric Data and Missing Values

In [None]:
# Handle missing numeric data
# Choose strategy based on your data and research questions

# Option 1: Fill with median (robust to outliers)
# numeric_column = 'your_numeric_column'
# median_value = my_data_cleaned[numeric_column].median()
# my_data_cleaned[numeric_column] = my_data_cleaned[numeric_column].fillna(median_value)
# print(f"Filled missing {numeric_column} with median: {median_value}")

# Option 2: Fill with mean
# mean_value = my_data_cleaned[numeric_column].mean()
# my_data_cleaned[numeric_column] = my_data_cleaned[numeric_column].fillna(mean_value)

# Option 3: Fill based on category groups
# my_data_cleaned[numeric_column] = my_data_cleaned.groupby('category_column')[numeric_column].transform(lambda x: x.fillna(x.median()))

print("Add your numeric data cleaning code here")

In [None]:
# Convert data types if needed
# Example: Convert string numbers to numeric
# my_data_cleaned['numeric_column'] = pd.to_numeric(my_data_cleaned['numeric_column'], errors='coerce')

# Example: Convert string dates to datetime
# my_data_cleaned['date_column'] = pd.to_datetime(my_data_cleaned['date_column'], errors='coerce')

print("Add data type conversion code here if needed")

## Part 6: Checking Your Cleaning Work

In [None]:
# Compare before and after cleaning
print("CLEANING SUMMARY")
print("=" * 50)
print(f"Original dataset: {len(my_data)} rows, {len(my_data.columns)} columns")
print(f"Cleaned dataset: {len(my_data_cleaned)} rows, {len(my_data_cleaned.columns)} columns")
print(f"Rows removed: {len(my_data) - len(my_data_cleaned)}")
print(f"Columns added: {len(my_data_cleaned.columns) - len(my_data.columns)}")

print("\nMissing data comparison:")
print(f"Original missing values: {my_data.isnull().sum().sum()}")
print(f"Cleaned missing values: {my_data_cleaned.isnull().sum().sum()}")

In [None]:
# Final check of cleaned data
print("Cleaned data sample:")
my_data_cleaned.head()

In [None]:
# Check remaining data issues
print("Remaining missing data:")
remaining_missing = my_data_cleaned.isnull().sum()
print(remaining_missing[remaining_missing > 0])

print("\nData types:")
print(my_data_cleaned.dtypes)

## Part 7: Data Analysis and Exploration

Now that your data is clean, perform some exploratory analysis!

In [None]:
# Basic descriptive statistics for numeric columns
print("Descriptive statistics:")
my_data_cleaned.describe()

In [None]:
# Value counts for categorical columns
# Replace 'category_column' with your actual column name

# print("Distribution of categories:")
# category_counts = my_data_cleaned['category_column'].value_counts()
# print(category_counts)

print("Add your categorical analysis code here")

In [None]:
# Groupby analysis (if you have both categorical and numeric data)
# Replace column names with your actual columns

# grouped_analysis = my_data_cleaned.groupby('category_column').agg({
#     'numeric_column_1': ['mean', 'count'],
#     'numeric_column_2': ['sum', 'median']
# })
# print("Analysis by category:")
# print(grouped_analysis)

print("Add your groupby analysis code here")

## Part 8: Data Visualization

In [None]:
# Create visualizations appropriate for your data
# Examples:

# Bar chart for categorical data:
# category_counts = my_data_cleaned['category_column'].value_counts()
# category_counts.plot(kind='bar', title='Distribution of Categories', figsize=(10, 6))
# plt.xlabel('Category')
# plt.ylabel('Count')
# plt.xticks(rotation=45)
# plt.tight_layout()
# plt.show()

print("Add your visualization code here")
print("Consider: bar charts, histograms, scatter plots, or time series plots")

In [None]:
# Second visualization
# Example: Histogram for numeric data:
# my_data_cleaned['numeric_column'].hist(bins=20, title='Distribution of Numeric Values', figsize=(8, 6))
# plt.xlabel('Value')
# plt.ylabel('Frequency')
# plt.show()

print("Add a second visualization here")

## Part 9: Save Your Work

In [None]:
# Save your cleaned dataset
output_filename = 'my_cleaned_cultural_data.csv'
my_data_cleaned.to_csv(output_filename, index=False)
print(f"‚úÖ Saved cleaned dataset as: {output_filename}")

# Optional: Save analysis results
# if 'grouped_analysis' in locals():
#     grouped_analysis.to_csv('my_analysis_results.csv')
#     print("‚úÖ Saved analysis results")

## Part 10: Reflection and Cultural Insights

### üéØ Data Cleaning Reflection (Connect to Pandas_02 Lesson):
**How did the techniques from the Pandas_02 lesson work with your specific dataset?**


**What challenges did you encounter that weren't covered in the lesson examples?**


**Which pandas string methods (.str.title(), .str.replace(), etc.) were most useful for your data?**


**How did your missing data strategy compare to the approaches demonstrated in the lesson?**


### üîÑ Dataset Continuity Experience:
**If you continued with your HW3-1 dataset: How did working with the same data across both assignments enhance your learning?**


**What new insights about your cultural domain emerged through the cleaning process?**


**How did your understanding of the data's limitations and biases evolve from HW3-1 to HW3-2?**


**If you switched datasets: What did you learn from comparing the data quality issues in both datasets?**


### üìä Cultural Analysis Insights:
**What patterns or trends did you discover in your cleaned data?**


**How do your findings compare to the literary dataset patterns shown in Pandas_02?**


**What cultural questions does your analysis raise that require further investigation?**


**How might data quality issues have affected historical or cultural research using similar datasets?**


### ü§ñ Ethics and AI-Era Considerations:
**How did considering robots.txt and data collection ethics affect your analysis approach?**


**What biases or limitations did you discover in your dataset during the cleaning process?**


**How might AI training considerations influence how cultural institutions share their data in the future?**


**What responsibilities do cultural data analysts have in the current AI landscape?**


### üîç Next Steps and Future Research:
**What additional data cleaning techniques would improve your analysis?**


**What other cultural datasets could you combine with this one for richer insights?**


**How could this cleaned data contribute to broader digital humanities or cultural research projects?**


**What questions would you ask the original data collectors about their methodology?**

## üìö Summary: Your Data Cleaning Journey

Congratulations! You've successfully applied the data cleaning techniques from **Pandas_02** to your own cultural dataset!

### ‚úÖ Technical Skills Practiced from Pandas_02:
- **Identified and documented data quality issues** using systematic exploration methods
- **Applied pandas string methods** (.str.title(), .str.replace(), .fillna()) for text standardization
- **Handled missing data** with appropriate strategies for cultural research contexts
- **Created clean, analysis-ready datasets** through systematic workflows
- **Performed exploratory data analysis** to reveal cultural patterns
- **Created meaningful visualizations** that communicate cultural insights effectively

### ‚úÖ Cultural Research Skills Developed:
- **Applied ethical data collection principles** including robots.txt compliance and AI-era considerations
- **Critical evaluation of data quality and bias** in cultural contexts
- **Understanding the impact of data cleaning choices** on cultural analysis outcomes
- **Recognition of patterns in cultural datasets** and their broader significance
- **Appreciation for the complexity** of real-world cultural data and its limitations

### ? Contemporary Digital Humanities Skills:
- **Ethical data stewardship** in an era of AI training and large-scale data collection
- **Cultural sensitivity** in data standardization and categorization processes
- **Critical assessment** of whose voices are represented (and missing) in cultural datasets
- **Responsible attribution** and source documentation practices

### üéì Next Steps in Your Cultural Data Journey:
1. **Expand your toolkit**: Learn advanced pandas techniques like merging datasets and time series analysis
2. **Deepen domain expertise**: Explore specialized tools for your specific cultural research interests
3. **Build ethical practices**: Develop frameworks for responsible cultural data collection and analysis
4. **Share responsibly**: Present findings with appropriate context about limitations and biases
5. **Contribute to the field**: Engage with digital humanities communities and contemporary debates about data ethics

### üîó Connection to Course Goals:
The data cleaning skills you've practiced here directly support:
- **Critical digital literacy**: Understanding how data collection and processing choices shape cultural narratives
- **Ethical research practices**: Applying contemporary standards for responsible data use
- **Technical proficiency**: Building computational skills for cultural analysis
- **Cultural insight**: Using quantitative methods to deepen understanding of cultural phenomena

Remember: **Clean data is the foundation of trustworthy cultural analysis, but the choices we make in cleaning reflect our values and responsibilities as cultural researchers.** The skills you've practiced here‚Äîboth technical and ethical‚Äîwill serve you well in any data-driven cultural research project!

### üìã Submit Your Work:
Ensure your notebook includes:
- [ ] **Completed ethics reflection** with thoughtful consideration of data sources and biases
- [ ] **Working code** applied to your actual cultural dataset
- [ ] **Clear documentation** of your data cleaning decisions and their rationale
- [ ] **Meaningful visualizations** that reveal cultural patterns
- [ ] **Critical analysis** connecting your findings to broader cultural questions
- [ ] **Reflection on limitations** and future research directions
- [ ] **Proper attribution** of data sources and acknowledgment of ethical considerations