<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/exercises/Review_09_Integration_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WRIT 20833 Review 09: Integration  



Integrate all Python skills to date.

**Make a copy:** File > Save a copy in Drive

## Project Overview: Cultural Data Analysis Portfolio

Create a complete cultural data analysis project that demonstrates mastery of:
- Data collection and ethics (Review 05)
- Data processing with Pandas (Review 06) 
- Text analysis and sentiment (Review 07)
- Data visualization (Review 08)
- All foundational Python skills (Reviews 01-04)

## Exercise 1: Project Setup and Data Collection
Choose your cultural domain and set up your analysis framework.

In [None]:
# Import all necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import re

# VADER for sentiment analysis
try:
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()
    print(" VADER sentiment analysis ready")
except ImportError:
    print(" VADER not available, using basic sentiment analysis")
    analyzer = None

# PROJECT CHOICE: Select your cultural domain
# Options: books, movies, music, art, theater, digital_culture, etc.

project_domain = "books"  # TODO: Change to your chosen domain

print(f" CULTURAL ANALYSIS PROJECT: {project_domain.upper()}")
print("=" * 50)

# Sample dataset - replace with your own data!
cultural_data = {
    'title': ['1984', 'Pride and Prejudice', 'The Handmaid\'s Tale', 'Beloved', 'The Great Gatsby'],
    'creator': ['George Orwell', 'Jane Austen', 'Margaret Atwood', 'Toni Morrison', 'F. Scott Fitzgerald'],
    'year': [1949, 1813, 1985, 1987, 1925],
    'genre': ['Dystopian', 'Romance', 'Dystopian', 'Historical Fiction', 'Modernist'],
    'description': [
        'A totalitarian society under constant surveillance where independent thinking is a crime.',
        'A witty exploration of love, marriage, and social class in Regency England.',
        'A dystopian tale of women\'s rights and reproductive freedom in a theocratic society.',
        'A powerful story of slavery, trauma, and the lasting effects of historical injustice.',
        'The decline of the American Dream through the eyes of the mysterious Jay Gatsby.'
    ],
    'themes': [
        'surveillance, totalitarianism, truth, freedom, oppression',
        'love, marriage, social class, wit, independence', 
        'feminism, reproductive rights, religious extremism, resistance',
        'slavery, trauma, memory, motherhood, healing',
        'American Dream, wealth, love, illusion, moral decay'
    ]
}

df = pd.DataFrame(cultural_data)
print(f"Dataset loaded: {len(df)} items")
print("\nFirst few items:")
print(df[['title', 'creator', 'year', 'genre']].head())

# Data validation and cleaning
print("\n DATA QUALITY CHECK:")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")
print(f"Year range: {df['year'].min()} - {df['year'].max()}")
print(f"Unique creators: {df['creator'].nunique()}")
print(f"Genres: {', '.join(df['genre'].unique())}")

## Exercise 2: Text Analysis and Theme Extraction
Analyze textual content using string methods and sentiment analysis.

In [None]:
# Text preprocessing function (Review 02 skills)
def clean_text(text):
    """Clean and standardize text data"""
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text.strip()

# Sentiment analysis function (Review 07 skills)
def analyze_sentiment(text):
    """Analyze sentiment of text descriptions"""
    if analyzer:
        scores = analyzer.polarity_scores(text)
        return scores['compound']
    else:
        # Basic sentiment using word lists
        positive_words = ['love', 'beautiful', 'amazing', 'wonderful', 'great', 'excellent', 'brilliant']
        negative_words = ['death', 'war', 'tragic', 'terrible', 'awful', 'horrible', 'dystopian', 'oppression']
        
        text_lower = text.lower()
        pos_count = sum(1 for word in positive_words if word in text_lower)
        neg_count = sum(1 for word in negative_words if word in text_lower)
        
        if pos_count > neg_count:
            return 0.5
        elif neg_count > pos_count:
            return -0.5
        else:
            return 0.0

# Theme extraction function (Review 03 & 04 skills)
def extract_themes(themes_string):
    """Convert theme string to list and count occurrences"""
    if pd.isna(themes_string):
        return []
    return [theme.strip() for theme in themes_string.split(',')]

# Apply text analysis
print(" TEXT ANALYSIS RESULTS:")
print("=" * 30)

# Clean text data
df['description_clean'] = df['description'].apply(clean_text)

# Sentiment analysis
df['sentiment_score'] = df['description'].apply(analyze_sentiment)
df['sentiment_category'] = df['sentiment_score'].apply(
    lambda x: 'Positive' if x > 0.1 else 'Negative' if x < -0.1 else 'Neutral'
)

# Theme extraction
df['theme_list'] = df['themes'].apply(extract_themes)
df['theme_count'] = df['theme_list'].apply(len)

# Display results
print("Sentiment Analysis Results:")
for idx, row in df.iterrows():
    print(f"{row['title'][:25]:<25} | Sentiment: {row['sentiment_category']:<8} ({row['sentiment_score']:.2f})")

print(f"\nOverall sentiment distribution:")
sentiment_counts = df['sentiment_category'].value_counts()
for category, count in sentiment_counts.items():
    print(f"{category}: {count} items ({count/len(df)*100:.1f}%)")

# Theme frequency analysis
all_themes = []
for theme_list in df['theme_list']:
    all_themes.extend(theme_list)

theme_frequency = Counter(all_themes)
print(f"\nMost common themes:")
for theme, count in theme_frequency.most_common(5):
    print(f"{theme}: {count} occurrences")

## Exercise 3: Statistical Analysis and Patterns
Use Pandas for data analysis and pattern discovery.

In [None]:
# Statistical analysis (Review 06 skills)
print(" STATISTICAL ANALYSIS:")
print("=" * 25)

# Time period analysis
df['century'] = ((df['year'] - 1) // 100 + 1) * 100
df['decade'] = (df['year'] // 10) * 10
df['era'] = df['year'].apply(
    lambda x: 'Pre-1900' if x < 1900 else '20th Century' if x < 2000 else '21st Century'
)

# Genre analysis
genre_stats = df.groupby('genre').agg({
    'year': ['min', 'max', 'mean'],
    'sentiment_score': 'mean',
    'theme_count': 'mean',
    'title': 'count'
}).round(2)

print("Genre Statistics:")
print(genre_stats)

# Era analysis 
era_stats = df.groupby('era').agg({
    'sentiment_score': 'mean',
    'theme_count': 'mean',
    'title': 'count'
}).round(2)

print("\nEra Analysis:")
print(era_stats)

# Creator productivity
creator_analysis = df['creator'].value_counts()
print(f"\nMost prolific creators:")
for creator, count in creator_analysis.head(3).items():
    avg_sentiment = df[df['creator'] == creator]['sentiment_score'].mean()
    print(f"{creator}: {count} work(s), avg sentiment: {avg_sentiment:.2f}")

# Correlation analysis
numeric_cols = ['year', 'sentiment_score', 'theme_count']
correlations = df[numeric_cols].corr()
print(f"\nCorrelations:")
print(correlations.round(3))

# Advanced filtering and analysis
print(f"\n ADVANCED INSIGHTS:")

# Most thematically complex works
complex_works = df.nlargest(2, 'theme_count')
print(f"Most thematically complex works:")
for _, work in complex_works.iterrows():
    print(f"- {work['title']} ({work['theme_count']} themes)")

# Sentiment by era
print(f"\nSentiment trends by era:")
for era in df['era'].unique():
    era_sentiment = df[df['era'] == era]['sentiment_score'].mean()
    era_count = len(df[df['era'] == era])
    print(f"{era}: {era_sentiment:.2f} average sentiment ({era_count} works)")

## Exercise 4: Data Visualization Dashboard
Create compelling visualizations to communicate your findings.

In [None]:
# Comprehensive visualization dashboard (Review 08 skills)
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle(f'{project_domain.title()} Cultural Analysis Dashboard', fontsize=16, fontweight='bold')

# 1. Timeline visualization
ax1.scatter(df['year'], df['sentiment_score'], c=df['theme_count'], 
           cmap='viridis', s=100, alpha=0.7, edgecolors='black')
ax1.set_xlabel('Year')
ax1.set_ylabel('Sentiment Score')
ax1.set_title('Sentiment Over Time (size = theme complexity)')
ax1.grid(True, alpha=0.3)
ax1.axhline(y=0, color='red', linestyle='--', alpha=0.5)

# 2. Genre distribution
genre_counts = df['genre'].value_counts()
colors = plt.cm.Set3(np.linspace(0, 1, len(genre_counts)))
ax2.pie(genre_counts.values, labels=genre_counts.index, autopct='%1.1f%%', 
        colors=colors, startangle=90)
ax2.set_title('Genre Distribution')

# 3. Sentiment by genre
sentiment_by_genre = df.groupby('genre')['sentiment_score'].mean().sort_values()
bars = ax3.barh(sentiment_by_genre.index, sentiment_by_genre.values, 
                color=['red' if x < 0 else 'green' if x > 0 else 'gray' 
                       for x in sentiment_by_genre.values], alpha=0.7)
ax3.set_xlabel('Average Sentiment Score')
ax3.set_title('Sentiment by Genre')
ax3.axvline(x=0, color='black', linestyle='-', alpha=0.5)

# 4. Theme complexity over time
decade_themes = df.groupby('decade')['theme_count'].mean()
ax4.plot(decade_themes.index, decade_themes.values, 'o-', linewidth=2, markersize=8)
ax4.set_xlabel('Decade')
ax4.set_ylabel('Average Theme Count')
ax4.set_title('Thematic Complexity Trends')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Additional focused visualizations
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Theme frequency bar chart
top_themes = theme_frequency.most_common(8)
themes, counts = zip(*top_themes)
ax1.bar(themes, counts, color='skyblue', alpha=0.8)
ax1.set_xlabel('Themes')
ax1.set_ylabel('Frequency')
ax1.set_title('Most Common Themes')
ax1.tick_params(axis='x', rotation=45)

# Era comparison
era_sentiment = df.groupby('era')['sentiment_score'].mean()
ax2.bar(era_sentiment.index, era_sentiment.values, 
        color=['lightcoral', 'lightgreen', 'lightblue'][:len(era_sentiment)], alpha=0.8)
ax2.set_xlabel('Era')
ax2.set_ylabel('Average Sentiment')
ax2.set_title('Sentiment Across Eras')
ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(" VISUALIZATION INSIGHTS:")
print(f" Most positive genre: {sentiment_by_genre.index[-1]} ({sentiment_by_genre.iloc[-1]:.2f})")
print(f" Most negative genre: {sentiment_by_genre.index[0]} ({sentiment_by_genre.iloc[0]:.2f})")
print(f" Most complex themes appear in: {df.loc[df['theme_count'].idxmax(), 'title']}")
print(f" Era with highest sentiment: {era_sentiment.idxmax()} ({era_sentiment.max():.2f})")

## Exercise 6: Final Project Presentation
Synthesize your analysis into key findings and conclusions.

In [None]:
# Generate comprehensive project summary
print(" FINAL PROJECT SUMMARY")
print("=" * 40)
print(f"Cultural Domain: {project_domain.title()}")
print(f"Dataset Size: {len(df)} items")
print(f"Time Span: {df['year'].max() - df['year'].min()} years ({df['year'].min()}-{df['year'].max()})")
print(f"Unique Creators: {df['creator'].nunique()}")
print(f"Genres Analyzed: {df['genre'].nunique()}")

print("\n KEY FINDINGS:")
print("-" * 20)

# Finding 1: Sentiment trends
overall_sentiment = df['sentiment_score'].mean()
sentiment_trend = "positive" if overall_sentiment > 0.1 else "negative" if overall_sentiment < -0.1 else "neutral"
print(f"1. Overall sentiment is {sentiment_trend} (avg: {overall_sentiment:.2f})")

# Finding 2: Genre insights
dominant_genre = df['genre'].value_counts().index[0]
genre_percentage = (df['genre'].value_counts().iloc[0] / len(df)) * 100
print(f"2. {dominant_genre} is the dominant genre ({genre_percentage:.1f}% of works)")

# Finding 3: Temporal patterns
modern_works = len(df[df['year'] >= 1950])
historical_works = len(df[df['year'] < 1950])
if modern_works > historical_works:
    temporal_focus = "modern"
else:
    temporal_focus = "historical"
print(f"3. Dataset focuses on {temporal_focus} works ({modern_works} modern vs {historical_works} historical)")

# Finding 4: Thematic complexity
avg_themes = df['theme_count'].mean()
most_complex = df.loc[df['theme_count'].idxmax(), 'title']
print(f"4. Average thematic complexity: {avg_themes:.1f} themes per work")
print(f"   Most complex: '{most_complex}' ({df['theme_count'].max()} themes)")

# Finding 5: Creator patterns
if df['creator'].value_counts().iloc[0] > 1:
    prolific_creator = df['creator'].value_counts().index[0]
    creator_count = df['creator'].value_counts().iloc[0]
    print(f"5. Most prolific creator: {prolific_creator} ({creator_count} works)")
else:
    print(f"5. All creators represented equally (1 work each)")

print("\n RESEARCH QUESTIONS RAISED:")
print("-" * 30)
print("1. How do cultural and historical contexts influence thematic content?")
print("2. What factors contribute to sentiment patterns in cultural works?")
print("3. How has thematic complexity evolved over time?")
print("4. What role does genre play in cultural expression and reception?")
print("5. How might computational analysis complement traditional cultural criticism?")

print("\n SKILLS DEMONSTRATED:")
print("-" * 25)
skills_checklist = [
    " Data collection and validation",
    " Text processing and cleaning", 
    " Sentiment analysis implementation",
    " Statistical analysis with Pandas",
    " Data visualization and dashboards",
    " Ethical consideration of cultural data",
    " Pattern recognition and interpretation",
    " Research question formulation"
]

for skill in skills_checklist:
    print(skill)

print("\n" + "=" * 40)
print(" CONGRATULATIONS!")
print("You've completed a comprehensive cultural data analysis project!")
print("This demonstrates mastery of Python for digital humanities research.")

## Summary: Python for Digital Humanities

**Skills Mastered Across All Reviews:**

**Reviews 01-04: Python Foundations**
- Variables, data types, and basic operations
- String methods and text processing
- Conditional logic and loops
- Lists, dictionaries, and data structures
- Functions and modular programming

**Review 05: Research Ethics**
- Ethical data collection principles
- Bias recognition and mitigation
- Cultural sensitivity in computational analysis
- Responsible research practices

**Review 06: Data Analysis with Pandas**
- DataFrame creation and manipulation
- Data cleaning and validation
- Statistical analysis and aggregation
- Pattern recognition in cultural datasets

**Review 07: Text Analysis**
- Computational text processing
- Sentiment analysis implementation
- Thematic analysis and categorization
- Cultural text interpretation

**Review 08: Data Visualization**
- Chart selection and design principles
- Multi-panel dashboard creation
- Visual storytelling with cultural data
- Accessibility and ethical visualization

**Review 09: Integration Project**
- End-to-end cultural analysis workflow
- Research question formulation
- Comprehensive project documentation
- Ethical reflection and methodology

**Applications in Digital Humanities:**
- Literary analysis and distant reading
- Historical trend identification
- Cross-cultural comparative studies
- Cultural heritage digitization projects
- Social media and digital culture analysis
- Museum and archive data analysis

**Congratulations on completing the WRIT 20833 Python Review Series!**

---
 