<a href="https://colab.research.google.com/github/TCU-DCDA/WRIT20833-2025/blob/main/notebooks/exercises/Review_09_Integration_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WRIT 20833 Review 09: Integration  



Integrate all Python skills to date.

**Make a copy:** File > Save a copy in Drive

## Project Overview: Cultural Data Analysis Portfolio

Create a complete cultural data analysis project that demonstrates mastery of:
- Data collection and ethics (Review 05)
- Data processing with Pandas (Review 06) 
- Text analysis and sentiment (Review 07)
- Data visualization (Review 08)
- All foundational Python skills (Reviews 01-04)

## Exercise 1: Project Setup and Data Collection
Choose your cultural domain and set up your analysis framework.

In [None]:
# Import libraries (only what was covered in CodeAlongs)
import pandas as pd
import matplotlib.pyplot as plt

# VADER for sentiment analysis (from CodeAlongs)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
print("VADER sentiment analysis ready")

# PROJECT CHOICE: Select your cultural domain
# Options: books, movies, music, art, theater, digital_culture, etc.

project_domain = "books"  # TODO: Change to your chosen domain

print("CULTURAL ANALYSIS PROJECT: " + project_domain.upper())
print("=" * 50)

# Sample dataset - replace with your own data!
cultural_data = {
    'title': ['1984', 'Pride and Prejudice', 'The Handmaid\'s Tale', 'Beloved', 'The Great Gatsby'],
    'creator': ['George Orwell', 'Jane Austen', 'Margaret Atwood', 'Toni Morrison', 'F. Scott Fitzgerald'],
    'year': [1949, 1813, 1985, 1987, 1925],
    'genre': ['Dystopian', 'Romance', 'Dystopian', 'Historical Fiction', 'Modernist'],
    'description': [
        'A totalitarian society under constant surveillance where independent thinking is a crime.',
        'A witty exploration of love, marriage, and social class in Regency England.',
        'A dystopian tale of women\'s rights and reproductive freedom in a theocratic society.',
        'A powerful story of slavery, trauma, and the lasting effects of historical injustice.',
        'The decline of the American Dream through the eyes of the mysterious Jay Gatsby.'
    ],
    'themes': [
        'surveillance, totalitarianism, truth, freedom, oppression',
        'love, marriage, social class, wit, independence', 
        'feminism, reproductive rights, religious extremism, resistance',
        'slavery, trauma, memory, motherhood, healing',
        'American Dream, wealth, love, illusion, moral decay'
    ]
}

df = pd.DataFrame(cultural_data)
print("Dataset loaded: " + str(len(df)) + " items")
print()
print("First few items:")
print(df[['title', 'creator', 'year', 'genre']].head())

# Data validation and cleaning
print()
print("DATA QUALITY CHECK:")
print("Missing values: " + str(df.isnull().sum().sum()))
print("Duplicates: " + str(df.duplicated().sum()))
print("Year range: " + str(df['year'].min()) + " - " + str(df['year'].max()))
print("Unique creators: " + str(df['creator'].nunique()))
print("Genres: " + str(list(df['genre'].unique())))

## Exercise 2: Text Analysis and Theme Extraction
Analyze textual content using string methods and sentiment analysis.

In [None]:
# Text processing functions (using patterns from CodeAlongs)

# Simple text cleaning function
def clean_text(text):
    """Clean text by removing common punctuation"""
    text = str(text).lower()
    # Remove common punctuation
    text = text.replace(',', '').replace('.', '').replace('!', '').replace('?', '').replace(';', '').replace(':', '')
    return text.strip()

# Sentiment analysis function (using VADER from CodeAlongs)
def analyze_sentiment(text):
    """Analyze sentiment of text descriptions using VADER"""
    scores = analyzer.polarity_scores(text)
    return scores['compound']

# Theme extraction function (using basic string methods from CodeAlongs)
def extract_themes(themes_string):
    """Convert theme string to list"""
    if pd.isna(themes_string):
        return []
    themes = themes_string.split(',')
    clean_themes = []
    for theme in themes:
        clean_themes.append(theme.strip())
    return clean_themes

# Apply text analysis
print("TEXT ANALYSIS RESULTS:")
print("=" * 30)

# Clean text data
df['description_clean'] = df['description'].apply(clean_text)

# Sentiment analysis
df['sentiment_score'] = df['description'].apply(analyze_sentiment)

# Categorize sentiment
sentiment_categories = []
for score in df['sentiment_score']:
    if score > 0.1:
        sentiment_categories.append('Positive')
    elif score < -0.1:
        sentiment_categories.append('Negative')
    else:
        sentiment_categories.append('Neutral')

df['sentiment_category'] = sentiment_categories

# Theme extraction
df['theme_list'] = df['themes'].apply(extract_themes)

# Count themes
theme_counts = []
for theme_list in df['theme_list']:
    theme_counts.append(len(theme_list))
df['theme_count'] = theme_counts

# Display results
print("Sentiment Analysis Results:")
for idx, row in df.iterrows():
    title_short = row['title'][:25]
    if len(row['title']) > 25:
        title_short = title_short + "..."
    print(title_short + " | Sentiment: " + row['sentiment_category'] + " (" + str(round(row['sentiment_score'], 2)) + ")")

print()
print("Overall sentiment distribution:")
sentiment_counts = df['sentiment_category'].value_counts()
for category in sentiment_counts.index:
    count = sentiment_counts[category]
    percentage = count / len(df) * 100
    print(category + ": " + str(count) + " items (" + str(round(percentage, 1)) + "%)")

# Theme frequency analysis (simplified)
all_themes = []
for theme_list in df['theme_list']:
    for theme in theme_list:
        all_themes.append(theme)

# Count themes manually
theme_frequency = {}
for theme in all_themes:
    if theme in theme_frequency:
        theme_frequency[theme] = theme_frequency[theme] + 1
    else:
        theme_frequency[theme] = 1

print()
print("Most common themes:")
# Get top 5 themes
sorted_themes = sorted(theme_frequency.items(), key=lambda x: x[1], reverse=True)
for i in range(min(5, len(sorted_themes))):
    theme, count = sorted_themes[i]
    print(theme + ": " + str(count) + " occurrences")

## Exercise 3: Statistical Analysis and Patterns
Use Pandas for data analysis and pattern discovery.

In [None]:
# Statistical analysis (using pandas patterns from CodeAlongs)
print("STATISTICAL ANALYSIS:")
print("=" * 25)

# Basic statistics
print("Basic Dataset Statistics:")
print("Average year: " + str(round(df['year'].mean(), 1)))
print("Average sentiment score: " + str(round(df['sentiment_score'].mean(), 3)))
print("Average theme count: " + str(round(df['theme_count'].mean(), 1)))
print()

# Genre analysis
print("Genre Analysis:")
genre_counts = df['genre'].value_counts()
print("Most common genre: " + genre_counts.index[0] + " (" + str(genre_counts.iloc[0]) + " items)")
print("Total genres: " + str(len(genre_counts)))
print()

# Year range analysis
print("Time Period Analysis:")
print("Earliest work: " + str(df['year'].min()))
print("Latest work: " + str(df['year'].max()))
print("Time span: " + str(df['year'].max() - df['year'].min()) + " years")
print()

# Sentiment analysis by genre
print("Sentiment by Genre:")
for genre in df['genre'].unique():
    genre_data = df[df['genre'] == genre]
    avg_sentiment = genre_data['sentiment_score'].mean()
    print(genre + ": " + str(round(avg_sentiment, 3)))
print()

# Creator analysis
print("Creator Analysis:")
creator_counts = df['creator'].value_counts()
print("Total creators: " + str(len(creator_counts)))
print("Most works by one creator: " + str(creator_counts.max()))
print()

# Simple correlations
correlation_year_sentiment = df['year'].corr(df['sentiment_score'])
correlation_year_themes = df['year'].corr(df['theme_count'])
print("Simple Correlations:")
print("Year vs Sentiment: " + str(round(correlation_year_sentiment, 3)))
print("Year vs Theme Count: " + str(round(correlation_year_themes, 3)))

## Exercise 4: Data Visualization Dashboard
Create compelling visualizations to communicate your findings.

In [None]:
# Visualization (using basic matplotlib patterns from CodeAlongs)
print("DATA VISUALIZATION:")
print("=" * 20)

# 1. Timeline scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['year'], df['sentiment_score'])
plt.xlabel('Year')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Over Time')
plt.show()

# 2. Genre distribution pie chart
plt.figure(figsize=(8, 8))
genre_counts = df['genre'].value_counts()
plt.pie(genre_counts.values, labels=genre_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Genre Distribution')
plt.show()

# 3. Sentiment by genre bar chart
plt.figure(figsize=(10, 6))
unique_genres = df['genre'].unique()
genre_sentiments = []

for genre in unique_genres:
    genre_data = df[df['genre'] == genre]
    avg_sentiment = genre_data['sentiment_score'].mean()
    genre_sentiments.append(avg_sentiment)

plt.bar(unique_genres, genre_sentiments)
plt.xlabel('Genre')
plt.ylabel('Average Sentiment Score')
plt.title('Sentiment by Genre')
plt.show()

# 4. Theme count histogram
plt.figure(figsize=(10, 6))
plt.hist(df['theme_count'], bins=5)
plt.xlabel('Number of Themes')
plt.ylabel('Frequency')
plt.title('Distribution of Theme Counts')
plt.show()

# Insights
print("VISUALIZATION INSIGHTS:")

# Find most positive genre
max_sentiment_idx = 0
for i in range(len(genre_sentiments)):
    if genre_sentiments[i] > genre_sentiments[max_sentiment_idx]:
        max_sentiment_idx = i

# Find most negative genre
min_sentiment_idx = 0
for i in range(len(genre_sentiments)):
    if genre_sentiments[i] < genre_sentiments[min_sentiment_idx]:
        min_sentiment_idx = i

print("Most positive genre: " + unique_genres[max_sentiment_idx] + " (" + str(round(genre_sentiments[max_sentiment_idx], 2)) + ")")
print("Most negative genre: " + unique_genres[min_sentiment_idx] + " (" + str(round(genre_sentiments[min_sentiment_idx], 2)) + ")")
print("Most common genre: " + genre_counts.index[0])
print("Average theme count: " + str(round(df['theme_count'].mean(), 1)))

## Exercise 6: Final Project Presentation
Synthesize your analysis into key findings and conclusions.

In [None]:
# Final Analysis and Insights
print("COMPREHENSIVE ANALYSIS SUMMARY:")
print("=" * 40)

print("DATASET OVERVIEW:")
print("Total Items: " + str(len(df)))
print("Time Span: " + str(df['year'].max() - df['year'].min()) + " years (" + str(df['year'].min()) + "-" + str(df['year'].max()) + ")")
print("Unique Creators: " + str(df['creator'].nunique()))
print("Genres Analyzed: " + str(df['genre'].nunique()))

print()
print("KEY FINDINGS:")
print("-" * 20)

# Finding 1: Sentiment trends
overall_sentiment = df['sentiment_score'].mean()
if overall_sentiment > 0.1:
    sentiment_trend = "positive"
elif overall_sentiment < -0.1:
    sentiment_trend = "negative"
else:
    sentiment_trend = "neutral"
print("1. Overall sentiment is " + sentiment_trend + " (avg: " + str(round(overall_sentiment, 2)) + ")")

# Finding 2: Genre insights
dominant_genre = df['genre'].value_counts().index[0]
genre_percentage = (df['genre'].value_counts().iloc[0] / len(df)) * 100
print("2. " + dominant_genre + " is the dominant genre (" + str(round(genre_percentage, 1)) + "% of works)")

# Finding 3: Temporal patterns
modern_works = 0
historical_works = 0
for year in df['year']:
    if year >= 1950:
        modern_works = modern_works + 1
    else:
        historical_works = historical_works + 1

if modern_works > historical_works:
    temporal_focus = "modern"
else:
    temporal_focus = "historical"
print("3. Dataset focuses on " + temporal_focus + " works (" + str(modern_works) + " modern vs " + str(historical_works) + " historical)")

# Finding 4: Thematic complexity
avg_themes = df['theme_count'].mean()
max_themes_idx = df['theme_count'].idxmax()
most_complex = df.loc[max_themes_idx, 'title']
print("4. Average thematic complexity: " + str(round(avg_themes, 1)) + " themes per work")
print("   Most complex: '" + most_complex + "' (" + str(df['theme_count'].max()) + " themes)")

# Finding 5: Creator patterns
creator_counts = df['creator'].value_counts()
if creator_counts.iloc[0] > 1:
    prolific_creator = creator_counts.index[0]
    creator_count = creator_counts.iloc[0]
    print("5. Most prolific creator: " + prolific_creator + " (" + str(creator_count) + " works)")
else:
    print("5. All creators represented equally (1 work each)")

print()
print("RESEARCH QUESTIONS RAISED:")
print("-" * 30)
print("1. How do cultural and historical contexts influence thematic content?")
print("2. What factors contribute to sentiment patterns in cultural works?")
print("3. How has thematic complexity evolved over time?")
print("4. What role does genre play in cultural expression and reception?")
print("5. How might computational analysis complement traditional cultural criticism?")

print()
print("SKILLS DEMONSTRATED:")
print("-" * 25)
print("✓ Data collection and validation")
print("✓ Text processing and cleaning") 
print("✓ Sentiment analysis implementation")
print("✓ Statistical analysis with Pandas")
print("✓ Data visualization")
print("✓ Pattern recognition and interpretation")
print("✓ Research question formulation")

print()
print("=" * 40)
print("CONGRATULATIONS!")
print("You've completed a comprehensive cultural data analysis project!")
print("This demonstrates mastery of Python for digital humanities research.")

## Summary: Python for Digital Humanities

**Skills Mastered Across All Reviews:**

**Reviews 01-04: Python Foundations**
- Variables, data types, and basic operations
- String methods and text processing
- Conditional logic and loops
- Lists, dictionaries, and data structures
- Functions and modular programming

**Review 05: Research Ethics**
- Ethical data collection principles
- Bias recognition and mitigation
- Cultural sensitivity in computational analysis
- Responsible research practices

**Review 06: Data Analysis with Pandas**
- DataFrame creation and manipulation
- Data cleaning and validation
- Statistical analysis and aggregation
- Pattern recognition in cultural datasets

**Review 07: Text Analysis**
- Computational text processing
- Sentiment analysis implementation
- Thematic analysis and categorization
- Cultural text interpretation

**Review 08: Data Visualization**
- Chart selection and design principles
- Multi-panel dashboard creation
- Visual storytelling with cultural data
- Accessibility and ethical visualization

**Review 09: Integration Project**
- End-to-end cultural analysis workflow
- Research question formulation
- Comprehensive project documentation
- Ethical reflection and methodology

**Applications in Digital Humanities:**
- Literary analysis and distant reading
- Historical trend identification
- Cross-cultural comparative studies
- Cultural heritage digitization projects
- Social media and digital culture analysis
- Museum and archive data analysis

**Congratulations on completing the WRIT 20833 Python Review Series!**

---
 