# Google Form Response Summarizer with HDBSCAN Clustering

This notebook demonstrates how to:
1. Connect to Google Forms API to fetch form responses
2. Preprocess the text responses for analysis
3. Use HDBSCAN clustering to group similar responses
4. Generate summaries for each cluster to understand common themes

## Prerequisites
- Google Cloud Project with Forms API enabled
- Service account credentials JSON file
- Form ID of the Google Form you want to analyze

## 1. Install and Import Required Libraries

First, we'll install the necessary packages for connecting to Google Forms API and performing clustering analysis.

In [None]:
# Install required packages
!pip install google-auth google-auth-oauthlib google-api-python-client
!pip install hdbscan pandas scikit-learn matplotlib seaborn plotly
!pip install nltk textblob wordcloud

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Google API imports
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# Text processing and clustering imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import hdbscan
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from textblob import TextBlob
from wordcloud import WordCloud

# Other utilities
import json
import os
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

In [None]:
# Download required NLTK data
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

print("NLTK data downloaded successfully!")

## 2. Authenticate and Connect to Google Forms API

Set up authentication credentials and connect to the Google Forms API. 
You'll need to:
1. Create a Google Cloud Project
2. Enable the Google Forms API
3. Download service account credentials or use OAuth2 flow

In [None]:
# Configuration
SCOPES = ['https://www.googleapis.com/auth/forms.responses.readonly',
          'https://www.googleapis.com/auth/forms.body.readonly']

# Update this with your form ID
FORM_ID = 'YOUR_FORM_ID_HERE'  # Replace with actual form ID

def authenticate_google_api():
    """
    Authenticate with Google API using OAuth2 or service account
    """
    creds = None
    
    # The file token.json stores the user's access and refresh tokens.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            # Update path to your credentials file
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)  # Download from Google Cloud Console
            creds = flow.run_local_server(port=0)
        
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())
    
    return creds

# Authenticate and build service
try:
    creds = authenticate_google_api()
    service = build('forms', 'v1', credentials=creds)
    print("Successfully authenticated with Google Forms API!")
except Exception as e:
    print(f"Authentication error: {e}")
    print("Please ensure you have credentials.json file in the current directory")

## 3. Fetch Google Form Responses

Retrieve responses from the specified Google Form and convert them into a pandas DataFrame for analysis.

In [None]:
def get_form_structure(service, form_id):
    """
    Get the form structure to understand question types and IDs
    """
    try:
        form = service.forms().get(formId=form_id).execute()
        return form
    except HttpError as e:
        print(f"Error getting form structure: {e}")
        return None

def get_form_responses(service, form_id):
    """
    Fetch all responses from the Google Form
    """
    try:
        # Get form responses
        responses = service.forms().responses().list(formId=form_id).execute()
        return responses.get('responses', [])
    except HttpError as e:
        print(f"Error fetching responses: {e}")
        return []

def process_responses_to_dataframe(form_structure, responses):
    """
    Convert form responses to a pandas DataFrame
    """
    if not responses:
        print("No responses found")
        return pd.DataFrame()
    
    # Extract question titles and IDs
    questions = {}
    if 'items' in form_structure:
        for item in form_structure['items']:
            if 'questionItem' in item:
                question_id = item['questionItem']['question']['questionId']
                title = item['title']
                questions[question_id] = title
    
    # Process responses
    processed_data = []
    
    for response in responses:
        response_data = {
            'response_id': response.get('responseId', ''),
            'create_time': response.get('createTime', ''),
            'last_submitted_time': response.get('lastSubmittedTime', '')
        }
        
        # Extract answers
        if 'answers' in response:
            for question_id, answer in response['answers'].items():
                question_title = questions.get(question_id, f'Question_{question_id}')
                
                # Handle different answer types
                if 'textAnswers' in answer:
                    text_values = [ta.get('value', '') for ta in answer['textAnswers']['answers']]
                    response_data[question_title] = '; '.join(text_values)
                elif 'fileUploadAnswers' in answer:
                    response_data[question_title] = 'File uploaded'
                else:
                    response_data[question_title] = str(answer)
        
        processed_data.append(response_data)
    
    return pd.DataFrame(processed_data)

# Fetch form data
print("Fetching form structure...")
form_structure = get_form_structure(service, FORM_ID)

if form_structure:
    print(f"Form Title: {form_structure.get('info', {}).get('title', 'Unknown')}")
    print(f"Form Description: {form_structure.get('info', {}).get('description', 'No description')}")
    
    print("\nFetching responses...")
    responses = get_form_responses(service, FORM_ID)
    
    print(f"Found {len(responses)} responses")
    
    # Convert to DataFrame
    df = process_responses_to_dataframe(form_structure, responses)
    print(f"DataFrame shape: {df.shape}")
    
    # Display first few responses
    print("\nFirst 5 responses:")
    display(df.head())
else:
    print("Could not fetch form structure. Please check your FORM_ID and permissions.")

## 4. Preprocess Responses for Clustering

Clean and preprocess text responses to prepare them for clustering analysis.

In [None]:
def preprocess_text(text):
    """
    Clean and preprocess text data
    """
    if pd.isna(text) or text == '':
        return ''
    
    # Convert to lowercase
    text = str(text).lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and len(token) > 2]
    
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    
    return ' '.join(tokens)

def combine_text_responses(df, text_columns=None):
    """
    Combine multiple text columns into a single text for analysis
    """
    if text_columns is None:
        # Identify text columns (exclude metadata columns)
        text_columns = [col for col in df.columns 
                       if col not in ['response_id', 'create_time', 'last_submitted_time']]
    
    # Combine text from all specified columns
    combined_texts = []
    for idx, row in df.iterrows():
        text_parts = []
        for col in text_columns:
            if pd.notna(row[col]) and str(row[col]).strip():
                text_parts.append(str(row[col]))
        
        combined_text = ' '.join(text_parts)
        combined_texts.append(combined_text)
    
    return combined_texts

# Analyze the DataFrame structure
print("DataFrame columns:")
for i, col in enumerate(df.columns):
    print(f"{i+1}. {col}")

# Identify text columns for clustering
text_columns = [col for col in df.columns 
                if col not in ['response_id', 'create_time', 'last_submitted_time']]

print(f"\nText columns identified for analysis: {text_columns}")

# Combine and preprocess text responses
if len(df) > 0:
    combined_texts = combine_text_responses(df, text_columns)
    
    print(f"\nPreprocessing {len(combined_texts)} responses...")
    processed_texts = [preprocess_text(text) for text in combined_texts]
    
    # Filter out empty responses
    non_empty_indices = [i for i, text in enumerate(processed_texts) if text.strip()]
    processed_texts = [processed_texts[i] for i in non_empty_indices]
    filtered_df = df.iloc[non_empty_indices].copy()
    
    print(f"After filtering empty responses: {len(processed_texts)} responses remain")
    
    # Add processed text to DataFrame
    filtered_df['processed_text'] = processed_texts
    
    print("\nSample processed texts:")
    for i in range(min(3, len(processed_texts))):
        print(f"{i+1}. Original: {combined_texts[non_empty_indices[i]][:100]}...")
        print(f"   Processed: {processed_texts[i][:100]}...")
        print()
else:
    print("No data available for processing.")

In [None]:
# Vectorize text using TF-IDF
if len(processed_texts) > 0:
    print("Vectorizing text using TF-IDF...")
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer(
        max_features=1000,  # Limit to top 1000 features
        min_df=2,          # Ignore terms that appear in less than 2 documents
        max_df=0.8,        # Ignore terms that appear in more than 80% of documents
        ngram_range=(1, 2) # Include both unigrams and bigrams
    )
    
    # Fit and transform the text
    tfidf_matrix = vectorizer.fit_transform(processed_texts)
    
    print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
    print(f"Number of features: {len(vectorizer.get_feature_names_out())}")
    
    # Convert to dense array for clustering (if not too large)
    if tfidf_matrix.shape[0] * tfidf_matrix.shape[1] < 100000:
        X = tfidf_matrix.toarray()
    else:
        X = tfidf_matrix  # Keep sparse for large datasets
    
    print("Text vectorization completed!")
    
    # Show most important features
    feature_names = vectorizer.get_feature_names_out()
    print(f"\nSample features: {feature_names[:20]}")
    
else:
    print("No processed texts available for vectorization.")

## 5. Apply HDBSCAN Clustering

Use HDBSCAN to cluster the preprocessed responses and identify groups of similar feedback.

In [None]:
# Apply HDBSCAN clustering
if len(processed_texts) > 0:
    print("Applying HDBSCAN clustering...")
    
    # Convert sparse matrix to dense if needed
    if hasattr(X, 'toarray'):
        X_dense = X.toarray()
    else:
        X_dense = X
    
    # Configure HDBSCAN parameters
    min_cluster_size = max(2, len(processed_texts) // 10)  # Adaptive minimum cluster size
    min_samples = max(1, min_cluster_size // 2)           # Minimum samples
    
    print(f"HDBSCAN parameters:")
    print(f"- min_cluster_size: {min_cluster_size}")
    print(f"- min_samples: {min_samples}")
    
    # Create and fit HDBSCAN clusterer
    clusterer = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,
        min_samples=min_samples,
        metric='euclidean',
        cluster_selection_method='eom'
    )
    
    # Fit the model
    cluster_labels = clusterer.fit_predict(X_dense)
    
    # Add cluster labels to DataFrame
    filtered_df['cluster'] = cluster_labels
    
    # Analyze clustering results
    n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
    n_noise = list(cluster_labels).count(-1)
    
    print(f"\nClustering Results:")
    print(f"- Number of clusters: {n_clusters}")
    print(f"- Number of noise points: {n_noise}")
    print(f"- Percentage of noise: {n_noise/len(cluster_labels)*100:.1f}%")
    
    # Cluster size distribution
    cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
    print(f"\nCluster size distribution:")
    for cluster_id, count in cluster_counts.items():
        if cluster_id == -1:
            print(f"  Noise: {count} responses")
        else:
            print(f"  Cluster {cluster_id}: {count} responses")
    
    # Calculate clustering quality metrics
    if n_clusters > 1:
        # Silhouette score (excluding noise points)
        non_noise_mask = cluster_labels != -1
        if np.sum(non_noise_mask) > 1:
            silhouette_avg = silhouette_score(X_dense[non_noise_mask], 
                                            cluster_labels[non_noise_mask])
            print(f"\nSilhouette Score: {silhouette_avg:.3f}")
    
else:
    print("No data available for clustering.")

In [None]:
# Visualize clustering results
if len(processed_texts) > 0 and n_clusters > 0:
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # 1. Cluster size distribution
    cluster_counts_viz = cluster_counts[cluster_counts.index != -1]  # Exclude noise
    axes[0, 0].bar(range(len(cluster_counts_viz)), cluster_counts_viz.values)
    axes[0, 0].set_title('Cluster Size Distribution')
    axes[0, 0].set_xlabel('Cluster ID')
    axes[0, 0].set_ylabel('Number of Responses')
    axes[0, 0].set_xticks(range(len(cluster_counts_viz)))
    axes[0, 0].set_xticklabels(cluster_counts_viz.index)
    
    # 2. Cluster membership overview
    cluster_data = pd.DataFrame({'cluster': cluster_labels})
    cluster_summary = cluster_data['cluster'].value_counts().sort_index()
    
    # Pie chart for cluster distribution
    colors = plt.cm.Set3(np.linspace(0, 1, len(cluster_summary)))
    labels = [f'Cluster {i}' if i != -1 else 'Noise' for i in cluster_summary.index]
    
    axes[0, 1].pie(cluster_summary.values, labels=labels, autopct='%1.1f%%', colors=colors)
    axes[0, 1].set_title('Cluster Distribution')\n    
    # 3. Cluster confidence scores (HDBSCAN probabilities)
    if hasattr(clusterer, 'probabilities_'):
        axes[1, 0].hist(clusterer.probabilities_, bins=30, alpha=0.7)
        axes[1, 0].set_title('Cluster Membership Probabilities')
        axes[1, 0].set_xlabel('Probability')
        axes[1, 0].set_ylabel('Frequency')
    else:
        axes[1, 0].text(0.5, 0.5, 'Probabilities not available', 
                       ha='center', va='center', transform=axes[1, 0].transAxes)
        axes[1, 0].set_title('Cluster Probabilities')
    
    # 4. Response length by cluster
    response_lengths = [len(text.split()) for text in processed_texts]
    cluster_df_viz = pd.DataFrame({
        'cluster': cluster_labels,
        'length': response_lengths
    })
    
    for cluster_id in sorted(set(cluster_labels)):
        if cluster_id != -1:
            cluster_lengths = cluster_df_viz[cluster_df_viz['cluster'] == cluster_id]['length']
            axes[1, 1].hist(cluster_lengths, alpha=0.6, label=f'Cluster {cluster_id}', bins=15)
    
    axes[1, 1].set_title('Response Length Distribution by Cluster')
    axes[1, 1].set_xlabel('Number of Words')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].legend()
    
    plt.tight_layout()
    plt.show()
    
else:
    print("No clusters found or insufficient data for visualization.")

## 6. Summarize Clustered Responses

Generate summaries and insights for each cluster to understand common themes and patterns.

In [None]:
def analyze_cluster(cluster_df, cluster_id, vectorizer, X_dense):
    """
    Analyze a specific cluster and generate insights
    """
    if cluster_id == -1:
        return {"cluster_id": -1, "name": "Noise", "size": len(cluster_df)}
    
    cluster_texts = cluster_df['processed_text'].tolist()
    original_texts = []
    
    # Get original combined texts
    for idx in cluster_df.index:
        original_row = df.iloc[idx]
        original_parts = []
        for col in text_columns:
            if pd.notna(original_row[col]) and str(original_row[col]).strip():
                original_parts.append(str(original_row[col]))
        original_texts.append(' '.join(original_parts))
    
    # Get most important terms for this cluster
    cluster_indices = cluster_df.index.tolist()
    cluster_vectors = X_dense[cluster_indices]
    
    # Calculate mean TF-IDF values for cluster
    mean_tfidf = np.mean(cluster_vectors, axis=0)
    feature_names = vectorizer.get_feature_names_out()
    
    # Get top terms
    top_indices = np.argsort(mean_tfidf)[-10:][::-1]
    top_terms = [(feature_names[i], mean_tfidf[i]) for i in top_indices if mean_tfidf[i] > 0]
    
    # Sentiment analysis
    sentiments = [TextBlob(text).sentiment.polarity for text in original_texts]
    avg_sentiment = np.mean(sentiments)
    
    # Response length statistics
    lengths = [len(text.split()) for text in original_texts]
    
    return {
        "cluster_id": cluster_id,
        "size": len(cluster_df),
        "top_terms": top_terms,
        "avg_sentiment": avg_sentiment,
        "avg_length": np.mean(lengths),
        "sample_responses": original_texts[:3],  # First 3 responses as examples
        "processed_sample": cluster_texts[:3]
    }

# Analyze each cluster
if len(processed_texts) > 0 and n_clusters > 0:
    print("Analyzing clusters...")
    
    cluster_analyses = []
    
    for cluster_id in sorted(set(cluster_labels)):
        cluster_mask = filtered_df['cluster'] == cluster_id
        cluster_data = filtered_df[cluster_mask]
        
        analysis = analyze_cluster(cluster_data, cluster_id, vectorizer, X_dense)
        cluster_analyses.append(analysis)
        
        print(f"\n{'='*50}")
        print(f"CLUSTER {cluster_id} ANALYSIS")
        print(f"{'='*50}")
        
        if cluster_id == -1:
            print(f"Type: Noise/Outliers")
            print(f"Size: {analysis['size']} responses")
        else:
            print(f"Size: {analysis['size']} responses")
            print(f"Average Sentiment: {analysis['avg_sentiment']:.3f} ({'Positive' if analysis['avg_sentiment'] > 0.1 else 'Negative' if analysis['avg_sentiment'] < -0.1 else 'Neutral'})")
            print(f"Average Response Length: {analysis['avg_length']:.1f} words")
            
            print(f"\nTop Terms:")
            for term, score in analysis['top_terms'][:5]:
                print(f"  - {term}: {score:.3f}")
            
            print(f"\nSample Responses:")
            for i, response in enumerate(analysis['sample_responses']):
                print(f"  {i+1}. {response[:150]}{'...' if len(response) > 150 else ''}")
    
    # Create summary DataFrame
    summary_data = []
    for analysis in cluster_analyses:
        if analysis['cluster_id'] != -1:
            summary_data.append({
                'Cluster': analysis['cluster_id'],
                'Size': analysis['size'],
                'Avg_Sentiment': round(analysis['avg_sentiment'], 3),
                'Avg_Length': round(analysis['avg_length'], 1),
                'Top_Terms': ', '.join([term for term, _ in analysis['top_terms'][:3]])
            })
    
    summary_df = pd.DataFrame(summary_data)
    
    print(f"\n{'='*50}")
    print("CLUSTER SUMMARY TABLE")
    print(f"{'='*50}")
    display(summary_df)
    
else:
    print("No clusters available for analysis.")

In [None]:
# Generate word clouds for each cluster
if len(processed_texts) > 0 and n_clusters > 0:
    print("Generating word clouds for clusters...")
    
    # Calculate number of rows needed for subplots
    n_clusters_to_show = min(n_clusters, 6)  # Show up to 6 clusters
    n_cols = min(3, n_clusters_to_show)
    n_rows = (n_clusters_to_show + n_cols - 1) // n_cols
    
    if n_clusters_to_show > 0:
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
        
        # Handle single subplot case
        if n_clusters_to_show == 1:
            axes = [axes]
        elif n_rows == 1:
            axes = [axes] if n_cols == 1 else axes
        else:
            axes = axes.flatten()
        
        cluster_counter = 0
        
        for analysis in cluster_analyses:
            if analysis['cluster_id'] != -1 and cluster_counter < n_clusters_to_show:
                cluster_id = analysis['cluster_id']
                cluster_mask = filtered_df['cluster'] == cluster_id
                cluster_texts = filtered_df[cluster_mask]['processed_text'].tolist()
                
                # Combine all text in cluster
                cluster_text = ' '.join(cluster_texts)
                
                if cluster_text.strip():  # Only create wordcloud if there's text
                    try:
                        wordcloud = WordCloud(
                            width=400, 
                            height=300, 
                            background_color='white',
                            max_words=50,
                            colormap='viridis'
                        ).generate(cluster_text)
                        
                        axes[cluster_counter].imshow(wordcloud, interpolation='bilinear')
                        axes[cluster_counter].set_title(f'Cluster {cluster_id} Word Cloud\n({analysis["size"]} responses)')
                        axes[cluster_counter].axis('off')
                    except ValueError:
                        axes[cluster_counter].text(0.5, 0.5, f'Cluster {cluster_id}\nInsufficient text for word cloud', 
                                                 ha='center', va='center', transform=axes[cluster_counter].transAxes)
                        axes[cluster_counter].set_title(f'Cluster {cluster_id}')
                
                cluster_counter += 1
        
        # Hide unused subplots
        for i in range(cluster_counter, len(axes)):
            axes[i].axis('off')
        
        plt.tight_layout()
        plt.show()
    
    else:
        print("No clusters available for word cloud generation.")

else:
    print("No data available for word cloud generation.")

In [None]:
# Export results to CSV
if len(processed_texts) > 0:
    print("Exporting results...")
    
    # Create detailed results DataFrame
    export_df = filtered_df.copy()
    
    # Add cluster information
    cluster_names = {}
    for analysis in cluster_analyses:
        if analysis['cluster_id'] != -1:
            top_terms = [term for term, _ in analysis['top_terms'][:2]]
            cluster_names[analysis['cluster_id']] = f"Cluster_{analysis['cluster_id']}_{'_'.join(top_terms)}"
        else:
            cluster_names[analysis['cluster_id']] = "Noise"
    
    export_df['cluster_name'] = export_df['cluster'].map(cluster_names)
    
    # Add sentiment scores
    original_combined_texts = combine_text_responses(export_df, text_columns)
    export_df['sentiment_score'] = [TextBlob(text).sentiment.polarity for text in original_combined_texts]
    export_df['response_length'] = [len(text.split()) for text in original_combined_texts]
    
    # Export to CSV
    output_filename = 'google_form_clustering_results.csv'
    export_df.to_csv(output_filename, index=False)
    print(f"Results exported to: {output_filename}")
    
    # Create summary report
    report = f\"\"\"
GOOGLE FORM CLUSTERING ANALYSIS REPORT
=====================================

Dataset Overview:
- Total responses analyzed: {len(export_df)}
- Number of clusters found: {n_clusters}
- Noise points: {n_noise} ({n_noise/len(cluster_labels)*100:.1f}%)

Cluster Summary:
\"\"\"\n    \n    for analysis in cluster_analyses:\n        if analysis['cluster_id'] != -1:\n            report += f\"\"\"\n\nCluster {analysis['cluster_id']}:\n- Size: {analysis['size']} responses ({analysis['size']/len(export_df)*100:.1f}%)\n- Average sentiment: {analysis['avg_sentiment']:.3f}\n- Average length: {analysis['avg_length']:.1f} words\n- Key themes: {', '.join([term for term, _ in analysis['top_terms'][:3]])}\n\"\"\"\n    \n    # Save report\n    report_filename = 'clustering_analysis_report.txt'\n    with open(report_filename, 'w') as f:\n        f.write(report)\n    \n    print(f\"Analysis report saved to: {report_filename}\")\n    print(\"\\nAnalysis completed successfully!\")\n    \nelse:\n    print(\"No data available for export.\")

## Usage Instructions and Next Steps

### How to Use This Notebook:

1. **Setup Google API Credentials:**
   - Go to [Google Cloud Console](https://console.cloud.google.com/)
   - Create a new project or select existing one
   - Enable the Google Forms API
   - Create credentials (OAuth 2.0 or Service Account)
   - Download the credentials file as `credentials.json`

2. **Configure the Form ID:**
   - Replace `YOUR_FORM_ID_HERE` with your actual Google Form ID
   - The Form ID can be found in the form URL: `https://docs.google.com/forms/d/{FORM_ID}/edit`

3. **Run the Analysis:**
   - Execute all cells in order
   - The first run will prompt for Google authentication
   - Results will be saved as CSV and text report files

### Interpreting Results:

- **Clusters**: Groups of similar responses based on content
- **Noise Points**: Responses that don't fit well into any cluster
- **Sentiment Scores**: Range from -1 (negative) to +1 (positive)
- **Top Terms**: Most characteristic words for each cluster

### Customization Options:

- Adjust `min_cluster_size` in HDBSCAN for different cluster granularity
- Modify TF-IDF parameters for different text processing
- Change the number of top terms displayed
- Add custom text preprocessing steps

### Potential Extensions:

- Topic modeling with LDA or BERTopic
- Sentiment analysis with advanced models
- Interactive visualizations with Plotly
- Automatic cluster labeling with GPT
- Real-time analysis with form webhooks