# AUDIO ANALYSIS PROJECT - ASR TRANSCRIPTION & INITIAL NLP PIPELINE
**Author:** S.Akil | **Mentor:** Resma Rani Nimalpuri  

## Overview
This notebook implements the ASR (Automatic Speech Recognition) transcription using OpenAI's Whisper model, followed by an initial NLP pipeline for text cleaning, tokenization, stopword removal, and lemmatization.  

**Pipeline Flow:**  
1. Fetch and load audio from an online URL (integrates with Step 3: Preprocessing).  
2. Transcribe audio to raw text.  
3. Process text into clean, lemmatized tokens (ready for Step 5: Topic Segmentation).  

**Sample Input:** A public-domain audiobook clip from LibriVox (Sherlock Holmes, Chapter 1).  
**Expected Output:** `transcript.txt` (raw text) and `processed_tokens.txt` (NLP-ready tokens).  

**Run Instructions:** Execute cells top-to-bottom. First-time NLTK/Whisper downloads may take 1-2 minutes.  

# Step 1: Import Required Libraries

In [None]:

# Explanation: Whisper for ASR, NLTK for NLP, requests/librosa for audio handling from URLs.
# Run this cell first to verify installs‚Äîno outputs expected beyond potential NLTK downloads.

import whisper  # pip install openai-whisper (for transcription)
import nltk    # pip install nltk (for tokenization, stopwords, lemmatization)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import requests  # For fetching audio from URLs
import librosa   # For loading/resampling audio (pip install librosa)
import io        # For in-memory byte handling
import os        # For file paths/saving outputs

# Download NLTK data (runs once; quiet mode for cleanliness)
nltk.download('punkt', quiet=True)       # For tokenization
nltk.download('stopwords', quiet=True)   # For stopwords
nltk.download('wordnet', quiet=True)     # For lemmatization
nltk.download('omw-1.4', quiet=True)     # For multilingual lemmatization support

print("‚úÖ Libraries loaded successfully! NLTK data ready. Proceed to next cell.")

# Step 2: Function to Fetch and Load Audio from Online URL

In [None]:

# Explanation: Downloads audio from URL and loads as a NumPy array for Whisper.
#              Assumes pre-cleaning from Step 3, but resamples to 16kHz mono.
# Usage: Call this in execution cell; handles errors gracefully.

def load_audio_from_url(audio_url):
    """
    Fetches audio from URL and loads as a numpy array for Whisper.
    Args: audio_url (str) - URL to MP3/WAV file.
    Returns: audio (np.array) - Loaded audio data at 16kHz mono.
    """
    # Download the audio file
    response = requests.get(audio_url)
    if response.status_code != 200:
        raise ValueError(f"‚ùå Failed to fetch audio from {audio_url} (Status: {response.status_code}). Check URL or network.")
    
    # Load into librosa (handles MP3 to WAV conversion/resampling)
    audio_data, sr = librosa.load(io.BytesIO(response.content), sr=16000, mono=True)
    
    print(f"‚úÖ Audio loaded: Duration ~{len(audio_data)/sr:.1f} seconds, Sample Rate: {sr} Hz")
    return audio_data

# Quick Test (Optional: Uncomment to test alone)
# test_url = "https://archive.org/download/adventures_sherlock_holmes_rg_librivox/adventuresholmes_01_doyle.mp3"
# audio_test = load_audio_from_url(test_url)
# print("Test complete‚Äîno errors!")

# Step 3: ASR Transcription Function Using Whisper

In [None]:

# Explanation: Loads a Whisper model and transcribes the audio array to text.
#              Auto-detects English; outputs clean text (no timestamps here).
# Usage: Pass audio_data; tweak model_name for trade-offs (base=fast, medium=accurate).

def transcribe_audio(audio_data, model_name='base'):
    """
    Transcribes audio using Whisper ASR.
    Args: audio_data (np.array) - Preprocessed audio array.
          model_name (str) - Whisper model size (base/small/medium/large).
    Returns: transcript (str) - Raw transcribed text.
    """
    # Load Whisper model (downloads ~142MB for 'base' on first run; cached after)
    print(f"üîÑ Loading Whisper model '{model_name}'... (First run may take 1-2 min.)")
    model = whisper.load_model(model_name)
    
    # Transcribe (fp16=False for CPU; add language='en' if needed)
    result = model.transcribe(audio_data, fp16=False)
    
    transcript = result['text'].strip()
    print(f"‚úÖ Transcription complete! Model: {model_name}")
    print(f"üìù Raw Transcript Preview (first 200 chars): {transcript[:200]}...")
    return transcript

# Quick Test (Optional: Uncomment after Cell 3 test)
# transcript_test = transcribe_audio(audio_test)
# print("Test complete‚Äîtranscription ready!")

# Step 4: Initial NLP Pipeline Function

In [None]:

# Explanation: Cleans transcript, then: Tokenize ‚Üí Remove Stopwords ‚Üí Lemmatize.
#              Outputs list of base-form tokens (e.g., 'running' ‚Üí 'run').
#              English-focused; extend for multilingual later.

def process_text_nlp(transcript):
    """
    Applies NLP pipeline: Clean ‚Üí Tokenize ‚Üí Remove Stopwords ‚Üí Lemmatize.
    Args: transcript (str) - Raw text from ASR.
    Returns: processed_tokens (list) - List of lemmatized tokens.
    """
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))  # English stopwords; customize as needed
    
    # Sub-step 4.1: Text Cleaning (lowercase, normalize spaces, basic punctuation strip)
    cleaned_text = ' '.join(transcript.lower().split())  # Simple but effective
    print(f"üßπ Cleaned Text Preview: {cleaned_text[:200]}...")
    
    # Sub-step 4.2: Tokenization
    tokens = word_tokenize(cleaned_text)
    print(f"üî§ Tokens (first 10): {tokens[:10]} | Total: {len(tokens)}")
    
    # Sub-step 4.3: Stopword Removal (keep alphabetic tokens only)
    filtered_tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    print(f"üóëÔ∏è After Stopwords Removal (first 10): {filtered_tokens[:10]} | Total: {len(filtered_tokens)}")
    
    # Sub-step 4.4: Lemmatization
    processed_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    print(f"üì¶ Lemmatized Tokens (first 10): {processed_tokens[:10]} | Final Total: {len(processed_tokens)}")
    
    return processed_tokens

# Quick Test (Optional: Uncomment after Cell 4 test)
# tokens_test = process_text_nlp(transcript_test)
# print("Test complete‚ÄîNLP ready!")

# Step 5: End-to-End Pipeline Execution

In [None]:

# Explanation: Runs the full chain: Load ‚Üí Transcribe ‚Üí NLP Process.
#              Uses sample URL; swap for your own (e.g., from preprocessing outputs).
#              Saves files for review/Step 5 integration.

if __name__ == "__main__":
    # Sample online audio URL (verified working: Sherlock Holmes Ch. 1, ~28 min)
    sample_url = "https://archive.org/download/adventures_sherlock_holmes_rg_librivox/adventuresholmes_01_doyle.mp3"
    
    print("üöÄ === Starting Full ASR + NLP Pipeline ===")
    
    # Load audio from URL
    audio_data = load_audio_from_url(sample_url)
    
    # Transcribe
    transcript = transcribe_audio(audio_data, model_name='base')
    
    # Process with NLP
    processed_tokens = process_text_nlp(transcript)
    
    # Final Results
    print("\nüèÅ === Pipeline Complete! ===")
    print(f"üìä Original Transcript: {len(transcript.split())} words")
    print(f"üìä Processed Tokens: {len(processed_tokens)}")
    print(f"üìù Sample Processed Output (first 20): {' '.join(processed_tokens[:20])}...")
    
    # Save Outputs (for next steps, e.g., topic modeling)
    with open('transcript.txt', 'w', encoding='utf-8') as f:
        f.write(transcript)
    print("üíæ Saved: transcript.txt (raw ASR text)")
    
    with open('processed_tokens.txt', 'w', encoding='utf-8') as f:
        f.write(' '.join(processed_tokens))
    print("üíæ Saved: processed_tokens.txt (NLP tokens)")

print("üéâ All done! Check saved files and outputs above.")