# Cleaning Youtube Data

Notebook summary:  

To process raw poker video transcripts, extract meaningful content related to poker rules, strategies, player advice, and hand analyses, and generate structured prompt-completion pairs suitable for fine-tuning a language model.

### Imports 

In [43]:
import pandas as pd
import json
import re
import spacy
import logging
from typing import List, Dict
from collections import defaultdict
import random

In [44]:
def inspect_transcript(csv_path: str):
    """Debug function to inspect raw transcript format"""
    df = pd.read_csv(csv_path)
    print("Sample transcript raw data:")
    print(df['Transcript'].iloc[0][:200])  # Print first 200 chars of first transcript
    return df

inspect_transcript('../data/raw/transcript1.csv')

Sample transcript raw data:
"[{""text"": ""hi guys and welcome to how to deal poker"", ""start"": 0.0, ""duration"": 5.21}, {""text"": ""like a professional step one is the wash"", ""start"": 1.77, ""duration"": 6.029}, {""text"


Unnamed: 0,VideoID,Title,Transcript,URL
0,nWYSfXreH6M,How To Deal Texas Holdem Poker Professionally ...,"""[{""""text"""": """"hi guys and welcome to how to d...",https://www.youtube.com/watch?v=nWYSfXreH6M
1,S_h9EEzBoYU,"""Risking $10,000 For The Ultimate Texas Hold E...","""[{""""text"""": """"welcome everybody we're here at...",https://www.youtube.com/watch?v=S_h9EEzBoYU
2,B1JjmsekmqI,Mastering The Fundamentals: Postflop Strategy,"""[{""""text"""": """"let's discuss post-flop strateg...",https://www.youtube.com/watch?v=B1JjmsekmqI
3,OHFDITnIo7Y,#1 PLO Tip for No Limit Holdem Players (NLH ▶️...,"""[{""""text"""": """"my number one PLO tip for Nolan...",https://www.youtube.com/watch?v=OHFDITnIo7Y
4,noKz7aCfwNM,9 TEXAS HOLD&#39;EM Poker Tips For Beginners (...,"""[{""""text"""": """"what's going on guys Nathan her...",https://www.youtube.com/watch?v=noKz7aCfwNM
...,...,...,...,...
95,y2W8-AhrmMU,How to Deal Poker Like a Pro in 7 Easy Steps,"""[{""""text"""": """"let's get it on welcome to card...",https://www.youtube.com/watch?v=y2W8-AhrmMU
96,Eimf0ExhSjM,5 Step Guide to CRUSH Low Stakes Poker (2024),"""[{""""text"""": """"here's the Ultimate five-step G...",https://www.youtube.com/watch?v=Eimf0ExhSjM
97,pSRGErzzIo4,How to Play Poker for Beginners | PokerStars L...,"""[{""""text"""": """"Hello, welcome to PokerStars Le...",https://www.youtube.com/watch?v=pSRGErzzIo4
98,su61vuClbBI,3-Betting in Poker: A Complete Guide for Begin...,"""[{""""text"""": """"have you ever heard the term th...",https://www.youtube.com/watch?v=su61vuClbBI


In [45]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("processing.log"),
        logging.StreamHandler()
    ]
)

### Load spaCy Model

In [46]:
# Load spaCy English model for NLP tasks
nlp = spacy.load("en_core_web_sm")

### Define the PokerTranscriptProcessor Class

In [47]:
class PokerTranscriptProcessor:
    def __init__(self):
        """
        Initializes the PokerTranscriptProcessor with poker-related terminologies and templates.
        """
        # Comprehensive poker terminology categorized into dictionaries
        self.poker_terms = {
            'game_stages': ['preflop', 'flop', 'turn', 'river', 'showdown'],
            'actions': ['fold', 'call', 'raise', 'bet', 'check', 'all-in', 'bluff', 'limp'],
            'positions': ['button', 'cutoff', 'hijack', 'early', 'middle', 'late', 'blinds', 'utg'],
            'player_types': ['tight', 'loose', 'aggressive', 'passive', 'fish', 'shark', 'nit', 'maniac'],
            'hand_types': ['pocket', 'suited', 'offsuit', 'connector', 'broadway', 'ace-high'],
            'concepts': ['pot odds', 'equity', 'range', 'value bet', 'bluff', 'semi-bluff', 'implied odds']
        }
        
        # Templates for generating prompt-completion pairs
        self.templates = {
            'rules': ["What is {concept} in poker?"],  # Single template per category
            'strategy': ["What's the optimal strategy when {situation}?"],
            'player_advice': ["How should I play against {player_type} players?"],
            'hand_analysis': ["How should I play {hand} in this situation?"]
        }
        
        # Keywords indicating instructional content
        self.instruction_markers = [
            'should', 'must', 'need to', 'optimal', 'better', 'strategy',
            'recommend', 'important', 'key', 'best', 'avoid', 'always', 'never'
        ]
        
        # Optional: Initialize summarization pipeline for rewriting completions
        # self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    def process_raw_csv(self, file_path: str) -> pd.DataFrame:
        """
        Loads and cleans the raw CSV file containing transcripts.
        
        Parameters:
            file_path (str): Path to the raw CSV file.
        
        Returns:
            pd.DataFrame: Cleaned DataFrame containing transcript data.
        """
        try:
            df = pd.read_csv(file_path)
            df.columns = df.columns.str.strip()
            return df
        except Exception as e:
            print(f"Error processing CSV {file_path}: {e}")
            return pd.DataFrame()
    
    def clean_transcript(self, raw_transcript: str) -> List[str]:
        """
        Cleans and parses the raw transcript from CSV format containing double-quoted JSON.
        
        Parameters:
            raw_transcript (str): Raw transcript string from the CSV.
        
        Returns:
            List[str]: List of cleaned text segments extracted from the transcript.
        """
        try:
            # Remove outer quotes and normalize inner quotes
            cleaned = raw_transcript.strip('"')
            cleaned = cleaned.replace('""', '"')
            
            # Parse JSON content
            transcript_data = json.loads(cleaned)
            
            # Extract text segments
            return [segment['text'] for segment in transcript_data if 'text' in segment]
                
        except Exception as e:
            print(f"Error cleaning transcript: {e}")
            print(f"Problematic transcript part: {raw_transcript[:100]}")
            return []
    
    def combine_segments(self, segments: List[str]) -> str:
        """
        Combines individual transcript segments into a coherent block of text.
        
        Parameters:
            segments (List[str]): List of text segments from the transcript.
        
        Returns:
            str: Combined and cleaned text.
        """
        text = ' '.join(segments)
        text = re.sub(r'\[Music\]|\[Applause\]', '', text)  # Remove non-verbal cues
        text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
        return text.strip()
    
    def extract_poker_content(self, text: str) -> Dict[str, List[str]]:
        """
        Extracts and categorizes poker-related content from the combined transcript text.
        
        Parameters:
            text (str): Combined transcript text.
        
        Returns:
            Dict[str, List[str]]: Dictionary categorizing extracted content into 'rules', 'strategy', 'player_advice', and 'hand_analysis'.
        """
        doc = nlp(text)
        content = {
            'rules': [],
            'strategy': [],
            'player_advice': [],
            'hand_analysis': []
        }
        
        # Process each sentence in the transcript
        for sent in doc.sents:
            sent_text = sent.text.strip()
            
            # Basic filtering based on sentence validity
            if not self._is_valid_sentence(sent_text):
                continue
                
            # Check for instructional content indicators
            if not self._has_instructional_content(sent):
                continue
                
            # Categorize the sentence
            category = self._categorize_sentence(sent)
            if category:
                # Clean and rewrite the sentence for coherence
                cleaned = self._clean_and_rewrite_sentence(sent_text, category)
                if cleaned:
                    content[category].append(cleaned)
        
        return content
    
    def create_training_pairs(self, content: Dict[str, List[str]]) -> List[Dict[str, str]]:
        """
        Creates prompt-completion pairs from the categorized poker content.
        """
        pairs = []
        seen_completions = set()
        used_concepts = set()
        
        for category, items in content.items():
            if category in self.templates:
                templates = self.templates[category]
                
                for item in items:
                    if not self._is_meaningful_content(item):
                        continue
                    
                    concept = self._get_unique_concept(item, category, used_concepts)
                    if not concept:
                        continue
                        
                    template = random.choice(templates)
                    prompt = self._create_prompt(template, category, concept)
                    completion = self._create_completion(item, category)
                    
                    if (completion and 
                        completion not in seen_completions and 
                        self._is_coherent_completion(prompt, completion)):
                        
                        pairs.append({
                            "prompt": prompt,
                            "completion": completion
                        })
                        seen_completions.add(completion)
        
        return pairs
    
    def _is_meaningful_content(self, text: str) -> bool:
        """
        Checks if the content is meaningful by filtering out clichés and ensuring sentence completeness.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            bool: True if the content is meaningful, False otherwise.
        """
        # Must contain poker terms
        has_poker_terms = any(term in text.lower() for terms in self.poker_terms.values() for term in terms)
                
        # Relaxed sentence structure check
        is_proper_sentence = text[0].isupper() and text[-1] in '.!?'
                
        # Relaxed length check
        good_length = 5 <= len(text.split()) <= 100
                
        # Relaxed instruction check - remove this constraint if too restrictive
        # has_instruction = any(word in text.lower() for word in self.instruction_markers)
                
        return has_poker_terms and is_proper_sentence and good_length
    
    def _calculate_content_density(self, text: str) -> float:
        """
        Calculates the ratio of meaningful words to total words in a sentence.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            float: Content density ratio.
        """
        doc = nlp(text)
        
        # Count meaningful tokens (excluding stopwords and punctuation)
        meaningful_tokens = [token for token in doc 
                             if not token.is_stop 
                             and not token.is_punct
                             and token.pos_ in ['NOUN', 'VERB', 'ADJ']]
        
        if len(doc) == 0:
            return 0
        
        return len(meaningful_tokens) / len(doc)
    
    def _is_content_rich(self, text: str) -> bool:
        """
        Determines if the content has sufficient density of meaningful information.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            bool: True if content density is above the threshold, False otherwise.
        """
        density = self._calculate_content_density(text)
        return density > 0.3 
    
    def _is_contextually_relevant(self, text: str) -> bool:
        """
        Checks if the content is contextually relevant to poker instruction.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            bool: True if relevant, False otherwise.
        """
        doc = nlp(text)
        
        # Define key poker-related verbs and nouns
        poker_actions = set(['bet', 'raise', 'fold', 'call', 'check'])
        poker_concepts = set(['hand', 'position', 'pot', 'stack', 'range'])
        
        # Check for presence of poker-specific actions or concepts
        has_poker_action = any(token.lemma_ in poker_actions for token in doc)
        has_poker_concept = any(token.text.lower() in poker_concepts for token in doc)
        
        return has_poker_action or has_poker_concept
    
    def _has_valid_structure(self, sent) -> bool:
        """
        Checks if the sentence has a valid instructional structure, including conditional statements and imperative moods.
        
        Parameters:
            sent (spacy.tokens.Span): The sentence to evaluate.
        
        Returns:
            bool: True if the structure is valid, False otherwise.
        """
        # Check for conditional markers
        has_condition = any(token.dep_ == 'mark' for token in sent)
        
        # Check for imperative mood or modal verbs
        has_instruction = (sent[0].pos_ == 'VERB' or 
                           any(token.tag_ == 'MD' for token in sent))
        
        # Check for a main clause structure
        has_main_clause = any(token.dep_ == 'ROOT' for token in sent)
        
        return (has_condition or has_instruction) and has_main_clause
    
    def _is_valid_sentence(self, text: str) -> bool:
        """
        Comprehensive validation of a sentence to determine its suitability for inclusion.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            bool: True if the sentence passes all validation checks, False otherwise.
        """
        # Must start with a capital letter and end with proper punctuation
        if not (text and text[0].isupper() and text[-1] in '.!?'):
            return False
                
        # Must not start with conjunctions or filler words
        if any(text.startswith(word) for word in ['But', 'And', 'However', 'So', 'Yeah', 'Um']):
            return False
            
        # Parse the sentence with spaCy for further checks
        doc = nlp(text)
        
        # Enhanced validation checks
        if not self._is_meaningful_content(text):
            return False
                
        if not self._is_content_rich(text):
            return False
                
        if not self._is_contextually_relevant(text):
            return False
                
        if not self._has_valid_structure(doc):
            return False
                
        # Must contain poker-related terms
        if not any(term in text.lower() for terms in self.poker_terms.values() for term in terms):
            return False
                
        return True
    
    def _has_instructional_content(self, sent) -> bool:
        """
        Determines if a sentence contains instructional content based on markers and sentence structure.
        
        Parameters:
            sent (spacy.tokens.Span): The sentence to evaluate.
        
        Returns:
            bool: True if instructional, False otherwise.
        """
        text = sent.text.lower()
        
        # Check for instructional markers
        has_marker = any(marker in text for marker in self.instruction_markers)
        
        # Check for modal verbs indicating advice
        has_modal = any(token.lemma_ in ['should', 'must', 'can', 'need'] for token in sent)
        
        # Check for imperative mood
        starts_with_verb = sent[0].pos_ == 'VERB'
        
        return has_marker or has_modal or starts_with_verb
    
    def _categorize_sentence(self, sent) -> str:
        """
        Categorizes a sentence into one of the predefined categories based on its content.
        
        Parameters:
            sent (spacy.tokens.Span): The sentence to categorize.
        
        Returns:
            str: The category of the sentence ('rules', 'strategy', 'player_advice', 'hand_analysis') or None if uncategorized.
        """
        text = sent.text.lower()
        
        # Prioritize categorization based on specificity
        if any(term in text for term in self.poker_terms['hand_types']):
            return 'hand_analysis'
        elif any(term in text for term in self.poker_terms['player_types']):
            return 'player_advice'
        elif any(term in text for term in self.poker_terms['game_stages']):
            return 'rules'
        elif any(term in text for term in self.poker_terms['actions']):
            return 'strategy'
        
        return None
    
    def _clean_and_rewrite_sentence(self, text: str, category: str) -> str:
        """
        Cleans and rewrites a sentence to ensure coherence and relevance to the category.
        
        Parameters:
            text (str): The original sentence.
            category (str): The category of the sentence.
        
        Returns:
            str: The cleaned and rewritten sentence, or None if rewriting fails.
        """
        # Remove filler words and normalize
        text = self._clean_sentence(text)
        
        # Rewriting based on category
        if category == 'rules':
            return self._rewrite_as_rule(text)
        elif category == 'strategy':
            return self._rewrite_as_strategy(text)
        elif category == 'player_advice':
            return self._rewrite_as_player_advice(text)
        elif category == 'hand_analysis':
            return self._rewrite_as_hand_analysis(text)
        
        return None
    
    def _rewrite_as_rule(self, text: str) -> str:
        """
        Rewrites a sentence as a clear rule definition.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten rule definition.
        """
        doc = nlp(text)
        
        # Identify the main concept being explained
        concepts = [term for term in self.poker_terms['concepts'] if term in text.lower()]
        if not concepts:
            return None
        concept = concepts[0]
        
        # Extract the full sentence containing the concept
        sentences = text.split('.')
        concept_sentence = next((s for s in sentences if concept in s.lower()), None)
        
        if concept_sentence:
            # Clean up the sentence
            cleaned = concept_sentence.strip()
            if not cleaned.lower().startswith(concept.lower()):
                cleaned = f"{concept} is {cleaned}"
            return cleaned + "."
        
        return None
        
    def _rewrite_as_strategy(self, text: str) -> str:
        """
        Rewrites a sentence as clear strategic advice.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten strategic advice.
        """
        doc = nlp(text)
        
        # Identify the main verb and object for strategy action
        main_verb = None
        main_obj = None
        
        for token in doc:
            if token.dep_ == 'ROOT' and token.pos_ == 'VERB':
                main_verb = token.text
                for child in token.children:
                    if child.dep_ in ['dobj', 'pobj']:
                        main_obj = child.text
                        break
        
        if main_verb and main_obj:
            return f"{main_verb.capitalize()} {main_obj} when facing this situation."
        
        return None
    
    def _rewrite_as_player_advice(self, text: str) -> str:
        """
        Rewrites a sentence as clear advice for playing against specific player types.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten player advice.
        """
        doc = nlp(text)
        
        # Identify the player type being discussed
        player_types = [token.text for token in doc if token.text.lower() in self.poker_terms['player_types']]
        if not player_types:
            return None
        
        # Extract the advice portion following the player type
        player_type = player_types[0]
        advice_start = text.lower().find(player_type.lower())
        if advice_start == -1:
            return None
        
        advice = text[advice_start + len(player_type):].strip().strip('.!?')
        if not advice:
            return None
        
        # Construct clear player advice
        return f"You should {advice.lower()} when playing against {player_type} players."
    
    def _rewrite_as_hand_analysis(self, text: str) -> str:
        """
        Rewrites a sentence as clear hand analysis advice.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten hand analysis advice.
        """
        doc = nlp(text)
        
        # Identify the hand type being discussed
        hand_types = [token.text for token in doc if any(term in token.text.lower() for term in self.poker_terms['hand_types'])]
        if not hand_types:
            return None
        
        hand_type = hand_types[0]
        
        # Extract the situation and advice
        hand_start = text.lower().find(hand_type.lower())
        if hand_start == -1:
            return None
        
        # Attempt to split the sentence into situation and advice
        advice_parts = text[hand_start + len(hand_type):].split(',')
        if len(advice_parts) > 1:
            situation = advice_parts[0].strip()
            advice = ' '.join(advice_parts[1:]).strip().strip('.!?')
            return f"In {situation}, with {hand_type}, you should {advice.lower()}."
        
        return None
    
    def _is_grammatically_correct(self, prompt: str) -> bool:
        """
        Checks if the prompt is grammatically correct.
        """
        return prompt[0].isupper() and prompt.endswith('?')

    def _is_coherent_completion(self, prompt: str, completion: str) -> bool:
        """
        Checks if the completion is coherent and directly answers the prompt.
        """
        prompt_terms = set(re.findall(r'\w+', prompt.lower()))
        completion_terms = set(re.findall(r'\w+', completion.lower()))
        
        common_terms = prompt_terms.intersection(completion_terms)
        return len(common_terms) > 2


    def _extract_key_phrase(self, text: str, category: str) -> str:
        """
        Extracts diverse key phrases from the text based on the category.
        """
        doc = nlp(text)
        
        if category == 'rules':
            # Look for different poker concepts
            concepts = []
            for term in self.poker_terms['concepts']:
                if term in text.lower():
                    # Include some context around the concept
                    concept_idx = text.lower().find(term)
                    context = text[max(0, concept_idx-20):min(len(text), concept_idx+len(term)+20)]
                    concepts.append((term, context))
            
            if concepts:
                # Choose a random concept-context pair
                term, context = random.choice(concepts)
                return term.capitalize()
        
        elif category == 'strategy':
            # Look for action-situation pairs
            actions = [term for term in self.poker_terms['actions'] if term in text.lower()]
            if actions:
                action = random.choice(actions)
                # Try to find associated situation
                action_idx = text.lower().find(action)
                situation = text[max(0, action_idx-30):min(len(text), action_idx+30)]
                return f"{action} in {situation}".capitalize()
        
        elif category == 'player_advice':
            # Look for player types and their characteristics
            player_types = [term for term in self.poker_terms['player_types'] if term in text.lower()]
            if player_types:
                player_type = random.choice(player_types)
                # Try to find associated advice
                type_idx = text.lower().find(player_type)
                advice = text[max(0, type_idx-30):min(len(text), type_idx+30)]
                return player_type.capitalize()
        
        elif category == 'hand_analysis':
            # Look for hand types and positions
            hand_types = [term for term in self.poker_terms['hand_types'] if term in text.lower()]
            positions = [term for term in self.poker_terms['positions'] if term in text.lower()]
            
            if hand_types:
                hand_type = random.choice(hand_types)
                if positions:
                    # If we have both hand type and position
                    position = random.choice(positions)
                    return f"{hand_type} from {position}".capitalize()
                return hand_type.capitalize()
            elif positions:
                # If we only have position
                position = random.choice(positions)
                return f"Playing from {position}".capitalize()
        
        # If no specific category matches or no key phrases found
        # Look for any poker term as a fallback
        all_terms = []
        for term_list in self.poker_terms.values():
            all_terms.extend([term for term in term_list if term in text.lower()])
        
        if all_terms:
            return random.choice(all_terms).capitalize()
        
        return "poker strategy"  # final fallback

    def _has_repetition(self, text: str) -> bool:
        """
        Detects if the text contains repetitive phrases.
        """
        words = text.lower().split()
        for i in range(len(words)-1):
            if words[i] == words[i+1]:
                return True
        return False
    
    def _create_prompt(self, template: str, category: str, text: str) -> str:
        """
        Creates a prompt by filling in the template with relevant key phrases.
        
        Parameters:
            template (str): The template string with placeholders.
            category (str): The category of the content.
            text (str): The source text to extract key phrases from.
        
        Returns:
            str: The filled-in prompt string.
        """
        prompt = template.format(
            concept=concept if '{concept}' in template else '',
            situation=concept if '{situation}' in template else '',
            player_type=concept if '{player_type}' in template else '',
            hand=concept if '{hand}' in template else ''
        ).strip()
        
        if not prompt.endswith('?'):
            prompt += '?'
        
        return prompt
        
    def _create_completion(self, text: str, category: str) -> str:
        """
        Creates a formal, neutral completion from the rewritten text.
        
        Parameters:
            text (str): The rewritten sentence.
            category (str): The category of the content.
        
        Returns:
            str: The formatted completion string.
        """
        
        text = text.strip()
        if not text:
            return None

        # Remove any existing prefixes we might add
        text = re.sub(r'^(In poker,|The optimal strategy is to|When facing|With)\s*', '', text, flags=re.IGNORECASE)
        
        if category == 'rules':
            completion = f"In poker, {text}"
        elif category == 'strategy':
            completion = f"The optimal strategy is to {text.lower()}"
        elif category == 'player_advice':
            completion = f"When facing this situation, {text.lower()}"
        elif category == 'hand_analysis':
            completion = f"With this hand, {text.lower()}"
        else:
            completion = text
        
        # Ensure proper sentence structure
        completion = completion.strip()
        if not completion.endswith('.'):
            completion += '.'
        
        # Ensure first letter is capitalized
        completion = completion[0].upper() + completion[1:]
        
        return completion
    
    def _is_valid_completion(self, completion: str) -> bool:
        """
        Validates that a completion is well-formed and meaningful.
        """
        if not completion:
            return False
            
        # Must be a complete sentence
        if not (completion[0].isupper() and completion.endswith('.')):
            return False
            
        # Must contain meaningful content
        words = completion.split()
        if len(words) < 5:
            return False
            
        # Must not have repetitive phrases
        if self._has_repetition(completion):
            return False
            
        return True

    def _is_valid_pair(self, prompt: str, completion: str) -> bool:
        """
        Validates the prompt-completion pair to ensure alignment and proper formatting.
        
        Parameters:
            prompt (str): The prompt string.
            completion (str): The completion string.
        
        Returns:
            bool: True if the pair is valid, False otherwise.
        """
        # Check prompt and completion lengths
        if len(prompt.split()) < 5 or len(completion.split()) < 8:
            return False
                
        # Check for presence of poker terms
        if not any(term in (prompt + completion).lower() for terms in self.poker_terms.values() for term in terms):
            return False
                
        # Check for proper formatting based on category
        if 'what is' in prompt.lower() and not completion.startswith('In poker'):
            return False
        if 'strategy' in prompt.lower() and not completion.startswith('The optimal strategy'):
            return False
        if 'against' in prompt.lower() and not completion.startswith('When facing'):
            return False
        if 'play' in prompt.lower() and not completion.startswith('With'):
            return False
                
        return True
    
    def process_transcript(self, transcript_text: str) -> List[Dict[str, str]]:
        """
        Processes a single transcript to generate prompt-completion pairs.
        
        Parameters:
            transcript_text (str): Raw transcript text from the CSV.
        
        Returns:
            List[Dict[str, str]]: List of prompt-completion pairs generated from the transcript.
        """
        segments = self.clean_transcript(transcript_text)
        if not segments:
            return []
                
        full_text = self.combine_segments(segments)
        content = self.extract_poker_content(full_text)
        
        pairs = []
        used_concepts = set()  # Track used concepts
        
        for category, items in content.items():
            if category in self.templates:
                template = random.choice(self.templates[category])
                
                for item in items:
                    # Get a unique concept for this item
                    concept = self._get_unique_concept(item, category, used_concepts)
                    if not concept:
                        continue
                    
                    # Create prompt using the unique concept
                    prompt = self._create_prompt(template, category, concept)
                    completion = self._create_completion(item, category)
                    
                    if completion and self._is_valid_completion(completion):
                        pairs.append({
                            "prompt": prompt,
                            "completion": completion
                        })
        
        return pairs
    
    def post_process_pairs(self, pairs: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """
        Post-processes and filters the list of training pairs to ensure quality and uniqueness.
        
        Parameters:
            pairs (List[Dict[str, str]]): List of generated prompt-completion pairs.
        
        Returns:
            List[Dict[str, str]]: Filtered list of high-quality prompt-completion pairs.
        """
        processed_pairs = []
        seen_completions = set()
        
        for pair in pairs:
            prompt, completion = pair['prompt'], pair['completion']
                
            # Skip if either is empty or too short
            if not prompt or not completion:
                continue
                    
            # Skip if completion is duplicate
            if completion in seen_completions:
                continue
                    
            # Skip if completion doesn't match prompt category
            if not self._is_matching_pair(prompt, completion):
                continue
                    
            # Add to processed pairs
            processed_pairs.append({
                'prompt': prompt,
                'completion': completion
            })
            seen_completions.add(completion)
            
        return processed_pairs
    
    def _is_matching_pair(self, prompt: str, completion: str) -> bool:
        """
        Checks if the prompt and completion pair matches in terms of category and content.
        
        Parameters:
            prompt (str): The prompt string.
            completion (str): The completion string.
        
        Returns:
            bool: True if the pair matches appropriately, False otherwise.
        """
        # Check if rule prompt has rule completion
        if 'what is' in prompt.lower() and not completion.startswith('In poker'):
            return False
                
        # Check if strategy prompt has strategy completion
        if 'strategy' in prompt.lower() and not completion.startswith('The optimal strategy'):
            return False
                
        # Check if player advice prompt matches
        if 'against' in prompt.lower() and not completion.startswith('When facing'):
            return False
                
        # Check if hand analysis prompt matches
        if 'play' in prompt.lower() and not completion.startswith('With'):
            return False
                
        return True
    
    def _clean_sentence(self, text: str) -> str:
        """
        Cleans and normalizes sentence text by removing filler words, timestamps, brackets, and normalizing poker terms.
        
        Parameters:
            text (str): The sentence to clean.
        
        Returns:
            str: Cleaned and normalized sentence.
        """
        # Remove specific video references and fillers
        text = re.sub(r'like and subscribe|check out|visit|website|um|uh|yeah|you know|laughter|subscribe', '', text, flags=re.IGNORECASE)
        
        # Remove timestamps and brackets
        text = re.sub(r'\[\w+\]|\(\d+:\d+\)', '', text)
        
        # Normalize poker terms
        text = re.sub(r'hold\s*em', "Hold'em", text, flags=re.IGNORECASE)
        
        # Clean up whitespace
        text = ' '.join(text.split())
        
        # Ensure proper sentence termination
        text = text.strip()
        if text and not text[-1] in '.!?':
            text += '.'
            
        return text

    def _get_unique_concept(self, text: str, category: str, used_concepts: set) -> str:
        """
        Gets a unique concept that hasn't been used before for this category.
        """
        if category == 'rules':
            concepts = [term for term in self.poker_terms['concepts'] 
                    if term in text.lower() and term not in used_concepts]
        elif category == 'strategy':
            concepts = [term for term in self.poker_terms['actions'] 
                    if term in text.lower() and term not in used_concepts]
        elif category == 'player_advice':
            concepts = [term for term in self.poker_terms['player_types'] 
                    if term in text.lower() and term not in used_concepts]
        elif category == 'hand_analysis':
            concepts = [term for term in self.poker_terms['hand_types'] 
                    if term in text.lower() and term not in used_concepts]
        else:
            return None
            
        if concepts:
            concept = random.choice(concepts)
            used_concepts.add(concept)
            return concept
        return None

    
    def process_all_transcripts(self, csv_path: str, output_path: str) -> List[Dict]:
        """
        Processes all transcripts from a CSV file and saves the generated prompt-completion pairs to a JSONL file.
        
        Parameters:
            csv_path (str): Path to the raw CSV file containing transcripts.
            output_path (str): Path to save the processed JSONL file.
        
        Returns:
            List[Dict]: List of final prompt-completion pairs.
        """
        # Load and clean the raw CSV
        df = self.process_raw_csv(csv_path)
        
        all_pairs = []
        seen_completions = set()
        
        logging.info(f"Processing {len(df)} transcripts...")
        
        for idx, row in df.iterrows():
            try:
                # Process individual transcript
                pairs = self.process_transcript(row['Transcript'])
                    
                # Filter and deduplicate
                for pair in pairs:
                    completion = pair['completion'].strip()
                    if completion and completion not in seen_completions:
                        all_pairs.append(pair)
                        seen_completions.add(completion)
                            
            except Exception as e:
                logging.error(f"Error processing video {row.get('VideoID', 'Unknown')}: {e}")
                continue
                    
            # Progress update every 10 transcripts
            if (idx + 1) % 10 == 0:
                logging.info(f"Processed {idx + 1} transcripts...")
        
        # Post-process all pairs for final filtering
        final_pairs = self.post_process_pairs(all_pairs)
        
        # Save the final pairs to a JSONL file in append mode
        if final_pairs:
            try:
                with open(output_path, 'a') as f:  # Changed 'w' to 'a' for append mode
                    for pair in final_pairs:
                        json.dump(pair, f)
                        f.write('\n')
                logging.info(f"Saved {len(final_pairs)} training pairs to {output_path}")
            except Exception as e:
                logging.error(f"Error saving to {output_path}: {e}")
        
        return final_pairs

### Initialize the Processor and Define Transcript Files

In [48]:
# Initialize the PokerTranscriptProcessor
processor = PokerTranscriptProcessor()

# Define the list of transcript CSV files to process
transcript_files = [
    '../data/raw/transcript1.csv',
    '../data/raw/transcript2.csv',
    '../data/raw/transcript3.csv',
    '../data/raw/transcript4.csv'
]

### Process All Transcripts and Generate Training Pairs

In [49]:
# Clear the output file before processing
with open('../data/processed/transcript_poker_training.jsonl', 'w') as f:
    pass

# Initialize a list to hold all prompt-completion pairs
all_pairs = []

# Iterate over each transcript file and process
for csv_path in transcript_files:
    try:
        print(f"\nProcessing {csv_path}...")
        # Process all transcripts in the current CSV file
        pairs = processor.process_all_transcripts(
            csv_path, 
            '../data/processed/transcript_poker_training.jsonl'
        )
        all_pairs.extend(pairs)
    except Exception as e:
        print(f"Error processing file {csv_path}: {e}")
        continue


2024-12-15 17:36:54,034 - INFO - Processing 100 transcripts...



Processing ../data/raw/transcript1.csv...


2024-12-15 17:36:55,758 - INFO - Processed 10 transcripts...
2024-12-15 17:36:57,520 - INFO - Processed 20 transcripts...
2024-12-15 17:36:58,974 - INFO - Processed 30 transcripts...
2024-12-15 17:37:00,192 - INFO - Processed 40 transcripts...
2024-12-15 17:37:01,322 - ERROR - Error processing video l_tBnX8tNp8: name 'concept' is not defined
2024-12-15 17:37:02,173 - INFO - Processed 50 transcripts...
2024-12-15 17:37:03,276 - INFO - Processed 60 transcripts...
2024-12-15 17:37:04,384 - INFO - Processed 70 transcripts...
2024-12-15 17:37:06,318 - INFO - Processed 80 transcripts...
2024-12-15 17:37:07,316 - ERROR - Error processing video ibkt_SLQe4E: name 'concept' is not defined
2024-12-15 17:37:07,660 - INFO - Processed 90 transcripts...
2024-12-15 17:37:09,504 - INFO - Processed 100 transcripts...
2024-12-15 17:37:09,542 - INFO - Processing 188 transcripts...



Processing ../data/raw/transcript2.csv...


2024-12-15 17:37:11,179 - INFO - Processed 10 transcripts...
2024-12-15 17:37:12,719 - INFO - Processed 20 transcripts...
2024-12-15 17:37:14,060 - INFO - Processed 30 transcripts...
2024-12-15 17:37:15,057 - INFO - Processed 40 transcripts...
2024-12-15 17:37:15,967 - ERROR - Error processing video l_tBnX8tNp8: name 'concept' is not defined
2024-12-15 17:37:16,749 - INFO - Processed 50 transcripts...
2024-12-15 17:37:17,906 - INFO - Processed 60 transcripts...
2024-12-15 17:37:18,996 - INFO - Processed 70 transcripts...
2024-12-15 17:37:20,971 - INFO - Processed 80 transcripts...
2024-12-15 17:37:22,019 - ERROR - Error processing video ibkt_SLQe4E: name 'concept' is not defined
2024-12-15 17:37:22,353 - INFO - Processed 90 transcripts...
2024-12-15 17:37:24,251 - INFO - Processed 100 transcripts...
2024-12-15 17:37:24,754 - ERROR - Error processing video _Bu-c_S8d-8: name 'concept' is not defined
2024-12-15 17:37:25,592 - INFO - Processed 110 transcripts...
2024-12-15 17:37:29,546 - I


Processing ../data/raw/transcript3.csv...


2024-12-15 17:37:43,058 - INFO - Processed 10 transcripts...
2024-12-15 17:37:44,636 - INFO - Processed 20 transcripts...
2024-12-15 17:37:45,784 - INFO - Processed 30 transcripts...
2024-12-15 17:37:46,823 - INFO - Processed 40 transcripts...
2024-12-15 17:37:47,816 - ERROR - Error processing video l_tBnX8tNp8: name 'concept' is not defined
2024-12-15 17:37:48,783 - INFO - Processed 50 transcripts...
2024-12-15 17:37:49,951 - INFO - Processed 60 transcripts...
2024-12-15 17:37:51,131 - INFO - Processed 70 transcripts...
2024-12-15 17:37:53,197 - INFO - Processed 80 transcripts...
2024-12-15 17:37:54,208 - ERROR - Error processing video ibkt_SLQe4E: name 'concept' is not defined
2024-12-15 17:37:54,536 - INFO - Processed 90 transcripts...
2024-12-15 17:37:56,434 - INFO - Processed 100 transcripts...
2024-12-15 17:37:56,465 - INFO - Processing 150 transcripts...



Processing ../data/raw/transcript4.csv...


2024-12-15 17:37:58,115 - INFO - Processed 10 transcripts...


### Display Example Training Pairs

In [37]:
# Print the total number of training pairs processed
print(f"\nProcessed {len(all_pairs)} training pairs total")

# Display the first 5 example training pairs
print("\nExample training pairs:")
for pair in all_pairs[:5]:
    print("\nPrompt:", pair['prompt'])
    print("Completion:", pair['completion'])



Processed 56 training pairs total

Example training pairs:

Prompt: What is Pot odds in poker?
Completion: In poker, pot odds is Play to the math pot odds equities counting combos in real time to bet sizing for both preflop and post flop play.

Prompt: What is Range in poker?
Completion: In poker, range is This makes intuitive sense, a player cannot have a hand on the river that he cannot also have preflop second hand that's eliminated from your opponent's range is gone forever.

Prompt: What is Range in poker?
Completion: In poker, range is For example, our opponent called a $10 Raise preflop in the cut off right there, we can narrow down his range quite substantially eliminating hands on both ends of the spectr.

Prompt: What is Range in poker?
Completion: In poker, range is In short, the hands we include in his flop range must also be in his preflop range.

Prompt: What is Range in poker?
Completion: In poker, range is Before we jp in, you'll remember from the principles of hand ra