# Cleaning Youtube Data

Notebook summary:  

To process raw poker video transcripts, extract meaningful content related to poker rules, strategies, player advice, and hand analyses, and generate structured prompt-completion pairs suitable for fine-tuning a language model.

### Imports 

In [14]:
import pandas as pd
import json
import re
import spacy
import logging
from typing import List, Dict
from collections import defaultdict
import random

In [15]:
def inspect_transcript(csv_path: str):
    """Debug function to inspect raw transcript format"""
    df = pd.read_csv(csv_path)
    print("Sample transcript raw data:")
    print(df['Transcript'].iloc[0][:200])  # Print first 200 chars of first transcript
    return df

inspect_transcript('../data/raw/transcript1.csv')

Sample transcript raw data:
"[{""text"": ""hi guys and welcome to how to deal poker"", ""start"": 0.0, ""duration"": 5.21}, {""text"": ""like a professional step one is the wash"", ""start"": 1.77, ""duration"": 6.029}, {""text"


Unnamed: 0,VideoID,Title,Transcript,URL
0,nWYSfXreH6M,How To Deal Texas Holdem Poker Professionally ...,"""[{""""text"""": """"hi guys and welcome to how to d...",https://www.youtube.com/watch?v=nWYSfXreH6M
1,S_h9EEzBoYU,"""Risking $10,000 For The Ultimate Texas Hold E...","""[{""""text"""": """"welcome everybody we're here at...",https://www.youtube.com/watch?v=S_h9EEzBoYU
2,B1JjmsekmqI,Mastering The Fundamentals: Postflop Strategy,"""[{""""text"""": """"let's discuss post-flop strateg...",https://www.youtube.com/watch?v=B1JjmsekmqI
3,OHFDITnIo7Y,#1 PLO Tip for No Limit Holdem Players (NLH ▶️...,"""[{""""text"""": """"my number one PLO tip for Nolan...",https://www.youtube.com/watch?v=OHFDITnIo7Y
4,noKz7aCfwNM,9 TEXAS HOLD&#39;EM Poker Tips For Beginners (...,"""[{""""text"""": """"what's going on guys Nathan her...",https://www.youtube.com/watch?v=noKz7aCfwNM
...,...,...,...,...
95,y2W8-AhrmMU,How to Deal Poker Like a Pro in 7 Easy Steps,"""[{""""text"""": """"let's get it on welcome to card...",https://www.youtube.com/watch?v=y2W8-AhrmMU
96,Eimf0ExhSjM,5 Step Guide to CRUSH Low Stakes Poker (2024),"""[{""""text"""": """"here's the Ultimate five-step G...",https://www.youtube.com/watch?v=Eimf0ExhSjM
97,pSRGErzzIo4,How to Play Poker for Beginners | PokerStars L...,"""[{""""text"""": """"Hello, welcome to PokerStars Le...",https://www.youtube.com/watch?v=pSRGErzzIo4
98,su61vuClbBI,3-Betting in Poker: A Complete Guide for Begin...,"""[{""""text"""": """"have you ever heard the term th...",https://www.youtube.com/watch?v=su61vuClbBI


In [16]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("processing.log"),
        logging.StreamHandler()
    ]
)

### Load spaCy Model

In [17]:
# Load spaCy English model for NLP tasks
nlp = spacy.load("en_core_web_sm")

### Define the PokerTranscriptProcessor Class

In [18]:
# Import necessary libraries
import pandas as pd
import json
import re
import spacy
import logging
from typing import List, Dict
from collections import defaultdict
import random

# Configure logging
logging.basicConfig(
    level=logging.INFO,  # Set to INFO to reduce verbosity
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("processing.log"),
        logging.StreamHandler()
    ]
)

# Load spaCy English model for NLP tasks
nlp = spacy.load("en_core_web_sm")

class PokerTranscriptProcessor:
    def __init__(self):
        """
        Initializes the PokerTranscriptProcessor with poker-related terminologies and templates.
        """
        # Comprehensive poker terminology categorized into dictionaries
        self.poker_terms = {
            'game_stages': ['preflop', 'flop', 'turn', 'river', 'showdown'],
            'actions': ['fold', 'call', 'raise', 'bet', 'check', 'all-in', 'bluff', 'limp'],
            'positions': ['button', 'cutoff', 'hijack', 'early', 'middle', 'late', 'blinds', 'utg'],
            'player_types': ['tight', 'loose', 'aggressive', 'passive', 'fish', 'shark', 'nit', 'maniac'],
            'hand_types': ['pocket', 'suited', 'offsuit', 'connector', 'broadway', 'ace-high'],
            'concepts': ['pot odds', 'equity', 'range', 'value bet', 'bluff', 'semi-bluff', 'implied odds']
        }
        
        # Templates for generating prompt-completion pairs
        self.templates = {
            'rules': [
                "What is {concept} in poker?",
                "Can you explain {concept} in Texas Hold'em?",
                "Define {concept} within the context of poker."
            ],
            'strategy': [
                "What's the optimal strategy when {situation}?",
                "How should I approach {situation}?",
                "What's the best strategy if {situation}?"
            ],
            'player_advice': [
                "How should I play against {player_type} players?",
                "What's the best approach versus {player_type} opponents?",
                "How can I exploit {player_type} players?"
            ],
            'hand_analysis': [
                "How should I play {hand} in this situation?",
                "What's the correct play with {hand} when {situation}?",
                "With {hand}, what's the optimal move in this scenario?"
            ]
        }
        
        # Keywords indicating instructional content
        self.instruction_markers = [
            'should', 'must', 'need to', 'optimal', 'better', 'strategy',
            'recommend', 'important', 'key', 'best', 'avoid', 'always', 'never'
        ]
    
    def process_raw_csv(self, file_path: str) -> pd.DataFrame:
        """
        Loads and cleans the raw CSV file containing transcripts.
        
        Parameters:
            file_path (str): Path to the raw CSV file.
        
        Returns:
            pd.DataFrame: Cleaned DataFrame containing transcript data.
        """
        try:
            df = pd.read_csv(file_path)
            df.columns = df.columns.str.strip()
            logging.info(f"Successfully loaded CSV: {file_path}")
            return df
        except Exception as e:
            logging.error(f"Error processing CSV {file_path}: {e}")
            return pd.DataFrame()
    
    def clean_transcript(self, raw_transcript: str) -> List[str]:
        """
        Cleans and parses the raw transcript from CSV format containing double-quoted JSON.
        
        Parameters:
            raw_transcript (str): Raw transcript string from the CSV.
        
        Returns:
            List[str]: List of cleaned text segments extracted from the transcript.
        """
        try:
            # Remove outer quotes and normalize inner quotes
            cleaned = raw_transcript.strip('"')
            cleaned = cleaned.replace('""', '"')
            
            # Parse JSON content
            transcript_data = json.loads(cleaned)
            
            # Extract text segments
            segments = [segment['text'] for segment in transcript_data if 'text' in segment]
            logging.debug("Successfully cleaned transcript.")
            return segments
        except Exception as e:
            logging.error(f"Error cleaning transcript: {e}")
            logging.debug(f"Problematic transcript part: {raw_transcript[:100]}")
            return []
    
    def combine_segments(self, segments: List[str]) -> str:
        """
        Combines individual transcript segments into a coherent block of text.
        
        Parameters:
            segments (List[str]): List of text segments from the transcript.
        
        Returns:
            str: Combined and cleaned text.
        """
        text = ' '.join(segments)
        text = re.sub(r'\[Music\]|\[Applause\]', '', text)  # Remove non-verbal cues
        text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
        return text.strip()
    
    def extract_poker_content(self, text: str) -> Dict[str, List[str]]:
        """
        Extracts and categorizes poker-related content from the combined transcript text.
        
        Parameters:
            text (str): Combined transcript text.
        
        Returns:
            Dict[str, List[str]]: Dictionary categorizing extracted content into 'rules', 'strategy', 'player_advice', and 'hand_analysis'.
        """
        doc = nlp(text)
        content = {
            'rules': [],
            'strategy': [],
            'player_advice': [],
            'hand_analysis': []
        }
        
        # Process each sentence in the transcript
        for sent in doc.sents:
            sent_text = sent.text.strip()
            
            # Basic filtering based on sentence validity
            if not self._is_valid_sentence(sent_text):
                continue
                
            # Check for instructional content indicators
            if not self._has_instructional_content(sent):
                continue
                
            # Categorize the sentence
            category = self._categorize_sentence(sent)
            if category:
                # Clean and rewrite the sentence for coherence
                cleaned = self._clean_and_rewrite_sentence(sent_text, category)
                if cleaned:
                    content[category].append(cleaned)
        
        return content
    
    def create_training_pairs(self, content: Dict[str, List[str]]) -> List[Dict[str, str]]:
        """
        Creates prompt-completion pairs from the categorized poker content.
        
        Parameters:
            content (Dict[str, List[str]]): Categorized poker content.
        
        Returns:
            List[Dict[str, str]]: List of prompt-completion dictionaries.
        """
        pairs = []
        seen_completions = set()
        used_concepts = set()
        
        for category, items in content.items():
            if category in self.templates:
                templates = self.templates[category]
                
                for item in items:
                    if not self._is_meaningful_content(item):
                        continue
                    
                    concept = self._get_unique_concept(item, category, used_concepts)
                    if not concept:
                        continue
                        
                    template = random.choice(templates)
                    prompt = self._create_prompt(template, category, concept)
                    if not prompt:
                        continue  # Skip if prompt creation failed
                    completion = self._create_completion(item, category)
                    
                    if (completion and 
                        completion not in seen_completions and 
                        self._is_coherent_completion(prompt, completion)):
                        
                        pairs.append({
                            "prompt": prompt,
                            "completion": completion
                        })
                        seen_completions.add(completion)
        
        return pairs
    
    def _is_meaningful_content(self, text: str) -> bool:
        """
        Checks if the content is meaningful by filtering out clichés and ensuring sentence completeness.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            bool: True if the content is meaningful, False otherwise.
        """
        # Must contain poker terms
        has_poker_terms = any(term in text.lower() for terms in self.poker_terms.values() for term in terms)
                
        # Relaxed sentence structure check
        is_proper_sentence = text[0].isupper() and text[-1] in '.!?'
                
        # Relaxed length check
        good_length = 5 <= len(text.split()) <= 100
                
        return has_poker_terms and is_proper_sentence and good_length
    
    def _get_unique_concept(self, text: str, category: str, used_concepts: set) -> str:
        """
        Gets a unique concept that hasn't been used before for this category.
        
        Parameters:
            text (str): The sentence text.
            category (str): The category of the content.
            used_concepts (set): Set of already used concepts.
        
        Returns:
            str: A unique concept or None if none available.
        """
        if category == 'rules':
            concepts = [term for term in self.poker_terms['concepts'] 
                    if term in text.lower() and term not in used_concepts]
        elif category == 'strategy':
            concepts = [term for term in self.poker_terms['actions'] 
                    if term in text.lower() and term not in used_concepts]
        elif category == 'player_advice':
            concepts = [term for term in self.poker_terms['player_types'] 
                    if term in text.lower() and term not in used_concepts]
        elif category == 'hand_analysis':
            concepts = [term for term in self.poker_terms['hand_types'] 
                    if term in text.lower() and term not in used_concepts]
        else:
            return None
            
        if concepts:
            concept = random.choice(concepts)
            used_concepts.add(concept)
            return concept
        return None
    
    def _create_prompt(self, template: str, category: str, concept: str) -> str:
        """
        Creates a prompt by filling in the template with relevant key phrases.
        
        Parameters:
            template (str): The template string with placeholders.
            category (str): The category of the content.
            concept (str): The concept to fill into the template.
        
        Returns:
            str: The filled-in prompt string.
        """
        # Fill in the template with the appropriate concept
        prompt = template.format(
            concept=concept if '{concept}' in template else '',
            situation=concept if '{situation}' in template else '',
            player_type=concept if '{player_type}' in template else '',
            hand=concept if '{hand}' in template else ''
        ).strip()
        
        # Ensure the prompt is properly formatted
        if not prompt.endswith('?'):
            prompt += '?'
        
        return prompt
    
    def _create_completion(self, text: str, category: str) -> str:
        """
        Creates a formal, neutral completion from the rewritten text.
        
        Parameters:
            text (str): The rewritten sentence.
            category (str): The category of the content.
        
        Returns:
            str: The formatted completion string.
        """
        
        text = text.strip()
        if not text:
            return None

        # Remove any existing prefixes we might add
        text = re.sub(r'^(In poker,|The optimal strategy is to|When facing|With)\s*', '', text, flags=re.IGNORECASE)
        
        if category == 'rules':
            completion = f"In poker, {text}"
        elif category == 'strategy':
            completion = f"The optimal strategy is to {text.lower()}"
        elif category == 'player_advice':
            completion = f"When facing {text.lower()}"
        elif category == 'hand_analysis':
            completion = f"With {text.lower()}"
        else:
            completion = text
        
        # Ensure proper sentence structure
        completion = completion.strip()
        if not completion.endswith('.'):
            completion += '.'
        
        # Ensure first letter is capitalized
        completion = completion[0].upper() + completion[1:]
        
        return completion
    
    def _is_coherent_completion(self, prompt: str, completion: str) -> bool:
        """
        Checks if the completion is coherent and directly answers the prompt.
        
        Parameters:
            prompt (str): The prompt string.
            completion (str): The completion string.
        
        Returns:
            bool: True if coherent, False otherwise.
        """
        # Simple heuristic: Check if key terms from the prompt appear in the completion
        prompt_terms = set(re.findall(r'\w+', prompt.lower()))
        completion_terms = set(re.findall(r'\w+', completion.lower()))
        
        common_terms = prompt_terms.intersection(completion_terms)
        return len(common_terms) > 2  # Threshold can be adjusted
    
    def _is_valid_sentence(self, text: str) -> bool:
        """
        Comprehensive validation of a sentence to determine its suitability for inclusion.
        
        Parameters:
            text (str): The sentence to evaluate.
        
        Returns:
            bool: True if the sentence passes all validation checks, False otherwise.
        """
        # Must start with a capital letter and end with proper punctuation
        if not (text and text[0].isupper() and text[-1] in '.!?'):
            return False
                
        # Must not start with conjunctions or filler words
        if any(text.startswith(word) for word in ['But', 'And', 'However', 'So', 'Yeah', 'Um']):
            return False
            
        # Must contain poker-related terms
        if not any(term in text.lower() for terms in self.poker_terms.values() for term in terms):
            return False
                
        return True
    
    def _has_instructional_content(self, sent) -> bool:
        """
        Determines if a sentence contains instructional content based on markers and sentence structure.
        
        Parameters:
            sent (spacy.tokens.Span): The sentence to evaluate.
        
        Returns:
            bool: True if instructional, False otherwise.
        """
        text = sent.text.lower()
        
        # Check for instructional markers
        has_marker = any(marker in text for marker in self.instruction_markers)
        
        # Check for modal verbs indicating advice
        has_modal = any(token.lemma_ in ['should', 'must', 'can', 'need'] for token in sent)
        
        # Check for imperative mood
        starts_with_verb = sent[0].pos_ == 'VERB'
        
        return has_marker or has_modal or starts_with_verb
    
    def _categorize_sentence(self, sent) -> str:
        """
        Categorizes a sentence into one of the predefined categories based on its content.
        
        Parameters:
            sent (spacy.tokens.Span): The sentence to categorize.
        
        Returns:
            str: The category of the sentence ('rules', 'strategy', 'player_advice', 'hand_analysis') or None if uncategorized.
        """
        text = sent.text.lower()
        
        # Prioritize categorization based on specificity
        if any(term in text for term in self.poker_terms['hand_types']):
            return 'hand_analysis'
        elif any(term in text for term in self.poker_terms['player_types']):
            return 'player_advice'
        elif any(term in text for term in self.poker_terms['game_stages']):
            return 'rules'
        elif any(term in text for term in self.poker_terms['actions']):
            return 'strategy'
        
        return None
    
    def _clean_and_rewrite_sentence(self, text: str, category: str) -> str:
        """
        Cleans and rewrites a sentence to ensure coherence and relevance to the category.
        
        Parameters:
            text (str): The original sentence.
            category (str): The category of the sentence.
        
        Returns:
            str: The cleaned and rewritten sentence, or None if rewriting fails.
        """
        # Remove filler words and normalize
        text = self._clean_sentence(text)
        
        # Rewriting based on category
        if category == 'rules':
            return self._rewrite_as_rule(text)
        elif category == 'strategy':
            return self._rewrite_as_strategy(text)
        elif category == 'player_advice':
            return self._rewrite_as_player_advice(text)
        elif category == 'hand_analysis':
            return self._rewrite_as_hand_analysis(text)
        
        return None
    
    def _rewrite_as_rule(self, text: str) -> str:
        """
        Rewrites a sentence as a clear rule definition.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten rule definition.
        """
        # Attempt to extract the rule directly
        return text
    
    def _rewrite_as_strategy(self, text: str) -> str:
        """
        Rewrites a sentence as clear strategic advice.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten strategic advice.
        """
        # Attempt to extract the strategy directly
        return text
    
    def _rewrite_as_player_advice(self, text: str) -> str:
        """
        Rewrites a sentence as clear advice for playing against specific player types.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten player advice.
        """
        # Attempt to extract the player advice directly
        return text
    
    def _rewrite_as_hand_analysis(self, text: str) -> str:
        """
        Rewrites a sentence as clear hand analysis advice.
        
        Parameters:
            text (str): The original sentence.
        
        Returns:
            str: Rewritten hand analysis advice.
        """
        # Attempt to extract the hand analysis directly
        return text
    
    def _has_repetition(self, text: str) -> bool:
        """
        Detects if the text contains repetitive phrases.
        
        Parameters:
            text (str): The text to evaluate.
        
        Returns:
            bool: True if repetition is detected, False otherwise.
        """
        words = text.lower().split()
        for i in range(len(words)-1):
            if words[i] == words[i+1]:
                return True
        return False
        
    def _is_matching_pair(self, prompt: str, completion: str) -> bool:
        """
        Checks if the prompt and completion pair matches in terms of category and content.
        
        Parameters:
            prompt (str): The prompt string.
            completion (str): The completion string.
        
        Returns:
            bool: True if the pair matches appropriately, False otherwise.
        """
        # Check if rule prompt has rule completion
        if 'what is' in prompt.lower() and not completion.startswith('In poker'):
            return False
                
        # Check if strategy prompt has strategy completion
        if 'strategy' in prompt.lower() and not completion.startswith('The optimal strategy'):
            return False
                
        # Check if player advice prompt matches
        if 'against' in prompt.lower() and not completion.startswith('When facing'):
            return False
                
        # Check if hand analysis prompt matches
        if 'play' in prompt.lower() and not completion.startswith('With'):
            return False
                
        return True
    
    def _clean_sentence(self, text: str) -> str:
        """
        Cleans and normalizes sentence text by removing filler words, timestamps, brackets, and normalizing poker terms.
        
        Parameters:
            text (str): The sentence to clean.
        
        Returns:
            str: Cleaned and normalized sentence.
        """
        # Remove specific video references and fillers
        text = re.sub(r'like and subscribe|check out|visit|website|um|uh|yeah|you know|laughter|subscribe', '', text, flags=re.IGNORECASE)
        
        # Remove timestamps and brackets
        text = re.sub(r'\[\w+\]|\(\d+:\d+\)', '', text)
        
        # Normalize poker terms
        text = re.sub(r'hold\s*em', "Hold'em", text, flags=re.IGNORECASE)
        
        # Clean up whitespace
        text = ' '.join(text.split())
        
        # Ensure proper sentence termination
        text = text.strip()
        if text and not text[-1] in '.!?':
            text += '.'
            
        return text

    def _is_valid_completion(self, completion: str) -> bool:
        """
        Validates that a completion is well-formed and meaningful.
        
        Parameters:
            completion (str): The completion text to validate.
        
        Returns:
            bool: True if the completion is valid, False otherwise.
        """
        if not completion:
            return False
            
        # Must be a complete sentence
        if not (completion[0].isupper() and completion[-1] in '.!?'):
            return False
            
        # Must contain meaningful content (at least 5 words)
        words = completion.split()
        if len(words) < 5:
            return False
            
        # Must contain poker-related terms
        has_poker_terms = any(term in completion.lower() 
                            for terms in self.poker_terms.values() 
                            for term in terms)
        if not has_poker_terms:
            return False
            
        return True

    def process_transcript(self, transcript_text: str) -> List[Dict[str, str]]:
        """
        Processes a single transcript to generate prompt-completion pairs.
        """
        segments = self.clean_transcript(transcript_text)
        if not segments:
            return []
                
        full_text = self.combine_segments(segments)
        content = self.extract_poker_content(full_text)
        
        pairs = []
        for category, items in content.items():
            if category in self.templates:
                template = self.templates[category][0]  # Use first template
                
                for item in items:
                    # Create prompt
                    prompt = template.format(
                        concept=item,
                        situation=item,
                        player_type=item,
                        hand=item
                    )
                    
                    # Format completion based on category
                    if category == 'rules':
                        completion = f"In poker, {item}"
                    elif category == 'strategy':
                        completion = f"The optimal strategy is to {item}"
                    elif category == 'player_advice':
                        completion = f"When facing {item}"
                    elif category == 'hand_analysis':
                        completion = f"With {item}"
                    else:
                        completion = item
                    
                    # Ensure proper sentence ending
                    if completion and not completion.endswith('.'):
                        completion += '.'
                    
                    # Validate completion
                    if completion and self._is_valid_completion(completion):
                        pairs.append({
                            "prompt": prompt,
                            "completion": completion
                        })
        
        return pairs
    
    def post_process_pairs(self, pairs: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """
        Post-processes and filters the list of training pairs to ensure quality and uniqueness.
        
        Parameters:
            pairs (List[Dict[str, str]]): List of generated prompt-completion pairs.
        
        Returns:
            List[Dict[str, str]]: Filtered list of high-quality prompt-completion pairs.
        """
        processed_pairs = []
        seen_completions = set()
        
        for pair in pairs:
            prompt, completion = pair['prompt'], pair['completion']
                
            # Skip if either is empty or too short
            if not prompt or not completion:
                continue
                    
            # Skip if completion is duplicate
            if completion in seen_completions:
                continue
                    
            # Skip if completion doesn't match prompt category
            if not self._is_matching_pair(prompt, completion):
                continue
                    
            # Add to processed pairs
            processed_pairs.append({
                'prompt': prompt,
                'completion': completion
            })
            seen_completions.add(completion)
            
        return processed_pairs
    
    def process_all_transcripts(self, csv_path: str, output_path: str) -> List[Dict]:
        """
        Processes all transcripts from a CSV file and saves the generated prompt-completion pairs to a JSONL file.
        
        Parameters:
            csv_path (str): Path to the raw CSV file containing transcripts.
            output_path (str): Path to save the processed JSONL file.
        
        Returns:
            List[Dict]: List of final prompt-completion pairs.
        """
        # Load and clean the raw CSV
        df = self.process_raw_csv(csv_path)
        logging.info(f"Processing {len(df)} transcripts from {csv_path}...")
        
        all_pairs = []
        seen_completions = set()
        
        for idx, row in df.iterrows():
            try:
                # Process individual transcript
                pairs = self.process_transcript(row['Transcript'])
                    
                # Filter and deduplicate
                for pair in pairs:
                    completion = pair['completion'].strip()
                    if completion and completion not in seen_completions:
                        all_pairs.append(pair)
                        seen_completions.add(completion)
                            
            except Exception as e:
                logging.error(f"Error processing video {row.get('VideoID', 'Unknown')}: {e}")
                continue
                    
            # Progress update every 10 transcripts
            if (idx + 1) % 10 == 0:
                logging.info(f"Processed {idx + 1} transcripts...")
        
        # Post-process all pairs for final filtering
        final_pairs = self.post_process_pairs(all_pairs)
        logging.info(f"Post-processed {len(final_pairs)} unique training pairs.")
        
        # Save the final pairs to a JSONL file in append mode
        if final_pairs:
            try:
                with open(output_path, 'a') as f:  # Changed 'w' to 'a' for append mode
                    for pair in final_pairs:
                        json.dump(pair, f)
                        f.write('\n')
                logging.info(f"Saved {len(final_pairs)} training pairs to {output_path}")
            except Exception as e:
                logging.error(f"Error saving to {output_path}: {e}")
        
        return final_pairs


### Initialize the Processor and Define Transcript Files

In [19]:
# Initialize the PokerTranscriptProcessor
processor = PokerTranscriptProcessor()

# Define the list of transcript CSV files to process
transcript_files = [
    '../data/raw/transcript1.csv',
    '../data/raw/transcript2.csv',
    '../data/raw/transcript3.csv',
    '../data/raw/transcript4.csv'
]

### Process All Transcripts and Generate Training Pairs

In [20]:
# Clear the output file before processing
with open('../data/processed/transcript_poker_training.jsonl', 'w') as f:
    pass

# Initialize a list to hold all prompt-completion pairs
all_pairs = []

# Iterate over each transcript file and process
for csv_path in transcript_files:
    try:
        print(f"\nProcessing {csv_path}...")
        # Process all transcripts in the current CSV file
        pairs = processor.process_all_transcripts(
            csv_path, 
            '../data/processed/transcript_poker_training.jsonl'
        )
        all_pairs.extend(pairs)
    except Exception as e:
        print(f"Error processing file {csv_path}: {e}")
        continue


2024-12-15 21:28:52,922 - INFO - Successfully loaded CSV: ../data/raw/transcript1.csv
2024-12-15 21:28:52,922 - INFO - Processing 100 transcripts from ../data/raw/transcript1.csv...



Processing ../data/raw/transcript1.csv...


2024-12-15 21:28:54,697 - INFO - Processed 10 transcripts...
2024-12-15 21:28:56,413 - INFO - Processed 20 transcripts...
2024-12-15 21:28:57,512 - INFO - Processed 30 transcripts...
2024-12-15 21:28:58,618 - INFO - Processed 40 transcripts...
2024-12-15 21:28:59,662 - INFO - Processed 50 transcripts...
2024-12-15 21:29:00,769 - INFO - Processed 60 transcripts...
2024-12-15 21:29:01,932 - INFO - Processed 70 transcripts...
2024-12-15 21:29:03,623 - INFO - Processed 80 transcripts...
2024-12-15 21:29:04,521 - INFO - Processed 90 transcripts...
2024-12-15 21:29:06,105 - INFO - Processed 100 transcripts...
2024-12-15 21:29:06,106 - INFO - Post-processed 56 unique training pairs.
2024-12-15 21:29:06,107 - INFO - Saved 56 training pairs to ../data/processed/transcript_poker_training.jsonl
2024-12-15 21:29:06,148 - INFO - Successfully loaded CSV: ../data/raw/transcript2.csv
2024-12-15 21:29:06,148 - INFO - Processing 188 transcripts from ../data/raw/transcript2.csv...



Processing ../data/raw/transcript2.csv...


2024-12-15 21:29:07,661 - INFO - Processed 10 transcripts...
2024-12-15 21:29:09,236 - INFO - Processed 20 transcripts...
2024-12-15 21:29:10,355 - INFO - Processed 30 transcripts...
2024-12-15 21:29:11,363 - INFO - Processed 40 transcripts...
2024-12-15 21:29:12,377 - INFO - Processed 50 transcripts...
2024-12-15 21:29:13,461 - INFO - Processed 60 transcripts...
2024-12-15 21:29:14,565 - INFO - Processed 70 transcripts...
2024-12-15 21:29:16,103 - INFO - Processed 80 transcripts...
2024-12-15 21:29:16,975 - INFO - Processed 90 transcripts...
2024-12-15 21:29:18,565 - INFO - Processed 100 transcripts...
2024-12-15 21:29:19,365 - INFO - Processed 110 transcripts...
2024-12-15 21:29:22,841 - INFO - Processed 120 transcripts...
2024-12-15 21:29:23,704 - INFO - Processed 130 transcripts...
2024-12-15 21:29:24,662 - INFO - Processed 140 transcripts...
2024-12-15 21:29:25,422 - INFO - Processed 150 transcripts...
2024-12-15 21:29:27,286 - INFO - Processed 160 transcripts...
2024-12-15 21:29:


Processing ../data/raw/transcript3.csv...


2024-12-15 21:29:33,704 - INFO - Processed 10 transcripts...
2024-12-15 21:29:35,236 - INFO - Processed 20 transcripts...
2024-12-15 21:29:36,285 - INFO - Processed 30 transcripts...
2024-12-15 21:29:37,271 - INFO - Processed 40 transcripts...
2024-12-15 21:29:38,259 - INFO - Processed 50 transcripts...
2024-12-15 21:29:39,325 - INFO - Processed 60 transcripts...
2024-12-15 21:29:40,382 - INFO - Processed 70 transcripts...
2024-12-15 21:29:41,902 - INFO - Processed 80 transcripts...
2024-12-15 21:29:42,765 - INFO - Processed 90 transcripts...
2024-12-15 21:29:44,300 - INFO - Processed 100 transcripts...
2024-12-15 21:29:44,301 - INFO - Post-processed 56 unique training pairs.
2024-12-15 21:29:44,302 - INFO - Saved 56 training pairs to ../data/processed/transcript_poker_training.jsonl
2024-12-15 21:29:44,331 - INFO - Successfully loaded CSV: ../data/raw/transcript4.csv
2024-12-15 21:29:44,331 - INFO - Processing 150 transcripts from ../data/raw/transcript4.csv...



Processing ../data/raw/transcript4.csv...


2024-12-15 21:29:45,818 - INFO - Processed 10 transcripts...
2024-12-15 21:29:47,351 - INFO - Processed 20 transcripts...
2024-12-15 21:29:48,416 - INFO - Processed 30 transcripts...
2024-12-15 21:29:49,445 - INFO - Processed 40 transcripts...
2024-12-15 21:29:50,437 - INFO - Processed 50 transcripts...
2024-12-15 21:29:51,506 - INFO - Processed 60 transcripts...
2024-12-15 21:29:52,570 - INFO - Processed 70 transcripts...
2024-12-15 21:29:54,097 - INFO - Processed 80 transcripts...
2024-12-15 21:29:54,959 - INFO - Processed 90 transcripts...
2024-12-15 21:29:56,493 - INFO - Processed 100 transcripts...
2024-12-15 21:29:57,229 - INFO - Processed 110 transcripts...
2024-12-15 21:30:00,524 - INFO - Processed 120 transcripts...
2024-12-15 21:30:01,369 - INFO - Processed 130 transcripts...
2024-12-15 21:30:02,304 - INFO - Processed 140 transcripts...
2024-12-15 21:30:03,039 - INFO - Processed 150 transcripts...
2024-12-15 21:30:03,040 - INFO - Post-processed 96 unique training pairs.
2024-

### Display Example Training Pairs

In [21]:
# Print the total number of training pairs processed
print(f"\nProcessed {len(all_pairs)} training pairs total")

# Display the first 5 example training pairs
print("\nExample training pairs:")
for pair in all_pairs[:5]:
    print("\nPrompt:", pair['prompt'])
    print("Completion:", pair['completion'])



Processed 340 training pairs total

Example training pairs:

Prompt: What's the optimal strategy when They would then wager on whether or not they had the best hand— either adding more money to the pot or folding and forfeiting their chance to win it.?
Completion: The optimal strategy is to They would then wager on whether or not they had the best hand— either adding more money to the pot or folding and forfeiting their chance to win it.

Prompt: What's the optimal strategy when Cold decks were pre-sequenced to give victims powerful hands that encouraged them to bet big, while giving the cheater an even better one.?
Completion: The optimal strategy is to Cold decks were pre-sequenced to give victims powerful hands that encouraged them to bet big, while giving the cheater an even better one.

Prompt: What's the optimal strategy when His work became the foundation for a whole new branch of mathematics called game theory, which has grown to be vitally important not only in high-stakes po