## Off menu easy wins analysis

This notebook serves as a development environment for the logic to analyse the results of the first sweep. The first sweep consists of 'easy wins', that is partial matches with the full name of the reataurant with scores over 90. The final production code is located in off_menu/data_processing.py.

In [None]:
import sys
import os

#  Ensure imports can find my utils:

# Get the current working directory of the notebook so we can locate the root relative to the notebook
notebook_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))

# Insert the project root to the beginning of sys.path
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print(f"Project root added to sys.path: {project_root}")
print(f"Current sys.path: {sys.path}")

# Import pandas (data storage), requests (accessing links), Beautiful Soup (parsing HTML), re (RegEx), List (typehints)
import pandas as pd
from bs4 import BeautifulSoup
import requests 
import re 
from typing import List, Dict

# Import libraries for fuzzy matching
from fuzzywuzzy import (process,
                        fuzz
)

from off_menu.utils import try_read_parquet

Project root added to sys.path: c:\Users\jbara\Data science projects (store here not desktop on onedrive)\Off Menu project
Current sys.path: ['c:\\Users\\jbara\\Data science projects (store here not desktop on onedrive)\\Off Menu project', 'C:\\Users\\jbara\\miniconda3\\python312.zip', 'C:\\Users\\jbara\\miniconda3\\DLLs', 'C:\\Users\\jbara\\miniconda3\\Lib', 'C:\\Users\\jbara\\miniconda3', 'c:\\Users\\jbara\\OneDrive\\Desktop\\Data_science\\Python projects\\Off Menu project\\.venv', '', 'c:\\Users\\jbara\\OneDrive\\Desktop\\Data_science\\Python projects\\Off Menu project\\.venv\\Lib\\site-packages', 'c:\\Users\\jbara\\OneDrive\\Desktop\\Data_science\\Python projects\\Off Menu project\\.venv\\Lib\\site-packages\\win32', 'c:\\Users\\jbara\\OneDrive\\Desktop\\Data_science\\Python projects\\Off Menu project\\.venv\\Lib\\site-packages\\win32\\lib', 'c:\\Users\\jbara\\OneDrive\\Desktop\\Data_science\\Python projects\\Off Menu project\\.venv\\Lib\\site-packages\\Pythonwin']


## Examine the number of matches in the easy wins sweep and, if no match, whether this was due to transcript missing

### Load the easy wins dataframe and examine head

In [24]:
easy_wins_path = os.path.join(project_root, 'data', 'processed', 'easy_win_mention_search_df.parquet')
easy_wins_scan_df = try_read_parquet(easy_wins_path)

print(easy_wins_scan_df.head(5))

   Episode ID             Restaurant  \
0           1  oli babas kerb camden   
1           2             little owl   
2           2                 trullo   
3           3              five guys   
4           3             cora pearl   

                                        Mention text  Match Score  \
0                                               None            0   
1  it would be, the side dish would be from littl...          100   
2  and it's the beef shin ragu with probably it's...          100   
3  what if every single thing that i'm going to m...          100   
4  this was a difficult question for me until lit...          100   

       Match Type Timestamp                                  transcript_sample  
0  No match found      None  starting point is 00:00:00 hello, listeners of...  
1   full, over 90  00:34:52  starting point is 00:00:00 hello, listeners of...  
2   full, over 90  00:20:26  starting point is 00:00:00 hello, listeners of...  
3   full, over 90  0

### Explore the number of successful matches and total matches

In [25]:
match_type_counts = easy_wins_scan_df['Match Type'].value_counts()

print("Breakdown of Match Types:")
print(match_type_counts)

no_match_count = match_type_counts.get('No match found', 0)
print(f"\nNumber of 'No match found' entries: {no_match_count}")

Breakdown of Match Types:
Match Type
full, over 90     275
No match found    212
Name: count, dtype: int64

Number of 'No match found' entries: 212


### Identify how many 'No match found's are due to missing transcripts

In [None]:
# Update the logic to be more robust
easy_wins_scan_df['transcript_available'] = (
    easy_wins_scan_df['transcript_sample'].notna() &  # Check that it's not None
    (easy_wins_scan_df['transcript_sample'] != '') &  # Check that it's not an empty string
    (easy_wins_scan_df['transcript_sample'] != 'No Transcript Found') # Check it's not 'No transcript found'
)

# Create a cross-tabulation of 'Match Type' and 'transcript_available'
match_breakdown_by_transcript = pd.crosstab(
    easy_wins_scan_df['Match Type'],
    easy_wins_scan_df['transcript_available'],
    margins=True, 
    margins_name="Total"
)

print("\nBreakdown of Match Types by Transcript Availability:")
print(match_breakdown_by_transcript)

no_match_found_no_transcript_count = match_breakdown_by_transcript.loc[
    'No match found', 
    False
]


Breakdown of Match Types by Transcript Availability:
transcript_available  False  True  Total
Match Type                              
No match found           93   119    212
full, over 90             0   275    275
Total                    93   394    487
