### 1. Check Output Dataset

Load the full dataset (inference, finetuning and validation set).

In [1]:
import pandas as pd
import json 
import jsonschema
data_path = "../data"

# read in model outputs and join with finetuning and valid labels to get complete dataset

# model outputs
inf_file_name = "inf_llama3_ft_v4_q8_0_llamacpp_guided"
inf_path = f"{data_path}/inference_results/inf_runs/{inf_file_name}.csv"
# finetuning labels
ft_file_name = "FT_transcript_chunks_nvids45968_chunksize2048_overlap50_tokMistral_with_metadata_for_prompt_with_labels"
ft_path = f"{data_path}/transcript_chunks/splits/{ft_file_name}.csv"
# valid labels
val_file_name = "VAL_transcript_chunks_nvids45968_chunksize2048_overlap50_tokMistral_with_metadata_for_prompt_with_labels"
val_path = f"{data_path}/transcript_chunks/splits/{val_file_name}.csv"

# read in data
inf_df = pd.read_csv(inf_path, sep=";")
ft_df = pd.read_csv(ft_path, sep=";")
val_df = pd.read_csv(val_path, sep=";")

### adjust columns

# add split and error? cols
inf_df["split"] = "inf"

ft_df["error?"] = False 
ft_df["split"] = "ft"

val_df["error?"] = False
val_df["split"] = "val"

# rename
inf_df = inf_df.rename(columns={"output": "label"})
# keep only relevant columns
cols = ["video_id", "chunk_number", "label", "error?", "split"]
ft_df = ft_df[cols]
val_df = val_df[cols]

In [2]:
# join
df = pd.concat([inf_df, ft_df, val_df])
print(df.shape)

(80176, 5)


Check for issues.

In [3]:
# show missing labels and/or error rows
problem_rows = df[(df["label"].isna()) | (df["error?"])]
if problem_rows.shape[0] > 0:
    print(problem_rows)
else:
    print("No missing labels and/or errors")

No missing labels and/or errors


In [4]:
from LLM_utils import output_json_schema_string

if False: # runs ~ 2 min
    # check for any invalid json in outputs
    def matches_schema(string, schema):
        try:
            jsonschema.validate(json.loads(string), schema)
            return True
        except jsonschema.ValidationError:
            return False
    # check model outputs against schema defined in LLM_utils
    schema_match = df["label"].apply(lambda x: matches_schema(x, json.loads(output_json_schema_string)))
    if schema_match.sum() < df.shape[0]:
        print("Some outputs do not match schema")
        print(schema_match.value_counts())
    else:
        print("No invalid json found in outputs.")


In [5]:
# total number of extracted assets
n_assets = df["label"].apply(lambda x: len(json.loads(x))).sum()
print(f"Total number of extractions: {n_assets}")

Total number of extractions: 74551


### 2. Asset Name Matching

Using the matching processes defined in ``name_matching_utils.py`` we try to match the appropriate ticker to every asset_name in the outputs. We add the entire ``match_info`` dict returned by the matching function to every label as a fourth object. (might be useful to compute matching statistics later). 

Note that with pre-processing the candidate dicts, the matching functions are fast enough to allow us to simply iterate over every extracted label and get the matching results. Matching is deterministic as well. If it wasn't, or if efficiency was a bigger concern, we could first find the set of unique extracted names and match only those, avoiding repeated work.

In [6]:
from name_matching_utils import load_candidate_dicts, match_stock, match_etf, match_crypto, match_commodity

candidate_dicts = load_candidate_dicts()

In [7]:
def add_matches_to_label(label_str):
    label_json = json.loads(label_str)
    for asset in label_json:
        if asset["asset_type"] == "stock":
            asset["match_info"] = match_stock(asset["asset_name"], **candidate_dicts["stocks"])
        elif asset["asset_type"] == "etf":
            asset["match_info"] = match_etf(asset["asset_name"], **candidate_dicts["etfs"])
        elif asset["asset_type"] == "crypto":
            asset["match_info"] = match_crypto(asset["asset_name"], **candidate_dicts["cryptos"])
        elif asset["asset_type"] == "commodity":
            asset["match_info"] = match_commodity(asset["asset_name"], **candidate_dicts["commodities"])
        elif asset["asset_type"] == "other":
            # catch-all for asset types outside our scope -> no matching (could also enter empty match_info dict)
            pass
    return json.dumps(label_json)

# add matches to all labels
df["label"] = df["label"].apply(lambda x: add_matches_to_label(x))

In [8]:
# save matched version of chunk labels df
df.to_csv(f"{data_path}/matched/CHUNKS_{inf_file_name}.csv", sep=";", index=False)

### 3. Recombining Chunks (dealing with duplicate assets)

Now that we have a matched chunk-level dataset, we can recombine the chunk-level into video-level data. For cases where the same asset appears in multiple chunks of a video (or even multiple times in the same chunk), we abide by the following rules:

1. Disagreements in sentiment:
     - for three-way disagreements: keep first neutral one
     - for buy/sell disagreements: create neutral one from first
     - for buy/neutral or sell/neutral disagreements: keep first non-neutral one

2. (Remaining) disagreements in asset_type:
     - can only happen with stocks/commodities or etfs/commodities
     -> keep only the one with type ``commodity``

3. (Remaining) disagreements in asset_name:
     - just keep the first one

Importantly, we must take measures to apply the rules above separately to cryptos and non-crypto assets, because cryptos can share tickers with stocks/etfs/commodities (but obviously refer to different underlying assets). We implement this by adding a prefix to crypto tickers before processing (and removing it afterwards). 

To preserve all information we keep two columns: one with a list of all extractions, and one freed of duplicates via the rules above. 

In [9]:
# load matched data
chunks_df = pd.read_csv(f"{data_path}/matched/CHUNKS_{inf_file_name}.csv", sep=";")

In [10]:
from chunks_to_video_utils import deduplicate_asset_list

# obtain video-level dataframe
video_df = chunks_df.sort_values(by=["video_id", "chunk_number"]).groupby(["video_id"]).agg({"label": lambda x: json.dumps([asset for chunk_list in x for asset in json.loads(chunk_list)])})
video_df = video_df.reset_index().rename(columns={"label": "extractions_all"})

# deduplicated column, retaining unmatched
video_df["extractions_dedup_retain_unmatched"] = video_df["extractions_all"].apply(lambda x: json.dumps(deduplicate_asset_list(json.loads(x), retain_unmatched=True)))
# deduplicated column, removing unmatched assets
video_df["extractions_dedup"] = video_df["extractions_all"].apply(lambda x: json.dumps(deduplicate_asset_list(json.loads(x), retain_unmatched=False)))


Finally, we add two more columns with stripped-down versions (without match_info etc.) of the extractions, i.e. list of asset dicts with three keys only: ``asset_type``, ``ticker``, ``sentiment``.
- ``trade_info_incl_neutrals``
- ``trade_info_no_neutrals`` (not including extractions with neutral sentiment)

In [11]:
# helper
def filter_for_trade_info(asset):
    return {"asset_type": asset["asset_type"], 
             "ticker": asset["match_info"]["matched_ticker"],
             "sentiment": asset["sentiment"]}

video_df["trade_info_incl_neutrals"] = video_df["extractions_dedup"].apply(lambda x: json.dumps([filter_for_trade_info(a) for a in json.loads(x)]))
video_df["trade_info_no_neutrals"] = video_df["extractions_dedup"].apply(lambda x: json.dumps([filter_for_trade_info(a) for a in json.loads(x) if a["sentiment"] != "neutral"]))

In [12]:
# save video-level df
video_df.to_csv(f"{data_path}/matched/VIDEOS_{inf_file_name}.csv", sep=";", index=False)

Later we will join this dataset with the video upload dates (necessary for building portfolios) and other metadata which could be interesting for further analysis. 