# What

As establised in this [notebook](./safey_themes_from_safety_issues.ipynb). BERTopic seems to be the most promising method for generating safety themes from safety issues.

There are a few problems that need to be address.
- Lots of outliers
- only 3 topics being generated

## Modules

In [1]:
# local

# third parties

import yaml
import pandas as pd
import numpy as np

from dotenv import load_dotenv

import voyageai
import openai

from bertopic import BERTopic
from bertopic.representation import OpenAI
from umap import UMAP


# builtin
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Getting safety issue data

In [5]:
safety_issues_df = pd.read_csv('safety_issues.csv')

# Confirm it has the right columns report_id, si and mode

if not safety_issues_df.columns.isin(['report_id', 'si', 'mode']).any():
    print("Safety issues dataset is missing columns")
    del safety_issues_df

# Getting embeddings to be used for clustering

In [11]:
voyageai_embeddings = pd.read_pickle('voyageai_embeddings.pkl')

openai_embeddings = pd.read_pickle('openai_embeddings.pkl')

# BERTopic models

I have two things that I can play with are:
- What embeddings are used
- How the topic representation are generated (keywords, openai prompts etc)

In [8]:
openai_base_representation_model = OpenAI(
    openai_client,
    model="gpt-4-turbo",
    chat=True,
    nr_docs = 50)

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

In [7]:
column_to_2darray = lambda column: np.array([np.array(x) for x in column.to_numpy()])

def runBERTopic(df, docs_name, embeddings_name, representation_model, umap_model, reduce_outliers=True, ):

    topic_model = BERTopic(
        representation_model=representation_model,
        umap_model=umap_model,
        calculate_probabilities=True)

    if embeddings_name is not None:
        topics, probs = topic_model.fit_transform(
            df[docs_name],
            column_to_2darray(df[embeddings_name]))
    else:
        topics, probs = topic_model.fit_transform(df[docs_name])
        
    if reduce_outliers:
        topics = topic_model.reduce_outliers(
            documents=df[docs_name].to_list(),
            topics=topics, 
            probabilities=probs,
            strategy="probabilities")

        topic_model.update_topics(
            df[docs_name].to_list(),
            topics=topics,
            representation_model=representation_model)
        
    df['topic'] = topics

    df = pd.concat([df, pd.DataFrame(probs)], axis=1)

    return topic_model, df

## Running it on all safety issues


I want to generate the safety themes from all of the safety issues I have available.

In [6]:
def check_mode_cluster_distribution(df):
    safety_issues_df_topic_mode = df.pivot_table(index='topic', columns='mode', values='report_id', aggfunc='count').fillna(0)
    return safety_issues_df_topic_mode

### Simple minilm embeddings

This seems to of failed. I believe this is mainly due to the fact that each documents are really short.

In [83]:

topic_model, _ = runBERTopic(
    safety_issues_df, 'si', None, openai_base_representation_model, umap_model, reduce_outliers=False)

topic_model.get_topic_info()


There is a bit of a problem where the number of outliers is quite great.

I will try to merge the outliers

In [86]:
topic_model, _ = runBERTopic(
    safety_issues_df, 'si', None, openai_base_representation_model, umap_model, reduce_outliers=True)

topic_model.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,166,0_Rail Safety and Operational Issues in New Ze...,[Rail Safety and Operational Issues in New Zea...,[The training that drivers received for transi...
1,1,64,1_Maritime Safety and Navigation Management Is...,[Maritime Safety and Navigation Management Iss...,[The voyage planning for the time in the Snare...
2,2,36,2_Maritime Safety and Regulations Compliance I...,[Maritime Safety and Regulations Compliance Is...,[The skipper did not have the requisite knowle...
3,3,53,3_Safety and Maintenance Issues in Engineering...,[Safety and Maintenance Issues in Engineering ...,[There was a lack of clear communication and a...
4,4,53,4_Maritime and Aviation Safety Management and ...,[Maritime and Aviation Safety Management and E...,[It could not be established why the chief off...
5,5,50,5_Aviation Safety and Compliance Issues,[Aviation Safety and Compliance Issues],[Had the controllers realised that the low clo...
6,6,27,6_Robinson Helicopter Safety and Accident Anal...,[Robinson Helicopter Safety and Accident Analy...,"[Due to their unique main rotor design, during..."
7,7,62,7_Aviation Safety and Regulatory Compliance Is...,[Aviation Safety and Regulatory Compliance Iss...,[The standard of pilot training and the superv...
8,8,26,8_Aircraft Landing Gear and Door Lock Failures,[Aircraft Landing Gear and Door Lock Failures],[Had the pilots known that the nose landing ge...
9,9,23,9_Deficiencies in Safety and Regulatory Compli...,[Deficiencies in Safety and Regulatory Complia...,[There were no established procedures for ente...


The main problem here is that the the distribution is not great. It seems that most of the rail are in the first topic then martime and aviation take up the rest.

### VoyageAI embeddings

In [13]:
topic_model, voyageai_clusters_df = runBERTopic(
    voyageai_embeddings, 'si', 'si_embedding', openai_base_representation_model, umap_model, reduce_outliers=True)

topic_model.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,379,0_Aviation and Maritime Safety Management and ...,[Aviation and Maritime Safety Management and C...,[The bilge pumping system on the Jubilee was n...
1,1,181,1_Railway Safety and Operational Issues,[Railway Safety and Operational Issues],[The training that drivers received for transi...


In [101]:


check_mode_cluster_distribution(voyageai_clusters_df)

mode,0,1,2
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,204,8,167
1,2,177,2


This has created two topics wiht one being avaiation and martime and the other being rail.

### OpenAI embeddings

In [8]:
topic_model, openai_clusters_df = runBERTopic(
    openai_embeddings, 'si', 'si_embedding', openai_base_representation_model, umap_model, reduce_outliers=True)

topic_model.get_topic_info()[['Count', 'Name']]



Unnamed: 0,Count,Name
0,203,0_Aviation Safety and Compliance Issues
1,189,1_Rail Safety and Operational Issues in New Ze...
2,142,2_Maritime Safety and Navigation Management Flaws
3,26,3_Maritime Safety and Compliance Issues of the...


In [11]:
check_mode_cluster_distribution(openai_clusters_df)

mode,0,1,2
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,191.0,4.0,8.0
1,5.0,181.0,3.0
2,4.0,0.0,138.0
3,6.0,0.0,20.0


This has also made a cleanish split between modes of transport. I can eithe try to force it not to do this and/or run the model on each mode then merge the models.

In [16]:
umap_model_tweaked = UMAP(n_neighbors=4, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

topic_model, openai_clusters_tweaked_df = runBERTopic(
    openai_embeddings, 'si', 'si_embedding', openai_base_representation_model, umap_model_tweaked, reduce_outliers=True)

display(topic_model.get_topic_info()[['Count', 'Name']])

check_mode_cluster_distribution(openai_clusters_tweaked_df)



Unnamed: 0,Count,Name
0,167,0_Rail Safety and Operational Failures
1,115,1_Maritime Safety and Resource Management Defi...
2,50,2_Safety and Compliance in Transport and Marit...
3,51,3_Aviation Safety and Regulatory Compliance Is...
4,41,4_Helicopter Safety and Operational Issues
5,52,5_Aviation Safety and Air Traffic Control Issues
6,27,6_Safety Issues in Rail Operations
7,30,7_Aircraft Landing Gear and Maintenance Issues
8,11,8_Aviation Safety Issues Related to Door Locki...
9,16,9_Safety and Maintenance Challenges in Maritim...


mode,0,1,2
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3.0,157.0,7.0
1,7.0,0.0,108.0
2,15.0,2.0,33.0
3,47.0,0.0,4.0
4,40.0,1.0,0.0
5,51.0,0.0,1.0
6,2.0,25.0,0.0
7,30.0,0.0,0.0
8,11.0,0.0,0.0
9,0.0,0.0,16.0


I will try to tune the hyper paramters and see if I can get the right sort of safety themes

In [116]:
topic_model, openai_clusters_tuned_df = runBERTopic(
    openai_embeddings,
    'si',
    'si_embedding',
    openai_base_representation_model,
    UMAP(n_neighbors=4, n_components=5, min_dist=0.0, metric='cosine', random_state=42),
    reduce_outliers=True)

topic_model.get_topic_info()[['Count', "Name","Representative_Docs"]]



Unnamed: 0,Count,Name,Representative_Docs
0,167,0_Rail Safety and Operational Issues in New Ze...,[The training that drivers received for transi...
1,115,1_Maritime Safety and Resource Management Defi...,[The standard of passage planning on board the...
2,50,2_Maritime and Aviation Safety Regulations and...,[The absence of a visual indicator in the whee...
3,51,3_Aviation Safety and Regulatory Compliance Is...,[The operator's system for training its pilots...
4,41,4_Helicopter Safety and Maintenance Issues,"[Due to their unique main rotor design, during..."
5,52,5_Aviation Safety and Operational Procedures a...,[While ATC sequences an IFR aeroplane to land ...
6,27,6_Safety Issues and Management Deficiencies in...,[The train controller made an assumption about...
7,30,7_Aircraft Landing Gear and Maintenance Issues,[Had the pilots known that the nose landing ge...
8,11,8_Aviation Safety and Equipment Malfunction,"[The use of ""threat and error management"" (TEM..."
9,16,9_Maintenance and Risk Management in Marine Sa...,[A clear placard should be placed at the contr...


## Run cluster on just one mode

It would make sense that if the clustering is finding the transport modes then splitting into the modes first might help find the themes within each mode.

In [12]:
def printout_each_modes_topics(results):
    for res in results:
        print("Cluster names: ")
        for i, count in zip(res[0].get_topic_info()['Name'], res[0].get_topic_info()['Count']):
            print(f"{count}, {i}")

### OpenAI

In [20]:
openai_modes_dfs = [openai_embeddings[openai_embeddings['mode'] == i] for i in range(3)]

In [106]:
results = [runBERTopic(df, 'si', 'si_embedding', openai_base_representation_model, umap_model) for df in openai_modes_dfs]

printout_each_modes_topics(results)



Cluster names: 
46, 0_Aviation Safety and Operational Procedures Issues
42, 1_Aircraft Maintenance and Safety Issues
37, 2_Challenges and Safety Issues in Robinson Helicopter Operations
51, 3_Aviation Safety and Regulatory Oversight in New Zealand
30, 4_Aviation Safety and Emergency Response
Cluster names: 
49, 0_KiwiRail Safety and Compliance Issues
28, 1_Rail Safety and Inspection Inefficiencies
42, 2_Rail Safety and Communication Issues
27, 3_Safety and Oversight Concerns in Train Operations
21, 4_Road and Rail Safety at Level Crossings
18, 5_Risk Management and Safety Issues in Wellington Station Train Operations
Cluster names: 
150, 0_Maritime Safety and Crew Management Deficiencies
19, 1_Maritime Safety and Compliance Issues


I will try instead to do with no dimension reduction, or atleast decrease the amount of dimension reduction.

In [111]:
from bertopic.dimensionality import BaseDimensionalityReduction

results = [runBERTopic(df,
                       'si',
                       'si_embedding',
                       openai_base_representation_model,
                       BaseDimensionalityReduction()
                       ) for df in openai_modes_dfs]

printout_each_modes_topics(results)



Cluster names: 
46, 0_Aviation Safety and Air Traffic Management Issues
47, 1_Aircraft Maintenance and Safety Issues
41, 2_Safety and Training Issues in Robinson Helicopter Operations
53, 3_Aviation Safety and Compliance Issues
19, 4_Safety and Regulatory Oversight in Aviation and Parachuting Operations
Cluster names: 
47, 0_Issues in KiwiRail's Safety and Operational Procedures
62, 1_Rail Safety and Incident Analysis
29, 2_Rail Safety and Signal Management Issues in Wellington Station Approaches
20, 3_Safety Issues at Rail Level Crossings
27, 4_Safety and Risk Management in Rail Operations
Cluster names: 
149, 0_Maritime Safety and Resource Management Issues
20, 1_Maritime Safety Violations and the Sinking of the Easy Rider


This reuslts in just one cluster for each as the curse of dimensionality is prudent here. I will instead try to tune the hyper parameters of OPenAI

In [26]:
from bertopic.dimensionality import BaseDimensionalityReduction

results = [runBERTopic(df,
                       'si',
                       'si_embedding',
                       None,
                       UMAP(n_neighbors=6, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
                       ) for df in openai_modes_dfs]

printout_each_modes_topics(results)



Cluster names: 
52, 0_the_to_of_and
47, 1_the_to_of_and
41, 2_the_of_to_and
33, 3_the_gear_landing_to
17, 4_the_to_for_water
16, 5_zealand_new_of_the
Cluster names: 
44, 0_the_to_work_of
40, 1_the_train_to_and
25, 2_train_of_the_and
20, 3_road_level_crossings_the
20, 4_the_brake_braking_conditions
15, 5_the_in_of_wellington
21, 6_the_rail_of_to
Cluster names: 
120, 0_the_of_and_to
27, 1_the_to_of_easy
11, 2_co2_the_could_be
11, 3_the_fish_crew_of


In [1]:
merged_moode_models.get_topic_info()[['Count', "Name"]]

NameError: name 'merged_moode_models' is not defined

## Hypter parameter tuning

I have had a look at both single run and individual models.

I think the next step is to do some hypter paramter tuning.

 As there are not noticable differences between voyageAI and openAI I will go with openAI embedding model.

In [9]:
def perform_umap_parameter_search(modes_dfs, n_range = range(4,5), n_components = 5):
    overall_results = []
    for n in n_range:
        print(" Looking at n_neighbors = ", n)
        results = [runBERTopic(df,
                        'si',
                        'si_embedding',
                        None,
                        UMAP(n_neighbors=n, n_components=n_components, min_dist=0.0, metric='cosine', random_state=42)
                        ) for df in modes_dfs]
        
        group_clusters = runBERTopic(
            pd.concat(modes_dfs),
            'si',
            'si_embedding',
            None,
            UMAP(n_neighbors=n, n_components=n_components, min_dist=0.0, metric='cosine', random_state=42))
        
        overall_results.append({
            'n_neighbors': n,
            'individual_models': [result[0] for result in results],
            'individual_df': [result[1] for result in results],
            'group_model': group_clusters[0],
            'group_df': group_clusters[1]

        })
    
    return overall_results

In [None]:
hyper_parameter_search_results = []

for n_components in [2,3,4,5,6,7,8]:
    print("Looking at n_components = ", n_components)
    results =  perform_umap_parameter_search(openai_modes_dfs, range(3,10), n_components)
    for res in results:
        hyper_parameter_search_results.append({'n_components': n_components} | res)

In [None]:
hyper_parameter_search_df = pd.DataFrame(hyper_parameter_search_results)

In [None]:
hyper_parameter_search_df['individual_topic_counts'] = hyper_parameter_search_df['individual_models'].apply(lambda list_of_models: [(len(x.get_topic_info()['Name']) > 3) * 1 for x in list_of_models])

hyper_parameter_search_df['group_topic_membership_counts'] = hyper_parameter_search_df.apply(
    lambda x: 
    [(c < 100) * 1 for c in x['group_model'].get_topic_info()['Count'].to_list()], axis=1)

hyper_parameter_search_df[['n_components', 'n_neighbors', 'individual_topic_counts', 'group_topic_membership_counts']]



Unnamed: 0,n_components,n_neighbors,individual_topic_counts,group_topic_membership_counts
0,2,3,"[1, 1, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1,2,4,"[1, 0, 0]","[0, 0, 0, 1, 1, 1, 1]"
2,2,5,"[1, 1, 0]","[0, 0]"
3,2,6,"[1, 1, 0]","[0, 1, 1, 1, 1, 1, 1, 1, 1]"
4,2,7,"[0, 0, 1]","[0, 0, 0]"
5,2,8,"[1, 1, 0]","[0, 0, 0, 1]"
6,2,9,"[1, 1, 0]","[0, 0, 1, 1, 1, 1, 1, 1, 1]"
7,3,3,"[1, 1, 1]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
8,3,4,"[1, 0, 0]","[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
9,3,5,"[0, 1, 0]","[0, 0, 1, 1, 1, 1, 1, 1]"


We can see that there are quite a few different ones that might seem reasonable.

I will have to choose one for the demo purposes.

In [None]:
appropriate_counts_df = hyper_parameter_search_df[hyper_parameter_search_df.apply(
    lambda row: 
    np.nanmean(row['individual_topic_counts']) > 0.6 and
    np.nanmean(row['group_topic_membership_counts']) > 0.9
    , axis=1)
]

appropriate_counts_df[['n_components', 'n_neighbors', 'individual_topic_counts', 'group_topic_membership_counts']]

Unnamed: 0,n_components,n_neighbors,individual_topic_counts,group_topic_membership_counts
0,2,3,"[1, 1, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
7,3,3,"[1, 1, 1]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
14,4,3,"[1, 1, 1]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
17,4,6,"[1, 1, 0]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
21,5,3,"[1, 1, 1]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
23,5,5,"[0, 1, 1]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
28,6,3,"[1, 1, 0]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
29,6,4,"[0, 1, 1]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
35,7,3,"[1, 1, 1]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
36,7,4,"[1, 1, 0]","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"


#### Group model

I will look at the best ones that are group models.

In [None]:
potential_group_model = appropriate_counts_df.loc[[0,23,42],]

print(potential_group_model.columns)

potential_group_model.apply(
    lambda row:
    row['group_model'].update_topics(
        row['group_df']['si'].to_list(),
        representation_model = openai_base_representation_model
    ),
    axis=1
)


Index(['n_components', 'n_neighbors', 'individual_models', 'individual_df',
       'group_model', 'group_df', 'individual_topic_counts',
       'group_topic_membership_counts'],
      dtype='object')


KeyboardInterrupt: 

In [None]:

potential_group_model['model_summary'] = potential_group_model['group_model'].apply(lambda model: model.get_topic_info())

for i, row in potential_group_model.iterrows():

    display(row['model_summary'])

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,70,0_Rail and Road Safety and Standards Compliance,[Rail and Road Safety and Standards Compliance],[The Beach Road/ State Highway 1 intersection ...
1,1,60,1_Aviation and Maritime Safety and Compliance ...,[Aviation and Maritime Safety and Compliance I...,[The procedure for circling below the minimum ...
2,2,63,2_Maritime Safety and Navigation Management De...,[Maritime Safety and Navigation Management Def...,[The standard of bridge resource management on...
3,3,54,3_Safety and Regulatory Oversight in New Zeala...,[Safety and Regulatory Oversight in New Zealan...,[There was a low likelihood of the weather con...
4,4,50,4_Maritime Safety and Risk Management Deficien...,[Maritime Safety and Risk Management Deficienc...,[The plastic sheathing that had been placed ar...
5,5,36,5_Issues in KiwiRail's Work and Safety Managem...,[Issues in KiwiRail's Work and Safety Manageme...,[The New Zealand Rail Operating Rules and Proc...
6,6,51,6_Helicopter Flight Safety and Operating Chall...,[Helicopter Flight Safety and Operating Challe...,"[Due to their unique main rotor design, during..."
7,7,27,7_Challenges and Risks in Train Control Safety...,[Challenges and Risks in Train Control Safety ...,[The train controller made an assumption about...
8,8,30,8_Emergency Preparedness and Response in Trans...,[Emergency Preparedness and Response in Transp...,[There were as few as 4 approved lifejackets o...
9,9,28,9_Train Operation and Communication Safety Issues,[Train Operation and Communication Safety Issues],[Lack of a suitable communication system betwe...


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,95,0_Maritime Safety and Navigation Management Is...,[Maritime Safety and Navigation Management Iss...,[The standard of bridge resource management on...
1,1,61,1_Aviation Safety and Compliance Issues in New...,[Aviation Safety and Compliance Issues in New ...,[The operator's system for training its pilots...
2,2,49,2_KiwiRail Operational and Safety Challenges,[KiwiRail Operational and Safety Challenges],[The New Zealand Rail Operating Rules and Proc...
3,3,49,3_Maritime Safety and Emergency Response Regul...,[Maritime Safety and Emergency Response Regula...,[The absence of a visual indicator in the whee...
4,4,47,4_Aviation Safety and Operational Miscommunica...,[Aviation Safety and Operational Miscommunicat...,[While ATC sequences an IFR aeroplane to land ...
5,5,43,5_Aircraft Maintenance and Safety Compliance I...,[Aircraft Maintenance and Safety Compliance Is...,[Had the pilots known that the nose landing ge...
6,6,38,6_Helicopter Safety and Operational Challenges,[Helicopter Safety and Operational Challenges],"[Due to their unique main rotor design, during..."
7,7,22,7_Train Operational Safety and Communication I...,[Train Operational Safety and Communication Is...,[The safety issue arising from this incident w...
8,8,26,8_Safety and Risk Management Issues in Train C...,[Safety and Risk Management Issues in Train Co...,[Poor planning and co-ordination of track infr...
9,9,18,9_Rail System Safety and Performance in Low-Ad...,[Rail System Safety and Performance in Low-Adh...,[A key safety issue was that the National Rail...


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,80,0_Maritime Safety and Navigation Management Is...,[Maritime Safety and Navigation Management Iss...,[Neither the owner nor the skipper sought or w...
1,1,54,1_Rail Safety and Training Gaps,[Rail Safety and Training Gaps],[The Matangi braking and wheel-slide protectio...
2,2,57,2_Helicopter Safety and Maintenance Issues,[Helicopter Safety and Maintenance Issues],"[Due to their unique main rotor design, during..."
3,3,46,3_Train Collision Risks and Communication Fail...,[Train Collision Risks and Communication Failu...,[There are a number of reasonable measures tha...
4,4,47,4_Maritime Safety and Compliance Issues,[Maritime Safety and Compliance Issues],[Neither the ship's planned-maintenance system...
5,5,43,5_Emergency Preparedness and Response in Trans...,[Emergency Preparedness and Response in Transp...,[There were as few as 4 approved lifejackets o...
6,6,41,6_KiwiRail Operational and Safety Compliance I...,[KiwiRail Operational and Safety Compliance Is...,[The New Zealand Rail Operating Rules and Proc...
7,7,39,7_Aviation Safety and Air Traffic Management C...,[Aviation Safety and Air Traffic Management Co...,[When an IFR aeroplane is approved to conduct ...
8,8,39,8_Aviation Safety and Operational Compliance I...,[Aviation Safety and Operational Compliance Is...,[The operator's system for training its pilots...
9,9,27,9_Safety Challenges and Risks in Train Control...,[Safety Challenges and Risks in Train Control ...,[The train controller made an assumption about...


I need to choose just one for a demo.

This will be the last one as it looks the most reasonable.

In [202]:
check_mode_cluster_distribution(appropriate_counts_df.loc[1, 'group_df'])

KeyError: 1

In [215]:
demo_group_model = potential_group_model.loc[0,]

demo_group_model['group_model'].save('demo_group_model', serialization='pytorch')



TypeError: cannot pickle '_thread.RLock' object

#### Merged models

There are three sets of indivudal models that has good counts.
I can use this to merge a model and end up with quite a few topics.

In [None]:
potential_individual_models = appropriate_counts_df[appropriate_counts_df['individual_topic_counts'].apply(
    lambda counts: 
    sum(counts) == 3
    )
].reset_index(drop=True)


In [None]:

potential_individual_models['merged_model'] = potential_individual_models.apply(
    lambda row: 
    BERTopic.merge_models(row['individual_models'])
    , axis=1
)

potential_individual_models['merged_df'] = potential_individual_models['individual_df'].apply(
    lambda dfs: pd.concat([df.dropna(subset=['si']) for df in dfs], axis = 0)
)    


In [None]:
potential_individual_models.apply(
    lambda row: 
    row['merged_model'].update_topics(
        row['merged_df']['si'].tolist(),
        representation_model = openai_base_representation_model
    ), axis = 1
)


In [None]:
potential_individual_models['model_summary'] = potential_individual_models['merged_model'].apply(lambda model: model.get_topic_info())

for i, row in potential_individual_models.iterrows():

    display(row['model_summary'])

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,42,0_Helicopter Safety and Pilot Error,[Helicopter Safety and Pilot Error],
1,1,261,1_Transportation Safety and Management Issues,[Transportation Safety and Management Issues],
2,2,18,2_Aviation Safety and Regulatory Oversight,[Aviation Safety and Regulatory Oversight],
3,3,18,3_Aircraft Landing Gear and Maintenance Issues,[Aircraft Landing Gear and Maintenance Issues],
4,4,35,4_Aviation Safety and Compliance Issues,[Aviation Safety and Compliance Issues],
5,5,16,5_Aviation Safety and Regulatory Concerns in N...,[Aviation Safety and Regulatory Concerns in Ne...,
6,6,12,6_Safety and Compliance Issues in Aircraft Com...,[Safety and Compliance Issues in Aircraft Comp...,
7,7,28,7_Aerodrome and Air Traffic Control Operations...,[Aerodrome and Air Traffic Control Operations ...,
8,8,26,8_Safety and Performance Risks in Rail Control...,[Safety and Performance Risks in Rail Control ...,
9,9,23,9_Risk Factors and Challenges in Train Operati...,[Risk Factors and Challenges in Train Operatio...,


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,44,0_Helicopter Safety and Accident Factors,[Helicopter Safety and Accident Factors],
1,1,144,1_Maritime Safety and Management Failures,[Maritime Safety and Management Failures],
2,2,18,2_Aircraft Nose Landing Gear Failures and Main...,[Aircraft Nose Landing Gear Failures and Maint...,
3,3,18,3_Safety and Regulatory Compliance in Aviation...,[Safety and Regulatory Compliance in Aviation ...,
4,4,18,4_Air Traffic Control and Pilot Miscommunication,[Air Traffic Control and Pilot Miscommunication],
5,5,20,5_Aviation Safety and Regulatory Compliance Co...,[Aviation Safety and Regulatory Compliance Con...,
6,6,60,6_Safety and Compliance Issues in Transportati...,[Safety and Compliance Issues in Transportatio...,
7,7,23,7_Safety and Regulatory Issues in New Zealand ...,[Safety and Regulatory Issues in New Zealand A...,
8,8,42,8_Transportation Safety and Communication Issues,[Transportation Safety and Communication Issues],
9,9,42,9_KiwiRail Safety and Procedure Compliance Issues,[KiwiRail Safety and Procedure Compliance Issues],


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,57,0_Helicopter Safety and Operational Hazards,[Helicopter Safety and Operational Hazards],
1,1,44,1_Aviation Safety and Compliance Issues,[Aviation Safety and Compliance Issues],
2,2,186,2_Maritime Safety and Management System Defici...,[Maritime Safety and Management System Deficie...,
3,3,22,3_Aircraft Nose Landing Gear Maintenance and F...,[Aircraft Nose Landing Gear Maintenance and Fa...,
4,4,18,4_Aviation Safety and Regulatory Oversight,[Aviation Safety and Regulatory Oversight],
5,5,12,5_Safety and Regulatory Issues in New Zealand ...,[Safety and Regulatory Issues in New Zealand A...,
6,6,44,6_Train Safety and Communication Issues,[Train Safety and Communication Issues],
7,7,39,7_KiwiRail Safety and Compliance Issues,[KiwiRail Safety and Compliance Issues],
8,8,35,8_Rail and Train Safety Issues,[Rail and Train Safety Issues],
9,9,26,9_Safety and Performance Issues in Train Control,[Safety and Performance Issues in Train Control],


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,46,0_Aviation Safety and Air Traffic Control Chal...,[Aviation Safety and Air Traffic Control Chall...,
1,1,49,1_Helicopter Safety and Operational Challenges,[Helicopter Safety and Operational Challenges],
2,2,165,2_Maritime Safety and Management Deficiencies,[Maritime Safety and Management Deficiencies],
3,3,20,3_Aircraft Maintenance and Safety Issues,[Aircraft Maintenance and Safety Issues],
4,4,18,4_Aviation Safety and Regulatory Oversight,[Aviation Safety and Regulatory Oversight],
5,5,24,5_Aircraft Landing Gear and Maintenance Issues,[Aircraft Landing Gear and Maintenance Issues],
6,6,13,6_Safety and Regulatory Issues in New Zealand ...,[Safety and Regulatory Issues in New Zealand A...,
7,7,42,7_Issues in KiwiRail's Safety and Communicatio...,[Issues in KiwiRail's Safety and Communication...,
8,8,26,8_Train Controller Performance and Safety Issues,[Train Controller Performance and Safety Issues],
9,9,27,9_Rail Safety and Signal Management Issues,[Rail Safety and Signal Management Issues],


I have decided to go with the fourth set of hyper paremters as these seem to give the best results.


**Merged model**

In [211]:
demo_merged_model = potential_individual_models.loc[3,]

demo_merged_model['merged_model'].save("demo_merged_model", serialization="pytorch")

**Individual model**

In [None]:


for model, df in zip(
    potential_individual_models.loc[3,'individual_models'],
    potential_individual_models.loc[3,'individual_df']):

    model.update_topics(
        df.dropna(subset=['si'])['si'].tolist(),
        representation_model=openai_base_representation_model)



In [213]:
for i, model in enumerate(potential_individual_models.loc[3,'individual_models']):
    display(model.get_topic_info())

    model.save(f"demo_individual_model_mode_{i}", serialization="pytorch")

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,46,0_Aviation Safety and Communication Issues,[Aviation Safety and Communication Issues],[While ATC sequences an IFR aeroplane to land ...
1,1,49,1_Helicopter Safety and Operational Challenges,[Helicopter Safety and Operational Challenges],"[Due to their unique main rotor design, during..."
2,2,36,2_Aviation Safety and Regulatory Compliance Co...,[Aviation Safety and Regulatory Compliance Con...,[The operator's system for training its pilots...
3,3,20,3_Aircraft Safety and Maintenance Issues,[Aircraft Safety and Maintenance Issues],[There was a lack of clear communication and a...
4,4,18,4_Safety and Regulatory Oversight in Parachuti...,[Safety and Regulatory Oversight in Parachutin...,[The risk to people involved in helicopter ope...
5,5,24,5_Aircraft Landing Gear and Maintenance Issues,[Aircraft Landing Gear and Maintenance Issues],[Had the pilots known that the nose landing ge...
6,6,13,6_Safety and Regulatory Issues in New Zealand ...,[Safety and Regulatory Issues in New Zealand A...,[The regulatory oversight of commercial balloo...


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,42,0_Issues in KiwiRail's Operational Procedures ...,[Issues in KiwiRail's Operational Procedures a...,[The New Zealand Rail Operating Rules and Proc...
1,1,26,1_Safety and Risk Management in Rail Operations,[Safety and Risk Management in Rail Operations],[Poor planning and co-ordination of track infr...
2,2,27,2_Rail Safety and Signal Management Issues,[Rail Safety and Signal Management Issues],[The lever in the signal box that was used to ...
3,3,23,3_Train Safety and Communication Failures,[Train Safety and Communication Failures],[Nor could the system rely on visually sightin...
4,4,21,4_Train Collision Risks at Wellington Station,[Train Collision Risks at Wellington Station],[There is a heightened risk of trains collidin...
5,5,18,5_Rail Safety and Standards Compliance Concerns,[Rail Safety and Standards Compliance Concerns],[A key safety issue was that the National Rail...
6,6,15,6_Rail System Failures and Inspection Limitations,[Rail System Failures and Inspection Limitations],[The rail fracture examination revealed that t...
7,7,13,7_Safety and Compatibility Issues at Road and ...,[Safety and Compatibility Issues at Road and R...,[Level crossing assessments do not require the...


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,83,0_Maritime Safety and Resource Management Prac...,[Maritime Safety and Resource Management Pract...,[The voyage planning for the time in the Snare...
1,1,46,1_Maritime Safety and Risk Management Deficien...,[Maritime Safety and Risk Management Deficienc...,[The plastic sheathing that had been placed ar...
2,2,26,2_Maritime Safety and Emergency Response Failures,[Maritime Safety and Emergency Response Failures],[The owner of the Easy Rider was not meeting i...
3,3,14,3_Maintenance and Regulation Issues in Maritim...,[Maintenance and Regulation Issues in Maritime...,"[The CO2 system's pilot cylinder leaked, but t..."


### VoyageAI

In [None]:
modes_dfs = [voyageai_embeddings[voyageai_embeddings['mode'] == i] for i in range(3)]

results = [runBERTopic(df, 'si', 'si_embedding', openai_base_representation_model, umap_model) for df in modes_dfs]

printout_each_modes_topics(results)



Cluster names: 
54, 0_Aviation Safety and Operational Procedures
34, 1_Safety Challenges and Risks in Robinson Helicopter Operations
62, 2_Aviation Safety and Regulatory Compliance Issues
25, 3_Aircraft Landing Gear and Door System Failures
31, 4_Aircraft Maintenance and Safety Concerns
Cluster names: 
166, 0_Rail Safety and Management Issues
19, 1_Safety and Regulatory Issues at Rail Level Crossings
Cluster names: 
28, 0_Maritime Safety and Bridge Resource Management Deficiencies
33, 1_Maritime Safety and Management Failures
43, 2_Maritime Safety and Navigation Standards Compliance
30, 3_Maritime Safety and Emergency Response Deficiencies
22, 4_Maritime Safety Violations and Consequences aboard the Easy Rider
13, 5_Propulsion System Failures and Maintenance Issues in Marine Operations


In [None]:
checking = results[2][1]

In [None]:
merged_moode_models = BERTopic.merge_models([result[0] for result in results], min_similarity=0.9)


merged_moode_models.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,52,0_Aviation Safety and Air Traffic Control Proc...,[Aviation Safety and Air Traffic Control Proce...,
1,1,47,1_Helicopter Safety and Accident Analysis,[Helicopter Safety and Accident Analysis],
2,2,41,2_Aviation Safety and Compliance Issues,[Aviation Safety and Compliance Issues],
3,3,33,3_Aircraft Safety and Maintenance Issues,[Aircraft Safety and Maintenance Issues],
4,4,17,4_Aviation Safety and Regulatory Compliance in...,[Aviation Safety and Regulatory Compliance in ...,
5,5,16,5_Safety and Regulatory Issues in New Zealand ...,[Safety and Regulatory Issues in New Zealand A...,
6,6,44,0_KiwiRail Safety and Compliance Issues,[KiwiRail Safety and Compliance Issues],
7,7,40,1_Rail Safety and Communication Failures,[Rail Safety and Communication Failures],
8,8,25,2_Safety and Management Issues in Rail Operations,[Safety and Management Issues in Rail Operations],
9,9,20,3_Safety and Regulatory Issues at Road-Rail Le...,[Safety and Regulatory Issues at Road-Rail Lev...,


# Visualization of themes and safety issues

Now that we have some models that seem reasonable, it is time to create a user friendly representation.

In [63]:
modes_dfs = [openai_embeddings[openai_embeddings['mode'] == i].reset_index(drop=True) for i in range(3)]

pd.concat(modes_dfs)

Unnamed: 0,report_id,si,mode,si_embedding
0,2011_003,The New Zealand regulatory system has not prov...,0,"[0.0187440924346447, -0.000433413457358256, -0..."
1,2011_003,The format of the Robinson R22 helicopter flig...,0,"[0.01013844646513462, -0.03145159035921097, -0..."
2,2011_003,The rate of R22 in-flight break-up accidents i...,0,"[0.005347656551748514, -0.022685393691062927, ..."
3,2011_003,"The crashworthiness of the ELT, which was desi...",0,"[0.014976576901972294, 0.015324870124459267, -..."
4,2010_010,The failure of the nose landing gear to extend...,0,"[-0.0042054359801113605, 0.04125332459807396, ..."
...,...,...,...,...
164,2017_203,Technicians who are authorised to conduct mand...,2,"[0.002318679355084896, 0.015887508168816566, -..."
165,2013_201,The firefighting drills held on board the Taok...,2,"[0.006056208163499832, 0.01051066443324089, -0..."
166,2014_201,crew awareness of the operating limitations of...,2,"[-0.029451534152030945, 0.026009364053606987, ..."
167,2014_201,crew operating knowledge of on-board emergency...,2,"[-0.021512825042009354, 0.029569942504167557, ..."


In [85]:
topic_model = BERTopic.load("demo_merged_model")

all_data = pd.concat(modes_dfs)

array_embeddings = column_to_2darray(all_data['si_embedding'])

reduced_array_embeddings = UMAP(n_neighbors=7, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(array_embeddings)

visualization = topic_model.visualize_documents(all_data['si'].to_list(), embeddings=array_embeddings, reduced_embeddings=reduced_array_embeddings)

with open(os.path.join('topic_visuals', 'demo_merged_model_visual.html'), 'w') as f:
    visualization.write_html(f)

visualization



In [80]:
demo_individual_models = [BERTopic.load(f"demo_individual_model_mode_{i}") for i in range(3)]

for model, df, i in zip(demo_individual_models, modes_dfs, range(len(demo_individual_models))):
    array_embeddings = column_to_2darray(df['si_embedding'])

    reduced_array_embeddings = UMAP(n_neighbors=3, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(array_embeddings)

    visualization = model.visualize_documents(df['si'].to_list(), embeddings=array_embeddings, reduced_embeddings=reduced_array_embeddings)

    with open(os.path.join('topic_visuals', f'demo_individual_model_mode_{i}_visual.html'), 'w') as f:
        visualization.write_html(f)

    display(visualization)



In [87]:
topic_model = BERTopic.load("demo_group_model")

all_data = pd.concat(modes_dfs)

array_embeddings = column_to_2darray(all_data['si_embedding'])

reduced_array_embeddings = UMAP(n_neighbors=5, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(array_embeddings)

visualization = topic_model.visualize_documents(all_data['si'].to_list(), embeddings=array_embeddings, reduced_embeddings=reduced_array_embeddings)

with open(os.path.join('topic_visuals', 'demo_group_model_visual.html'), 'w') as f:
    visualization.write_html(f)

visualization

