<a href="https://colab.research.google.com/github/Colsai/scott_data606/blob/main/Saved_Models/BERTopic_Model_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic Model Analysis
In this last step, we'll look to answer the questions from two pre-trained/created topic models.

This is due to the variances between model creation that occur every time the unsupervised BERTopic runs- topics can change, so it is important here to do an analysis on a discrete and finished set of models.

In [1]:
from IPython.display import clear_output

In [2]:
#Install missing packages
!pip install pyLDAvis
!pip install bertopic
!pip install bertopic[visualization]
!python -m spacy download en_core_web_md -qq
clear_output()

In [3]:
###########################
# Import Packages         #
###########################
import pandas as pd
import sklearn
import nltk
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis.gensim_models
import en_core_web_md
import gensim
import random
import os
import pyLDAvis
from nltk.tokenize import RegexpTokenizer
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaMulticore
from gensim.models import CoherenceModel
from gensim import corpora, models
from gensim.test.utils import datapath
import warnings
import textwrap

warnings.filterwarnings('ignore')

# Set options for specific packages
nltk.download(['punkt', 
               'stopwords',
               'wordnet',
               'omw-1.4'])

# Visualise inside a notebook
pyLDAvis.enable_notebook()

sns.set()

clear_output()

In [None]:
#Added the function used as reference for loading the model

#CountVectorizer is used here for removing stopwords after processing
from sklearn.feature_extraction.text import CountVectorizer   
from nltk.stem import WordNetLemmatizer 
from nltk import word_tokenize

#Remove Stopwords through vectorizer_model
from bertopic import BERTopic

#Lemmatizer for cleaning text
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [4]:
#Change to the models folder
os.chdir("/content/drive/MyDrive/DATA_606/Models")

# 1. Load Models
- Load reports_model, workplan_model from GDrive
- Show the models that were used

In [6]:
#########################
# Load Models
#########################
from bertopic import BERTopic
try:
    reports_model = BERTopic.load("reports_model")
    workplan_model = BERTopic.load("workplan_model")

except Exception as e:
    print(f"Models did not load: {e}")

In [20]:
workplan_model.get_params()

{'calculate_probabilities': False,
 'diversity': None,
 'embedding_model': <bertopic.backend._sentencetransformers.SentenceTransformerBackend at 0x7ff0d90f8950>,
 'hdbscan_model': HDBSCAN(min_cluster_size=10, prediction_data=True),
 'language': 'english',
 'low_memory': False,
 'min_topic_size': 10,
 'n_gram_range': (1, 2),
 'nr_topics': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=5, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 'vectorizer_model': CountVectorizer(ngram_range=(1, 2),
                 stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                             'ourselves', 'you', "you're", "you've", "you'll",
                             "you'd", 'your', 'yours', 'yourself', 'yourselves',
                             'he', 'him', 'his', 'himself', 'she

In [7]:
#########################
# Model Info
#########################
print(reports_model.get_params())
print(workplan_model.get_params())

{'calculate_probabilities': False, 'diversity': None, 'embedding_model': <bertopic.backend._sentencetransformers.SentenceTransformerBackend object at 0x7ff0da0af610>, 'hdbscan_model': HDBSCAN(min_cluster_size=10, prediction_data=True), 'language': 'english', 'low_memory': False, 'min_topic_size': 10, 'n_gram_range': (1, 2), 'nr_topics': None, 'seed_topic_list': None, 'top_n_words': 10, 'umap_model': UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=5, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}), 'vectorizer_model': CountVectorizer(ngram_range=(1, 2),
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's

In [8]:
#########################
# Basic Topics
#########################
#Show all topics as sentences:
def show_all_topics(model):
    for i in range(0,len(model.get_topic_info())-1):
        print(f"{i}.", end = '')
        for j in range(0,10):
            print(model.get_topics()[i][j][0], end = ' ')
        print("")

In [17]:
#Show the topics from workplans
show_all_topics(workplan_model)

0.medicare payment hospital service claim inpatient physician beneficiary part overpayment 
1.drug part medicare manufacturer price rebate beneficiary hospice amount medicaid 
2.state medicaid care service federal managed care managed payment mcos provider 
3.financial audit agency card federal financial statement statement act improper charge 
4.preparedness emergency covid19 home nursing home nursing response disease infection emergency preparedness 
5.child state family program check foster childcare care foster care background 
6.health pepfar fund cdc provider pepfar fund relief covid19 control health center 
7.mfcus medicaid fraud state mfcu data abuse state medicaid report provides 
8.cybersecurity security information technology gda data device federal opdivs cloud 
9.nih institute research national institute grant grantee institute health national foreign health 
10.child orr unaccompanied refugee unaccompanied child facility resettlement refugee resettlement resettlement orr 

In [18]:
#Show the topics from reports
show_all_topics(reports_model)

0.medicaid federal state payment new million new york york service beneficiary 
1.medicare claim payment hospital service requirement beneficiary cm overpayment inpatient 
2.cost medicare pension prb cys claimed icps 2016 prb cost asset 
3.drug rebate part physicianadministered manufacturer physicianadministered drug formulary medicaid federal share federal 
4.grant nih fund head start head start federal grantee audit acf 
5.incident abuse neglect abuse neglect critical incident state critical potential abuse group home home 
6.opioid opioid use use opioids treatment otps crisis beneficiary opioid crisis overdose 
7.cm mao diagnosis data payment provider diagnosis code code risk record 
8.check background check background child child care provider criminal background criminal care state 
9.nursing home telehealth nursing home covid19 pandemic cm state staffing resident 
10.price drug quarter substitution price substitution cm previous 5 percent 5 data 
11.ihs patient policy health faci

In [11]:
workplan_model.get_topic_info()[1:]

Unnamed: 0,Topic,Count,Name
1,0,91,0_medicare_payment_hospital_service
2,1,74,1_drug_part_medicare_manufacturer
3,2,71,2_state_medicaid_care_service
4,3,40,3_financial_audit_agency_card
5,4,38,4_preparedness_emergency_covid19_home
6,5,32,5_child_state_family_program
7,6,25,6_health_pepfar_fund_cdc
8,7,23,7_mfcus_medicaid_fraud_state
9,8,22,8_cybersecurity_security_information_technology
10,9,22,9_nih_institute_research_national institute


In [12]:
reports_model.get_topic_info()[1:]

Unnamed: 0,Topic,Count,Name
1,0,85,0_medicaid_federal_state_payment
2,1,59,1_medicare_claim_payment_hospital
3,2,39,2_cost_medicare_pension_prb
4,3,38,3_drug_rebate_part_physicianadministered
5,4,36,4_grant_nih_fund_head start
6,5,31,5_incident_abuse_neglect_abuse neglect
7,6,26,6_opioid_opioid use_use_opioids
8,7,23,7_cm_mao_diagnosis_data
9,8,22,8_check_background check_background_child
10,9,21,9_nursing home_telehealth_nursing_home


## Visualize Top-5 
- Compare the work plan and reports

In [93]:
workplan_model.visualize_barchart(n_words = 5,
                                  width=300, 
                                  height=300,
                                  top_n_topics = 5)

In [92]:
reports_model.visualize_barchart(n_words = 5,
                                 width=300, 
                                 height=300,
                                 top_n_topics = 5)

## Simple function for topic lookup to connect topic --> Document

In [88]:
#############################
# Show Representative Topics
#############################

def show_rep_topics(modl):
    
    show_all_topics(modl)
    
    topic_select = int(input("Select a topic for representative documents: "))
        
    clear_output()

    doc_cat = modl.get_topic_info()['Name'][1::]
    
    for idx, doc in enumerate(modl.get_representative_docs(topic_select)):    

        print(f"""Document {idx+1}: {','.join(modl.get_topic_info()['Name'][topic_select + 1].split("_")[1:])}""")
        
        print(textwrap.fill(str(doc), width = 80))
        
        print("")

## Lookup topics: Displays topics, and returns top-3 topics to explain a specific work project

In [90]:
#Show for opioids
show_rep_topics(workplan_model)

Document 1: opioid,overdose,opioids,death
In 2017, there were an estimated 49,000 opioid-related overdose deaths in the
United States.  In a recent data brief, Opioid Use in Medicare Part D Remains
Concerning, OIG found that about 71,000 Part D beneficiaries were at serious
risk of misuse or overdose in 2017.  Gaining a deeper understanding of the
beneficiaries OIG identified as at serious risk of misuse or overdose is an
important next step in addressing the crisis.  This study will provide needed
information about: (1) the characteristics of these beneficiaries, including
their demographics and diagnoses; (2) the opioid utilization of these
beneficiaries; and (3) the extent to which these beneficiaries have had adverse
health effects related to opioids and any overdose incidents.

Document 2: opioid,overdose,opioids,death
The opioid crisis remains a public health emergency.  In 2018, there were nearly
47,000 opioid-related overdose deaths in the United States.  Identifying
patients w

In [89]:
#Show for opioids
show_rep_topics(reports_model)

Document 1: opioid,opioid use,use,opioids
<a href="/newsroom/media-materials/2017/2017-takedown.asp">Media Materials</a>,
Opioid abuse and overdose deaths are at epidemic levels in the United States.
This data brief is part of a larger strategy by the OIG to fight the opioid
crisis and address one of its top priority outcomes-to protect beneficiaries
from prescription drug abuse.  It provides baseline data on the extent to which
beneficiaries receive extreme amounts of opioids and appear to be "doctor
shopping."  It also identifies prescribers who have questionable opioid
prescribing patterns.  , We based this data brief on an analysis of prescription
drug event records of opioids received in 2016.  We determined beneficiaries'
morphine equivalent dose (MED), which is a measure that equates all of the
various opioids and strengths into one standard value. , Ensuring the
appropriate use and prescribing of opioids is essential to protecting the health
and safety of beneficiaries and the 