Borrowing code for knowledge graph generation with LLMs!

https://towardsdatascience.com/leverage-keybert-hdbscan-and-zephyr-7b-beta-to-build-a-knowledge-graph-33d7534ee01b

In [2]:
import torch

In [3]:
# Required imports
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load the model and the tokenizer
model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"

llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="cuda",
                                             trust_remote_code=False,
                                             revision="main") # change revision for a different branch
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, 
                     use_fast=True)

Downloading model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [4]:
import umap.umap_ as umap

import hdbscan

import pandas as pd
import numpy as np
import re
import pickle

from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')

from keybert.llm import TextGeneration
from keybert import KeyLLM, KeyBERT

import random

from tqdm import tqdm
tqdm.pandas()

import pandas as pd

df = pd.read_parquet(r"12JAN24 - One week news dump.parquet")

In [6]:
kw_model = KeyBERT()

In [7]:
doc = """The Bronx (/brɒŋks/) is a borough of New York City, coextensive with Bronx County, in the U.S. state of New York. It is south of Westchester County; north and east of the New York City borough of Manhattan, across the Harlem River; and north of the New York City borough of Queens, across the East River. The Bronx has a land area of 42 square miles (109 km2) and a population of 1,472,654 in the 2020 census.[1] If each borough were ranked as a city, the Bronx would rank as the ninth-most-populous in the U.S. Of the five boroughs, it has the fourth-largest area, fourth-highest population, and third-highest population density.[4] The population density of the Bronx was 32,718.7 inhabitants per square mile (12,632.8/km2) in 2022, the third-highest population density of any county in the United States, behind Manhattan and Brooklyn.[5] It is the only borough of New York City not primarily on an island. With a population that is 54.8% Hispanic as of 2020, it is the only majority-Hispanic county in the Northeastern United States and the fourth-most-populous nationwide.[6]

The Bronx is divided by the Bronx River into a hillier section in the west, and a flatter eastern section. East and west street names are divided by Jerome Avenue. The West Bronx was annexed to New York City in 1874, and the areas east of the Bronx River in 1895.[7] Bronx County was separated from New York County (modern-day Manhattan) in 1914.[8] About a quarter of the Bronx's area is open space,[9] including Woodlawn Cemetery, Van Cortlandt Park, Pelham Bay Park, the New York Botanical Garden, and the Bronx Zoo in the borough's north and center. The Thain Family Forest at the New York Botanical Garden is thousands of years old and is New York City's largest remaining tract of the original forest that once covered the city.[10] These open spaces are primarily on land reserved in the late 19th century as urban development progressed north and east from Manhattan.

The word "Bronx" originated with Swedish-born (or Faroese-born) Jonas Bronck, who established the first European settlement in the area as part of the New Netherland colony in 1639.[11][12][13] European settlers displaced the native Lenape after 1643. In the 19th and 20th centuries, the Bronx received many immigrant and migrant groups as it was transformed into an urban community, first from European countries particularly Ireland, Germany, Italy, and Eastern Europe, and later from the Caribbean region (particularly Puerto Rico, Trinidad, Haiti, Guyana, Jamaica, Barbados, and the Dominican Republic), and immigrants from West Africa (particularly from Ghana and Nigeria), African American migrants from the Southern United States, Panamanians, Hondurans, and South Asians.[14]

The Bronx contains the poorest congressional district in the United States, New York's 15th. There are, however, some upper-income, as well as middle-income neighborhoods such as Riverdale, Fieldston, Spuyten Duyvil, Schuylerville, Pelham Bay, Pelham Gardens, Morris Park, and Country Club.[15][16][17] Parts of the Bronx saw a steep decline in population, livable housing, and quality of life starting from the mid-to-late 1960s, continuing throughout the 1970s and into the 1980s, ultimately culminating in a wave of arson in the late 1970s, a period when hip hop music evolved.[18] The South Bronx, in particular, experienced severe urban decay. The borough began experiencing new population growth starting in the late 1990s and continuing to the present day.[19]"""

In [8]:
keywords = kw_model.extract_keywords(doc)

In [9]:
keywords

[('bronx', 0.6409),
 ('borough', 0.5789),
 ('westchester', 0.4554),
 ('brooklyn', 0.4442),
 ('york', 0.4363)]

In [16]:
# Retain the articles titles only for analysis
titles_list = df.title.tolist()

# Process the documents and collect the results
titles_kws = kw_model.extract_keywords(titles_list, keyphrase_ngram_range=(1, 3))

# Add the results to df
df["titles_keywords"] = titles_kws

In [17]:
df["titles_keywords"]

0        [(policing 2024 debt, 0.7905), (stuff policing...
1        [(inflation surge end, 0.731), (great inflatio...
2        [(barn dodge viper, 0.8066), (dodge viper nigh...
3        [(review ford bronco, 0.7184), (ford bronco sp...
4        [(flesh eating bacteria, 0.6753), (eating bact...
                               ...                        
24098    [(obama wins creative, 0.6589), (creative arts...
24099    [(2023 care court, 0.6707), (court bankrolling...
24100    [(electric vehicle tax, 0.7099), (vehicle tax ...
24101    [(starbucks wants 55, 0.5929), (stores 2030 ex...
24102    [(2024 election guide, 0.8422), (election guid...
Name: titles_keywords, Length: 24103, dtype: object

In [18]:
df_kws = df.titles_keywords.tolist()

flat_keys = [item[0] for sublist in df_kws for item in sublist]

flat_keys = list(set(flat_keys))

In [19]:
keys_df = pd.DataFrame(flat_keys, columns = ['key'])

In [20]:
torch.cuda.is_available()

True

In [21]:
torch.cuda.device_count()

1

In [23]:
torch.cuda.get_device_name(0)

'NVIDIA GeForce GTX 1080 Ti'

In [24]:
torch.cuda.current_device()

0

In [25]:
# Instantiate the embedding model
# TODO: Figure out how to use CUDA
model = SentenceTransformer('all-mpnet-base-v2', device='cuda')

# Embed the keywords and keyphrases into 768-dim real vector space
# keys_df['key_bert'] = keys_df['key'].progress_apply(lambda x: model.encode(x))

## This is way faster than row-based applies
keys_df['key_bert']  = list(model.encode(keys_df['key'].tolist(), convert_to_tensor=False, show_progress_bar=True))

Batches:   0%|          | 0/2161 [00:00<?, ?it/s]

In [26]:
# Reduce to 10-dimensional vectors and keep the local neighborhood at 15
embeddings = umap.UMAP(n_neighbors=25, # Balances local vs. global structure.
                       n_components=15, # Dimension of reduced vectors
                       metric='cosine', verbose=True).fit_transform(list(keys_df.key_bert))

UMAP(angular_rp_forest=True, metric='cosine', n_components=15, n_neighbors=25, verbose=True)
Sun Jan 14 21:41:47 2024 Construct fuzzy simplicial set
Sun Jan 14 21:41:47 2024 Finding Nearest Neighbors
Sun Jan 14 21:41:47 2024 Building RP forest with 18 trees
Sun Jan 14 21:41:54 2024 NN descent for 16 iterations
	 1  /  16
	 2  /  16
	 3  /  16
	 4  /  16
	 5  /  16
	Stopping threshold met -- exiting after 5 iterations
Sun Jan 14 21:42:14 2024 Finished Nearest Neighbor Search
Sun Jan 14 21:42:17 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Sun Jan 14 21:42:48 2024 Finished embedding


In [27]:
# Add the reduced embedding vectors to the dataframe
keys_df['key_umap'] = embeddings.tolist()

In [28]:
min_df = int(len(keys_df)/500)

In [29]:
min_df

138

In [30]:
import time

In [34]:
start_time = time.time()

# Initialize the clustering model
clusterer = hdbscan.HDBSCAN(algorithm='best',
                            prediction_data=True,
                            approx_min_span_tree=True,
                            gen_min_span_tree=True,
                            min_cluster_size=min_df,
                            cluster_selection_epsilon = .1,
                            min_samples=1,
                            p=None,
                            metric='euclidean',
                            cluster_selection_method='leaf')

# Fit the data
clusterer.fit(embeddings)

# Create soft clusters
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)

# Add the soft cluster information to the data
closest_clusters = [np.argmax(x) for x in soft_clusters]
keys_df['cluster'] = closest_clusters

print(f">> Ran in {(time.time()-start_time)/60} minutes")

>> Ran in 0.9396579265594482 minutes


In [35]:
max(closest_clusters)

120

In [37]:
i = random.choice(closest_clusters)
keys_df[keys_df['cluster']==i][['key', 'cluster']]

Unnamed: 0,key,cluster
35,jennifer lopez ben,110
40,jennifer lawrence happier,110
78,natalie portman goes,110
153,cate blanchett,110
158,christina aguilera shows,110
...,...,...
68685,meghan markle suits,110
68769,amanda abbington claims,110
68791,megyn kelly,110
68822,worst dressed selena,110


In [None]:
generator = pipeline(
    model=llm,
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1,
)

def extract_description(df: pd.DataFrame,
                        n: int     
                        )-> pd.DataFrame:
    """
    Use a custom prompt to send to a LLM
    to extract labels and descriptions for a list of keywords.
    """

    one_cluster = df[df['cluster']==n]
    one_cluster_copy = one_cluster.copy()
    sample = one_cluster_copy.key.tolist()

    prompt_clusters= f"""
    <|system|>
    I have the following list of keywords and keyphrases:
    ['encryption','attribute','firewall','security properties',
    'network security','reliability','surveillance','distributed risk factors',
    'still vulnerable','cryptographic','protocol','signaling','safe',
    'adversary','message passing','input-determined guards','secure communication',
    'vulnerabilities','value-at-risk','anti-spam','intellectual property rights',
    'countermeasures','security implications','privacy','protection',
    'mitigation strategies','vulnerability','secure networks','guards']

    Based on the information above, first name the domain these keywords or keyphrases 
  belong to, secondly give a brief description of the domain.
    Do not use more than 30 words for the description!
    Do not provide details!
    Do not give examples of the contexts, do not say 'such as' and do not list the keywords 
  or the keyphrases!
    Do not start with a statement of the form 'These keywords belong to the domain of' or 
  with 'The domain'.

    Cybersecurity: Cybersecurity, emphasizing methods and strategies for safeguarding digital information
    and networks against unauthorized access and threats.
    </s>

    <|user|>
    I have the following list of keywords and keyphrases:
    {sample}
    Based on the information above, first name the domain these keywords or keyphrases belong to, secondly
    give a brief description of the domain.
    Do not use more than 30 words for the description!
    Do not provide details!
    Do not give examples of the contexts, do not say 'such as' and do not list the keywords or the keyphrases!
    Do not start with a statement of the form 'These keywords belong to the domain of' or with 'The domain'.
    <|assistant|>
    """

    # Generate the outputs
    outputs = generator(prompt_clusters,
                    max_new_tokens=120,
                    do_sample=True,
                    temperature=0.1,
                    top_k=10,
                    top_p=0.95)

    text = outputs[0]["generated_text"]

    # Example string
    pattern = "<|assistant|>\n"

    # Extract the output
    response = text.split(pattern, 1)[1].strip(" ")
    # Check if the output has the desired format
    if len(response.split(":", 1)) == 2:
        label  = response.split(":", 1)[0].strip(" ")
        description = response.split(":", 1)[1].strip(" ")
    else:
        label = description = response

    # Add the description and the labels to the dataframe
    one_cluster_copy.loc[:, 'description'] = description
    one_cluster_copy.loc[:, 'label'] = label

    return one_cluster_copy

In [39]:
from tqdm import tqdm
tqdm.pandas()

In [None]:
# Initialize an empty list to store the cluster dataframes
dataframes = []
clusters = len(set(keys_df.cluster))

# Iterate over the range of n values
for n in tqdm(range(clusters-1)):
    df_result = extract_description(keys_df,n)
    dataframes.append(df_result)

# Concatenate the individual dataframes
final_df = pd.concat(dataframes, ignore_index=True)

  2%|█▉                                                                             | 3/120 [07:41<4:56:54, 152.26s/it]