# Example Usage of GPTopic: 20 Newsgroups Dataset

In this notebook, we will use the 20 Newsgroups dataset to demonstrate the use of the gptopic package

In [138]:
from gptopic.GPTopic import GPTopic
from sklearn.datasets import fetch_20newsgroups 

In [139]:
# select your own API key here. (Note: This specific code will not work for you unless you specified an environment variable for OPENAI_API_KEY)
import os
api_key_openai = os.environ.get('OPENAI_API_KEY') 

### Load Data

In [154]:
data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) #download the 20 Newsgroups dataset
corpus = data['data']
corpus = [doc for doc in corpus if doc != ""]


## Initialize and fit the model 

In [None]:
tm = GPTopic(
    openai_api_key = api_key_openai,
    n_topics = 20 # select 20 topics since the true number of topics is 20 
)

In [None]:
tm.fit(corpus)

Computing vocabulary...


Processing corpus: 100%|██████████| 980/980 [00:04<00:00, 208.30it/s]


Most common words:
n't: 813
would: 557
one: 481
people: 353
know: 312
like: 307
think: 291
get: 259
use: 258
also: 254
Computing embeddings...


100%|██████████| 980/980 [00:00<00:00, 1301.57it/s]
100%|██████████| 980/980 [09:41<00:00,  1.69it/s]
100%|██████████| 2609/2609 [00:00<00:00, 4873.34it/s]
100%|██████████| 2609/2609 [24:19<00:00,  1.79it/s] 


Extracting topics...
UMAP(angular_rp_forest=True, metric='cosine', min_dist=0, n_components=5, random_state=42, verbose=True)
Mon Sep  4 19:29:23 2023 Construct fuzzy simplicial set
Mon Sep  4 19:29:38 2023 Finding Nearest Neighbors
Mon Sep  4 19:29:49 2023 Finished Nearest Neighbor Search
Mon Sep  4 19:29:58 2023 Construct embedding


Epochs completed: 100%| ██████████ 500/500 [00:08]


Mon Sep  4 19:30:07 2023 Finished embedding


Epochs completed: 100%| ██████████ 100/100 [00:02]
Computing word-topic matrix: 100%|██████████| 12/12 [00:04<00:00,  2.48it/s]


Shape of tfidf:  (2609, 11)
shape fo word_topic_mat:  (2609, 11)


Epochs completed: 100%| ██████████ 100/100 [00:03]


Describing topics...


100%|██████████| 11/11 [01:00<00:00,  5.53s/it]


In [None]:
# in case the respective pickle files for this example exist, you can also load the model directly

with open("../Data/SavedTopicRepresentations/GPTopic_20ng.pkl", "rb") as f:
    tm = pickle.load(f)

## Get an overview over the identified topics

In [1]:
tm.topic_lis

NameError: name 'tm' is not defined

In [144]:
tm.print_topics()

Topic 0: Electronics Equipment Sales

Topic_description: The common topic of the given words appears to be "electronics and technology". 

Various aspects and sub-topics of this topic include:
1. Buying and selling: "offer", "sale", "sell", "price", "buy"
2. Device usage and features: "use", "get", "new", "used", "condition"
3. Technical specifications: "wire", "ground", "power", "circuit", "voltage"
4. Communication and connectivity: "phone", "email", "modem", "wireless", "connection"
5. Accessories and peripherals: "battery", "cable", "manuals", "disk", "monitor"
Top words: ["n't", 'one', 'would', 'use', 'like', 'get', 'new', 'used', 'offer', 'sale']

------------------------------------------------------------------------------------------------------------------------------------------------------
Topic 1: Image Processing

Topic_description: The common topic of the given words is "Image Processing and Graphics". 

Aspects and sub-topics of this topic include:
1. Image Manipulation

In [None]:
tm.visualize_clusters()

### Obtain more detailed information about the topics

In [145]:
tm.pprompt("Which information on the keyword 'moon landing' does topic 13 have?")

GPT wants to the call the function:  {
  "name": "knn_search",
  "arguments": "{\n  \"topic_index\": 13,\n  \"query\": \"moon landing\"\n}"
}
Topic 13, which is related to the keyword "moon landing," contains information about various aspects of space exploration and missions to the Moon. Here are some key points:

1. The United States has sent automated spacecraft and human-crewed expeditions to explore the Moon. These missions have provided significant knowledge and understanding of the lunar surface.
   - Document index: 258

2. NASA's automated spacecraft for solar system exploration come in various shapes and sizes. Each spacecraft consists of scientific instruments selected for specific missions and is supported by basic subsystems for electrical power, trajectory control, and data communication with Earth.
   - Document index: 535

3. The Mariner missions, conducted between 1962 and 1975, played a crucial role in the early planetary reconnaissance of the Moon and other terrestri

(['This file and other text and image files from JPL missions are available from the JPL Info public access computer site, reachable by Internet via anonymous ftp to pubinfo.jpl.nasa.gov (128.149.6.2); or by dialup modem to +1 (818) 354-1333, up to 9600 bits per second, parameters N-8-1. -----------------------------------------------------------------  Our Solar System at a Glance  Information Summary  PMS 010-A (JPL) June 1991  JPL 410-34-1  6/91  NASA National Aeronautics and Space Administration  Jet Propulsion Laboratory California Institue of Technology Pasadena, California   For a printed copy of this publication contact the public mail office at the NASA center in your geographic region.    INTRODUCTION       From our small world we have gazed upon the cosmic ocean for untold thousands of years. Ancient astronomers observed points of light that appeared to move among the stars. They called these objects planets, meaning wanderers, and named them after Roman deities -- Jupiter, 

In [146]:
print(tm.topic_lis[13].documents[102])

 Their Hiten engineering-test mission spent a while in a highly eccentric Earth orbit doing lunar flybys, and then was inserted into lunar orbit using some very tricky gravity-assist-like maneuvering.  This meant that it would crash on the Moon eventually, since there is no such thing as a stable lunar orbit (as far as anyone knows), and I believe I recall hearing recently that it was about to happen.


In [147]:
tm.pprompt("What are 5 potential subtopics of topic 6")

GPT wants to the call the function:  {
  "name": "split_topic_kmeans",
  "arguments": "{\n  \"topic_idx\": 6,\n  \"n_clusters\": 5\n}"
}


Epochs completed: 100%| ██████████ 100/100 [00:01]
Computing word-topic matrix: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]
Epochs completed: 100%| ██████████ 100/100 [00:03]
Epochs completed: 100%| ██████████ 100/100 [00:01]
100%|██████████| 1/1 [00:04<00:00,  4.34s/it]
Epochs completed: 100%| ██████████ 100/100 [00:01]
Computing word-topic matrix: 100%|██████████| 1/1 [00:00<00:00,  2.93it/s]
Epochs completed: 100%| ██████████ 100/100 [00:05]
Epochs completed: 100%| ██████████ 100/100 [00:01]
100%|██████████| 1/1 [00:05<00:00,  5.68s/it]
Epochs completed: 100%| ██████████ 100/100 [00:00]
Computing word-topic matrix: 100%|██████████| 1/1 [00:00<00:00,  4.42it/s]
Epochs completed: 100%| ██████████ 100/100 [00:01]
Epochs completed: 100%| ██████████ 100/100 [00:01]
100%|██████████| 1/1 [00:03<00:00,  3.94s/it]
Epochs completed: 100%| ██████████ 100/100 [00:00]
Computing word-topic matrix: 100%|██████████| 1/1 [00:00<00:00,  8.14it/s]
Epochs completed: 100%| ██████████ 100/100 [00:01]
E

Here are five potential subtopics of topic 6:

1. Existence of God: This subtopic involves questioning the existence of God and examining the evidence for and against it.

2. Sexual Orientation: This subtopic relates to homosexuality and encompasses aspects such as sexual orientation, rights and discrimination, social attitudes, relationships and partners, and public perception.

3. Ethics and Morality: This subtopic focuses on moral and ethical principles, including moral philosophy, moral reasoning, moral standards, moral dilemmas, and moral relativism.

4. Religion and Law: This subtopic explores the intersection of religion and law, including beliefs, practices, interpretation, controversies, and the role of religion in society and politics.

5. Argumentation and Atheism: This subtopic revolves around debates and arguments, involving communication, logical reasoning, disagreements, the intersection of religion and atheism, and the criticism and analysis of arguments.

Please note t

[Topic 0: Electronics Equipment Sales,
 Topic 1: Image Processing,
 Topic 2: Gun control,
 Topic 3: Online Privacy and Anonymity,
 Topic 4: Conflict and Violence.,
 Topic 5: Computer Hardware,
 Topic 6: Online Discussions,
 Topic 7: Computer Software,
 Topic 8: Car Features and Performance,
 Topic 9: Encryption and Government,
 Topic 10: Technology and Computing.,
 Topic 11: Technology and Computing,
 Topic 12: Space Exploration,
 Topic 13: Motorcycle Riding Techniques,
 Topic 14: Technology,
 Topic 15: Hockey Games,
 Topic 16: Health and Medicine.,
 Topic 17: Baseball games and teams.,
 Topic 18: Beliefs about Homosexuality.,
 Topic 19: Existence of God,
 Topic 20: Sexual Orientation,
 Topic 21: Ethics and Morality,
 Topic 22: Religion and Law,
 Topic 23: Argumentation and Atheism.]

### Topic splitting

Based on the previously identified topics, we decide to split topic 6 not into 5 but into three subtopics based on the keywords 'religious faith', 'atheism' and 'ethics and philosophy'.

In [148]:
tm.pprompt("Please split topic 6 into subtopics based on the keywords 'religious faith', 'atheism' and 'ethics and philosophy'. Do this inplace.")

GPT wants to the call the function:  {
  "name": "split_topic_keywords",
  "arguments": "{\n  \"topic_idx\": 6,\n  \"keywords\": [\"religious faith\", \"atheism\", \"ethics and philosophy\"],\n  \"inplace\": true\n}"
}


Epochs completed: 100%| ██████████ 100/100 [00:00]
Computing word-topic matrix: 100%|██████████| 1/1 [00:00<00:00, 12.43it/s]
Epochs completed: 100%| ██████████ 100/100 [00:01]
Epochs completed: 100%| ██████████ 100/100 [00:01]
100%|██████████| 1/1 [00:04<00:00,  4.88s/it]
Epochs completed: 100%| ██████████ 100/100 [00:00]
Computing word-topic matrix: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]
Epochs completed: 100%| ██████████ 100/100 [00:03]
Epochs completed: 100%| ██████████ 100/100 [00:01]
100%|██████████| 1/1 [00:06<00:00,  6.45s/it]
Epochs completed: 100%| ██████████ 100/100 [00:00]
Computing word-topic matrix: 100%|██████████| 1/1 [00:00<00:00,  2.04it/s]
Epochs completed: 100%| ██████████ 100/100 [00:02]
Epochs completed: 100%| ██████████ 100/100 [00:01]
100%|██████████| 1/1 [00:04<00:00,  4.19s/it]


[Topic 0: Electronics Equipment Sales
, Topic 1: Image Processing
, Topic 2: Gun control
, Topic 3: Online Privacy and Anonymity
, Topic 4: Conflict and Violence.
, Topic 5: Computer Hardware
, Topic 6: Online Discussions
, Topic 7: Computer Software
, Topic 8: Car Features and Performance
, Topic 9: Encryption and Government
, Topic 10: Technology and Computing.
, Topic 11: Technology and Computing
, Topic 12: Space Exploration
, Topic 13: Motorcycle Riding Techniques
, Topic 14: Technology
, Topic 15: Hockey Games
, Topic 16: Health and Medicine.
, Topic 17: Baseball games and teams.
, Topic 18: Beliefs about Homosexuality.
, Topic 19: Religious Beliefs
, Topic 20: Existence of God
, Topic 21: Ethics and Morality
]
Topic 6 has been split into the following subtopics based on the keywords 'religious faith', 'atheism', and 'ethics and philosophy':

1. Subtopic: Religious Beliefs
   - Description: The common topic of these words is "Religion and Beliefs". Aspects and sub-topics of this 

[Topic 0: Electronics Equipment Sales,
 Topic 1: Image Processing,
 Topic 2: Gun control,
 Topic 3: Online Privacy and Anonymity,
 Topic 4: Conflict and Violence.,
 Topic 5: Computer Hardware,
 Topic 6: Online Discussions,
 Topic 7: Computer Software,
 Topic 8: Car Features and Performance,
 Topic 9: Encryption and Government,
 Topic 10: Technology and Computing.,
 Topic 11: Technology and Computing,
 Topic 12: Space Exploration,
 Topic 13: Motorcycle Riding Techniques,
 Topic 14: Technology,
 Topic 15: Hockey Games,
 Topic 16: Health and Medicine.,
 Topic 17: Baseball games and teams.,
 Topic 18: Beliefs about Homosexuality.,
 Topic 19: Religious Beliefs,
 Topic 20: Existence of God,
 Topic 21: Ethics and Morality]

In [150]:
tm.topic_lis

[Topic 0: Electronics Equipment Sales,
 Topic 1: Image Processing,
 Topic 2: Gun control,
 Topic 3: Online Privacy and Anonymity,
 Topic 4: Conflict and Violence.,
 Topic 5: Computer Hardware,
 Topic 6: Online Discussions,
 Topic 7: Computer Software,
 Topic 8: Car Features and Performance,
 Topic 9: Encryption and Government,
 Topic 10: Technology and Computing.,
 Topic 11: Technology and Computing,
 Topic 12: Space Exploration,
 Topic 13: Motorcycle Riding Techniques,
 Topic 14: Technology,
 Topic 15: Hockey Games,
 Topic 16: Health and Medicine.,
 Topic 17: Baseball games and teams.,
 Topic 18: Beliefs about Homosexuality.,
 Topic 19: Religious Beliefs,
 Topic 20: Existence of God,
 Topic 21: Ethics and Morality]

### Combine Topics

Topics 15 and 17 both seem to be about sports, so let's merge them into one topic.

In [153]:
tm.pprompt("Please combine topics 15 and 17. Do this inplace.")

GPT wants to the call the function:  {
  "name": "combine_topics",
  "arguments": "{\n  \"topic_idx_lis\": [15, 17],\n  \"inplace\": true\n}"
}


Epochs completed: 100%| ██████████ 100/100 [00:01]
Computing word-topic matrix: 100%|██████████| 1/1 [00:07<00:00,  7.16s/it]
Epochs completed: 100%| ██████████ 30/30 [00:09]
Epochs completed: 100%| ██████████ 100/100 [00:02]
100%|██████████| 1/1 [00:06<00:00,  6.62s/it]


The topics 15 and 17 have been combined into a new topic called "Sports". This topic includes aspects and sub-topics related to sports such as team and players, games and seasons, performance and skills, fans and audience, and statistics and records. Some of the common words found in this topic include "team," "players," "hockey," "baseball," "game," "games," "season," "playoffs," "good," "better," "win," "hit," "score," "fans," "series," "watch," "fan," "stats," "record," "pts," and "career".


[Topic 0: Electronics Equipment Sales,
 Topic 1: Image Processing,
 Topic 2: Gun control,
 Topic 3: Online Privacy and Anonymity,
 Topic 4: Conflict and Violence.,
 Topic 5: Computer Hardware,
 Topic 6: Online Discussions,
 Topic 7: Computer Software,
 Topic 8: Car Features and Performance,
 Topic 9: Encryption and Government,
 Topic 10: Technology and Computing.,
 Topic 11: Technology and Computing,
 Topic 12: Space Exploration,
 Topic 13: Motorcycle Riding Techniques,
 Topic 14: Technology,
 Topic 15: Health and Medicine.,
 Topic 16: Beliefs about Homosexuality.,
 Topic 17: Religious Beliefs,
 Topic 18: Existence of God,
 Topic 19: Ethics and Morality,
 Topic 20: Sports]

### Delete Topics

Since Topic 10 and 11 have the same title, we can combine them into one topic. Note that this doesn't delete the documents from the delted topic, but rather distributes them over the other topics.

In [159]:
tm.pprompt("Please delete topic 10. Do this inplace.")

GPT wants to the call the function:  {
  "name": "delete_topic",
  "arguments": "{\n  \"topic_idx\": 10,\n  \"inplace\": true\n}"
}
Tue Sep  5 13:50:48 2023 Building and compiling search function


Epochs completed: 100%| ██████████ 100/100 [00:02]
Computing word-topic matrix: 100%|██████████| 20/20 [02:01<00:00,  6.06s/it]


Shape of tfidf:  (31365, 20)
shape fo word_topic_mat:  (31365, 20)


Epochs completed: 100%| ██████████ 30/30 [00:05]
100%|██████████| 20/20 [01:43<00:00,  5.17s/it]


The topic with index 10 has been successfully deleted. 

After removing topic 10, the new topic we have is topic with index 19, which is related to "Sports". The various aspects and sub-topics of this topic include:

1. Games: "game", "games", "play", "team", "players"
2. Seasons: "year", "season", "last", "years", "playoffs"
3. Performance: "good", "better", "well", "great", "average"
4. Strategies: "think", "strategy", "tactics", "coach", "plan"
5. Results: "win", "score", "goal", "points", "victory"


[Topic 0: Electronics equipment sales,
 Topic 1: Image Processing,
 Topic 2: Gun control,
 Topic 3: Online Privacy,
 Topic 4: Conflict and violence.,
 Topic 5: Computer Hardware,
 Topic 6: Anonymity in online discussions.,
 Topic 7: Computer Software,
 Topic 8: Car Features and Performance,
 Topic 9: Encryption,
 Topic 10: Technology and Computing,
 Topic 11: Space Exploration,
 Topic 12: Motorcycle Riding Tips,
 Topic 13: Technology and Computing,
 Topic 14: Healthcare and Medicine.,
 Topic 15: Biblical interpretation,
 Topic 16: Religious Beliefs,
 Topic 17: Existence of God,
 Topic 18: Morality in Health Insurance,
 Topic 19: Sports]

### Compare Topics 

In [160]:
tm.pprompt("Please compare topics 5 and 7.")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arik_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arik_\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


GPT wants to the call the function:  {
  "name": "get_topic_information",
  "arguments": "{\n  \"topic_idx_lis\": [5, 7]\n}"
}
Topic 5 is about computer hardware, while topic 7 is about computer software.

In topic 5, the common topic is "computer hardware," and it covers various aspects and sub-topics such as storage, components, performance, connectivity, and display. Some of the top words in this topic include "drive," "card," "disk," "memory," and "video." It seems to be discussing various hardware-related issues, including problems, compatibility, performance, and configuration. It also mentions specific components like hard drives, cards, motherboards, and monitors. Overall, this topic focuses on the physical components of a computer system.

On the other hand, topic 7 is about computer software and usage. It covers aspects such as operating systems, software programs, computer hardware, user interface, and troubleshooting. Some of the top words in this topic include "file," "pro

{5: '\n            Topic index: 5\n            Topic name: Computer Hardware\n            Topic description: The common topic of the given words is "computer hardware". \n\nThe various aspects and sub-topics of this topic include:\n1. Storage: disk, hard drive, floppy, drives, disks.\n2. Components: card, controller, board, chip, motherboard.\n3. Performance: speed, memory, clock, mhz, faster.\n4. Connectivity: bus, port, connector, cable, serial.\n5. Display: monitor, video, screen, color, graphics.\n            Topic topwords: ["n\'t", \'drive\', \'card\', \'one\', \'would\', \'use\', \'know\', \'get\', \'like\', \'disk\', \'system\', \'problem\', \'drives\', \'work\', \'also\', \'controller\', \'hard\', \'anyone\', \'using\', \'drivers\', \'need\', \'two\', \'monitor\', \'bus\', \'new\', \'used\', \'software\', \'speed\', \'data\', \'could\', \'think\', \'driver\', \'memory\', \'board\', \'time\', \'problems\', \'mode\', \'video\', \'port\', \'good\', \'much\', \'cards\', \'computer

### Add a completely new topic

We can also add a completely new topic based on the keyword "Politics and government"

In [168]:
tm.pprompt("Please add a completely new topic based on the keyword 'Politics and the government'.")

GPT wants to the call the function:  {
  "name": "add_new_topic_keyword",
  "arguments": "{\n  \"keyword\": \"Politics and the government\"\n}"
}


Epochs completed: 100%| ██████████ 100/100 [00:01]
Epochs completed: 100%| ██████████ 100/100 [00:03]
Computing word-topic matrix: 100%|██████████| 21/21 [01:18<00:00,  3.72s/it]
Epochs completed: 100%| ██████████ 30/30 [00:04]
100%|██████████| 21/21 [01:58<00:00,  5.64s/it]


Sure! I have added a new topic based on the keyword "Politics and the government". The new topic is called "Government and Policy".

The common theme of the given words is "government and public policy". This topic encompasses various aspects and sub-topics, including:

1. Government actions and decisions: This sub-topic focuses on the actions and decisions made by the government. It includes words like "make", "said", "believe", "question", and "decisions".

2. Healthcare system: This sub-topic is centered around the healthcare system, including topics such as health, care, insurance, private healthcare, and drugs.

3. Economy and jobs: This sub-topic explores the economy and job-related issues. It includes words like jobs, work, economic, business, and spending.

4. Education and schools: This sub-topic pertains to education and schools. It includes words like education, school, students, learning, and teachers.

5. Law and order: This sub-topic focuses on the legal system and mainta

[Topic 0: Electronics equipment sales,
 Topic 1: Image Processing,
 Topic 2: Gun control,
 Topic 3: Internet Privacy,
 Topic 4: Conflict and Violence,
 Topic 5: Computer Hardware,
 Topic 6: Anonymous Posting,
 Topic 7: Computer Software,
 Topic 8: Car features and performance.,
 Topic 9: Encryption,
 Topic 10: Technology,
 Topic 11: Space Exploration,
 Topic 12: Motorcycle Riding Tips,
 Topic 13: Technology and Computing,
 Topic 14: Healthcare and Medicine,
 Topic 15: Biblical interpretation,
 Topic 16: Beliefs and Religion,
 Topic 17: Existence of God,
 Topic 18: Sexual Morality,
 Topic 19: Sports,
 Topic 20: Government and Policy]