### Final Project

<h1><center> Artificial Intelligence & the press: <br>
    new(s) meanings with the emergence of generative AI systems</h1></center>
<h2><center>Part 3 | Text Analysis: Topic Modelling Before ChatGPT</center></h2>


<h3 align="right">by Silvia DalBen Furtado</h3> 

I decided to create a new jupyter notebook document just for the topic modelling before ChatGPT.

For that, I will use BERTopic modelling. Some resources about it here:

https://www.pinecone.io/learn/bertopic/

https://maartengr.github.io/BERTopic/index.html

There is also another type of topic modelling called LDA: https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
nltk.download('omw-1.4')
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()
from bertopic import BERTopic
from umap import UMAP
import re
import os 
import unicodedata
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
from umap import UMAP

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/silvinhad/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/silvinhad/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Topic Modelling before ChatGPT

In [2]:
data = pd.read_csv('titles_before_chatgpt.csv').astype("string")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5548 entries, 0 to 5547
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   5548 non-null   string
dtypes: string(1)
memory usage: 43.5 KB


In [3]:
data.shape

(5548, 1)

In [4]:
data =  data[['title']] 

In [5]:
stopwords = nltk.corpus.stopwords.words(['english'])
print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')

There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'no

In [6]:
def clean_text(x):
    x = str(x)
    x = x.lower()
    x = re.sub(r'#[A-Za-z0-9]*', ' ', x)
    x = re.sub(r'@[A-Za-z0-9]+', ' ', x)
    x = re.sub(r'[%s]' % re.escape('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“…”’'), ' ', x)
    x = re.sub(r'\d+', ' ', x)
    x = re.sub(r'\n+', ' ', x)
    return x

data['title'] = data.title.apply(clean_text).astype("string")
data['title']

0       three companies that place a high priority on ...
1       africa s and america s forgotten need access t...
2       how semi automated offside technology will cha...
3       semi automatic offside will be in place at the...
4       fifa to use new high tech for offside calls at...
                              ...                        
5543    elon musk says   months to first human implant...
5544    elon musk's neuralink shows brain implant prot...
5545    elon musk expects neuralink to begin human tri...
5546    florida children getting hit with a tsunami of...
5547             sci fi fantasy film ‘life cycle  out now
Name: title, Length: 5548, dtype: string

I will now clean/prepare my text data using the code I found here:

https://github.com/m3redithw/data-science-visualizations/blob/main/WordClouds/prepare.py

In [7]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

data['basic_clean'] = data.title.apply(basic_clean).astype("string")
data['basic_clean']

0       three companies that place a high priority on ...
1       africa s and america s forgotten need access t...
2       how semi automated offside technology will cha...
3       semi automatic offside will be in place at the...
4       fifa to use new high tech for offside calls at...
                              ...                        
5543    elon musk says   months to first human implant...
5544    elon musks neuralink shows brain implant proto...
5545    elon musk expects neuralink to begin human tri...
5546    florida children getting hit with a tsunami of...
5547              sci fi fantasy film life cycle  out now
Name: basic_clean, Length: 5548, dtype: string

In [8]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()

    # Use tokenizer
    string = tokenizer.tokenize(string, return_str = True)

    return string

data['tokenize'] = data.basic_clean.apply(tokenize).astype("string")
data['tokenize']

0       three companies that place a high priority on ...
1       africa s and america s forgotten need access t...
2       how semi automated offside technology will cha...
3       semi automatic offside will be in place at the...
4       fifa to use new high tech for offside calls at...
                              ...                        
5543    elon musk says months to first human implant o...
5544    elon musks neuralink shows brain implant proto...
5545    elon musk expects neuralink to begin human tri...
5546    florida children getting hit with a tsunami of...
5547               sci fi fantasy film life cycle out now
Name: tokenize, Length: 5548, dtype: string

In [9]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()

    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]

    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

data['lemmatize'] = data.tokenize.apply(lemmatize).astype("string")
data['lemmatize']

0       three company that place a high priority on pe...
1       africa s and america s forgotten need access t...
2       how semi automated offside technology will cha...
3       semi automatic offside will be in place at the...
4       fifa to use new high tech for offside call at ...
                              ...                        
5543    elon musk say month to first human implant of ...
5544    elon musk neuralink show brain implant prototy...
5545    elon musk expects neuralink to begin human tri...
5546    florida child getting hit with a tsunami of vi...
5547               sci fi fantasy film life cycle out now
Name: lemmatize, Length: 5548, dtype: string

In [10]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = nltk.corpus.stopwords.words(['english'])
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

data['titles_without_stopwords'] = data.lemmatize.apply(remove_stopwords).astype("string")
data['titles_without_stopwords']

0                three company place high priority people
1            africa america forgotten need access capital
2       semi automated offside technology change qatar...
3            semi automatic offside place qatar world cup
4           fifa use new high tech offside call world cup
                              ...                        
5543    elon musk say month first human implant brain ...
5544    elon musk neuralink show brain implant prototy...
5545    elon musk expects neuralink begin human trial ...
5546      florida child getting hit tsunami virus bad get
5547                       sci fi fantasy film life cycle
Name: titles_without_stopwords, Length: 5548, dtype: string

In [11]:
data

Unnamed: 0,title,basic_clean,tokenize,lemmatize,titles_without_stopwords
0,three companies that place a high priority on ...,three companies that place a high priority on ...,three companies that place a high priority on ...,three company that place a high priority on pe...,three company place high priority people
1,africa s and america s forgotten need access t...,africa s and america s forgotten need access t...,africa s and america s forgotten need access t...,africa s and america s forgotten need access t...,africa america forgotten need access capital
2,how semi automated offside technology will cha...,how semi automated offside technology will cha...,how semi automated offside technology will cha...,how semi automated offside technology will cha...,semi automated offside technology change qatar...
3,semi automatic offside will be in place at the...,semi automatic offside will be in place at the...,semi automatic offside will be in place at the...,semi automatic offside will be in place at the...,semi automatic offside place qatar world cup
4,fifa to use new high tech for offside calls at...,fifa to use new high tech for offside calls at...,fifa to use new high tech for offside calls at...,fifa to use new high tech for offside call at ...,fifa use new high tech offside call world cup
...,...,...,...,...,...
5543,elon musk says months to first human implant...,elon musk says months to first human implant...,elon musk says months to first human implant o...,elon musk say month to first human implant of ...,elon musk say month first human implant brain ...
5544,elon musk's neuralink shows brain implant prot...,elon musks neuralink shows brain implant proto...,elon musks neuralink shows brain implant proto...,elon musk neuralink show brain implant prototy...,elon musk neuralink show brain implant prototy...
5545,elon musk expects neuralink to begin human tri...,elon musk expects neuralink to begin human tri...,elon musk expects neuralink to begin human tri...,elon musk expects neuralink to begin human tri...,elon musk expects neuralink begin human trial ...
5546,florida children getting hit with a tsunami of...,florida children getting hit with a tsunami of...,florida children getting hit with a tsunami of...,florida child getting hit with a tsunami of vi...,florida child getting hit tsunami virus bad get


In [12]:
# exporting the cleaned titles to generate a new wordcloud
data.titles_without_stopwords.to_csv(r'titles_before_chatgpt_cleaned.csv', index=False)

In [22]:
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False)
#topic_model = BERTopic(umap_model=umap_model).fit(docs)

topic_model = BERTopic(umap_model=umap_model, language="english", calculate_probabilities=True, nr_topics = 13, verbose=True, n_gram_range=(1, 3))
topics, _ = topic_model.fit_transform(data['titles_without_stopwords'])

docs = data['titles_without_stopwords']
# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = topic_model.vectorizer_model
#vectorizer = CountVectorizer()
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names_out()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic) if words!=''] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_model.get_coherence()
coherence

2024-03-30 11:02:12,185 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/174 [00:00<?, ?it/s]

2024-03-30 11:02:21,869 - BERTopic - Embedding - Completed ✓
2024-03-30 11:02:21,869 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-30 11:02:25,763 - BERTopic - Dimensionality - Completed ✓
2024-03-30 11:02:25,764 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-30 11:02:28,984 - BERTopic - Cluster - Completed ✓
2024-03-30 11:02:28,984 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-30 11:02:29,317 - BERTopic - Representation - Completed ✓
2024-03-30 11:02:29,318 - BERTopic - Topic reduction - Reducing number of topics
2024-03-30 11:02:29,604 - BERTopic - Topic reduction - Reduced number of topics from 116 to 13


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0.5800052811474617

In [23]:
topic_words

[['ai',
  'robot',
  'intelligence',
  'artificial',
  'artificial intelligence',
  'art',
  'tesla',
  'human',
  'ethic',
  'new'],
 ['stock',
  'tech',
  'earnings',
  'chip',
  'nvidia',
  'new',
  'amazon',
  'digital',
  'data',
  'cloud'],
 ['musk',
  'elon',
  'elon musk',
  'twitter',
  'election',
  'facebook',
  'tiktok',
  'misinformation',
  'cybersecurity',
  'social'],
 ['china',
  'biden',
  'drone',
  'tech',
  'war',
  'israel',
  'chip',
  'ukraine',
  'say',
  'russia'],
 ['climate',
  'emission',
  'animal',
  'climate change',
  'energy',
  'oil',
  'food',
  'help',
  'change',
  'gas'],
 ['effective altruism',
  'altruism',
  'effective',
  'woman',
  'fried',
  'bankman',
  'sam bankman',
  'bankman fried',
  'sam bankman fried',
  'sam'],
 ['westworld',
  'hbo',
  'sci fi',
  'sci',
  'fi',
  'movie',
  'season',
  'prime',
  'amazon prime',
  'steven'],
 ['world cup',
  'cup',
  'chess',
  'world',
  'offside',
  'qatar world cup',
  'qatar world',
  'qatar',

In [37]:
freq = topic_model.get_topic_info()
freq.to_csv('freq.csv', index=False)
freq

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2001,-1_ai_new_tech_technology,"[ai, new, tech, technology, future, company, c...","[japan pay u company chip production, computer..."
1,0,1124,0_ai_robot_intelligence_artificial,"[ai, robot, intelligence, artificial, artifici...",[ai ethic ai law clarifying fact trustworthy a...
2,1,864,1_stock_tech_earnings_chip,"[stock, tech, earnings, chip, nvidia, new, ama...",[senate pass bill boost computer chip producti...
3,2,522,2_musk_elon_elon musk_twitter,"[musk, elon, elon musk, twitter, election, fac...",[get healthcare coverage need digital health b...
4,3,498,3_china_biden_drone_tech,"[china, biden, drone, tech, war, israel, chip,...",[pentagon despite russia war china still top t...
5,4,188,4_climate_emission_animal_climate change,"[climate, emission, animal, climate change, en...",[gore announces fossil fuel emission inventory...
6,5,141,5_effective altruism_altruism_effective_woman,"[effective altruism, altruism, effective, woma...",[sam bankman fried support democrat massively ...
7,6,84,6_westworld_hbo_sci fi_sci,"[westworld, hbo, sci fi, sci, fi, movie, seaso...","[sci fi drama westworld canceled hbo season, s..."
8,7,67,7_world cup_cup_chess_world,"[world cup, cup, chess, world, offside, qatar ...","[semi automatic offside place qatar world cup,..."
9,8,19,8_facial_facial recognition_recognition_seal,"[facial, facial recognition, recognition, seal...",[facial recognition help conserve seal scienti...


In [25]:
topic_model.visualize_barchart(top_n_topics=4)