<br>Dora Li
<br>CS315 
<br>April 23rd, 2024

# Topic Modeling for Advertisements in TikTok Video History

<a id="sec1"></a>
## 1. Load the data

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Import data and get advertisements
df = pd.read_csv("results_26301.csv")
ads = df[df["video_is_ad"] == True]
ads = ads.fillna("") # make sure there is no na values
ads.head()

Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,video_playcount,video_description,video_is_ad,video_stickers,author_username,author_name,author_followercount,author_followingcount,author_heartcount,author_videocount,author_diggcount,author_verified
112,7342923622752767278,2024-03-05T12:44:54,15.0,US,"Skating Board, tubi movie, tubi movies 2023, t...",1031.0,18.0,12.0,1300000.0,They're bringing new vibes to old traditions. ...,True,,tubi,Tubi,,,,,,True
120,7340530115241053486,2024-02-28T01:06:58,59.0,US,"Karaoke Machine, Karaoke, Playing Karaoke, car...",14200.0,706.0,65.0,249000.0,You cant even tell on video how loud this gets...,True,,kaitttttnicole,Kait,,,,,,False
125,7302183353237703967,2023-11-16T17:01:05,35.0,US,"heated round brush, wavytalk, wavy talk 5 in 1...",40800.0,3345.0,355.0,5400000.0,Replying to @lily TikTok shop black friday sa...,True,,julissa_guillen,Julissa Guillen,,,,,,False
131,7313262803266161962,2023-12-16T13:35:03,17.0,US,"wavytalk brush, wavy talk, wavy thermal brush,...",157900.0,680.0,315.0,8000000.0,im not even kidding like this guy is comign w ...,True,,jigglyjulia,julia huynh,,,,,,False
135,7336608256384585002,2024-02-17T11:27:29,49.0,US,"wavy thermal brush, Thermal Brush, wavytalk br...",10100.0,363.0,66.0,1600000.0,feeling like this viral @wavytalkofficial is w...,True,,zoeyburger,zoeyburger,,,,,,False


In [16]:
ads.shape

(1079, 20)

<a id="sec2"></a>
## 2. Convert to document-term matrix

We will apply the CountVectorizer to convert our corpus into a document-term matrix. Empirical evidence has shown that simply counting words is more meaningful for performing LDA on documents. (It is possible to use the Tf-idf vectorizer too.)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

This process has always two steps: 

1. initialize the vectorizer constructor
2. apply `fit_transform` to perform the transformation.

In [6]:
# Initialize the vectorizer
vectorizer = CountVectorizer(
    strip_accents='unicode',
    stop_words='english',
    lowercase=True,
    token_pattern=r'\b[a-zA-Z]{3,}\b', # we want only words that contain letters and are 3 or more characters long
)

# Transform our data into the document-term matrix
dtm = vectorizer.fit_transform(ads['video_description'])
dtm

<1079x3404 sparse matrix of type '<class 'numpy.int64'>'
	with 9065 stored elements in Compressed Sparse Row format>

### Going back to the dataframe

We can create a function that takes the representation of each document as a row of numbers in the matrix and converts it back to a list of words.

In [12]:
def matrix2Doc(dtMatrix, features, index):
    """Turns each row of the document-term matrix into a list of terms"""
    row = dtMatrix.getrow(index).toarray()
    non_zero_indices = row.nonzero()[1]
    words = [features[idx] for idx in non_zero_indices]
    return words

In [13]:
allAdsAsTerms = [matrix2Doc(dtm, feature_names, i) for i in range(dtm.shape[0])]

Check that we have all of them:

In [15]:
len(allAdsAsTerms)

1079

Add a column to the dataframe:

In [17]:
ads['terms'] = allAdsAsTerms
ads.head()

Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,video_playcount,video_description,...,video_stickers,author_username,author_name,author_followercount,author_followingcount,author_heartcount,author_videocount,author_diggcount,author_verified,terms
112,7342923622752767278,2024-03-05T12:44:54,15.0,US,"Skating Board, tubi movie, tubi movies 2023, t...",1031.0,18.0,12.0,1300000.0,They're bringing new vibes to old traditions. ...,...,,tubi,Tubi,,,,,,True,"[boarders, bringing, march, new, old, starting..."
120,7340530115241053486,2024-02-28T01:06:58,59.0,US,"Karaoke Machine, Karaoke, Playing Karaoke, car...",14200.0,706.0,65.0,249000.0,You cant even tell on video how loud this gets...,...,,kaitttttnicole,Kait,,,,,,False,"[definitely, fun, fyp, gets, hype, loud, tell,..."
125,7302183353237703967,2023-11-16T17:01:05,35.0,US,"heated round brush, wavytalk, wavy talk 5 in 1...",40800.0,3345.0,355.0,5400000.0,Replying to @lily TikTok shop black friday sa...,...,,julissa_guillen,Julissa Guillen,,,,,,False,"[amikablowoutbabe, amikablowoutbabedupe, amika..."
131,7313262803266161962,2023-12-16T13:35:03,17.0,US,"wavytalk brush, wavy talk, wavy thermal brush,...",157900.0,680.0,315.0,8000000.0,im not even kidding like this guy is comign w ...,...,,jigglyjulia,julia huynh,,,,,,False,"[comign, guy, kidding, like, trips]"
135,7336608256384585002,2024-02-17T11:27:29,49.0,US,"wavy thermal brush, Thermal Brush, wavytalk br...",10100.0,363.0,66.0,1600000.0,feeling like this viral @wavytalkofficial is w...,...,,zoeyburger,zoeyburger,,,,,,False,"[blowout, feeling, hairhack, hype, like, order..."


<a id="sec3"></a>
## 3. Fit the LDA model

Now that the data is ready and we understand well how it is represented (and how sparse it is), let us fit the LDA model:

In [18]:
from sklearn.decomposition import LatentDirichletAllocation

# Step 1: Initialize the model

lda = LatentDirichletAllocation(n_components=15, # we are picking the number of topics arbitrarely at the moment
                                random_state=0)

# Step 2: Fit the model
lda.fit(dtm)

Find top words associated with each topic:

In [19]:
def display_topics(model, features, no_top_words):
    """Helper function to show the top words of a model"""
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([features[i]
                        for i in topic.argsort()[:-no_top_words-1:-1]])) # syntax for reversing a list [::-1]

display_topics(lda, feature_names, 15)

Topic 0:
skin new moment daily glow video essence girl replying defense invisible watch let primevideo product
Topic 1:
skin new make chewy pet happy shopping hydration wallet march denim instantly sephora pro smooth
Topic 2:
sour theaters beans just kids match patch jelly youth new lip text place game best
Topic 3:
chewy need pets new care start rubs belly minus naturally shop ingredients press sets point
Topic 4:
new today collection streaming hulu shogun discover right delivered doorstep chewy disney free glow spring
Topic 5:
new delta terms apply app card like don beach shoe comfort look pay adventure ootd
Topic 6:
love maxx want fashion easter partner new sale best available ends family march looking spring
Topic 7:
free unlimited buy line nuuly bbq try like cream pulled pork life customers existing intro
Topic 8:
limited time new sale good lip balm apply skin success follow steps sensor shop vitamin
Topic 9:
world discover shop disney day skin nuuly warm easy turn way need walmar

### The document-topic matrix and dominant topics

In the prior step, by fitting the LDA model, we found the topics that are present in our corpus. Now, we will use these topics to generate the documents. For that, we will use the method `transform`. This method will transform our document-term matrix into a new matrix, the document-topic matrix. This is where the **dimensionality reduction** is happening. We go from the large document-term matrix to a narrow document-topic matrix.

In [20]:
doc_topic_dist = lda.transform(dtm)
doc_topic_dist 

array([[0.91515145, 0.00606062, 0.00606061, ..., 0.00606062, 0.00606061,
        0.00606061],
       [0.00666667, 0.00666667, 0.00666667, ..., 0.00666667, 0.00666667,
        0.00666667],
       [0.00317461, 0.0031746 , 0.0031746 , ..., 0.0031746 , 0.00317461,
        0.0031746 ],
       ...,
       [0.96888888, 0.00222222, 0.00222222, ..., 0.00222222, 0.00222222,
        0.00222222],
       [0.00740741, 0.22962934, 0.00740741, ..., 0.00740741, 0.00740741,
        0.00740741],
       [0.00392157, 0.00392157, 0.00392157, ..., 0.00392157, 0.94509798,
        0.00392157]])

Verify the shape:

In [21]:
doc_topic_dist.shape

(1079, 15)

**Meaning of the matrix values:** The entries in this matrix represent the proportion of the document's content that is attributed to each topic. This means each row of the output matrix is a distribution over topics for the corresponding document and should sum to one. We can easily test that by getting the sum of a row:

**Better representing the document-topic matrix**

The document-topic matrix above is not very legible, we will create a dataframe that has a better representation. First, I'll modify the function `display_topics` to show a few terms for each topic:

In [22]:
def displayHeader(model, features, no_top_words):
    """Helper function to show the top words of a model"""
    topicNames = []
    for topic_idx, topic in enumerate(model.components_):
        topicNames.append(f"Topic {topic_idx}: " + (", ".join([features[i]
                             for i in topic.argsort()[:-no_top_words-1:-1]])))
    return topicNames

In [25]:
# column names
topicnames = displayHeader(lda, feature_names, 5)

# index names
docnames = ads.index.tolist() # We will use the original names of the documents

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(doc_topic_dist, 3), 
                                 columns=topicnames, 
                                 index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1) # finds the maximum argument
df_document_topic['dominant_topic'] = dominant_topic

df_document_topic.head()

Unnamed: 0,"Topic 0: skin, new, moment, daily, glow","Topic 1: skin, new, make, chewy, pet","Topic 2: sour, theaters, beans, just, kids","Topic 3: chewy, need, pets, new, care","Topic 4: new, today, collection, streaming, hulu","Topic 5: new, delta, terms, apply, app","Topic 6: love, maxx, want, fashion, easter","Topic 7: free, unlimited, buy, line, nuuly","Topic 8: limited, time, new, sale, good","Topic 9: world, discover, shop, disney, day","Topic 10: save, covergirl, new, time, tiktok","Topic 11: shop, amazon, finds, home, dress","Topic 12: new, theaters, march, fragrance, american","Topic 13: dune, theaters, tickets, dunemovie, introducing","Topic 14: streaming, don, miss, bacon, prime",dominant_topic
112,0.915,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0
120,0.007,0.007,0.007,0.007,0.007,0.007,0.007,0.007,0.007,0.007,0.907,0.007,0.007,0.007,0.007,10
125,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.956,0.003,0.003,0.003,0.003,10
131,0.011,0.011,0.011,0.011,0.011,0.011,0.844,0.011,0.011,0.011,0.011,0.011,0.011,0.011,0.011,6
135,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.928,0.005,0.005,0.005,11


### Topic distribution across documents

Now that we have the document-topic matrix, we can see which topics show up most frequently:

In [26]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['TopicNum', 'NumDocuments']
df_topic_distribution

Unnamed: 0,TopicNum,NumDocuments
0,6,96
1,11,94
2,8,93
3,1,93
4,12,81
5,4,81
6,7,71
7,2,67
8,0,62
9,5,61


### Add two more columns

1. a column with the top 10 words of the corresponding topic. (see Topic Num for the topic number)
2. a column that lists the document names associated with the topic (document names are things like food_1, food_2, etc.)

#### Column 1

In [27]:
def display_top10_word_in_topics(row, model, features, no_top_words):
    """Helper function to show the top 10 words in a topic"""
    for topic_idx, topic in enumerate(model.components_):
        if row["TopicNum"] == topic_idx:
            return " ".join([features[i] for i in topic.argsort()[:-no_top_words-1:-1]]) # syntax for reversing a list [::-1]

In [28]:
df_topic_distribution['top10'] = df_topic_distribution.apply(display_top10_word_in_topics,args= (lda, feature_names,10), axis=1)

In [29]:
df_topic_distribution

Unnamed: 0,TopicNum,NumDocuments,top10
0,6,96,love maxx want fashion easter partner new sale...
1,11,94,shop amazon finds home dress play free summer ...
2,8,93,limited time new sale good lip balm apply skin...
3,1,93,skin new make chewy pet happy shopping hydrati...
4,12,81,new theaters march fragrance american discover...
5,4,81,new today collection streaming hulu shogun dis...
6,7,71,free unlimited buy line nuuly bbq try like cre...
7,2,67,sour theaters beans just kids match patch jell...
8,0,62,skin new moment daily glow video essence girl ...
9,5,61,new delta terms apply app card like don beach ...


#### Column 2

In [30]:
def display_doc(row):
    """Helper function to display the document names associated with the topic"""
    docs = df_document_topic[df_document_topic['dominant_topic'] == row['TopicNum']]
    return docs.index.tolist()

In [31]:
df_topic_distribution['docs'] = df_topic_distribution.apply(display_doc, axis=1)

In [32]:
df_topic_distribution

Unnamed: 0,TopicNum,NumDocuments,top10,docs
0,6,96,love maxx want fashion easter partner new sale...,"[131, 164, 229, 266, 432, 458, 554, 580, 623, ..."
1,11,94,shop amazon finds home dress play free summer ...,"[135, 307, 311, 321, 699, 742, 765, 1153, 1455..."
2,8,93,limited time new sale good lip balm apply skin...,"[139, 254, 385, 406, 527, 819, 849, 855, 920, ..."
3,1,93,skin new make chewy pet happy shopping hydrati...,"[208, 240, 260, 467, 591, 682, 754, 860, 936, ..."
4,12,81,new theaters march fragrance american discover...,"[378, 442, 453, 564, 575, 642, 869, 873, 1340,..."
5,4,81,new today collection streaming hulu shogun dis...,"[655, 808, 984, 1006, 1132, 1169, 1359, 1478, ..."
6,7,71,free unlimited buy line nuuly bbq try like cre...,"[394, 471, 487, 595, 613, 725, 1064, 1198, 125..."
7,2,67,sour theaters beans just kids match patch jell...,"[143, 170, 201, 222, 273, 301, 315, 366, 462, ..."
8,0,62,skin new moment daily glow video essence girl ...,"[112, 289, 360, 399, 436, 558, 633, 651, 760, ..."
9,5,61,new delta terms apply app card like don beach ...,"[295, 662, 748, 777, 783, 989, 1068, 1164, 171..."


<a id="sec4"></a>
## 4. Grid Search: Find number of topics

In the example so far, we arbitrarely chose the number of topics to be 15. However, that is not the right way to go about it. We whould use methods for selecting the optimal number of topics. This can be done through a mechanism known as GridSearch with cross-validation that builds multiple models and then compares them to see which one performs the best.

In [33]:
from sklearn.model_selection import GridSearchCV

# We are going to test multiple values for the number of topics
search_params = {'n_components': [5, 10, 15, 20, 25, 30, 35]}

# Initialize the LDA model
lda = LatentDirichletAllocation()

# Initialize a Grid Search with cross-validation instance
grid = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
grid.fit(dtm)

Let us look at the results:

In [34]:
grid.cv_results_

{'mean_fit_time': array([0.31691394, 0.29247179, 0.27941689, 0.27942915, 0.28690066,
        0.29683456, 0.30405126]),
 'std_fit_time': array([0.0165243 , 0.02506067, 0.00244721, 0.00238839, 0.00381297,
        0.00141829, 0.00326779]),
 'mean_score_time': array([0.01058421, 0.0105722 , 0.01031303, 0.0103229 , 0.01064243,
        0.00983582, 0.00972252]),
 'std_score_time': array([0.00056811, 0.00092465, 0.00050051, 0.00034589, 0.0001279 ,
        0.00033557, 0.00017553]),
 'param_n_components': masked_array(data=[5, 10, 15, 20, 25, 30, 35],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_components': 5},
  {'n_components': 10},
  {'n_components': 15},
  {'n_components': 20},
  {'n_components': 25},
  {'n_components': 30},
  {'n_components': 35}],
 'split0_test_score': array([-22661.59831765, -27675.88185918, -31937.94972332, -35674.48672861,
        -39389.55672467, -43773.18169067, -47375.337870

Since this representation is a bit overwhelming, let's access a few features of the grid instance:

In [35]:
# Best Model
best_lda_model = grid.best_estimator_

# Model Parameters
print("Best Model's Params: ", grid.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", grid.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(dtm))

Best Model's Params:  {'n_components': 5}
Best Log Likelihood Score:  -23295.924462537158
Model Perplexity:  2965.9465658789786


The results are showing that the best LDA model should have 5 topics, the smallest number we tried. This raises the question of whether we should try other small numbers, which I'm doing below:

In [36]:
search_params = {'n_components': [1,2,3,4,5,6]}

lda = LatentDirichletAllocation()
grid = GridSearchCV(lda, param_grid=search_params)

grid.fit(dtm)

# Best Model
best_lda_model = grid.best_estimator_

# Model Parameters
print("Best Model's Params: ", grid.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", grid.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(dtm))

Best Model's Params:  {'n_components': 1}
Best Log Likelihood Score:  -17346.95820992014
Model Perplexity:  2702.786354244111


This result shows that actually the best number of topics for this corpus is 1.

**Meaning of Log Likelihood**. 

Log Likelihood is the logarithm of the probability of observing the given data under the model with specific parameters. Essentially, it measures how well the model explains the observed data. (It is a conditional probability.)

**Meaning of perplexity**

Perplexity is a common metric used to evaluate the quality of probabilistic models. It reflects how well the model describes or predicts the documents in the dataset.

A lower perplexity score suggests that the model is more certain about its predictions (i.e., the probability distributions it assigns to unseen documents are more accurate). This means that the topic distributions learned by the model are a good fit for the observed data.

**Words for best modesl with one topic**

Let's see what are the top words for the best model with one topic:

In [37]:
display_topics(best_lda_model, feature_names, 40)

Topic 0:
new skin shop free time march love make today don chewy theaters amazon like sale collection limited streaming discover just try glow best finds dress fashion spring miss want good lip home cream body shopping hulu start need care maxx


As we can see it is a mix of food and realestae and New York. If we had documents with more distinct nature and more of them we might have seen something else. 

However, the point of this tutorial was to show the mechanics of building LDA models. 

Now it's time to take what you saw here and apply it to your projects.

Have fun exploring!