## Install Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from gensim import corpora, models
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary

from sklearn.model_selection import train_test_split

import os
import json
import lzma
from imgix import UrlBuilder
from google.cloud import vision_v1

## Task A: Google Cloud LDA

### 1. Report the top 25 words for each topic, and decide on suitable names for each topic.

##### Perform topic modeling (LDA) on the image labels from Google Vision

In [4]:
# Specify the path to the service account key file
Application_Credentials = 'summer-sun-420716-7ba0fd05a0b9.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = Application_Credentials

# Initialize Google Vision client
client = vision_v1.ImageAnnotatorClient()
#image = vision_v1.Image()

In [5]:
def extract_image_labels(image_path):
    """Extract image labels using Google Vision API."""
    with open(image_path, 'rb') as image_file:
        content = image_file.read()
    image = vision_v1.Image(content=content)
    response = client.label_detection(image=image)
    labels = [label.description for label in response.label_annotations]
    return labels

In [6]:
data_dictionary = "/Users/zy/Downloads/Data"

In [7]:
post_info = []

for subdir, dirs, files in os.walk(data_dictionary):
    folder_name = os.path.basename(subdir)
    folder_info = folder_name.rsplit('_', 4)  # Split from the right to capture the last four elements distinctly
    
    if len(folder_info) < 5:
        continue  # Skip folders that do not have the expected number of parts
    else:
        username = folder_info[0]
        post_id, likes, comments = folder_info[2:]

    labels = []

    for file in files:
        file_path = os.path.join(subdir, file)
        if file.lower().endswith(('.png', '.jpg', '.jpeg')):
            label = extract_image_labels(file_path)
            labels.extend(label)
    
    if labels:
        post_info.append({
            'profile_name': username,
            'post_id': post_id,
            'likes_count': likes,
            'comments_count': comments,
            'labels': ' '.join(labels),
        })

post_info = pd.DataFrame(post_info)

In [8]:
post_info

Unnamed: 0,profile_name,post_id,likes_count,comments_count,labels
0,kayaancontractor,2868439159916464863,205,2,Nose Glasses Skin Lip Vision care Eyebrow Mout...
1,debasreee,3066886135067171352,7650,20,Cloud Sky Water Hat People in nature Wood Sun ...
2,shereenlovebug,3029832820659598377,456,27,Forehead Nose Cheek Skin Lip Chin Hairstyle Ey...
3,mandirabedi,2889470489057495466,9167,86,Lip Hand Arm Finger Gesture Happy Grass Eyewea...
4,rahulkl,3053766391959056433,4057,18,Clothing Outerwear Shirt Photograph Cap Sports...
...,...,...,...,...,...
1963,akanksharedhu,3006774447044259315,-1,23,Food Tableware Ingredient Recipe Ice cream bar...
1964,bonappetitmag,3066434777540625472,10850,48,Food Ingredient Recipe Baked goods Cuisine Dis...
1965,vehiclevirgins,2791858705941864416,6366,103,Wheel Tire Car Vehicle Automotive lighting Aut...
1966,virat.kohli,3033512035523241475,4464459,28359,Sports uniform Shorts Jersey Sports equipment ...


In [9]:
image_labels = post_info['labels']

# Tokenize the image labels
tokenized_labels = [label.split() for label in image_labels]

# Create dictionary and corpus for topic modeling
dictionary = corpora.Dictionary(tokenized_labels)
corpus = [dictionary.doc2bow(label) for label in tokenized_labels]

# Split the corpus into training and testing sets
corpus_train, corpus_test = train_test_split(corpus, test_size=0.2, random_state=42)

##### Choose an appropriate number of topics

In [10]:
num_topics_list = range(1, 11)  # Loop through 1 to 10

perplexity_scores = []
coherence_scores = []

for num_topics in num_topics_list:
    # Train LDA model
    lda_model = LdaModel(corpus_train, id2word=dictionary, num_topics=num_topics)

    # Compute perplexity
    perplexity = lda_model.log_perplexity(corpus_test)
    perplexity_scores.append(perplexity)

    # Compute coherence
    coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_labels, dictionary=dictionary, coherence='c_v')
    coherence = coherence_model_lda.get_coherence()
    coherence_scores.append(coherence)

# Choose the number of topics based on the scores
best_num_topics_perplexity = num_topics_list[perplexity_scores.index(min(perplexity_scores))]
best_num_topics_coherence = num_topics_list[coherence_scores.index(max(coherence_scores))]

print("Best number of topics based on perplexity:", best_num_topics_perplexity)
print("Best number of topics based on coherence:", best_num_topics_coherence)

Best number of topics based on perplexity: 10
Best number of topics based on coherence: 10


##### Describe the process of finding the best number of topics in detail

Initially, a range of topics, from one to ten, is defined. Each number of topics within this range is sequentially iterated over. For every iteration, the LDA model is trained on a designated training corpus. Subsequently, two key metrics, perplexity and coherence, are computed to evaluate the performance of the model. 

1. **Perplexity**: reflecting the model's ability to predict the sample, is calculated with lower values indicating superior performance.
2. **Coherence**: assessing the interpretability and quality of the generated topics, with higher scores signifying more coherent topics.

Once perplexity and coherence scores are obtained for each number of topics, the best number of topics is determined based on these metrics. Specifically, the number of topics with the lowest perplexity score and the highest coherence score is identified. This methodical approach enables the selection of an optimal number of topics that strikes a balance between model complexity and performance, ensuring the generation of meaningful and coherent topics in the LDA model.

In this case, the best number of topics based on both metrics is 10, indicating that the model with 10 topics performs relatively well in predicting the sample and produces the most coherent and semantically meaningful topics when configured with this number.

##### Report the top 25 words for each topic, and decide on suitable names for each topic

In [11]:
# Perform LDA topic modeling
lda_model = models.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15, random_state=42)
topics = lda_model.show_topics(num_topics=10, num_words=25, formatted=False)
topic_name = ['Automotive Design and Parts', 'Home and Interior Design', 'Fashion and Style Photography', 'Culinary Arts and Cuisine', 
              'Nature and Travel', 'Fashion Design and Accessories', 'Sports Gear and Apparel', 'Body and Fashion Photography',
              'Wedding and Bridal Fashion', 'Graphic Design and Branding']

topics_data = []

for idx, topic in topics:
    # Remove duplicates from the list of words
    words = list(set([word for word, prob in topic]))
    words = ", ".join(words)
    topics_data.append({'Topic': topic_name[idx], 'Top Words': words})

# Create a DataFrame from the list of topic data
topics_df = pd.DataFrame(topics_data)
topics_df

Unnamed: 0,Topic,Top Words
0,Automotive Design and Parts,"Steering, Land, Alloy, Plant, exterior, Hood, ..."
1,Home and Interior Design,"Flooring, Plant, Leisure, Building, Floor, Cha..."
2,Fashion and Style Photography,"Facial, Beard, Human, Gesture, Happy, Event, L..."
3,Culinary Arts and Cuisine,"Plant, Natural, Drinkware, Plate, Table, Stapl..."
4,Nature and Travel,"Travel, Plant, Natural, Leisure, Building, Clo..."
5,Fashion Design and Accessories,"Magenta, Plant, Flower, Leisure, Purple, Texti..."
6,Sports Gear and Apparel,"Hat, Baseball, jersey, Gesture, Competition, g..."
7,Body and Fashion Photography,"Neck, Flooring, Joint, Waist, Human, Knee, Thi..."
8,Wedding and Bridal Fashion,"One-piece, Neck, dress, Textile, Waist, Happy,..."
9,Graphic Design and Branding,"Magenta, Graphic, Baseball, Gesture, Event, Sp..."


In [12]:
topics_data

[{'Topic': 'Automotive Design and Parts',
  'Top Words': 'Steering, Land, Alloy, Plant, exterior, Hood, Tire, plate, Car, Vehicle, wheel, Grille, Motor, Wheel, Automotive, car, vehicle, tire, lighting, light, part, Personal, design, Sky, registration'},
 {'Topic': 'Home and Interior Design',
  'Top Words': 'Flooring, Plant, Leisure, Building, Floor, Chair, Happy, Event, Recreation, Table, Wood, Smile, Crowd, Furniture, Window, Sharing, Interior, Fun, Comfort, T-shirt, Tableware, design, Couch, Houseplant, Lighting'},
 {'Topic': 'Fashion and Style Photography',
  'Top Words': 'Facial, Beard, Human, Gesture, Happy, Event, Lip, Smile, expression, Hairstyle, Eyewear, Flash, Fun, Entertainment, hair, Forehead, Eyebrow, Chin, Fashion, Skin, photography, Cool, Font, Eyelash, Sleeve'},
 {'Topic': 'Culinary Arts and Cuisine',
  'Top Words': 'Plant, Natural, Drinkware, Plate, Table, Staple, Fruit, Cup, vegetable, Food, Recipe, Leaf, Baked, Ingredient, foods, Produce, Tableware, Kitchen, Cuisine,

### 2. What are the main differences in the average topic weights of images across the two quartiles?

##### Sort the data from high to low engagement (by # of comments)

In [29]:
post_info['comments_count'] = pd.to_numeric(post_info['comments_count'], errors='coerce')

# Sort the data based on the number of comments
df_sorted = post_info.sort_values(by='comments_count', ascending=False)
df_sorted

Unnamed: 0,profile_name,post_id,likes_count,comments_count,labels
1045,halfbakedharvest,2996407846187385570,247631,693966,Kitchen appliance Christmas tree Home applianc...
511,halfbakedharvest,2990612658156456674,212513,515973,Liquid Plant Flash photography Gas Tints and s...
1834,halfbakedharvest,2986269265263400579,112425,331474,Automotive lighting Input device Audio equipme...
10,halfbakedharvest,3000750743145862289,160070,239765,Orange Tableware Serveware Drinkware Event Toy...
578,halfbakedharvest,2869549391481578933,94986,184733,Flower Dishware Drinkware Flowerpot Purple Pla...
...,...,...,...,...,...
330,eventplannerlife,456753512781395346,17,0,Water Building Water resources Boat Watercraft...
437,eventplanneracademy,2283360642806948755,99,0,Purple Font Rectangle Line Electric blue Magen...
941,kunalgir,2261132305343390536,888,0,Smile Temple Happy Cool Fashion design Event F...
1210,kunalgir,2138354707539057422,556,0,Muscle Blue Publication Advertising Knee Water...


In [30]:
# Initialize an empty list to store assigned topics
assigned_topics = []

# Iterate through each post in the sorted DataFrame
for idx, post in df_sorted.iterrows():
    # Extract the image labels for the post
    labels = post['labels'].split()
    # Tokenize the labels
    tokenized_labels = [label for label in labels]
    # Create a bag of words representation for the post
    bow = dictionary.doc2bow(tokenized_labels)
    # Get the topic distribution for the post
    topic_distribution = lda_model.get_document_topics(bow)
    # Sort the topic distribution by probability in descending order
    sorted_topics = sorted(topic_distribution, key=lambda x: x[1], reverse=True)
    # Assign the topic with the highest probability
    assigned_topic = sorted_topics[0][0]  # Get the index of the topic
    assigned_topic_name = topic_name[assigned_topic]  # Get the name of the assigned topic
    assigned_topics.append(assigned_topic_name)

# Add the assigned topics as a new column to the sorted DataFrame
df_sorted['assigned_topic_name'] = assigned_topics
df_sorted

Unnamed: 0,profile_name,post_id,likes_count,comments_count,labels,assigned_topic_name
1045,halfbakedharvest,2996407846187385570,247631,693966,Kitchen appliance Christmas tree Home applianc...,Fashion Design and Accessories
511,halfbakedharvest,2990612658156456674,212513,515973,Liquid Plant Flash photography Gas Tints and s...,Home and Interior Design
1834,halfbakedharvest,2986269265263400579,112425,331474,Automotive lighting Input device Audio equipme...,Automotive Design and Parts
10,halfbakedharvest,3000750743145862289,160070,239765,Orange Tableware Serveware Drinkware Event Toy...,Culinary Arts and Cuisine
578,halfbakedharvest,2869549391481578933,94986,184733,Flower Dishware Drinkware Flowerpot Purple Pla...,Fashion Design and Accessories
...,...,...,...,...,...,...
330,eventplannerlife,456753512781395346,17,0,Water Building Water resources Boat Watercraft...,Nature and Travel
437,eventplanneracademy,2283360642806948755,99,0,Purple Font Rectangle Line Electric blue Magen...,Graphic Design and Branding
941,kunalgir,2261132305343390536,888,0,Smile Temple Happy Cool Fashion design Event F...,Fashion Design and Accessories
1210,kunalgir,2138354707539057422,556,0,Muscle Blue Publication Advertising Knee Water...,Graphic Design and Branding


In [38]:
# Assign topics to each image in the dataset
df_sorted['assigned_topic'] = [lda_model.get_document_topics(corpus[i])[0][0] for i in range(len(corpus))]
df_sorted

Unnamed: 0,profile_name,post_id,likes_count,comments_count,labels,assigned_topic_name,assigned_topic
1045,halfbakedharvest,2996407846187385570,247631,693966,Kitchen appliance Christmas tree Home applianc...,Fashion Design and Accessories,2
511,halfbakedharvest,2990612658156456674,212513,515973,Liquid Plant Flash photography Gas Tints and s...,Home and Interior Design,2
1834,halfbakedharvest,2986269265263400579,112425,331474,Automotive lighting Input device Audio equipme...,Automotive Design and Parts,2
10,halfbakedharvest,3000750743145862289,160070,239765,Orange Tableware Serveware Drinkware Event Toy...,Culinary Arts and Cuisine,2
578,halfbakedharvest,2869549391481578933,94986,184733,Flower Dishware Drinkware Flowerpot Purple Pla...,Fashion Design and Accessories,2
...,...,...,...,...,...,...,...
330,eventplannerlife,456753512781395346,17,0,Water Building Water resources Boat Watercraft...,Nature and Travel,1
437,eventplanneracademy,2283360642806948755,99,0,Purple Font Rectangle Line Electric blue Magen...,Graphic Design and Branding,3
941,kunalgir,2261132305343390536,888,0,Smile Temple Happy Cool Fashion design Event F...,Fashion Design and Accessories,0
1210,kunalgir,2138354707539057422,556,0,Muscle Blue Publication Advertising Knee Water...,Graphic Design and Branding,6


##### Take the highest and the lowest quartiles

In [39]:
# Calculate quartiles
highest_quartile = df_sorted['comments_count'].quantile(0.75)
lowest_quartile = df_sorted['comments_count'].quantile(0.25)

# Extract rows corresponding to the highest and lowest quartiles
highest_quartile_data = df_sorted[df_sorted['comments_count'] >= highest_quartile]
lowest_quartile_data = df_sorted[df_sorted['comments_count'] <= lowest_quartile]

In [40]:
highest_quartile_data

Unnamed: 0,profile_name,post_id,likes_count,comments_count,labels,assigned_topic_name,assigned_topic
1045,halfbakedharvest,2996407846187385570,247631,693966,Kitchen appliance Christmas tree Home applianc...,Fashion Design and Accessories,2
511,halfbakedharvest,2990612658156456674,212513,515973,Liquid Plant Flash photography Gas Tints and s...,Home and Interior Design,2
1834,halfbakedharvest,2986269265263400579,112425,331474,Automotive lighting Input device Audio equipme...,Automotive Design and Parts,2
10,halfbakedharvest,3000750743145862289,160070,239765,Orange Tableware Serveware Drinkware Event Toy...,Culinary Arts and Cuisine,2
578,halfbakedharvest,2869549391481578933,94986,184733,Flower Dishware Drinkware Flowerpot Purple Pla...,Fashion Design and Accessories,2
...,...,...,...,...,...,...,...
1100,yuzi_chahal23,3034998328627543899,269408,420,Glasses Smile Vision care Sunglasses Chair Hap...,Fashion Design and Accessories,0
1312,malaikaaroraofficial,3016069893396863505,46953,419,Dog Carnivore Dog breed German spitz Spitz Hap...,Sports Gear and Apparel,2
426,ishant.sharma29,3030065365716527817,1365764,416,Hairstyle Facial expression Vision care Goggle...,Fashion Design and Accessories,0
1503,bonappetitmag,3065800668426056503,50658,416,Glasses Vision care Smile Customer Food Eyewea...,Fashion Design and Accessories,2


In [41]:
lowest_quartile_data

Unnamed: 0,profile_name,post_id,likes_count,comments_count,labels,assigned_topic_name,assigned_topic
1911,michaelhyatt,3035127162126388068,337,30,Font Rectangle Parallel Screenshot Number Art ...,Graphic Design and Branding,2
165,minimalistbaker,3019891519067395807,2740,30,Food Ingredient Recipe Staple food Dish Cuisin...,Culinary Arts and Cuisine,2
40,mandirabedi,2863850507363824801,2254,30,Handwriting Rectangle Font Tints and shades Wr...,Graphic Design and Branding,3
551,yogeshfitness,2934191407661249747,-1,30,Automotive tire Sunglasses Automotive design K...,Body and Fashion Photography,2
635,hulu,3065712398225426484,2822,30,Gesture Human leg Entertainment Event Darkness...,Fashion and Style Photography,6
...,...,...,...,...,...,...,...
330,eventplannerlife,456753512781395346,17,0,Water Building Water resources Boat Watercraft...,Nature and Travel,1
437,eventplanneracademy,2283360642806948755,99,0,Purple Font Rectangle Line Electric blue Magen...,Graphic Design and Branding,3
941,kunalgir,2261132305343390536,888,0,Smile Temple Happy Cool Fashion design Event F...,Fashion Design and Accessories,0
1210,kunalgir,2138354707539057422,556,0,Muscle Blue Publication Advertising Knee Water...,Graphic Design and Branding,6


##### Main differences in the average topic weights of images across the two quartiles (e.g., greater weight of some topics in the highest versus lowest engagement quartiles).

In [42]:
# Calculate the average topic weights for each quartile
highest_quartile_topic_weights = highest_quartile_data.groupby('assigned_topic')['comments_count'].mean()
lowest_quartile_topic_weights = lowest_quartile_data.groupby('assigned_topic')['comments_count'].mean()

In [46]:
# Create a table to show the main differences in average topic weights
topic_weights_comparison = pd.DataFrame({
    'Topic': topic_name,
    'Avg. Weight - Highest Quartile': highest_quartile_topic_weights.round(2).values,
    'Avg. Weight - Lowest Quartile': lowest_quartile_topic_weights.round(2).values
})
topic_weights_comparison.sort_values(by='Avg. Weight - Highest Quartile', ascending=False)

Unnamed: 0,Topic,Avg. Weight - Highest Quartile,Avg. Weight - Lowest Quartile
2,Fashion and Style Photography,19251.44,15.43
5,Fashion Design and Accessories,5586.33,14.35
7,Body and Fashion Photography,5472.85,13.36
8,Wedding and Bridal Fashion,4875.86,9.0
3,Culinary Arts and Cuisine,4874.53,16.28
9,Graphic Design and Branding,4650.69,17.43
1,Home and Interior Design,3073.8,13.54
6,Sports Gear and Apparel,3004.11,12.66
0,Automotive Design and Parts,2643.92,14.12
4,Nature and Travel,1946.18,14.88


## Task B: Business Interpretation and Client Advise  

Based on my findings above, I would be able to identify topics that generate higher engagement by noticeable differences in the average topic weights between the highest and lowest engagement quartiles. 

- **Fashion and Style Photography**: This topic exhibits significantly higher average weights in the highest quartile compared to the lowest quartile, indicating that posts related to fashion and style photography tend to generate more engagement among the audience. Followed by **Fashion Design and Accessories** and **Body and Fashion Photography** with also relatively high average weights in the highest quartile.

- **Nature and Travel**: Conversely, the topic of nature and travel exhibits the lowest average weights in the highest quartile, suggesting that this type of content may not resonate as strongly with the audience or may have less appeal for engagement.

Based on these findings, the client can tailor their content strategy to focus more on fashion and style-related topics, which have shown to drive higher engagement. Here are some more general advice for the client:

1. **Visual Appeal**: Ensure that the posts are visually appealing and align with the topics that generate higher engagement. Use high-quality images and captivating visuals to grab the audience's attention.

2. **Engage with Followers**: Actively engage with the audience by responding to comments, liking posts, and reposting user-generated content. Building a sense of community fosters loyalty and encourages further engagement.

3. **Consistent Posting**: Maintain a consistent posting schedule to keep the audience engaged and interested. Consistency helps in staying relevant in the minds of the followers and encourages them to interact regularly.

4. **Utilize Stories and Reels**: Take advantage of Instagram Stories and Reels to share behind-the-scenes content, tutorials, or quick snippets related to the high-engagement topics. These formats offer a more casual and interactive way to connect with the audience.