<h1>Image Analytics + Topic Modelling (LDA)</h1>

<h2>Scrape data from @thailand account on instagram </h2>

In [61]:
import instaloader
import pandas as pd

# Initialize an Instaloader instance
L = instaloader.Instaloader()

# Target Instagram profile (replace 'username' with the target username)
profile = instaloader.Profile.from_username(L.context, 'thailand')
print(profile)
count = 0

url_list = []
post_list = []
like_list = []
caption_list = []

for post in profile.get_posts():
    post_url = post.url
    post_type = post.typename  # Type of the post (e.g., 'GraphImage', 'GraphVideo', 'GraphSidecar')
    num_pictures = len(list(post.get_sidecar_nodes()))  # Number of pictures in a post (works for multiple-image posts)
    num_likes = post.likes
    str_caption = post.caption 

    url_list.append(post_url)
    post_list.append(post_type)
    like_list.append(num_likes)
    caption_list.append(str_caption)

    #print("Post URL:", post_url)
    #print("Post Type:", post_type)
    #print("Like:", num_likes)
    #print("Caption:", str_caption)
    #print("\n")

    if count < 550:
        if post_type == 'GraphImage' or post_type == 'GraphSidecar':
            count += 1
        else:
            count = count
    else: 
        break
print(count)

<Profile thailand (1780827587)>


In [62]:
instagram_data = pd.DataFrame()
instagram_data['image_url'] =  url_list
instagram_data['post_type'] =  post_list
instagram_data['like'] =  like_list
instagram_data['caption'] =  caption_list
instagram_data.to_csv("postURL.csv", index= False)

<h2>Analyze images with Google Vision API</h2>

In [None]:
from google.cloud import vision_v1 as vision

import os
import pandas as pd

#Add Credential File (.json)
Application_Credentials = "" 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = Application_Credentials
client = vision.ImageAnnotatorClient()
image = vision.Image()

def detect_labels_uri(uri):
    """Detects labels in the file located in Google Cloud Storage or on the Web."""
    image.source.image_uri = uri

    response = client.label_detection(image=image)
    labels = response.label_annotations
    #print("Labels:")

    label_list = [] 

    for label in labels:
        #print(label.description)
        label_list.append(label.description)

    #print(' '.join(label_list))

    return label_list

df = pd.read_csv("postURL.csv")
label_column = []

for index, row in df.iterrows():
    #print(index)
    results = detect_labels_uri(row['image_url'])
    label_column.append(results)

df['label'] = label_column
df.to_csv("image_label.csv", index= False)

<h2>Clean Data</h2>
and create engagement column (high-low) based on number of like

In [64]:
import pandas as pd

#read files
df = pd.read_csv("image_label.csv")
print(df.head(3))
print("rows: ", len(df))

#Drop no number of likes 
df = df[df['like'] != -1]
print("rows: ", len(df))

#Drop no label rows
df = df[df['label'] != "[]"]
print("rows: ", len(df))
#df.to_csv("check.csv")

#Threshold - Median Value
#Posts that have more than median values of Likes wil be label to 1
#otherwise 0
median_value = df['like'].median()
#print(median_value)
df['engagement'] = df['like'].apply(lambda x: 1 if x > median_value else 0)
print(df.shape)
print(df.head(3))


                                           image_url     post_type    like  \
0  https://scontent-dfw5-1.cdninstagram.com/v/t51...  GraphSidecar  514160   
1  https://scontent-dfw5-2.cdninstagram.com/v/t51...  GraphSidecar   96360   
2  https://scontent-dfw5-1.cdninstagram.com/v/t51...    GraphImage   74962   

                                             caption  \
0  WE HAVE ALWAYS HELD TO THE HOPE, THE BELIEF, T...   
1  Realize deeply that the present moment is all ...   
2  Paradise is where I am. When will you begin th...   

                                               label  
0  ['Sky', 'Water', 'Boat', 'Building', 'Cloud', ...  
1  ['Cloud', 'Sky', 'Building', 'Car', 'Daytime',...  
2  ['Water', 'Sky', 'Mountain', 'Ecoregion', 'Mam...  
rows:  542
rows:  542
rows:  539
(539, 6)
                                           image_url     post_type    like  \
0  https://scontent-dfw5-1.cdninstagram.com/v/t51...  GraphSidecar  514160   
1  https://scontent-dfw5-2.cdninstagram.com/

<h2>Using Image Labels and Caption to predict high-low engagement</h2>

<h3>Part 1: Label</h3>

In [2]:
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

#prepare text_documents
text_documents = []
for label in df['label']:
    concatenated_text = " ".join(ast.literal_eval(label))
    text_documents.append(concatenated_text)

#create TF-IDF matrix
#print(text_documents)
select_stopwords = ["of", "and"]
tfidf_vectorizer = TfidfVectorizer(stop_words=select_stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(text_documents)
tfidf_array = tfidf_matrix.toarray()
terms = tfidf_vectorizer.get_feature_names_out()

# Create a DataFrame
df_label_tfidf = pd.DataFrame(tfidf_array, columns=terms)

# Display the DataFrame
#df_label_tfidf['Y'] = df['engagement']
#df_label_tfidf.to_csv("label.csv")
print(df_label_tfidf)

     adaptation  african  afterglow  agriculture  algae  amber    animal  \
0           0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   
1           0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   
2           0.0      0.0   0.000000          0.0    0.0    0.0  0.345095   
3           0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   
4           0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   
..          ...      ...        ...          ...    ...    ...       ...   
534         0.0      0.0   0.316672          0.0    0.0    0.0  0.000000   
535         0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   
536         0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   
537         0.0      0.0   0.331162          0.0    0.0    0.0  0.000000   
538         0.0      0.0   0.000000          0.0    0.0    0.0  0.000000   

     aqua  archaeological  architecture  ...  wheel  white  wind  window  \
0     0.0  

In [3]:
#feature selection
from sklearn.feature_selection import SelectKBest, chi2

# Select the top 100 features based on chi-squared score
df_label_tfidf.fillna(0, inplace=True)
k = 100
tfidf_matrix_selected = SelectKBest(chi2, k=k).fit_transform(df_label_tfidf , df['engagement'].to_list())
print(tfidf_matrix_selected.shape)

(539, 100)


In [4]:
# Importance of each word
test = SelectKBest(score_func=chi2, k=k)
fit = test.fit(df_label_tfidf , df['engagement'].to_list())
#print(fit.pvalues_.shape)

df_column_score = pd.DataFrame({'Column_name': df_label_tfidf.columns, 'P-value': fit.pvalues_})
# P-value of each word
print(df_column_score.sort_values(by='P-value',ascending = True).head(100))

    Column_name   P-value
17   atmosphere  0.003306
21        azure  0.023746
93       flower  0.044280
2     afterglow  0.069438
12          art  0.081926
..          ...       ...
257   thatching  0.461442
286        wave  0.461465
289        wind  0.461465
288       white  0.464057
22         bank  0.466615

[100 rows x 2 columns]


In [9]:
#Logistic Regrssion
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix_selected, df['engagement'].to_list(), test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)

Confusion Matrix:
[[32 18]
 [28 30]]

Accuracy: 0.5740740740740741


<h3>Part 2: Caption</h3>

In [10]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Tokenization and lowercase conversion
    tokens = nltk.word_tokenize(text.lower())
    # Remove punctuation and stopwords
    cleaned_tokens = [token for token in tokens if token not in stop_words]
    # Rejoin tokens into a clean text string
    cleaned_text = " ".join(cleaned_tokens)
    
    return cleaned_text


#prepare text_documents

text_documents = [clean_text(document) for document in df['caption']]

#create TF-IDF matrix
#print(text_documents)

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_documents)
tfidf_array = tfidf_matrix.toarray()
terms = tfidf_vectorizer.get_feature_names_out()

# Create a DataFrame
df_caption_tfidf = pd.DataFrame(tfidf_array, columns=terms)
# Display the DataFrame
#df_caption_tfidf['Y'] = df['engagement']
#df_caption_tfidf.to_csv("caption.csv")
print(df_caption_tfidf)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Poonnawit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


     007   10   19  199s  1pae  1puinun  2022      2023   26  2ontours  ...  \
0    0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.305949  0.0       0.0  ...   
1    0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
2    0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
3    0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
4    0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
..   ...  ...  ...   ...   ...      ...   ...       ...  ...       ...  ...   
534  0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
535  0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
536  0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
537  0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   
538  0.0  0.0  0.0   0.0   0.0      0.0   0.0  0.000000  0.0       0.0  ...   

     ยวเม  ยวไทยเท  องไทยทางig  องไทยสวยท  องไทยไม 

In [11]:
#feature selection

from sklearn.feature_selection import SelectKBest, chi2

# Select the top 100 features based on chi-squared score
df_caption_tfidf.fillna(0, inplace=True)
k = 100
tfidf_matrix_selected2 = SelectKBest(chi2, k=k).fit_transform(df_caption_tfidf , df['engagement'].to_list())
print(tfidf_matrix_selected.shape)

(539, 100)


In [12]:
# Importance of each word
test = SelectKBest(score_func=chi2, k=k)
fit = test.fit(df_caption_tfidf , df['engagement'])
#print(fit.pvalues_.shape)

df_column_score = pd.DataFrame({'Column_name': df_caption_tfidf.columns, 'P-value': fit.pvalues_})
# P-value of each word
print(df_column_score.sort_values(by='P-value',ascending = True).head(100))

        Column_name   P-value
757           krabi  0.013938
759   krabithailand  0.049988
1223         railay  0.092785
1445        sunrise  0.104103
721    kohlaolading  0.106165
...             ...       ...
483           found  0.392118
439          family  0.392275
1630            use  0.392295
60    alysha_huxley  0.392803
365     dodgerjones  0.393716

[100 rows x 2 columns]


In [13]:
#Logistic Regrssion
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix_selected2, df['engagement'].to_list(), test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)

Confusion Matrix:
[[39 11]
 [28 30]]

Accuracy: 0.6388888888888888


<h3>Part 3: Label + Caption</h3>

In [22]:
tfidf_label = pd.DataFrame(tfidf_matrix_selected)
tfidf_caption = pd.DataFrame(tfidf_matrix_selected2)

# Concatenate horizontally
combined_df = pd.concat([tfidf_label , tfidf_caption], axis=1)

combined_df.columns = range(len(combined_df.columns))

#print(combined_df)

#Logistic Regrssion
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(combined_df, df['engagement'].to_list(), test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)

     0         1    2    3    4         5    6         7         8    \
0    0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.000000   
1    0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.000000   
2    0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.000000   
3    0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.000000   
4    0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.000000   
..   ...       ...  ...  ...  ...       ...  ...       ...       ...   
534  0.0  0.316672  0.0  0.0  0.0  0.000000  0.0  0.418186  0.000000   
535  0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.000000   
536  0.0  0.000000  0.0  0.0  0.0  0.295882  0.0  0.000000  0.000000   
537  0.0  0.331162  0.0  0.0  0.0  0.000000  0.0  0.437321  0.000000   
538  0.0  0.000000  0.0  0.0  0.0  0.000000  0.0  0.000000  0.301087   

          9    ...  190  191  192  193  194  195  196  197  198       199  
0    0.000000  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 

**Observation**

We observed lowest accuracy when using the image labels. The accuracy was highest when we ran the prediction using the image captions. When we merged both the image captions and the image labels, we observed multicolinearity in the data, where sometimes both the image labels and captions had similar or even the same words. This resulted in the model with combined labels and captions having a lower accuracy then using only the image captions.

<h2> Use LDA (Latent Dirichlet allocation) to define topic in each document </h2>

In [25]:
from gensim import corpora
from gensim.models import LdaModel
import ast
from nltk.corpus import stopwords

# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

convert_to_list = []
for label in df['label'].to_list():
    convert_to_list.append(ast.literal_eval(label))

#print(convert_to_list)
#print(convert_to_list[0][0])

# Create a document-term matrix
dictionary = corpora.Dictionary(convert_to_list)
corpus = [dictionary.doc2bow(text) for text in convert_to_list]

# The number of topics
num_topics = 3

# Train the LDA model
lda = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

# Visualize topics or inspect them
topics = lda.print_topics(num_words=10)
for topic in topics:
    print(topic)

# Display all words 
topics_data = {}
for topic_id, topic in lda.show_topics(num_topics=num_topics, num_words=276, formatted=False):
    words = [word for word, score in topic]
    scores = [score for word, score in topic]
    
    for word, score in zip(words, scores):
        if word in topics_data:
            topics_data[word][f'Topic {topic_id}'] = score
        else:
            topics_data[word] = {f'Topic {topic_id}': score}

print(topics_data)
df_topic = pd.DataFrame.from_dict(topics_data, orient='index')

# Visualize topics or inspect all the words
df_topic.to_csv("topic3.csv")
# Display the dataframe
#print(df_topic)

(0, '0.054*"Temple" + 0.052*"Sky" + 0.047*"Plant" + 0.045*"Building" + 0.045*"Tree" + 0.022*"Leisure" + 0.021*"Landscape" + 0.020*"Sculpture" + 0.020*"City" + 0.020*"Grass"')
(1, '0.098*"Water" + 0.089*"Sky" + 0.069*"Natural landscape" + 0.059*"Cloud" + 0.047*"Mountain" + 0.046*"Coastal and oceanic landforms" + 0.045*"Tree" + 0.042*"Water resources" + 0.041*"Plant" + 0.041*"Azure"')
(2, '0.089*"Sky" + 0.061*"Cloud" + 0.058*"Afterglow" + 0.053*"Water" + 0.050*"Dusk" + 0.043*"Plant" + 0.042*"Nature" + 0.039*"Tree" + 0.031*"Light" + 0.028*"Building"')
{'Temple': {'Topic 0': 0.054132134, 'Topic 1': 0.000117946925, 'Topic 2': 0.00033121943}, 'Sky': {'Topic 0': 0.051891055, 'Topic 1': 0.088922516, 'Topic 2': 0.089084454}, 'Plant': {'Topic 0': 0.046811845, 'Topic 1': 0.041242354, 'Topic 2': 0.042520583}, 'Building': {'Topic 0': 0.045281008, 'Topic 1': 0.0002871806, 'Topic 2': 0.027538897}, 'Tree': {'Topic 0': 0.044910103, 'Topic 1': 0.04538644, 'Topic 2': 0.039055184}, 'Leisure': {'Topic 0': 

<li>Topic 0: **Cultural Landscape** (temple, building, leisure, landscape, chinese arquitecture, city, statue, world) </li>
<li>Topic 1: **Natural Landscape** (water, natural landscape, mountain, costal and oceanic land) </li>
<li>Topic 2: **Atmospheric Beauty** (sky, could, afterglow, dusk, nature, light, sunlight, atmosphere, daytime, sunset, shade) </li>

Note: the order of the topic can be changed if you rerun the LDA model

In [26]:
#fill nan if any
df_topic.fillna(0, inplace = True)
print(df_topic)

                Topic 0   Topic 1   Topic 2
Temple         0.054132  0.000118  0.000331
Sky            0.051891  0.088923  0.089084
Plant          0.046812  0.041242  0.042521
Building       0.045281  0.000287  0.027539
Tree           0.044910  0.045386  0.039055
...                 ...       ...       ...
Jewellery      0.000334  0.000101  0.000972
Headgear       0.000334  0.000101  0.000972
Comfort        0.000334  0.000101  0.000972
Textile        0.000334  0.000101  0.000972
Outdoor table  0.000334  0.000101  0.000972

[276 rows x 3 columns]


In [None]:
#calculate the score for each topic of each document
import ast

column_names = df_topic.columns
score_list_list = []
topic_list = []

for i in df['label']:
    label = ast.literal_eval(i)
    score_list = []
    #print(i)
    score_list = []
    for k in column_names:
        score = 0
        for j in label:
            if j in df_topic.index:
                #print(j, df_topic[k][j])
                score += df_topic[k][j]
        #print(k, score)
        score_list.append(score)
    row_score =  sum(score_list)
    results = [number / row_score for number in score_list]    
    score_list_list.append(results)
#print(score_list_list)

df_topic_score = pd.DataFrame(score_list_list, columns=column_names)
#print(df['label'].shape)
df_topic_score['label'] = df['label'].to_list()
#print(df['like'].shape)
df_topic_score['like'] = df['like'].to_list()

df_topic_score.to_csv("final_score_results.csv")

In [5]:
import pandas as pd
df_topic_score = pd.read_csv("final_score_results.csv")

# Sort the DataFrame by the 'like' column
df_sorted = df_topic_score.sort_values(by='like',ascending = False)

# Calculate the first quartile (25th percentile)~Top25% and third quartile (75th percentile) values
first_quartile_value = df_sorted['like'].quantile(0.75)
third_quartile_value = df_sorted['like'].quantile(0.25)

print(first_quartile_value)
print(third_quartile_value)

# Filter the DataFrame to get the first and last quartile rows
first_quartile_rows = df_sorted[df_sorted['like'] >= first_quartile_value]
last_quartile_rows = df_sorted[df_sorted['like'] <= third_quartile_value]

# Display the results
print("First Quartile Rows:")
print(first_quartile_rows)

print("\nLast Quartile Rows:")
print(last_quartile_rows)

10576.0
5204.0
First Quartile Rows:
     Unnamed: 0   Topic 0   Topic 1   Topic 2  \
0             0  0.182602  0.355289  0.462109   
1             1  0.255729  0.298162  0.446110   
2             2  0.204300  0.476428  0.319272   
3             3  0.182602  0.355289  0.462109   
4             4  0.071758  0.707330  0.220912   
..          ...       ...       ...       ...   
130         130  0.183582  0.545836  0.270582   
131         131  0.190909  0.459441  0.349649   
132         132  0.199303  0.434243  0.366454   
133         133  0.139553  0.575271  0.285175   
134         134  0.176362  0.366094  0.457544   

                                                 label    like  
0    ['Sky', 'Water', 'Boat', 'Building', 'Cloud', ...  514160  
1    ['Cloud', 'Sky', 'Building', 'Car', 'Daytime',...   96360  
2    ['Water', 'Sky', 'Mountain', 'Ecoregion', 'Mam...   74962  
3    ['Water', 'Sky', 'Boat', 'Cloud', 'Building', ...   59274  
4    ['Water', 'Water resources', 'Mountain', 'Nat

In [6]:
high_engagement_topic0 = first_quartile_rows['Topic 0'].mean()
high_engagement_topic1 = first_quartile_rows['Topic 1'].mean()
high_engagement_topic2 = first_quartile_rows['Topic 2'].mean()

low_engagement_topic0 = last_quartile_rows['Topic 0'].mean()
low_engagement_topic1 = last_quartile_rows['Topic 1'].mean()
low_engagement_topic2 = last_quartile_rows['Topic 2'].mean()


import pandas as pd

# Sample DataFrames for first and last quartile rows
first_quartile_data = {
    'Topic 0': [high_engagement_topic0],
    'Topic 1': [high_engagement_topic1],
    'Topic 2': [high_engagement_topic2]
}

last_quartile_data = {
    'Topic 0': [low_engagement_topic0],
    'Topic 1': [low_engagement_topic1],
    'Topic 2': [low_engagement_topic2]
}

# Create DataFrames for high and low averages
high_average_df = pd.DataFrame(first_quartile_data, index=['High Average'])
low_average_df = pd.DataFrame(last_quartile_data, index=['Low Average'])

In [9]:
#Display a Table
matrix=pd.DataFrame(columns=['Average Type','Cultural Landscape','Natural Landscape', 'Atomspheric Beauty'])
matrix.reset_index()

low_dict={'Average Type':'Lower Quartile','Cultural Landscape':low_average_df.to_string().split('  ')[-3:][0],'Natural Landscape':low_average_df.to_string().split('  ')[-3:][1],'Atomspheric Beauty':low_average_df.to_string().split('  ')[-3:][2]}

high_dict={'Average Type':'Higher Quartile','Cultural Landscape':high_average_df.to_string().split('  ')[-3:][0],'Natural Landscape':high_average_df.to_string().split('  ')[-3:][1],'Atomspheric Beauty':high_average_df.to_string().split('  ')[-3:][2]}


matrix=pd.concat([matrix, pd.DataFrame([low_dict])], axis=0, ignore_index=True)
matrix=pd.concat([matrix, pd.DataFrame([high_dict])], axis=0, ignore_index=True)

print(matrix)
matrix

      Average Type Cultural Landscape Natural Landscape Atomspheric Beauty
0   Lower Quartile           0.248485           0.40951           0.342005
1  Higher Quartile           0.199573          0.480028           0.320399


Unnamed: 0,Average Type,Cultural Landscape,Natural Landscape,Atomspheric Beauty
0,Lower Quartile,0.248485,0.40951,0.342005
1,Higher Quartile,0.199573,0.480028,0.320399


<h2> Summary </h2>


Based on this data we can see that in general there is a balance between topics that the Thailand Instagram account keeps more or less equal for their posts. 

Our first recommendation would be for Thailand Instagram to go heavier on the natural landscape related topic. Higher engagement (number of likes) posts are generally more correlated to natural landscape related topic. However, we would not recommend going too heavy on the cultural landscape, because it seems to have lower the engagement. 

For our second recommendation, we would suggest that this instagram account could also consider trying out going heavier on the natural landscape to see if further increasing the natural landscape will further increase engagement and try to make some differentiation on posts related to cultural topic to increase engagement level. 

Our third recommendation is based on our chi-squared test that we did based on the words from both the image labels and the image captions. We observed from the captions that images displaying the province of Krabi, which has beautiful beaches in Thailand, are very highly correlated to high engagement. From the image labels, we saw that images focusing on the atmosphere, flowers, or afterglow also lead to higher engagement.

In conclusion, we recommend trying to balance all three topics, holding back slightly on the natural landscape, favoring the cultural landscape a bit more. To do this, we recommend focusing on the province of Krabi, as well as nearby provinces. We recommend taking pictures with more wide angles focusing on displaying the beautiful atmosphere, or taking colorful pictures highlighting the local flowers.

