### TikTok analysis methods ###

Prepared for manuscript "*Engineering the communication of science through social media*" by Oshinowo et al.

Analysis is divided into the following sections:
 1. Correlation analysis of interest signals
 2. K-means clustering analysis
 3. Linear regression analysis
 4. Sentiment analysis with word cloud analysis

For an accessible introduction to machine learning, please feel free to read the review "A guide to machine learning for biologists" by Greener et al., 2021, published in journal [Nature Reviews Molecular Cell Biology](https://www.nature.com/articles/s41580-021-00407-0).

In [None]:
# Import statements

# File management
import os  # For directory management
import glob
import shutil
from pathlib import Path
from tkinter import filedialog

# Number and file management
import numpy as np  # For array management
import pandas as pd  # For database management
import datetime
import matplotlib.pyplot as plt  # For plotting result data
import seaborn as sns

# Mathematical methods
from scipy.ndimage import label
from scipy import stats
from sklearn.cluster import KMeans
from skimage import measure, util
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# NLP methods
import sys
!{sys.executable} -m textblob.download_corpora
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from textblob import TextBlob
import textblob

#### 1. Correlation analysis of interest signals ###

Correlation matrices are a statistical technique used to evaluate the relationship between two variables in a data set. In the produced table, every cell contains a Pearson correlation coefficient. +1 is considered a strong positive association between variables, 0 a neutral relationship, and -1 a strong inverse relationship.

Python programming methods are based upon library [Pandas](https://pandas.pydata.org/) version 2.2.0.

In [None]:
file = filedialog.askopenfile()  # Select excel sheet of data, read
df = pd.read_excel(file, engine='openpyxl')

In [None]:
# Correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.tight_layout()
plt.savefig('Correlation_matrix.png', dpi=300)

#### 2. K-means clustering analysis ####

Clustering analyses are a label-free machine learning method designed to find natural groupings with data sets. Here, k-means algorithms, understood to be a robust general approach to clustering, group data points into a specified *k* number of clusters in which each observation belongs to the cluster with the nearest mean.

Python programming methods are based upon freely available, open source library [scikit-learn](https://scikit-learn.org/stable/) version 1.4. Algorithms are described in manuscript "*Algorithm AS 136: A K-means clustering algorithm*" by J. A. Hartigan and M. A. Wong, 1979, published in journal [Journal of the Royal Statistical Society](https://www.jstor.org/stable/2346830).

In [None]:
# Create scree plot
sse = {}
df_scree = df_X.copy()  # Leave original X set unchanged
for k in range(1, 12):  # Range of 1 to 12 clusters, could easily edit
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_scree)
    df_scree["Clusters"] = kmeans.labels_
    #print(data["clusters"])
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of clusters")
plt.ylabel("SSE")
plt.tight_layout()
plt.savefig('Selected_features_scree.png', dpi=300)

In [None]:
# After consulting the scree plot, input a number of clusters (k) to parition all data points into
# i.e. k = 2
k = 4 # Edit

In [None]:
# Partition clusters
kmeans = KMeans(n_clusters=k, random_state=100)
kmeans_labels = kmeans.fit_predict(df_X)  # Note each label will be an integer, starting at 0 (0, 1, 2.. etc.)

In [None]:
# Create new dataframe that includes sample name, all feature metric values, and labels
df_with_labels = df.copy()
df_with_labels['Label'] = kmeans_labels
print(df_with_labels.head())

In [None]:
# Silhouette score to assess goodness of clustering
# -1 indicates poor clustering, 1 indicates perfect clustering
silhouette_coefficient = metrics.silhouette_score(df_X, kmeans_labels)

#### 3. Linear regression analysis ####

Regression analyses are used to mathematically characterize the value of a dependent variable (y-axis) based upon the value of an independent variable (x-axis). Here, we performed paired value analysis with views as the dependent variable.

Python programming methods are based upon freely available, open source library [scikit-learn](https://scikit-learn.org/stable/) version 1.4. 

In [None]:
# Create a linear regression model and fit it
model = LinearRegression()
# For multivariate analysis: several x variables, one y variable
y_block = 'Views'  # Column name for y block, all remaining variables are x block variables
Y = df[y_block]
X = df.drop(columns=[y_block])
model.fit(X, Y)

In [None]:
# Plot weights assigned to each feature
plt.figure(figsize=(10, 4))
plt.bar(X.columns, model.coef_)
plt.ylabel('weight')
plt.tight_layout()
plt.savefig('Regression_weights.png', dpi=300)
plt.show()

#### 4. Sentiment analysis ####

Sentiment analysis, a subset of the field of natural language processing, classifies individual words or groups of words (here, a comment) as having a positive (value greater than zero with a maximum value of 1), neutral (value of zero), or negative (value less than zero with a maximum value of -1) polarity.

To organize .csv files of individual comments, we used [TTCommentExplorer](https://chromewebstore.google.com/detail/ttcommentexporter-export/epjbmmchkjlgmogfoamcleeikmfaffjm?pli=1)

Python methods are based upon freely available, open source library [Natural Learning Toolkit](https://www.nltk.org/) version 1.8. Algorithms are described in manuscript #*Sentiment analysis: capturing favorability using natural language processing*" by T. Nasukawa and J. Li in the journal [Association for Computing Machinery](https://dl.acm.org/doi/abs/10.1145/945645.945658). 

In [None]:
# Choose directory of .csv files
directory = filedialog.askopendirectory()

In [None]:
# Create .csv list
csv_list = sorted(glob.glob(directory + "/*.csv"))

In [None]:
# Create column of comments

df = pd.DataFrame(columns=['ID', 'Comments'])
for csv in csv_list:
    
    comment_df = pd.read_csv(csv)
    comment_list = comment_df['Comment'].tolist()
    comment_dict = {'ID': Path(csv).stem,
                    'Comments': comment_list}
    
    df = df.append(comment_dict, ignore_index=True)

In [None]:
# Make one large word cloud
comment_list_all = df_words['Comments'].tolist()
flat_list = [item for sublist in comment_list_all for item in sublist]
one_string = " ".join(flat_list)

In [None]:
# Remove emojis
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

In [None]:
# Remove extra characters
def remove_all(one_string_updated_nopunc):
    one_string_updated_nopunc = one_string_updated_nopunc.replace('...', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('\u2026', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('\u0022', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('\u0027', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('\u201C', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('\u201D', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('\u2019', '')
    one_string_updated_nopunc = one_string_updated_nopunc.replace('"', '')
    
    return one_string_updated_nopunc

In [None]:
# Create sentiment analyzer
sia = SentimentIntensityAnalyzer()
df_sia = pd.DataFrame()

In [None]:
# Analyze individual words or comments
df_all_polarity = pd.DataFrame()
    
word_list = []
polarity_list = []
cat_list = []
sub_list = []
    
# Clean words
comment_list_all = df_words['Comments'].tolist()
flat_list = [item for sublist in comment_list_all for item in sublist]
one_string = " ".join(flat_list)
# Remove emojis
one_string_updated = remove_emojis(one_string)
    
# Remove punctuation
one_string_updated_nopunc = one_string_updated.translate(str.maketrans('', '', string.punctuation))
one_string_updated_nopunc = remove_all(one_string_updated_nopunc)
list_words = one_string_updated_nopunc.split()
    
for word in list_words:
    cat_list.append(cat)
    word_list.append(word)
    polarity_list.append(TextBlob(word).sentiment.polarity)

# for comment in flat_list:
#     print(comment)
#     cat_list.append(cat)
#     word_list.append(comment)
#     polarity_list.append(TextBlob(comment).sentiment.polarity)
        
df_all_polarity['Category'] = cat_list
df_all_polarity['Word'] = word_list

# df_all_polarity['Comment'] = word_list

df_all_polarity['Polarity'] = polarity_list
    
with pd.ExcelWriter('Word_polarity.xlsx') as writer:
    df_all_polarity.to_excel(writer, sheet_name='Words')
    
# with pd.ExcelWriter('Comment_polarity.xlsx') as writer:
#     df_all_polarity.to_excel(writer, sheet_name='Comments')

Word clouds are groupings of words with size of text indicating relative frequency. 

Python methods are based upon freely available, open source library [WordCloud](https://github.com/amueller/word_cloud) version 1.8.1.

In [None]:
# Create and generate a word cloud image:
wordcloud = WordCloud(background_color='white', colormap='binary', width=800, height=500).generate(one_string_updated_nopunc)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

plt.savefig('Word_cloud.png', dpi=300)