<a href="https://colab.research.google.com/github/DoritoClod95/Text-Metric-Analyzer/blob/main/Differences_of_Social_Media_Comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Title: Difference of Social Media Comment Sections
Name: Carla Parinas
Student ID: 300653631

### 1. INTRODUCTION AND RESEARCH QUESTION(S)

For this project, I want to analyse the differences in behaviour and mannerisms across social media platforms in terms of comments. I saw this video of a dog and noticed how despite being the same video, the comments were drastically different depending on the platorm. As a person who's relatively active on social media apps, things such as social etiquette seems to be present in the online world as well. I tend to notice a difference in comments depending on the social media app, and that although there are no rules to commenting, there seems to be a standard for how comments should be. For example, in the internet, there seems to be a trend of instagram having more vulgar and negative comments compared to tiktok.If the internet is described like a digital world, then the drastic change in comment sections styles could be comparable to something like culture shock, and I found that very intruiging. My research questions for this project would be:

- What is the difference in emotion under both social media?
- Which comment section is more sophistication in terms of their wording?

To answer my research questions, I will be doing calculating things such as:

- Lexical Diversity, to see how well the comments varies (although there could be some issues with this)
- Most common words and ngrams, to find any trends in wording
- Sentiment Value (Using VADER), to see the differences in emotion
- Readability formulas, to measure the sophistication and see if it matches the video (maybe try different ones?)
- Age of Acquisition, to have rough estimate about the age demographic of a group
- Spacy, noun dependencies
- Basic word information, in case I need it *

### DATA AND DESCRIPTION OF DATA

I have decided to have 2 categories:

1. Tiktok
2. Instagram

Due to comments having a naturally shorter word count, I will be gathering around 7 videos that are available on both the platforms and putting comments until it reaaches around 250-500 words for each video. If I find that there is an insufficient count for one of the platforms then I will make an exception and add a bit more, but I will include take note of this in the program. The number of comments will be used as data.

The video must be available on all the three platforms. I will use videos that aim to have the most insightful comment sections, and I will also use videos that are relatively mainstream -- this is to avoid incredibly niche topics whilest still staying in the area of the category. The comments would have to be different, meaning that the trend of a comment section "repeating itself" for humour will not count. I will try to keep these videos neutral as possible. I will not be including emojis or any sort of username tagging.   

### Program explanation

My program is mostly recycled code from my previous assignment, and some taken from the course. It uses the NLTK library and the spaCy library as the main ones for the linguistics statistics. Before doing the text analysis I made a corpus that I would work with, and then categorized it to either Instagram (ig) or Tiktok (tt). I would feed this corpus into a main function that would run other functions to get the metrics. The only preprocessing I did was removing the punctuation. I didn't remove stopwords because I feel like they are a key when it comes to comments, and social media comments are already short enough as it is. I retained the full stops so that I was able to process the sentences.

Note that most, if not, all of the metrics was gathered into an average to keep fairness. The first metric function it would run is text info, which would pretty much gather the basic information of a text: word count, sentence count, comment count, and syllable count. The syllables were counted using the cmu dictionary from the nltk library, otherwise they were done manually. These pieces of information would then be sent to the other functions to get the other metrics. The sentiment score, which was the next metric, was calculated using the vader tool from the nltk library.The next metric was the flesch-reading ease calculator, and I took this from the course, but it takes the values of the text and uses a helper method to calculate the readability. Then it calculated the Age of Acquisition using the help of NLTK library. With the help of spaCy andn the code from the course notebooks, I got the noun dependency of the overall comment section and also the average of each individual comment. Lastly, the program gets the lexical diversity. It gathers the Lexical Diversity of the entire thing, then the average of all the sentences, and then the average of all the comments.

I analyzed each of the comparison results individually and then manually counted the amount of times a category was "higher" than the other in a specific aspect.


In [None]:
# SETUP
import nltk
import requests
import spacy
from nltk.corpus import stopwords
from nltk.corpus import cmudict
from nltk import FreqDist
from google.colab import drive
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from collections import defaultdict
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from nltk.util import ngrams

drive.mount('/content/drive')
corpus_location = '/content/drive/MyDrive/comments'

resources =  ['book', 'stopwords', 'averaged_perceptron_tagger', 'vader_lexicon', 'punkt']
nltk.download(resources)

updated_stopwords = stopwords.words('english')
updated_stopwords.append("i'm")
sid = SentimentIntensityAnalyzer()
cmu = cmudict.dict()

noun_tags = ['NN', 'NNS']
verb_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']

nlp = spacy.load('en_core_web_sm')


Mounted at /content/drive


[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    

In [None]:
# FUNCTIONS

def preprocess(text, stopwords):
  # print(text)
  text = text.lower()
  new_text = cleanPunc(text)
  if stopwords == False:
    new_text = remove_stopwords(text)

  text_fdist = FreqDist(new_text)
  #print(text_fdist)

  return new_text

def cleanPunc(text):
  punct = ',;"!\'[]{}:><-_?``()#$'
  #print(text)
  text = ''.join([x for x in text if x not in punct and x != ''])

  return text

def removeFullStop(text):
  return ' '.join([x for x in text.split() if x != "."])

def remove_stopwords(text):
    text = ' '.join([x for x in text.split() if x.lower() not in updated_stopwords])

    # needed to do it one more time
    text = ' '.join([x for x in nltk.word_tokenize(text) if x.lower() not in updated_stopwords])


In [None]:
def get_word_rating_resource(url):
  """helper function to get lexical resources for LING226 students
  resources are hosted on github as .txt in the form of Word\tValue\n
  """
  # read the raw text and split on newlines
  raw = requests.get(url).text.split('\n')

  # split each pair and convert value to rounded float
  # the if statement is there to avoid indexing errors when a row in a resource doesn't have complete data
  raw_list = [(pair.split('\t')[0], round(float(pair.split('\t')[1]), 3)) for pair in raw if len(pair.split('\t')) == 2]

  # create a dictionary and return it
  return dict(raw_list)

aoa_url = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/lexical-resources/AoA_Brysbart.txt'
aoa_dict = get_word_rating_resource(aoa_url)

In [None]:
def find_LD(tokens):
  if len(tokens) != 0:
    return len(set(tokens))/len(tokens)
  else:
    return 0

In [None]:
def syllable_counter(word):
  syllables = []
  if word in cmu.keys():
    #print("scenario 1")
    phones = cmu[word][0]
    #print(phones)

    vowel_sounds = [sound for sound in phones if sound[-1].isdigit()]
    #print(vowel_sounds)
    syllables = len(vowel_sounds)

  else:
    #print("scenario 2")
    vowels = "aeiouy"
    syllables = 0
    prev_char_is_vowel = False
    for char in word:
        if char in vowels:
            if not prev_char_is_vowel:
                syllables += 1
            prev_char_is_vowel = True
        else:
             prev_char_is_vowel = False

    #print(syllables)
  return syllables

In [None]:
def sentimentFinder2(corpus, title):
  output = []

  raw_file = corpus.raw(title)
  text = raw_file.replace('\r\n', '.')
  sentences = [sent for sent in text.split(".") if sent != '' or sent != ' ']

  whole_comment = sid.polarity_scores(text)['compound']

  for sent in sentences:
      output.append(sid.polarity_scores(sent)['compound'])


  if output:
      sent_avg = sum(output)/len(output)
      print(f"Total Polarity Score: { whole_comment } \nAverage Polarity Score per Sentence: { sent_avg }")

In [None]:
def text_info(text):
  comments_whole = '\n'.join(text)
  sents_whole = '. '.join(text)
  #print(sents_whole)

  #print(type(sents_whole))

  # lowercase the text
  comment = comments_whole.lower()
  sents = sents_whole.lower()

  # get number of comments
  comments = [word for word in comments_whole.split('\n') if word != '']

  #print('comment done')

  # extract tokens, removing any that are just punctuation
  tokens = [token.lower() for token in nltk.word_tokenize(comment) if token.isalpha()]

  #print('token done')

  # extract sentences
  sentences = [word for word in sents_whole.split('.') if word != '' or word != ' ']
  #print(sentences)

  #print('sents done')

  # extract syllables
  syllables = 0


  for token in tokens:
    syllables += syllable_counter(token)

  print('syllable done')
  return tokens, sentences, syllables, comments

In [None]:
def sentimentFinder(sentences, comments_whole):
  output = []

  print(type(sentences))
  #raw_file = corpus.raw(title)
  #text = raw_file.replace('\r\n', '.')
  #sentences = [sent for sent in text.split(".") if sent != '' or sent != ' ']
  sentence_joined = '. '.join(sentences)

  whole_comment = sid.polarity_scores(sentence_joined)['compound']

  for sent in sentences:
      output.append(sid.polarity_scores(sent)['compound'])

  if output:
      sent_avg = sum(output)/len(output)
      print(f"Total Polarity Score: { round(whole_comment, 4) } \nAverage Polarity Score per Sentence: { round(sent_avg, 4) }")

In [None]:
def flesch_reading_ease(words, sents, sylls, comments):
  output = []
  overall = calculate(len(words), len(sents), sylls)

  for comment in comments:
    syll_c = 0
    for word in comment.split():
      syll_c += syllable_counter(word)

    if syll_c > 0 and len(comment.split()) > 0:
      output.append(calculate(len(comment.split()), len(comment.split('.')), syll_c))

  avg = sum(output)/len(output)
  print(f'Overall Score: {round( overall, 3)}, Average Score per Comment: { round(avg, 3)}')

In [None]:
def calculate(words, sents, sylls):

  word_sents = words/sents
  syll_words = sylls/words

  reading_ease_score = 206.835 - (1.015 * word_sents) - (84.6 * syll_words)

  #print(f'Flesch Reading Ease Score: {reading_ease_score}')
  return reading_ease_score

In [None]:
def aoa(words, comments):
  comment_output = []
  whole_avg = get_aoa(words)

  for comment in comments:
    if get_aoa(comment.split()) is not None:
      #print(comment)
      comment_output.append(get_aoa(comment.split()))

  #print(comment_output)
  comment_avg = sum(comment_output)/len(comment_output)
  print(f'Whole Average Comment Section AOA: { round(whole_avg, 2)} \nAverage AOA per Comment: { round(comment_avg, 2) }')

  return round(whole_avg, 1), round(comment_avg, 1)


In [None]:

def get_aoa(words):
  #print(words)
  output = []
  for w in words:
      if w in aoa_dict.keys():
        #print(w)
        output.append(aoa_dict[w])

  if len(output) != 0:
    avg = sum(output)/len(output)
    return round(avg, 1)

In [None]:
# an updated version of our function
import statistics

def avg_sd_noun_deps(text_input):
  """return average and sd of of dependents per head noun in a text"""
  # create spacy doc
  tokens = nlp(text_input)

  # list to store noun children
  n_deps = []

  for token in tokens:
    # use simple pos tag to find the nouns
    if token.pos_ == "NOUN":
      n_childs = [c for c in token.children]
      n_deps.append(len(n_childs))

  # safety first
  if n_deps:
    # in case you want to check what's happening with the numbers
    # print(n_deps)
    avg = statistics.mean(n_deps)
    sd = statistics.stdev(n_deps)
    return avg, sd # first number is the average, second is the standard deviation
  else:
    print('Sorry, no nouns found')

In [None]:
def text_metrics_individual(corpus, title, tri):
  raw = corpus.raw(title)
  raw = raw.replace('\r\n', '')
  raw_text = preprocess(raw, True)
  raw_nostop = preprocess(raw, False)

  comments_list = [comment for comment in raw_text.split('/t')]
  comments_whole = '\n'.join(comments_list)
  #print(comments_list)
  #print(comments_whole)

  print("\n=======TEXT INFO========")
  #print(type(comments_list))
  words, sentences, syllables, comments = text_info(comments_list)
  print(f'this text has {len(words)} words, {len(sentences)} sentences, and {syllables} syllables, and {len(comments)} comments.')

  print("\n=======SENTIMENT FINDER========")
  sentiment = sentimentFinder(sentences, raw_text)

  print("\n=======FLESCH READING EASE========")
  overall = flesch_reading_ease(words, sentences, syllables, comments)

  print("\n=======AGE OF ACQUISITION========")
  aoa_avg, aoa_comment_avg = aoa(words, comments)

  print("\n=======NOUN DEPENDENCIES========")
  whole_dep, whole_dep_sd = avg_sd_noun_deps(comments_whole)

  print(f'Average Noun Dependancy: {round(whole_dep, 4)} \nStandard Deviation: { round(whole_dep_sd, 4)}')

  print("\n=======FIND LD========")
  overall_ld = find_LD(comments_whole)
  sent_ld = []
  comments_ld = []

  for sent in sentences:
    if sent != '':
      add = find_LD(nltk.word_tokenize(sent))
      sent_ld.append(add)

  for comment in comments:
     comments_ld.append(find_LD(nltk.word_tokenize(comment)))

  sent_avg = sum(sent_ld)/len(sent_ld)
  comment_avg = sum(comments_ld)/len(comments_ld)
  print(f'Overall LD: {round(overall_ld, 4)} \nSentence Average LD: {round(sent_avg, 4)} \nComment Average LD: {round(comment_avg, 4)}')


In [None]:
comments_corpus = CategorizedPlaintextCorpusReader(root = corpus_location, fileids = '.*', cat_pattern = '.*(..).txt')

type(comments_corpus.raw('dog_ig.txt'))

str

# DOG

In [None]:
text_metrics_individual(comments_corpus, 'dog_ig.txt', 5)


syllable done
this text has 240 words, 30 sentences, and 322 syllables, and 27 comments.

<class 'list'>
Total Polarity Score: 0.993 
Average Polarity Score per Sentence: 0.1776

Overall Score: 85.21, Average Score per Comment: 81.997

Whole Average Comment Section AOA: 5.0 
Average AOA per Comment: 5.01

Average Noun Dependancy: 1.2195 
Standard Deviation: 0.9357

Overall LD: 0.02 
Sentence Average LD: 0.9211 
Comment Average LD: 0.9864


In [None]:
text_metrics_individual(comments_corpus, 'dog_tt.txt', 5)


syllable done
this text has 219 words, 41 sentences, and 276 syllables, and 36 comments.

<class 'list'>
Total Polarity Score: 0.9617 
Average Polarity Score per Sentence: 0.139

Overall Score: 94.794, Average Score per Comment: 92.943

Whole Average Comment Section AOA: 4.7 
Average AOA per Comment: 4.69

Average Noun Dependancy: 1.9211 
Standard Deviation: 1.851

Overall LD: 0.0249 
Sentence Average LD: 0.9475 
Comment Average LD: 0.9903


For the Dog video, IG had more comments despite it having less words.
It is surprisingly the instagram comment section that has a higher sentiment score both in overall and per sentence. Meaning that the IG comments seemingly have a more positive vibe. When it comes to the readability, it seems that the tiktok comments have a higher score by around 10.
The AoA is higher on the instagram comments with age 5, but with instagram it is around age 4 but still close to 5. Coming to noun dependencies, the tiktok comments have a higher noun dependency -- this implies that the TT comment section have more variation and have a more detailed way of commenting while the instagram comments are more consistent in comments in a straightforward manner. To accomodate this result, the lexical density of tiktok is only slightly higher than instagram's in all aspects.

# BATHROOM

In [None]:
text_metrics_individual(comments_corpus, 'bathroom_ig.txt', 5)


syllable done
this text has 374 words, 35 sentences, and 480 syllables, and 21 comments.

<class 'list'>
Total Polarity Score: -0.9208 
Average Polarity Score per Sentence: -0.0814

Overall Score: 87.411, Average Score per Comment: 79.854

Whole Average Comment Section AOA: 5.0 
Average AOA per Comment: 5.07

Average Noun Dependancy: 1.48 
Standard Deviation: 1.3692

Overall LD: 0.0153 
Sentence Average LD: 0.8335 
Comment Average LD: 0.9148


In [None]:
text_metrics_individual(comments_corpus, 'bathroom_tt.txt', 5)


syllable done
this text has 380 words, 28 sentences, and 504 syllables, and 22 comments.

<class 'list'>
Total Polarity Score: -0.0212 
Average Polarity Score per Sentence: -0.1015

Overall Score: 80.854, Average Score per Comment: 78.273

Whole Average Comment Section AOA: 4.8 
Average AOA per Comment: 4.8

Average Noun Dependancy: 1.2239 
Standard Deviation: 0.9345

Overall LD: 0.0163 
Sentence Average LD: 0.9249 
Comment Average LD: 0.9431


For the transgender bathroom video, both comment sections have a similar number of comments and words. Both videos have negative sentiment. The overall sentiment is much lower with the IG comments, while the average sentence score is lower on the TT comments. This time, the IG comments have a higher readability rate with a decent difference for overall but only a slight difference for the average per comment. The age of acquisition for IG comments remains consistent at age 5 and TT comments being somewhat lower. The noun dependency and standard deviation for IG comments is also higher than TT comments. The general lexical diversity of the TT comments is higher than the IG comments although there is not much of a difference when it comes to overall LD and comment average LD.

# DRUMMER

In [None]:
text_metrics_individual(comments_corpus, 'drummer_ig.txt', 5)


syllable done
this text has 359 words, 34 sentences, and 485 syllables, and 25 comments.

<class 'list'>
Total Polarity Score: 0.1485 
Average Polarity Score per Sentence: -0.0355

Overall Score: 81.825, Average Score per Comment: 88.166

Whole Average Comment Section AOA: 5.2 
Average AOA per Comment: 4.97

Average Noun Dependancy: 1.4267 
Standard Deviation: 1.3772

Overall LD: 0.016 
Sentence Average LD: 0.8533 
Comment Average LD: 0.9548


In [None]:
text_metrics_individual(comments_corpus, 'drummer_tt.txt', 5)


syllable done
this text has 323 words, 52 sentences, and 436 syllables, and 34 comments.

<class 'list'>
Total Polarity Score: 0.995 
Average Polarity Score per Sentence: 0.0936

Overall Score: 86.333, Average Score per Comment: 80.032

Whole Average Comment Section AOA: 4.9 
Average AOA per Comment: 4.97

Average Noun Dependancy: 1.3676 
Standard Deviation: 0.9911

Overall LD: 0.0168 
Sentence Average LD: 0.8584 
Comment Average LD: 0.9345


For the contemporary drummer performance art video, we see quite a big difference in sentiment values. The IG comments have a more negative sentiment than TT comments. The overall reading score is higher for the IG comments but the average reading score per sentence is higher for TT comments. The AoA for IG remains on 5 while TT seems to have switched up to around age 5. IG comments have a higher average noun dependency and standard deviation, with a big difference in the latter metric. The lexical diversity of both platforms are almost identical to each other, with the only slightly notable difference being in the average LD per comment.

# HAIRCUT

In [None]:
text_metrics_individual(comments_corpus, 'haircut_ig.txt', 5)


syllable done
this text has 262 words, 38 sentences, and 332 syllables, and 23 comments.

<class 'list'>
Total Polarity Score: -0.9404 
Average Polarity Score per Sentence: -0.0477

Overall Score: 92.634, Average Score per Comment: 89.265

Whole Average Comment Section AOA: 5.0 
Average AOA per Comment: 5.0

Average Noun Dependancy: 1.2203 
Standard Deviation: 1.0516

Overall LD: 0.0198 
Sentence Average LD: 0.9241 
Comment Average LD: 0.9637


In [None]:
text_metrics_individual(comments_corpus, 'haircut_tt.txt', 5)


syllable done
this text has 254 words, 40 sentences, and 301 syllables, and 28 comments.

<class 'list'>
Total Polarity Score: 0.9984 
Average Polarity Score per Sentence: 0.184

Overall Score: 100.135, Average Score per Comment: 96.462

Whole Average Comment Section AOA: 4.8 
Average AOA per Comment: 4.96

Average Noun Dependancy: 1.6818 
Standard Deviation: 1.3078

Overall LD: 0.0222 
Sentence Average LD: 0.9048 
Comment Average LD: 0.9871


The comment count is higher on TT despite it having lesser words than IG. There is a big difference in sentiment as the score for the IG comments have a generally negative sentiment while the TT comments have a generally postiive sentiment. There is a notable difference in readability as the the TT comments is higher than the IG comments. The AoA is quite similar but IG comments still take on a slightly higher age as usual. The TT comments have a higher noun dependency and a higher standard deviation this time, this also applies to the lexical diversity except for the LD for sentence average.

# HOUSEBOAT

In [None]:
text_metrics_individual(comments_corpus, 'houseboat_ig.txt', 5)


syllable done
this text has 266 words, 32 sentences, and 364 syllables, and 16 comments.

<class 'list'>
Total Polarity Score: 0.9808 
Average Polarity Score per Sentence: 0.1267

Overall Score: 82.629, Average Score per Comment: 84.618

Whole Average Comment Section AOA: 4.8 
Average AOA per Comment: 4.66

Average Noun Dependancy: 1.1833 
Standard Deviation: 1.0167

Overall LD: 0.0223 
Sentence Average LD: 0.7591 
Comment Average LD: 0.9441


In [None]:
text_metrics_individual(comments_corpus, 'houseboat_tt.txt', 5)


syllable done
this text has 244 words, 27 sentences, and 305 syllables, and 20 comments.

<class 'list'>
Total Polarity Score: 0.1969 
Average Polarity Score per Sentence: -0.0016

Overall Score: 91.912, Average Score per Comment: 85.087

Whole Average Comment Section AOA: 4.8 
Average AOA per Comment: 5.31

Average Noun Dependancy: 1.2321 
Standard Deviation: 0.934

Overall LD: 0.0228 
Sentence Average LD: 0.8929 
Comment Average LD: 0.9617


The comment count is higher on TT despite it having less words. This time it seems the TT comment section has a less positive sentiment compared to IG. It is a rather significant difference. The readability is much higher on TT than it is to IG. This time, the AoA is higher for TT comments than it usually is, overpassing IG comments. The TT comments also have a higher dependency but the IG comments have a slightly higher standard deviation. Although the overall and comment average have similar LD values, the sentence average of the TT comments is notably higher.

# MOM

In [None]:
text_metrics_individual(comments_corpus, 'mom_ig.txt', 5)


syllable done
this text has 368 words, 35 sentences, and 472 syllables, and 17 comments.

<class 'list'>
Total Polarity Score: 0.9611 
Average Polarity Score per Sentence: 0.0726

Overall Score: 87.654, Average Score per Comment: 84.654

Whole Average Comment Section AOA: 4.8 
Average AOA per Comment: 4.71

Average Noun Dependancy: 1.6111 
Standard Deviation: 1.1229

Overall LD: 0.017 
Sentence Average LD: 0.8044 
Comment Average LD: 0.9269


In [None]:
text_metrics_individual(comments_corpus, 'mom_tt.txt', 5)


syllable done
this text has 298 words, 49 sentences, and 364 syllables, and 45 comments.

<class 'list'>
Total Polarity Score: 0.9991 
Average Polarity Score per Sentence: 0.3199

Overall Score: 97.325, Average Score per Comment: 91.383

Whole Average Comment Section AOA: 4.4 
Average AOA per Comment: 4.49

Average Noun Dependancy: 1.8478 
Standard Deviation: 1.4601

Overall LD: 0.0226 
Sentence Average LD: 0.943 
Comment Average LD: 0.9616


The comments on IG is significantly higher despite it having much less words than the TT comment section. This may have been an error on my part to choose this video. The general polarity score for both the IG and TT comment sections are quite similar and positive, but the avg per sentence has a big difference with the TT comments having a higher score. The readability, as always, is higher on TT's' comment section. The AoA is higher on IG comments but this time they are quite similar since both are around the age of 4. The noun dependancy is higher on the TT commets with also a higher standard deviation. The lexical diversity is higher in the TT comment section on all aspects, with a somewhat significant difference in the sentence average per LD.

# VIEW

In [None]:
text_metrics_individual(comments_corpus, 'view_ig.txt', 5)


syllable done
this text has 302 words, 31 sentences, and 421 syllables, and 25 comments.

<class 'list'>
Total Polarity Score: 0.9972 
Average Polarity Score per Sentence: 0.2612

Overall Score: 79.011, Average Score per Comment: 74.906

Whole Average Comment Section AOA: 4.9 
Average AOA per Comment: 4.9

Average Noun Dependancy: 1.619 
Standard Deviation: 1.1972

Overall LD: 0.0176 
Sentence Average LD: 0.9073 
Comment Average LD: 0.9601


In [None]:
text_metrics_individual(comments_corpus, 'view_tt.txt', 5)


syllable done
this text has 312 words, 34 sentences, and 407 syllables, and 26 comments.

<class 'list'>
Total Polarity Score: 0.9943 
Average Polarity Score per Sentence: 0.1254

Overall Score: 87.161, Average Score per Comment: 86.752

Whole Average Comment Section AOA: 5.0 
Average AOA per Comment: 5.09

Average Noun Dependancy: 1.383 
Standard Deviation: 1.0332

Overall LD: 0.0174 
Sentence Average LD: 0.8643 
Comment Average LD: 0.9661


There is only 1 comment difference between the two, and they both have similar amount of words. The overall polarity score of both comment sections have matching values, but the polarity score per sentence is higher for the IG comments. As always, the readability for TT comments are higher than IG's comment section. This time, the AoA is higher in the TT comment section, with it being age 5, as compared to IG's comment section being just under 5. The noun dependency is higher with IG this time, along with the standard deviation. The lexical diversity values on both comment sections are quite similar except for the sentence average, to which the IG comments got it higher.

# RESULTS AND REPORT

### Overall Results

These are the results how many times each comment section had the higher results for each metric:

Comment Count

IG: 0     
TT: 7

Polarity score

IG: 3            
TT: 4

Readability

IG: 1.5    
TT: 5.5

AoA

IG: 5    
TT: 2

Noun Dependancy (and sd)

IG: 3 (4)  
TT: 4 (3)

LD (Generally)

IG: 1   
TT: 6

Judging from the calculations of the program. Just from the comment count we can gather that tiktok tends to have shorter comments, while IG has possibility for wordier and longer comments. The polarity score count tells us that the difference isn't stark as I thought it would be, but it still indicate that Tiktok would be have a higher chance at being a more positive comment section. The readability count shows that tiktok tends to have a more easily digestible comment section that instagram, which could paired with the Age of Acquisition for tiktok generally being lower than Instagram. Although the noun dependency seems to be on even ground for the most part, the Tiktok comment section would likely be higher. Which is rather odd considering the other statistics implying
a more simplicistic nature when it comes to tiktok. However, the instagram comments tend to be more consistent, probably implying that the tend to repeat words more. This matches up with the lexical diversity results, with tiktok having a higher count implying that there is less of a repeating thing.

The differences in the polarity scores aren't usually that contrasting, so in regards to my first research question, it appears that there isn't really a big difference when it comes to emotions, but it is very situational. For example, in the drummer contemporary art video the difference between the two platforms were big, but that was one out of the seven videos brought up. Although I do hear many remarks on the internet about how Instagram has ruthless, blunt comments. Perhaps the videos that seem to have the negative sentiment could be influenced by a prefixed standard, as in the people who comment on instagram tend to be meaner because everyone implies that they are, and they want to fulfill that expectation for reasons such as humour. That, or the VADER sentiment feature could be working incorrectly.

In regards to my second question, I can say that the intricacy depends on whether you are referring to the entire comment section or each comment, and it also depends on the kind of video. From the results, generally it feels like Tiktok comments would have more variety in their comments. With the fact that the lexical diversity and the noun dependency being generally higher (with the addition of the standard deviation being a lower count), this implies that tiktok is more complex in terms of the entire comment section, and that each comment is likely to be different from one another. Instagram on the other hand is intricate in a very niche sense. From the results, Instagram tends to have longer comments than on Tiktok. It also has less of a readability score and the age of acquisition tends to be higher (but not significantly big difference). This probably suggests how the individual comments of Instagram are perhaps more meticulous than the individual ones on Tiktok. Although you can see that that although Instagram has a lower noun dependency, it generally has a somewhat higher standard deviation. This implies consistency within the comment section. I did mention earlier about expectations and people on the internet wanting to fulfill them being a contributer to the nature of the comments. As I was browsing through the comment sections, I did notice that despite that there was this trend of repeating phrases, there are comments that are completely a different vibe compared to most. These comments take shape in  relatively sizeable paragraphs typed in a formal matter. These are not in every Instagram comment section, but they can be occasionally seen. This lead me to conclude that instagram comments are intricate in a sense that there are specific comments that offer a good sense of being meticulous, but not in the sense that generalizes the whole comment section.

The video that highlighted the mean, aggressive reputation of instagram comments was the dog video, I included it to see the differences in polarity score but to my surprise the contrast wasn't as stark. In fact, the instagram comments was shown to have a higher polarity score in the specific video. However, thinking about it now, it is probably due to the fact that the true intent of comments can't really be captured by a computer without being carefully handpicked. The instagram comments on the dog video probably had words that had high sentiment, but used in a sarcastic and threatening way. I suppose this was one of the flaws that's present in the project.

Overall, the project was done rather roughly. One thing that prevented me from being able to work on the project smoothly was the fact that I went out on a holiday and the fatigue was getting over me. I think Python being a language that I'm not accustomed to has also served as an issue for me. I haven't been able to meddle and tweak the program as much as I'd like to since I figured it would take awhile trying to figure out syntax, and I was already on a time crunch from being on holiday. As a result, I just copied and pasted from my previous assignment and added the new features that I thought would fit for the project and made slight changes according to fit the kind of data that I used. Because of this, I think my program could definitely be improved, and I felt like there could be more metrics that I could touch upon and link it had I got the time. I would've also made a function that would gather up the results of both of the comment sections that I am comparing under a specific video.

As for the data, I'm not sure if it was enough since I wasn't used to working with such a small group of texts. It was originally supposed to be 5 videos but I decided to include 2 more for data's sake. I feel like more videos would've been good so that we could accurately measure the differences more using the counts. I also tried my best to try include all sides of the internet into 7 vidoes, I did have some difficulty trying to find the right videos since I had to check if the comments fit my criteria. At worst there would be a video that I would find interesting to analyze on but to my luck there would be little to no comments. The data gathering could definitely be improved. Again, I suppose the timing of my holiday likely affected this.
