Fake Comment Generator Using Reddit Corpus

All the work for this program was done by Ozan Demircan.


In [14]:
# Downloading and reading the corpus from the subreddit r/NeutralPolitics using ConvoKit toolkit

!pip install convokit --quiet


from convokit import Corpus, download

corpus = Corpus(filename=download("subreddit-NeutralPolitics"))
corpus.print_summary_stats()


Dataset already exists at /root/.convokit/downloads/subreddit-NeutralPolitics
Number of Speakers: 41204
Number of Utterances: 434685
Number of Conversations: 14676


In [15]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Cleaning and Tokenizing the Corpus

tokens = []
i=30000
for utt in corpus.iter_utterances():
    i -= 1
    comment_tokenized = []
    sentences = sent_tokenize(utt.text)
    if(utt.text != "[Deleted]" and utt.text != "[Removed]"):
        for s in sentences:
            s_tokenized = word_tokenize(s)
            tokens += ["<s>"]+s_tokenized+["</s>"]
        if(i==0):
            break

# We need everything in lower case for Gensim to work
tokens = [i.lower() for i in tokens]
print(tokens[:100]) 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['<s>', 'i', 'was', 'reading', 'this', 'article', '[', 'here', ']', '(', 'http', ':', '//rt.com/usa/obama-syria-chemical-weapons-634/', ')', ',', 'where', 'president', 'obama', 'announced', 'that', 'there', 'is', 'evidence', 'that', 'a', 'chemical', 'agent', 'has', 'been', 'used', 'in', 'syria', ',', 'and', 'then', 'i', 'remembered', 'reading', 'an', 'article', 'earlier', 'that', 'said', 'officials', 'from', 'germany', 'were', 'asking', 'to', 'see', 'the', 'proof', 'that', 'the', 'americans', 'had', '.', '</s>', '<s>', 'i', "'m", 'not', 'trying', 'to', 'directly', 'compare', 'the', "'proofs", "'", 'stated', ',', 'be', 'more', 'that', 'maybe', 'this', 'proof', 'of', 'chemical', 'warfare', 'is', 'just', 'a', 'ruse', 'to', 'justify', 'a', 'more', 'predominant', 'role', 'of', 'intervention', 'in', 'the', 'syrian', 'situation', ',', 'much', 'like', 'the']


In [16]:
# Defining a function to Generate the Comments

def CommentGenerator(max_token_count=200, starter = ""):

    comment = ""
    from itertools import chain 
    from scipy import stats
    import re

    comment += "<s>"
    current_phrase = ("<s>",starter)
    trigrams = [((tokens[i],tokens[i+1]), tokens[i+2]) for i in range(len(tokens)-2)]
    trigram_cfd = nltk.ConditionalFreqDist(trigrams)
    trigram_pbs = nltk.ConditionalProbDist(trigram_cfd, nltk.MLEProbDist)
    for i in range(max_token_count+1):
      comment += current_phrase[1] + " "
      probable_words = list(trigram_pbs[current_phrase].samples())
      word_probabilities = [trigram_pbs[current_phrase].prob(word) for word in probable_words]
      if not word_probabilities:
        return "Possible trigram not found!"
      result = stats.multinomial.rvs(1,word_probabilities)
      index_of_probable_word = list(result).index(1)
      current_phrase = (current_phrase[1],(probable_words[index_of_probable_word]))
      
    comment = re.sub("<s>", "", comment)
    comment = re.sub("</s>", "", comment)  
    
    return comment

In [17]:
# Importing Gensim and downloading a ready dataset

import gensim.downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')

In [21]:
# Getting related keywords using Gensim and calling the function to generate the comments
input_text = input("Enter the topic you want comments on:")
most_related = glove_vectors.most_similar(input_text)
keywords = [keyword[0] for keyword in most_related[:5]]
print(keywords)
example_comments = [CommentGenerator(max_token_count=50, starter = keyword) for keyword in keywords]
print(example_comments)


Enter the topic you want comments on:senate
['labour', 'congressional', 'parliament', 'congress', 'immigration']
["labour holds .   the government in regulation of the problems fairly easily solved , but so far does the current healthcare controversy in america .   the fcc 's recent actions ?   it 's about offering causal explanations for the government take a few months ? ", "congressional spending on programs like this simply a modernization of politicking , not the school system , if i concede in your criticism .   i 'm not familiar with its state of mind for as little as he resigned it to washington after the event .   what ", 'Possible trigram not found!', "congress has created a huge push across the board of directors funded people that have a limited number of agencies have to get firearms if they did n't understand what we see populists elected in over 70 years and a license to have opinions and opinions into the healthcare problem , ", 'immigration policy be ?   no one will be

In [22]:
# Grammar Correction

!pip install --quiet GingerIt

from gingerit.gingerit import GingerIt

# Creating a GingerIt instance
parser = GingerIt()

# Defining a function to check the grammar of a sentence and return the corrected sentence
def check_grammar(sentence):
    result = parser.parse(sentence)
    corrected_sentence = result['result']
    return corrected_sentence

corrected_comments = []
for comment in example_comments:
    corrected_sentence = check_grammar(comment)
    print("Corrected sentence:", corrected_sentence)
    corrected_comments.append(corrected_sentence)

example_comments = corrected_comments

Corrected sentence: Labor holds.   The government in the regulation of the problems fairly easily solved, but so far does the current health care controversy in America.   The FCC 's recent actions?   It's about offering causal explanations for the government takes a few months? 
Corrected sentence: Congressional spending on programs like this simply a modernization of politicking, not the school system, if I concede in your criticism.   I 'm not familiar with its state of mind for as little as he resigned it to Washington after the event.   What 
Corrected sentence: Possible trigram not found!
Corrected sentence: congress has created a huge push across the board of directors funded people that have a limited number of agencies have to get firearms if they didn't understand what we see populists elections in over 70 years and a license to have opinions and opinions into the healthcare problem, 
Corrected sentence: Immigration policy is?   No one will be.   & get; the actual measured va

In [10]:
# Downloading the necessary resources for sentiment analysis
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [11]:
# Defining a function to analyze the positivity of a comment
def analyze_positivity(text):
    scores = sia.polarity_scores(text)
    return scores

In [23]:
# Evaluating the comments using a Sentiment analyzer in NLTK

sentiment_scores = [analyze_positivity(text) for text in example_comments]
print(sentiment_scores)
for i in range(len(sentiment_scores)):
    if (sentiment_scores[i]["compound"] > 0):
      print(i,":",example_comments[i],":Comment is positive.")
    elif(sentiment_scores[i]["compound"] < 0):
      print(i,":",example_comments[i],":Comment is negative.")
    else:
      print(i,":",example_comments[i],":Comment is neutral.")

[{'neg': 0.04, 'neu': 0.787, 'pos': 0.173, 'compound': 0.7236}, {'neg': 0.105, 'neu': 0.838, 'pos': 0.057, 'compound': -0.2815}, {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, {'neg': 0.09, 'neu': 0.801, 'pos': 0.109, 'compound': 0.0}, {'neg': 0.078, 'neu': 0.675, 'pos': 0.246, 'compound': 0.8271}]
0 : Labor holds.   The government in the regulation of the problems fairly easily solved, but so far does the current health care controversy in America.   The FCC 's recent actions?   It's about offering causal explanations for the government takes a few months?  :Comment is positive.
1 : Congressional spending on programs like this simply a modernization of politicking, not the school system, if I concede in your criticism.   I 'm not familiar with its state of mind for as little as he resigned it to Washington after the event.   What  :Comment is negative.
2 : Possible trigram not found! :Comment is neutral.
3 : congress has created a huge push across the board of directors funded