# Generate posts

Generate artificial posts and images to go with them. Results will be in `output > csv_files > posts.csv` and `output > post_pictures`

# Setting Variables

## Verify model name

Make sure that the `model_name` parameter is the same as what you downloaded in step 2. For example, if you downloaded the 1.3B model, make sure `model_name="EleutherAI/gpt-neo-1.3B"` in the second line of code below.

In [None]:
from happytransformer import HappyGeneration, GENSettings
gpt_neo = HappyGeneration(model_type="GPT-NEO", model_name="EleutherAI/gpt-neo-1.3B", load_path="internal/model_gpt_neo/")

## Setting the Pexels API Key

Go to https://www.pexels.com/api/, and select "Get Started." Create an account. Once you do, you will receive an API Key, which consists of  about 56 numbers and letters. Please copy this and paste it (surrounded by quotes) into the API_KEY variable below. It should read `API_KEY = '123...890'` where `123...890` is your API key. The Pexels API key offers 200 requests per hour and 20,000 requests per month for free as of August 2022. **This means that this program may only generate a maximum of 200 posts per hour unless the multiple API keys are used.**

In [None]:
API_KEY = '563492ad6f91700001000001c12998557b2d40488b7bc8d328bccd80'

## Basic variables

You will likely only need to adjust these variables to get the results you want. If you are getting poor results, or want to change something specific, see the 'Advanced variables' below. 

In [None]:
# From Reddit
subreddit = 'Science'  # the name of the subreddit you wish to replicate
listing = ''  # controversial, best, hot, new, random, rising, top; leave as listing = '' for default

# Generating content
num_results = 5  # The number of posts you wish to generate.
                 # It may be best to start small with about 5, and then adjust any variables if necessary. 
                 # As noted in the API section above, this program can only generate 200 posts per hour, 
                 # so this variable can be set to a maximum of 200. 

# Setting a range of hours (for example, if posts should have a time from -24:59 to 60:59, enter -20 and -60)
smallest_hour = -20
largest_hour = 60

## Advanced variables

In [None]:
# From Reddit
limit = 10 # number of posts to retrieve from the request; must be >= number_of_inputs

# Prompting the model to generate titles
number_of_inputs = 10   # the number of inputs to prompt the model with. A good range for this is 4-10. 
                        # The model will take these inputs (e.g. post titles), and generate something similar.
                        # The larger the number, the longer it may take to generate results.
                        # Unfortunately, the ideal number is usually dependent on the length and format of the prompts.

# Getting image search words
num_search_words = 2    # include this many words in each image search. Setting this to 1 may result in 
                        # empty search results and a random image being placed instead.
                        # A number that is too large may also not generate any results. 
                        # 2-3 is probably best.

# Getting the images from Pexels
photos_per_request = 2  # the higher the photos_per_request, the more variation there will be in images, 
                        # but the less relevance the images might have (max 80)
image_size = 'large'  # 'original', 'large2x', 'large', 'medium', 'small', 'portrait', 'landscape', or 'tiny' 

default_query = subreddit  # a search term that is entered when no images could be found using the keyword(s) generated by the algorithm.
# change this to any string. For example, default_query = 'science'
# NOTE: The default query is here in case no results are found in the image search with the search keyword(s).
# It may work well to enter a single word that is similar to the topics of other posts/images.
# If there is no such theme, a good default may be 'nature' or 'food' or 'people', for example. 
# Any post that requires this 'placeholder' image will have 'PLACEHOLDER' at the end of its filename.
# If the titles are very short, but there is a theme among them such as 'food', make 'food' the 
# default_query and increase the photos_per_request to a high number such as 50-80

# Running the code

Run the program by selecting *Kernel > Restart Kernel and Run All Cells...* in the menu bar at the top left of the screen. Results will be in `output > csv_files > posts.csv` and `output > post_pictures`. You will see the words, "Posts generated!!" at the bottom of the notebook when the code is finished running.

## Installing dependencies

In [None]:
from tqdm import tqdm
import requests
import re
import nltk
import random
import pandas as pd
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordfreq import zipf_frequency
from transformers import pipeline

# Have the model generate text given examples

## Get text from source

### Reddit

In [None]:
# get posts

# subreddit = 'Science'  # the name of the subreddit you wish to replicate
# listing = ''  # controversial, best, hot, new, random, rising, top; leave as listing = '' for default
# limit = 10 # number of posts to retrieve from the request; must be >= number_of_inputs

url = 'https://www.reddit.com/r/' + subreddit + '/' + listing + '.json?limit=' + str(limit)
header = {'User-agent': 'Useragentname'}
res = requests.get(url,headers=header)
posts = res.json()['data']['children']
#print("Connection status: ", res.status_code)
#print(len(posts), "posts fetched")

In [None]:
# collect desired post attributes
titles = []
content = []
for idx, x in enumerate(posts):
    titles.append(posts[idx]['data']['title'])
    content.append(posts[idx]['data']['selftext'])

## Generate text

In [None]:
# number_of_inputs = 10   # the number of inputs to prompt the model with. A good range for this is 4-10. 
#                         # The larger the number, the longer it may take to generate results.
#                         # The model will take these inputs (e.g. post titles), and generate something similar.
#                         # Unfortunately, the ideal number is usually dependent on the length and format of the prompts.

input_str = ""
my_titles = titles[1:number_of_inputs]
for idx, x in enumerate(my_titles):
    input_str += "Prompt: " + my_titles[idx] + "\n###\n"
input_str += "Prompt:"
#print(input_str)

In [None]:
# generate titles

args = GENSettings(max_length=50, no_repeat_ngram_size=2, do_sample=True, early_stopping=False, top_k=50, temperature=0.7)

# num_results = 5
text_results = []
for i in tqdm(range(num_results)):
    result = gpt_neo.generate_text(input_str, args=args)
    text_results.append(result.text.partition('\n')[0]) # take only the first line of each result
# for i in range(num_results):
#     print(text_results[i] + "\n")

# Get keyword(s) for image search

In [None]:
# filter out stop words and punctuation for better image searches

# First, if there is at least one end punctuation (.!?) in the sentence, delete everything after the last one to avoid incomplete sentences.
# But check for abbreviations and edge cases
for idx, x in enumerate(text_results):
    temp_str = text_results[idx]
    while True:
        # if there are no ., !, or ? symbols in temp_sentence; or if a . is at index -1, 0, or 1; or if the original string ends with a ., !, or ?
        if max(temp_str.rfind(i) for i in ".!?") == -1 or temp_str.rfind('.') - 2 < 0 or max(text_results[idx].rfind(i) for i in ".!?") == len(text_results[idx]) - 1:
            # return the original string
            temp_str = text_results[idx]
            break
        # if the period is for an abbreviation
        if (temp_str[temp_str.rfind('.') - 2] == '.') or (temp_str[temp_str.rfind('.') - 2] == ' ' and temp_str[temp_str.rfind('.') - 1] != 'I'):
            # trim the string back until a space character is reached
            my_bool = True
            while my_bool:
                temp_str = temp_str[0:len(temp_str) - 1] 
                if temp_str[len(temp_str) - 1] == ' ':
                    my_bool = False
        else:
            # the string has been sufficiently trimmed. Return the trimmed string
            temp_str = temp_str[0:temp_str.rfind('.') + 1]
            break
    text_results[idx] = temp_str

text_res_tokens = []
text_res_tokens_punct = []
text_results_punct = text_results.copy()

for idx, x in enumerate(text_results):    
    # remove punctuation
    text_results[idx] = re.sub(r'[^\w\s]','',text_results[idx])
    # remove stop words
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text_results[idx])
    word_tokens_punct = word_tokenize(text_results_punct[idx])
    filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
    filtered_sentence_punct = [w for w in word_tokens_punct if not w.lower() in stop_words]
    filtered_sentence = []
    filtered_sentence_punct = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w.lower())
    for w in word_tokens_punct:
        if w not in stop_words:
            filtered_sentence_punct.append(w.lower())
    #print(word_tokens)
    #print(filtered_sentence)
    text_res_tokens.append(filtered_sentence)
    #print(word_tokens_punct)
    #print(filtered_sentence_punct)
    text_res_tokens_punct.append(filtered_sentence_punct)

In [None]:
# find the least commonly used word for an image search
num_words_for_search = 5    # take the top n least common words from each
language = 'en'

pos_search_words = []
for i, x in enumerate(text_res_tokens):
    search_words = {}
    for j, y in enumerate(text_res_tokens[i]):
        search_words[str(text_res_tokens[i][j])] = zipf_frequency(text_res_tokens[i][j], language)
    search_words = {k: v for k, v in sorted(search_words.items(), key=lambda item: item[1])[:num_words_for_search]}
    # print(search_words)
    temp_word_list = []
    pos_search_words.append(list(search_words.keys()))

In [None]:
# nltk nouns
nltk_search_words = []
for idx, x in enumerate(text_results_punct):
    lines = text_results_punct[idx]
    # function to test if something is a noun
    is_noun = lambda pos: pos[:2] == 'NN'
    # do the nlp stuff
    tokenized = nltk.word_tokenize(lines)
    nouns = [word.lower() for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos) and len(word) > 1] 
    # print(nouns)
    nltk_search_words.append(nouns)
# combine these search_words with those from current pos_search_words list above
for i, x in enumerate(pos_search_words):
    for j, y in enumerate(nltk_search_words[i]):
        if nltk_search_words[i][j] not in pos_search_words[i]:
            pos_search_words[i].append(nltk_search_words[i][j])

In [None]:
# load the classification model
classifier = pipeline("zero-shot-classification", 'internal/model_classifier')

In [None]:
# Getting image search words
# num_search_words = 2    # include this many words in each image search. Setting this to 1 may result in 
#                         # empty search results and a random image being placed instead.
#                         # A number that is too large may also not generate any results. 
#                         # 2-3 is probably best.

# For "dissallowed_words", be sure to list words that are very common among responses
# For example, if the topics involve scientific research, include words like 'researchers' or 'study' and 'studies'. 
# Check which keywords result from the algorithm, and if there are many repeats that might clutter an image search, 
# be sure to include them below
disallowed_words = ['question', 'phrase', 'questions', 'phrases', 'sentence', 'sentences', 'article', 'articles',
                              'paragraph', 'paragraphs', 'why', 'how', 'did', 'what', 'who', 'where', 'when', 'do',
                              'researchers', 'research', 'study']

# initialize var
final_search_words = []

# function for filtering everything but nouns
is_noun = lambda pos: (pos[:2] == 'NN' or pos[:2] == 'NNS' or pos[:2] == 'NNP' or pos[:2] == 'NNSP')

# get keyword
for i, x in enumerate(text_results):
    # exclude eveything except nouns (including plural and proper); remove disallowed words
    nltk_search_words[i] = [word for (word, pos) in nltk.pos_tag(nltk_search_words[i]) if is_noun(pos) and word not in disallowed_words]
    pos_search_words[i] = [word for (word, pos) in nltk.pos_tag(pos_search_words[i]) if is_noun(pos) and word not in disallowed_words]  
    
    
    # words that show up the most frequently and more than once in the nltk nouns get priority

    # sort on basis of frequency of elements
    temp_nltk_s_w = [item for items, c in Counter(nltk_search_words[i]).most_common() for item in [items] * c]
    # remove unique values
    for index in range(len(temp_nltk_s_w) - 1, -1, -1):
        if temp_nltk_s_w.count(temp_nltk_s_w[index]) == 1:
            del temp_nltk_s_w[index]
    # temp_nltk_s_w now contains only words that occur more than once in nltk_search_words, sorted 
    # by those with the most duplicates first.
    # temp_nltk_s_w does not contain any unique words. For example: ['word1', 'word1', 'word1', 'word2', word2']
    # now, remove duplicates... 
    temp_nltk_s_w = list(dict.fromkeys(temp_nltk_s_w))
    # ...and append this list to the final_search_words (first truncate list if there are too many)
    for i in range(0, len(temp_nltk_s_w) - num_search_words):
        temp_nltk_s_w.pop()
    final_search_words.append(temp_nltk_s_w)
    
    
    # if the list of search words is not completely filled, use classification to find which words in the sentence have the best fit
    if len(final_search_words[i]) < num_search_words:
        cl_res = classifier(x, pos_search_words[i], multi_label=True)['labels']
        fkw_index = 0
        while len(final_search_words[i]) < num_search_words and fkw_index < len(pos_search_words[i]):
            while cl_res[fkw_index] in temp_nltk_s_w:
                fkw_index += 1
                if fkw_index >= len(pos_search_words[i]):
                    break
            final_search_words[i].append(cl_res[fkw_index])
        # for reference: classifier(sequence, candidate_labels, multi_label=True)
    #print("iteration", i, ":", final_search_words)
    
# print(final_search_words)

# Get picture and save to csv file

## Using Pexels

In [None]:
# # the higher the photos_per_request, the more variation there will be in images, but the less relevance the images might have (max 80)
# photos_per_request = 2
# image_size = 'large'  # 'original', 'large2x', 'large', 'medium', 'small', 'portrait', 'landscape', or 'tiny'

# API_KEY = '563492ad6f91700001000001c12998557b2d40488b7bc8d328bccd80'
# # The API key is relatively easy to get by creating an account at https://www.pexels.com/api/
# # It offers 200 requests per hour and 20,000 requests per month for free. 

# default_query = 'science'
# # NOTE: The default query is here in case no results are found in the image search with the keyword(s) found above.
# # It may work well to enter a single word that is similar to the topics of other posts/images.
# # If there is no such theme, a good default may be 'nature' or 'food' or 'people', for example. 
# # Any post that requires this 'placeholder' image will have 'PLACEHOLDER' at the end of its filename.
# # If the titles are very short, but there is a theme among them such as 'food', make 'food' the 
# # default_query and increase the photos_per_request to a high number such as 50-80


# # range of hours (for example, if posts should have a time from -24:59 to 60:59, enter -20 and -60)
# smallest_hour = -20
# largest_hour = 60


# for csv file
post_data = {'id': [],
            'body': [],
            'picture': [],
            'actor': [],
            'time': [],
            'class': [],
            'experiment_group': []
}
# get list of actors
actor_names = list(pd.read_csv('output/csv_files/actors.csv')['username'])

for i, x in enumerate(final_search_words):
    image_found = 1
    # construct search query
    my_query = final_search_words[i][0]
    for j in range(1, len(final_search_words[i])):
        my_query += "%20" + final_search_words[i][j]
    print(text_results_punct[i])
    response = requests.get(f"https://api.pexels.com/v1/search?query="+my_query+"&per_page="+str(photos_per_request)+", allow_redirects=True", headers={'Authorization': API_KEY})
    if len(response.json()['photos']) == 0: # if nothing found in original query, fill it in with a picture of nature
        response = requests.get(f"https://api.pexels.com/v1/search?query="+default_query+"&per_page="+str(photos_per_request)+", allow_redirects=True", headers={'Authorization': API_KEY})
        image_found = 0
    this_image = random.choice(response.json()['photos'])
    print(this_image['url'])
    # download image
    image_data = requests.get(str(this_image['src'][str(image_size)])).content
    image_name = 'post_img_' + str(i) + '.png'
    if not image_found: image_name = 'post_img_' + str(i) + '_PLACEHOLDER' + '.png'
    image_path = 'output/post_pictures/' + image_name
    with open(image_path, 'wb') as handler:
        handler.write(image_data)
    
    # make into csv
    post_data['id'].append(i)
    post_data['body'].append(text_results_punct[i])
    post_data['picture'].append(image_name)
    post_data['actor'].append(random.choice(actor_names))
    post_data['time'].append(str(random.randint(smallest_hour, largest_hour)).zfill(2) + ":" + str(random.randint(0,59)).zfill(2))
    post_data['class'].append(random.choice(['normal', 'cohort']))
    post_data['experiment_group'].append(random.choice(['var1', 'var2', 'var3', 'var4']))

posts_df = pd.DataFrame(post_data)
posts_df.to_csv('output/csv_files/posts.csv', index=False)

print("\n\n\nPosts generated!!")