# Notes about this notebook

This notebook was written by me, Ryan Schmid, the one who first began this project. This is where I began experimenting with text generation and pairing that text with images. It resembles a very rough version of the `4_generate_posts.ipynb` notebook and does not have as recent code as what is in that document. However, it has some leftover code from when I was experimenting with various techniques, such as other image and keyword-finding APIs. I am including this in the project as a reference and possible starting point if a future developer wishes to expand upon any of these ideas. 

**Note: a few file references may need editing**

A few thoughts about expanding the project:
1. Keyword-finding methods
    1. Some APIs/libraries were able to pull out noun phrases from sentences. I had limited success using these, but it could be one approach to consider. 
1. Image searching
    1. The Pexels library consists of stock photos, and many of them look like stock photos. So if you'd like photos to look more authentic, see what Instagram and Facebook might offer for APIs. 
    1. A Google Image API might have the widest range and could produce images that fit nicely with the text. It also has a very flexible search feature. I tried using it for this reason, but unfortunately, the API I used came back with limited and often very unrelated images after I limited the images to only those that could be used commercially. 
1. Overall approach to generating posts
    1. Starting with finding an image first and then generating text from that might actually be more promising than what I did (the exact opposite). There are numerous examples and tutorials online for generating captions to images using computer vision libraries. At least finding the keywords for the subject(s) of an image and generating text from them would mean that images will likely relate much more closely to the text. Also, finding a way to prompt the GPT-Neo model with keywords and a textual example from a post could be successful. For example, you could get examples of reddit posts or pass in a few examples from an existing Truman project's posts.csv file and have it build a sentence similar to those using the keywords found from the image. 
1. Prompting GPT-Neo
    1. As you can see from the printed result below, the format for the GPT-Neo prompt is:
    ```
    Prompt: ...
    ###
    Prompt: ...
    ###
    ...
    ###
    Prompt:
    ```
        This was very useful, and I got the idea from a source online that was able to prompt the model in a similar way to (not the best example, but you get the idea):
    ```
    Word: person
    Sentence: The person went for a walk.
    ###
    Word: car
    Sentence: The car drove 10 miles.
    ###
    Word: airplane
    Sentence:
    ```
        ... and it would generate something like `The airplane flew to the airport.`. Having a few examples of this "prompt and response" type input for the model can be a fairly powerful tool in having it generate something similar to what you want. 

Beginnings for this project were sourced from https://medium.com/mlearning-ai/text-generation-using-gpt-neo-41877ef586c7
and https://www.vennify.ai/gpt-neo-made-easy/

# Getting Started

## Installing dependencies

In a terminal, `pip install happytransformer` or `pip install --upgrade --force-reinstall happytransformer`

In [None]:
from happytransformer import HappyGeneration, GENSettings, GENTrainArgs
from tqdm import tqdm
import requests
import re
import nltk
import random
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordfreq import zipf_frequency
from textblob import TextBlob
from transformers import pipeline

## Download and Save the Model

Note: The model used here is the GPT Neo, which may use up to 2.7 billion parameters (~10 GB). This is a free, open-source model built by Open AI. Open AI's largest model, the GPT-3, uses 175 billion parameters and is trained on about 45 TB of text data.

In [None]:
# this part takes a while and may download ~5GB of data with the 1.3B model or ~10GB with the 2.7B model
gpt_neo = HappyGeneration(model_type="GPT-NEO", model_name="EleutherAI/gpt-neo-1.3B") 
# to use an different-sized model, replace the model_name with
# "EleutherAI/gpt-neo-125M" or "EleutherAI/gpt-neo-1.3B" or "EleutherAI/gpt-neo-2.7B"

gpt_neo.save("internal/model_gpt_neo/")

print (gpt_neo)

## Load a previously downloaded model

In [None]:
gpt_neo = HappyGeneration(model_type="GPT-NEO", model_name="EleutherAI/gpt-neo-1.3B", load_path="internal/model_gpt_neo/")

## Test the model with a string as input

For documentation, see https://happytransformer.com/text-generation/settings/

In [None]:
args = GENSettings(max_length=100, no_repeat_ngram_size=2, do_sample=True, early_stopping=False, top_k=50, temperature=0.7)

result = gpt_neo.generate_text("write a Reddit post about food.", args=args)

print(result.text)

# Have the model generate text given examples

## Get data for training

`pip install requests`

### Reddit

In [None]:
# get posts
subreddit = 'Science'
limit = 50
listing = '' # controversial, best, hot, new, random, rising, top

url = 'https://www.reddit.com/r/' + subreddit + '/' + listing + '.json?limit=' + str(limit)
header = {'User-agent': 'Useragentname'}
res = requests.get(url,headers=header)
print("Connection status: ", res.status_code)
posts = res.json()['data']['children']
print(len(posts), "posts fetched")

In [None]:
# collect desired post attributes
titles = []
content = []
for idx, x in enumerate(posts):
    titles.append(posts[idx]['data']['title'])
    content.append(posts[idx]['data']['selftext'])

## Generate text

In [None]:
number_of_inputs = 10

input_str = ""
my_titles = titles[1:number_of_inputs]
for idx, x in enumerate(my_titles):
    input_str += "Prompt: " + my_titles[idx] + "\n###\n"
input_str += "Prompt:"
print(input_str)

In [None]:
args = GENSettings(max_length=50, no_repeat_ngram_size=2, do_sample=True, early_stopping=False, top_k=50, temperature=0.7)

num_results = 5
text_results = []
for i in tqdm(range(num_results)):
    result = gpt_neo.generate_text(input_str, args=args)
    text_results.append(result.text.partition('\n')[0]) # take only the first line of each result
for i in range(num_results):
    print(text_results[i] + "\n")

# remove partial sentences at the end
# downside is time
# gpt neo api
# gpt 3 api
# run with 2.7B model

# Take the generated text, and find pictures to go along with each instance

`pip install wordfreq`
`pip install textblob`
`pip install simplejson`

In [None]:
# import spacy
# !python3 -m spacy download en_core_web_sm

In [None]:
# filter out stop words and punctuation for better image searches
text_res_tokens = []
text_res_tokens_punct = []
text_results_punct = text_results.copy()
for idx, x in enumerate(text_results):
    # remove punctuation
    text_results[idx] = re.sub(r'[^\w\s]','',text_results[idx])
    # remove stop words
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text_results[idx])
    word_tokens_punct = word_tokenize(text_results_punct[idx])
    filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
    filtered_sentence_punct = [w for w in word_tokens_punct if not w.lower() in stop_words]
    filtered_sentence = []
    filtered_sentence_punct = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w.lower())
    for w in word_tokens_punct:
        if w not in stop_words:
            filtered_sentence_punct.append(w.lower())
    #print(word_tokens)
    #print(filtered_sentence)
    text_res_tokens.append(filtered_sentence)
    #print(word_tokens_punct)
    #print(filtered_sentence_punct)
    text_res_tokens_punct.append(filtered_sentence_punct)

In [None]:
# find the least commonly used word for an image search
num_words_for_search = 5    # take the top n least common words from each
language = 'en'

pos_search_words = []
for i, x in enumerate(text_res_tokens):
    search_words = {}
    for j, y in enumerate(text_res_tokens[i]):
        search_words[str(text_res_tokens[i][j])] = zipf_frequency(text_res_tokens[i][j], language)
    #print(search_words)
    search_words = {k: v for k, v in sorted(search_words.items(), key=lambda item: item[1])[:num_words_for_search]}
    print(search_words)
    temp_word_list = []
    pos_search_words.append(list(search_words.keys()))

In [None]:
# nltk nouns
nltk_search_words = []
for idx, x in enumerate(text_results_punct):
    lines = text_results_punct[idx]
    # function to test if something is a noun
    is_noun = lambda pos: pos[:2] == 'NN'
    # do the nlp stuff
    tokenized = nltk.word_tokenize(lines)
    nouns = [word.lower() for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos) and len(word) > 1] 
    print(nouns)
    nltk_search_words.append(nouns)
# combine these search_words with those from current pos_search_words list above
for i, x in enumerate(pos_search_words):
    for j, y in enumerate(nltk_search_words[i]):
        if nltk_search_words[i][j] not in pos_search_words[i]:
            pos_search_words[i].append(nltk_search_words[i][j])

In [None]:
# search for overlap in textblob nouns and nltk nouns?
# get verbs too? adjectives?
# search for first (max 10) words in google image search api?

# put sentiment in search?

In [None]:
# 1. least commonly used word before first punctuation (except ' and ")
# 2. repeated in nltk nouns
# 3. in both nltk nouns and least common words

### From classification

#### Download and save the model

In [None]:
classifier = pipeline("zero-shot-classification")
classifier.save_pretrained('internal/model_classifier')

#### Load the model

In [None]:
classifier = pipeline("zero-shot-classification", 'internal/model_classifier')

#### Example

In [None]:
sequence = "A study shows that the world could be facing a ’climate-change-induced, ice-free‘ year around the end of this century. One theory suggests that if the Earth keeps warming at the current rate, the ice sheets in Greenland"
candidate_labels = ['climatechangeinduced',
  'icefree',
  'greenland',
  'warming',
  'sheets',
  'study',
  'world',
  'year',
  'end',
  'century',
  'theory',
  'Earth',
  'rate',
  'ice',
  'Greenland']

cl_res = classifier(sequence, candidate_labels, multi_label=True)
#cl_res['labels'][0]
cl_res

#### Find keyword

In [None]:
num_search_words = 2  # include this many words in each search. Setting this to 1 may result in 
    # empty search results and a random image being placed instead

# For "dissallowed_words", be sure to list words that are very common among responses
# For example, if the topics involve scientific research, include words like 'researchers' or 'study' and 'studies'. 
# Check which keywords result from the algorithm, and if there are many repeats that might clutter an image search, 
# be sure to include them below
disallowed_words = ['question', 'phrase', 'questions', 'phrases', 'sentence', 'sentences', 'article', 'articles',
                              'paragraph', 'paragraphs', 'why', 'how', 'did', 'what', 'who', 'where', 'when', 'do',
                              'researchers', 'study']

# initialize var
final_search_words = []

# function for filtering everything but nouns
is_noun = lambda pos: (pos[:2] == 'NN' or pos[:2] == 'NNS' or pos[:2] == 'NNP' or pos[:2] == 'NNSP')

# get keyword
for i, x in enumerate(text_results):
    # exclude eveything except nouns (including plural and proper); remove disallowed words
    nltk_search_words[i] = [word for (word, pos) in nltk.pos_tag(nltk_search_words[i]) if is_noun(pos) and word not in disallowed_words]
    pos_search_words[i] = [word for (word, pos) in nltk.pos_tag(pos_search_words[i]) if is_noun(pos) and word not in disallowed_words]  
    
    
    # words that show up the most frequently and more than once in the nltk nouns get priority

    # sort on basis of frequency of elements
    temp_nltk_s_w = [item for items, c in Counter(nltk_search_words[i]).most_common() for item in [items] * c]
    # remove unique values
    for index in range(len(temp_nltk_s_w) - 1, -1, -1):
        if temp_nltk_s_w.count(temp_nltk_s_w[index]) == 1:
            del temp_nltk_s_w[index]
    # temp_nltk_s_w now contains only words that occur more than once in nltk_search_words, sorted 
    # by those with the most duplicates first.
    # temp_nltk_s_w does not contain any unique words. For example: ['word1', 'word1', 'word1', 'word2', word2']
    # now, remove duplicates... 
    temp_nltk_s_w = list(dict.fromkeys(temp_nltk_s_w))
    # ...and append this list to the final_search_words (first truncate list if there are too many)
    for i in range(0, len(temp_nltk_s_w) - num_search_words):
        temp_nltk_s_w.pop()
    final_search_words.append(temp_nltk_s_w)
    
    
    # if the list of search words is not completely filled, use classification to find which words in the sentence have the best fit
    if len(final_search_words[i]) < num_search_words:
        cl_res = classifier(x, pos_search_words[i], multi_label=True)['labels']
        fkw_index = 0
        while len(final_search_words[i]) < num_search_words and fkw_index < len(pos_search_words[i]):
            while cl_res[fkw_index] in temp_nltk_s_w:
                fkw_index += 1
                if fkw_index >= len(pos_search_words[i]):
                    break
            final_search_words[i].append(cl_res[fkw_index])
        # for reference: classifier(sequence, candidate_labels, multi_label=True)
    #print("iteration", i, ":", final_search_words)
    
print(final_search_words)

## Get picture

### Pexels

In [None]:
# the higher the photos_per_request, the more variation there will be in images, but the less relevance the images might have
photos_per_request = 2
image_size = 'large'  # 'original', 'large2x', 'large', 'medium', 'small', 'portrait', 'landscape', or 'tiny'

API_KEY = '563492ad6f91700001000001c12998557b2d40488b7bc8d328bccd80'
# The API key is relatively easy to get by creating an account at https://www.pexels.com/api/
# It offers 200 requests per hour and 20,000 requests per month for free. 

default_query = 'science'
# NOTE: The default query is here in case no results are found in the image search with the keyword(s) found above.
# It may work well to enter a single word that is similar to the topics of other posts/images.
# If there is no such theme, a good default may be 'nature' or 'food' or 'people', for example. 
# Any post that requires this 'placeholder' image will have 'PLACEHOLDER' at the end of its filename.

for i, x in enumerate(final_search_words):
    image_found = 1
    # construct search query
    my_query = final_search_words[i][0]
    for j in range(1, len(final_search_words[i])):
        my_query += "%20" + final_search_words[i][j]
    print(text_results_punct[i])
    response = requests.get(f"https://api.pexels.com/v1/search?query="+my_query+"&per_page="+str(photos_per_request)+", allow_redirects=True", headers={'Authorization': API_KEY})
    if len(response.json()['photos']) == 0: # if nothing found in original query, fill it in with a picture of nature
        response = requests.get(f"https://api.pexels.com/v1/search?query="+default_query+"&per_page="+str(photos_per_request)+", allow_redirects=True", headers={'Authorization': API_KEY})
        image_found = 0
    this_image = random.choice(response.json()['photos'])
    print(this_image['url'])
    # download image
    image_data = requests.get(str(this_image['src'][str(image_size)])).content
    image_name = 'images/image_' + str(i) + '.png'
    if not image_found: image_name = 'images/image_' + str(i) + '_PLACEHOLDER' + '.png'
    with open(image_name, 'wb') as handler:
        handler.write(image_data)

## Other image APIs (not in use)

### Unsplash

In [None]:
import requests

# Download an image off unsplash without the api using python
# https://www.codegrepper.com/code-examples/python/download+unsplash+images+python+without+api

# assume no photos to begin with inside 'images' folder

def downloadimage(search_term, num_imgs): # Define the function to download images
    print(f"https://source.unsplash.com/random/?"+str(search_term)+", allow_redirects=True") # State the URL                                                                                      # Loop for chosen amount of times
    num_imgs += 1
    response = requests.get(f"https://source.unsplash.com/random/?"+str(search_term)+", allow_redirects=True")     # Download the photo(s)
    print("Saving to: images/image" + "_" + str(num_imgs) + ".png")                                                # State the filename
    open("images/image" + "_" + str(num_imgs) + ".png", 'wb').write(response.content)                              # Write image file


# get all pictures
for i, x in enumerate(final_search_words):
    downloadimage(x, i)

In [None]:
response = requests.get(f"https://source.unsplash.com/search/photos?query=office, allow_redirects=True")
open("images/image_test.png", 'wb').write(response.content)
if response.headers['X-Imgix-ID'] == '104702ca07cd7ae5eeae32a67f307203f17a8128': # X-Imgix-ID for when image is not found
    print('Image not found')
else:
    print('Image found')

In [None]:
https://api.unsplash.com/search/photos?page=1&query=office

In [None]:
response = requests.get(f"https://source.unsplash.com/random?women&order_by=relevant, allow_redirects=True")

### Wikimedia Commons - consider a google image search api instead for a more flexible image search

In [None]:
# example request

url = "https://commons.wikimedia.org/w/api.php?action=query&generator=images&prop=imageinfo&gimlimit=500&redirects=1&titles=Cat&iiprop=timestamp|user|userid|comment|canonicaltitle|url|size|dimensions|sha1|mime|thumbmime|mediatype|bitdepth"

params = {'format':'json'}

res_img = requests.get(url, params)

In [None]:
search_words

In [None]:
params = {'format':'json'}
image_results = []
    
for i, x in enumerate(text_results):
    url_query = ''
    for j, x in enumerate(search_words[i]):
        if j < 1: # ONLY the first two keywords
            if j != 0:
                url_query += '|'
            url_query += search_words[i][j]
    url_query = 'food' # food works, foods doesn't
    print(url_query)
    url = "https://commons.wikimedia.org/w/api.php?action=query&generator=images&prop=imageinfo&gimlimit=500&redirects=1&titles=" + url_query + "&iiprop=url"
    res_img = requests.get(url, params)
    print(res_img.status_code)
    #print(res_img.json())
    res_img_keys = list(res_img.json()['query']['pages'])
    #### TODO: randomize the res_img_keys index so that different pictures show up for the same keyword
    image_results.append(res_img.json()['query']['pages'][res_img_keys[0]]['imageinfo'][0]['url'])

In [None]:
image_results

### Google images

In [None]:
from icrawler.builtin import GoogleImageCrawler
import os

# get images
google_Crawler = GoogleImageCrawler(storage = {'root_dir': r'images'})
filters = dict(size='large', license='commercial,modify')
google_Crawler.crawl(keyword = 'positive phrase', filters=filters, max_num = 5)
# https://icrawler.readthedocs.io/en/latest/builtin.html

In [None]:
os.rename('images/000001.png','images/100001.png')

In [None]:
# possible google image search api

from urllib.request import build_opener
import simplejson
from io import StringIO

fetcher = urllib.build_opener()
searchTerm = 'parrot'
startIndex = 0
searchUrl = "http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=" + searchTerm + "&start=" + startIndex
f = fetcher.open(searchUrl)
deserialized_output = simplejson.load(f)