<a href="https://colab.research.google.com/github/Josh-Been/Mining-Reddit-Instagram/blob/master/Mine_Reddit_Subgroups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Baylor Libraries: Digital Scholarship](https://cpb-us-w2.wpmucdn.com/blogs.baylor.edu/dist/7/7192/files/2019/08/cropped-DigitalScholarshipblog_header-2019-08-30-1.jpg)

# Mine Reddit Subgroups

Primary Libraries Used

*   Python Reddit Wrapper [PRAW: The Python Reddit API Wrapper](https://praw.readthedocs.io/en/latest/)
*   OCR Text from Images [Github: tesseract-ocr](https://github.com/tesseract-ocr/tesseract)
*   Recognize Objects in Images [Github: ImageAI](https://github.com/OlafenwaMoses/ImageAI)

Thank you to the following helpful guides

*   [Felippe Rodrigues: How to scrape Reddit with Python](https://www.storybench.org/how-to-scrape-reddit-with-python/)
*   [Bhadresh Savani: OCR from Image using Pytesseract in Python on Colab Notebook?](https://medium.com/@bhadreshpsavani/how-to-use-tesseract-library-for-ocr-in-google-colab-notebook-5da5470e4fe0)
*   [Object Detection with 10 lines of code](https://towardsdatascience.com/object-detection-with-10-lines-of-code-d6cb4d86f606)

02/2020, Josh Been

# Step 1 - Install PRAW

Run the following snippet. Wait until it specifies complete before moving to the next snippet.

In [0]:
cursor='  >> '
print(cursor,'Installing PRAW')
from google.colab import output
try:
  import praw
except:
  !pip install praw
  import praw
output.clear()
print(cursor,'Installing PRAW')
print(cursor, 'PRAW Installation Complete')

import pandas as pd
import datetime as dt

print(cursor, 'Complete. Move onto the next snippet.')

# Step 2 - Enter 5 Reddit API Strings

Specify the following 5 strings.



1.   Head to [Reddit Apps](https://www.reddit.com/prefs/apps)
2.   Sign In
3.   Click EDIT on an existing app **OR** Click the create app or create another app button at the bottom left
   * Name the application
   * Select the **script** option
   * Enter http://localhost:8080 in the redirect uri field

<img src="https://i.ibb.co/cLTkV7b/reddit-api.png" width=75%>

or [go here](https://researchguides.baylor.edu/c.php?g=980986)

In [0]:
reddit = praw.Reddit(client_id='', client_secret='', user_agent='', username='', password='')
print(cursor, 'Complete. Move onto the next snippet.')

# Step 3 - Specify Subreddit

Specify the name of the subreddit to search.

Then run the snippet.

In [0]:
subreddit = reddit.subreddit('')   # Enter subreddit between the ''
print(cursor, 'Complete. Move onto the next snippet.')

# Step 4 - Type and Amount of Posts to Return

Specify the type and amouont of posts to return. Each subreddit has five different ways of organizing the topics created by redditors: **.hot**, **.new**, **.controversial**, **.top**, and **.gilded**. You can also use **.search**("SEARCH_KEYWORDS") to get only results matching an engine search.

Then run the snippet. This snippet will test whether your authentication and search were successful by attempting to request 1 post.

In [0]:
top_subreddit = subreddit.top(limit=100)

for submission in subreddit.top(limit=1):
    print(cursor, submission.title, submission.id)

print(cursor, 'Complete. Move onto the next snippet.')

# Step 5 - Write Search Results Metadata to REDDIT.csv

If the last snippet successfully returned 1 post, this snippet should successfully extract the full search request and save to a comma delimited .csv spreadsheet file.

In [0]:
import datetime as dt
topics_dict = { "title":[], "score":[], "id":[], "url":[], "comms_num": [], "created": [], "body":[], "img":[]}

for submission in top_subreddit:
    topics_dict["title"].append(submission.title)
    topics_dict["score"].append(submission.score)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext.replace('\n',' '))
    try:
        topics_dict["img"].append(submission.url.split('/')[-1])
    except:
        topics_dict["img"].append('')

topics_data = pd.DataFrame(topics_dict)

def get_date(created):
    return dt.datetime.fromtimestamp(created)

_timestamp = topics_data["created"].apply(get_date)

topics_data = topics_data.assign(timestamp = _timestamp)

topics_data.to_csv('REDDIT.csv', index=False) 
print(cursor, 'Created REDDIT.csv with', len(topics_data), 'records.')

# Step 6 - Download Accompanying Images

Run the following code snippet to download all images linked in the subreddit extract created above.

A new folder is created with the images. The folder name is the date/time stamp. A zip compressed file for easy downloading is also generated with the same date/time stamp.

In [0]:
import os, shutil, urllib.request
from google.colab import files

imagetypes=['jpg','png','tif','bmp','gif']

print(cursor, 'Downloading images')
import datetime
dt = str(datetime.datetime.now())
os.mkdir(dt)

for img in topics_dict['url']:
  if img[-3:] in imagetypes:
    filename = img.split('/')[-1].split('?')[0]
    try:
      urllib.request.urlretrieve(img, dt+'/'+filename)
    except:
      print('Skipping image due to error', img)

shutil.make_archive(dt, 'zip', dt)
print(cursor, 'Directory with images created.')
print(cursor, 'Zipped file created.')

# Step 7 - OCR Text From Images

Extract text from images. Clean text and then write to new CSV spreadsheet.

In [0]:
print(cursor, 'Install Tesseract OCR')
!sudo apt install tesseract-ocr
!pip install pytesseract
output.clear()

print(cursor, 'Install Tesseract OCR')
print(cursor, 'Installation completed')
print(cursor, 'Processing images')
import pytesseract, shutil, os, random, glob, spacy
try:
    from PIL import Image
except ImportError:
    import Image
nlp = spacy.load("en_core_web_sm")

location = dt
fileset = [file for file in glob.glob(location + "**/*", recursive=True)]
i=1
imgtxt=[]
urls=[]
for img in fileset:
    word_list=[]
    try:
        obtained_txt=pytesseract.image_to_string(Image.open(img))
    except:
        print(cursor, 'Sorry, problem occurred with', img, 'Skipping...')
        obtained_text=''
    obtained_text=obtained_txt.lower()
    doc=nlp(obtained_txt.replace('\n',' ').replace('\r',' '))
    for token in doc:
        word_list.append(token.text.lower())
    extractedInformation=' '.join(word_list)
    print(cursor, i,'/',len(fileset))
    imgtxt.append(extractedInformation)
    urls.append(img.split('/')[1])
    i+=1

df_img = pd.DataFrame(imgtxt, columns =['img2txt'])
df_img['img']=urls
df_merged=pd.merge(topics_data, df_img, on='img', how='left')
df_merged.to_csv('REDDIT_OCR.csv', index=False)
print(cursor, 'Completed. Written to REDDIT_OCR.csv')

# Step 8 - Install ImageAI

In [0]:
print(cursor,'Installing imageai')
!pip install imageai --upgrade
output.clear()
print(cursor,'Installing imageai')
import urllib.request
urllib.request.urlretrieve ("https://github.com/OlafenwaMoses/ImageAI/releases/download/1.0/resnet50_coco_best_v2.0.1.h5", "resnet50_coco_best_v2.0.1.h5")
print(cursor, 'Completed')

# Step 9 - Recognize Objects

[AI Details](https://github.com/OlafenwaMoses/ImageAI/blob/master/imageai/Detection/README.md)

In [0]:
%tensorflow_version 1.x
from imageai.Detection import ObjectDetection
import os
from IPython.display import Image

recognized_items={}

fileset = [file for file in glob.glob(location + "**/*", recursive=True)]
i=0
imgtxt=[]
urls=[]

# [:10] returns the first 10 images for speed

print(cursor, 'Processing Images')
for img in fileset[:10]:
    if not os.path.isfile(location+'/resnet50_coco_best_v2.0.1.h5'): 
        shutil.move('resnet50_coco_best_v2.0.1.h5', location+'/resnet50_coco_best_v2.0.1.h5')
    execution_path = location

    i+=1 
    img_=img.split('/')[1]
    print(cursor, 'processing image', i)
    detector = ObjectDetection()
    detector.setModelTypeAsRetinaNet()
    detector.setModelPath( os.path.join(execution_path , "resnet50_coco_best_v2.0.1.h5"))
    detector.loadModel()
    detections = detector.detectObjectsFromImage(input_image=os.path.join(execution_path , img_), output_image_path=os.path.join(execution_path , img_.replace('.jpg','_new.jpg')))
    if i==1:
        output.clear()
        print(cursor, 'Processing Images')
        print(cursor, 'processing image', i)
    if len(detections)>0:
        display(Image(img.replace('.jpg','_new.jpg')))
        for eachObject in detections:
            print(eachObject["name"] , " : " , eachObject["percentage_probability"] )
            if eachObject["name"] in recognized_items:
                recognized_items[eachObject["name"]]+=1
            else:
                recognized_items[eachObject["name"]]=1
    else:
        print(cursor, 'No objects recognized')
shutil.move(location+'/resnet50_coco_best_v2.0.1.h5', 'resnet50_coco_best_v2.0.1.h5')

# Step 10 - Column Chart of Recognized Images

In [0]:
import pylab as pl
import numpy as np

X = np.arange(len(recognized_items))
pl.bar(X, recognized_items.values(), align='center', width=0.5)
pl.xticks(X, recognized_items.keys())
ymax = max(recognized_items.values()) + 1
pl.ylim(0, ymax)
pl.show()