# Natural Observer

## A project to collect the thousands of observations of the natural world from Reddit (and maybe eventually other social media). Photos, identification, and any location information are collated to create a usable dataset for citizen science networks such as eBird and iNaturalist. We hope... eventually...

### Authors: Lindsey Parkinson, Thomas Oliver, and Roman Grisch 

This notebook uses the Reddit API PRAW. You must have a Reddit account in order to use the notebook. I have my information saved in a seperate json file for anonymity. You can add your own credentials below. 

In [20]:
import os
import praw
import json
import pandas as pd
import numpy as np
from collections import defaultdict
import re
import datetime

#redditkeys.json contains all the information necessary to use the Reddit API
working_directory = os.getcwd()
file_path = working_directory + '/redditkeys.json'

with open(file_path) as infile:
    credentials = json.load(infile)
reddit = praw.Reddit(client_id = credentials["client_id"],
                     client_secret = credentials["client_secret"],
                     user_agent=credentials["user_agent"],
                     username=credentials["username"],
                     password=credentials["password"])

In [21]:
#check to ensure it is associated with your Reddit account:
#print(reddit.user.me())

Parker09


### Scraping

There are many r/whatisthis___ or r/whatsthis___ subreddits used for plant, animal, and fungus identification. Here we use r/whatsthisfish as an example though multiple subreddits can be added to subreddit_list below. 

Other subreddits include:
r/whatisthisfish,
r/whatsthisbug,
r/whatsthisbird,
r/whatsthissnake

The subreddits above follow a standard protocol enforced by the moderators making the scraping of novel observations easier. However, the following subreddits may also be worth considering:
r/slimemolds,
r/whatsthisplant,
r/animalid,
r/PlantIdentification,
r/treeidentification

In [22]:
date_list = []
#author_list = []
id_list = []
link_flair_text_list = []
title_list = []
url_list = []
top_comment_list = []


#subreddits we want to scrape information from
subreddit_list= ['whatsthisfish']

#What information we want from each subreddit post
for subred in subreddit_list:
    subreddit = reddit.subreddit(subred)
    top_post = subreddit.top(limit = 100)  #how many posts from the subreddit we want to pull
    
    for sub in top_post:        
        date_list.append(datetime.datetime.fromtimestamp(sub.created_utc))
        #author_list.append(sub.author)
        id_list.append(sub.id)        
        link_flair_text_list.append(sub.link_flair_text)
        title_list.append(sub.title)
        url_list.append(sub.url)
        
    print(subred, 'completed; ', end='')
    print('total', len(title_list), 'posts scraped')

whatsthisfish completed; total 100 posts scraped


In [24]:
df = pd.DataFrame({'Date': date_list,
                   'ID':id_list, 
                   #'Author':author_list, 
                   'Title':title_list,
                   'Flair':link_flair_text_list,
                   'URL':url_list
                  })

### Formatting URLs

In [23]:
def convert(row, col = "URL"):
    """
    This function will convert strings into hyperlinks readable when exported into csv or pdf. 
    Should make it easier to pull images
    """
    return "<a href='{}'>{}</a>".format(row[col], row.name)

In [25]:
df['URL'] = df.apply(convert, axis = 1)

### Formatting top comment
This code extracts the comment tree of the first comment block then the first comment of the block. Our hope was that this comment will contain the correct identification because participants are supposed to upvote the answers they agree with. 

I added the if/else statement below to try and deal with posts that don't seem to have comments. Honestly, It doesn't work. Some subreddits I scrape the comment column ends up 1 or 2 rows shorter and I haven't figured out why.  

In [26]:
comments = defaultdict(list)
            
for ID in id_list:
    submission = reddit.submission(str(ID))
    for top_level_comment in submission.comments:
        if top_level_comment is not None:
            comments[submission.title].append(top_level_comment.body)
        else:
            comments[submission.title].append("NA")

In [27]:
top_comment = []
    
for key, val in comments.items():
    if val is not None:
        top_comment.append(val[0])
    else:
        top_comment.apend("NA")
        

In [28]:
df["Top Comment"] = top_comment

### Extracting location
The location of the observation should be written within the title of the post. In the following code chunks we use the nltk package to tokenize the post titles and attempt to extract location words.  

In [29]:
#Import and download NLP tools

import nltk
#nltk.download('punkt')
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
#nltk.download('averaged_perceptron_tagger')

In [33]:
#A function to pull location information from Reddit post titles
def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))
    return entity_names

titles_list = df["Title"].tolist()
location = []

for item in titles_list:
    sentences = nltk.sent_tokenize(item)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
    
    entities = []  
    for tree in chunked_sentences:
        entities.extend(extract_entity_names(tree))
    location.append(entities)

In [31]:
df["Location"] = location
df

Unnamed: 0,Date,ID,Title,Flair,URL,Top Comment,Location
0,2020-10-26 04:40:05,ji7jvh,Anybody know what species this is? Found in Ja...,"Identified, probably",<a href='https://i.redd.it/bybemwp61dv51.jpg'>...,"Florida pompano, *Trachinotus carolinus* . Th...","[Anybody, Jacksonville, Inshore]"
1,2020-06-15 00:58:58,h93riw,"Caught in key west, Florida. Never seen anythi...","Identified, high confidence",<a href='https://i.redd.it/j86d87vnhy451.jpg'>...,"Took me awhile, but its a swallow-tailed bass ...","[Caught, Florida]"
2,2020-05-11 17:59:31,ghq9mn,I found this video and the fish is hella cute ...,,<a href='https://v.redd.it/lvximb7tr5y41'>2</a>,Looks like spotted porcupinefish also known as...,[]
3,2021-01-21 17:49:37,l2299m,Picture taken in Central Florida. Freshwater p...,"Family known, species unidentified",<a href='https://i.redd.it/ow7f7ozctpc61.jpg'>...,Common pleco. Non-native species in Florida.,"[Picture, Central Florida, Freshwater]"
4,2020-11-20 22:01:29,jxxfdr,What's this crustacean? Has a short lobster li...,"Identified, high confidence",<a href='https://v.redd.it/c8cfmpcrlg061'>4</a>,Pretty sure its a Slipper Lobster.,[Florida]
...,...,...,...,...,...,...,...
95,2020-11-27 07:00:16,k1vk0j,"What is this ""rare elongated fish"" caught off ...","Identified, high confidence",<a href='https://i.redd.it/8nxx9mfb3q161.jpg'>...,red cornetfish (*Fistularia petimba*),[Japan]
96,2020-11-13 06:34:11,jtbhpj,Thought it was a baby lingcod but it doesn’t h...,"Identified, high confidence",<a href='https://www.reddit.com/gallery/jtbhpj...,Pacific staghorn sculpin *Leptocottus armatus*...,"[Caught, Newport Beach Ca, Need ID]"
97,2020-09-28 02:08:47,j11sg7,Most beautiful sunfish I caught,"Identified, high confidence",<a href='https://www.reddit.com/gallery/j11sg7...,I think the first one is a Readbreast Sunfish....,[]
98,2020-08-30 18:24:23,ijffix,"[North Myrtle Beach, SC] I was fishing along t...",,<a href='https://i.imgur.com/CVqCZsx.jpg'>98</a>,Looks like Barracuda jaws,"[North Myrtle Beach, SC, Intracoastal Waterway]"
