# Natural Observer

## A project to collect the thousands of observations of the natural world from Reddit (and maybe eventually other social media). Photos, identification, and any location information are collated to create a usable dataset for citizen science networks such as eBird and iNaturalist. We hope... eventually...

### Authors: Lindsey Parkinson, Thomas Oliver, and Roman Grisch 

This notebook uses the Reddit API PRAW. You must have a Reddit account in order to use the notebook. I have my information saved in a seperate json file for anonymity. You can add your own credentials below. 

In [1]:
import os
import praw
import json
import pandas as pd
import numpy as np
from collections import defaultdict
import re
import datetime

#redditkeys.json contains all the information necessary to use the Reddit API
working_directory = os.getcwd()
file_path = working_directory + '/redditkeys.json'

with open(file_path) as infile:
    credentials = json.load(infile)
reddit = praw.Reddit(client_id = credentials["client_id"],
                     client_secret = credentials["client_secret"],
                     user_agent=credentials["user_agent"],
                     username=credentials["username"],
                     password=credentials["password"])

Version 7.1.0 of praw is outdated. Version 7.3.0 was released Thursday June 17, 2021.


In [2]:
#check to ensure it is associated with your Reddit account:
#print(reddit.user.me())

### Scraping

There are many r/whatisthis___ or r/whatsthis___ subreddits used for plant, animal, and fungus identification. Here we use r/whatsthisfish as an example though multiple subreddits can be added to subreddit_list below. 

Other subreddits include:
r/whatisthisfish,
r/whatsthisbug,
r/whatsthisbird,
r/whatsthissnake

The subreddits above follow a standard protocol enforced by the moderators making the scraping of novel observations easier. However, the following subreddits may also be worth considering:
r/slimemolds,
r/whatsthisplant,
r/animalid,
r/PlantIdentification,
r/treeidentification

In [3]:
date_list = []
#author_list = []
id_list = []
link_flair_text_list = []
title_list = []
url_list = []
top_comment_list = []


#subreddits we want to scrape information from
subreddit_list= ['whatsthisfish']

#What information we want from each subreddit post
for subred in subreddit_list:
    subreddit = reddit.subreddit(subred)
    top_post = subreddit.top(limit = 100)  #how many posts from the subreddit we want to pull
    
    for sub in top_post:        
        date_list.append(datetime.datetime.fromtimestamp(sub.created_utc))
        #author_list.append(sub.author)
        id_list.append(sub.id)        
        link_flair_text_list.append(sub.link_flair_text)
        title_list.append(sub.title)
        url_list.append(sub.url)
        
    print(subred, 'completed; ', end='')
    print('total', len(title_list), 'posts scraped')

whatsthisfish completed; total 100 posts scraped


In [4]:
df = pd.DataFrame({'Date': date_list,
                   'ID':id_list, 
                   #'Author':author_list, 
                   'Title':title_list,
                   'Flair':link_flair_text_list,
                   'URL':url_list
                  })

### Formatting URLs

In [5]:
def convert(row, col = "URL"):
    """
    This function will convert strings into hyperlinks readable when exported into csv or pdf. 
    Should make it easier to pull images
    """
    return "<a href='{}'>{}</a>".format(row[col], row.name)

In [6]:
df['URL'] = df.apply(convert, axis = 1)

### Formatting top comment
This code extracts the comment tree of the first comment block then the first comment of the block. Our hope was that this comment will contain the correct identification because participants are supposed to upvote the answers they agree with. 

I added the if/else statement below to try and deal with posts that don't seem to have comments. Honestly, It doesn't work. Some subreddits I scrape the comment column ends up 1 or 2 rows shorter and I haven't figured out why.  

In [7]:
comments = defaultdict(list)
            
for ID in id_list:
    submission = reddit.submission(str(ID))
    for top_level_comment in submission.comments:
        if top_level_comment is not None:
            comments[submission.title].append(top_level_comment.body)
        else:
            comments[submission.title].append("NA")

In [8]:
top_comment = []
    
for key, val in comments.items():
    if val is not None:
        top_comment.append(val[0])
    else:
        top_comment.apend("NA")
        

In [9]:
df["Top Comment"] = top_comment

### Extracting location
The location of the observation should be written within the title of the post. In the following code chunks we use the nltk package to tokenize the post titles and attempt to extract location words. 

If this is your first time using nltk or this notebook you may need to remove the # and download the packages below.  

In [10]:
#Import and download NLP tools

import nltk
#nltk.download('punkt')
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
#nltk.download('averaged_perceptron_tagger')

In [11]:
#A function to pull location information from sentence chunks
def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))
    return entity_names

In [12]:
titles_list = df["Title"].tolist()
location = []

for item in titles_list:
    sentences = nltk.sent_tokenize(item)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
    
    entities = []  
    for tree in chunked_sentences:
        entities.extend(extract_entity_names(tree))
    location.append(entities)

In [13]:
df["Location"] = location
df

Unnamed: 0,Date,ID,Title,Flair,URL,Top Comment,Location
0,2020-10-26 04:40:05,ji7jvh,Anybody know what species this is? Found in Ja...,"Identified, probably",<a href='https://i.redd.it/bybemwp61dv51.jpg'>...,"Florida pompano, *Trachinotus carolinus* . Th...","[Anybody, Jacksonville, Inshore]"
1,2021-01-21 17:49:37,l2299m,Picture taken in Central Florida. Freshwater p...,"Family known, species unidentified",<a href='https://i.redd.it/ow7f7ozctpc61.jpg'>...,Common pleco. Non-native species in Florida.,"[Picture, Central Florida, Freshwater]"
2,2021-05-13 21:57:12,nbptao,Off-shore Palm Beach at around 70ft. Photograp...,"Identified, probably",<a href='https://i.redd.it/1crpkkcx0yy61.jpg'>...,It would appear to be a [juvenile louvar](http...,"[Palm Beach, Photographer, Michael Patrick]"
3,2020-06-15 00:58:58,h93riw,"Caught in key west, Florida. Never seen anythi...","Identified, high confidence",<a href='https://i.redd.it/j86d87vnhy451.jpg'>...,"Took me awhile, but its a swallow-tailed bass ...","[Caught, Florida]"
4,2020-05-11 17:59:31,ghq9mn,I found this video and the fish is hella cute ...,,<a href='https://v.redd.it/lvximb7tr5y41'>4</a>,Looks like spotted porcupinefish also known as...,[]
...,...,...,...,...,...,...,...
95,2021-02-20 23:35:57,loiigf,"Found: French Beach, Vancouver Island","Family known, species maybe IDed",<a href='https://i.redd.it/x53ciq2impi61.jpg'>...,Some kind of sculpin I believe.\n\nI misread t...,"[French Beach, Vancouver Island]"
96,2020-12-27 23:06:35,klclk8,Twirling fish in the FL Keys,"Identified, high confidence",<a href='https://v.redd.it/ivfqdbqzys761'>96</a>,Looks like a [Tripletail](https://www.fishbase...,[]
97,2020-12-20 15:22:49,kguscz,Found this tiny fish stranded on the ice in ea...,"Family known, species maybe IDed",<a href='https://i.redd.it/ww2sedbzpc661.jpg'>...,Looks like a kind of sculpin (family Cottidae)...,[Norway]
98,2020-12-16 18:00:12,keczcj,I found this tiny blue thing dead on the subst...,"Identified, probably",<a href='https://i.redd.it/9q66hwf9xk561.jpg'>...,I think its some kind of insect larvae. Maybe ...,[]


In [29]:
import geograpy
#url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
for title in df['Title']:
    places = geograpy.get_geoPlace_context(text = title)
    print(places)

Downloading /home/lindsey/anaconda3/lib/python3.8/site-packages/geograpy/locs.db.gz from http://wiki.bitplan.com/images/confident/locs.db.gz ... this might take a few seconds
unzipping /home/lindsey/anaconda3/lib/python3.8/site-packages/geograpy/locs.db from /home/lindsey/anaconda3/lib/python3.8/site-packages/geograpy/locs.db.gz
countries=['United States']
regions=[]
cities=['Jacksonville']
other=['Anybody']
countries=[]
regions=[]
cities=[]
other=['Picture']
countries=[]
regions=[]
cities=[]
other=[]
countries=['Argentina', 'Uruguay', 'Brazil', 'Trinidad and Tobago', 'Honduras', 'Costa Rica', 'Colombia', 'Puerto Rico', 'United States']
regions=[]
cities=['Florida']
other=['Caught']
countries=[]
regions=[]
cities=[]
other=[]
countries=[]
regions=[]
cities=[]
other=[]
countries=['Argentina', 'Uruguay', 'Brazil', 'Trinidad and Tobago', 'Honduras', 'Costa Rica', 'Colombia', 'Puerto Rico', 'United States']
regions=[]
cities=['Florida']
other=[]
countries=[]
regions=[]
cities=[]
other=[]
co

In [30]:
from geograpy import extraction

for title in df['Title']:
    e = extraction.Extractor(text = title)
    e.find_geoEntities()
# You can now access all of the places found by the Extractor
    print(e.places)

['Anybody', 'Jacksonville']
['Picture']
[]
['Caught', 'Florida']
[]
[]
['Florida']
[]
[]
['Singapore']
['Came', 'Brunei']
['Help']
[]
['Etowah']
[]
['Caught', 'Looe']
[]
['Ontario', 'Canada']
[]
['Mytilene', 'Greece']
[]
['Caught', 'Chinook']
['Seems']
[]
['Caught', 'Florida']
['Caught', 'Vermont']
[]
[]
[]
['South']
[]
['Caught', 'Boca Raton']
[]
['Islamorada']
['Louisiana']
[]
['Found']
[]
['Caught']
['Petsmart']
[]
['Caught', 'Florida']
['Hong Kong']
['Crosspost', 'Instagram', 'French']
['Greenland']
[]
['Florida']
['Portugal']
['Caught', 'Tampa']
[]
[]
['South Florida']
['Hi']
[]
['Front']
[]
[]
['Caught']
['France']
['Puget Sound', 'Washington']
[]
['Caught', 'California']
['Florida', 'Mexico']
[]
[]
['China']
[]
[]
['Stuff']
['Fish']
[]
['Cape Town', 'South Africa']
['US']
['North']
['Tokyo']
['Nova']
['North']
[]
[]
[]
['Belgrade', 'Serbia']
['Hi']
['Oskaloosa', 'Florida']
[]
['Caught', 'Sydney', 'Australia']
['Saw']
['Aquarium']
[]
[]
['Hi', 'La Paz', 'Mexico']
[]
['Took', 'Cal