# Exercise: Reddit Data Ingestion

For this exercise, you are going to ingest Reddit RSS feeds into a PostgreSQL database structure of your design.

This exercise will rely on some lessons from the reading in this module, but heavily on the prior module and some prior coursework in Python and DB/SQL boot camps.


### From the site:

reddit: the front page of the internet  
https://www.reddit.com/  
Reddit gives you the best of the internet in one place. Get a constantly updating feed of breaking news, fun stories, pics, memes, and videos just for you.


### From Wikipedia:
Reddit is an American social news aggregation, web content rating, and discussion website. 
Registered members submit content to the site such as links, text posts, and images, 
which are then voted up or down by other members. 
Posts are organized by subject into user-created boards called "subreddits", 
which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. 
Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough votes, ultimately on the site's front page. 



#### Sample Posting:

The below link is an example discussion that was started based on someone asking for opinions on MySQL vs NoSQL.  

**Spoiler: ** The conclusion was PostgreSQL, leveled with a healthy dose of sarcasm that was lost on the original poster.

https://www.reddit.com/r/nosql/comments/8ckkzg/should_i_use_nosql/



### From: https://www.redditinc.com/
![REDDIT_About.png MISSING](../images/REDDIT_About.png)

---

### Really Simple Syndication (RSS)

AKA: Rich Site Summary; [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) Site Summary

**From Wikipedia**  
RSS is a type of web feed which allows users to access updates to online content in a **standardized, computer-readable format**. 
These feeds can, for example, allow a user to keep track of many different websites in a single news aggregator. 
The news aggregator will automatically check the RSS feed for new content, allowing the content to be automatically passed from website to website or from website to user. 
This passing of content is called web syndication. 
Websites usually use RSS feeds to publish frequently updated information, such as blog entries, news headlines, audio, video. An RSS document (called "feed", "web feed", or "channel") includes full or summarized text, and metadata, like publishing date and author's name.

#### Reddit supports RSS access to their community of posts.

Click the link below to see the new Reddit posts feed in raw form (XML)
 * https://www.reddit.com/new/.rss?sort=new
 
This is an example of a sub-reddit RSS feed:
 * https://www.reddit.com/r/datascience/.rss?sort=new
 
**In both cases, we see the pattern of after the URL's last slash, "/", we add**  
  
`.rss?sort=new`


**The wall of character data you see should scream: _Parse Me with Python!_**

#### RSS Feed Content

The RSS Feed is structured as a set of items, which can roughly be expected to follow the below structure. 
Note: the XML has been parsed into a DOM, then renderd as a JSON here.


---

```JSON
{
	'guidislink': True, 
	'author_detail': {
		'href': 'https://www.reddit.com/user/cryoskyd', 
		'name': '/u/cryoskyd'
		}, 
	'links': [
		{
			'rel': 'alternate', 
			'type': 'text/html', 
			'href': 'https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/'
		}], 
	'href': 'https://www.reddit.com/user/cryoskyd', 
	'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=14, tm_hour=18, tm_min=38, tm_sec=25, tm_wday=3, tm_yday=165, tm_isdst=0), 
	'authors': [
		{'href': 'https://www.reddit.com/user/cryoskyd', 
		'name': '/u/cryoskyd'
		}
		], 
	'tags': [
		{'label': 'r/SkydTech', 
		'scheme': None, 
		'term': 'SkydTech'
		}], 
	'title_detail': {
		'type': 'text/plain', 
		'base': 'https://www.reddit.com/new/.rss?sort=new', 
		'value': 'Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580', 				'language': None
		}, 
	'summary': '<table> <tr><td> <a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/"> <img src="https://b.thumbs.redditmedia.com/ioyXj08RjCyhRbNWiPQfcjrQMlHTG4Ec-LrYJ6MB0kI.jpg" alt="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" title="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/cryoskyd"> /u/cryoskyd </a> &#32; to &#32; <a href="https://www.reddit.com/r/SkydTech/"> r/SkydTech </a> <br/> <span><a href="https://arstechnica.com/staff/2018/06/dealmaster-get-a-15-inch-dell-laptop-with-an-8th-gen-core-i7-for-580/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/">[comments]</a></span> </td></tr></table>', 
	'id': 'https://www.reddit.com/new/t3_8r47kj', 
	'updated': '2018-06-14T18:38:25+00:00', 
	'content': [
		{
			'type': 'text/html', 
			'base': 'https://www.reddit.com/new/.rss?sort=new', 
			'value': '<table> <tr><td> <a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/"> <img src="https://b.thumbs.redditmedia.com/ioyXj08RjCyhRbNWiPQfcjrQMlHTG4Ec-LrYJ6MB0kI.jpg" alt="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" title="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/cryoskyd"> /u/cryoskyd </a> &#32; to &#32; <a href="https://www.reddit.com/r/SkydTech/"> r/SkydTech </a> <br/> <span><a href="https://arstechnica.com/staff/2018/06/dealmaster-get-a-15-inch-dell-laptop-with-an-8th-gen-core-i7-for-580/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/">[comments]</a></span> </td></tr></table>', 
			'language': None
		}], 
	'link': 'https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/', 
	'author': '/u/cryoskyd', 
	'title': 'Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580'
}
```

---

To process this data we are going to use some key Python Libraries:
 * FeedParser
   * https://pypi.org/project/feedparser/
   * http://www.pythonforbeginners.com/feedparser/using-feedparser-in-python
 * BeautifulSoup
   * https://www.crummy.com/software/BeautifulSoup/


### Example Code:

The example code below grabs the new reddit posts from the RSS feed, then prints the first one.

In [1]:
import feedparser

# Define URL of the RSS Feed I want
a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        print(item)
        break

{'content': [{'base': 'https://www.reddit.com/new/.rss?sort=new', 'value': '<!-- SC_OFF --><div class="md"><p>Clean, fit, polite, tall, bearded, goofy are some words that I think describe me.</p> <p>I&#39;m in town on business (Burbank now, DTLA until Sunday) and can&#39;t get the idea of a casual meetup out of my head.</p> <p>Heading out for a nightcap but I&#39;ll most likely be up for a couple of hours...</p> </div><!-- SC_ON --> &#32; submitted by &#32; <a href="https://www.reddit.com/user/travelplay_throwaway"> /u/travelplay_throwaway </a> &#32; to &#32; <a href="https://www.reddit.com/r/RandomActsOfBlowJob/"> r/RandomActsOfBlowJob </a> <br/> <span><a href="https://www.reddit.com/r/RandomActsOfBlowJob/comments/9wx2p0/burbank_m4f_in_town_on_business_hoping_for_a_cure/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/RandomActsOfBlowJob/comments/9wx2p0/burbank_m4f_in_town_on_business_hoping_for_a_cure/">[comments]</a></span>', 'language': None, 'type': 'text/html'}],

### Sub-Reddits

As described above, sub-reddits are communities organized around particular topics.

Some example sub-reddits:
 * https://www.reddit.com/r/steak/
 * https://www.reddit.com/r/datascience/
 * https://www.reddit.com/r/MachineLearning/
 * https://www.reddit.com/r/deeplearning/
 * https://www.reddit.com/r/Python/
 * https://www.reddit.com/r/Databases/
 * https://www.reddit.com/r/NoSQL/


# Exercise Tasks
 1. Review the data in an RSS item.
 1. Conceptual a database design that can collect the data.
    * Make sure you capture at least Author, Tags, Title, Link
    * Make sure your items (posts) are unique and not duplicated!
 1. Implement the database in your PostgreSQL schema
 1. Implement a cell of Python Code that collects the latest post from front page (/new/) and 5-10 sub-reddits (r/.../), then inserts the data into your database.
 1. After you have loaded a few hundered posts (items) from the RSS feeds, write an **interesting query** that requires a join across your two or more of your tables.
 
 

#### Sample code to extract some data from the feed items

In [2]:
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment

# Functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# Define URL of the RSS Feed I want
a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        dttm = item[ "date" ]
        title = item[ "title" ]
        summary_text = text_from_html(item[ "summary" ])
        link = item[ "link" ]

        
        print("====================")
        print("Title: {} ({})\nTimestamp: {}".format(title,link,dttm))
        print("--------------------\nSummary:\n{}".format(summary_text))
        

Title: DMT Level Up [TF] 1:15 | [CS] 1:15 | [PUBG] 1:14 | [GEM] 1:350 (https://www.reddit.com/r/SteamTradingCards/comments/9wx2qo/dmt_level_up_tf_115_cs_115_pubg_114_gem_1350/)
Timestamp: 2018-11-14T05:15:57+00:00
--------------------
Summary:
https://steamcommunity.com/id/DMTlevelup/  /u/Bus_Driver_359 r/SteamTradingCards [link] [comments]
Title: Poodseipie (https://www.reddit.com/r/PewdiepieSubmissions/comments/9wx2qj/poodseipie/)
Timestamp: 2018-11-14T05:15:57+00:00
--------------------
Summary:
     submitted by /u/legendaryfreaks to r/PewdiepieSubmissions   [link]  [comments] 
Title: Looking for simpler split-screen multiplayer/couch co-op games to play with my dad. (PC/Switch) (https://www.reddit.com/r/gamingsuggestions/comments/9wx2qh/looking_for_simpler_splitscreen_multiplayercouch/)
Timestamp: 2018-11-14T05:15:56+00:00
--------------------
Summary:
Hi all. As the title says, I'm interested in multiplayer games that can be played couch co-op or split screen multiplayer. My dad 

## Task 1: Annotate your Entities and their attributes


## Task 2: Implement the database in your PostgreSQL schema


In [3]:
import getpass
# This collects a masked password from the user
mypasswd = getpass.getpass()

¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [4]:
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs

# Then connects to the DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'garwoode', 
                              host = 'dbase.dsa.missouri.edu',
                              password = mypasswd)

In [5]:
# Then remove the password from computer memory
del mypasswd

In [6]:
CREATE_TABLES = """
DROP TABLE IF EXISTS garwoode.articles cascade;
CREATE TABLE garwoode.articles (
title VARCHAR(10000) PRIMARY KEY,
link VARCHAR(255),
author VARCHAR(255)
);


DROP TABLE IF EXISTS garwoode.tags cascade;
CREATE TABLE garwoode.tags (
tag VARCHAR(255) PRIMARY KEY,
tag_link VARCHAR(255)
);

DROP TABLE IF EXISTS garwoode.article_tag cascade;
CREATE TABLE garwoode.article_tag (
title VARCHAR(10000),
tag VARCHAR(255),
PRIMARY KEY (title, tag),
FOREIGN KEY (title)
    REFERENCES articles(title),
FOREIGN KEY (tag)
    REFERENCES tags(tag)
);

"""
with connection, connection.cursor() as cursor:
    cursor.execute(CREATE_TABLES)

## Task 3: Python Code to collect RSS data from Reddit


In [7]:

## Your answer in this cell
## ------------------------

import pandas as pd
import numpy as np

a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )


titles = []
links = []
authors = []
taggs = []
tag_link = []
if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        titles.append(item[ "title" ])
        links.append(item[ "link" ])
        author = item[ "author"]
        authors.append(author[3:])
        tags = item["tags"]
        for tag in tags:
            tag = dict(tag)
        taggs.append(tag['term'])
        tag_link.append(('www.reddit.com/r/'+ tag['term']))

        
        
articles = pd.DataFrame({'title' : titles, 'author' :authors, 'link' : links})
article_tag = pd.DataFrame({'title':titles, 'tag':taggs})
tagss = pd.DataFrame({'tag':taggs, 'tag_link':tag_link})
tagss = tagss.drop_duplicates()


In [8]:
INSERT_SQL = 'INSERT INTO garwoode.articles '
INSERT_SQL += ' (author,link,title) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s, %s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in articles.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        #print(row)
        cursor.execute(INSERT_SQL,row)


In [9]:
INSERT_SQL = 'INSERT INTO garwoode.tags '
INSERT_SQL += ' (tag,tag_link) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in tagss.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        #print(row)
        cursor.execute(INSERT_SQL,row)


In [22]:
INSERT_SQL = 'INSERT INTO garwoode.article_tag '
INSERT_SQL += ' (tag, title) VALUES '
# this is a parameterized string for SQL, the %s are placeholders
# this prevents SQL-Injection attacks on the code
# https://en.wikipedia.org/wiki/SQL_injection
INSERT_SQL += '(%s,%s)'

# Note: The Commit Will Be Automatic after this with clause
with connection, connection.cursor() as cursor:
    # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
    for row in article_tag.itertuples(index=False, name=None):  # pull each row as a tuple

        # Insert the row
        #print(row)
        cursor.execute(INSERT_SQL,row)

## Task 4: Your interesting query of your data
**Feel free to add additional cells and write extra queries**



# Save your notebook, then `File > Close and Halt`

---