# Exercise: Reddit Data Ingestion

For this exercise, you are going to ingest Reddit RSS feeds into a PostgreSQL database structure of your design.

This exercise will rely on some lessons from the reading in this module, but heavily on the prior modules and some prior coursework in Python, this course and DB/SQL boot camps.

*** Remember to break this down into manageable tasks and not attempt to build the entire database before knowing one of the entities and its attributes. ***


### From the site:

reddit: the front page of the internet  
https://www.reddit.com/  
Reddit gives you the best of the internet in one place. Get a constantly updating feed of breaking news, fun stories, pics, memes, and videos just for you.


### From Wikipedia:
Reddit is an American social news aggregation, web content rating, and discussion website. 
Registered members submit content to the site such as links, text posts, and images, 
which are then voted up or down by other members. 
Posts are organized by subject into user-created boards called "subreddits", 
which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. 
Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough votes, ultimately on the site's front page. 



#### Sample Posting:

The below link is an example discussion that was started based on someone asking for opinions on MySQL vs NoSQL.  

**Spoiler: ** The conclusion was PostgreSQL, leveled with a healthy dose of sarcasm that was lost on the original poster.

https://www.reddit.com/r/nosql/comments/8ckkzg/should_i_use_nosql/



### From: https://www.redditinc.com/
![REDDIT_About.png MISSING](../images/REDDIT_About.png)

---

### Really Simple Syndication (RSS)

AKA: Rich Site Summary; [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) Site Summary

**From Wikipedia**  
RSS is a type of web feed which allows users to access updates to online content in a **standardized, computer-readable format**. 
These feeds can, for example, allow a user to keep track of many different websites in a single news aggregator. 
The news aggregator will automatically check the RSS feed for new content, allowing the content to be automatically passed from website to website or from website to user. 
This passing of content is called web syndication. 
Websites usually use RSS feeds to publish frequently updated information, such as blog entries, news headlines, audio, video. An RSS document (called "feed", "web feed", or "channel") includes full or summarized text, and metadata, like publishing date and author's name.

#### Reddit supports RSS access to their community of posts.

Click the link below to see the new Reddit posts feed in raw form (XML)
 * https://www.reddit.com/new/.rss?sort=new
 
This is an example of a sub-reddit RSS feed:
 * https://www.reddit.com/r/datascience/.rss?sort=new
 
**In both cases, we see the pattern of after the URL's last slash, "/", we add**  
  
`.rss?sort=new`


**The wall of character data you see should scream: _Parse Me with Python!_**

#### RSS Feed Content

The RSS Feed is structured as a set of items, which can roughly be expected to follow the below structure. 
Note: the XML has been parsed into a DOM, then renderd as a JSON here.


---

```JSON
{
	'guidislink': True, 
	'author_detail': {
		'href': 'https://www.reddit.com/user/cryoskyd', 
		'name': '/u/cryoskyd'
		}, 
	'links': [
		{
			'rel': 'alternate', 
			'type': 'text/html', 
			'href': 'https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/'
		}], 
	'href': 'https://www.reddit.com/user/cryoskyd', 
	'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=14, tm_hour=18, tm_min=38, tm_sec=25, tm_wday=3, tm_yday=165, tm_isdst=0), 
	'authors': [
		{'href': 'https://www.reddit.com/user/cryoskyd', 
		'name': '/u/cryoskyd'
		}
		], 
	'tags': [
		{'label': 'r/SkydTech', 
		'scheme': None, 
		'term': 'SkydTech'
		}], 
	'title_detail': {
		'type': 'text/plain', 
		'base': 'https://www.reddit.com/new/.rss?sort=new', 
		'value': 'Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580', 				'language': None
		}, 
	'summary': '<table> <tr><td> <a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/"> <img src="https://b.thumbs.redditmedia.com/ioyXj08RjCyhRbNWiPQfcjrQMlHTG4Ec-LrYJ6MB0kI.jpg" alt="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" title="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/cryoskyd"> /u/cryoskyd </a> &#32; to &#32; <a href="https://www.reddit.com/r/SkydTech/"> r/SkydTech </a> <br/> <span><a href="https://arstechnica.com/staff/2018/06/dealmaster-get-a-15-inch-dell-laptop-with-an-8th-gen-core-i7-for-580/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/">[comments]</a></span> </td></tr></table>', 
	'id': 'https://www.reddit.com/new/t3_8r47kj', 
	'updated': '2018-06-14T18:38:25+00:00', 
	'content': [
		{
			'type': 'text/html', 
			'base': 'https://www.reddit.com/new/.rss?sort=new', 
			'value': '<table> <tr><td> <a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/"> <img src="https://b.thumbs.redditmedia.com/ioyXj08RjCyhRbNWiPQfcjrQMlHTG4Ec-LrYJ6MB0kI.jpg" alt="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" title="Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580" /> </a> </td><td> &#32; submitted by &#32; <a href="https://www.reddit.com/user/cryoskyd"> /u/cryoskyd </a> &#32; to &#32; <a href="https://www.reddit.com/r/SkydTech/"> r/SkydTech </a> <br/> <span><a href="https://arstechnica.com/staff/2018/06/dealmaster-get-a-15-inch-dell-laptop-with-an-8th-gen-core-i7-for-580/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/">[comments]</a></span> </td></tr></table>', 
			'language': None
		}], 
	'link': 'https://www.reddit.com/r/SkydTech/comments/8r47kj/dealmaster_get_a_15inch_dell_laptop_with_an/', 
	'author': '/u/cryoskyd', 
	'title': 'Dealmaster: Get a 15-inch Dell laptop with an 8th-gen Core i7 for $580'
}
```

---

To process this data we are going to use some key Python Libraries:
 * FeedParser
   * https://pypi.org/project/feedparser/
   * http://www.pythonforbeginners.com/feedparser/using-feedparser-in-python
 * BeautifulSoup
   * https://www.crummy.com/software/BeautifulSoup/


### Example Code:

The example code below grabs the new reddit posts from the RSS feed, then prints the first one.

In [1]:
import feedparser

# Define URL of the RSS Feed I want
a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        print(item)
        break

{'authors': [{'href': 'https://www.reddit.com/user/iptvglobal', 'name': '/u/iptvglobal'}], 'title': 'BUYIP-TV.COM Offer iptv trial', 'tags': [{'label': 'r/myIPTV', 'term': 'myIPTV', 'scheme': None}], 'guidislink': True, 'summary': '<!-- SC_OFF --><div class="md"><h2>1-Day Risk <a href="https://www.buyip-tv.com/free-trial/">Free trial</a>. No obligation, no credit card required. Simply Request <a href="https://www.buyip-tv.com/free-trial/">free</a> <a href="https://www.buyip-tv.com/free-trial/">IPTV Trial</a> and you’ll be up and running within few min.</h2> <p>Are you still confused about getting an IPTV subscription? Do you want an <a href="https://www.buyip-tv.com/free-trial/"><strong>iptv test</strong></a> of the service before paying a huge amount?<br/> <a href="https://www.buyip-tv.com/free-trial/"><strong>IPTV 24-h test</strong></a> Subscription is your solution to get a full test of the <a href="https://www.buyip-tv.com/free-trial/"><strong>iptv</strong></a> serivce and to be su

### Sub-Reddits

As described above, sub-reddits are communities organized around particular topics.

Some example sub-reddits:
 * https://www.reddit.com/r/steak/
 * https://www.reddit.com/r/datascience/
 * https://www.reddit.com/r/MachineLearning/
 * https://www.reddit.com/r/deeplearning/
 * https://www.reddit.com/r/Python/
 * https://www.reddit.com/r/Databases/
 * https://www.reddit.com/r/NoSQL/


# Exercise Tasks
 1. Review the data in an RSS item.
 1. Conceptual a database design that can collect the data.
    * Make sure you capture at least Author, Tags, Title, Link
    * Make sure your items (posts) are unique and not duplicated!
 1. Implement the database in your PostgreSQL schema
 1. Implement a cell of Python Code that collects the latest post from front page (/new/) and 5-10 sub-reddits (r/.../), then inserts the data into your database.
 1. After you have loaded a few hundered posts (items) from the RSS feeds, write an **interesting query** that requires a join across your two or more of your tables.
 
 

#### Sample code to extract some data from the feed items

In [2]:
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment

# Functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# Define URL of the RSS Feed I want
a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        dttm = item[ "date" ]
        title = item[ "title" ]
        summary_text = text_from_html(item[ "summary" ])
        link = item[ "link" ]
        
        print("====================")
        print("Title: {} ({})\nTimestamp: {}".format(title,link,dttm))
        print("--------------------\nSummary:\n{}".format(summary_text))
        

Title: ASP.NET MVC with Angular or Vue? (https://www.reddit.com/r/webdev/comments/9vjfm5/aspnet_mvc_with_angular_or_vue/)
Timestamp: 2018-11-09T10:42:41+00:00
--------------------
Summary:
Hello there.  Been working on a C# Asp.NET backend for a while. Any recommendations for a frontend technology? Doesn't have to be either of the two.  Thanks everyone in advance.  /u/ppallo r/webdev [link] [comments]
Title: [CN] The Great Ruler - Chapters 782 - 785 (https://www.reddit.com/r/noveltranslations/comments/9vjfm1/cn_the_great_ruler_chapters_782_785/)
Timestamp: 2018-11-09T10:42:39+00:00
--------------------
Summary:
The Great Ruler | Da Zhu Zai | 大主宰  Author: Tian Can Tu Dou | 天蚕土豆   Chapter 782  Chapter 783  Chapter 784  Chapter 785   Official Synopsis  The Great Thousand World. It is a place where numerous planes intersect, a place where many clans live and a place where a group of lords assemble. The Heavenly Sovereigns appear one by one from the Lower Planes and they will all display a 

## M3:E2:Q1 - Task 1: Annotate your Entities and their attributes


## M3:E2:Q2 -  Task 2: Implement the database in your PostgreSQL schema


## M3:E2:Q3 - Task 3: Python Code to collect RSS data from Reddit


In [220]:
## implement your answer in this cell
## ------------------------
import pandas as pd
a_reddit_rss_url = 'https://www.reddit.com/.rss?sort=new&limit=100'

feed = feedparser.parse( a_reddit_rss_url )
reddit_author = pd.DataFrame(columns=[['author','tag']])
reddit_posts = pd.DataFrame(columns=[['author','title','link','time','summary']])


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        author = item["author"]
        for a in item["tags"]:
            reddit_author = reddit_author.append({'author': author ,'tag': a[ "label" ]}, ignore_index=True)        
        
        reddit_posts = reddit_posts.append({'author': item[ "author" ],'title': item[ "title" ],'link': item[ "link" ],'time': item[ "date" ],'summary': text_from_html(item[ "summary"]) }, ignore_index=True)        
        
        

In [221]:
reddit_author.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
author    100 non-null object
tag       100 non-null object
dtypes: object(2)
memory usage: 1.6+ KB


In [222]:
datascience_url = 'https://www.reddit.com/r/datascience/.rss?sort=new&limit=100'

feed = feedparser.parse( datascience_url )


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        author = item["author"]
        for a in item["tags"]:
            reddit_author = reddit_author.append({'author': author ,'tag': a[ "label" ]}, ignore_index=True)        
        
        reddit_posts = reddit_posts.append({'author': item[ "author" ],'title': item[ "title" ],'link': item[ "link" ],'time': item[ "date" ],'summary': text_from_html(item[ "summary"]) }, ignore_index=True)        
        
        
        

In [223]:
reddit_author.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 2 columns):
author    202 non-null object
tag       202 non-null object
dtypes: object(2)
memory usage: 3.2+ KB


In [224]:
machinelearning_url = 'https://www.reddit.com/r/MachineLearning/.rss?sort=new&limit=100'

feed = feedparser.parse( machinelearning_url )


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        author = item["author"]
        for a in item["tags"]:
            reddit_author = reddit_author.append({'author': author ,'tag': a[ "label" ]}, ignore_index=True)        
        
        reddit_posts = reddit_posts.append({'author': item[ "author" ],'title': item[ "title" ],'link': item[ "link" ],'time': item[ "date" ],'summary': text_from_html(item[ "summary"]) }, ignore_index=True)        
        
        

In [225]:
reddit_author.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 2 columns):
author    304 non-null object
tag       304 non-null object
dtypes: object(2)
memory usage: 4.8+ KB


In [226]:
deeplearning_url = 'https://www.reddit.com/r/deeplearning/.rss?sort=new&limit=100'

feed = feedparser.parse( deeplearning_url )


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        author = item["author"]
        for a in item["tags"]:
            reddit_author = reddit_author.append({'author': author ,'tag': a[ "label" ]}, ignore_index=True)        
        
        reddit_posts = reddit_posts.append({'author': item[ "author" ],'title': item[ "title" ],'link': item[ "link" ],'time': item[ "date" ],'summary': text_from_html(item[ "summary"]) }, ignore_index=True)        
        
        

In [227]:
python_url = 'https://www.reddit.com/r/Python/.rss?sort=new&limit=100'

feed = feedparser.parse( python_url )


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        author = item["author"]
        for a in item["tags"]:
            reddit_author = reddit_author.append({'author': author ,'tag': a[ "label" ]}, ignore_index=True)        
        
        reddit_posts = reddit_posts.append({'author': item[ "author" ],'title': item[ "title" ],'link': item[ "link" ],'time': item[ "date" ],'summary': text_from_html(item[ "summary"]) }, ignore_index=True)        
        
        
       

In [228]:
sql_url = 'https://www.reddit.com/r/NoSQL/.rss?sort=new&limit=100'

feed = feedparser.parse( sql_url )


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        author = item["author"]
        for a in item["tags"]:
            reddit_author = reddit_author.append({'author': author ,'tag': a[ "label" ]}, ignore_index=True)        
        
        reddit_posts = reddit_posts.append({'author': item[ "author" ],'title': item[ "title" ],'link': item[ "link" ],'time': item[ "date" ],'summary': text_from_html(item[ "summary"]) }, ignore_index=True)        
        
        

In [229]:
reddit_author.head()

Unnamed: 0,author,tag
0,/u/clgmae104,r/whatisthisthing
1,/u/amco3008,r/AskReddit
2,/u/adviseme3737,r/legaladvice
3,/u/_Caketaco_,r/copypasta
4,/u/Pixelcitizen98,r/OutOfTheLoop


In [230]:
reddit_author.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606 entries, 0 to 605
Data columns (total 2 columns):
author    606 non-null object
tag       606 non-null object
dtypes: object(2)
memory usage: 9.5+ KB


In [231]:
reddit_author.nunique()

author    535
tag       100
dtype: int64

In [232]:
#drop duplicates
reddit_author=reddit_author.drop_duplicates(['author'],keep='last')

In [233]:
reddit_author.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 535 entries, 0 to 605
Data columns (total 2 columns):
author    535 non-null object
tag       535 non-null object
dtypes: object(2)
memory usage: 12.5+ KB


In [241]:
reddit_author=reddit_author.dropna(axis=0, how='any')

In [242]:
reddit_author.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 535 entries, 0 to 605
Data columns (total 2 columns):
author    535 non-null object
tag       535 non-null object
dtypes: object(2)
memory usage: 12.5+ KB


In [243]:
reddit_author.isnull().sum()

author    0
tag       0
dtype: int64

In [244]:
reddit_author.nunique()

author    535
tag        97
dtype: int64

In [143]:
reddit_posts.head()

Unnamed: 0,author,title,link,time,summary
0,/u/GeneLatifah,Gillum responds to Scott lawsuit: ‘Counting vo...,https://www.reddit.com/r/politics/comments/9vk...,2018-11-09T14:14:55+00:00,submitted by /u/GeneLatifah to r/politics...
1,/u/mvea,US cigarette smoking rate reaches new low - Ci...,https://www.reddit.com/r/science/comments/9vkc...,2018-11-09T13:18:23+00:00,submitted by /u/mvea to r/science [link...
2,/u/clgmae104,What is this rodent that just climbed out of m...,https://www.reddit.com/r/whatisthisthing/comme...,2018-11-09T01:52:21+00:00,submitted by /u/clgmae104 to r/whatisthis...
3,/u/EverythingTittysBoii,Gave him a forever home yesterday and thought ...,https://www.reddit.com/r/aww/comments/9vkcqy/g...,2018-11-09T13:20:19+00:00,submitted by /u/EverythingTittysBoii to r...
4,/u/evgat2,The frosting on my car window was melted by th...,https://www.reddit.com/r/mildlyinteresting/com...,2018-11-09T12:48:22+00:00,submitted by /u/evgat2 to r/mildlyinteres...


In [247]:
reddit_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606 entries, 0 to 605
Data columns (total 5 columns):
author     606 non-null object
title      606 non-null object
link       606 non-null object
time       606 non-null object
summary    606 non-null object
dtypes: object(5)
memory usage: 23.8+ KB


In [249]:
reddit_post= reddit_posts.drop_duplicates()

In [250]:
reddit_posts=reddit_posts.dropna()

In [251]:
reddit_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 606 entries, 0 to 605
Data columns (total 5 columns):
author     606 non-null object
title      606 non-null object
link       606 non-null object
time       606 non-null object
summary    606 non-null object
dtypes: object(5)
memory usage: 28.4+ KB


In [24]:
import getpass
# This collects a masked password from the user
mypasswd = getpass.getpass()

········


In [25]:
import psycopg2
import numpy as np
from psycopg2.extensions import adapt, register_adapter, AsIs

# Then connects to the DB
connection = psycopg2.connect(database = 'dsa_student', 
                              user = 'dlfy6', 
                              host = 'dbase.dsa.missouri.edu',
                              password = mypasswd)
cursor = connection.cursor()

# Then remove the password from computer memory
del mypasswd

In [245]:
#reddit_author = reddit_author.where(pd.notnull(data), None)


register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

for row in reddit_author.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('rollback;')
    cursor.execute('INSERT INTO dlfy6.reddit_author VALUES(%s,%s)',row)
    
    
# Save (commit) the changes
connection.commit()


In [246]:

#reddit_posts = reddit_posts.where(pd.notnull(data), None)

register_adapter(np.int64,AsIs)
register_adapter(np.float64,AsIs)

    
for row in reddit_posts.itertuples(index=False, name ='None'):
    #print(row)
    cursor.execute('rollback;')
    cursor.execute('INSERT INTO dlfy6.reddit_posts VALUES(%s,%s,%s,%s,%s)',row)
    
    
# Save (commit) the changes
connection.commit()


## M3:E2:Q4 - Task 4: Your interesting query of your data
**Feel free to add additional cells and write extra queries**


In [256]:
SQL ="""

SELECT a.author, a.tag, p.link
FROM dlfy6.reddit_posts p
JOIN dlfy6.reddit_author a
USING (author);

"""

with connection, connection.cursor() as cursor:
    cursor.execute(SQL)
    df = cursor.fetchall()
    


In [258]:
df=pd.DataFrame(df,columns=['author','tag','link'])
df


Unnamed: 0,author,tag,link
0,/u/clgmae104,r/whatisthisthing,https://www.reddit.com/r/whatisthisthing/comme...
1,/u/amco3008,r/AskReddit,https://www.reddit.com/r/AskReddit/comments/9v...
2,/u/adviseme3737,r/legaladvice,https://www.reddit.com/r/legaladvice/comments/...
3,/u/_Caketaco_,r/copypasta,https://www.reddit.com/r/copypasta/comments/9v...
4,/u/Pixelcitizen98,r/OutOfTheLoop,https://www.reddit.com/r/OutOfTheLoop/comments...
5,/u/GeneLatifah,r/politics,https://www.reddit.com/r/politics/comments/9vk...
6,/u/screaming_librarian,r/news,https://www.reddit.com/r/news/comments/9vhpmv/...
7,/u/syd430,r/TopMindsOfReddit,https://www.reddit.com/r/TopMindsOfReddit/comm...
8,/u/mvea,r/science,https://www.reddit.com/r/science/comments/9vkc...
9,/u/EverythingTittysBoii,r/aww,https://www.reddit.com/r/aww/comments/9vkcqy/g...


# Save your notebook, then `File > Close and Halt`

---