# Demo 14 - Reddit API

In [None]:
import numpy as np
import pandas as pd

## PSAW

psaw is a python module that is a wrapper for the Pushshift API. It includes functionality for accessing publicly available Reddit submissions and comments. 

Code and examples can be found on [github](https://github.com/dmarx/psaw), documentation can be found [online](https://psaw.readthedocs.io/en/latest/#)

**Question:** How do we install psaw?

<details>
<summary>Solution</summary>
   !pip install psaw

</details>

In [None]:
from psaw import PushshiftAPI

In [None]:
api = PushshiftAPI()
api

In [None]:
api_request_generator = api.search_submissions(subreddit='AMITheAsshole')
api_request_generator

### Python Generators Explained

#### Iterable

**Question** What is an `Iterable`?

<details>
<summary>Solution</summary>
    An iterable is any object in Python which has an __iter__ or a __getitem__ method defined which returns an iterator or can take indexes. In short an iterable is any object which can provide us with an iterator.
    
    <br>
    <br>
    https://book.pythontips.com/en/latest/generators.html

</details>

##### Examples of Iterable objects

In [None]:
np_array = np.arange(10)
np_array

In [None]:
"__iter__" in dir(np_array)

In [None]:
"__getitem__" in dir(np_array)

In [None]:
dictionary = {}
for key in 'abcdefhi':
    dictionary[key] = np.random.rand()
dictionary

In [None]:
"__iter__" in dir(np_array)

In [None]:
"__getitem__" in dir(np_array)

#### Iterator

An iterator is any object in Python which has a next (Python2) or __next__ method defined. That’s it. That’s an iterator

In [None]:
"__next__" in dir(np_array)

In [None]:
"__next__" in dir(iter(np_array))

In [None]:
iterator = iter(np_array)
iterator

In [None]:
next(iterator)

In [None]:
next(iterator)

In [None]:
iterator = iter(np_array[:1])
iterator

In [None]:
next(iterator)

In [None]:
next(iterator)

The previous line throws an error because we are at the end of the iterator

#### Iteration

In simple words it is the process of taking an item from something e.g a list. When we use a loop to loop over something it is called iteration. It is the name given to the process itself. Now as we have a basic understanding of these terms let’s understand generators.

In [None]:
for num in np_array:
    print(num)

In [None]:
for num in iterator:
    print(num)

In [None]:
iterator = iter(np_array)
iterator

for num in iterator:
    print(num)

#### Generators

Generators are iterators, but you can only iterate over them once. It’s because they do not store all the values in memory, they generate the values on the fly. You use them by iterating over them, either with a ‘for’ loop or by passing them to any function or construct that iterates. Most of the time generators are implemented as functions. However, they do not return a value, they yield it. Here is a simple example of a generator function:

In [None]:
def generator_function():
    for i in range(3):
        yield i

gen = generator_function()
print(next(gen))
# Output: 0
print(next(gen))
# Output: 1
print(next(gen))
# Output: 2
print(next(gen))
# Output: Traceback (most recent call last):
#            File "<stdin>", line 1, in <module>
#         StopIteration

We can use generators to convert functions into Iterators

### Search Submissions

Potential subbreddits for class:

- https://www.reddit.com/r/columbia/
- https://www.reddit.com/r/wallstreetbets/
- https://www.reddit.com/r/datasets/
- /r/AskReddit

In [None]:
api_request_generator = api.search_submissions(subreddit='AmITheAsshole', score = ">2000")
api_request_generator

In [None]:
aita_subs_df = pd.DataFrame([submission.d_ for submission in api_request_generator])
aita_subs_df.shape

In [None]:
aita_subs_df.keys()

- author
- title
- Selftext
- Created_at
- Score
- num_comments
- Subreddit
- url


In [None]:
from datetime import datetime

In [None]:
aita_subs_df.head(5)['created_utc'].apply(datetime.fromtimestamp)

#### Search submission by keyword

In [None]:
api_request_generator = api.search_submissions(q='Missy Elliott', score = ">2000")
missy_elliot_df = pd.DataFrame([submission.d_ for submission in api_request_generator])


### Search Comments

In [None]:
search_generator = api.search_comments(size=25)
search_generator

In [None]:
comments_df = pd.DataFrame([submission.d_ for submission in search_generator])
comments_df.shape

In [None]:
submissionubmission in search_generator:
    submission=submission

In [None]:
type(submission)

In [None]:
dir(submission)

In [None]:
submission.d_['stickied']

#### Search Comments by Multiple Keywords

**Question:** Can someone explain the next two lines

In [None]:
api_request_generator = api.search_comments(q='(George Orwell)|(J. R. R. Tolkien)')

In [None]:
api_request_generator = api.search_comments(q='(Shakespeare)&(Beyonce)')

## PRAW

PRAW is a popular python wrapper for accessing Reddit data. Unlike PSAW, it uses the Reddit API directly rather than Pushshifts collection of Reddit. 

If you use Praw, you need to create a Reddit account and create a Reddit App on Reddit.
We'll skip this during today's demo.

### Making a Reddit App


Go to https://www.reddit.com/prefs/apps/ and click on the button that says 
`are you a developer? create an app...`

[API Access overview](https://www.reddit.com/wiki/api)


## Cleaning text

In [None]:
import pandas as pd
aita_subs_df = pd.DataFrame([submission.d_ for submission in api_request_generator])


In [None]:
submission = next(api_request_generator)
submission.selftext

In [None]:
!pip install redditcleaner

In [None]:
import redditcleaner

In [None]:
redditcleaner.clean(submission.selftext)