# Data Retrieval Example

## requirements:
- install praw using the following command `!pip install praw`. Don't use conda for this specific model.
- install beautifulsoup using the following command `!pip install bs4`
- install requests if you don't have in your development environment 

## References:
- https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c
- https://www.reddit.com/dev/api
- https://praw.readthedocs.io/en/stable/index.html


In [1]:
# %pip install praw
# %pip install pandas
import praw
import pandas as pd
import requests
import bs4

## Connecting to the reddit API using `requests`

### Steps:
- create an app on reddit to acquire the access credentials:
    - `PASSWORD` password of the reddit account used to create the app
    - `USERNAME` username of the reddit account used to create the app
    - `CLIENT_ID` acquired after creating the app
    - `SECRET_TOKEN` acquired after creating the app


In [2]:
# Replace <password> with the password for the CMPS287_project user. Replace myFirstDatabase with the name of the database that connections will use by default. Ensure any option params are URL encoded.
# connection_string = "mongodb+srv://CMPS287_project:<password>@cluster0.unwq2.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"
# modogoDB_password = 'uHg8D8xpWwz71zRz'
PASSWORD     = 'j=HUZ`6S8B'
USERNAME     = 'CMPS287_project'
CLIENT_ID    = 'd5w9jc7mmyeLEL2DG1wtxg'
SECRET_TOKEN = 'HIOuTew4HunOVSJeFT47Yi4sCkdBCA'
# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth(CLIENT_ID, SECRET_TOKEN)

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': USERNAME,
        'password': PASSWORD }

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyAPI/0.0.1'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
response_josn = requests.get('https://oauth.reddit.com/api/v1/me', headers=headers).json()

# To see the response just print response_json
# print(response_josn)

In [3]:
hot = "https://oauth.reddit.com/r/python/hot"


# make a request for the trending posts in /r/Python
res = requests.get("https://oauth.reddit.com/r/python/hot",
                   headers=headers)

df = pd.DataFrame()  # initialize dataframe

# loop through each post retrieved from GET request
for post in res.json()['data']['children']:
    # append relevant data to dataframe
    df = df.append({
        'subreddit': post['data']['subreddit'],
        'title': post['data']['title'],
        'selftext': post['data']['selftext'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'ups': post['data']['ups'],
        'downs': post['data']['downs'],
        'score': post['data']['score']
    }, ignore_index=True)


  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({


In [4]:
df.head()

Unnamed: 0,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,Python,Sunday Daily Thread: What's everyone working o...,Tell /r/python what you're working on this wee...,0.72,9,0,9
1,Python,Monday Daily Thread: Project ideas!,Comment any project ideas beginner or advanced...,0.5,0,0,0
2,Python,I'm too lazy to turn off my PC at nights. so m...,"\n""Laziness"" is a common problem in programmer...",0.94,725,0,725
3,Python,Essential Books to Improve Your Python Program...,**it's absolutely possible to learn Python on...,0.84,29,0,29
4,Python,I used a new dataframe library (polars) to wra...,,0.88,12,0,12


In [5]:
# getting usernames of the hot posts
authors_contributors = "https://oauth.reddit.com/r/explainlikeimfive/hot"

# make a request for the trending posts in /r/Python
res = requests.get(authors_contributors, headers=headers)

respons_json = res.json()['data']['children']

for i in range(len(respons_json)):
    print(respons_json[i]['data']['author'])

kingtut2003
NovemberGoat
Palkenstein
Hewo111
Lemoontree3412
SecureDonkey
UnblackMetalist
dafreshprints
bababooey0608
eskaden
Cool-Boy57
soundsystxm
FB_emeenem
The_Dying
thatstupidthing
tidytuna
NostraThomas1
uziau
CumDogMillionare93
Edwunclerthe3rd
BrilliantIdiot99
someonee404
msleo90
Nostalgia_Red


## Example using praw library
- praw library is the python Reddit API wrapper
- We use the same access credentials we used in the previous example

In [6]:
reddit = praw.Reddit(client_id     = CLIENT_ID,
                     client_secret = SECRET_TOKEN,
                     user_agent    = 'MyAPI/0.0.1')

# To test if your instance is working use:
print(reddit.read_only) # Output: True

for submission in reddit.subreddit("learnpython").hot(limit=10):
    print(submission.title)

True
Ask Anything Monday - Weekly Thread
where to start to learn python
Oh can i make that in less lines
Pytest is failing on GitHub Actions but succeeds locally
What are the basic elements of OOP for Python?
problem using .split() when taking two inputs from a user
Weather Station project
Implementing a simple math captcha in flask ( x+y=z)?
I learn in a really specific way…
Looking for a mentor please.


In [7]:
#  getting the wikipages of the r/autowikibot subreddit
#  link to the subreddit http://www.reddit.com/user/autowikibot

for wikipage in reddit.subreddit("autowikibot").wiki:
    print(wikipage)

autowikibot/botlist
autowikibot/commandlist
autowikibot/config/description
autowikibot/config/sidebar
autowikibot/config/stylesheet
autowikibot/config/submit_text
autowikibot/css
autowikibot/excludedsubs
autowikibot/index
autowikibot/livelists
autowikibot/modfaqs
autowikibot/nsfwtag
autowikibot/planned
autowikibot/redditbots
autowikibot/rootonlysubs
autowikibot/statistics
autowikibot/summon
autowikibot/summononlysubs
autowikibot/userblacklist


In [8]:
# getting the content of a specific wikipage in the r/autowikibot subreddit
# link to the wikipage we are requesting https://www.reddit.com/r/autowikibot/wiki/redditbots
wikipage = reddit.subreddit("autowikibot").wiki["redditbots"]

# print the content of the page
# print(wikipage.content_md)

## Using BeautifulSoup

In [9]:
url = "https://www.reddit.com/r/autowikibot/wiki/redditbots"
# Headers to mimic a browser visit
headers = {'User-Agent': 'Mozilla/5.0'}

# Returns a requests.models.Response object
page = requests.get(url, headers=headers)

soup = bs4.BeautifulSoup(page.text, 'html.parser')
a_tags = soup.select("table tbody td a")

bots_usrnames = []
for i in range(len(a_tags)):
    href = a_tags[i].attrs['href']
    if href.startswith('/u/'):
        # print(href)
        bots_usrnames.append(href)
len(bots_usrnames)

NameError: name 'bs4' is not defined

## Using BeautifulSoup and Pandas

- Pandas provide a `read_html()` function that takes a webpage url and returns a list of dataframes created form the tables that exist in the said webpage.
- I am using here bs4 to parse the webpage and passing the soup object which is the parsed html page I am doing this instead of passing the url to the function because the parser that is integrated with the function is bad.

In [None]:
full_list_bot_usrnames = []

# Parsing a webpage that has one table of bot accounts.
page = requests.get('https://www.reddit.com/r/autowikibot/wiki/redditbots', headers=headers)
soup = bs4.BeautifulSoup(page.text, 'html.parser')

# Passing the parsed page as text to the function
dfs = pd.read_html(page.text)


bots_table = dfs[0]
bots_table.loc[:,'Username']

for username in bots_table.loc[:,'Username']:
    full_list_bot_usrnames.append(username)
    # usrname_url = "https://www.reddit.com" + username
    # print(usrname_url) 
    
df.head()

Unnamed: 0,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,0.0,14.0,Tell /r/python what you're working on this wee...,Python,Sunday Daily Thread: What's everyone working o...,14.0,1.0
1,0.0,0.0,Use this thread to talk about anything Python ...,Python,Friday Daily Thread: Free chat Friday! Daily T...,0.0,0.5
2,0.0,73.0,"hey, yesterday I was making an auto reply bot ...",Python,copilot getting creepy,73.0,0.87
3,0.0,26.0,This was built as my first year university pro...,Python,Boihut bookstore(Ecommerce ) website made in D...,26.0,0.86
4,0.0,11.0,,Python,Python 3.11 Preview: Task and Exception Groups...,11.0,0.93


In [None]:
# Parsing a webpage that has two tables of bot accounts.
page = requests.get('https://www.reddit.com/r/botwatch/comments/1wg6f6/bot_list_i_built_a_bot_to_find_other_bots_so_far/cf1nu8p/', headers=headers)
soup = bs4.BeautifulSoup(page.text, 'html.parser')

dfs = pd.read_html(page.text)

# Appending the first dataframe of the firt table to the second dataframe of the second table.
total= dfs[0].append(dfs[1]) 

for username in total.loc[:,'User']:
    # usrname_url = "https://www.reddit.com" + username
    # print(usrname_url) 

    # Checking if the bot account username is already in the list (from the table of bots in the previous list)
    if username not in full_list_bot_usrnames:
        full_list_bot_usrnames.append(username)

  total= dfs[0].append(dfs[1])


In [None]:
pd.DataFrame(full_list_bot_usrnames).to_csv("test.csv")

In [None]:
trolls_list = []
trolls_list_url = "https://www.reddit.com/wiki/suspiciousaccounts"
page = requests.get(trolls_list_url, headers=headers)

df_trolls = pd.read_html(page.text)[0]

for username in df_trolls.loc[:,'Username']:
    # usrname_url = "https://www.reddit.com" + username
    # print(usrname_url) 

    trolls_list.append(username)

In [None]:
len(trolls_list) # 939

939