# Web Scraping

In this part, I would like to scrape more fake news to enlarge my data sets. Simply, I use The pushshift.io Reddit API to help me extract more than thousands of onion news posted in Reddit.But, after analysing the data, I found poor quality due to missing content can be a barrier to use these data, so I didn't include this part into my final project. But, I will still list the process here.

In [3]:
# API scrape 
from psaw import PushshiftAPI
import pandas as pd
import numpy as np

In [18]:
def scrape_data(subreddit):
    
    # Instantiate 
    api = PushshiftAPI()

    # Create list of scraped data
    scrape_list = list(api.search_submissions(subreddit=subreddit, filter=['title','selftext', 'subreddit', 'num_comments', 'author', 'subreddit_subscribers', 'score', 'domain', 'created_utc'],limit=15000))

    #Filter list to only show Subreddit titles and Subreddit category 
    clean_scrape_lst = []
    for i in range(len(scrape_list)):
        scrape_dict = {}
        scrape_dict['author'] = scrape_list[i][0]
        scrape_dict['timestamp'] = scrape_list[i][1]        
        scrape_dict['domain'] = scrape_list[i][2]
        scrape_dict['num_comments'] = scrape_list[i][3]
        scrape_dict['score'] = scrape_list[i][4]  
        scrape_dict['selftext'] = scrape_list[i][5]         
        scrape_dict['subreddit'] = scrape_list[i][6]
        scrape_dict['title'] = scrape_list[i][8]
        clean_scrape_lst.append(scrape_dict)

    # Show number of subscribers
    print(subreddit, 'subscribers:',scrape_list[1][6])
    
    # Return list of scraped data
    return clean_scrape_lst

In [19]:

# Call function and create DataFrame
df_onion = pd.DataFrame(scrape_data('theonion'))

# Save data to csv
df_onion.to_csv('./the_onion.csv')

# Shape of DataFrame
print(f'df_onion shape: {df_onion.shape}')

# Show head
df_onion.head()

theonion subscribers: TheOnion
df_onion shape: (15000, 8)


Unnamed: 0,author,domain,num_comments,score,selftext,subreddit,timestamp,title
0,aresef,sports.theonion.com,0,1,,TheOnion,1567026670,Case Keenum Wins Redskins Starting Job With He...
1,aresef,politics.theonion.com,0,1,,TheOnion,1567023428,House Wayans And Means Committee Approves $50 ...
2,ProbablyDrDre,youtube.com,0,1,,TheOnion,1567019827,Expert On Anteaters Wasted Entire Life Studyin...
3,dwaxe,clickhole.com,2,1,,TheOnion,1567018342,Celebrating An Icon: President Trump Has Invit...
4,dwaxe,lifestyle.clickhole.com,0,1,,TheOnion,1567016551,Inspiring: This Man Just Became The Oldest Per...


In [33]:
df_onion = pd.read_csv('./the_onion.csv',index_col='Unnamed: 0')

In [34]:
df_onion.head()

Unnamed: 0,author,domain,num_comments,score,selftext,subreddit,timestamp,title
0,aresef,sports.theonion.com,0,1,,TheOnion,1567026670,Case Keenum Wins Redskins Starting Job With He...
1,aresef,politics.theonion.com,0,1,,TheOnion,1567023428,House Wayans And Means Committee Approves $50 ...
2,ProbablyDrDre,youtube.com,0,1,,TheOnion,1567019827,Expert On Anteaters Wasted Entire Life Studyin...
3,dwaxe,clickhole.com,2,1,,TheOnion,1567018342,Celebrating An Icon: President Trump Has Invit...
4,dwaxe,lifestyle.clickhole.com,0,1,,TheOnion,1567016551,Inspiring: This Man Just Became The Oldest Per...


In [35]:

def clean_data(dataframe,column):

    # Drop duplicate rows
    dataframe.drop_duplicates(subset=column, inplace=True)
    
    # Remove punctation
    dataframe['title'] = dataframe['title'].str.replace('[^\w\s]',' ')

    # Remove numbers 
    dataframe['title'] = dataframe['title'].str.replace('[^A-Za-z]',' ')

    # Make sure any double-spaces are single 
    dataframe['title'] = dataframe['title'].str.replace('  ',' ')
    dataframe['title'] = dataframe['title'].str.replace('  ',' ')

    # Transform all text to lowercase
    dataframe['title'] = dataframe['title'].str.lower()
    
    print("New shape:", dataframe.shape)
    return dataframe.head()

In [36]:
clean_data(df_onion,'title')

New shape: (14257, 8)


Unnamed: 0,author,domain,num_comments,score,selftext,subreddit,timestamp,title
0,aresef,sports.theonion.com,0,1,,TheOnion,1567026670,case keenum wins redskins starting job with he...
1,aresef,politics.theonion.com,0,1,,TheOnion,1567023428,house wayans and means committee approves mil...
2,ProbablyDrDre,youtube.com,0,1,,TheOnion,1567019827,expert on anteaters wasted entire life studyin...
3,dwaxe,clickhole.com,2,1,,TheOnion,1567018342,celebrating an icon president trump has invite...
4,dwaxe,lifestyle.clickhole.com,0,1,,TheOnion,1567016551,inspiring this man just became the oldest pers...


In [39]:
df_onion.selftext.unique()

array([nan, '[deleted]', 'TheOnion',
       "As all of you who have tried to visit in the last couple days saw, the subreddit has been private. We closed it to show solidarity with the other subs who closed. While you may not agree with the strike or the fact that we closed, we appreciate having all of you here. Hopefully some good can come of the strike and everything won't be in vain.\n\nBut once again, thank you for sticking with us through this. Sorry that petty reddit politics has affected your ability to browse here. If you have any suggestions or opinions express them here or message us. \n\nThanks,\n\n/r/TheOnion mods"],
      dtype=object)

In [None]:
df_onion

In [40]:
# Convert Unix Timestamp to Datetime
df_onion['timestamp'] = pd.to_datetime(df_onion['timestamp'], unit='s')

# Show date-range of posts scraped from r/TheOnion and r/nottheonion
print("TheOnion start date:", df_onion['timestamp'].min())
print("TheOnion end date:", df_onion['timestamp'].max())

TheOnion start date: 2015-06-09 20:17:33
TheOnion end date: 2019-08-28 21:11:10
