# Notebook 1

## Problem Statement

Fake news is a prevalent and harmful problem in modern society, often misleading the general public on important topics and polices, such as healthcare or taxes.

Such misinformation can erode trust in government institutions or news agencies and result in deep seated and persistent societal issues that have a negative impact on public safety and well being.

Our team aims to develop an model using Natural Language Processing and Classification Models that can accurately classify if an article contains real news or fake news based on the headline.


## Background

To create a model we collected the headlines of two subreddits, The Onion and Not The Onion. The onion subreddit is a collection of post that contains the headlines and links to articles produced by the satire website "The Onion" that publishes fictatious articles written to emulate real news articles.

Not the Onion on the other hand is a subreddit that contains posts that are true but hard for readers to believe.

By collecting the post data from these two subreddits we can build a collection of headlines of fake news articles and another with only real news.

### Part 1-Scrapping Data

#### Imports

In [None]:
#importing packages required for extraction and to organize the extracted posts
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import numpy as np

#### Scrapping data

The loveshift API allows us to extract 100 post per request. For our project we plan to extract ~5000 posts per reddit. To pull the necessary number of posts, we will have to make 50 requests of 100 post each to meet our requirements. We will do this by creating a function that will make an initial request of 100 posts, after that the function will keep running the following loop until the post count reaches 5,000:

(a) Make a request of 100 or the necessary number of post needed to reach 5,000 posts
(b) drop any duplicates,
(c) if less than 5000 posts pulled repeat.

We're pulling 5,000 posts as it allows us to create a dataset with sufficient data points that will remain sizeable as we clean the data.

In [70]:
#creating function for reddit scrapping. utc = function will only seek posts that we're created before stated utc, url = url of api, number = numnber of posts to extract
def reddit_scrape (subreddit, utc = 1640966400,
                   url = 'https://api.pushshift.io/reddit/search/submission',
                   number = 5000):
    #setting parameters to extract 1st 100 posts
    params = {'subreddit': subreddit,
              'size':100, #max number of posts for pullshift api
              'before': utc}
    res = requests.get(url,params)
    data = res.json()
    posts = data['data']
    # creating a dataframe from the posts scrapped
    df = pd.DataFrame(posts)
    # creating a loop where as long as the dataframe does not have 5000 posts we will keep pulling 100 posts or the number required to hit 5000 posts, whichever is less
    while len(df)< number:
        remainder = number-len(df)
        # setting the size to either 100 posts or the number required to hit 5k posts
        size = np.min((remainder, 100))
        # parameters for requests, similar to the initial parameters however cut off date will be based on the date of the last item in the current batch of posts.
        params_for_additions = {'subreddit': subreddit,
                                'size': size,
                                'before': df.created_utc.iloc[-1]}

        additional_requests = requests.get(url, params_for_additions)
        additional_data = additional_requests.json()
        additional_posts = additional_data['data']
        #Adding newly extracted posts to the created df
        df = pd.concat([df, pd.DataFrame(additional_posts)], axis=0)
        #dropping any duplicates
        df.drop_duplicates(subset = ['title'], keep = 'last', inplace = True)
        #resetting index
        df.reset_index(inplace = True, drop = True)
    #Once 5000 unique posts have been created, save data into a csv
    df.to_csv(f'../Datasets/{subreddit}.csv')


In [14]:
#running the function to extract posts from "TheOnion"
reddit_scrape('theonion')

In [26]:
#running the function to extract posts from "Nottheonion"
reddit_scrape('nottheonion')

Next we'll read in the csv files with the posts details from "the onion" and "nottheonion" so that we can review the contents to ensure that the function extracted the post details as intended.

In [3]:
#reading the datasets
df_notonion = pd.read_csv('../Datasets/nottheonion.csv', index_col = [0])
df_onion = pd.read_csv('../Datasets/theonion.csv', index_col = [0])

In [5]:
#review contents
df_notonion[['title']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   5000 non-null   object
dtypes: object(1)
memory usage: 78.1+ KB


In [6]:
df_onion[['title']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   5000 non-null   object
dtypes: object(1)
memory usage: 78.1+ KB


In [7]:
# checking for duplicates
df_onion[df_onion['title'].duplicated()==True]['title'].count()

0

In [8]:
df_notonion[df_notonion['title'].duplicated()== True]['title'].count()

0

At first glance there doesn't seem to be any duplicates in our data and there are no nulls in the title field.

In [9]:
#appending the 2 datasets together
df = df_onion.append(df_notonion, ignore_index = True)

In [10]:
df.reset_index(inplace = True, drop = True)

For the purpose of out analysis we will be using just the title of the post to determine which subreddit it belongs to. As such we will only need to keep the 'title' and 'subreddit' columns.

#### Saving selected features into a csv

In [11]:
#selecting only title and subreddit columns
df = df[['subreddit', 'title']]

In [12]:
#saving the combined items into a csv
df.to_csv('../Datasets/combined.csv', index = False)

In [59]:
#end of book 1