<a id = 'content'><a/>
### Content page
___

<a id = 'section_0'><a/>
# 0.0 Function Creation
___

In [1]:
import time
from time import sleep
import requests

import pandas as pd
import numpy as np
import random

pd.set_option('display.max_colwidth' , 300)

In [2]:
def red_scrap(title):
    url = 'https://api.pushshift.io/reddit/search/submission'
    df_load = []
    params = {
        'subreddit': title,
        'size' : 100,
        'before': None
    }
    for i in range(14):
        # Access Reddit API
        res = requests.get(url,params)
        data = res.json()
        posts = data['data']
        
        
        df_new = pd.DataFrame(posts)
        df_load.append(df_new)
        
        # Initiating new time stamp (100th position of the 100 size) for before in params
        params['before'] = df_new['created_utc'][99]
        
        # Extract to CSV
        df = pd.concat(df_load, ignore_index = True)
        df.to_csv(f'{title}.csv')
        time.sleep(20)
        print(f'{i+1} Iterations completed')
        
    return df

In [3]:
def date_conversion(df , column):

    time_value = []
    for value in df[column]:
        time_value.append(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(value)))
        
    df[column] = time_value
    df[[column]] = df[[column]].astype('datetime64[ns]')
    return df

In [4]:
# df_fakenews = red_scrap('fakenews')

In [5]:
# df_politcal_humor = red_scrap('PoliticalHumor')

<a id = 'section_1'><a/>
# 1.0 Data Exploration
___
[(back to top)](#content)

In [63]:
df_fakenews = pd.read_csv('fakenews.csv')
df_fakenews = df_fakenews[['title' , 'subreddit' , 'created_utc']]

# Changing datetime format
df_fakenews = date_conversion(df_fakenews , 'created_utc')

df_politcal_humor = pd.read_csv('PoliticalHumor.csv')
df_politcal_humor = df_politcal_humor[['title' , 'subreddit' , 'created_utc']]

# Changing datetime format
df_politcal_humor = date_conversion(df_politcal_humor , 'created_utc')

In [64]:
print(f'No. of Fakenews Datasets : {len(df_fakenews)}')
print(f'Shape of Fakenews Datasets : {df_fakenews.shape}')

print(f'No. of Politcal Humors Datasets : {len(df_politcal_humor)}')
print(f'Shape of political Humors Datasets : {df_politcal_humor.shape}')

No. of Fakenews Datasets : 1400
Shape of Fakenews Datasets : (1400, 3)
No. of Politcal Humors Datasets : 1400
Shape of political Humors Datasets : (1400, 3)


### 1.0 Checking for Duplicates and Null
___

In [65]:
print(f'No. of Duplicate Cell : {df_fakenews.duplicated().sum()}')
print(f'No. of Null Cell : {df_fakenews.isnull().sum().sum()}')

No. of Duplicate Cell : 0
No. of Null Cell : 0


### 1.1 Checking for Data Leakage
___

In [66]:
df_fakenews['title'].str.contains('fake%').value_counts()

False    1400
Name: title, dtype: int64

In [67]:
df_politcal_humor['title'].str.contains('humor|politic|fun|laugh').value_counts()

False    1378
True       22
Name: title, dtype: int64

In [68]:
df_fakenews = df_fakenews.loc[~df_fakenews['title'].str.contains('fake')]
df_fakenews.head()

Unnamed: 0,title,subreddit,created_utc
0,Magic in a LIVE Broadcast ABC have to see!,fakenews,2021-08-28 20:52:06
1,ABC anchor nominated for a Pulitzer,fakenews,2021-08-26 23:00:36
2,Never forget MK Ultra,fakenews,2021-08-25 22:16:04
3,Actual Story Behind the Men Who Stare at Goats,fakenews,2021-08-24 23:37:24
5,Taliban “declaration of Emirate”,fakenews,2021-08-21 21:54:16


In [69]:
df_politcal_humor = df_politcal_humor.loc[~df_politcal_humor['title'].str.contains('humor|politic|fun|laugh')]
df_politcal_humor.head()

Unnamed: 0,title,subreddit,created_utc
0,Next National Building Project.,PoliticalHumor,2021-09-02 20:52:36
1,I can't believe people accuse the GOP of being do as I say not as I do hypocrites who only care about taking away the rights of others,PoliticalHumor,2021-09-02 20:40:02
2,I’m starting to think someone is just making stuff now up to see how many dumb conservatives they can get rid of,PoliticalHumor,2021-09-02 20:27:48
3,"Let's all send wire coat hangers to the Texas State capital: 1100 Congress Ave, Austin, TX 78701.",PoliticalHumor,2021-09-02 20:27:19
4,I would tell them it's ironic but they wouldn't know what that means,PoliticalHumor,2021-09-02 20:25:19


### 1.2 Visualing Some Random Text
___

In [70]:
len(df_fakenews)

1218

In [71]:
random_sentences = random.sample(df_fakenews['title'].to_list() , 10)
for index , sentence in enumerate(random_sentences):
    print(index , sentence)

0 The Fakenews Machine now FULLY Exposed: Anti-Trump Media Collusion to Endorse &amp; Encouraged Riots, Calling Them "Peaceful", BACKFIRES, They Now Try Place The Blame ON Trump!
1 Reasons Millennials Think News Media is Dividing Our Country!
2 New vid
3 Groundhog Day
4 Do you know the truth about the history of propaganda? (Please see the comments section for more.)
5 Unable to cross reference the author name against other credible news sources, outlets, or history
6 Washington Post reporter misleads on alcohol price. Note that she cropped out most (but not all) of the green drum
7 Did she really say this?
8 CNN Tells Viewers to Take Trump’s ‘Hoax’ Comment ‘How You Wish’
9 Anti-vaxxers and Russia behind viral 5G COVID conspiracy theory


In [72]:
random_sentences = random.sample(df_politcal_humor['title'].to_list() , 10)
for sentence in random_sentences:
    print(sentence)

Our chances look grim
Trump bad
Accurate
Neigh
😶
Democrats have been cleaning up Republican messes for 100 years.
All black drivers will be shot before they can get their phones out
WAWAWEWA
Vanilla ISIS
This is getting disgusting.


### 2.0 Data Cleaning and Preprocessing
___

In [73]:
import nltk
import string
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [81]:
'''
1. Standardize each example (usually lowercasing + punctuation stripping)
2. Split each example into substrings (usually words)
3. Recombine substrings into tokens (usually ngrams)
'''


random_sentences = random.sample(df_fakenews['title'].to_list() , 5)


for sentence in random_sentences:
    
    # Print Before Split
    print(sentence)

    #Join back without stopwords
    sent_split = ' '.join([word for word in sentence if word not in (string.punctuation)])
    print(sent_split)    
    
    # Split the Sentence
    sent_split = sentence.split(sep = ' ')
    print(sent_split)
    

    
    #Join back without stopwords
    joint = ' '.join([word for word in sent_split if word not in (stop_words)])
    print(joint)
    print('-----------------------------------------')
    

The snowstorm backlog debunks this article
T h e   s n o w s t o r m   b a c k l o g   d e b u n k s   t h i s   a r t i c l e
['The', 'snowstorm', 'backlog', 'debunks', 'this', 'article']
The snowstorm backlog debunks article
-----------------------------------------
The Fakenews Machine now FULLY Exposed: Anti-Trump Media Collusion to Endorse &amp; Encouraged Riots, Calling Them "Peaceful", BACKFIRES, They Now Try Place The Blame ON Trump!
T h e   F a k e n e w s   M a c h i n e   n o w   F U L L Y   E x p o s e d   A n t i T r u m p   M e d i a   C o l l u s i o n   t o   E n d o r s e   a m p   E n c o u r a g e d   R i o t s   C a l l i n g   T h e m   P e a c e f u l   B A C K F I R E S   T h e y   N o w   T r y   P l a c e   T h e   B l a m e   O N   T r u m p
['The', 'Fakenews', 'Machine', 'now', 'FULLY', 'Exposed:', 'Anti-Trump', 'Media', 'Collusion', 'to', 'Endorse', '&amp;', 'Encouraged', 'Riots,', 'Calling', 'Them', '"Peaceful",', 'BACKFIRES,', 'They', 'Now', 'Try', 'Place'