# Imports

In [1]:
import pandas as pd

df_praw = pd.read_csv('../data/PRAW.csv')
df_reddit_api = pd.read_csv('../data/Reddit_Api.csv')

# 1. Merge two csv files by joining PRAW to Reddit Api (deduplicate the repetitive posts) 

## Inner join two data frames bases on their creation time and title

In [2]:
pd.merge(df_praw, df_reddit_api, how='inner', left_on=['created','title'], right_on = ['create_time','title'])

Unnamed: 0,Unnamed: 0_x,title,score_x,id_x,url,comms_num,created,body,Unnamed: 0_y,create_time,downs,id_y,kind,score_y,selftext,subreddit,ups,upvote_ratio,image


Looks like they have nothing in common.

## Deduplicate repetitive posts

In [3]:
len(df_praw)

950

In [5]:
len(df_praw.drop_duplicates())

950

`df_praw` has no duplicates.

In [6]:
len(df_reddit_api)

998

In [7]:
len(df_reddit_api.drop_duplicates())

998

`df_reddit_api` also has no duplicates.

# 2. Use datetime to transform the UTC time to readable timestamps

In [11]:
from datetime import datetime
def convertTime(row):
    if 'created' in row:
        ts = int(row['created'])
    elif 'create_time' in row:
        ts = int(row['create_time'])
    else:
        return None

    # if you encounter a "year is out of range" error the timestamp
    # may be in milliseconds, try `ts /= 1000` in that case
    return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

In [14]:
df_praw['readable timestamp'] = df_praw.apply(convertTime, axis=1)

In [15]:
df_reddit_api['readable timestamp'] = df_reddit_api.apply(convertTime, axis=1)

In [16]:
df_praw

Unnamed: 0.1,Unnamed: 0,title,score,id,url,comms_num,created,body,readable timestamp
0,0,Is there a subreddit to liberal memes?,1,m27vku,https://www.reddit.com/r/Liberal/comments/m27v...,4,1.615438e+09,I tried to follow leftist memes but they trash...,2021-03-11 04:39:47
1,1,Over 700 Complaints About NYPD Officers Abusin...,39,m232se,https://www.propublica.org/article/over-700-co...,1,1.615426e+09,,2021-03-11 01:31:05
2,2,Biden’s signature won’t appear on stimulus che...,188,m229ns,https://www.independent.co.uk/news/world/ameri...,33,1.615424e+09,,2021-03-11 01:00:46
3,3,Senate revs its confirmation engine to fill Bi...,5,m1zkb4,https://www.politico.com/news/2021/03/10/biden...,2,1.615418e+09,,2021-03-10 23:08:17
4,4,Arkansas Governor Signs Bill Banning All Elect...,13,m1ogqh,https://lawandcrime.com/abortion/arkansas-gove...,3,1.615376e+09,,2021-03-10 11:39:26
...,...,...,...,...,...,...,...,...,...
945,945,Ex-CIA Director Spoke to Mueller About Flynn's...,2,ii504f,https://www.nbcnews.com/news/amp/ncna815176?__...,1,1.598641e+09,,2020-08-28 18:59:30
946,946,What would your mom say if you wanted to go a ...,46,ihrkb5,https://www.reddit.com/r/Liberal/comments/ihrk...,34,1.598585e+09,I asked my RW mother would she of let me go to...,2020-08-28 03:21:30
947,947,I wish I could sit in on the phone call betwee...,187,ihn6mj,https://www.reddit.com/r/Liberal/comments/ihn6...,41,1.598571e+09,Jared Kushner On NBA Strike: They're 'Fortunat...,2020-08-27 23:33:30
948,948,How do you guys think protests will affect pre...,3,ihb90y,https://www.reddit.com/r/Liberal/comments/ihb9...,15,1.598518e+09,"On a national level, there is little doubt in...",2020-08-27 08:47:22


In [17]:
df_reddit_api

Unnamed: 0.1,Unnamed: 0,create_time,downs,id,kind,score,selftext,subreddit,title,ups,upvote_ratio,image,readable timestamp
0,0,1.615450e+09,0.0,m2cto5,t3,1.0,Discussion of using Python in a professional e...,Python,Thursday Daily Thread: Python careers!,1.0,1.00,,2021-03-11 08:00:15
1,1,1.615449e+09,0.0,m2cgm5,t3,1.0,"Hey everyone, wanted to share a program that I...",Python,Auto Export Program For Thinkorswim,1.0,1.00,https://external-preview.redd.it/2INSsxyCpFUCR...,2021-03-11 07:42:58
2,2,1.615447e+09,0.0,m2bqbm,t3,2.0,There was a post on here last week (pardon me ...,Python,Where do python users tend to fall on the OOP ...,2.0,1.00,,2021-03-11 07:10:51
3,3,1.615446e+09,0.0,m2bmgz,t3,1.0,,Python,R Programming For Absolute Beginners,1.0,0.56,https://external-preview.redd.it/fEfeDmwPioL2b...,2021-03-11 07:06:09
4,4,1.615445e+09,0.0,m2b5co,t3,1.0,Our team has been working to make a self-servi...,Python,"An affordable, self-service data discovery tool",1.0,0.67,,2021-03-11 06:45:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,993,1.612924e+09,0.0,lg9o7s,t3,2.0,,Python,"Bulk read multiple CSVs in Python - Pandas, Da...",2.0,0.75,https://external-preview.redd.it/77sSz9HSFKFc4...,2021-02-10 02:20:52
994,994,1.612923e+09,0.0,lg9g9h,t3,9.0,,Python,Python Inner Functions: What Are They Good For...,9.0,0.72,https://external-preview.redd.it/mnHU8G30_BXQg...,2021-02-10 02:11:29
995,995,1.612920e+09,0.0,lg8d5y,t3,3.0,,Python,How to Setup PyCharm for Maya Scripting with A...,3.0,1.00,https://external-preview.redd.it/fEPxjJFJ0WgNI...,2021-02-10 01:26:03
996,996,1.612920e+09,0.0,lg84v3,t3,66.0,[https://github.com/umutseven92/apyr](https://...,Python,I've built a Python + FastAPI project to mock ...,66.0,0.95,https://external-preview.redd.it/sw3__0VhrYUNQ...,2021-02-10 01:16:04
