# r/politics Data Cleaning

The goal of this notebook is to clean up the datasets "politics_comments.csv" and "politics_posts.csv" and merge them into a single dataset for NLP processing and exploratory data analysis.  The datasets were obtained using SQL in Google BigQuery.

# Set-up

We start by importing the usual tools for cleaning.

In [1]:
# import libraries
import pandas as pd
import numpy as np
from datetime import datetime

# Datasets

Next, we import the datasets to be cleaned: "politics_comments.csv" and "politics_posts.csv".  This might take a while considering that "politics_comments.csv" contains over 2 million rows of data, much of it unstructured.

In [92]:
# import "politics_comments.csv"
# dtype is specified for certain columns as they contain mixed data
comments = pd.read_csv("politics_comments.csv", dtype = {"author_flair_text":str, "distinguished":str,
                                                        "author_flair_css_class":str})

comments.head()

Unnamed: 0,body,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class
0,If I were on the national teevee talking stupi...,,False,,zoso4evr,:flag-al: Alabama,,1539649137,t5_2cneq,t3_9ohusq,t1_e7u9cye,24,1541093805,0,0,e7uabgp,politics,,,
1,I guess the real question was why it was red i...,,False,,Nano_Burger,:flag-va: Virginia,,1539734759,t5_2cneq,t3_9ot1eb,t3_9ot1eb,64,1541131825,0,0,e7wjmyn,politics,,,virginia-flag
2,Well when Bush Sr. finally dies I'll log in he...,,False,,occupybostonfriend,:flag-ms: Mississippi,,1538786272,t5_2cneq,t3_9lrfkd,t1_e78vi65,38,1540733575,0,0,e78yfkf,politics,,,mississippi-flag
3,Republicans: Bah. We'll just make shit up and ...,,False,,thekozmicpig,:flag-ct: Connecticut,,1538618402,t5_2cneq,t3_9l7giq,t1_e74mg5c,20,1540661004,0,0,e74n20g,politics,,,
4,I'd blame the Democrats &amp; social media. Th...,,False,,voter45,:flag-mi: Michigan,,1539735974,t5_2cneq,t3_9ot02w,t3_9ot02w,-7,1541132466,0,0,e7wl0b2,politics,,,michigan-flag


In [29]:
# import "politics_posts.csv"
# dtypes are again specified because these columns have mixed data types
posts = pd.read_csv("politics_posts.csv", dtype = {"link_flair_css_class":str, "link_flair_text":str})

posts.head()

Unnamed: 0,created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,...,author_flair_css_class,archived,is_self,from_id,permalink,name,author_flair_text,quarantine,link_flair_text,distinguished
0,1538377404,politics,1Os,yahoo.com,https://www.yahoo.com/news/millionaire-sen-chu...,747,17484,,,Millionaire Sen. Chuck Grassley Applying For T...,...,,False,False,,/r/politics/comments/9ke4ye/millionaire_sen_ch...,,,False,,
1,1538390265,politics,1Os,yahoo.com,https://www.yahoo.com/news/senate-gop-apos-out...,44,0,,,Senate GOP's Outside Counsel Says ‘Reasonable ...,...,,False,False,,/r/politics/comments/9kf7lt/senate_gops_outsid...,,,False,,
2,1540117790,politics,1Os,yahoo.com,https://www.yahoo.com/news/paul-manafort-appea...,34,6,,,Paul Manafort Appears In Wheelchair At Court H...,...,,False,False,,/r/politics/comments/9q2jpy/paul_manafort_appe...,,,False,,
3,1540946192,politics,1Os,yahoo.com,https://www.yahoo.com/news/mueller-refers-plot...,19,87,,,Mueller refers plot to make false claims about...,...,,False,False,,/r/politics/comments/9stv7b/mueller_refers_plo...,,,False,,
4,1540115953,politics,1Os,yahoo.com,https://www.yahoo.com/news/trump-says-us-put-i...,12,37,,,Trump says US will pull out of intermediate ra...,...,,False,False,,/r/politics/comments/9q2eom/trump_says_us_will...,,,False,,


# Cleaning Comments

Now we clean up the data in the comments dataframe.  We need to remove columns that don't contribute to our analysis, remove comments that are useless (e.g. "[deleted]" which gives us nothing to work with as it is essentially a null comment) and clean up actual null data in the dataframe.

We start by dropping the columns that are of no interest to us.

In [93]:
# create a list containing everything to be dropped
drop_list_comments = ["archived", "name", "author", "author_flair_text", "downs", "retrieved_on", "id", "subreddit",
                     "subreddit_id", "author_flair_css_class", "ups", "score_hidden"]

# drop the columns
comments.drop(drop_list_comments, axis = 1, inplace = True)

comments.head()  # now we have a nice and compact dataset

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,distinguished
0,If I were on the national teevee talking stupi...,1539649137,t3_9ohusq,t1_e7u9cye,24,0,0,
1,I guess the real question was why it was red i...,1539734759,t3_9ot1eb,t3_9ot1eb,64,0,0,
2,Well when Bush Sr. finally dies I'll log in he...,1538786272,t3_9lrfkd,t1_e78vi65,38,0,0,
3,Republicans: Bah. We'll just make shit up and ...,1538618402,t3_9l7giq,t1_e74mg5c,20,0,0,
4,I'd blame the Democrats &amp; social media. Th...,1539735974,t3_9ot02w,t3_9ot02w,-7,0,0,


As a reddit user, we know that some comments become "[deleted]" as someone might post a comment and then in regret, choose to delete it. "[removed]" occurs when a moderator doesn't like your comment and chooses to remove it. In both cases, these comments are useless to us because not only is the comment itself missing, but other data related to the comment, such as the score, is also suspect.  Hence, we remove the whole row.

In [95]:
# remove colums containing [deleted] and/or [removed]
comments = comments[comments.body != "[deleted]"]
comments = comments[comments.body != "[removed]"]

# check to see if it worked
comments.loc[comments["body"] == "[deleted]"]

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,distinguished


In [96]:
# check for [removed] as well
comments.loc[comments["body"] == "[removed]"]

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,distinguished


In [97]:
# reset the index for ease of use later
comments.reset_index();

Next, we take a look at the "distinguished" column to see what is in it and whether we can simplify it or not.

In [98]:
comments["distinguished"].unique()

array([nan, 'moderator'], dtype=object)

We can see that the column basically shows whether a comment was made by a moderator or not. We can analyze this variable later but for now, we need to replace the null values, the value that shows that the comment was not made by a moderator, with a value such as 0 and replace the "moderator" with another value such as 1. This effectively means that we dummy code our data.

In [99]:
# change all the values with the map method in pandas
comments["distinguished"] = comments["distinguished"].map({"moderator": 1, None: 0})

In [100]:
# check if this was done correctly
comments["distinguished"].unique()

array([0, 1], dtype=int64)

We also rename the column to "is_moderator_comments" to make it easier to remember what the values in the column represent and whether it came from the comments dataframe or the posts dataframe.

In [101]:
# rename the column
comments.rename(columns = {"distinguished":"is_moderator_comments"}, inplace = True)

comments.head()

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,is_moderator_comments
0,If I were on the national teevee talking stupi...,1539649137,t3_9ohusq,t1_e7u9cye,24,0,0,0
1,I guess the real question was why it was red i...,1539734759,t3_9ot1eb,t3_9ot1eb,64,0,0,0
2,Well when Bush Sr. finally dies I'll log in he...,1538786272,t3_9lrfkd,t1_e78vi65,38,0,0,0
3,Republicans: Bah. We'll just make shit up and ...,1538618402,t3_9l7giq,t1_e74mg5c,20,0,0,0
4,I'd blame the Democrats &amp; social media. Th...,1539735974,t3_9ot02w,t3_9ot02w,-7,0,0,0


Next, we have to convert the values in the "created_utc" from a POSIX timestamp to something human readable.

In [103]:
# convert the timestamps using the datetime module
comments["created_utc"] = comments["created_utc"].apply(datetime.utcfromtimestamp)

# check to see if it worked
comments.head()

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,is_moderator_comments
0,If I were on the national teevee talking stupi...,2018-10-16 00:18:57,t3_9ohusq,t1_e7u9cye,24,0,0,0
1,I guess the real question was why it was red i...,2018-10-17 00:05:59,t3_9ot1eb,t3_9ot1eb,64,0,0,0
2,Well when Bush Sr. finally dies I'll log in he...,2018-10-06 00:37:52,t3_9lrfkd,t1_e78vi65,38,0,0,0
3,Republicans: Bah. We'll just make shit up and ...,2018-10-04 02:00:02,t3_9l7giq,t1_e74mg5c,20,0,0,0
4,I'd blame the Democrats &amp; social media. Th...,2018-10-17 00:26:14,t3_9ot02w,t3_9ot02w,-7,0,0,0


And with that, we have a readable set of timestamps in the "created_utc" column.

Lastly, we have to check the dataset for missing data. Missing data with hinder our analysis later on.

In [104]:
# missing data check
comments.apply(lambda x: sum(x.isnull()), axis=0)

body                     30
created_utc               0
link_id                   0
parent_id                 0
score                     0
controversiality          0
gilded                    0
is_moderator_comments     0
dtype: int64

Null data in the body is again kind of pointless to us so we remove those rows.

In [105]:
# remove the rows from the dataset
comments = comments[comments.body.notnull()]

In [106]:
# 2nd missing data check to make sure that there is no more
comments.apply(lambda x: sum(x.isnull()), axis=0)

body                     0
created_utc              0
link_id                  0
parent_id                0
score                    0
controversiality         0
gilded                   0
is_moderator_comments    0
dtype: int64

And now this is our final dataset.

In [107]:
# couple rows of the finalized dataset
comments.head()

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,is_moderator_comments
0,If I were on the national teevee talking stupi...,2018-10-16 00:18:57,t3_9ohusq,t1_e7u9cye,24,0,0,0
1,I guess the real question was why it was red i...,2018-10-17 00:05:59,t3_9ot1eb,t3_9ot1eb,64,0,0,0
2,Well when Bush Sr. finally dies I'll log in he...,2018-10-06 00:37:52,t3_9lrfkd,t1_e78vi65,38,0,0,0
3,Republicans: Bah. We'll just make shit up and ...,2018-10-04 02:00:02,t3_9l7giq,t1_e74mg5c,20,0,0,0
4,I'd blame the Democrats &amp; social media. Th...,2018-10-17 00:26:14,t3_9ot02w,t3_9ot02w,-7,0,0,0


# Cleaning Posts

We clean up posts so that when we combine the datasets, there won't be an absurd number of columns to look through. Thankfully, there isn't as much to clean up here.

In [30]:
# construct a list containing all columns to be dropped
drop_list_posts = ["author", "url", "ups", "downs", "selftext", "from_kind", "from", "thumbnail", "subreddit", "hide_score",
                  "link_flair_css_class", "author_flair_css_class", "archived", "is_self", "from_id", "permalink",
                  "author_flair_text", "quarantine", "link_flair_text", "retrieved_on", "subreddit_id", "saved", "name"]

# drop the columns
posts.drop(drop_list_posts, axis = 1, inplace = True)

# view the remainder
posts.head()

Unnamed: 0,created_utc,domain,num_comments,score,title,id,gilded,stickied,over_18,distinguished
0,1538377404,yahoo.com,747,17484,Millionaire Sen. Chuck Grassley Applying For T...,9ke4ye,1,False,False,
1,1538390265,yahoo.com,44,0,Senate GOP's Outside Counsel Says ‘Reasonable ...,9kf7lt,0,False,False,
2,1540117790,yahoo.com,34,6,Paul Manafort Appears In Wheelchair At Court H...,9q2jpy,0,False,False,
3,1540946192,yahoo.com,19,87,Mueller refers plot to make false claims about...,9stv7b,0,False,False,
4,1540115953,yahoo.com,12,37,Trump says US will pull out of intermediate ra...,9q2eom,0,False,False,


Next, we check the "distinguished" column again for anything interesting and clean up.

In [31]:
# check the present values
posts["distinguished"].unique()

array([nan, 'moderator'], dtype=object)

In [32]:
# change all the values with the map method in pandas
posts["distinguished"] = posts["distinguished"].map({"moderator": 1, None: 0})

In [33]:
# check the make sure the change went through
posts["distinguished"].unique()

array([0, 1], dtype=int64)

In [34]:
# rename the column, we use "is_moderator_posts" so that when we merge the datasets, we know which one
# corresponds to which dataset
posts.rename(columns = {"distinguished":"is_moderator_posts"}, inplace = True)

posts.head()

Unnamed: 0,created_utc,domain,num_comments,score,title,id,gilded,stickied,over_18,is_moderator_posts
0,1538377404,yahoo.com,747,17484,Millionaire Sen. Chuck Grassley Applying For T...,9ke4ye,1,False,False,0
1,1538390265,yahoo.com,44,0,Senate GOP's Outside Counsel Says ‘Reasonable ...,9kf7lt,0,False,False,0
2,1540117790,yahoo.com,34,6,Paul Manafort Appears In Wheelchair At Court H...,9q2jpy,0,False,False,0
3,1540946192,yahoo.com,19,87,Mueller refers plot to make false claims about...,9stv7b,0,False,False,0
4,1540115953,yahoo.com,12,37,Trump says US will pull out of intermediate ra...,9q2eom,0,False,False,0


Next, have to make the "created_utc" column human readable too.

In [35]:
# convert the timestamps using the datetime module
posts["created_utc"] = posts["created_utc"].apply(datetime.utcfromtimestamp)

In [36]:
# check that it went through
posts.head()

Unnamed: 0,created_utc,domain,num_comments,score,title,id,gilded,stickied,over_18,is_moderator_posts
0,2018-10-01 07:03:24,yahoo.com,747,17484,Millionaire Sen. Chuck Grassley Applying For T...,9ke4ye,1,False,False,0
1,2018-10-01 10:37:45,yahoo.com,44,0,Senate GOP's Outside Counsel Says ‘Reasonable ...,9kf7lt,0,False,False,0
2,2018-10-21 10:29:50,yahoo.com,34,6,Paul Manafort Appears In Wheelchair At Court H...,9q2jpy,0,False,False,0
3,2018-10-31 00:36:32,yahoo.com,19,87,Mueller refers plot to make false claims about...,9stv7b,0,False,False,0
4,2018-10-21 09:59:13,yahoo.com,12,37,Trump says US will pull out of intermediate ra...,9q2eom,0,False,False,0


Finally, a missing data check for the posts dataframe.

In [37]:
# missing data check
posts.apply(lambda x: sum(x.isnull()), axis=0)

created_utc           0
domain                0
num_comments          0
score                 0
title                 0
id                    0
gilded                0
stickied              0
over_18               0
is_moderator_posts    0
dtype: int64

Our final posts dataset.

In [108]:
# a couple columns of our finalized dataset
posts.head()

Unnamed: 0,created_utc,domain,num_comments,score,title,id,gilded,stickied,over_18,is_moderator_posts
0,2018-10-01 07:03:24,yahoo.com,747,17484,Millionaire Sen. Chuck Grassley Applying For T...,9ke4ye,1,False,False,0
1,2018-10-01 10:37:45,yahoo.com,44,0,Senate GOP's Outside Counsel Says ‘Reasonable ...,9kf7lt,0,False,False,0
2,2018-10-21 10:29:50,yahoo.com,34,6,Paul Manafort Appears In Wheelchair At Court H...,9q2jpy,0,False,False,0
3,2018-10-31 00:36:32,yahoo.com,19,87,Mueller refers plot to make false claims about...,9stv7b,0,False,False,0
4,2018-10-21 09:59:13,yahoo.com,12,37,Trump says US will pull out of intermediate ra...,9q2eom,0,False,False,0


# Combining the datasets

Comments and posts share "link_id" from comments and "id" from posts.  We can make use of this to combine the two datasets together into one for our analysis later on.

One problem, however, is that there is a few extra characters in front of the "link_id" (e.g. "t1_") that make them different from "id".  These are of consistent sizes, so we can remove the first three characters of each of the strings in "link_id" to make them the same as the "id" from posts.

In [109]:
# remove the first 3 characters of every string in "parent_id"
comments["link_id"] = comments["link_id"].str[3:]

# check please
comments.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,is_moderator_comments
0,If I were on the national teevee talking stupi...,2018-10-16 00:18:57,9ohusq,t1_e7u9cye,24,0,0,0
1,I guess the real question was why it was red i...,2018-10-17 00:05:59,9ot1eb,t3_9ot1eb,64,0,0,0
2,Well when Bush Sr. finally dies I'll log in he...,2018-10-06 00:37:52,9lrfkd,t1_e78vi65,38,0,0,0
3,Republicans: Bah. We'll just make shit up and ...,2018-10-04 02:00:02,9l7giq,t1_e74mg5c,20,0,0,0
4,I'd blame the Democrats &amp; social media. Th...,2018-10-17 00:26:14,9ot02w,t3_9ot02w,-7,0,0,0


Now, we can double check to make sure that the new values in "link_id" exist in "id" or else we will not be able to actually merge the dataframes.

In [110]:
# "9ke4ye" comes from the "id" in the first row of posts
comments.loc[comments["link_id"] == "9ke4ye"]

Unnamed: 0,body,created_utc,link_id,parent_id,score,controversiality,gilded,is_moderator_comments
1341,Vote for people who aren’t traitors.,2018-10-01 11:17:31,9ke4ye,t1_e6yg2ep,30,0,0,0
1775,[J. D. Scholten](https://www.scholten4iowa.com...,2018-10-01 12:25:01,9ke4ye,t1_e6yicp3,144,0,0,0
5109,It's almost like the government is not working...,2018-10-01 13:11:32,9ke4ye,t1_e6yr5d2,224,0,0,0
8585,Both are not seemingly shitty IMO. Not only do...,2018-10-01 13:28:45,9ke4ye,t1_e6yr4hv,32,0,0,0
21238,"Yep, that's my number one objective. We're tot...",2018-10-01 17:48:53,9ke4ye,t1_e6z9icq,1,0,0,0
21609,you will find you are at the extreme low end o...,2018-10-02 15:59:07,9ke4ye,t1_e70cvyr,1,0,0,0
21625,Sounds like you answered your own question.,2018-10-01 13:36:23,9ke4ye,t1_e6yt3av,1,0,0,0
22249,But you're not a poor farmer! Jesus christ!,2018-10-01 18:19:46,9ke4ye,t3_9ke4ye,1,0,0,0
22286,"You can campaign all you want, just without sp...",2018-10-01 15:15:54,9ke4ye,t1_e6yz8n4,1,0,0,0
24175,Should people in public office have to divest ...,2018-10-01 15:33:30,9ke4ye,t3_9ke4ye,1,0,0,0


Since we can see that they do, in fact, exist, we can move onto actually merging the dataframes.

In [111]:
# merging dataframess comments and posts
merged_df = pd.merge(comments, posts, left_on = "link_id", right_on = "id", how = "inner")

merged_df.head()

Unnamed: 0,body,created_utc_x,link_id,parent_id,score_x,controversiality,gilded_x,is_moderator_comments,created_utc_y,domain,num_comments,score_y,title,id,gilded_y,stickied,over_18,is_moderator_posts
0,If I were on the national teevee talking stupi...,2018-10-16 00:18:57,9ohusq,t1_e7u9cye,24,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
1,I only worry this was a setup by Tucker. I mea...,2018-10-16 00:01:48,9ohusq,t1_e7u6zec,25,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
2,When you prove yourself unworthy of civil soci...,2018-10-15 23:04:48,9ohusq,t3_9ohusq,395,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
3,No one needs to actually express that sentimen...,2018-10-15 23:58:40,9ohusq,t1_e7u859g,32,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
4,The fringe right is going to see this as an op...,2018-10-15 23:03:34,9ohusq,t3_9ohusq,-7,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0


# Housekeeping

Now that we have merged the datasets, we can do some house keeping to make the dataframe smaller and better labeled.

We know that any column with "x" at the end belongs to the comments dataset and anything with "y" at the end belongs to the posts dataset.  But to make sure, we rename them.

In [115]:
# construct a dictionary with all the name changes we want
name_change = {"created_utc_x":"created_utc_comments", "score_x":"score_comments", "gilded_x":"gilded_comments",
              "created_utc_y":"created_utc_posts", "score_y":"score_posts", "gilded_y":"gilded_posts"}

# make the changes
merged_df.rename(columns = name_change, inplace = True)

# check the dataset
merged_df.head()

Unnamed: 0,body,created_utc_comments,link_id,parent_id,score_comments,controversiality,gilded_comments,is_moderator_comments,created_utc_posts,domain,num_comments,score_posts,title,id,gilded_posts,stickied,over_18,is_moderator_posts
0,If I were on the national teevee talking stupi...,2018-10-16 00:18:57,9ohusq,t1_e7u9cye,24,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
1,I only worry this was a setup by Tucker. I mea...,2018-10-16 00:01:48,9ohusq,t1_e7u6zec,25,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
2,When you prove yourself unworthy of civil soci...,2018-10-15 23:04:48,9ohusq,t3_9ohusq,395,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
3,No one needs to actually express that sentimen...,2018-10-15 23:58:40,9ohusq,t1_e7u859g,32,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0
4,The fringe right is going to see this as an op...,2018-10-15 23:03:34,9ohusq,t3_9ohusq,-7,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,9ohusq,0,False,False,0


Additionally, we kept columns such as "id" and "link_id" in order to merge the dataframes because they are the common columns in each of them.  We can get rid of them now as they will tell us nothing in the analysis.

In [116]:
# list of columns to remove
drop_list_merged = ["link_id", "parent_id", "id"]

# remove the columns
merged_df.drop(drop_list_merged, axis = 1, inplace = True)

# our final merged_df
merged_df.head()

Unnamed: 0,body,created_utc_comments,score_comments,controversiality,gilded_comments,is_moderator_comments,created_utc_posts,domain,num_comments,score_posts,title,gilded_posts,stickied,over_18,is_moderator_posts
0,If I were on the national teevee talking stupi...,2018-10-16 00:18:57,24,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,0,False,False,0
1,I only worry this was a setup by Tucker. I mea...,2018-10-16 00:01:48,25,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,0,False,False,0
2,When you prove yourself unworthy of civil soci...,2018-10-15 23:04:48,395,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,0,False,False,0
3,No one needs to actually express that sentimen...,2018-10-15 23:58:40,32,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,0,False,False,0
4,The fringe right is going to see this as an op...,2018-10-15 23:03:34,-7,0,0,0,2018-10-15 23:02:36,thedailybeast.com,308,821,Fox News Host Tucker Carlson ‘Can’t Really Go’...,0,False,False,0


# Write to .csv

Finally, we write a new .csv file with the cleaned data.  This lets us import the cleaned data into a new notebook so that we will not have to continuously rerun the cleaning code if we need to reset the dataset later on.

In [117]:
# export the new merged_df dataset
merged_df.to_csv("cleaned_r_politics.csv", encoding = "utf-8")

And with that, we have cleaned 2 seperate datasets and merged them together into a single one!