# Data Wrangling for subreddit post data

1. Use data from subreddit group with attributes of title, score, id, subreddit, url, number of comments, body, number of subscribers and created date.
2. Create post type column and find the type by looking at the body and url. If body has text, then the post type is **text**. If url contains png and jpg, the type is **photo**. If neither of them, then type is **video** (there is a one from TikTok!).
3. Make sure you have the folder named Datasets in this project directory

In [5]:
import pandas as pd

In [2]:
data = pd.read_csv("./Datasets/group_posts.csv",index_col=0)
data.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,num_subscribers,created_date
0,Our most-broken and least-understood rules is ...,1593,doqwow,depression,https://www.reddit.com/r/depression/comments/d...,109,We understand that most people who reply immed...,603300,2019-10-29
1,Regular Check-In Post,151,exo6f1,depression,https://www.reddit.com/r/depression/comments/e...,862,Welcome to /r/depression's check-in post - a p...,603300,2020-02-02
2,Does anyone else just wanna start new,595,f4re4h,depression,https://www.reddit.com/r/depression/comments/f...,109,I just want to move to a town where no one kno...,603300,2020-02-16
3,I wanna get sick for a few weeks to catch a break,230,f4u8i2,depression,https://www.reddit.com/r/depression/comments/f...,30,This is prob really messed up but i kind of ju...,603300,2020-02-17
4,I don't want you to ask me if I am feeling bet...,39,f4uix8,depression,https://www.reddit.com/r/depression/comments/f...,2,"The moment you ask me that, I automatically fe...",603300,2020-02-17


## Check your dataframe
Uncomment the below to understand your data

In [1]:
#data.sort_values(by=['num_comments'], ascending = False )

In [2]:
#data.sort_values(by=['score'], ascending = False )

## Create the post type
Create post type column and find the type by looking at the body and url. If body has text, then the post type is **text**. If url contains png and jpg, the type is **photo**. If neither of them, then type is **video** (there is a one from TikTok!).

In [5]:
data.body.fillna('No text',inplace=True)

def condition(row):
    if row['body'] != 'No text':
        return 'text'
    else:
        if 'png' in row['url'] or 'jpg' in row['url']:
            return 'photo'
        else:
            return 'video'
        
data['post_type'] = data.apply(condition, axis=1)

## Create the engagement data
The engagement of the group is the sum of score and number of comments.

In [9]:
data['engagement'] = data['score'] + data['num_comments']
data.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,num_subscribers,created_date,post_type,engagement
0,Our most-broken and least-understood rules is ...,1593,doqwow,depression,https://www.reddit.com/r/depression/comments/d...,109,We understand that most people who reply immed...,603300,2019-10-29,text,1702
1,Regular Check-In Post,151,exo6f1,depression,https://www.reddit.com/r/depression/comments/e...,862,Welcome to /r/depression's check-in post - a p...,603300,2020-02-02,text,1013
2,Does anyone else just wanna start new,595,f4re4h,depression,https://www.reddit.com/r/depression/comments/f...,109,I just want to move to a town where no one kno...,603300,2020-02-16,text,704
3,I wanna get sick for a few weeks to catch a break,230,f4u8i2,depression,https://www.reddit.com/r/depression/comments/f...,30,This is prob really messed up but i kind of ju...,603300,2020-02-17,text,260
4,I don't want you to ask me if I am feeling bet...,39,f4uix8,depression,https://www.reddit.com/r/depression/comments/f...,2,"The moment you ask me that, I automatically fe...",603300,2020-02-17,text,41


## Create the engagement rate 
The engagement rate is the engagement divided by number of subscribers

In [11]:
data['ER'] = data['engagement'] / data['num_subscribers'] * 100
data.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,num_subscribers,created_date,post_type,engagement,ER
0,Our most-broken and least-understood rules is ...,1593,doqwow,depression,https://www.reddit.com/r/depression/comments/d...,109,We understand that most people who reply immed...,603300,2019-10-29,text,1702,0.282115
1,Regular Check-In Post,151,exo6f1,depression,https://www.reddit.com/r/depression/comments/e...,862,Welcome to /r/depression's check-in post - a p...,603300,2020-02-02,text,1013,0.16791
2,Does anyone else just wanna start new,595,f4re4h,depression,https://www.reddit.com/r/depression/comments/f...,109,I just want to move to a town where no one kno...,603300,2020-02-16,text,704,0.116692
3,I wanna get sick for a few weeks to catch a break,230,f4u8i2,depression,https://www.reddit.com/r/depression/comments/f...,30,This is prob really messed up but i kind of ju...,603300,2020-02-17,text,260,0.043096
4,I don't want you to ask me if I am feeling bet...,39,f4uix8,depression,https://www.reddit.com/r/depression/comments/f...,2,"The moment you ask me that, I automatically fe...",603300,2020-02-17,text,41,0.006796


## Save above data into csv file

In [12]:
data.to_csv('./Datasets/group_posts.csv')

## Prepare the data for dashboards in Tableau
We get rid of title, id, url, and body because we don't need to deal with those data in Tableau

In [16]:
new_data = data.copy()
new_data = new_data.drop(['title','id','url','body'],axis=1)
new_data.head()

Unnamed: 0,score,subreddit,num_comments,num_subscribers,created_date,post_type,engagement,ER
0,1593,depression,109,603300,2019-10-29,text,1702,0.282115
1,151,depression,862,603300,2020-02-02,text,1013,0.16791
2,595,depression,109,603300,2020-02-16,text,704,0.116692
3,230,depression,30,603300,2020-02-17,text,260,0.043096
4,39,depression,2,603300,2020-02-17,text,41,0.006796


## Save the data for Tableau dashboards

In [18]:
new_data.to_csv("./Datasets/Engagment_data.csv")