<a href="https://colab.research.google.com/github/Shakthi-Dhar/Reddit-Data-Collection/blob/main/Reddit_Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Collection from Reddit pages**
The code below uses PRAW library to extract the reddit submissions and converts it to a suitable format and saved as a .csv file

For more details about PRAW check their [documentation](https://praw.readthedocs.io/en/stable/).

**Importing the required libraries**

In [None]:
!pip install praw
!pip install arrow
import praw
from datetime import datetime
import arrow
import pandas as pd

Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 8.2 MB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.8 MB/s 
[?25hCollecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.5.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.2.3
Collecting arrow
  Downloading arrow-1.2.1-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.1 MB/s 
Installing collected packages: arrow
Successfully installed arrow-1.2.1


**Creating an instance of the Reddit class**

Enter your client id, client secret, user agent details

Create your own tokens from: [https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps)


In [None]:
reddit = praw.Reddit(
    # Client ID
    client_id = "",
    # Client Secret Key
    client_secret = "",
    # User Agent
    user_agent = "",
    check_for_async = False
)

**Function to convert the given data to CSV format**

In [None]:
def data_csv (data, name):
  df = pd.DataFrame(data)
  df = df.melt(['date', 'time', 'subreddit', 'search', 'id', 'author', 'title', 'message', 'url'], var_name='Comment No', value_name='Comment Message').sort_values('date', ascending=False)
  df['Comment No'] = pd.to_numeric(df['Comment No'].str.split(" ").str[1])
  df = df.sort_values(['date', 'id', 'Comment No'], ascending=[False, True, True])
  df.replace("", float("NaN"), inplace=True)
  df.dropna(subset=["Comment Message"], inplace=True)
  df.to_csv(name, index = False)
  print("The Data has been converted to CSV format successfully.\nHere is a sample of the dataframe:\n")
  print(df.head())
  print("\nDownload the CSV file.")

**Function for extracting the data from given subreddit**

In [None]:
# Search by subreddits only
def subreddit_extraction (subreddit, data):
  for submission in reddit.subreddit(subreddit).hot(limit=None):
    sub = {}
    dt = arrow.get(submission.created_utc).to('local').format()
    sub['date'] = dt.split(" ")[0]
    sub['time'] = dt.split(" ")[1].split("+")[0]
    sub['subreddit'] = submission.subreddit.display_name
    sub['search'] = ""
    sub['id'] = submission.id
    if(submission.author is not None):
      sub['author'] = submission.author.name
    else:
      sub['author'] = ""
    sub['title'] = submission.title
    sub['message'] = submission.selftext
    sub['url'] = submission.url
    i = 1
    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
      sub['comment '+str(i)] = comment.body
      i = i + 1
    data.append(sub)
  print("The data from the given subreddit has been extracted successfully.\n")
  data_csv(data,subreddit+"_subreddit.csv")

**Function for extracting the data from a given subreddit and search key word**

In [None]:
# Search by subreddits and by a given search element
def subreddit_search_extraction (subreddit, data, search):
  for submission in reddit.subreddit(subreddit).search(search):
    sub = {}
    dt = arrow.get(submission.created_utc).to('local').format()
    sub['date'] = dt.split(" ")[0]
    sub['time'] = dt.split(" ")[1].split("+")[0]
    sub['subreddit'] = submission.subreddit.display_name
    sub['search'] = search
    sub['id'] = submission.id
    if(submission.author is not None):
      sub['author'] = submission.author.name
    else:
      sub['author'] = ""
    sub['title'] = submission.title
    sub['message'] = submission.selftext
    sub['url'] = submission.url
    i = 1
    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
      sub['comment '+str(i)] = comment.body
      i = i + 1
    data.append(sub)
  print("The data from the given subreddit has been extracted successfully.\n")
  data_csv(data,subreddit+"_"+search+"_subreddit.csv")

**The below code snippet basically calls the functions**

Enter your subreddit and search key words here

In [None]:
# Enter the subreddit from which you want to extract the data
subreddit = "technology"
# Enter the search key word for specific data extraction
search = "cloud"
# Create an empty list
data = []

# Uncomment the below line if you dont have any search optimization
subreddit_extraction(subreddit, data)

# Uncomment the below line if you have any search key word optimization
# subreddit_search_extraction(subreddit, data, search)