# Reddit Comment Scraper Documentation

## Introduction
The Reddit Comment Scraper is a Python script that extracts comments from Reddit posts based on a list of URLs provided in a text file. 
It uses the PRAW library (Python Reddit API Wrapper) to interact with the Reddit API and extract comments from various Reddit posts. 
The extracted data is then processed and exported into a CSV file for further analysis.

## Required Libraries
- pandas: A library for data manipulation and analysis.
- praw: The Python Reddit API Wrapper for accessing Reddit's API.
- PySimpleGUI: A simple graphical user interface (GUI) library.
- re: The regular expressions library for pattern matching.


## Functions
1. reddit_txt(): This function is responsible for opening a GUI window that allows the user to select a text file containing a list of Reddit URLs. It returns the selected file's path.

2. getUrls(txtfile): This function takes the path of a text file as input and reads the URLs from the file, storing them in a list. It then returns this list of URLs.

3. getPraw(): This function initializes and configures the Reddit API using the provided credentials. It returns a Reddit API instance.

4. trialtype(url): This function uses regular expressions to determine the type of a Reddit post based on the keywords present in the URL. It returns the type of the post ("cancer", "covid", or "other").

5. getComments(reddit, urls): The main function responsible for extracting comments from Reddit posts. It takes the Reddit API instance and the list of URLs as inputs. It iterates through each URL, extracts comments, and organizes them into a nested list based on the post type. The nested list is then returned.

6. CreateDataframe(all_comments): This function creates a Pandas DataFrame to hold the comments extracted from Reddit. It takes the nested list of comments as input, processes them, and organizes them into the DataFrame with two columns: "comments" and "type".

7. cleanDataFrame(df): This function removes any comments that have been deleted or removed from the DataFrame. It takes the DataFrame as input, drops rows with deleted or removed comments, and returns the cleaned DataFrame.

8. export_to_csv(df): This function prompts the user to select an output folder using a GUI window and exports the cleaned DataFrame to a CSV file named "reddit.csv" within the selected folder.

9. main(): The main function that orchestrates the entire process. It calls the above functions in sequence to scrape Reddit comments, process them, clean the data, and export it to a CSV file.


## How to Use
- Make sure you have installed the required libraries: pandas, praw, PySimpleGUI.

- Replace the client_id and client_secret values in the getPraw() function with your own Reddit API credentials.

- Run the script. It will open a GUI window for you to select a text file containing Reddit URLs.

- The script will then extract comments from the provided Reddit URLs, process the data, and clean it.

- A GUI window will prompt you to select an output folder for the CSV file.

- Once the process is complete, the cleaned data will be exported to a CSV file named "reddit.csv" within the selected folder.

## Required libraries

In [1]:
import pandas as pd
import praw
import PySimpleGUI as sg
import re

### Function to locate the text file with all the reddit sites  

In [3]:
def reddit_txt():
    while True:
        file = sg.popup_get_file("Select Plane-Data csv file location",
                                 title='My File Browser',
                                 file_types=(("ALL txt Files", "*.txt"),))
        if file != (''):
            break
        sg.popup_error(' Please enter a file path ')
    return file

### Function to get all the urls from the text file into a list

In [4]:

def getUrls(txtfile):
    urls=[]
    with open(txtfile) as Fileobj:
        for lines in Fileobj:
            urls.append(lines)
    return urls


### Setting reddit api credentials required to be used to use the api

In [5]:
def getPraw():
 return praw.Reddit(user_agent="Comment Extraction (by /u/rddit_scrapper)",
                     client_id="iD5TP-sX0fWQjXLskZUAsw", 
                     client_secret="3thlw2TxzNqrs0XsxUzSB7ReX47Cbg",
                     check_for_async=False)

### Function uses regex to look for keywords within the url to indicate the type of post and returns its corresponding type

In [6]:
def trialtype(url):
    if re.search('(?i)cancer', url):
        return 'cancer'
    elif re.search('(?i)(corona)|(covid)|(novavax)', url):
        return 'covid'
    else:
        return 'other'

### The main function to get all the comments of all the reddit sites 

In [7]:
def getComments(reddit,urls):
    
    #This list will a nest list of each url's comments 
    allcommentslist = []
    
    #loops through the urls list 
    for url in urls:
        
        #This list will hold all the comments of the current url only
        url_list= []
        
        #Appending the type of post
        url_list.append(trialtype(url))
       
        #Setting the url 
        submission = reddit.submission(url=url)
        
        #All 'More Comments' objects will be replaced until there are none left, 
        #as long as they satisfy the threshold
        submission.comments.replace_more(limit=None)
        
        #iterating through a list of all the comments and appends it to a list
        for comment in submission.comments.list():
            url_list.append(comment.body)
        
        #appending the url list  to the main list to hold all urls' comments
        allcommentslist.append(url_list)
        
    return allcommentslist



In [8]:
def CreateDataframe(all_comments):
    #creates a empty dataframe that will concatinate all urls comments which its type
    df = pd.DataFrame([],columns=['comments','type'])
    
    for comments in all_comments:
        #taking out the first value which holds the site's type
        comment_type = comments.pop(0)
        #Creating a dataframe to hold the comments each loop with the column head as comments
        x= pd.DataFrame(comments,columns =['comments'])
        #creates a column type to label all the comments with the site's type
        x['type'] = comment_type
        #concatinate the dataframe with each iteration to hold all the comments labeled with its type
        df = pd.concat([df,x])
       
    return df



### Function to remove any comments that was deleted or removed on the site

In [9]:

def cleanDataFrame(df):
    indexNames = df[(df.comments == '[removed]') | (df.comments == '[deleted]')].index
    df.drop(indexNames, inplace=True)
    
    return df

### Function to export the datafram to a csv

In [10]:
def export_to_csv(df):
    #call a GUI to select output folder
    destfolder = sg.PopupGetFolder('Please select destination folder for extracted features')
    destfolder = destfolder.replace('/','\\\\')
    destfolder += '\\\\'
    df.to_csv(destfolder + 'reddit.csv',index=False)

### Main function 

In [11]:
def main():
    txtfile = reddit_txt()
    reddit = getPraw()
    urls = getUrls(txtfile)
    all_comments = getComments(reddit,urls)    
    df = CreateDataframe(all_comments)
    df = cleanDataFrame(df)
    export_to_csv(df)



In [12]:
# Calling main function 
if __name__=="__main__": 
    main()
