**REDDIT SCRAPING: GET DATA VIA THE PYTHON REDDIT API WRAPPER**<BR>
This code snippet is tasked with getting the title/comments/urls from the reddit posts in order to prepare a training set.

Limit has been set to None so as to obtain as many training samples as the PRAW Wrapper can fetch. <br>
A throwaway reddit username r/reddit_scraper_ has been created with a blank slate: no user information as well as only one subbed subreddit: r/India. This username can be used for research into the code. Throwaway username will be removed post-screening task after which the user must use his/her own username and password/ client id and secret than the one provided in-code<br><br>

Separate code blocks for each segment have been created so as to make playing with the code in jupyter playground easier and isolated from one another

**GET POST TITLE**

In [0]:
# Get Python's Reddit API Wrapper
import praw
# Pickle will store generated data for EDA
import pickle

# All the flairs that were consistent with current affairs on r/India
link_flair_tags = {
					"Politics":0, 
					"Non-Political":0, 
					"AskIndia":0, 
					"Policy/Economy":0, 
					"Business/Finance":0, 
					"Science/Technology":0, 
					"Scheduled":0, 
					"Sports":0, 
					"Food":0,
					"Photography":0,
					"CAA-NRC-NPR":0,
					"Coronavirus":0,
				  }

# We need to store both data collected and tbe statistics: hence we require file pointers
dataset_file = open("hot/flair_dataset.csv", "w")
data_analysis = open("hot/data_analysis.pickle", "wb")

# Generate a Reddit API Wrapper object: PRAW object to scrape reddit data
reddit = praw.Reddit(
						client_id="RImabBQtiUpnDw",
						client_secret="VOhkZ1p8g215x4hY354QEKdUEn0",
						user_agent="python-linux:text_classifier:v0.1a (by u/reddit_scraper_)"
					)

# Counter to keep track of number of posts that were scraped
count = 0

# Get posts from the required sub with appropriate sorting bias
for submission in reddit.subreddit('india').hot(limit=None):
	# Internal parameters upated
	count += 1
	title = ""
	flair = ""
	flag = False

	# Check whether the received request is valid or not
	if submission.link_flair_text:
		# If the received flair is not part of the shortlisted flairs, Ignore it
		try:
			# Coronavirus posts are overflowing.... to control bias we ignore posts over 150.
			if submission.link_flair_text == "Coronavirus" and link_flair_tags[submission.link_flair_text] > 150:
				print('Corona Exceeded')
			# Counting number of posts under the flair
			else:
				link_flair_tags[submission.link_flair_text] += 1
				flair = submission.link_flair_text
				flag = True
		except KeyError:
			print("New Flair Found: " + submission.link_flair_text + ". Ignoring..........")
			flag = False

	# Get the title of received post
	if submission.title:
		title = submission.title
		if ',' in title:
			title = '\"' + title + '\"'
	# Data that is generated needs to be from a shortlisted flair: else it is not included in statistics
	if flag == True:
		dataset_file.write(title + ',' + flair + '\n')

# Validation of code working correctly: get the statistics scraped by the bot
print(link_flair_tags)
print(count)

# Close all file pointers
dataset_file.close()
data_analysis.close()

**GET POST COMMENTS**

In [0]:
import praw
import pickle
import re
from nltk.corpus import stopwords

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+,_\"]')
STOPWORDS = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

link_flair_tags = {
					"Politics":0, 
					"Non-Political":0, 
					"AskIndia":0, 
					"Policy/Economy":0, 
					"Business/Finance":0, 
					"Science/Technology":0, 
					"Scheduled":0, 
					"Sports":0, 
					"Food":0,
					"Photography":0,
					"CAA-NRC-NPR":0,
					"Coronavirus":0,
				  }

dataset_file = open("new/flair_dataset_comments.csv", "w")
data_analysis = open("new/data_analysis_comments.pickle", "wb")

reddit = praw.Reddit(
						client_id="RImabBQtiUpnDw",
						client_secret="VOhkZ1p8g215x4hY354QEKdUEn0",
						user_agent="python-linux:text_classifier:v0.1a (by u/reddit_scraper_)"
					)

count = 0

for submission in reddit.subreddit('india').new(limit=None):
	count += 1
	comments = ""
	flair = ""
	flag = False
	if submission.link_flair_text:
		try:
			if submission.link_flair_text == "Coronavirus" and link_flair_tags[submission.link_flair_text] > 150:
				print('Corona Exceeded')
			else:
				link_flair_tags[submission.link_flair_text] += 1
				flair = submission.link_flair_text
				flag = True
		except KeyError:
			print("New Flair Found: " + submission.link_flair_text + ". Ignoring..........")
			#link_flair_tags[submission.link_flair_text] = 1
			#flair = submission.link_flair_text
			flag = False

	if submission.comments:
		for top_level_comment in submission.comments:
			if isinstance(top_level_comment, praw.models.MoreComments):
				continue
			comments = comments + ' ' + top_level_comment.body
			comments = preprocess_text(comments)
	if flag == True:
		dataset_file.write(comments + ',' + flair + '\n')

#pickle.dump(link_flair_tags, data_analysis)
print(link_flair_tags)
print(count)

dataset_file.close()
data_analysis.close()