### **MIDAS SCREENING TASK 1: R/INDIA FLAIR DETECTOR**

This is a manual/documentation of the various attempts at succeding in the screening task provided by IIITD for MIDAS research labs by Anant Raina, a student of Nirma University

## Task 1: REDDIT DATA COLLECTION

The first task is to get the data from the r/India Subreddit.<br>

We can use the PRAW Python Reddit API Wrapper. Docs: [link text](https://praw.readthedocs.io/en/latest/)<br>

PRAW is a free to use Python Reddit API wrapper that will help us to get the data from Subreddits. The following, however, needs to be done before proceeding further, as per the documentation:

Step 1: Go to [link text](https://www.reddit.com/prefs/apps) and create new app<br>
Step 2: Fill out the form like so:


*   Name: "Name Of App"
*   App Type: Script
*   description: You can leave this blank
*   about url: You can leave this blank
*   redirect url: http://www."Name Of App".com/unused/redirect/uri

Step 3: Hit create app. Make note of the client ID and the client secret.

After this we may need to install praw. A simple way would be to use pip.

In [0]:
!pip install praw

Let us try to get a single reddit post from r/India

In [3]:
import praw

reddit = praw.Reddit(
						client_id="RImabBQtiUpnDw",
						client_secret="VOhkZ1p8g215x4hY354QEKdUEn0",
						user_agent="python-linux:text_classifier:v0.1a (by u/reddit_scraper_)"
					)

for submission in reddit.subreddit('india').hot(limit=1):
  print(submission.title, submission.link_flair_text)

Coronavirus (COVID-19) Megathread - News and Updates - 4 Coronavirus


If we actually go to r/India and see the first post, this post will show up with the obtained flair. This is because this post is pinned to the top and hence shows up first when we try and get the data. Let us try to get more submissions.

In [0]:
for submission in reddit.subreddit('india').hot(limit=50):
  print(submission.title, submission.link_flair_text)

It is clear that the reddit API is working just fine. It is now time to decide on what to scrape, how much to scrape, and how to preprocess it. Let us now get to task 2. For more on reddit_scraping, see reddit_scraping.ipynb in the repository under jupyter notebooks.

## **Task 2: EDA**

The next task is to perform an analysis on what data to get, and how much to get.<br>
# What to get

We first need to decide what is required and what is not. The following can be scraped from a reddit post:

*   URL
*   Post Title
*   Post Body
*   Comments
*   Post Flair

We need to have a train_dataset and a train_labelset. train_labels can be obtained from scraping for the reddit flair. So Flair needs to be obtained. What is left is the train_data. For train_data the easiest to train and obtain is the URL and the title. <br>

From a Machine Learning perspective, intuitively title seems like the best way to train a classifier. The following reasons can also be cited for the same:


1.   It is a short string. It does not require a lot of pre processing
2.   It is uniform across all the posts. There will be only unicode information in all titles and no images or sounds or links can find its way into the title

Let us look at the trends in the three sorting types: New, Hot, Top

# HOT

![picture](https://drive.google.com/uc?id=14RmlK8b3d3RtdRfPFUfrUtrBJlmOUyzV)

It is clear that Coronavirus is very trending at the moment. And we must keep this in mind when scraping data as it may bias the dataset. The only other tags that are prevelant is Politics and Non-Political. And AskIndia becomes prevelant when limit is set to None and is free to obtain as many submissions as possible.

# NEW

![picture](https://drive.google.com/uc?id=1Mflw2dV5QW5SYT-brwKlZ0el-xCymAuk)

Coronavirus breaks the plot here. This shows how popular the talking points about the virus are. This can cause problems for us. There are similar observations to HOT.

# TOP

Sorting by top will not help. The following reasons are observed for it.

*   Coronavirus flair is dominant on r/India and TOP posts of all time will most generally show flairs that were prevelant(dominant) before such as demonetization, GST that are not prevelant now and so it is not advisable to run with TOP
*   r/India suffers from no Automoderators that keep track of flairs and no moderators seem to moderate flairs: oftentimes new flairs that do not have enough/have any posts in it such as "OC", "Unverified", "| REPOST |" etc. which should be avoided. TOP has a lot of such flairs that are either not Canon anymore or simply too unused to be added as a parameter.

So what can be used as a parameter for a useful flair? Here are the key parameters:

*   It needs to be relevent in the current times
*   It needs to have enough posts and have some form of validation so as to its selection.

A good way would be to start a new post on r/India and select a flair for the post. Here are the listed flairs:

![alt text](https://drive.google.com/uc?id=1exsXh3X7YV9r7pAju3ab-mgdCeFZDOLd)

Interesting.<br>

This can be used as our train_label flairs perhaps?




### **PART 3: MAKE A FLAIR DETECTOR**

Let us now try to build a classifier, now that we have train_data and train_labels. To refer to code that converts the EDA to a training dataset one can refer to reddit_scraping.ipynb in Jupyter Notebooks in the repository.

## Make the model.

It feels that the string-based nature of the data pushes a need for Machine Learning to step in. The first attempt at a classifier was a OneVsRest Classifier. The model looks like so:



In [0]:
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)

classifier = Pipeline([
	('vectorizer', CountVectorizer()),
	('tfidf', TfidfTransformer()),
	('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)

There were, of course, some issues with using the OneVsRest Classifier.
It had high accuracy (74%) but it oftentimes output no output, no classification, which after days of debugging led to no explanation. Also, at this point in time the data was placed without any preprocessing and hence often was broken by inverted commas and quotes and commas. Hence there was a need for clearning the data. The following code dependant on the NLP library "nltk" and "re" helps us here. Code is as follows:

In [0]:
#Get all the symbols that are undesired and unwanted in the training/testing set
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+,_\"]')
STOPWORDS = set(stopwords.words('english'))
TAG_RE = re.compile(r'<[^>]+>')

# Remove HTML tags if any from the samples
def remove_tags(text):
    return TAG_RE.sub('', text)

# Perform removal of unwanted symbols from the samples
def preprocess_text(text):
    text = text.lower()
    text = remove_tags(text)
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

This will remove all the unwanted symbols in BAD_SYMBOLS_RE and REPLACE_BY_SYMBOLS_RE and also any stopwords in the english language that follows its way into the title, or any html tags will be filtered out of the test string.

The classifier still failed, with the accuracy only reaching 44% for 1722 new posts. The problem still persisted, with blank string predictions reaching its way into the answer repeatedly.

We needed new classifiers, perhaps a trial and error on all possibilities and then choose the best one.

So I developed new models

In [0]:
def ml_train(model_type):
	
	if model_type == "baye":
		print(y_train)
		classifier = Pipeline([
			('vectorizer', CountVectorizer()),
			('tfidf', TfidfTransformer()),
			('clf', MultinomialNB())])
		classifier.fit(X_train, y_train_vectored)
		filename = 'native_baye.sav'
		joblib.dump(classifier, filename)

	elif model_type == "sgd":
		print(y_train)
		classifier = Pipeline([
			('vectorizer', CountVectorizer()),
			('tfidf', TfidfTransformer()),
			('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None))])
		classifier.fit(X_train, y_train_vectored)
		filename = 'sgd.sav'
		joblib.dump(classifier, filename)

	elif model_type == "regression":
		print(y_train)
		classifier = Pipeline([
			('vectorizer', CountVectorizer()),
			('tfidf', TfidfTransformer()),
			('clf', LogisticRegression(n_jobs=1, C=1e5, max_iter=10000))])
		classifier.fit(X_train, y_train_vectored)
		filename = 'regression.sav'
		joblib.dump(classifier, filename)

	elif model_type == "random_forrest":
		print(y_train)
		classifier = Pipeline([
			('vectorizer', CountVectorizer()),
			('tfidf', TfidfTransformer()),
			('clf', RandomForestClassifier(n_estimators = 1000, random_state = 42))])
		classifier.fit(X_train, y_train_vectored)
		filename = 'random_forrest.sav'
		joblib.dump(classifier, filename)

	elif model_type == "all":
		ml_train("baye")
		ml_train("sgd")
		ml_train("regression")
		ml_train("random_forrest")

This seemed to work: the problem of blank prediction was solved and we had the opportunity to choose the best from a plethora of models. The accuracy reached now was as follows:

*On Title*<br>
NB: 53%<br>",
LogRegression: 49%<br>
RandomForrest: 46%<br>
LinearSVM: 49%<br>

This seems to make NB the best, however we are missing one thing: there are other ways to train the classifier. Such as comments and the url itself. The scraping for these datasets can be seen in reddit_scraping.ipynb. Here are the results of the same:

*On Comments*<br>
NB: 26%<br>
LogRegression: 22%<br>
RandomForrest: 24%<br>
LinearSVM: 22%<br>

*On URL*<br>
NB: 10%<br>
LogRegression: 8%<br>
RandomForrest: 7%<br>
LinearSVM: 8%<br>

So that was a trainwreck but we need to figure out why. For URL we can answer intuitively but for comments we need to delve deeper. Here are some noticable things:

1) Comments are subjective and may lead to tangents. 
2) Comments on r/India are often hindi words worded in english, which may throw off the classifier.

Let us now look at why title fails. Let us look at this title that was found while training.

"I just got fired"

Without the context, it appears to be Non-Political. But it is flaired as Coronavirus....this is because the person making the post has lost his job due to the outbreak. Hence the confusion of the classifier. 

That was my journey through this problem set. Thank you.