<a href="https://colab.research.google.com/github/NikV-JS/Reddit-Flair-Detector/blob/master/Notebooks/Documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install -q praw pandas numpy scikit-learn matplotlib
!pip install -q nltk

In [0]:
import praw
import pandas as pd

In [0]:
!pip install -q praw pandas numpy scikit-learn matplotlib
!pip install -q nltk

First approach was to use praw ( Python Reddit API Wrapper), one of the most famous main Reddit API used for extraction of data using python. From this code block, it can be seen that the code was supposed to scrape 500 submissions related to each flair but I ended up with around 250 capped for each flair. Also, there is a search scrapping limit for Reddit which limits number to 1000. I have in mind 6,000 number of submissions as apt for the dataset for the given task considering transfer learning. So I had to find ways to get around this low number of submissions scraped from Reddit.


In [0]:
reddit = praw.Reddit(client_id='EPeQ4_tZaSnieQ', client_secret="o8wiYMDri2RMiF1um14L1rGHXEs", user_agent='Reddit WebScraping')

subreddit = reddit.subreddit('india')
topics_dict = {"flair":[], "title":[], "url":[], "comms_num": [], "body":[], "author":[]}

for flair in flairs:
  
  get_subreddits = subreddit.search(flair, sort='new', syntax='cloudsearch', limit=500)
  
  for submission in get_subreddits:
    
    topics_dict["flair"].append(flair)
    topics_dict["title"].append(submission.title)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["body"].append(submission.selftext)
    topics_dict["author"].append(submission.author)

topics_data = pd.DataFrame(topics_dict)
topics_data.shape

After a detailed analysis of praw documentation and reddit web scraping limitations it was evident that praw could only help me reach a dataset of size 2,700 which might eventually lead to overfitting. In the following code block another attempt was made at using cloudsearch for individually scraping submissions for each flair. But after some research, it was found out that Reddit has stopped support to cloudsource so this was not a viable option. Another approach could have been the submissions function in praw but it has been deprecated due to changes by Reddit.


In [0]:
reddit = praw.Reddit(client_id='EPeQ4_tZaSnieQ', client_secret="o8wiYMDri2RMiF1um14L1rGHXEs", user_agent='Reddit WebScraping')

subreddit = reddit.subreddit('india')
topics_dict = {"flair":[], "title":[], "url":[], "comms_num": [], "body":[], "author":[]}

import datetime
params = {'sort':'new', 'limit':None, 'syntax':'cloudsearch'}
time_now = datetime.datetime.now()

get_subreddits = subreddit.search('timestamp:{0}..{1}'.format(
    int((time_now - datetime.timedelta(days=365)).timestamp()),
    int(time_now.timestamp())), 
    **params)

for submission in get_subreddits:
    
    topics_dict["flair"].append(flairs[0])
    topics_dict["title"].append(submission.title)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["body"].append(submission.selftext)
    topics_dict["author"].append(submission.author)

metadata = pd.DataFrame(topics_dict)
metadata.shape

To overcome the above shortcomings, my chosen approach was to combine recent new data from Reddit India, acquired through praw, with 2019 database present at Google Big Query. This would mean that Coronavirus and CAA-NRC-NPR flairs would have data only upto 250 capped due to the limitations of praw in Reddit. Then the following code of block was used to perform data analysis on data acquired through Biq Query to see how many months of data, going back from August'2019, I would need to meet the criteria of atleast 600 cases in the remaining 10 flair cases.

In [0]:
length = np.zeros((12,1))
for i in range(0,11):
  flair_data = data[data['flair'] == flairs[i]];
  length[i] = flair_data.shape[0];
length

The approach of combining the praw data and Big Query data worked out as indicated in the Reddit Web Scraping (Part I) jupyter notebook. The most recent data of Aug'19 , July'19, June'19 were available on Big Query and were taken accordingly. Then based upon the code block above, data analysis was performed and finally size of each flair category was decided based on the data distribution in the praw and Big Query Data. A total size of 6,000 cases was taken and 7 of the flairs are of size 500 each whereas the size of Coronavirus and CAA-NRC-NPR category was capped off to 250 and 109 respectivley due to the shortcomings of praw. Poltics, AskIndia, Non-Political had more percent of cases so their size was 714, 714, 713 respectively. After a preliminary analysis and study of the obtained final dataset I feel that the short number of cases for Coronavirus and CAA-NRC-NPR shouldn't be a difficulty for the model as the title of submission mostly gives away the flair category in these cases.

**PART** **-** **2**

The following code blocks were used to efficiently plot the frequency of words appearing in 'Title' of submissions of a particular Flair (Category) using Pandas, Seaborn, and Matplotlib. The output of the following code was as expected with stop words having the highest frequency. To improve this, I would need to refine the Title category by removing the english stopwords for a better word frequency analysis.

In [0]:
import csv
import os
import random
import json
import numpy as np
import pandas as pd
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import xgboost as xgb
import gc
import seaborn as sns
color = sns.color_palette()

eng_stopwords = set(stopwords.words("english"))

import matplotlib.pyplot as plt

In [0]:
Flair_0 = Data[Data['flair'] == flairs[0]]
Word_frequency = pd.Series(' '.join(Flair_0.title).split()).value_counts()[:10]

plt.figure(figsize=(18,8))
sns.barplot(Word_frequency.index, Word_frequency.values, alpha=0.8)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Top 10 Words appearing in Title of '+str(flairs[0])+' Category Submissions', fontsize=12)
plt.show()

The following code was successful in removing the stopwords. The frequency distribution was light years better. Two punctuation marks showed up in the frequency distribution so another Iteration that could be done is the removal of punctuation marks in the analysis data for an even better frequency distribution of words.

In [0]:
Analysis_Data = Data;
Analysis_Data['Analysis_Title'] = Analysis_Data['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (eng_stopwords)]))

Flair_0 = Analysis_Data[Analysis_Data['flair'] == flairs[0]]
Word_frequency = pd.Series(' '.join(Flair_0.Analysis_Title).split()).value_counts()[:10]

plt.figure(figsize=(18,8))
sns.barplot(Word_frequency.index, Word_frequency.values, alpha=0.8)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Top 10 Words appearing in Title of '+str(flairs[0])+' Category Submissions', fontsize=12)
plt.show()

The following line of code was used to remove punctuation marks efficiently for the final word frequency distribution results.

In [0]:
Analysis_Data['Analysis_Title'] = Analysis_Data['Analysis_Title'].str.replace('[^\w\s]','')

The following code was used to clean and preprocess the text before metafeature extraction and also seperation of the dataset into train, validation, test sets.

In [0]:
import unicodedata
import re

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )
def normalizeString(s):
    if not isinstance(s, float):
      s = unicodeToAscii(s.lower().strip())
      s = re.sub(r"([.!?])", r" \1", s)
      s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

Analysis_Data['Text'] = Analysis_Data['Text'].apply(lambda x: normalizeString(x))

Data['Text'] = Data['Text'].apply(lambda x: normalizeString(x))

On observation of the data, it was clearly evident that some submissions with deleted or removed or NaN body were also scraped from Reddit. To prevent these from hindering the performance of the model, the following code was used to check for number of such cases and also replace them with space character. After replacement, the number of such cases was checked again to validate the removal process.

In [0]:
empties = ['nan', '[deleted]', '[removed]']
for empty in empties:
  print(empty, len(Data[Data['body'] == empty]))

Data['body'] = Data['body'].apply(lambda x: '' if x in empties else x)

empties = ['nan', '[deleted]', '[removed]']
for empty in empties:
  print(empty, len(Data[Data['body'] == empty]))

The following block of code was used to split the data into three subsets corresponding to Test, Validation and Test Set. The split percentage is 60 %, 20 %, 20 % respectively. The number of total submissions in each subset is 3,600 , 1,200 , 1,200 respectively.

In [0]:
x, x_test, y, y_test = train_test_split(Data.Text,Data.flair,test_size=0.2,train_size=0.8)
x_train, x_val, y_train, y_val = train_test_split(x,y,test_size = 0.25,train_size =0.75)

Train_Data = pd.merge(x_train, y_train, left_index=True, right_index=True)
Val_Data = pd.merge(x_val, y_val, left_index=True, right_index=True)
Test_Data = pd.merge(x_test, y_test, left_index=True, right_index=True)

Train_Data.to_csv('Train_Data.csv', index=False)
Val_Data.to_csv('Val_Data.csv', index=False)
Test_Data.to_csv('Test_Data.csv', index=False)

**Part - 3**

Below the Part-3 code is posted. During the development of Part-3 there were no major problems. For part-3 a major emphasis was laid on reading Ktrain documentation and using it to the fullest potential. The XLNet model was trained and studied extensively and the model achieved very good performance comparable to that of a human.

In [0]:
# Installing the required modules
!pip install -q tensorflow_gpu>=2.0 pandas ktrain

# Importing necessary modules
import tensorflow as tf
print(tf.__version__)

tf.test.gpu_device_name()

import pandas as pd
import numpy as np

# Loading Train, Validation, and Test Data as Pandas DataFrame
Train_data = pd.read_csv('/content/Train_Data.csv')
Val_data = pd.read_csv('/content/Val_Data.csv')
Test_data = pd.read_csv('/content/Test_Data.csv')

# Print the 12 Flair Categories
flairs = list(set(Train_data['flair']))
flairs

# Print size of training, validation and testing set
print('size of training set: %s' % (len(Train_data['Text'])))
print('size of validation set: %s' % (len(Val_data['Text'])))
print('size of testing set: %s' % (len(Test_data['Text'])))

# Splitting the datasets into input and target labels for deep learning model
x_train = Train_data['Text']
y_train = Train_data['flair']
x_val = Val_data['Text']
y_val = Val_data['flair']
x_test = Test_data['Text']
y_test = Test_data['flair']

"""The Deep Learning (DL) Model I chose for the task of Reddit flair prediction is XLNet introduced by Google Brain and CMU. To understand the intuition behind choosing XLNet let's dive into some Natural Language Processing (NLP) history. Reddit Flair Prediction comes under the domain of Text Classification as we are trying to classify a collection of words under a post. The generic approach towards text classification 3 years back was to use Machine Learning (ML) Algorithms such as Logistic Regression, Random Forest. For these methods, the general approach was to first pre-process the data to remove stopwords and punctuations and then use word to vector embeddings so that they can apply the ML algorithm. Through NLP studies it was quite evident that stopwords and punctuations tend to contain useful information for the model to percieve but in general the frequency of stopwords is very high in text thereby increasing the sequence length. As the sequence length increases the performance of the standard ML algorithms decreases. 

So as Deep Learning Models improved and started to be applied to NLP, Google introduced transformers (attention based encoder-decoder) for NLP applications. In 2018, a wide range of Language Models were introduced with the main focus being on Google's BERT. Language Models were based on Deep Learning Models and they overcame the drawbacks of classic ML algorithms. Language Models had a large potential for application in various fields of NLP. Language Models application was based on transfer learning. For Deep Learning models the size of the dataset plays an important role in model overfitting and model generalization capability. Language Models have two phases that is pretrain phase and finetune phase. In the pretrain phase these Langauge Models are trained on large corpuses of text thereby allowing the model to gain contextual information of english language. The pretrain phase is where the XLNet (based on Transformer-XL) differs from BERT. The XLNet model uses Permutation Langauge Modeling (PLM) which enables it to have greater generalization capability and language understanding above BERT. This feature of XLNet enables it to outperform BERT in text classification. In finetune phase Language Models require small amount of data in application domain to achieve great performance. So using transfer learning with XLNet on our dataset enables for a great classification ability even with a small dataset. Another advantage of Langauge Models is that they use tokenization as pre-processing for data thereby preserving the information in stopwords and punctuations leading to better langauge understanding. These factors compelled me towards using XLNet for Reddit Flair Detection.

Google Brain and CMU have open sourced the pre-trained XLNet model for application in different domains. The Hugging Face Transformers python package contains most of the pre-trained Language Models. In Pytorch, FASTAI enables us to deploy langauge models easily. For this project I decided to use Tensorflow 2.1.0 . For training (finetuning) and validation of pretrained XLNet model  I used Ktrain , a lightweight wrapper for keras, which contains the pretrained Language Models from Hugging Face Transformers package. Firstly, Ktrain tokenizes (pre-processes) the data into the required format of the language model, here XLNet. XLNet is of two sizes. Here the XLNet Base version (smaller one) is used. Then accordingly a Keras model is created with pre-loaded weights. The following code is the procedure used to train XLNet for Reddit Flair Detection.
"""

# Import ktrain module, preprocess the training and validation data
# and initialize XLNet model
import ktrain
from ktrain import text
MODEL_NAME = 'xlnet-base-cased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=flairs)
train = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_val, y_val)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=train, val_data=val, batch_size=6)

"""Before the training of the model is initialized, it is a good practice to find out the optimal learning rate for the model. Advanced APIs such as FASTAI and Ktrain enable us to find the optimal learning rate by simulating training for few epochs at different learning rates and plotting a loss vs learning rate graph. Through observation of the loss trend the optimal learning rate can be decided."""

learner.lr_find(show_plot=True, max_epochs=2)

"""On observing the plot, it can be seen that 5e-5 is an optimal learning rate. Now we start training the model based on onecycle policy initially for 4 epochs to see if there is improvement of model performance as training continues."""

learner.fit_onecycle(5e-5, 4)

"""It can be seen that the model is learning and the validation accuracy is improving."""

learner.save_model('Model-1')

learner.load_model('/content/drive/My Drive/Model-XLNet', t)

"""Continue training the model for 5 more epochs based on onecycle policy."""

learner.fit_onecycle(5e-5, 5, checkpoint_folder='/content/drive/My Drive/Model-XLNet')

"""It can be observed that the validation accuracy is reaching a plateau. So now to finish the model training, the approach is to use early stopping and reducing learning rate on reaching a plateau. These features are inbuilt into the autofit function of Ktrain. Early stopping prevents the model from overfitting on the training dataset."""

learner.autofit(2e-5)

"""Now the model training has been successfully finished and XLNet model with Validation accuracy of 64.25 % has been obtained. Since the validation and test data set are of similar distribution, it is expected that the Test accuracy is also around 65 %. The following code is used to obtain precision, recall, f1-score for each flair category in the dataset. Also the confusion matrix is obtained and plotted for detailed analysis."""

cm_val = learner.validate(val_data=val, class_names=flairs)

"""On observation of the F1-score for all the 12 flair categories it can be inferred that the XLNet model shows good performance on almost all flair categories with some hinderance in AskIndia and Coronavirus category. We can see that the model does exceptionally good in classifying flairs of category 'Science/Technology', 'Politics', 'Business/Finance'and 'Food'. We can also see that the model displays a good performance in 'Non-political', and 'Sports'.

Next step will be to plot the Confusion Matrix for the validation set for further analysis. A Confusion Matrix is a performance measurement for Machine Learning and also DL Methods. Confusion Matrix is widely used for describing performance of Classification models and it allows the visualization of performance of classifier. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the mistakes being made by a classifier but more importantly the types of mistakes that are being made.
"""

import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

df_cm = pd.DataFrame(cm_val, index = [i for i in flairs],
                  columns = [i for i in flairs])
plt.figure(figsize = (10,7))
sn.heatmap(df_cm, annot=True)

"""In the confusion matrix the row index indicates predicted class and the column index indicates actual class. On analysis of the heatmap of confusion matrix, we can see that the XLNet model shows good performance across all 12 flairs as depicted by the diagonal of the confusion matrix. Alligning with the stastical report we can see that the model has almost no confusion in predicting flairs of category 'Science/Technology', 'Politics', 'Business/Finance', 'Food'. An unexpected trend that can be detected is that there is a confusion between Photography and Coronavirus thereby resulting in a square in those blocks in the heatmap. Relatively the model has shown a lower performance in Coronavirus due to possible intersection of domains in the posts. This is clearly evident on observing the Coronavirus row. Another detected confusion is between Photography and AskIndia where most AskIndia posts are detected as Photography. Also there seems to be slight confusion between Non-Political and AskIndia posts. We can also infer the reason for relatively low performance in AskIndia category through the Confusion Matrix. On observation of AskIndia column we can see that classification of AskIndia posts has confusion across most of the column thereby indicating intersecting domains or topics as a possible reason. In AskIndia posts this scenario is totally possible because they are questions based on various topics.

Thus, through this analysis we gain a good insight of Model performance as in where it works and where the performance might be hindered. Now the next step will be to verify the effectiveness of model by validating on test set and see if the same insights hold true. Also the test set will be an indicator of real world accuracy.

Next, the test data is preprocessed so that we can find out the test accuracy. Since the XLNet model has not been exposed to the test data prior to testing. The test accuracy can be an indicator of real world accuracy on the respective 12 flair categories.
"""

test = t.preprocess_test(x_test, y_test)

# Validating model on Test data
cm_test = learner.validate(val_data=test, class_names=flairs)

"""We can observe that the XLNet model has achieved a final test accuracy of 67 %. It displays great performance across all 12 flairs except AskIndia which is inline with the performance observed on validation dataset. Alligning with the validation performance, the model works exceptionally good on 'Science/Technology', 'Politics', 'Business/Finance', 'Food' and 'Non-Political'. We can see consistent performance in the model across both validation and test sets.

Next, the confusion marix is also plotted to see if the same trend can be observed.
"""

df_cm_test = pd.DataFrame(cm_test, index = [i for i in flairs],
                  columns = [i for i in flairs])
plt.figure(figsize = (10,7))
sn.heatmap(df_cm_test, annot=True)

"""As expected, the same trend is visible in the heatmap of the confusion matrix on test set. Similar to the validation set a confusion can be detected between Coronovirus and Photography, AskIndia and Non-Political. Also the same AskIndia trend can be seen. AskIndia category has confusion spread across all the 12 flair categories indicating that major topic intersection might be the main reason and also the reason being that the questions are spread across various domains.

Finally, it can be concluded that the XLNet model has shown very good generalization capability in classifying the flairs. Even with a relatively small dataset we were able to obtain great results where only the slight confusion between Coronavirus and Photography seemed unreasonable, bearing in mind the Coronavirus flair was limited by the number of cases due to the Reddit limiations. Overall the model XLNet achieved good results and showed great potential in classifying flairs of Reddit Posts.
"""

# Saving XLNet model along with weigths
learner.save_model('/content/drive/My Drive/Model-XLNet/model_XLNet')

# Loading XLNet model and weights
learner.load_model('/content/drive/My Drive/Models/model_XLNet', t)

"""To wrap things up, I wanted to conduct a small study to observe the Inference Capability and language understanding of the model and compare it with human approach. For this purpose Ktrain has very good explainable AI functions which help in understanding the reason behind the model's predictions. Explainable AI involves methods and techniques to help understand how an AI model reach particular conclusions. Active research is being pursued in explainable AI but there are some established practices to determine behaviour of model."""

# Configuring Pandas options to display full content of columns in DataFrame
pd.set_option('display.max_colwidth', -1)

"""To understand the behaviour of the model, some interesting cases from the top losses in validation and test set are taken and studied."""

learner.view_top_losses(n=10, preproc=t, val_data=val)

Val_data.iloc[[506, 564, 803, 396, 1040, 113, 596, 732, 239, 445],:]

"""On careful observation we can see that for Ids 506, 564, 803 and 596 the model predictions are in line with human predictions. The text in the following posts are related to the flair category predicted. For instance if we take Id 596 the text seems to be a question asking for skills to be learnt. Here we would also classify it as a flair of category AskIndia. So clearly we can understand that the model has very good language understanding capabilities.
 
 Furthermore, let us study case with id 506 with the help of explain( ) function of Ktrain. Here in case id 506 the name of famous Cricketer M.S.Dhoni is present and the context of the text is retirement in sports. The model predicts flair as sports which is alligning with human appraoch. Let us confirm this with Ktrain function. The explain function is based on probabilities of various words in the sentence impacting the final classification. The input is randomly perturbed to examine how the prediction changes. This is used to infer the relative importance of different words to the final prediction.
"""

predictor = ktrain.get_predictor(learner.model, preproc=t)

predictor.predict(x_val[506])

# Required module for explainable AI
!pip install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

predictor.explain(x_val[506])

"""As expected the model's inference capability is similar to that of human approach and thinking. I think this demonstrates the great language understanding potential of the XLNet model and show us why Language Models outperform generic algorithms by light years.

Now, let us see if the same can be observed in the case of test set.
"""

learner.view_top_losses(n=10, preproc=t, val_data=test)

# id:849 is excluded because it is very big in size so not suitable for 
# studying generalization capability
Test_data.iloc[[532, 1154, 14, 687, 802, 733, 825, 906, 1199],:]

"""The same situation as validation set can be observed indicating a consistent performance by the model. Here Id 1154 also presents us with a scenario where the post has multiple domains that is both economics and it is a question (AskIndia). Also similar to the case of validation set we can see that in Id's 14, 802, 825, 906 the model has a human approach to classification."""

# Save Predictor Model
predictor.save('/content/drive/My Drive/Models/Predictor')

"""The Deep Learning Model, XLNet has been successfully trained for Reddit Flair Prediction. Also a detailed analysis of the results was given. The XLNet model has shown great performance across all 12 flair categories and also has shown a language understanding capability comparable to that of Humans. The model exceptionally classified the correct flair even in cases where the true label was wrong or contradicting to the context. The XLNet model exceeded expectations by depicting human level performance in certain cases. This was also verified by observing probabilities through explain function in Ktrain.


**Part-4**

The Part-4 and Part-5 of the given task was a very interesting and challenging task for me. This task was my first ever experience delving into Web development, both frontend and backend. It was a fun roller coaster ride figuring out the trade secrets in designing a website using html, css, javascript and also designing the backend API for the website. I did not want to compromise on the quality of the website so I didn't settle for a simple layout. An appealing website with good User Interface and performance was designed.

For creating the web application Flask API was used because it is python based and reliable with very good functions. The website is a two page design where one is Home and the other is Predicted flair page. As a part of the task we were also asked to design an endpoint API called '/automated_testing'for testing by sending a POST request. For this task also Flask was used with the help of API funciton. Also the code was designed to filter invalid inputs to the web page. The endpoint was initially designed to take .txt file as input and output a json object. But as of 23rd April'20 we were asked to use opened binary file as input so according changes were made with the help of Flask requests. The output was also changed to a Json file response.

The initial approach for designing the html templates was to use requests function from Flask. For website designing the text editor at W3.css was very helpful. After implementing the Flask requests in the html file I didn't end up with a working routing process between both html pages and also app and website. So after quite a bit of research, my approach was to use FlaskForm() function in Flask to develop the routing part between backend and website frontend. Along with this I also discovered various aspects of html site design ranging from buttons to hyper referencing and so on. Web page backgrounds available at W3 image gallery were used for website design.

One problem encountered during development of Web app was the bug encountered in Ktrain pertaining to loading the predictor model. Ktrain has two attributes, one learner and the other predictor. For model training the learner attribute is used and for predictions the predictor is used. After model validation was done the learner model was saved and loaded into predictor attribute successfully and saved. The problem arised when I was reloading the predictor. Due to a bug on Ktrain side the loading of predictor had problems. As a work around for this the developer of Ktrain has posted a solution on github based on json model saving and loading. But this approach was based on tensorflow 1.15 and outdated. Now Ktrain operates on Tensorflow 2.1.0. In the json model approach, the deprecated graph and session approach was used so it is not viable. A work around for this was to define the preprocessor and learner and load the learner model accordingly and convert it to predictor. This work around turned out well and didn't cause any drop in performance.



Below is the Ktrain predictor code that was expected to be used.

In [0]:
predictor = ktrain.load_predictor('path_to_predictor_model')
def predict_flair(url):
    s = reddit.submission(url=url)
    Data = {"title":[], "body":[]}
    Data['title'].append(s.title)
    Data['body'].append(s.selftext)
    Data = pd.DataFrame(Data)
    Data['Text'] = Data['title'] + ' ' + Data['body']
    Data['Text'] = Data['Text'].apply(lambda x: normalizeString(x))
    data= Data['Text'].values.tolist()
    return predictor.predict(data)

Below is the work around in Tensorflow 1.15.

In [0]:
from tensorflow.keras import models
from tensorflow.keras.models import model_from_json
from keras.backend import get_session, set_session
from keras import backend as K
import pickle
import tensorflow as tf

from tensorflow.keras.models import model_from_json
import pickle

def cat_init():
  # Load model json file
  json_file = open('/content/drive/My Drive/Models/model_XLNet/config.json','r')

  # Load Ktrain preproc file
  features = pickle.load(open('/content/Reddit-Flair-Detector/Predictor.preproc', 'rb'))
  
  # Session
  sess = K.get_session()
  graph = tf.get_default_graph()

  loaded_model_json = json_file.read()
  json_file.close()
  loaded_model = model_from_json(loaded_model_json)

  loaded_model.load_weights("/content/Reddit-Flair-Detector/Models/model_XLNet/tf_model.h5")
  print("Model Loaded from disk")

  #compile and evaluate loaded model
  loaded_model.compile(optimizer='adadelta', loss='categorical_crossentropy', metrics=['acc'])
  return loaded_model,graph,sess,features

The following code is used to test the /automated_testing endpoint API of Web Application.

In [0]:
import json
import requests
url = 'https://reddit-flair-predictor.uc.r.appspot.com/automated_testing'
files = {'upload_file': open('/content/test.txt','rb')}
r = requests.post(url, files=files)
data = r.json()
print(data)

The below code is of the jupyter notebook used to figure out code for parts of the final Web app file.

In [0]:
# -*- coding: utf-8 -*-
"""Untitled0.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1knIwf0-qTr8d0X741Z6oq8jubV3yIJXu
"""

!git clone https://github.com/NikV-JS/Reddit-Flair-Detector.git

!pip install -q -r /content/Reddit-Flair-Detector/requirements.txt

import ktrain
import pandas as pd
import numpy as np

Train_data = pd.read_csv('/content/Train_Data.csv')
Val_data = pd.read_csv('/content/Val_Data.csv')
Test_data = pd.read_csv('/content/Test_Data.csv')

flairs = list(set(Train_data['flair']))

x_train = Train_data['Text']
y_train = Train_data['flair']
x_val = Val_data['Text']
y_val = Val_data['flair']
x_test = Test_data['Text']
y_test = Test_data['flair']

from ktrain import text
MODEL_NAME = 'xlnet-base-cased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=flairs)
train = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_val, y_val)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=train, val_data=val, batch_size=6)

learner.load_model('/content/drive/My Drive/Models/model_XLNet', preproc=t)

predictor = ktrain.get_predictor(learner.model, preproc=t)

import praw
import unicodedata
import re

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )
def normalizeString(s):
    if not isinstance(s, float):
      s = unicodeToAscii(s.lower().strip())
      s = re.sub(r"([.!?])", r" \1", s)
      s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s
reddit = praw.Reddit(client_id='EPeQ4_tZaSnieQ', client_secret="o8wiYMDri2RMiF1um14L1rGHXEs", user_agent='Reddit WebScraping')

def predict_flair(url):
    s = reddit.submission(url=url)
    Data = {"title":[], "body":[]}
    Data['title'].append(s.title)
    Data['body'].append(s.selftext)
    Data = pd.DataFrame(Data)
    Data['Text'] = Data['title'] + ' ' + Data['body']
    Data['Text'] = Data['Text'].apply(lambda x: normalizeString(x))
    data= Data['Text'].values.tolist()
    return predictor.predict(data)

Results = pd.DataFrame(columns = ['URL','Flair'])

for url in urls:
  flair = predict_flair(url)
  Results = Results.append([{'URL': url, 'Flair': flair[0]}], ignore_index=True)

url

import json

output = json.dumps(Results.to_dict('records'))
json_data = output.json()
print(json_data)

output

text_file = '/content/test.txt'

with open(text_file) as f:
  urls = f.readlines()

urls = ([s.strip('\n') for s in urls ])

Results = pd.DataFrame(columns = ['URL','Flair'])

for url in urls:
  flair = predict_flair(url)
  Results = Results.append([{'URL': str(url), 'Flair': flair[0]}], ignore_index=True)

output = Results.to_json(orient = 'records')

Results.append([{'URL': 'url', 'Flair': flair[0]}], ignore_index=True)

output

files = {'upload_file': open('/content/test.txt','rb')}

urls = files['upload_file'].readlines()

urls[0]

# urls = ([s.strip('\n') for s in urls ])
urls = [x.decode('utf8').strip('\n') for x in urls]

args = {'files':[]}
args['files'] = files

files = args['files']
files

urls

import json
import requests
url = 'https://reddit-flair-predictor.uc.r.appspot.com/automated_testing'
files = {'upload_file': open('/content/test.txt','rb')}
r = requests.post(url, files=files)
data = r.json()
print(data)

The following is the final iteration of the code used for web application using Flask.

In [0]:
# Code for main.py (app.py for AWS & Heroku)
from flask import Flask, request, jsonify, make_response
from flask import render_template, url_for, flash, redirect
from forms import FlairForm
from flask_restful import reqparse, abort, Api, Resource
from werkzeug.datastructures import FileStorage
import tensorflow as tf
import ktrain
import pandas as pd 
import numpy as np 
import praw
import unicodedata
import re
import json
import argparse

app = Flask(__name__)
app.config['SECRET_KEY'] = '5791628bb0b13ce0c676dfde280ba049'
api = Api(app)

Train_data = pd.read_csv('static/data/Train_Data.csv')

flairs = list(set(Train_data['flair']))

x_train = Train_data['Text']
y_train = Train_data['flair']

from ktrain import text
MODEL_NAME = 'xlnet-base-cased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=flairs)
train = t.preprocess_train(x_train, y_train)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=train, batch_size=6)

learner.load_model('static/Models/model_XLNet', preproc=t)

predictor = ktrain.get_predictor(learner.model, preproc=t)
# Reddit Credentials Below for Web Scraping using praw
reddit = praw.Reddit(client_id='EPeQ4_tZaSnieQ', client_secret="o8wiYMDri2RMiF1um14L1rGHXEs", user_agent='Reddit WebScraping')

post = []

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )
def normalizeString(s):
    if not isinstance(s, float):
      s = unicodeToAscii(s.lower().strip())
      s = re.sub(r"([.!?])", r" \1", s)
      s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

def predict_flair(url):
    s = reddit.submission(url=url)
    Data = {"title":[], "body":[]}
    Data['title'].append(s.title)
    Data['body'].append(s.selftext)
    Data = pd.DataFrame(Data)
    Data['Text'] = Data['title'] + ' ' + Data['body']
    Data['Text'] = Data['Text'].apply(lambda x: normalizeString(x))
    data= Data['Text'].values.tolist()
    return predictor.predict(data)

@app.route("/",methods=['GET', 'POST'])
@app.route("/home", methods=['GET', 'POST'])
def home():
    global post
    form = FlairForm()
    if form.validate_on_submit():
        if request.method == 'POST':
            flair = predict_flair(form.URL.data)
            pred = {"flair":[], "URL":[]}
            pred['flair'].append(flair[0])
            pred['URL'].append(form.URL.data)
            post = pd.DataFrame(pred)
        return redirect(url_for('Flair_Detected'))
    return render_template('Home.html', form=form)


@app.route("/Flair_Detected", methods=['GET', 'POST'])
def Flair_Detected():
    return render_template('Flair_detected.html',output = post)

# argument parsing
parser = reqparse.RequestParser()
parser.add_argument('files')


class PredictionTest(Resource):
    def post(self):
       
        urls = request.files['upload_file'].readlines()

        urls = [x.decode('utf8').strip('\n') for x in urls]

        Results = pd.DataFrame(columns = ['URL','Flair'])

        for url in urls:
            flair = predict_flair(url)
            Results = Results.append([{'URL': url, 'Flair': flair[0]}], ignore_index=True)

        output = json.dumps(Results.to_dict('records'))

        res = make_response(jsonify(output), 200)

        return res

# Setup the Api resource routing here
# Route the URL to the resource
api.add_resource(PredictionTest, '/automated_testing', methods=['GET', 'POST'])


if __name__ == '__main__':
    app.run(debug=True)

In [0]:
# Code for forms.py
from flask_wtf import FlaskForm
from wtforms import StringField,SubmitField
from wtforms.validators import DataRequired


class FlairForm(FlaskForm):
    URL = StringField('URL',validators=[DataRequired()])
    submit = SubmitField('Submit')

So the website was successfully built with very good user interface. The website was checked on the local machine and validated for deployment. Next step was to deploy the website on a service like Heroku and so on.

**Part-5**

As a part of the assignment we were asked to deploy the website on Heroku. After starting to deploy the app on Heroku I ran to problems near the slug size. The slug size provided on Heroku was only 500 mb but this was not sufficient to load the model because the python package dependencies like tensorflow were big in size. So after consulting Hitkul sir, the approach was to deploy on web services such as AWS, GCP. Initially the model was deployed on AWS but it didn't seem suitable due to the limitations of AWS Free tier. I've had prior experience in using GCP (Google Cloud Platform) for research project so I decided to finally deploy the model on GCP. For deployment a Dockerfile to recreate the development environment on any machine and an app.yaml were created. The below code was used to deploy.

In [0]:
cd 'path_to_project_folder'
gcloud init
gcloud app deploy app.yaml --project project_name

The Reddit Flair Model was deployed successfully and evrything worked out very good. It was a fun experience working on the task and a recap of various data science packages! Also everything was explained and documented properly. Well then, it's a wrap! Time to upload on github and submit.