# Introduction

This is a problem from HackerRank.

Stack Exchange is an information powerhouse, built on the power of crowdsourcing. It has 105 different topics and each topic has a library of questions which have been asked and answered by knowledgeable members of the StackExchange community. The topics are as diverse as travel, cooking, programming, engineering and photography.

We have hand-picked ten different topics (such as Electronics, Mathematics, Photography etc.) from Stack Exchange, and we provide you with a set of questions from these topics.

Given a question and an excerpt, your task is to identify which among the 10 topics it belongs to.

Getting started with text classification

For those getting started with this fascinating domain of text classification, here's a wonderful Youtube video of Professor Dan Jurafsky from Stanford, explaining the Naive Bayes classification algorithm, which you could consider using as a starting point

Input Format
The first line will be an integer N. N lines follow each line being a valid JSON object. The following fields of raw data are given in json

question (string) : The text in the title of the question.
excerpt (string) : Excerpt of the question body.
topic (string) : The topic under which the question was posted.
The input for the program has all the fields but topic which you have to predict as the answer.

Constraints
1 <= N <= 22000
topic is of ascii format
question is of UTF-8 format
excerpt is of UTF-8 format

Output Format
For each question that is given as a JSON object, output the topic of the question as predicted by your model separated by newlines.

The training file is available here. It is also present in the current directory in which your code is executed.

Sample Input
12345
json_object
json_object
json_object
.
.
.
json_object
Sample Output

electronics
security
photo
.
.
.
mathematica
Sample testcases can be downloaded here for offline training. When you submit your solution to us, you can assume that the training file can be accessed by reading "training.json" which will be placed in the same folder as the one in which your program is being executed.

Scoring

While the contest is going on, the score shown to you will be on the basis of the Sample Test file. The final score will be based on the Hidden Testcase only and there will be no weightage for your score on the Sample Test.

Score = MaxScore for the test case * (C/T)
Where C = Number of topics identified correctly and
T = total number of test JSONs in the input file.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

# Data Preparation

In [48]:
data = {'topic': list(), 'question': list()}
with open('training.json', 'r') as file:
    
    n_data = None
    for line in file:

        if n_data is None:
            n_data = int(line)
        else:
            d = json.loads(line)
            data['topic'].append(d['topic'])
            data['question'].append(d['question'])

In [14]:
labels, counts = np.unique(data['topic'], return_counts=True)

In [16]:
print(dict(zip(labels, counts)))

{'android': 2239, 'apple': 2064, 'electronics': 2079, 'gis': 2383, 'mathematica': 1369, 'photo': 1945, 'scifi': 2333, 'security': 1899, 'unix': 1965, 'wordpress': 1943}


In [21]:
from sklearn.preprocessing import OrdinalEncoder

In [72]:
topic_enc = OrdinalEncoder().fit(np.array(data['topic']).reshape(-1, 1))
y = topic_enc.transform(np.array(data['topic']).reshape(-1, 1)).ravel()

In [53]:
X = np.array(data['question'])

In [54]:
from sklearn.model_selection import train_test_split

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [166]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=3500, decode_error='ignore')

In [167]:
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Modeling

## MLP

In [127]:
from sklearn.neural_network import MLPClassifier

In [128]:
clf = MLPClassifier(
    hidden_layer_sizes=(200, 20),
    batch_size=20,
    learning_rate='constant',
    learning_rate_init=0.001,
    early_stopping=True
).fit(X_train_vec, y_train)

In [129]:
y_pred = clf.predict(X_test_vec)

In [130]:
from sklearn.metrics import classification_report

In [131]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.72      0.82      0.77       445
         1.0       0.78      0.63      0.70       426
         2.0       0.63      0.72      0.67       399
         3.0       0.85      0.80      0.83       474
         4.0       0.72      0.56      0.63       278
         5.0       0.78      0.81      0.80       380
         6.0       0.69      0.87      0.77       500
         7.0       0.71      0.65      0.68       384
         8.0       0.56      0.54      0.55       357
         9.0       0.85      0.77      0.81       401

    accuracy                           0.73      4044
   macro avg       0.73      0.72      0.72      4044
weighted avg       0.73      0.73      0.73      4044



## Naive Bayes

In [132]:
from sklearn.naive_bayes import MultinomialNB

In [168]:
clf_nb = MultinomialNB().fit(X_train_vec, y_train)

In [169]:
y_pred = clf_nb.predict(X_test_vec)

In [170]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.74      0.84      0.79       445
         1.0       0.78      0.72      0.75       426
         2.0       0.81      0.80      0.81       399
         3.0       0.74      0.91      0.81       474
         4.0       0.91      0.54      0.68       278
         5.0       0.82      0.89      0.86       380
         6.0       0.94      0.85      0.89       500
         7.0       0.80      0.74      0.77       384
         8.0       0.65      0.66      0.66       357
         9.0       0.83      0.86      0.85       401

    accuracy                           0.80      4044
   macro avg       0.80      0.78      0.79      4044
weighted avg       0.80      0.80      0.79      4044



In [172]:
y_pred_str = topic_enc.inverse_transform(y_pred.reshape(-1, 1))

In [173]:
y_pred_str

array([['scifi'],
       ['scifi'],
       ['mathematica'],
       ...,
       ['android'],
       ['photo'],
       ['scifi']], dtype='<U11')