# Import Modules and Prerequisites
Please use this section to import any necessary modules that will be required later in this notebook like the example given.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import json
import re

%matplotlib inline

# NLP task

# Note: There is no expectation of coding a highly sofisticated solution in a small time window. Each question can be answered with a short code example and a possible written explanaton of the elaborate approach.

## A common AdSquirrel microservice's task is to predict whether an article is Brand Safe or not (appropriate for the general audience, similar to SFW-NSFW).

## Load and examine dataset

In [None]:
df = pd.read_csv('assignment.csv')
df.head()

## 1. Train a classifier on this dataset that will reach an acceptable accuracy score.

#### Feel free to follow any design choices you feel fit the problem best.

#### Briefly describe your approach in markdown cells, along with any necessary comments on your choices.

#### Explain your choices with the appropriate evaluation plots - analysis

## 2. Load a real-world, unseen (test) dataset and evaluate your best model

### The goal is to approach the classification accuracy of the train dataset on the test dataset, without using the latter for training. Describe any challenges (if they exist) and code your solution below following the same guidelines 

In [None]:
df_unseen = pd.read_csv('assignment_unseen.csv')
df_unseen.head()

## 3a. Another important task is to perform topic classification on the same datasets, but there are no available labels. You can use the entirety of data you have at your disposal. Describe possible approaches to this problem and code the most robust solution of your choice. 

## 3b. Supposing these are last week's news and we wish to extract the latest circulating trends in order to present them to our clients, how could we approach the problem of trend extraction via NLP? Either code or describe on a high-level the best possible approach that comes to mind. (This question may be connected to the previous one depending on your solution, which is acceptable).

##### Hint: Proper text preprocessing may assist in this task, if not done already.

## 4. (Optional) Discussion. Feel free to answer some or all of the following questions to the best of your knowledge.

#### Coding is optional and not expected below, except for the case you wish to accompany your answer with example code.

#### 4a. Suppose there is a complete labeled topic classification dataset available (hundreds of thousands of articles). Are there any changes in your design choices that would lead to a more accurate model?

#### 4b. In order to meet production requirements, the model needs to have as close to 20 ms latency as possible. Latency is measured as a single live prediction's time on CPU, added to the time needed to load a saved model into a microservice (directly impacted by the saved model's total size in MB). Are the models created in this assignment production ready? If not, are there any ways to improve prediction latency that come to mind? What about a model initially trained on the complete dataset which would probably be larger? 

#### 4c. Would any of the solutions presented above impact model accuracy? How could we make sure that accuracy does not fall below a desired threshold?  

#### 4d. If our model from question 2. was well within latency constraints and you had sufficient time equivalent to a small independent project, how could you potentially improve the model's accuracy on task 2 without a heavy latency sacrifice? (Simply importing a huge Transformer model is not a viable solution.)

#### 4e. Our deployed models handle ~50.000 predictions on unique articles per day. What are the challenges of evaluating an ML model on such a scale? What could we do to approximately evaluate our model's performance every week while unable to hand-label the entirety of our predictions?

# Analytics Task

## You have received a dataframe that contains some ML models' predictions on recent articles (text and title is irrelevant in this case) , as well as a json file containing the most trending named entities during the same time period.

## Suppose we have to create a dashboard regarding recent trends in News Articles. Use the visualization technique of your choice to extract some relevant patterns in our predictions , taking into consideration the separation due to Brand Safety in your plots.

In [None]:
df = pd.read_csv("assignment_analytics.csv")

with open("trending_entities.json") as f:
  trending_entities = json.load(f)

df.head()

# Thank you in advance