We publish a lot of articles. Some of those headlines could be classified as “Clickbait” by
distribution partners like Facebook or Google, which is bad for the performance of our articles.
We want to provide an API to the business for them to enter the title of an article and get back
the probability that article is either “News” or “Clickbait”. Thankfully, someone has already gone
and determined what is “News” and “Clickbait” for us to train off of and that data, and an initial
Python Flask app for the classifier, can be found HERE
Build a model to predict how likely an article is to be clickbait. Provide some standard
classification metrics for your work. Once you’ve done that, provide your model in an API
endpoint for use by the business.

Build a classifier (any classifier)
Create an API for Text Inputs
Create an API response with a clickbait pro
Containerize Your Work

Acceptance​ ​Criteria​:
- The API Endpoint accepts strings between 1 and 150 characters
- The API returns responses in .json format
- The API returns the clickbait likelihood response as a float
- There is a functional and reproducible API endpoint to provide text string inputs into

which returns the probability that text entered into the API is “clickbait” or is “news”
- Your classifier & API work is in a containerized environment (Docker, AWS Container
Service, etc.)
Considerations​:
- What tests would you want to perform on the endpoint to ensure it can handle
exceptions?
- Extra Points: Deploying this on AWS or GCP on a live endpoint

** Detailed steps **

- Load data and transform the json into a dataframe or list of titles
- Use text tokenizer from sklearn to tokenize and create a data frame or sth
- split your data into train and test
- chose a model and train 
- save model
- prepare a flask api
- prepare the endpoints and functions
- load model
- setup a prediction method called by endpoint function


In [81]:
import json
import os, sys
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.externals import joblib

In [98]:
def vectorize_matrix(X):
    vec = TfidfVectorizer(stop_words='english', analyzer='word', lowercase=True, max_df=0.5, sublinear_tf=True)
    return vec.fit_transform(X)


In [95]:
def load_data():
    data_dir = "./data/"
    files = []
    fnames = ['buzzfeed.json', 'dose.json', 'clickhole.json', 'nytimes.json']
    for fname in os.listdir(data_dir):
        if fname in fnames:
            with open(os.path.join(data_dir, fname)) as f:
                files += [pd.DataFrame(json.loads(f.read()))[['article_title', 'clickbait']]]
            
    df = pd.concat(files)
    df = df.sample(frac=1).reset_index(drop=True)
    return df.article_title, df.clickbait

In [96]:
X, y = load_data()

In [99]:
X = vectorize_matrix(X)

In [100]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=100)


In [131]:
clf = RandomForestClassifier(n_estimators=30, max_features=2)
clf.fit(Xtrain, ytrain)
pred_ytest = clf.predict(Xtest)

In [132]:
class_names = ['news', 'clickbait']

In [133]:
print(classification_report(y_true=ytest, y_pred=pred_ytest, target_names=class_names))

             precision    recall  f1-score   support

       news       0.92      0.97      0.95       619
  clickbait       0.97      0.92      0.94       621

avg / total       0.95      0.95      0.95      1240



In [134]:
joblib.dump(clf, filename="cb_model_rf.pkl")

['cb_model_rf.pkl']

In [88]:
clf2 = joblib.load("cb_model_rf.pkl")

In [89]:
pr2 = clf2.predict(Xtrain)

In [91]:
print(classification_report(y_true=ytrain, y_pred=pr2, target_names=class_names))

             precision    recall  f1-score   support

       news       1.00      1.00      1.00      2489
  clickbait       1.00      1.00      1.00      2471

avg / total       1.00      1.00      1.00      4960

