We publish a lot of articles. Some of those headlines could be classified as “Clickbait” by
distribution partners like Facebook or Google, which is bad for the performance of our articles.
We want to provide an API to the business for them to enter the title of an article and get back
the probability that article is either “News” or “Clickbait”. Thankfully, someone has already gone
and determined what is “News” and “Clickbait” for us to train off of and that data, and an initial
Python Flask app for the classifier, can be found HERE
Build a model to predict how likely an article is to be clickbait. Provide some standard
classification metrics for your work. Once you’ve done that, provide your model in an API
endpoint for use by the business.

Build a classifier (any classifier)
Create an API for Text Inputs
Create an API response with a clickbait pro
Containerize Your Work

Acceptance​ ​Criteria​:
- The API Endpoint accepts strings between 1 and 150 characters
- The API returns responses in .json format
- The API returns the clickbait likelihood response as a float
- There is a functional and reproducible API endpoint to provide text string inputs into

which returns the probability that text entered into the API is “clickbait” or is “news”
- Your classifier & API work is in a containerized environment (Docker, AWS Container
Service, etc.)
Considerations​:
- What tests would you want to perform on the endpoint to ensure it can handle
exceptions?
- Extra Points: Deploying this on AWS or GCP on a live endpoint

** Detailed steps **

- Load data and transform the json into a dataframe or list of titles
- Use text tokenizer from sklearn to tokenize and create a data frame or sth
- split your data into train and test
- chose a model and train 
- save model
- prepare a flask api
- prepare the endpoints and functions
- load model
- setup a prediction method called by endpoint function


In [4]:
import json
import os, sys
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.externals import joblib

In [1]:
def load_data():
    data_dir = "../data/"
    files = []
    fnames = ['buzzfeed.json', 'dose.json', 'clickhole.json', 'nytimes.json']
    for fname in os.listdir(data_dir):
        if fname in fnames:
            with open(os.path.join(data_dir, fname)) as f:
                files += [pd.DataFrame(json.loads(f.read()))[['article_title', 'clickbait']]]
            
    df = pd.concat(files)
    df = df.sample(frac=1).reset_index(drop=True)
    return df.article_title.values, df.clickbait.values

In [3]:
X, y = load_data()


vect = TfidfVectorizer(stop_words='english')
X = vect.fit_transform(X)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=100)
joblib.dump(vect, filename="tfidfvec.pkl")



['tfidfvec.pkl']

In [265]:
# df.ix[3,:].to_json()

# text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english')),
#                     ('clf_rf', RandomForestClassifier(n_estimators=30, max_features=2))])
# text_clf
# text_clf.fit(Xtrain, ytrain)

# pred_ytest = text_clf.predict(Xtest)

In [5]:
# clf = RandomForestClassifier(n_estimators=30, max_features=2)
clf = LogisticRegression()
clf.fit(Xtrain, ytrain)

pred_ytest = clf.predict(Xtest)
class_names = ['news', 'clickbait']

In [6]:
print(classification_report(y_true=ytest, y_pred=pred_ytest, target_names=class_names))

             precision    recall  f1-score   support

       news       0.88      0.97      0.92       615
  clickbait       0.97      0.87      0.92       625

avg / total       0.93      0.92      0.92      1240



In [134]:
joblib.dump(clf, filename="cb_model_rf.pkl")

['cb_model_rf.pkl']

In [88]:
clf2 = joblib.load("cb_model_rf.pkl")

In [173]:
pr2 = clf2.predict_proba(Xtrain)
pr2[:,0]

array([ 0.04,  1.  ,  0.08, ...,  0.  ,  0.08,  0.96])

In [254]:
print(classification_report(y_true=ytrain, y_pred=pr2, target_names=class_names))

ValueError: Mix type of y not allowed, got types {'continuous-multioutput', 'binary'}

In [272]:
i = ["Trump is signing executive order for muslim ban"]
ti = vect.transform(i)
clf.predict_proba(ti)

NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

In [None]:
print(json.dumps(json.loads(df.to_json(orient='records')), indent=2))

In [284]:
d = {
    "article_title": " This Motorcyclist Rear-Ended A Car...You'll Never Believe Your Eyes When You See What Happened",
    "clickbait": 1
  }
nd = pd.DataFrame(d)
nd

ValueError: If using all scalar values, you must pass an index

In [299]:
a = classification_report(y_true=ytest, y_pred=pred_ytest, target_names=class_names)

ValueError: dictionary update sequence element #0 has length 1; 2 is required

In [306]:
from sklearn.metrics import f1_score, precision_score, accuracy_score, recall_score
f1score = f1_score(y_true=ytest, y_pred=pred_ytest),
accuracy = accuracy_score(y_true=ytest, y_pred=pred_ytest),
precision = precision_score(y_true=ytest, y_pred=pred_ytest),
recall = recall_score(y_true=ytest, y_pred=pred_ytest),


0.931419457735 0.93064516129 0.966887417219 0.898461538462
