In [2]:
import pandas as pd
import numpy as np

# produce vector inline graphics
# from IPython.display import set_matplotlib_formats, display, Markdown, HTML

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF

# set_matplotlib_formats('pdf', 'svg')

# import matplotlib.pyplot as plt
# plt.rcParams['figure.figsize'] = [10, 4]
# display(HTML("<style>.container { width:70% !important; }</style>"))

# Preprocess data for [Voilà](https://github.com/voila-dashboards/voila) dashboard

In the previous example, we turned our exploratory notebook directly into a Voila app. When we started that app, it needed to run every cell in the notebook to produce results. 

This is inefficient. Our end-user shouldn't have to wait for this code to run every time they want to see the dashboard. This becomes especially relevant as we start doing more machine learning, or really any computations that take a considerable time to run. 

In this notebook, we'll reproduce the pre-processing steps needed for the dashboard. We will then save new data and export our ML model.

We need to run this notebook before we run the new, faster Voila app. In a production setting, we could separate these two *services* into separate applications, e.g. two separate Heroku apps connected to the same database.

### Data
We'll be using an [example dataset](https://raw.githubusercontent.com/shubham13p/Ad-Click-Prediction/master/advertising.csv) found on GitHub. We'll use Pandas to load in this data and display the first few rows.

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/shubham13p/Ad-Click-Prediction/master/advertising.csv')
df['Click_labeled'] = df['Clicked on Ad'].apply(lambda x: "Click" if x == 1 else "No Click")
df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad,Click_labeled
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0,No Click
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0,No Click
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0,No Click
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0,No Click
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0,No Click


## Matplotlib / Altair plots

If we wanted to, we could generate and export the our plots as standalone  files ahead of time. If we took that approach, we could actually embed them in *static* HTML, and we might not even need Voila. This is even true for the interactive plots!

We won't go into too much detail about this, but Altair has the benefit of creating charts that run completely on Javascript. As such, they don't need any real backend to run. We could have rich, interactive charts even on something like GitHub Pages, or any static HTML server.

So, again, **why use Voila?**

**Pros:**
* Integrates directly with Jupyter workflow
* Less overhead - we don't need to build a separate app or any other infrastructure

**Cons:**
* Slower to run
* Less flexibility than other frameworks

So the main use-case for Voila is in **protyping** a report, app, or dashboard. 

## Topic Modeling

Here, we'll handle the ML part of the app and add those features to the DataFrame. If we wanted to, we could save the topic model and load it in our Voila app, but we don't really need to. 

**Q: When would we want to include the model in the Voila app?** A: If we needed to generate predictions interactively. We're just using the model for descriptive analysis, rather than predictive, so we just need to save the results. 

In [4]:
tfidf_vect = TfidfVectorizer(max_df=0.9, min_df=10, stop_words='english')

# Create matrix of TFIDF features
doc_term_matrix = tfidf_vect.fit_transform(df['Ad Topic Line'].values.astype('U'))

In [5]:
# Create NMF model
nmf = NMF(n_components=3, random_state=42)
nmf.fit(doc_term_matrix )

# Create matrix of NMF outputs
topic_values = nmf.transform(doc_term_matrix)

# Assign the most relevant topic to each row in the DataFrame
df['Topic'] = topic_values.argmax(axis=1) + 1

df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad,Click_labeled,Topic
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0,No Click,1
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0,No Click,1
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0,No Click,3
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0,No Click,1
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0,No Click,3


In [6]:
df.to_csv('data/df.csv')

In [7]:
nmf_components = nmf.components_

np.save('data/nmf-components.npy', nmf_components)

In [8]:
feature_names = tfidf_vect.get_feature_names()

with open('data/feature-names.txt', 'w') as f:
    f.write('\n'.join(feature_names))


nmf_components = np.load('data/nmf-components.npy')

with open('data/feature-names.txt', 'r') as f:
    feature_names = [line.strip() for line in f.readlines()]
    

'''
for i,topic in enumerate(nmf_components):
    
    display(Markdown(f'### Topic #{i}'))
    
    num_ads = len(df[df['Topic']==i])
    display(Markdown(f'**Number of Ads in Topic:** {num_ads}'))

    display(Markdown(f'**Top 5 words:**'))
    
    top = ''.join([f'* {word} \n' for word in[feature_names[i] for i in topic.argsort()[-5:]]])
    display(Markdown(top))
'''