# ML Powered Applications

# Chapter 4

Data exploration notebook to better understand the data.

Objective
- To label and identify trends

Process
- Generate summary statistics
- Identifying differences in class distributions

In [None]:
# run script on a 3.6 environment - base36
!pip install -U spacy
!pip install -U umap-learn
!python -m spacy download en_core_web_sm
!pip install --upgrade gensim

In [None]:
# library dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# intialization
PATH_data = r"C:\Users\nrosh\Desktop\Personal Coding Projects\Python\ml-powered-applications\neel\data"

## Clustering

### Definitions

1. Clustering
    - Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)
    - Many clustering algorithms group data points by measuring the distance between points and assigning ones that are close to each other to the same cluster.
    - The vast majority of datasets can be separated into clusters based on their features, labels, or a combination of both. Examining each cluster individually and the similarities and differences between clusters is a great way to identify structure in a dataset.
    - Clustering algorithms work on vectors, so we can’t simply pass a set of sentences to a clustering algorithm. To get our data ready to be clustered, we will first need to vectorize it.

2. Vectorization
    - A process of converting a raw data set into a singular or multi-dimensional vector


### Vectorization techniques

Clustering requires distances to be measured on the "same" scale.

The approach for vectorizing/normalizing data depends on the structure and type of data being analysed.

1. Tabular data
    - Continuous features should be normalized to a common scale
    - Categorical features such as colors can be converted to a one-hot encoding (binary transformations). This allows the distance between points to always remain the same.
    
2. Text data

    - Bag Of Words (Tokenize sentences and count their observations by row)
        - The simplest way to vectorize text is to use a count vector, which is the word equivalent of one-hot encoding.
        - For each sentence, the number at each index represents the count of occurrences of the associated word in the given sentence.
        -This method ignores the order of the words in a sentence
        - scikit-learn TfidfVectorizer
            - Produce a vector of tokenized words for count aggregation by row
            - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

    - Word2ec and fastText
        - These vectorization techniques produce word vectors that attempt to learn a representation that captures similarities between concepts better than a TF-IDF encoding. 
        - They do this by learning which words tend to appear in similar contexts in large bodies of text such as Wikipedia.
        - This approach is based on the distributional hypothesis, which claims that linguistic items with similar distributions have similar meanings.
            - This is done by learning a vector for each word and training a model to predict a missing word in a sentence using the word vectors of words around it. 
            - The number of neighboring words to take into account is called the window size.
    - Dimensionality Reduction
        - Vectorized data are multi-dimensional and can't be visualized
        - The goal is to use a method that reduces multidimensional data into a visual space whilst minimizing the data loss associated with dimensional reduction
        - Techniques
            - t-SNE
            - UMAP
        - These techniques are useful for to notice patterns in data on a very high level
        - The goal is to use these methods to see whether there are regions of the data that can easily be seperated by a production model
        - 

UMAP
    - General purpose manifold learning and dimension reduction algorithm
    
    
Once we have a vectorized representation of our unstructured data, we can use it for the purpose of data inspection/exploration or outcome predictions.

1. Inspection
        - Dimensionality Reduction
             - Vectors produced from unstructured data often have more than one dimension. The dataset needs to be reduced in some way for us to visualize it on a two-dimensional plane.
             - 

2. Prediction





### Labelling Strategy
Feel free to update your vectorization strategy by adding any features you discover to help make your data representation as informative as possible, and go back to labeling.


In [None]:
from IPython.display import Image
Image(url="vectorization_strategy.png")

## Statistics Observed

1. Are there distinct regions of post's title that can be classified into one or multiple labels?
    
Sources:
- stackexchange: 
    - Data Schema: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
    - Score: https://meta.stackexchange.com/questions/229255/what-is-the-score-of-a-post
    - UMAP: https://umap-learn.readthedocs.io/en/latest/
    
- word2vec: https://code.google.com/archive/p/word2vec/

## Functions

In [None]:
def export_df(df, cols, **kwargs):

    # rename and return a dataframe of those columns
    # choose to export or not
    
    _df = df.loc[:, cols]
    
    # rename dict
    if "rename_dict" in kwargs.keys() :
        _df.rename(columns=kwargs["rename_dict"], inplace=True)
        print('- Columns renamed.')
    
    # export data
    if kwargs["export_loc"]:
        
        # handle data export
        try:
            
            if "export_name" in kwargs.keys():
                _location = kwargs["export_loc"] + "\\{}".format(kwargs["export_name"]) + ".csv"
                _df.to_csv(_location)
                print(f"""- File exported to: {_location}""")
            else:
                _location = kwargs["export_loc"]+"\\adhoc_{}".format(datetime.today().strftime("%m%d%y"))+".csv"
                _df.to_csv(_location)
                print(f"""- File exported to: {_location}""")

        except:
            raise Exception(f"""export_loc must be of type str. Given: {type(kwargs["export_loc"])}""")
    
    print('\n')
    return _df

## Ingestion


In [None]:
# original
df_orig = pd.read_csv(
    PATH_data + "\\labelling\questions_by_cluster.csv",
)

# copies
df = df_orig.iloc[:, 1::].copy()
df.info()

## Labeling Strategy

Goal: 
1. To produce features that describe each clusters uniquely 
2. To produce the results you would like it to produce

Strategies:

1. Update your vectorization strategy by adding any features you discover to help make your data representation as informative as possible, and go back to labeling.
2. To speed up your labeling, leverage your prior analysis by labeling a few data points in each cluster and for each common value in your feature distribution.
3. Use cluster visualizations to infer characteristics about each data point


In [None]:
# Bokeh setup and sources

import bokeh.models as bmo
import bokeh.plotting as bpl

from bokeh.plotting import figure, output_file, show
from bokeh.palettes import d3, all_palettes
from bokeh.layouts import row, layout
from bokeh.models import CheckboxGroup, CustomJS, Panel, CDSView
from bokeh.models.tools import HoverTool
from bokeh.models.widgets import Tabs
# from bokeh.io import show, output_notebook

# Last mile conversions
df_bokeh                      = df.copy()
df_bokeh["question_answered"] = df_bokeh.question_answered.apply(lambda row: "answered" if row==True else "unanswered")
df_bokeh["cluster"]           = df_bokeh.cluster.apply(lambda row: str(row))

# define UI Tools
TOOLS="hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select,"
LABELS = ["Cluster {}".format(cluster) for cluster in np.sort(df_bokeh.cluster.unique())]

# sources
df_ans   = df_bokeh.loc[df_bokeh.question_answered == 'answered', :]
df_nans  = df_bokeh.loc[df_bokeh.question_answered == 'unanswered', :]

src_ans  = bpl.ColumnDataSource(
    df_ans.to_dict('list')
)

src_nans = bpl.ColumnDataSource(
    df_nans.to_dict('list')
)


In [None]:
# bokeh function

def update(attr, old, new):
    
    # callback to update source
    try:
        # Get the list of carriers for the graph
        clusters_selected = [i for i in carrier_selection.active]
        print(clusters_selected)
        
        # Update source
        new_src_ans = ColumnDataSource(df_ans.loc[df_ans.cluster.isin(clusters_selected), :])
        new_src_nans = ColumnDataSource(df_nans.loc[df_nans.cluster.isin(clusters_selected), :])
        
        # Update the source used in the quad glpyhs
        src_ans.data.update(new_src_ans.data)
        src_nans.data.update(new_src_nans.data)
        
    except Exception as e:
        print(e)

        
## calback functions
filter_cluster = CustomJS(
    args=dict(source=src_ans), 
    code="""
        
        // initialize
        var primary_id = [];
        var x = [];
        var y = [];
        var post_title = [];
        var cluster = [];
        
        // define active checkboxes
        let selected_categories = this.active
        
        // iterate through rows of data source and see if each satisfies some constraint
        for (var i = 0; i < source.get_length(); i++){
            if (selected_categories.includes(parseInt(source.data['cluster'][i]))){
                primary_id.push(source.data['PrimaryId'][i])
                x.push(source.data['x'][i])
                y.push(source.data['y'][i])
                post_title.push(source.data['post_title'][i])
                cluster.push(source.data['cluster'][i])
            }
        }
        
        console.log(this.active)
        console.log(x.length)
        source['PrimaryId'] = primary_id
        source['x'] = x
        source['y'] = y
        source['post_title'] = post_title
        source['cluster'] = cluster
        
        source.change.emit();
        
""")

In [None]:
## Bokeh worflow

# output_notebook()

#
# Figures and colormaps
#

p_answered = figure(tools=TOOLS)
p_nanswered = figure(tools=TOOLS)

ans_cmap = bmo.CategoricalColorMapper(
    factors=df_ans.cluster.unique(),
    palette=all_palettes['Set1'][len(df_ans.cluster.unique())]
)

nans_cmap = bmo.CategoricalColorMapper(
    factors=df_nans.cluster.unique(),
    palette=all_palettes['Set1'][len(df_nans.cluster.unique())]
) # this changes the order with which the colors are applied

#
# Visualizations
#

p_answered.scatter(
    x='x', 
    y='y',
    source=src_ans,
    color={'field': 'cluster', 'transform': ans_cmap},
    fill_alpha=0.9,
    line_color=None,
    legend_group='cluster'
)

# using the color mapped define by answered questions - potential bug
p_nanswered.scatter(
    x='x', 
    y='y',
    source=src_nans,
    color={'field': 'cluster', 'transform': ans_cmap},
    fill_alpha=0.9,
    line_color=None,
    legend_group='cluster'
)

# Decorations
p_answered.title.text = 'Answered questions'
p_answered.legend.title = 'Cluster'
p_answered.xaxis.axis_label = 'Dimensional x'
p_answered.yaxis.axis_label = 'Dimensional y'


p_nanswered.title.text = 'Unanswered questions'
p_nanswered.legend.title = 'Cluster'
p_nanswered.xaxis.axis_label = 'Dimensional x'
p_nanswered.yaxis.axis_label = 'Dimensional y'

fig_dim = 550
p_answered.width = p_nanswered.width = fig_dim
p_answered.height = p_nanswered.height = fig_dim

# hover parameters
hover_answered = HoverTool()
hover_answered.tooltips=[
    ('Question ', '@question_answered'),
    ('Post Question Title', '@post_title'),
    ('Cluster', '@cluster'),
]

hover_nanswered = HoverTool()
hover_nanswered.tooltips=[
    ('Question ', '@question_answered'),
    ('Post Question Title', '@post_title'),
    ('Cluster', '@cluster')
]

p_answered.add_tools(hover_answered)
p_nanswered.add_tools(hover_answered)

#
# Widgets
#

checkbox_group = CheckboxGroup(labels=LABELS, active=[i for i in range(len(LABELS))])
checkbox_group.js_on_change('active', filter_cluster)

# checkbox_group.on_change('active', update)

# output_file("color_scatter.html", title="color_scatter.py example")

# layout specificationns 
lay = layout(
        children = [
            [p_answered, p_nanswered],
            [checkbox_group],
        ]
)

tab = Panel(child=lay, title = 'Clusters')
tabs = Tabs(tabs=[tab])

show(tabs)  # open a browser or notebook depending on configuration