# ML Powered Applications

# Chapter 4

Data exploration notebook to better understand the data.

Objective
- To label and identify trends

Process
- Generate summary statistics
- Identifying differences in class distributions

In [None]:
# run script on a 3.6 environment - base36
!pip install -U spacy
!pip install -U umap-learn
!python -m spacy download en_core_web_sm
!pip install --upgrade gensim

In [1]:
# library dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# intialization
PATH_data = r"C:\Users\nrosh\Desktop\Personal Coding Projects\Python\ml-powered-applications\neel\data"

## Clustering

### Definitions

1. Clustering
    - Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)
    - Many clustering algorithms group data points by measuring the distance between points and assigning ones that are close to each other to the same cluster.
    - The vast majority of datasets can be separated into clusters based on their features, labels, or a combination of both. Examining each cluster individually and the similarities and differences between clusters is a great way to identify structure in a dataset.
    - Clustering algorithms work on vectors, so we can’t simply pass a set of sentences to a clustering algorithm. To get our data ready to be clustered, we will first need to vectorize it.

2. Vectorization
    - A process of converting a raw data set into a singular or multi-dimensional vector


### Vectorization techniques

Clustering requires distances to be measured on the "same" scale.

The approach for vectorizing/normalizing data depends on the structure and type of data being analysed.

1. Tabular data
    - Continuous features should be normalized to a common scale
    - Categorical features such as colors can be converted to a one-hot encoding (binary transformations). This allows the distance between points to always remain the same.
    
2. Text data

    - Bag Of Words (Tokenize sentences and count their observations by row)
        - The simplest way to vectorize text is to use a count vector, which is the word equivalent of one-hot encoding.
        - For each sentence, the number at each index represents the count of occurrences of the associated word in the given sentence.
        -This method ignores the order of the words in a sentence
        - scikit-learn TfidfVectorizer
            - Produce a vector of tokenized words for count aggregation by row
            - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

    - Word2ec and fastText
        - These vectorization techniques produce word vectors that attempt to learn a representation that captures similarities between concepts better than a TF-IDF encoding. 
        - They do this by learning which words tend to appear in similar contexts in large bodies of text such as Wikipedia.
        - This approach is based on the distributional hypothesis, which claims that linguistic items with similar distributions have similar meanings.
            - This is done by learning a vector for each word and training a model to predict a missing word in a sentence using the word vectors of words around it. 
            - The number of neighboring words to take into account is called the window size.
    - Dimensionality Reduction
        - Vectorized data are multi-dimensional and can't be visualized
        - The goal is to use a method that reduces multidimensional data into a visual space whilst minimizing the data loss associated with dimensional reduction
        - Techniques
            - t-SNE
            - UMAP
        - These techniques are useful for to notice patterns in data on a very high level
        - The goal is to use these methods to see whether there are regions of the data that can easily be seperated by a production model
        - 

UMAP
    - General purpose manifold learning and dimension reduction algorithm
    
    
Once we have a vectorized representation of our unstructured data, we can use it for the purpose of data inspection/exploration or outcome predictions.

1. Inspection
        - Dimensionality Reduction
             - Vectors produced from unstructured data often have more than one dimension. The dataset needs to be reduced in some way for us to visualize it on a two-dimensional plane.
             - 

2. Prediction





### Labelling Strategy
Feel free to update your vectorization strategy by adding any features you discover to help make your data representation as informative as possible, and go back to labeling.


In [2]:
from IPython.display import Image
Image(url="vectorization_strategy.png")

## Statistics Observed

1. Are there distinct regions of post's title that can be classified into one or multiple labels?
    
Sources:
- stackexchange: 
    - Data Schema: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
    - Score: https://meta.stackexchange.com/questions/229255/what-is-the-score-of-a-post
    - UMAP: https://umap-learn.readthedocs.io/en/latest/
    
- word2vec: https://code.google.com/archive/p/word2vec/

## Functions

In [3]:
def export_df(df, cols, **kwargs):

    # rename and return a dataframe of those columns
    # choose to export or not
    
    _df = df.loc[:, cols]
    
    # rename dict
    if "rename_dict" in kwargs.keys() :
        _df.rename(columns=kwargs["rename_dict"], inplace=True)
        print('- Columns renamed.')
    
    # export data
    if kwargs["export_loc"]:
        
        # handle data export
        try:
            
            if "export_name" in kwargs.keys():
                _location = kwargs["export_loc"] + "\\{}".format(kwargs["export_name"]) + ".csv"
                _df.to_csv(_location)
                print(f"""- File exported to: {_location}""")
            else:
                _location = kwargs["export_loc"]+"\\adhoc_{}".format(datetime.today().strftime("%m%d%y"))+".csv"
                _df.to_csv(_location)
                print(f"""- File exported to: {_location}""")

        except:
            raise Exception(f"""export_loc must be of type str. Given: {type(kwargs["export_loc"])}""")
    
    print('\n')
    return _df

## Ingestion


In [4]:
# original
df_orig = pd.read_csv(
    PATH_data + "\\labelling\kmean_clusteral_predictions.csv",
)

# copies
df = df_orig.iloc[:, 1::].copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9935 entries, 0 to 9934
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PostTypeId            9935 non-null   int64  
 1   AcceptedAnswerId      5081 non-null   float64
 2   ParentId              0 non-null      float64
 3   AnswerCount           9935 non-null   float64
 4   CommentCount          9935 non-null   int64  
 5   FavoriteCount         4052 non-null   float64
 6   LastActivityDate      9935 non-null   object 
 7   CreationDate          9935 non-null   object 
 8   ClosedDate            1294 non-null   object 
 9   LastEditDate          6222 non-null   object 
 10  Score                 9935 non-null   int64  
 11  Title                 9935 non-null   object 
 12  body_text             9935 non-null   object 
 13  fe_tenure             9935 non-null   float64
 14  fe_isclosed           9935 non-null   int64  
 15  fe_isquestion        

## Labeling Strategy

Goal: 
1. To produce features that describe each clusters uniquely 
2. To produce the results you would like it to produce

Strategies:

1. Update your vectorization strategy by adding any features you discover to help make your data representation as informative as possible, and go back to labeling.
2. To speed up your labeling, leverage your prior analysis by labeling a few data points in each cluster and for each common value in your feature distribution.
3. Use cluster visualizations to infer characteristics about each data point


In [5]:
df.loc[]

Unnamed: 0,PostTypeId,AcceptedAnswerId,ParentId,AnswerCount,CommentCount,FavoriteCount,LastActivityDate,CreationDate,ClosedDate,LastEditDate,...,fe_tenure,fe_isclosed,fe_isquestion,fe_isanswer,fe_isfavorited,fe_wasedited,fe_question_answered,umap_x_Title,umap_y_Title,kmeans_cluster_Title
0,1,15.0,,10.0,7,19.0,2019-03-31 20:10:59.657,2010-11-18 20:40:32.857,2019-09-09 15:44:30.727,2019-02-10 04:06:33.283,...,77184.0,1,1,0,1,0,1,9.863544,2.54322,0
1,1,16.0,,7.0,0,5.0,2018-04-29 19:35:55.850,2010-11-18 20:42:31.513,,2018-04-29 19:35:55.850,...,inf,0,1,0,1,0,1,6.362819,2.405579,0
2,1,31.0,,5.0,1,10.0,2018-05-04 11:04:09.610,2010-11-18 20:43:28.903,,2018-05-04 11:04:09.610,...,inf,0,1,0,1,0,1,9.492914,-0.090201,0
3,1,,,8.0,1,4.0,2019-05-23 20:32:12.107,2010-11-18 20:43:59.693,,2019-04-10 07:29:49.297,...,inf,0,1,0,1,0,0,6.696061,2.830864,0
4,1,85.0,,10.0,1,6.0,2018-04-29 19:26:50.553,2010-11-18 20:45:44.067,,2010-11-18 21:16:41.767,...,inf,0,1,0,1,0,1,5.606722,3.328136,1
