## Text Classification

1. First, automatically identify the topics withing a corpus of textual data by using unsupervised machine learning.
2. Apply a supervised classifcation algorithm to assign topic labels to each textual document by using the result of the previous step as target labels.

Steps:
1. Import Texts [Retrieve Data from Database]
2. Clean Texts [Remove Fixed Patterns]
3. Translate [If needed]
4. Preprocess Texts [(a) Tokenization (b) Stop Words Removal]
5. Bigrams [Identify word pairs within data]
6. Document Term Matrix [(a) Bag of Words (b) Vectorization]

1. Remove Parts of the text repeated. 
2. Preprocess texts using tokenization, lemmatisation, stop-words and digit removal. 
3. Add n-grams to dataset to guide topic model with the assumption that short sequences of words treated as single entities, or tokens usually contain useful information to identify the topics of a sentence. Only bigrams turned out to be meaningful in this context - indeed they significantly improved the topic model performance. 
4. Use count vectorization to transform data into a numeric term-document-matrix, having documents as rows, single tokens as columns and the corresponding frequencies as values (frequency of the selected token in a given chat). The Bag of Words approach does not take into account words order, which should not play a crucial role in topic identification. 

## LDA - Latent Dirichet Allocation

- Takes document-term-matrix as input to infer probabilistic distribution on:
    1. A set of latent (i.e. unknown) topics across the documents. 
    2. The words in corpus vocabulary (set of all words used in dataset) by looking at topics in document in which the words are contained and other topic assignments for that particular word across corpus)

LDA outputs the k topics (where k is given to model as parameter) in form of high-dimensional vectors where each component represents the weight for a particular word in vocabulary . By looking at terms with heighest weights it's possible to manually give a name to the k topics, improving human interpretability of output.

LDA also provides a topic distribution for each document in dataset as a sparse vector (few components with high weights, all the rest with 0 weight) making it easier to interpret high-dimensional topic vectors and extract relevant topics for each text. 

So, the LDA model provides topic weights for each document it is trained on. Now, the transition to a supervised approach becomes straight forward. The vector component within the heighest weight is picked and the corresponding topic is used as target label for given document.

- Generative Probabilistic Model of a collection of composites made up of parts. It uses NLP and topic modelling. 

In terms of topic modelling, the composites are document and the parts are words or phrases .

LDA consists of 2 tables (matrices)
1. First table describes probability or chance of selecting a particular part when sampling of a particular topic (category).
2. Second table describes chance of selecting a particular topic when sampling a particular document or composite.

LDA algorithm assumes composites were generated like so:
1. Pick your unique set of parts. 
2. Pick how many composites you want. 
3. Pick how many parts you want per composite (sample from a Poisson Distribution)
4. Pick how may categories/topics you want. 
5. Pick a number between non-zero and positive infinity and call it alpha. 
6. Pick a number between non-zero and positive infinity and call it beta. 
7. Build Parts VS Topics table. For each column, draw a sample from Dirichlet Distribution (which is a distribution of distributions) using beta as input. Each sample will fill out each column in table, sum to one, and give probability of each part per topic (column).
8. Build Composites VS Topics table. For each column, draw a sample from Dirichlet Distribution using alpha as input. Each sample will fill out each row in the table, sum to one, and give probability of each topic (column) per composite. 
9. Build actual composites. For each composite:
    - Look up its row in Composites VS Topic table.
    - Sample a topic based on probabilities in row. 
    - Go to Parts VS Topics table. 
    - Look up topics sampled. 
    - Sample a part based on probabilitiesin column.
    - Repeat from step 2 until you have reached how many parts this composite was set to have. 

## Dirichlet Distribution

The Dirichlet Distribution takes a number (called alpha) for eac h topic/category. 

At low alpha values (less than one), most of the topic distribution samples are in corners (near the topics). For really low alpha values, it is likely you will end up sampling (1,0,0) or (0,1,0) or (0,0,1). This would mean that a document would only ever have one topic. This is the case when there are only 3 possible topics. 

At alpha valu equal to one, any space in surface of triangle is uniformly distributed. You could equally likely end up with a sample favoring only one topic, a sample that gives an even mixture of all topics or something in between. 

For alpha values greate than one, samples start to congregate in center of triangle. This means that as alpha gets bigger, samples will more likely be uniform, represent an even mixture of topics.

The alpha controls the mixture of topics for any given document. Turn it down and documents will likely have less of a mixture of topics. Turn it up and documents will have more of a mixture of topics. 

The beta hyperparameter controls distribution of words per topic. Turn it down and topics will likely have less words. Turn it up and topics will likely have more words.

Ideally, one composite should be made up of only a few topics and one parts to belong to only some of the topics. For this alpha and beta should be set to below 1. 

## Why LDA is better?

If you view number of topics as number of clusters and probabilities as proportion of cluster membership then using LDA is a way of soft clustering your composites and parts. 

In contrast, K-Means where each entity can only belong to one cluster (hard-clustering). LDA allows "fuzzy" membership. This provides a more nuanced way of recommending similar items finding duplicates or discovering user profiles.