# `sklearn` Explanations
- This notebook contains explanation to the various tools available in `sklearn` and some theory behind how they work

## Vectorisers
- **Vectoriser**
    - Converts an `iterable` containing words into a matrix of real numbers - the **term matrix**, which can be used for further analysis such as clustering
    - e.g., `TfidfVectoriser` and `CountVectoriser`
    - **NOTE**: after producing a term matrix, the vectoriser will store the feature names (i.e. the terms it analysed). Use `get_feature_names_out()` to see the features
- **Term matrix**
    - Each column represent a unique feature (term) that appeared in the corpus
    - Each row represent each document in a corpus


---

### TF-IDF - Term frequency-Inverse Document Frequency
- Vectoriser that take in words and produce a matrix containing weighting for each word
- the weight reflect how *important* a word is to a *document* in a collection
- For each word, the weight is the product of these two below:
    - **Term frequency**: How many times the term occurs, calculated for *each* document
    $$\text{tf}(t, d) = \frac{\text{raw count of term }t}{\text{sum of frequency for all terms in document }d}$$
    - **Inverse document frequency**: number of documents divided by number of documents the word occured in, scaled logarithmically
    $$\text{idf}(t, D) = \log \frac{\text{total number of documents in collection }D}{\text{number of documents that term }t \text{ occured in}}$$

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['the mitochondria is the power house of the cell', 'the factory of the sky is old']
tfidf_vect = TfidfVectorizer()
tfidf_matrix = tfidf_vect.fit_transform(corpus)

In [4]:
print(tfidf_vect.get_feature_names_out())

# Words appearing once such as "cell", "mitochondria", "factory" has higher
# weight than "is", "of"
# Note that "the" has high weight because the corpus is small
print(tfidf_matrix.toarray())

['cell' 'factory' 'house' 'is' 'mitochondria' 'of' 'old' 'power' 'sky'
 'the']
[[0.32327633 0.         0.32327633 0.23001377 0.32327633 0.23001377
  0.         0.32327633 0.         0.6900413 ]
 [0.         0.40697968 0.         0.2895694  0.         0.2895694
  0.40697968 0.         0.40697968 0.57913879]]


---

### Count Vectoriser (Bag-of-Words)
- Vectoriser that count up and return frequency of each term in a matrix

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
 
corpus = ['Cats and dogs are not allowed', 'Cats and dogs are antagonistic dogs']
count_vect = CountVectorizer()
bag_of_words = count_vect.fit_transform(corpus)

In [2]:
print(count_vect.get_feature_names_out())

# There are 2 arrays (since we have 2 documents in our corpus)
# Each array have 7 items as we have 7 unique words throughout all documents
print(bag_of_words.toarray())

['allowed' 'and' 'antagonistic' 'are' 'cats' 'dogs' 'not']
[[1 1 0 1 1 1 1]
 [0 1 1 1 1 2 0]]


## Clusters

#### Optimal Cluster
- `KMeans` privides the `inertia_` attribute - sum of square distances of samples to their cluster's centre, i.e. the **SSE**
    - SSE is a measure for how fitted the model is to the data, low SSE means the model is very fitted
    - Cluster size where the SSE starts to level off is optimal (don't want too low as that may mean overfitting)

In [None]:
def find_optimal_clusters(data, max_k):
    """
    Iterate through each cluster size up to max_k and plot SSE of each clusters
    """
    k_vals = range(2, max_k + 1, 2)
    
    sse = []
    for k in k_vals:
        sse.append(
            KMeans(n_clusters=k, random_state=520).fit(data).inertia_
            
            #MiniBatchKMeans(
            #    n_clusters=k, init_size=1024, batch_size=2040, random_state=20
            #).fit(data).inertia_
        )
        print(f"Fitted {k} clusters!")
    
    
    # Plot the graph
    f, ax = plt.subplots(1, 1)
    ax.plot(k_vals, sse, marker='o')
    ax.set_xlabel('Cluster Centres')
    ax.set_xticks(k_vals)
    ax.set_xticklabels(k_vals)
    ax.set_ylabel('SSE')