# scikit-learn
- Machine learning toolkit  
- Built on Numpy, SciPy, matplotlib
> [Main Table of Contents](../../../README.md)

## In This Notebook
- Stop words
- Bag of Words
	- LIMITATION OF THIS PROCESS
	- Commonly used parameters in Vectorizer init
	- CountVectorizer
	- TfidfVectorizer
- Models
	- Model Selection
		- Train Test Splitter
	- Model Accuracy
	- Linear Regression
	- Logistic Regression
	- Naive Bayes MultinomialNB
- Cosine Similarity



## Stop words

In [81]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

## Bag of Words
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.  

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:  

- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

- counting the occurrences of tokens in each document.

- normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.  

- BOW is part of the text preprocessing step
- BOW shortcomings:  Doesn't take in context. Can use ngram_range to resolve this. Though the increase in ngrams may only do marginally better and the added inefficiency (time and dimensionality) may not be worth it.
	- e.g. "The movie was good and not boring" and "The movie was not good and boring" has same BOW, but opposite sentiment.  

Vectorization the general process of turning a collection of text documents into numerical feature vectors.



### LIMITATIONS OF BOW AND TFIDF and cosine similarities
- Doesn't capture complex relationships, synonyms and antonyms
	- Would need word embeddings which capture complex relationships, synonyms/antonyms via heavy deep learning and enormous amount of training data
	- [GH question, can I look up synonyms with spacy](https://github.com/explosion/spaCy/issues/276)

In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['I am happy', 'I am joyous', 'I am sad']
corpus_two = ['happy', 'joyous', 'sad']
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
sim = cosine_similarity(matrix)
print(sim)
vectorizer_two = TfidfVectorizer()
matrix_two = vectorizer_two.fit_transform(corpus_two)
sim_two = cosine_similarity(matrix_two)
print(sim_two)

[[1.         0.25861529 0.25861529]
 [0.25861529 1.         0.25861529]
 [0.25861529 0.25861529 1.        ]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


## Commonly used parameters in vectorizer classes

Parameter | Description
--- | ---
token_pattern=regex | Specify regex to tokenize
tokenizer | Add custom function<br> Override the string tokenization step while preserving the preprocessing and n-grams generation steps
lowercase=bool | Conver all chars to lower before tokenizing
strip_accents='ascii', 'unicode', None | Remove accents and perform other char normalizations
stop_words='english', list, None | Remove common, uninformative words<br>e.g. stop_words=ENGLISH_STOP_WORDS
ngram_range=(int, int) | Specify ngram (inclusive)<br>e.g. unigram and bigram<br>ngram_range=(1, 2)
max_features=int | Specify max # of features (columns in df)
max_df=int\|float | Ignore terms that have document frequency higher than given threshold<br>If float in range [0.0-1.0] indicates percentage
min_df=int\|float | Ignore terms that have document frequency lower than given threshold<br>AKA 'cut-off' line in literature<br>If float in range [0.0-1.0] indicates percentage

### CountVectorizer
- Tokenization and occurrence counting of every token
- Convert a collection of text documents to a matrix of token counts
- Use CountVectorizer to preprocess data

In [83]:
from sklearn.feature_extraction.text import CountVectorizer
# import pandas as pd

# Every token in this corpus is a column, this is why text pre-processing is important, to reduce the number of dimensions by combining similar words together and eliminating unimportant words.  Most of the dimensions (columns) have value 0 since most words don't occur in a particular sentence
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = CountVectorizer()
# Learn a vocabulary dictionary of all tokens in the raw documents
trained = vectorizer.fit(corpus)
# Transform documents to document-term matrix
transformed = vectorizer.transform(corpus) 
arr = transformed.toarray()  # use to build pd.df
# Get list of feature names (can be used as col names in a pd.df)
names = vectorizer.get_feature_names_out()

# Alt convenience 2-in-1 method for above
# when using train/test sets only transform the test set, do not fit
alt = vectorizer.fit_transform(corpus)  
# Map the column names to the vocabulary
# pd.DataFrame(alt.toarray(), columns=alt.get_feature_names_out())

### TfidfVectorizer
- Convert a collection of raw documents to a matrix of TF-IDF features 
- tf-idf(t, d) = tf(t, d) x idf(t) 
- Higher the tfidf weight, the more important the word is in characterizing a document
	- May imply the word is highly exclusive to that document
- Application in:
	- Automatically detect stopwords b/c useful in finding words that characterize a particular document
	- Sometimes better performance in predictive modeling
	- Using TfidfVectorizer with Cosine Similarity or linear_kernel functions are useful in recommendation systems
	- e.g. If a person liked the movie "Godfather" then might like other movies with similar plot lines

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.  

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform

In [84]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = TfidfVectorizer()
# Learn a vocabulary dictionary of all tokens in the raw documents
trained = vectorizer.fit(corpus)
# Transform documents to document-term matrix
transformed = vectorizer.transform(corpus)
arr = transformed.toarray()  # use to build pd.df
# Get list of feature names (can be used as col names in a pd.df)
names = vectorizer.get_feature_names_out()

# Alt convenience 2-in-1 method for above
# when using train/test sets only transform the test set, do not fit
alt = vectorizer.fit_transform(corpus).toarray()
print(alt)

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


## Models

### Model Selection
- Train Test Splitter

### Model Accuracy
- Use `sklearn.metrics.accuracy_score` instead of the built-in model score methods

### Train Test Splitter
- sklearn.model_selection.train_test_split()
- Function will randomly split a given dataset (lists, numpy arrays, pd.df, scipy-sparse matrices) in to `X_train, y_train, x_test, y_test`, which can then be used in models' `fit, transform, predict, etc.` methods

	Parameter | Description
	--- | ---
	test_size=int\|float |If float [0.0-1.0] indicates percentage<br>If int indicates absolute # of test samples
	train_size=int\|float |If float [0.0-1.0] indicates percentage<br>If int indicates absolute # of test samples
	random_state=int | Similar to random seed used for reproducible output
	shuffle=bool | Shuffly data before splitting
	stratify=array-like | data is split in stratified fasion using input as class labels

### Linear Regression
- Best fit line

### Logistic Regression
 - Best fit sigmoid function
 - Typically used in classification problems (discrete categories)
	- Classification is a form of pattern recognition
		- Similar number sequences, words or sentiments

### Naive Bayes MultinomialNB
- Naive Bayes *classifier* for multinomial models.
- The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work

In [11]:
# Bulid a text classfier EXAMPLE

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
# note TfidfTransformer not TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer  
from sklearn.model_selection import train_test_split
df = pd.read_csv('../../../data/customer_call_transcriptions.csv', header=0)

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.3)
# Create text classifier pipeline
text_classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])
# Fit thhe classifier pipeline on the training data
text_classifier.fit(X_train, y_train)
# Make predictions and compare them to test labels
predictions = text_classifier.predict(X_test)
accuracy = 100 * np.mean(predictions == y_test)
print(f'The model is {accuracy:.2f}% accurate')

The model is 100.00% accurate


## Cosine Similarity
- Value lies between 0 and 1
- Using TfidfVectorizer with Cosine Similarity or linear_kernel functions are useful in recommendation systems
	- e.g. If a person liked the movie "Godfather" then might like other movies with similar plot lines

In [85]:
from sklearn.metrics.pairwise import cosine_similarity