# scikit-learn
- Machine learning toolkit  
- Built on Numpy, SciPy, matplotlib
> [Main Table of Contents](../../../README.md)

## In This Notebook
- Stop words
- Bag of Words
	- Commonly used parameters in Vectorizer init
	- CountVectorizer
	- TfidfVectorizer
- Models
	- Model Selection
		- Train Test Splitter
	- Linear Regression
	- Logistic Regression


## Stop words

In [1]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

## Bag of Words
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.  

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:  

- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

- counting the occurrences of tokens in each document.

- normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.  

Vectorization the general process of turning a collection of text documents into numerical feature vectors.

## Commonly used parameters in vectorizer classes

Parameter | Description
--- | ---
token_pattern=regex | Specify regex to tokenize
ngram_range=(int, int) | Specify ngram (inclusive)<br>e.g. unigram and bigram<br>ngram_range=(1, 2)
max_features=int | Specify max # of features (columns in df)
max_df=int\|float | Ignore terms that have document frequency higher than given threshold<br>If float in range [0.0-1.0] indicates percentage
min_df=int\|float | Ignore terms that have document frequency lower than given threshold<br>AKA 'cut-off' line in literature<br>If float in range [0.0-1.0] indicates percentage
stop_words='english'<br>stop_words=ENGLISH_STOP_WORDS | Filter out common and uninformative words

### CountVectorizer
- Tokenization and occurrence counting
- Convert a collection of text documents to a matrix of token counts

TODO: Add commonly used parameters here

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# import pandas as pd

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = CountVectorizer()
# Learn a vocabulary dictionary of all tokens in the raw documents
trained = vectorizer.fit(corpus)
# Transform documents to document-term matrix
transformed = vectorizer.transform(corpus)
arr = transformed.toarray()
# Get list of feature names (can be used as col names in a pd.df)
names = vectorizer.get_feature_names_out()

# Alt convenience 2-in-1 method for above
alt = vectorizer.fit_transform(corpus)
# pd.DataFrame(alt.toarray(), columns=alt.get_feature_names_out())

### TfidfVectorizer
- Convert a collection of raw documents to a matrix of TF-IDF features 
- tf-idf(t, d) = tf(t, d) x idf(t) 

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.  

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = TfidfVectorizer()
# Learn a vocabulary dictionary of all tokens in the raw documents
trained = vectorizer.fit(corpus)
# Transform documents to document-term matrix
transformed = vectorizer.transform(corpus)
arr = transformed.toarray()
# Get list of feature names (can be used as col names in a pd.df)
names = vectorizer.get_feature_names_out()

# Alt convenience 2-in-1 method for above
alt = vectorizer.fit_transform(corpus).toarray()
print(alt)

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


## Models

### Model Selection
- Train Test Splitter

### Train Test Splitter
- sklearn.model_selection.train_test_split()
- Function will randomly split a given dataset (lists, numpy arrays, pd.df, scipy-sparse matrices) in to `X_train, y_train, x_test, y_test`, which can then be used in models' `fit, transform, predict, etc.` methods

Parameter | Description
--- | ---
test_size=int\|float |If float [0.0-1.0] indicates percentage<br>If int indicates absolute # of test samples
train_size=int\|float |If float [0.0-1.0] indicates percentage<br>If int indicates absolute # of test samples
random_state=int | Similar to random seed used for reproducible output
shuffle=bool | Shuffly data before splitting
stratify=array-like | data is split in stratified fasion using input as class labels

### Linear Regression
- Best fit line

### Logistic Regression
 - Best fit sigmoid function
 - Typically used in classification problems (discrete categories)
	- Classification is a form of pattern recognition
		- Similar number sequences, words or sentiments