In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Source material attributed to General Assembly course notes and online documentation centers

# Count Vectorizer Section

## Pre-Processing

Let's review some of the pre-processing steps for text data:

- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

`CountVectorizer` actually can do a lot of this for us! It is important to keep these steps in mind in case you want to change the default methods used for each of these. Full documentation may be found here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

## `CountVectorizer`
---

The easiest way for us to convert text data into a structured, numeric `X` dataframe is to use `CountVectorizer`.

- **Count**: Count up how many times a token is observed in a given document.
- **Vectorizer**: Create a column (also known as a vector) that stores those counts.

![Count Vectorizer In Action](../images_FunctionDocumentation/countvectorizer2.png)

In [2]:
# Instantiate a CountVectorizer.
cvec = CountVectorizer()

### Transformation Step using the Count Vectorizer on training and testing data chosen features

![Count Vectorizer during the Tranformation](../images_FunctionDocumentation/countvectorizer.png)

[Source](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061).

### Fitting Step of chosen function along with Count Vectorizer. 

#### Please Note: It is also required for optimize parameters for both `Count Vectorizer` as well as the desired modeling function of choice! 

### Then, verify accuracy scores and other classification metrics. 

## These are the steps that MUST be followed for a general Natural Langauge Processing Model to be successful. Next, the Tfidf Vectorizer is introduced. 

#### Before moving on, it is important to express that one must utilize GridSearchCV with the desired function to obtain optimal parameters for the function given the data inputs. The documentation for this function may be found here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


 `CountVectorizer` has been explored in how it may transform text data into something passable through a model.

But what if something more than just counting the occurrences of each token is required?

# Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer Section

---

When modeling, which words tend to be the most helpful?
- Words that are common across all documents.
- Words that are rare across all documents.
- Words that are rare across some documents, and common across some documents.

The answer to this question is that words that are common in certain documents but rare in other documents tend to be more informative than words that are common in all documents or rare in all documents. For example: If one were to examine poetry over time, the word "thine" might be common in some documents but rare in most documents. The word "thine" is probably pretty informative in this case.

TF-IDF is a score that tells us which words are important to one document, relative to all other documents. Words that occur often in one document but don't occur in many documents contain more predictive power. Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

***Breaking it down:***

*Term Frequency* is the number of times a word appears in the document (same as values in CountVectorizer).

*Document Frequency* is the percentage of documents that a particular word appears in. 

$$ \log \frac{1+n}{1+df(t)}  + 1 $$

> where $n$ is the total number of documents in the document set, and *df(t)* is the number of documents in the document set that contain term *t*. The resulting tf-idf vectors are then normalized by the Euclidean norm... - the [docs](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)

Variations of the TF-IDF score are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

![Tfidf Vectorizer In Action](../images_FunctionDocumentation/tfidfvectorizer.png)

[Source](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061).

### From there, the process is the same as `Count Vectorizer`, there is a transformation, a fit, and an assessment of model performance. 

#### Please Note: It is also required for optimize parameters for both `Count Vectorizer` as well as `TF-IDF Vectorizer`. 

# Logistic Regression Documentation: 
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


# Support Vector Machine Documentation: 
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html


# SMOTE Documentation: 

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

### Source material attributed to General Assembly course notes and online documentation centers