## Credit

Notes are taken from NLPlanet Practical NLP with Python course section 1.4 Statistical Approaches and Text Classification with N-grams.
* https://www.nlplanet.org/course-practical-nlp/01-intro-to-nlp/04-n-grams

Authored by Fabio Chiusano
* https://medium.com/@chiusanofabio94

**All quotes '' are sourced from the NLPlanet course.**

**Expert System vs Statistical Approach**

<u>Expert System:<u>
* Manually building sets of rules in collaboration with experts in the field. Requires extensive trial and error.
    
<u>Statistical Approach:<u>
* Machine Learning (ML) through grouping like-subjects (typically two groups of what IS wanted and what is NOT wanted) and letting algorithms derive rules to discern between the groups.

**Text Classification with N-grams**

<u>Vectorization:<u>
* The process of converting text into a numerical vector that most machine learning models can understand. 
* Can be done in multiple ways: the easiest being counting a word's recurrences.
    
<u>N-Grams:<u>
* Made of N consecutive words. Serves as tokens to be vectorized.
* Unigrams (1 word), bigrams (2 words), trigrams (3 words), etc.
    
<u>Bag of Words:<u>
* A way of representing texts with a set of numbers. 
* Can be used as input into ML models.

<u>Sparse Matrix:<u>
* A matrix in which most elements are zero.
    
<u>Dense Matrix:<u>
* A matrix where most elements are non-zero.

**Making a logistic regression model:**
* A model that estimates the probability of an outcome based on independent variables

In [1]:
import pandas as pd
# Used to show tables with dataframes
from sklearn.feature_extraction.text import CountVectorizer
# CountVectorizer class is used to vectorize texts by counting the occurrences of each word
from sklearn.linear_model import LogisticRegression
# Class can be used for logistic regression machine learning tasks

In [4]:
# Creating a small dataset of texts (realistically require hundreds or thousands of examples)

# Dataset
texts = [
    "Programming is the process of creating virtual logic systems by communicating with a machine.",
    "Writing is the process of using a writing device or utensil to convey knowledge or expression."
]

labels = [1, 0] 
# 1 = Programming
# 0 = Writing

# Fit vectorizer on texts
vectorizer = CountVectorizer(ngram_range=(1, 1))
# ngram_range=(min_n, max_n) considers a specified range of grams (unigrams in this line)
vectorizer.fit(texts) 
# .fit() method counts words in texts

# Vectorize texts into bag of words
ngrams = vectorizer.transform(texts)
# .transform() method converts texts into a matrix of token counts
# ngrams is a sparse matrix
ngrams.todense()
# .todense() method converts a sparse matrix into a dense matrix

matrix([[1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
         0],
        [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 2, 1, 0, 0, 1, 1, 1, 1, 0, 0,
         2]])

In [9]:
# Vocabulary dictionary from fitting
vectorizer.vocabulary_
# Contains a mapping of terms (ngrams) to their indicies in the matrix

{'programming': 13,
 'is': 6,
 'the': 15,
 'process': 12,
 'of': 10,
 'creating': 3,
 'virtual': 19,
 'logic': 8,
 'systems': 14,
 'by': 0,
 'communicating': 1,
 'with': 20,
 'machine': 9,
 'writing': 21,
 'using': 17,
 'device': 4,
 'or': 11,
 'utensil': 18,
 'to': 16,
 'convey': 2,
 'knowledge': 7,
 'expression': 5}

In [29]:
# Create a pandas dataframe that shows the unigrams in each text

vocab_dict = vectorizer.vocabulary_.items()
# Retrieve vocab dictionary of grams and matrix indicies
print(f"vocab_dict:\n{vocab_dict}\n")

dict_sorted = sorted(list(vocab_dict))
# Convert into a list ordered numerically based on dict values
print(f"dict_sorted:\n{dict_sorted}\n")

keys_values_sorted = list(zip(*dict_sorted))
# Convert into a list of 2 sets with correlating grams and matrix indicies
print(f"keys_values_sorted:\n{keys_values_sorted}\n")

keys = keys_sorted[0]
# Retrieve each word
print(f"keys:\n{keys}\n")

ngrams_matrix = ngrams.todense()
df = pd.DataFrame(ngrams_matrix, columns=keys)
# .DataFrame(data, index, columns) method creates a Pandas DataFrame
# data as dictionary: keys represent column names, values represent data in columns
# data as list or NumPy array: represents the the values in the DataFrame. Rows and Columns will be indexed by default
# data as another DataFrame: can be used to create a new DataFrame based on the existing DataFrame's data
# index: row labels. Default = integer index (0, 1, 2,...)
# columns: defines column labels. Default = inferred from input data or integer index
df

vocab_dict:
dict_items([('programming', 13), ('is', 6), ('the', 15), ('process', 12), ('of', 10), ('creating', 3), ('virtual', 19), ('logic', 8), ('systems', 14), ('by', 0), ('communicating', 1), ('with', 20), ('machine', 9), ('writing', 21), ('using', 17), ('device', 4), ('or', 11), ('utensil', 18), ('to', 16), ('convey', 2), ('knowledge', 7), ('expression', 5)])

dict_sorted:
[('by', 0), ('communicating', 1), ('convey', 2), ('creating', 3), ('device', 4), ('expression', 5), ('is', 6), ('knowledge', 7), ('logic', 8), ('machine', 9), ('of', 10), ('or', 11), ('process', 12), ('programming', 13), ('systems', 14), ('the', 15), ('to', 16), ('using', 17), ('utensil', 18), ('virtual', 19), ('with', 20), ('writing', 21)]

keys_values_sorted:
[('by', 'communicating', 'convey', 'creating', 'device', 'expression', 'is', 'knowledge', 'logic', 'machine', 'of', 'or', 'process', 'programming', 'systems', 'the', 'to', 'using', 'utensil', 'virtual', 'with', 'writing'), (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1

Unnamed: 0,by,communicating,convey,creating,device,expression,is,knowledge,logic,machine,...,process,programming,systems,the,to,using,utensil,virtual,with,writing
0,1,1,0,1,0,0,1,0,1,1,...,1,1,1,1,0,0,0,1,1,0
1,0,0,1,0,1,1,1,1,0,0,...,1,0,0,1,1,1,1,0,0,2


**Model Training and Feature Weights** 

In [44]:
# Train logistic regression model using gathered ngrams (unigrams)

model = LogisticRegression()
model.fit(ngrams, labels)
# ngrams typically represents the matrix obtained from the text data after applying the CountVectorizer method
# row: coresponds to text document
# column: count of particular unigram in the document
# labels contains the corresponding target labels for each text document

# Show logistic regression weights

unigram_weight = dict(zip(keys, model.coef_[0]))
# model.coef_[0] retrieves the weights learned by the logistic regression model after training
# weights represent the importance of each feature (unigram) in determining the target class (specific category or label)
unigram_weight

{'by': 0.0998966450672437,
 'communicating': 0.0998966450672437,
 'convey': 0.0998966450672437,
 'creating': 0.0998966450672437,
 'device': -0.09993530868610438,
 'expression': -0.09993530868610438,
 'is': 0.0998966450672437,
 'knowledge': 0.0998966450672437,
 'logic': -0.09993530868610438,
 'machine': -0.09993530868610438,
 'of': -0.09993530868610438,
 'or': -3.866361886065603e-05,
 'process': -3.866361886065603e-05,
 'programming': -0.09993530868610438,
 'systems': -0.09993530868610438,
 'the': 0.0998966450672437,
 'to': 0.0998966450672437,
 'using': 0.0998966450672437,
 'utensil': -3.866361886065603e-05,
 'virtual': 0.0998966450672437,
 'with': -0.09993530868610438,
 'writing': -0.19987061737220876}

**Bigrams**
* Made of two consecutive words

In [45]:
# Fit vectorizer on texts
vectorizer = CountVectorizer(ngram_range=(1, 2))
# vectorizer object considers a minimum of unigram and maximum of bigram

vectorizer.fit(texts) 
# build ngram dictionary
# dict stored in vocabulary_

# vectorize texts into bag of words
ngrams = vectorizer.transform(texts)
ngrams.todense()

matrix([[1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
         0, 0, 0],
        [0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,
         2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
         2, 1, 1]])

In [46]:
# Vocabulary dictionary from fitting
vectorizer.vocabulary_
# Contains a mapping of terms (ngrams) to their indicies in the matrix

{'programming': 26,
 'is': 11,
 'the': 30,
 'process': 24,
 'of': 18,
 'creating': 6,
 'virtual': 38,
 'logic': 15,
 'systems': 28,
 'by': 0,
 'communicating': 2,
 'with': 40,
 'machine': 17,
 'programming is': 27,
 'is the': 12,
 'the process': 31,
 'process of': 25,
 'of creating': 19,
 'creating virtual': 7,
 'virtual logic': 39,
 'logic systems': 16,
 'systems by': 29,
 'by communicating': 1,
 'communicating with': 3,
 'with machine': 41,
 'writing': 42,
 'using': 34,
 'device': 8,
 'or': 21,
 'utensil': 36,
 'to': 32,
 'convey': 4,
 'knowledge': 13,
 'expression': 10,
 'writing is': 44,
 'of using': 20,
 'using writing': 35,
 'writing device': 43,
 'device or': 9,
 'or utensil': 23,
 'utensil to': 37,
 'to convey': 33,
 'convey knowledge': 5,
 'knowledge or': 14,
 'or expression': 22}

In [57]:
# Train logistic regression model using gathered ngrams (unigrams and bigrams)
model = LogisticRegression()
model.fit(ngrams, labels)

# Show ngram weights after training
vocab_dict = vectorizer.vocabulary_.items()
dict_sorted = sorted(list(vocab_dict))
keys_values_sorted = list(zip(*dict_sorted))
keys = keys_values_sorted[0]
ngram_weight = dict(zip(keys, model.coef_[0]))
ngram_weight

{'by': 0.0998966450672437,
 'by communicating': 0.0998966450672437,
 'communicating': 0.0998966450672437,
 'communicating with': 0.0998966450672437,
 'convey': -0.09993530868610438,
 'convey knowledge': -0.09993530868610438,
 'creating': 0.0998966450672437,
 'creating virtual': 0.0998966450672437,
 'device': -0.09993530868610438,
 'device or': -0.09993530868610438,
 'expression': -0.09993530868610438,
 'is': -3.866361886065603e-05,
 'is the': -3.866361886065603e-05,
 'knowledge': -0.09993530868610438,
 'knowledge or': -0.09993530868610438,
 'logic': 0.0998966450672437,
 'logic systems': 0.0998966450672437,
 'machine': 0.0998966450672437,
 'of': -3.866361886065603e-05,
 'of creating': 0.0998966450672437,
 'of using': -0.09993530868610438,
 'or': -0.19987061737220876,
 'or expression': -0.09993530868610438,
 'or utensil': -0.09993530868610438,
 'process': -3.866361886065603e-05,
 'process of': -3.866361886065603e-05,
 'programming': 0.0998966450672437,
 'programming is': 0.09989664506724