## 1.4 Statistical approaches and text classification with N-grams

### Uses:
* Pandas
* Scikit-learn

### Topics Covered:
* Difference between Expert Systems and Statistical Approaches.
* Text Classification with N-grams including:
* Vectorization
* N-Grams
* Bag of Words
* Sparse Matrix
* Dense Matrix
* Making a logistic regression model.
* Model training with feature weights (unigrams and bigrams)

### Syntax Segments Summary:

In [None]:
import pandas as pd
# Used to show tables with dataframes

from sklearn.feature_extraction.text import CountVectorizer
# CountVectorizer class is used to vectorize texts by counting the occurrences of each word

from sklearn.linear_model import LogisticRegression
# Class can be used for logistic regression machine learning tasks

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 1))
# ngram_range=(min_n, max_n) considers a specified range of grams (unigrams in this line)

In [None]:
vectorizer.fit(texts) 
# .fit() method counts words in texts

In [None]:
vectorizer.transform(texts)
# .transform() method converts texts into a sparse matrix of token counts (ngrams)

In [None]:
ngrams.todense()
# .todense() method converts a sparse matrix into a dense matrix

In [None]:
vectorizer.vocabulary_
# Contains a mapping of terms (ngrams) to their indicies in the matrix

In [None]:
vectorizer.vocabulary_.items()
# Retrieves the vocab dictionary of grams and their matrix indicies

In [None]:
pd.DataFrame(ngrams_matrix, columns=keys)
# .DataFrame(data, index, columns) method creates a Pandas DataFrame
# data as dictionary: keys represent column names, values represent data in columns
# data as list or NumPy array: represents the the values in the DataFrame. Rows and Columns will be indexed by default
# data as another DataFrame: can be used to create a new DataFrame based on the existing DataFrame's data
# index: row labels. Default = integer index (0, 1, 2,...)
# columns: defines column labels. Default = inferred from input data or integer index

In [None]:
model = LogisticRegression()
# LogisticRegression model

In [None]:
model.fit(ngrams, labels)
# .fit() calibrates the LogisticRegression model into a matrix with specified labels
# ngrams typically represents the matrix obtained from the text data after applying the CountVectorizer method
# row: coresponds to text document
# column: count of particular unigram in the document
# labels contains the corresponding target labels for each text document

In [None]:
model.coef_[0]
# model.coef_[0] retrieves the weights learned by the logistic regression model after training

## 1.5 Stemming, Lemmatization, Stopwords, and POS Tagging

### Uses:
* NLTK

### Topics Covered:
* Inflected Language
* Stemming
* Lemmatization
* Stopwords
* POS Tagging

### Syntax Segments Summary:

In [None]:
import nltk
nltk.download('punkt')
# Allows you to tokenize
nltk.download('wordnet')
# WordNet is a lexical database that tracks words and their relations
nltk.download('omw-1.4')
# Open Multilingual WordNet provides translations and word senses in multiple languages
nltk.download('averaged_perceptron_tagger')
# Model used for POS tagging
nltk.download('stopwords')
# A corpus of stopwords

from nltk.tokenize import word_tokenize
# Tokenizes based on words and punctuation

from nltk.stem import PorterStemmer
# Performs suffix stripping to produce stems
# Applies algorithmic rules to generate stems

from nltk.stem.snowball import SnowballStemmer
# Contains a family of stemmers for different languages

from nltk.stem import WordNetLemmatizer
# Class used to reduce words to their lemma

from nltk.corpus import stopwords
# Imports stopwords from the downloaded corpus

In [None]:
stemmer = PorterStemmer()
# PorterStemmer object used to stem

In [None]:
stemmer.stem(text)
# .stem() reduces a given word to their stem

In [None]:
nltk.word_tokenize(text)
# .word_tokenize() tokenizes the given text and returns a list of tokens

In [None]:
SnowballStemmer.languages
# Contains a tuple of available languages the class if capable of stemming

In [None]:
SnowballStemmer(avail_language)
# Creates a stemmer object for the specified language

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# WordNetLemmatizer object

In [None]:
stopwords.words(language)
# Retrieves stopwords specific to the given language

In [None]:
nltk.pos_tag(text)
# Takes a list of tokens and returns the POS (part-of-speech) tag for each token (word or punctuation)
# Returns as a list of tuples with 2 items each (token, POS)

## 1.9 Representing Texts as Vectors TF-IDF

### Uses:
* HuggingFace Hub
* NLTK
* Pandas
* Plotly
* Collections
* Regex

### Topics Covered:
* Zipf's Law
* Use Case: Medium Articles
* HuggingFace Datasets
* DataFrames
* Plotly Visualizations
* Use Case: Brown Corpus
* Use Case: Stop Words
* TF-IDF

### Syntax Segments Summary:

In [None]:
from huggingface_hub import hf_hub_download
# hf_hub_download used to download hugging face models

import nltk
nltk.download('punkt')
# Allows you to tokenize
nltk.download('gutenberg')
# Downloads the Gutenberg dataset
nltk.download('stopwords')
# A corpus of stopwords

from nltk.tokenize import word_tokenize
# Tokenizer

import pandas as pd
# Used for data manipulation and analysis

import plotly.express as px
# Used for visualizations

from collections import Counter
# Counts occurrences of items in an iterable

import re
# Regular Expressions used for manipulating text

In [None]:
dataframe = pd.read_csv(
    hf_hub_download(
        "fabiochiu/medium-articles", # Repository
        repo_type="dataset", # Type of content being fetched
        filename="medium_articles.csv" # File to be downloaded
    )
    # Downloads dataset from the Hugging Face model hub
)
# Reads a csv file into a Pandas DataFrame

In [None]:
dataframe.shape
# Contains a tuple with the size of a given dataframe (rows, columns)

In [None]:
dataframe.sample(n=30000)
# .sample() randomly selects a subset of n items from the given DataFrame
# Used to speed up computation

In [None]:
dataframe.head()
# Displays first few rows of the given DataFrame

In [None]:
dataframe[column_name_key].values)
# .values returns a NumPy array from the given DataFrame filled with the values from the given key

In [None]:
Counter_obj.most_common(top_n)
# .most_common(n) returns a list of tuples of the top n tokens starting with the highest recurrence
# tuples -> (token, count)

In [None]:
nltk.corpus.gutenberg.fileids()
# retrieves available corpus file IDs from the Gutenberg dataset from the corpora provided by nltk

In [None]:
nltk.corpus.gutenberg.words(corpus_name)
# retrieves the text from a specified corpus inside of the gutenberg dataset

In [None]:
stopwords.words('english')
# retrieves a list of all stopwords in the english language from nltk

## 1.12 Representing Text as Vectors with Word Embeddings

### Uses:
* Sentence Transformers
* HuggingFace Hub
* Pandas
* Numpy
* Scikit-learn
* Plotly

### Topics Covered:
* Word Embeddings
* Word Embedding models
* Context-independent Embedding
* Context-dependent Embedding
* Pre-trained Models
* Finetuning
* Sentence Embeddings
* Sentence Transformers
* Cosine Similarity
* MTEB
* Analyzing Similarity
* Plotly Visualizations
* Datasets
* DataFrames
* Bag of Words vs Embeddings comparison

### Syntax Segments Summary:

In [None]:
from sentence_transformers import SentenceTransformer, util
# SentenceTransformer class used for sentence embedding
# util contains utility functions to support the SentenceTransformer

from huggingface_hub import hf_hub_download
# Used for downloading files from the Hugging Face model hub

import pandas as pd
# Used for data manipulation and analysis

import numpy as np
# Used for numerical computing and working with array and matrices

from sklearn.decomposition import PCA 
# PCA (Principal Component Analysis) is a class
# A technique for dimensionality reduce
# Used to reduce high-dimensional data to its important information

from sklearn.manifold import TSNE 
# TSNE (t-Distributed Stochastic Neighbor Embedding) is a class
# Used for visualizing high-dimensional data

# Visualization

import plotly.express as px
# High-level interface used for plots and charts

import plotly.io as pio
# Used for handling input and output for Plotly visualizations (including saving or displaying)

In [None]:
model = SentenceTransformer(model_name)
# Uploads the specified pretraiened sentence embedding model

In [None]:
model.encode(sentences, convert_to_tensor=True)
# .encode() takes in a list of sentences and represents the given text numerically (embeddings)
# convert_to_tensor=True converts embeddings to tensors (numerical representations suitable for computation in frameworks like PyTorch or TensorFlow)

In [None]:
embeddings.shape
# returns a a list the number of embeddings and the number of dimensions for each vector
# lsit -> [num_embeddings, dimensions]

In [None]:
util.cos_sim(embedding1, embedding2)[0].item()
# util.cos_sim(vector1, vector2) calculates the cosine similarity between two given vectors
# Returns as a tensor
# [0] Accesses the first item (tensor may only have 1 item)
# .item() extracts the similarity score as a single floating-point number

In [None]:
pd.concat([
    df_articles[df_articles["tags"].apply(lambda taglist: "Data Science" in taglist)][:200],
    df_articles[df_articles["tags"].apply(lambda taglist: "Business" in taglist)][:200]
]).reset_index(drop=True)
# pd.concat concatenates subsets into a single DataFrame
# .reset_index(drop=True) resets the index of the resulting DataFrame to start from 0 and drops the previous index
# df_articles["tags"] refers to the "tags" column of the DataFrame
# .apply applies a function to each element in the specified column
# lambda function takes the article's taglist and returns True if the string is in the taglist

In [None]:
PCA_obj = PCA(n_components=n)
# PCA() creates a PCA object that reduces the data to a specified number of dimensions
# n_components=n reduces the data to n dimensions

In [None]:
embeddings_pca = PCA_obj.fit_transform(embeddings)
# .fit_transform() fits the PCA model to the embeddings 
# and transforms it into a new set of data with reduced dimensions

In [None]:
embeddings_tsne = TSNE(n_components=n, perplexity=p).fit_transform(embeddings_pca)
# TSNE() creates a t-SNE object that reduces the dimensions for visualization
# n_components=n reduces the data to n dimensions
# perplexity determines the number of neighbors to consider when reducing dimensionality
# low perplexity values lead to a more local view of data
# higher perplexity values lead to a more global view of data
# .fit_transform() fits the t-SNE model to the input data

In [None]:
pd.DataFrame(data={
    "x": x_dimension, # x coords
    "y": y_dimension, # y coords
    "title": titles, # plot title
    "color": ["color" for _ in x_dimension] # Appends the string "color" to each x_dimension element
    # Arbitrary placeholder 
}) 
# pd.DataFrame(data={}) creates a pandas DataFrane with the given information
# key -> column name
# value -> each item has a row in its column