# Test 2 Wrapup

## Key Themes

- Text processing (tokenization, considerations such as tokens, vocab size, thresholds)
- document term matrix 
- All machine learning still applies once we have a DTM or a similar representation!
  - dimensionality reduction
  - clustering/uml tasks
  - supervised tasks
- Sentiment Analysis
  - dictionary/lookup approaches
  - some attempt to augment with rules-based modifiers
  - Data annotation and hand-labeling is generally best for domain-specific needs
- NER
  - extract named entities from a corpus
  - spacy has a generalized model, but its a model, so not always accurate for our specific domain needs
  - when tuned for a specific problem, can help extract knowledge quickly versus humans in the loop
- embeddings
  - Instead of a sparse count representation, we can start to attempt to contextual meaning into static dense word/token vectors
  - As with PCA directionally, we look to create this new feature space to represent our row/observation/document 
  - These representations can be used downstream in DR/UML/SML
- beyond embeddings
  - deep learning neural net architectures expanded Word2Vec and ushered in language modeling (multiple ML tasks learned/trained at once)
  - Pretrained on large corpora, and like above, are general but might help improve our outcomes
- Putting it all Together
  - conversational AI combines intent classification and NER
  - we can finetune deep learning models to leverage the generalized fits and learn tune the data to help fit our task
  - Newer techniques for topic modeling draw upon learning embeddings, reducing those embeddings and calcuating similarity/distance to identify the semantic relationships in clusters

## Additional Considerations for after BA820 

- Data annotation in a notebook - https://github.com/dennisbakhuis/pigeonXT
- Bulk labeling - https://github.com/RasaHQ/rasalit/blob/main/notebooks/bulk-labelling/bulk-labelling.ipynb
- 

In [None]:
# installs
! pip install -U spacy

In [None]:
# imports
import pandas as pd
import seaborn as sns

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

import spacy
from spacy import cli 


In [None]:
# spacy setup
model = "en_core_web_md"
cli.download(model)

nlp = spacy.load(model)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [None]:
# a dataset
SQL = "SELECT tweet_id, text, airline_sentiment from `questrom.datasets.airlines-tweets`"
df = pd.read_gbq(SQL, "questrom")

In [None]:
# quick review
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   tweet_id           14640 non-null  int64 
 1   text               14640 non-null  object
 2   airline_sentiment  14640 non-null  object
dtypes: int64(1), object(2)
memory usage: 343.2+ KB


In [None]:
df.sample(3)

Unnamed: 0,tweet_id,text,airline_sentiment
12331,570044681670696960,@united thank you.,positive
3590,570110504254857217,@USAirways : You Make the Reservation; We'll M...,negative
6968,569859036360908801,"@AmericanAir @MallowFairy And how many times, ...",negative


In [None]:
# split up the docs
X_train, X_test, y_train, y_test = train_test_split(df.text, df.airline_sentiment, test_size=.3, random_state=820, stratify=df.airline_sentiment)

In [None]:
# fit the Bag of Words via sklearn
cv = CountVectorizer(max_features=15000)
cv.fit(X_train)

# get the dtms
dtm_train = cv.transform(X_train)
dtm_test = cv.transform(X_test)

# these are dense, so lets make sure they are arrays
dtm_train = dtm_train.toarray()
dtm_test = dtm_test.toarray()

In [None]:
# what are the nlp pipelines in spacy
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# get the document vector representation with spacy

# we can use pipe and disable the pipeline bits we dont need
DISABLE = ['tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
docs_train = list(nlp.pipe(X_train, disable=DISABLE))
docs_test = list(nlp.pipe(X_test, disable=DISABLE))


# doc = nlp("Brock likes python")

In [None]:
# extract above into doc vector representations
# use pre-trained spacy word vectors

dvm_train = [doc.vector for doc in docs_train]
dvm_test = [doc.vector for doc in docs_test]

In [None]:
# quick review, what is distro of sentiment
df.airline_sentiment.value_counts(dropna=False)

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [None]:
# fit a tree with the bow - 30 seconds
tree_bow = DecisionTreeClassifier(min_samples_split=150, random_state=820)
tree_bow.fit(dtm_train, y_train)

tree_vec = DecisionTreeClassifier(min_samples_split=150, random_state=820)
tree_vec.fit(dvm_train, y_train)

DecisionTreeClassifier(min_samples_split=150, random_state=820)

In [None]:
# apply to get the predictions
preds_dtm = tree_bow.predict(dtm_test)
preds_dvm = tree_vec.predict(dvm_test)

In [None]:
cr_bow = metrics.classification_report(y_test, preds_dtm)
cr_vec = metrics.classification_report(y_test, preds_dvm)

In [None]:
# report for bow/dtm
print(cr_bow)

              precision    recall  f1-score   support

    negative       0.78      0.84      0.80      2753
     neutral       0.50      0.45      0.47       930
    positive       0.65      0.54      0.59       709

    accuracy                           0.71      4392
   macro avg       0.64      0.61      0.62      4392
weighted avg       0.70      0.71      0.70      4392



In [None]:
# report for vectors
print(cr_vec)

              precision    recall  f1-score   support

    negative       0.76      0.81      0.79      2753
     neutral       0.45      0.42      0.43       930
    positive       0.58      0.48      0.52       709

    accuracy                           0.68      4392
   macro avg       0.60      0.57      0.58      4392
weighted avg       0.67      0.68      0.67      4392

