## <center>Elements Of Data Science - F2023</center>
# <center>Week 10: NLP, Sentiment Analysis and Topic Modeling<center>
### <center>11/27/2023</center>

# TODOs

- Readings:
  -  PML Chapter 11: Working with Unlabeled Data - Clustering Analysis, Sections 11.1 and 11.2
  - [Optional] [PDSH 5.11 k-Means](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)
  - [Optional] [Data Science From Scratch Chap 22: Recommender Systems](https://ezproxy.cul.columbia.edu/login?qurl=https%3a%2f%2fsearch.ebscohost.com%2flogin.aspx%3fdirect%3dtrue%26db%3dnlebk%26AN%3d979529%26site%3dehost-live%26scope%3dsite%26ebv%3DEB%26ppid%3Dpp_267)
<br>
<br>

- Quiz 10, Due **Mon Dec 4 15th, 11:59pm ET**
- HW3, Due **Fri Dec 1st 11:59pm**



# Today

- **Pipelines**
- **NLP**
- **Sentiment Analysis**
- **Topic Modeling**
<br>
<br>

# <center>Questions?</center>
<br>
<br>

# Environment Setup

In [None]:
import numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


sns.set_style('darkgrid')
%matplotlib inline

# Pipelines in sklearn

- Pipelines are wrappers used to string together transformers and estimators
 - sequentially apply a series of transforms, eg, `.fit_transform()` and `.transform()`
 - followed by a prediction, eg. `.fit()` and `.predict()`

# Pipelines in sklearn
<br>
<br>

<div align="center"><img src="images/pipelines.png" width="800px"></div>

<font size=6>From PML</font>

# Binary Classification With All Numeric Features Setup

In [None]:
# Example from PML - scaling > feature extraction > classification
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
bc = load_breast_cancer()
X_bc,y_bc = bc['data'],bc['target']
X_bc_train,X_bc_test,y_bc_train,y_bc_test = train_test_split(X_bc,
                                                             y_bc,
                                                             test_size=0.3,
                                                             stratify=y_bc,
                                                             random_state=123)

# print without scientific notation
numpy.set_printoptions(suppress = True)

print("training set has rows: {} columns: {}".format(*X_bc_train.shape))

# all real valued features
print('Feature names: ',bc.feature_names[:3], ' ...')
print('Corresponding Feature values:', X_bc_train[:1,:3][0].round(2), ' ...')
print('Target names: ', bc.target_names)

# Pipelines in sklearn

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

# Pipeline: list of (name,object) pairs
pipe1 = Pipeline([('scale',StandardScaler()),                   # scale
                  ('pca',PCA(n_components=15)),                  # reduce dimensions
                  ('lr',LogisticRegression(solver='saga',
                                           max_iter=1000,
                                           random_state=12)),  # classifier
                 ])

pipe1.fit(X_bc_train,y_bc_train)

print(f'train set accuracy: {pipe1.score(X_bc_train,y_bc_train).round(3)}')
print(f'test set accuracy : {pipe1.score(X_bc_test,y_bc_test).round(3)}')

In [None]:
# access pipeline components by name like a dictionary
pipe1['lr'].coef_.round(2)

In [None]:
pipe1['pca'].components_[0].round(2)

# Pipelines in sklearn: GridSearch with Pipelines

- specify grid points using 'step name' + '__' (double-underscore) + 'argument'

In [None]:
from sklearn.exceptions import ConvergenceWarning # needed to supress warnings
from sklearn.utils import parallel_backend        # needed to supress warnings

from sklearn.model_selection import GridSearchCV

# separate step-names and argument-names with double-underscore '__'
params1 = {'pca__n_components':[2,10,15,20],
           'lr__penalty':['none','l1','l2'],
           'lr__C':[0,.01,1,10,100]}

with parallel_backend("multiprocessing"):         # needed to supress warnings
    with warnings.catch_warnings():                 # needed to supress warnings
        warnings.filterwarnings("ignore")             # needed to supress warnings
    
gscv = GridSearchCV(pipe1, params1, cv=3, n_jobs=-1).fit(X_bc_train,y_bc_train)

gscv.best_params_

In [None]:
score = gscv.score(X_bc_test,y_bc_test)
print(f'test set accuracy: {score:0.3f}')

# Displaying Pipelines

In [None]:
gscv

In [None]:
print(gscv)

# Displaying Pipelines Cont.

In [None]:
gscv.best_estimator_

In [None]:
print(gscv.best_estimator_)

# Pipelines in sklearn with `make_pipeline`

- shorthand for Pipeline
- step names are lowercase of class names

In [None]:
from sklearn.pipeline import make_pipeline

# make_pipeline: arguments in order of how they should be applied
pipe2 = make_pipeline(StandardScaler(),                    # center and scale data
                      PCA(n_components=2),                 # extract 2 dimensions
                      LogisticRegression(random_state=123) # classify using logistic regression
                     )
pipe2.fit(X_bc_train,y_bc_train) 

pipe2

In [None]:
pipe2['logisticregression'].coef_.round(2)

# ColumnTransformer

- Transform sets of columns differently as part of a pipeline
- For example: makes it possible to transform categorical and numeric differently

# Binary Classification With Mixed Features, Missing Data 

In [None]:
# from https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
df_titanic = pd.read_csv(titanic_url)[['age','fare','embarked','sex','pclass','survived']]
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.
df_titanic.head(1)

In [None]:
df_titanic.info()

# ColumnTransformer Cont.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# specify columns subset
numeric_features = ['age', 'fare']
# specify pipeline to apply to those columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # fill missing values with median
    ('scaler', StandardScaler())])                 # scale features

In [None]:
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # fill missing value with 'missing'
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])                   # one hot encode

In [None]:
# combine column pipelines
preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features),
                  ('cat', categorical_transformer, categorical_features)
                 ])

In [None]:
# add a final prediction step
pipe3 = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression(solver='lbfgs', random_state=42))
                       ])

# ColumnTransformer Cont.

In [None]:
pipe3

# ColumnTransformer Cont.

In [None]:
X_titanic = df_titanic.drop('survived', axis=1)
y_titanic = df_titanic['survived']

X_titanic_train, X_titanic_test, y_titanic_train, y_titanic_test = train_test_split(X_titanic, 
                                                                                    y_titanic, 
                                                                                    test_size=0.2, 
                                                                                    random_state=142)
pipe3.fit(X_titanic_train, y_titanic_train)
print(f"train set score: {pipe3.score(X_titanic_train, y_titanic_train).round(3)}")
print(f"test set score : {pipe3.score(X_titanic_test, y_titanic_test).round(3)}")

In [None]:
from sklearn.model_selection import GridSearchCV

# grid search deep inside the pipeline
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
}

gs_pipeline = GridSearchCV(pipe3, param_grid, cv=3)
gs_pipeline.fit(X_titanic_train, y_titanic_train)
print(f"best test set score from grid search: {gs_pipeline.score(X_titanic_test, y_titanic_test).round(3)}")
print(f"best parameter settings: {gs_pipeline.best_params_}")

# ColumnTransformer Cont.

In [None]:
gs_pipeline

<br>
<br>

# <center>Questions re Pipelines?</center>
<br>
<br>

# Natural Language Processing (NLP)
<br>

- Analyzing and interacting with natural language
- Python Libraries
  - **sklearn**
  - nltk
  - **spaCy**
  - gensim
  - ...

# Natural Language Processing (NLP)
<br>

- Many NLP Tasks

  - **sentiment analysis**
  - **topic modeling**
  - entity detection
  - machine translation
  - natural language generation
  - question answering
  - relationship extraction
  - automatic summarization
  - ...

# Recall: Python Builtin String Functions

In [None]:
doc = "D.S. is fun!"
doc

In [None]:
doc.lower(),doc.upper()       # change capitalization

In [None]:
doc.split() , doc.split('.')  # split a string into parts (default is whitespace)

In [None]:
' | '.join(['ab','c','d'])      # join items in a list together

In [None]:
'|'.join(doc[:5])             # a string itself is treated like a list of characters

In [None]:
'  tes t   '.strip()           # remove whitespace from the beginning and end of a string

- and many more, see [https://docs.python.org/3.10/library/string.html](https://docs.python.org/3.10/library/string.html)

# NLP: The Corpus
<br>

- **corpus:** collection of documents
  - books
  - articles
  - reviews
  - tweets
  - resumes
  - sentences?
  - ...

# NLP: Doc Representation
<br>

- Documents usually represented as strings
  - string: a sequence (list) of unicode characters

In [None]:
sample_doc = "D.S. is fun!\nIt's  true."
print(sample_doc)

In [None]:
'|'.join(sample_doc)

- Need to split this up into parts (**tokens**)
- Good job for **Regular Expressions**

# Aside: Regular Expressions
<br>

- Strings that define search patterns over text
- Useful for finding/replacing/grouping
- python `re` library (others available)

In [None]:
print(sample_doc)

In [None]:
import re
# Find all of the whitespaces in doc
# '\s+' means "one or more whitespace characters"
re.findall(r'\s+',sample_doc)

# Aside: Regular Expressions

Just some of the special character definitions:
    
- `.` : any single character except newline (r'.' matches 'x')
- `*` : match 0 or more repetitions (r'x*' matches 'x','xx','')
- `+` : match 1 or more repetitions (r'x+' matches 'x','xx')
- `?` : match 0 or 1 repetitions (r'x?' matches 'x' or '')
<br>
    
- `^` : beginning of string (r'^D' matches 'D.S.')
- `$` : end of string (r'fun!$' matches 'DS is fun!'`)

# Aside: Regular Expression Cont.
<br>

- `[]` : a set of characters (^ as first element = not)
- `\s` : whitespace character (Ex: [ \t\n\r\f\v])
- `\S` : non-whitespace character (Ex: [^ \t\n\r\f\v])
- `\w` : word character (Ex: [a-zA-Z0-9_])
- `\W` : non-word character
- `\b` : boundary between \w and \W
- and many more!
<br>

- See [regex101.com](https://regex101.com) for examples and testing

# Aside: Regex Python Functions

In [None]:
r'\w*u\w*' # a string of word characters containing the letter 'u'

In [None]:
re.findall(r'\w*u\w*',sample_doc) # return all substrings that match a pattern

In [None]:
re.sub(r'\w*u\w*','XXXX',sample_doc) # substitute all substrings that match a pattern

In [None]:
re.split(r'\w*u\w*',sample_doc) # split substrings on a pattern

# NLP: Tokenization

- **tokens:** strings that make up a document ('the', 'cat',...)
- **tokenization:** convert a document into tokens
- **vocabulary:** set of unique tokens (terms) in corpus

In [None]:
# split on whitespace
re.split(r'\s+', sample_doc)

In [None]:
# find tokens of length 2+ word characters
re.findall(r'\b\w\w+\b',sample_doc)

In [None]:
# find tokens of length 2+ non-space characters
re.findall(r"\b\S\S+\b", sample_doc)

In [None]:
# example vocabulary
set(re.findall(r"\b\S\S+\b", sample_doc))

# NLP: Tokenization in spaCy
<br>
<br>

<div align="center"><img src="images/spacy_tokenization.svg" width="500px"></align>

<font size=5>From [https://spacy.io/usage/linguistic-features](https://spacy.io/usage/linguistic-features)</font>

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
- Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
- Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.


# NLP: Other Options for Preprocessing
<br>

- lowercase
- remove special characters
- add `<START>`, `<END>` tags

- **lemmatization:** perform morphological analysis
  - 'studies' becomes 'study'
  - 'studying' becomes 'study'

# NLP: Bag of Words
    
- **Bag of Words** (BOW) representation: ignore token order

In [None]:
sample_doc

In [None]:
sample_doc.lower()

In [None]:
sorted(re.findall(r'\b\S\S+\b', sample_doc.lower()))

# NLP: n-Grams

- **Unigram:** single token
- **Bigram:** combination of two ordered tokens
- **n-Gram:** combination of n ordered tokens
- The larger *n* is, the larger the vocabulary

In [None]:
# Bigram example:
tokens = '<start> ds is fun ds is great <end>'.split()
print("bigrams     : ",    [tokens[i]+'_'+tokens[i+1] for i in range(len(tokens)-1)])
print("bigram vocab: ",set([tokens[i]+'_'+tokens[i+1] for i in range(len(tokens)-1)]))

In [None]:
# Trigrams example:
tokens = '<start> ds is fun ds is great <end>'.split()
['_'.join(tokens[i:i+3]) for i in range(len(tokens)-2)]

# NLP: TF and DF

- **Term Frequency:** number of times a term is seen per document
- $\text{tf}(t, d) = \text{count of term } t \text{ in document } d$

In [None]:
example_corpus = ['red green blue', 'red blue blue']

#Vocabulary
example_vocab = sorted(set(' '.join(example_corpus).split()))
example_vocab

In [None]:
#TF
from collections import Counter
example_tf = np.zeros((len(example_corpus),len(example_vocab)))
for i,doc in enumerate(example_corpus):
    for j,term in enumerate(example_vocab):
        example_tf[i,j] = Counter(doc.split())[term]
example_tf = pd.DataFrame(example_tf,index=['doc1','doc2'],columns=example_vocab)
example_tf

# NLP: TF and DF

- **Document Frequency:** number of documents containing each term
$\text{df}(t) = \text{count of documents containing term } t$


In [None]:
example_tf

In [None]:
#DF
example_df = example_tf.astype(bool).sum(axis=0) # how many documents contain each term (column) 
example_df

# NLP: Stopwords

- terms that have high (or very low) DF and aren't informative
  - common engish terms (ex: a, the, in,...)
  - domain specific (ex, in class slides: 'data_science')
  - often removed prior to analysis
  - in sklearn
    - `min_df` : integer > 0 : keep terms that occur in at at least n documents
    - `max_df` : float in (0,1] :  keep terms that occur in less than max_df% of total documents

# NLP: CountVectorizer in sklearn

In [None]:
example_corpus = ['blue green red', 'blue green green']

from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(lowercase=True,    # default, transform all docs to lowercase
                        ngram_range=(1,1), # default, only unigrams
                        min_df=1,          # default, keep all terms
                        max_df=1.0,        # default, keep all terms
                       )
X_cv = cvect.fit_transform(example_corpus)
X_cv.shape

In [None]:
cvect.vocabulary_ # learned vocabulary, term:index pairs

In [None]:
cvect.get_feature_names() # vocabulary, sorted by indexs

In [None]:
X_cv.todense() # term frequencies

In [None]:
cvect.inverse_transform(X_cv) # mapping back to terms via vocabulary mapping

# NLP: TfIdf

- What if some terms are still uninformative?
- Can we downweight terms that occur in many documents?
- **Term Frequency * Inverse Document Frequency (tf-idf)**
  - $\text{tf-idf}(t,d) = \text{tf}(t, d) \times \text{idf}(t)$
  - $\text{idf}(t) = \log \frac{1+n}{1+\text{df}(t)} + 1$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvect = TfidfVectorizer(norm='l2') # by default, also doing l2 normalization

X_tfidf = tfidfvect.fit_transform(example_corpus)
sorted(tfidfvect.vocabulary_.items(),key=lambda x: x[1])

In [None]:
X_tfidf.todense().round(2)

In [None]:
# can also use to get term frequencies by setting use_idf to False and norm to none
TfidfVectorizer(use_idf=False, norm=None).fit_transform(example_corpus).todense()

# NLP: Classification Example

In [None]:
from sklearn.datasets import fetch_20newsgroups

ngs = fetch_20newsgroups(categories=['rec.sport.baseball','rec.sport.hockey']) # dataset has 20 categories, only get two

docs_ngs = ngs['data']                         # get documents (emails)
y_ngs = ngs['target']                          # get targets ([0,1])
target_names_ngs = ngs['target_names']         # get target names (['rec.sport.baseball','rec.sport.hockey'])

print(y_ngs[1], target_names_ngs[y_ngs[1]])    # print target int and target name
print('-'*50)                                  # print a string of 50 dashes
print(docs_ngs[0].strip()[:600])               # print beginning characters of first doc, after stripping whitespace

# NLP Example: Transform Docs

In [None]:
from sklearn.model_selection import train_test_split
docs_ngs_train,docs_ngs_test,y_ngs_train,y_ngs_test = train_test_split(docs_ngs,y_ngs, random_state = 123)

vect = TfidfVectorizer(lowercase=True,
                       min_df=5,           # occur in at least 5 documents
                       max_df=0.8,         # occur in at most 80% of documents
                       token_pattern=r'\b\S\S+\b',  # tokens of at least 2 non-space characters
                       ngram_range=(1,1),  # only unigrams
                       use_idf=False,      # term frequency counts instead of tf-idf
                       norm=None           # do not normalize
                      )
X_ngs_train = vect.fit_transform(docs_ngs_train)
X_ngs_train.shape

In [None]:
# first few terms in learned vocabulary
list(vect.vocabulary_.items())[:5]

In [None]:
# first few terms in learned stopword list
list(vect.stop_words_)[:5]

In [None]:
# first few terms in BOW representation of first document
vect.inverse_transform(X_ngs_train[0])[0][:10]

# NLP Example: Train and Evaluate Classifier

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

scores_dummy = cross_val_score(DummyClassifier(strategy='most_frequent'),X_ngs_train,y_ngs_train)
scores_lr    = cross_val_score(LogisticRegression(),X_ngs_train,y_ngs_train)

print(f'dummy cv accuracy: {scores_dummy.mean().round(2):0.2f} +- {scores_dummy.std().round(2):0.2f}')
print(f'lr    cv accuracy: {scores_lr.mean().round(2):0.2f} +- {scores_lr.std().round(2):0.2f}')

# NLP Example: Using Pipeline

In [None]:
from sklearn.pipeline import Pipeline

# Recall: use Pipeline instead of make_pipeline to add names to the steps
#  (name,object) tuple pairs for each step
pipe_ngs1 = Pipeline([('vect',TfidfVectorizer(lowercase=True,
                                              min_df=5,
                                              max_df=0.8,
                                              token_pattern=r'\b\S\S+\b',
                                              ngram_range=(1,1),
                                              use_idf=False,
                                              norm=None )
                      ),   
                      ('lr',LogisticRegression())
                     ])

pipe_ngs1.fit(docs_ngs_train,y_ngs_train) # pass in docs, not transformed X

score_ngs1 = pipe_ngs1.score(docs_ngs_train,y_ngs_train).round(2)
print(f'lr pipeline accuracy on training set: {score_ngs1:0.3f}')

In [None]:
scores_ngs1 = cross_val_score(pipe_ngs1,docs_ngs_train,y_ngs_train) 
print(f'lr pipeline cv accuracy: {scores_ngs1.mean().round(2):0.2f} +- {scores_ngs1.std().round(2):0.2f}')

In [None]:
list(pipe_ngs1['vect'].get_feature_names_out())[-5:]

# NLP Example: Add Feature Selection

In [None]:
from sklearn.feature_selection import SelectFromModel,SelectPercentile

pipe_ngs2 = Pipeline([('vect',TfidfVectorizer(lowercase=True,
                                              min_df=5,
                                              max_df=0.8,
                                              token_pattern='\\b\\S\\S+\\b',
                                              ngram_range=(1,1),
                                              use_idf=False,
                                              norm=None )
                      ),   
                      ('fs',SelectFromModel(estimator=LogisticRegression(C=1.0,
                                                                         penalty='l1',
                                                                         solver='liblinear',
                                                                         max_iter=1000,
                                                                         random_state=123
                                                                        ))),
                      ('lr',LogisticRegression(max_iter=10000))
                     ])

pipe_ngs2.fit(docs_ngs_train,y_ngs_train)
print(f'pipeline accuracy on training set: {pipe_ngs2.score(docs_ngs_train,y_ngs_train).round(2):0.2f}')

scores_ngs2 = cross_val_score(pipe_ngs2,docs_ngs_train,y_ngs_train) 
print(f'pipeline cv accuracy             : {scores_ngs2.mean().round(2):0.2f} +- {scores_ngs2.std().round(2):0.2f}')

# NLP Example: Grid Search with Feature Selection

In [None]:
%%time
# NOTE: this may take a minute or so
params_ngs2 = {'vect__use_idf':[True,False],
              'vect__ngram_range':[(1,1),(2,2)],
              'fs__estimator__C':[10,1000],
              'lr__C':[.01,1,100]}

gscv_ngs = GridSearchCV(pipe_ngs2, params_ngs2, cv=2, n_jobs=-1).fit(docs_ngs_train,y_ngs_train)

print(f'gscv_ngs best parameters  : {gscv_ngs.best_params_}')
print(f'gscv_ngs best cv accuracy : {gscv_ngs.best_score_.round(2):0.2f}')
print(f'gscv_ngs test set accuracy: {gscv_ngs.score(docs_ngs_test,y_ngs_test).round(2):0.2f}')

# Sentiment Analysis and sklearn
<br>

- determine sentiment/opinion from unstructured test
- usually positive/negative, but is domain specific
- can be treated as a classification task (with a target, using all of the tools we know)
- can also be treated as a linguistic task (sentence parsing)
<br>

- Example: determine sentiment of movie reviews
- see [sentiment_analysis_example.ipynb](sentiment_analysis_example.ipynb)

# Topic Modeling

- What topics are our documents composed of?
- How much of each topic does each document contain?
- Can we represent documents using topic weights? (dimensionality reduction!)

- What is topic modeling?
- How does **Latent Dirichlet Allocation (LDA)** work?
- How to train and use LDA with sklearn?

# What is Topic Modeling?
<br>

- **topic:** a collection of related words
- A document can be composed of several topics
<br>

- Given a collection of documents, we can ask:
  - **What terms make up each topic?** (per topic term distribution)
  - **What topics make up each document?** (per document topic distribution)

# Topic Modeling with Latent Dirichlet Allocation (LDA)

- Unsupervised method for determining topics and topic assignments
<br>
<br>

<div align="center"><img src="images/lda_blei.jpg" width="1100px"></div>

<font size=5>From David Blei</font>

# Two Important Matrices Learned by LDA

- the **per topic term distributions** aka $\varphi$ (phi)


In [None]:
topics = ['topic1','topic2']
vocab = ['cat','baseball','play']
phi = pd.DataFrame([[0.4,.2,.4],[0.2,.4,.4]],columns=vocab,index=topics)
phi

- the **per document TOPIC distributions** aka $\theta$ (theta)

In [None]:
topics = ['topic1','topic2']
docs = ['doc1','doc2']
theta = pd.DataFrame([[0.1,.9],[.5,.5]],columns=topics,index=docs)
theta

# Topic Modeling: Example

- Given the data and the number of topics we want

In [None]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

M = 3 # the number of documents

vocab = ['baseball','cat','dog','pet','played','tennis']

V = len(vocab) # size of vocabulary

K = 2 # our guess about the number of topics

print(f'{M = :}\n{V = :}\n{K = :}')

# Topic Modeling: Example

- Guessing some **per topic term distributions** ($\varphi$) given the documents and vocab

In [None]:
print(vocab)

In [None]:
# the probability of each term given topic 1 (high for sports terms)
topic_1 = [.33,   0,   0,   0, .33, .33]

# the probability of each term given topic 2 (high for pet terms)
topic_2 = [  0, .25, .25, .25, .25,   0]

# per topic term distributions
phi = pd.DataFrame([topic_1, topic_2],columns=vocab,
                   index=['topic_'+str(x) for x in range(1,K+1)])

phi

# Topic Modeling: Example

- Guessing the **per document topic distributions** $\theta$ given the **topics**

In [None]:
# Given our guess about phi
display(phi)
# And the corpus
corpus

In [None]:
# generate a guess about per document topic distributions
theta = pd.DataFrame([[.50, .50],
                      [.99, .01],
                      [.01, .99]],
                     columns=['topic_'+str(x) for x in range(1,K+1)],
                     index=['doc_'+str(x) for x in range(1,M+1)])
theta

# Topic Modeling With LDA

- Given
  - a set of documents
  - a number of topics $K$
<br>

- Learn
  - the **per topic term distributions $\varphi$ (phi)**, size: $K \times V$
  - the **per document topic distributions $\theta$ (theta)**, size: $M \times K$
<br>

- How to learn $\varphi$ and $\theta$:
  - Latent Dirichlet Allocation (LDA)
  - generative statistical model
  - Blei, D., Ng, A., Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (Jan 2003)

# Topic Modeling With LDA

- Uses for $\varphi$ (phi), the per topic term distributions:
  - infering labels for topics
  - word clouds
<br>
<br>

- Uses for $\theta$ (theta), the per document topic distributions:
  - dimensionality reduction
  - clustering
  - similarity

# LDA with sklearn

In [None]:
# load data from all 20 newsgroups
newsgroups = fetch_20newsgroups()
ngs_all = newsgroups.data
len(ngs_all)

In [None]:
# transform documents using tf-idf
tfidf = TfidfVectorizer(token_pattern=r'\b[a-zA-Z0-9-][a-zA-Z0-9-]+\b',min_df=50, max_df=.2)
X_tfidf = tfidf.fit_transform(ngs_all)
X_tfidf.shape

tf_idf_array = X_tfidf.toarray()

In [None]:
tf_idf_array[90:100,-10:]

In [None]:
feature_names = tfidf.get_feature_names()
print(feature_names[:10])
print(feature_names[-10:])

# LDA with sklearn Cont.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# create model with 20 topics
lda = LatentDirichletAllocation(n_components=20,  # the number of topics
                                n_jobs=-1,        # use all cpus
                                random_state=123) # for reproducability

# learn phi (lda.components_) and theta (X_lda)
# this will take a while!
X_lda = lda.fit_transform(X_tfidf)

In [None]:
ngs_all[100][:100]

In [None]:
X_lda[100].round(2) # lda representation of document_100

In [None]:
# Note: since this is unsupervised, these numbers may change
np.argsort(X_lda[100])[::-1]#[:3] # the top topics of document_100

# LDA: Per Topic Term Distributions

In [None]:
# a utility function to print out the most likely terms for each topic
# taken from https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic {:#2d}: ".format(topic_idx)
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

In [None]:
print_top_words(lda,feature_names,20)

# LDA Review

- What did we learn?
  - per document topic distributions
  - per topic term distributions
<br>
<br>

- What can we use this for?
  - Dimensionality Reduction/Feature Extraction!
  - investigate topics (much like PCA components)

# Other NLP Features

- Part of Speech tags
- Dependency Parsing
- Entity Detection
- Word Vectors
- See spaCy!

# Using spaCy for NLP

In [None]:
import spacy

# uncomment the line below the first time you run this cell
#%run -m spacy download en_core_web_sm
try:
    
    nlp = spacy.load("en_core_web_sm")
    
except OSError as e:
    print('Need to run the following line in a new cell:')
    print('%run -m spacy download en_core_web_sm')
    print('or the following line from the commandline with eods-f20 activated:')
    print('python -m spacy download en_core_web_sm')
    
parsed = nlp("N.Y.C. isn't in New Jersey.")
'|'.join([token.text for token in parsed])

# spaCy: Part of Speech Tagging

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $one billion.")

print(f"{'text':7s} {'lemma':7s} {'pos':5s} {'is_stop'}")
print('-'*30)
for token in doc:
    print(f'{token.text:7s} {token.lemma_:7s} {token.pos_:5s} {token.is_stop}')

In [None]:
from spacy import displacy
displacy.render(doc, style="dep")

# spaCy: Part of Speech Tagging

# spaCy: Entity Detection

In [None]:
[(ent.text,ent.label_) for ent in doc.ents]

In [None]:
displacy.render(doc, style="ent")

# spaCy: Word Vectors

- word2vec
- shallow neural net
- predict a word given the surrounding context (SkipGram or CBOW)
- words used in similar context should have similar vectors

In [None]:
# Need either the _md or _lg models to get vector information
# Note: this takes a while!
# %run -m spacy download en_core_web_md

In [None]:
nlp = spacy.load('en_core_web_md') # _lg has a larger vocabulary

doc = nlp('Baseball is played on a diamond.')
doc[0].text, doc[0].vector.shape, list(doc[0].vector[:3])

# spaCy: Multiple Documents


In [None]:
# Use nlp.pipe to transform multiple docs at once
docs = list(nlp.pipe(['Baseball is played on a diamond.',
                      'Hockey is played on ice.',
                      'Diamonds are clear as ice.']))

In [None]:
# using average of token vectors for each document.
np.array([['{:.2f}'.format(docs[i].similarity(docs[j])) for j in range(3)]
          for i in range(3)])

# Learning Sequences

- Hidden Markov Models
- Conditional Random Fields
- Recurrant Neural Networks
- LSTM
- GPT3
- [BERT](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)
- Transformers (MLP 16.4)

# NLP Review

- corpus, tokens, vocabulary, terms, n-grams, stopwords
- tokenization
- term frequency (TF), document frequency (DF)
- TF vs TF-IDF
- sentiment analysis
- topic modeling
<br>

- POS
- Dependency Parsing
- Entity Extraction
- Word Vectors

<br>
<br>

# <center>Questions?</center>
<br>
<br>

# Appendix: LDA Plate Diagram

<div align="center"><img src="images/Smoothed_LDA.png" width="400px"></div>

<font size=5>
    
**K** :  number of topics

**$\varphi$** : per topic term distributions

**$\beta$**  : parameters for word distribution die factory, length = V (size of vocab)

**M**     : number of documents

**N**     : number of words/tokens in each document

**$\theta$** : per document topic distributions

**$\alpha$** : parameters for topic die factory, length = K (number of topics)

**z** : topic indexes

**w** : observed tokens

</font>