<a href="https://colab.research.google.com/github/AishaEvering/NLP_Techniques/blob/main/NLP_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [22]:
!pip -qqq install spacy

!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Bag of Words Model

Basically count the words

In [3]:
class Category:
  BOOKS = "BOOKS"
  CLOTHING = "CLOTHING"

train_x = ['I love the book', 'This is a great book.', 'The fit is great.', 'I love the shoes.']
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

#vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
vectorizer = CountVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x)

print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())

['book' 'fit' 'great' 'is' 'love' 'shoes' 'the' 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


In [12]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

Bag of words is great on stuff it's trained on but not so much on news words.

In [21]:

test_x = vectorizer.transform(['I like the book.','Shoes are alright.', 'I love the books'])

clf_svm.predict(test_x)

array(['BOOKS', 'CLOTHING', 'CLOTHING'], dtype='<U8')

# Word Vectors

Capture the semantic meaning of a word in a vector

In [2]:
import spacy

nlp = spacy.load("en_core_web_md")

In [4]:
print(train_x)

['I love the book', 'This is a great book.', 'The fit is great.', 'I love the shoes.']


In [10]:
docs = [nlp(text) for text in train_x]
train_x_word_vectors = [x.vector for x in docs]

In [12]:
from sklearn import svm

clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_y)

In [18]:
test_x = ["I love the book", "I love the story", "I love the hat", "These earrings hurt", "I love the books"]
test_docs = [nlp(text) for text in test_x]
text_x_word_vectors = [x.vector for x in test_docs]

clf_svm_wv.predict(text_x_word_vectors)

array(['BOOKS', 'BOOKS', 'CLOTHING', 'CLOTHING', 'BOOKS'], dtype='<U8')

Word vectors are not really good at context.  So in "I went to the bank and wrote a check" and "let me check that out", the word check has the same word vector

# NLP Techniques

## Regexes

Pattern Matching of strings in Python

Password checkers, phone numbers, emails, and more!

In [25]:
import re

regexp = re.compile(r"ab[^\s]*cd")

phrases = ['abcd', 'xxx', 'aaa abxxxcd ccc', 'ab cd']

matches = []

for phrase in phrases:
  if re.search(regexp, phrase):
    matches.append(phrase)

print(matches)

['abcd', 'aaa abxxxcd ccc']


In [29]:
regexp = re.compile(r"\bread\b|\bstory\b|book")

phrases = ['I liked that story.', "The car treaded up the hill", "this hat it nice"]

matches = []

for phrase in phrases:
  if re.search(regexp, phrase):
    matches.append(phrase)

print(matches)

['I liked that story.']


## Stemming/Lemmatization

Techniques to normalize text

- reading -> read
- books -> book
- stories ->
  - stori for stemming (stemming is a algorithm that may not give you and actual word)
  - story for lemmatizing (uses a dictionary, to make sure it's giving an actual word)


In [30]:
import nltk

nltk.download("wordnet")
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [35]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

phrase = "reading the stories in books"
words = word_tokenize(phrase)

stemmed_words = []

for word in words:
  stemmed_words.append(stemmer.stem(word))

" ".join(stemmed_words)

'read the stori in book'

In [37]:
from nltk.stem import WordNetLemmatizer

lemmetizer = WordNetLemmatizer()

phrase = "reading the stories in books"
words = word_tokenize(phrase)

lemmatized_words = []

for word in words:
  lemmatized_words.append(lemmetizer.lemmatize(word, pos='v'))

" ".join(lemmatized_words)


'read the stories in book'

## Stopwords Removal
 Set of the most common words in the english language

 ex: this, they, he, it

In [39]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words("english")
print(len(stop_words), stop_words)

179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [40]:
phrase = "Here is an example sentence demostrating the removal of stopwords"

words = word_tokenize(phrase)

stripped_pharse = []

for word in words:
  if word not in stop_words:
    stripped_pharse.append(word)

" ".join(stripped_pharse)

'Here example sentence demostrating removal stopwords'

# Various other techniques (spell correction, sentiment, & pos (part of speach) tagging)

In [48]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [44]:
from textblob import TextBlob

phrase = "thiss is an eexampel"

tb_phrase = TextBlob(phrase)

tb_phrase.correct()

TextBlob("this is an example")

In [55]:
phrase = "The book was great"

tb_phrase = TextBlob(phrase)

print(tb_phrase.tags, tb_phrase.sentiment, sep="\n")

[('The', 'DT'), ('book', 'NN'), ('was', 'VBD'), ('great', 'JJ')]
Sentiment(polarity=0.8, subjectivity=0.75)


# Transformer Architecture

In [2]:
!pip install spacy-transformers
!python -m spacy download en_core_web_trf

  _torch_pytree._register_pytree_node(
Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
import spacy
import torch

nlp = spacy.load("en_core_web_trf")
doc = nlp("Apple shares rose on the news. Apple pie is delicious.")

In [9]:
class Category:
  BOOKS = "BOOKS"
  BANK = "BANK"

train_x = ["good characters and plot progression",
           "check out the book",
           "good story, would recommend",
           "novel recommendation",
           "need to make a deposit to the bank",
           "balance inquiry savings",
           "save money"]

train_y = [Category.BOOKS, Category.BOOKS, Category.BOOKS, Category.BOOKS, Category.BANK, Category.BANK, Category.BANK]

In [10]:
from sklearn import svm

docs = [nlp(text) for text in train_x]
train_x_vectors = [doc.vector for doc in docs]
clf_svm = svm.SVC(kernel="linear")

clf_svm.fit(train_x_vectors, train_y)

text_x = ['book']
docs = [nlp(text) for text in text_x]
test_x_vectors = [doc.vector for doc in docs]

clf_svm.predict(test_x_vectors)

ValueError: Found array with 0 feature(s) (shape=(7, 0)) while a minimum of 1 is required by SVC.