# Natural Language Processing (NLP)
![what-is-text-mining-definition-scaled.webp](attachment:what-is-text-mining-definition-scaled.webp)

## Introduction

additional content [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker and [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky*

In [None]:
### Text blob
# you may need to install
# pip install textblob
# pip install -U textblob
# -m textblob.download_corpora


In [None]:
#pip install spacy
# -m spacy download en_core_web_sm

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
pip install keras

In [None]:
pip install textblob

In [None]:
pip install tensorflow

### What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language

### What are some of the higher level task areas?

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [My application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

### What are some of the lower level components?

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF(term frequency-inverse document frequency)**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

### Why is NLP hard?

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the **language** and the **world**.

#### Corpora is a collection of text
The definition is a large or complete collection of writings. 


Some examples are 
- The entire works of Oscar Wilde
- My personal library could be the Matthew Morris library corpora 
- A classrooms written assignments that were due on Friday are a corpora. 
- All books available in Google Books.
- Every yelp review or tweet you have made or every yelp review or tweet a political party has made or demographic group. 
- Every Star Trek Review that only Star Wars Fans have made that have rated a median score of 4 stars or higher for star wars films. 


Libraries for NLP to be aware of

- NLTK
https://www.nltk.org/


NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

- TextBlob
https://textblob.readthedocs.io/en/dev/


TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

- spaCy
https://spacy.io/

- Support for 75+ languages
- 84 trained pipelines for 25 languages
- Multi-task learning with pretrained transformers like BERT
- Pretrained word vectors
- State-of-the-art speed
- Production-ready training system
- Linguistically-motivated tokenization
- Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text -  - classification, lemmatization, morphological analysis, entity linking and more
- Easily extensible with custom components and attributes
- Support for custom models in PyTorch, TensorFlow and other frameworks
- Built in visualizers for syntax and NER
- Easy model packaging, deployment and workflow management
- Robust, rigorously evaluated accuracy


- Gensim
https://pypi.org/project/gensim/#:~:text=Gensim%20is%20a%20Python%20library,information%20retrieval%20(IR)%20community.


Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.


- pyKeras

Keras is an open-source library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. 



there are a lot more and varying use cases on when to use

https://sunscrapers.com/blog/9-best-python-natural-language-processing-nlp/

in short
- NLTK - good for preprocessing. Essentially cleaning and getting data ready for processing
- TextBlob - good for preprocessing and processing, ie: noun phrase classification, translations and  sentiment analysis.
- Spacy while roboust is more for production. NLTK has similiar abilities but with a larger toolbox. Spacy is to Python as NLTK is to R.
- Gensim handles big data deep learning but lacks in nuance of other models. 
- keras is a high level nueral network tha is not as robust and detailed as Tensorflow. 

TensorFlow and pytorch begin getting you into AI and AI modeling. pyTorch is the R to Tensorflows Python.


https://medium.com/activewizards-machine-learning-company/comparison-of-top-6-python-nlp-libraries-c4ce160237eb

![1%20CApRhyf6pmFJY0nLKsRdCg.webp](attachment:1%20CApRhyf6pmFJY0nLKsRdCg.webp)

##### what  are some common packages is available in NLTK
- sent_tokenize
- word_tokenize
- TreebankWordTokenizer
- wordpunct_tokenize
- TweetTokenizer
- MWETokenizer


Lets take a look at tokenization

This is the process of converting sequences of text into smaller parts known as tokens
- sentances
- words
- whitespaces

Use Cases


- Search engines. When you type a query into a search engine like Google, it employs tokenization to dissect your input. This breakdown helps the engine sift through billions of documents to present you with the most relevant results.
- Machine translation. Tools such as Google Translate utilize tokenization to segment sentences in the source language. Once tokenized, these segments can be translated and then reconstructed in the target language, ensuring the translation retains the original context.
- Speech recognition. Voice-activated assistants like Siri or Alexa rely heavily on tokenization. When you pose a question or command, your spoken words are first converted into text. This text is then tokenized, allowing the system to process and act upon your request.





In [None]:
#### Tokenization
quotes = ("A person who never made a mistake never tried anything new. Don't worry be happy :):D. I can't change the direction of the wind but I can adjust my sails.  Procrastination makes easy things hard and hard things harder. I find the harder I work, the more luck I seem to have. Motivation is what gets you started. Habit is what keeps you going.")
from nltk.tokenize import sent_tokenize
sent_tokenize(quotes)

In [None]:
from nltk.tokenize import word_tokenize
word_tokenize(quotes)

In [None]:
# uses the whitespace as the delimiter
quotes.split()

In [None]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(quotes)

In [None]:
'''
Dealing with Contractions ie "Don't"
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

    split standard contractions, e.g. don't -> do n't and they'll -> they 'll

    treat most punctuation characters as separate tokens

    split off commas and single quotes, when followed by whitespace

    separate periods that appear at the end of line
'''

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(quotes)

In [None]:
import nltk
# let's deal with that emoticon
tokenizer = nltk.TweetTokenizer()
print(tokenizer.tokenize(quotes))

In [None]:
from textblob import TextBlob
#Load the words using TextBlob
blob = TextBlob(quotes)
#Tokenization of words
words = blob.words
print(words)
print(len(words))

In [None]:
#Spacey Tokenizer
'''
spaCy tokenizer can specify special tokens that don’t need to be segmented,
U.S.A or U.N. will be recognized should be treated as one lable and not split int U,S,A or U, N
download data and models for the English language.
'''
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(quotes) 
for token in nlp(quotes):
    print(token, token.idx)

In [None]:
#Gensim basically is looking for simularities
from gensim import corpora
from gensim.utils import tokenize

# first lets create a dictionary of text using 
# tokenizer = nltk.TweetTokenizer() <- try later
text = [[i for i in quotes.split()] for i in quotes]
dictionary = corpora.Dictionary(text)
# word ids
(dictionary.token2id)

'''
ID's will replace the word. So if the ID for apples is 1 any docs or text strings with apples will be
given the ID of 1. Tokenizing is not limited to gensim, this is an introduction to a new concept of 
ID's
'''


In [None]:
# Keras Tokenization
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
ntoken = Tokenizer(num_words = 15)
ntoken.fit_on_texts(quotes)
list_words = text_to_word_sequence(quotes)
print(list_words)

## Part 1: Reading in the Yelp Reviews

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

In [None]:
# read yelp.csv into a DataFrame
url = 'C://Users//Matth//OneDrive//Desktop/DATA//yelp.csv'
yelp = pd.read_csv(url)

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

yelp_best_worst = yelp_best_worst.head(50)


# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
X.head()

In [None]:
yelp_best_worst.tail()

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.shape

In [None]:
y.head()

## Part 2: Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [None]:
vect

In [None]:
# rows are documents, columns are terms (aka "tokens" or "features")
X_train_dtm.shape

In [None]:
# last 50 features
print(vect.get_feature_names_out()[-50:])

In [None]:
# show vectorizer options
# check lowercase
vect

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

text preprocessing technique commonly used in natural language processing (NLP) tasks for converting a collection of text documents into a numerical representation. 

- Parameter **lowercase:** boolean, True by default
    - If True, Convert all characters to lowercase before tokenizing.

In [None]:
# We will not convert to lowercase this time
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
# last 50 features
print(vect.get_feature_names_out()[:50])

- Parameter **ngram_range:** tuple (min_n, max_n)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
    
https://en.wikipedia.org/wiki/N-gram
    
    ![LARGER_FONT_VERSION_Six_n-grams_frequently_found_in_titles_of_publications_about_Coronavirus_disease_2019,_as_of_7_May_2020.svg.png](attachment:LARGER_FONT_VERSION_Six_n-grams_frequently_found_in_titles_of_publications_about_Coronavirus_disease_2019,_as_of_7_May_2020.svg.png)\
    
in short it is control the number of letters, words, sentances you want combined.

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
# last 50 features
print(vect.get_feature_names_out()[-50:])

**Predicting the star rating:**

In [None]:



# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
'''
The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for classification tasks, 
like text classification. It is also part of a family of generative learning algorithms, meaning that it seeks 
to model the distribution of inputs of a given class or category.
'''



nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# calculate null accuracy
y_test_category = pd.DataFrame(np.where(y_test==5, 'best', 'worst'), columns = ['rating'])
y_test_category.rating.value_counts() / len(y_test_category.rating)

In [None]:
# alternate way to calculate null accuracy
y_test_binary = np.where(y_test==5, 1, 0)
max(y_test_binary.mean(), 1 - y_test_binary.mean())

In [None]:
# define a function that accepts a vectorizer object and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [None]:
vect = CountVectorizer()
tokenize_test(vect)

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

## Part 3: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options
vect

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

In [None]:
# set of stop words
print(vect.get_stop_words())

## Part 4: Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english',)
tokenize_test(vect)

In [None]:
# all features
print(vect.get_feature_names_out())

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

In [None]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=1000)
tokenize_test(vect)

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2),  max_features=300, min_df=2)
tokenize_test(vect)

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2),  max_features=30000, min_df=3)
tokenize_test(vect)

## Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

In [None]:
# print the first review
print(yelp_best_worst.text[0])

In [None]:
# save it as a TextBlob object
review = TextBlob(yelp_best_worst.text[0])

In [None]:
type(review)

In [None]:
# list the words
import nltk
nltk.download('all')
#review.words

In [None]:
# list the sentences
review.sentences

In [None]:
# some string methods are available
review.lower()

## Part 6: Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [None]:
# initialize stemmer
stemmer = SnowballStemmer('english')

# stem each word
print([stemmer.stem(word) for word in review.words])

**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [None]:
# assume every word is a noun
print([word.lemmatize() for word in review.words])

In [None]:
# assume every word is a verb
print([word.lemmatize(pos='v') for word in review.words])

In [None]:
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    #text = str(text, 'utf-8').lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

In [None]:
# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

In [None]:
# last 50 features
print(vect.get_feature_names_out()[-50:]) # used to be get_feature_names

## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names_out())
tf

In [None]:
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names_out())

In [None]:
# Calculate a simple form of Term Frequency-Inverse Document Frequency (simple version)
tf/df

In [None]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names_out())

**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

## Part 8: Using TF-IDF to Summarize a Yelp Review

Reddit's autotldr uses the [SMMRY](http://smmry.com/about) algorithm, which is based on TF-IDF!

In [None]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names_out()
dtm.shape

In [None]:
type(dtm)

In [None]:
def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = str(yelp.text[review_id])
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print(word)
    
    # print 5 random words
    print('\n' + 'RANDOM WORDS:')
    random_words = np.random.choice(word_scores.keys(), size=5, replace=False)
    for word in random_words:
        print(word)
    
    # print the review
    print('\n' + review_text)

In [None]:
tokenize_test(vect)

In [None]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english',max_features=10000)
dtm = vect.fit_transform(yelp.text)
tokenize_test(vect)

## Part 9: Sentiment Analysis

In [None]:
print(review)

In [None]:
# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity

In [None]:
# understanding the apply method
yelp['length'] = yelp.text.apply(len)
yelp.head(1)

In [None]:
# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    #return TextBlob(text.decode('utf-8')).sentiment.polarity
    return TextBlob(text).sentiment.polarity

In [None]:
# create a new DataFrame column for sentiment (WARNING: SLOW!)
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

In [None]:
yelp.head()

In [None]:
# box plot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

In [None]:
# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

In [None]:
# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

In [None]:
# widen the column display
pd.set_option('max_colwidth', 500)

In [None]:
# negative sentiment in a 5-star review
yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head(1)

In [None]:
# positive sentiment in a 1-star review
yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head(1)

In [None]:
# reset the column display width
pd.reset_option('max_colwidth')

## Bonus: Adding Features to a Document-Term Matrix

In [None]:
# create a DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# use CountVectorizer with text column only
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
print(X_train_dtm.shape)
print(X_test_dtm.shape)

In [None]:
# shape of other four feature columns
X_train.drop('text', axis=1).shape

In [None]:
# cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train.drop('text', axis=1).astype(float))
extra.shape

In [None]:
# combine sparse matrices
X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra))
X_train_dtm_extra.shape

In [None]:
# repeat for testing set
extra = sp.sparse.csr_matrix(X_test.drop('text', axis=1).astype(float))
X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape

In [None]:
# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print(metrics.accuracy_score(y_test, y_pred_class))

## Bonus: Fun TextBlob Features

In [None]:
# spelling correction
TextBlob('15 minuets late').correct()

In [None]:
# spellcheck
Word('parot').spellcheck()

In [None]:
# definitions
Word('bank').define('v')

In [None]:
# language identification
TextBlob('Hola Amigos').detect_language()

In [None]:
#https://stackoverflow.com/questions/61479063/how-to-efficiently-detect-language-for-a-string-on-python-list
#pip install langdetect

In [None]:
from langdetect import detect
# https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
# language identification
detect('Hola Amigos')

## Conclusion

- Understanding the basics broadens the types of data you can work with
- Simple techniques go a long way
- Use scikit-learn for NLP whenever possible