<font face="serif" size="6" color="scarlet">Natural Language Processing</font>

It is a field in machine learning/deep learning that deals with understanding, analyzing, manipulating and generating language. Humans communicate through language on multiple mediums these days. It gets complicated. There is context, intonation, inflection and body language. The first major advancement in machine language processing was in 1950 when Alan Turing published "Computing Machinery and Intelligence". This paper establsihed the Turing Test, a criterion for how well a computer could impersonate a human. In 1957, Noam Chomsky's paper on Syntactic Structures revolutionized our understanding of linguistics. But a few decades passed without any real progress. It wasn't until the late 80's when ML algorithms were introduced that NLP showed real promise.

 <font face="script" size="4">"Learn a language and you'll avoid a war"-Arab proverb</font>
        

_NLP is not Neuro-linguistic programming(pseuodo-science - think changing behavior through hypnosis). Natural Language Understanding is similar to NLP but a bit different. NLP focuses on turning unstructured data into structured data. NLU is focused on content or sentiment analysis._

<font face="script" size="6" color="scarlet">NLP in the Real World</font>
 
 Lots of everyday things we take for granted rely completely on NLP to function. Spell check and auto-complete, voice recognition/texting, spam filters, search engines, Siri/Alexa, google translate.
 
 - [AI having a convo](https://youtu.be/WnzlbyTZsQY)
 - [Summarize text](https://smmry.com/)
 - [Jennings vs. Watson](https://www.ted.com/talks/ken_jennings_watson_jeopardy_and_me_the_obsolete_know_it_all)

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize 
nltk.download('punkt')

In [None]:
df = pd.read_csv('job_scrape6.csv')
df.info()

In [None]:
df.head()

<font face="script" size="6" color="scarlet">Preprocessing, Feature Engineering and EDA</font>
* Casing 
* Punctuation 
* Stop word removal 
* Tokenization 

* Stemming 
* Lemmatization 
* POS tagging 

All of these are ways to help normalize our data, reduce randomness and dimensionality.

→ Removal of duplicate whitespaces and punctuation.<br/>
→ Accent removal (if your data includes diacritical marks from ‘foreign’ languages — this helps to reduce errors related to encoding type).<br/>
→ Capital letter removal (often, working with lowercase words deliver better results. In some cases, however, capital letters are very important to extract information, like names and locations). <br/>
→ Removal or substitution of special characters/emojis (e.g.: remove hashtags). <br/>
→ Substitution of contractions (very common in English; e.g.: ‘I’m’→‘I am’). <br/>
→ Transform word numerals into numbers (eg.: ‘twenty three’→‘23’). <br/>
→ Substitution of values for their type (e.g.: ‘$50’→‘MONEY’). <br/>
→ Acronym normalization (e.g.: ‘US’→‘United States’/‘U.S.A’) and abbreviation normalization (e.g.: ‘btw’→‘by the way’). <br/>
→ Normalize date formats, social security numbers or other data that have a standard format. <br/>
→ Spell correction (one could say that a word can be misspelled infinite ways, so spell corrections reduce the vocabulary variation by “correcting”) — this is very important if you’re dealing with open user inputs, such as tweets, IMs and emails. <br/>
→ Removal of gender/time/grade variation with Stemming or Lemmatization. <br/>
→ Substitution of rare words for more common synonyms. <br/>
→ Stop word removal (more a dimensionality reduction technique than a normalization technique, but let us leave it here for the sake of mentioning it).

In [None]:
#Getting rid of upper cases. This avoids having multiple copies of the same words 
df['lower_desc'] = df['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['lower_desc'].head()

<font face="script" size="6" color="scarlet">Regular Expressions</font>

Regular expressions are specially encoded text strings used as patterns for matching sets of strings.
![](regex_cheat_sheet.png)
<a href="https://www.debuggex.com/cheatsheet/regex/python">Regex Cheatsheet</a>

In [None]:
#Removing punctuation. It helps us reduce the size of the data 
df['lower_desc'] = df['lower_desc'].str.replace('[^\w\s]','')
df['lower_desc'].head()

<font face="serif" size="4">**Stop Words Removal** - words that don't contribute to the significance or meaning of a document </font>

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
print(stop)


In [None]:
df['char_count'] = df['description'].str.len() #how many characters do we have in description? 
print(df[['description','char_count']].head())
print(df['char_count'].mean())

In [None]:
#how many stop words do we have? 
df['stopwords'] = df['description'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['description','stopwords']].head(10)

In [None]:
#removing stopwords 
df['lower_desc'] = df['lower_desc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['lower_desc'].head()

In [None]:
#most frequent and least frequent words 
freq = pd.Series(' '.join(df['lower_desc']).split()).value_counts()[:20]
freq

In [None]:
df.head()

<font face="script" size="6" color="scarlet">Tokenization</font>

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

In [None]:
desc_str = ' '.join(df['lower_desc'].tolist())
print(desc_str)

In [None]:
tokens = nltk.word_tokenize(desc_str) #tokenizing 
print(len(tokens))

<font face="script" size="6" color="scarlet">Stemming</font>
- a technique to remove affixes from a word and ending up with the stem. Play would be the stem of a word and the 'ing' in playing would be an affix. This process makes similar words more equal to each other. This way the algorithm only has to learn the stem of the word instead of the stem and all its variants.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer 
porter = PorterStemmer() #instantiate
lemma = WordNetLemmatizer() #instantiate 

In [None]:
print(porter.stem("I studied physics"))

<font face="script" size="6" color="scarlet">Lemmatization</font>
- similar to stemming but it brings context to the words with morphological(words relationships to other words) analysis. A lemma is the base form of all its inflectional forms. Inflections are added to the stem of a word

In [None]:
print(lemma.lemmatize("physics"))

<font face="script" size="6" color="scarlet">POS Tagging</font>

In [None]:
tokens_pos = nltk.pos_tag(tokens)
pos_df = pd.DataFrame(tokens_pos, columns = ('word','POS'))
pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags
pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag

In [None]:
#getting just the nouns
filtered_pos = [ ]
for one in tokens_pos:
    if one[1] == 'NN' or one[1] == 'NNS' or one[1] == 'NNP' or one[1] == 'NNPS':
        filtered_pos.append(one)
print (len(filtered_pos))

In [None]:
#the 100 most common nouns
fdist_pos = nltk.FreqDist(filtered_pos)
top_100_words = fdist_pos.most_common(100)
print(top_100_words)

In [None]:
top_words_df = pd.DataFrame(top_100_words, columns = ('pos','count'))
top_words_df['Word'] = top_words_df['pos'].apply(lambda x: x[0]) # split the tuple of POS
top_words_df = top_words_df.drop('pos', 1) # drop the previous column
top_words_df.head(10)

In [None]:
fig, ax = plt.subplots(figsize=(15,18))
top_words_df.sort_values(by='count').plot.barh(x='Word',
                      y='count',
                      ax=ax,
                      color="purple")

ax.set_title("Common Words Found in DS Job Descriptions(Without Stop Words)")

plt.show()

In [None]:
from textblob import TextBlob, Word
from wordcloud import WordCloud

In [None]:
word_counts = ' '.join(top_words_df['Word'].tolist())
print(type(word_counts))

In [None]:
wordcloud = WordCloud().generate(word_counts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

<font face="script" size="6" color="scarlet">Modeling</font>

### Naive Bayes Modeling

Naive Bayes models lend themselves well to NLP problems. Consider the task of trying to predict genre from text. My subjective probability that a text belongs to a certain genre would be a function of the words in the text. So e.g. the (prior) probability that a text is science-fiction may be relatively small. But the probability that a text is science-fiction *given that it uses the word 'cyclotron'* may be quite high.

### TF-IDF 



<center><img src="tfidf.png" height=600 width=600>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
import string

In [None]:
df['target'] = np.random.randint(0, 2, df.shape[0])
df.head()

In [None]:
#setting our target & features 
X = df['lower_desc']
y = df['target'] 

# generate a list of stopwords for TfidfVectorizer to ignore
stopwords_list = stopwords.words('english') + list(string.punctuation)

Let's create a function that takes in our various texts along with their respective labels and uses TF-IDF to vectorize the texts. Recall that TF-IDF helps us "vectorize" text (turn text into numbers) so we can do "math" with it. It is used to reflect how relevant a term is in a given document in a numerical way.

This TF-IDF model rescales the values of important words and makes them comparable between each text in the corpus 

In [None]:
# generate tf-idf vectorization (use sklearn's TfidfVectorizer) for our data
def tfidf(X, y,  stopwords_list): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    vectorizer = TfidfVectorizer(stop_words=stopwords_list)
    tf_idf_train = vectorizer.fit_transform(X_train)
    tf_idf_test = vectorizer.transform(X_test)
    return tf_idf_train, tf_idf_test, y_train, y_test, vectorizer

In [None]:
tf_idf_train, tf_idf_test, y_train, y_test, vectorizer = tfidf(X, y, stopwords_list)

Now that we have a set of vectorized training data we can use this data to train a classifier to learn how to classify a specific text based on the vectorized version of the text. The function below will accept a classifier object, a vectorized training set, vectorized test set, and list of training labels and return a list of predictions for our training set and a separate list of predictions for our test set.

In [None]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

In [None]:
# a function that takes in a classifier and trains it on our tf-idf vectors and generates test and train predictiions
def classify_text(classifier, tf_idf_train, tf_idf_test, y_train):
    classifier.fit(tf_idf_train, y_train)
    train_preds = classifier.predict(tf_idf_train)
    test_preds = classifier.predict(tf_idf_test)
    return train_preds, test_preds

In [None]:
# generate predictions with Naive Bayes Classifier
nb_train_preds, nb_test_preds = classify_text(nb_classifier, tf_idf_train, tf_idf_test, y_train)

# evaluate performance of Naive Bayes Classifier
print(confusion_matrix(y_test, nb_test_preds))
print(accuracy_score(y_test, nb_test_preds))

In [None]:
# generate predictions with Random Forest Classifier
rf_train_preds, rf_test_preds = classify_text(rf_classifier, tf_idf_train, tf_idf_test, y_train)

# evaluate performance of Random Forest Classifier
print(confusion_matrix(y_test, rf_test_preds))
print(accuracy_score(y_test, rf_test_preds))

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

Let's figure out which words are the most important to each class of texts! Recall that Inverse Document Frequency can help us determine which words are most important in an entire corpus or group of documents.



In [None]:
#function that calculates the inverse document frequency(IDF) of each word in our collection
def get_idf(class_, df, stopwords_list):
    docs = df[df.target==class_].lower_desc
    class_dict = {} 
    for doc in docs:
        words = set(doc.split())
        for word in words:
            if word.lower() not in stopwords_list: 
                class_dict[word.lower()] = class_dict.get(word.lower(), 0) + 1
    idf_df = pd.DataFrame.from_dict(class_dict, orient='index')
    idf_df.columns = ['IDF']
    idf_df.IDF = np.log(len(docs)/idf_df.IDF)
    idf_df = idf_df.sort_values(by="IDF", ascending=True)
    return idf_df.head(10)

In [None]:
get_idf(0 , df, stopwords_list)

<font face="script" size="6" color="scarlet">Resources</font>
* [Text blob library](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.Word)

* [Googles n-gram viewer](https://books.google.com/ngrams/graph?content=API&year_start=1800&year_end=2010&corpus=0&smoothing=3&direct_url=t1%3B%2CAPI%3B%2Cc0)

* [Tweepy - python library for accessing the Twitter API](https://www.tweepy.org/)

* [Step by step guide for NLP](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e)

