<font face="serif" size="6" color="scarlet">Natural Language Processing</font>

It is a field in machine learning/deep learning that deals with understanding, analyzing, manipulating and generating language. Humans communicate through language on multiple mediums these days. It gets complicated. There is context, intonation, inflection and body language. The first major advancement in machine language processing was in 1950 when Alan Turing published "Computing Machinery and Intelligence". This paper establsihed the Turing Test, a criterion for how well a computer could impersonate a human. In 1957, Noam Chomsky's paper on Syntactic Structures revolutionized our understanding of linguistics. But a few decades passed without any real progress. It wasn't until the late 80's when ML algorithms were introduced that NLP showed real promise.

 <font face="script" size="4">"Learn a language and you'll avoid a war"-Arab proverb</font>
        

_NLP is not Neuro-linguistic programming(pseuodo-science - think changing behavior through hypnosis). Natural Language Understanding is similar to NLP but a bit different. NLP focuses on turning unstructured data into structured data. NLU is focused on content or sentiment analysis._

<font face="script" size="6" color="scarlet">NLP in the Real World</font>
 
 Lots of everyday things we take for granted rely completely on NLP to function. Spell check and auto-complete, voice recognition/texting, spam filters, search engines, Siri/Alexa, google translate.
 
 - [AI having a convo](https://youtu.be/WnzlbyTZsQY)
 - [Summarize text](https://smmry.com/)
 - [Jennings vs. Watson](https://www.ted.com/talks/ken_jennings_watson_jeopardy_and_me_the_obsolete_know_it_all)

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize 
nltk.download('punkt')

In [None]:
df = pd.read_csv('job_scrape6.csv')
df.info()

In [None]:
df.head()

<font face="script" size="6" color="scarlet">Preprocessing, Feature Engineering and EDA</font>
* Casing 
* Punctuation 
* Stop word removal 
* Tokenization 

* Stemming 
* Lemmatization 
* POS tagging 


<font face="script" size="6" color="scarlet">Regular Expressions</font>
![](regex_cheat_sheet.png)
<a href="https://www.debuggex.com/cheatsheet/regex/python">Regex Cheatsheet</a>

In [None]:
#Getting rid of upper cases. This avoids having multiple copies of the same words 
df['lower_desc'] = df['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['lower_desc'].head()

In [None]:
#Removing punctuation. It helps us reduce the size of the data 
df['lower_desc'] = df['lower_desc'].str.replace('[^\w\s]','')
df['lower_desc'].head()

<font face="serif" size="4">**Stop Words Removal** - words that don't contribute to the significance or meaning of a document </font>

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')


In [None]:
df['char_count'] = df['description'].str.len() #how many characters do we have in description? 
df[['description','char_count']].head()

In [None]:
#how many stop words do we have? 
df['stopwords'] = df['description'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['description','stopwords']].head()

In [None]:
#removing stopwords 
df['lower_desc'] = df['lower_desc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['lower_desc'].head()

In [None]:
#most frequent and least frequent words 
freq = pd.Series(' '.join(df['lower_desc']).split()).value_counts()[:20]
freq

In [None]:
df.head()

<font face="script" size="6" color="scarlet">Tokenization</font>

In [None]:
desc_str = ' '.join(df['lower_desc'].tolist())
print(desc_str)

In [None]:
tokens = nltk.word_tokenize(desc_str) #tokenizing 
print(len(tokens))

<font face="script" size="6" color="scarlet">Stemming</font>
- a technique to remove affixes from a word and ending up with the stem. Play would be the stem of a word and the 'ing' in playing would be an affix. This process makes similar words more equal to each other. This way the algorithm only has to learn the stem of the word instead of the stem and all its variants.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer 
porter = PorterStemmer() #instantiate
lemma = WordNetLemmatizer() #instantiate 

In [None]:
print(porter.stem("I studied physics"))

<font face="script" size="6" color="scarlet">Lemmatization</font>
- similar to stemming but it brings context to the words with morphological(words relationships to other words) analysis. A lemma is the base form of all its inflectional forms. Inflections are added to the stem of a word

In [None]:
print(lemma.lemmatize("physics"))

<font face="script" size="6" color="scarlet">POS Tagging</font>

In [None]:
tokens_pos = nltk.pos_tag(tokens)
pos_df = pd.DataFrame(tokens_pos, columns = ('word','POS'))
pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags
pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag

In [None]:
tokens_pos = nltk.pos_tag(tokens)
pos_df = pd.DataFrame(tokens_pos, columns = ('word','POS'))
pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags
pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag

In [None]:
#getting just the nouns
filtered_pos = [ ]
for one in tokens_pos:
    if one[1] == 'NN' or one[1] == 'NNS' or one[1] == 'NNP' or one[1] == 'NNPS':
        filtered_pos.append(one)
print (len(filtered_pos))

In [None]:
#the 100 most common nouns
fdist_pos = nltk.FreqDist(filtered_pos)
top_100_words = fdist_pos.most_common(100)
print(top_100_words)

In [None]:
top_words_df = pd.DataFrame(top_100_words, columns = ('pos','count'))
top_words_df['Word'] = top_words_df['pos'].apply(lambda x: x[0]) # split the tuple of POS
top_words_df = top_words_df.drop('pos', 1) # drop the previous column
top_words_df.head(10)

In [None]:
fig, ax = plt.subplots(figsize=(15,18))
top_words_df.sort_values(by='count').plot.barh(x='Word',
                      y='count',
                      ax=ax,
                      color="purple")

ax.set_title("Common Words Found in DS Job Descriptions(Without Stop Words)")

plt.show()

In [None]:
from textblob import TextBlob, Word
from wordcloud import WordCloud

In [None]:
word_counts = ' '.join(top_words_df['Word'].tolist())
print(type(word_counts))

In [None]:
wordcloud = WordCloud().generate(word_counts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

<center><img src="tfidf.png" height=600 width=600>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
df['lower_desc'].head()

Transform text into a bag of words. CountVectorizer learns from the text with the fit method and then transforms the text into a list of lists(a matrix). 

In [None]:
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer().fit(df.lower_desc)
vectorized_text = vectorizer.transform(df.lower_desc)
print(vectorized_text.todense()[-1])

We must normalize the text length. 

In [None]:
TfidF = text.TfidfTransformer(norm='l1')
tfidf = TfidF.fit_transform(vectorized_text)

phrase = 3 #choose 0-3
total = 0 

This TF-IDF model rescales the values of important words and makes them comparable between each text in the corpus 

In [None]:
print(vectorizer.vocabulary_)

In [None]:
for word in vectorizer.vocabulary_:
    pos = vectorizer.vocabulary_[word]
    value = list(tfidf.toarray()[phrase])[pos]
    if value !=0:
        print("%10s: %0.3f" % (word, value))
        total += value
print('\nSummed values of a phrase: %0.1f' % total)

<font face="script" size="6" color="scarlet">N-Grams</font>

In [None]:
TextBlob(desc_str).ngrams(3)

<font face="script" size="6" color="scarlet">Resources</font>
* [Text blob library](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.Word)

* [Googles n-gram viewer](https://books.google.com/ngrams/graph?content=API&year_start=1800&year_end=2010&corpus=0&smoothing=3&direct_url=t1%3B%2CAPI%3B%2Cc0)

* [Tweepy - python library for accessing the Twitter API](https://www.tweepy.org/)

* [Step by step guide for NLP](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e)

