# `NLP WorkFlow and Problems That Arise`:

# <font color=red>Mr Fugu Data Science</font>

# (◕‿◕✿)

# Purpose & Outcome:

+ General Workflow of an actual example

+ Show what problems come up and how you need to think about this depending on your analysis

`*********************************`

Typically, data is unstructured and needs to be converted or interpreted into a certain format for analysis. Textual data is no exception; evaluating social media posts or short messages such as text messages (sms) can be difficult tasks to handle due lack of structure, slang, etc.

`This dataset came from the https://newsapi.org/ on Sept. 19,2020`

# Downsides:

+ There are situations where `stemming` and `lemmatizing` are not useful and can effect your analysis. You need to consider when this may be a problem:

for example, if we were looking that getting a gist of a news article. Say that we want to summarize the article in just a few words then the part of speech (POS) is very important. Changing the ending of a word or changing to lowercase can be a problem.

+ Having spelling errors: 

+ Words not separated by space `supposebut`, `wearehungry`

+ Conjugation: `don't`, `can't`

+ Abreviations/Acronyms: `gov't`,`FBI`,`NASA`

These are all concerns that need to be handled on a case by case basis. 

In [10]:
import pandas as pd
import nltk                              # text processing
from nltk import word_tokenize           # split sentence into list of words
from nltk.corpus import stopwords        # remove: and,it,i,etc
import string                            # punctuation removal
from nltk.stem import WordNetLemmatizer  # remove word endings etc
from collections import defaultdict      # dict with values as lists
from textblob import TextBlob            # fix spelling 
from nltk.corpus import words            # find words as comparison to see if valid
from collections import Counter


In [2]:
# news file to parse:
news_=pd.read_csv('goog_api_techstuff.csv')

# one entry: which appears to be a small article
print(news_['content'][13])
print('----------------------')
print(news_['description'][13])

news_.head(2)

NVIDIA declined to comment to Engadget. We’ve also asked SoftBank and ARM for comment.
If a deal goes through, it could represents one of the largest semiconductor buyouts in history. It potentially… [+694 chars]
----------------------
SoftBank might be close to finding a buyer for ARM, and it won’t surprise you who the bidder might be. Wall Street Journal sources say SoftBank is close to a deal to sell ARM Holdings to NVIDIA for “more than” $40 billion. The two have reportedly been in excl…


Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,Lifehacker.com,Brendan Hesse,Everything You Missed From Today's Facebook Co...,Today’s Facebook Connect kicked off with a two...,https://lifehacker.com/everything-you-missed-f...,https://i.kinja-img.com/gawker-media/image/upl...,2020-09-16 20:30:00,Todays Facebook Connect kicked off with a two-...
1,Lifehacker.com,Joel Cunningham,How to Turn Off Alexa's Creepy 'Whisper Mode',I love my smart speaker—as much as one can eve...,https://lifehacker.com/how-to-turn-off-alexas-...,https://i.kinja-img.com/gawker-media/image/upl...,2020-08-31 20:30:00,I love my smart speakeras much as one can ever...


# `Convert to lowercase, remove punctuation`:

In [48]:

wrd_lst_tokens=[]
for ikl in news_['content']:
    punct=word_tokenize(''.join(j for j in ikl.lower() if j not in string.punctuation))
    wrd_lst_tokens.append([punct])

pd.DataFrame(wrd_lst_tokens).head()

pd.DataFrame(wrd_lst_tokens)[0][0]

['todays',
 'facebook',
 'connect',
 'kicked',
 'off',
 'with',
 'a',
 'twohour',
 'keynote',
 'that',
 'detailed',
 'the',
 'companys',
 'latest',
 'virtual',
 'and',
 'augmented',
 'reality',
 'developments',
 'there',
 'were',
 'product',
 'announcements',
 'new',
 'apps',
 'and',
 'games',
 'on',
 'displa…',
 '5586',
 'chars']

# Same above code but modified: `remove digits and replace hyphens`

+ if I don't remove hyphens and replace with empty string you will get a conjunction or combined wording. Which I don't want

+ The digits are removed because, 

    1. ) the end of each string is only a representation of the total article and it shows the remaining portion in terms of character length. 
    
    2. ) I do not think I need them for this exercise and show how to deal with them

In [49]:
wrd_lst_tokens_updt=[]
for ikl in news_['content']:
    stopwrds = stopwords.words('english')

    punct=word_tokenize(''.join(j for j in ikl.replace('-',' ').lower()\
                                if j not in string.punctuation if not j.isdigit()))
    
    wrd_lst_tokens_updt.append([punct])


pd.DataFrame(wrd_lst_tokens_updt)[0][0]
# .head()

['todays',
 'facebook',
 'connect',
 'kicked',
 'off',
 'with',
 'a',
 'two',
 'hour',
 'keynote',
 'that',
 'detailed',
 'the',
 'companys',
 'latest',
 'virtual',
 'and',
 'augmented',
 'reality',
 'developments',
 'there',
 'were',
 'product',
 'announcements',
 'new',
 'apps',
 'and',
 'games',
 'on',
 'displa…',
 'chars']

# `Remove Stopwords, single characters, useless words as well`

In [52]:
d=[]
for i in wrd_lst_tokens_updt:
# removing single letters, and [:-3] will remove last 3 str from each list bc useless
    line = [j for j in i[0][:-2] if len(j) > 1]
    
# remove our stopwords like: ('i','it','etc')
    d.append([[ii for ii in line if ii not in stopwrds]])


print('Example of amount of data after: ',len(pd.DataFrame(d)[0][1])/len(news_['content'][0]))
pd.DataFrame(d).head()
pd.DataFrame(d)[0][8]
# pd.DataFrame(d)[0][0]

Example of amount of data after:  0.09813084112149532


['evo',
 'promise',
 'hours',
 'battery',
 'life',
 'plus',
 'rapid',
 'charging',
 'technology',
 'delivers',
 'four',
 'hour',
 'boost',
 'minute',
 'charge',
 'touch',
 'controls',
 'let',
 'users',
 'take',
 'calls',
 'change',
 'tracks',
 'adjust',
 'volume']

# `Lemmatization`: looking for root or bases of words

Now, you can use `Stemming` if you are pressed for time or memory. Because, you will just chop off endings of words like: 's','ies' etc.

Using a `Lemma` you are using a rule based approach and working more succinctly.

In [6]:
lemmatizer = WordNetLemmatizer()
sentence = "The striped bats are hanging on their feet for best"
words = word_tokenize(sentence)
for w in words:
    print(w, " : ", lemmatizer.lemmatize(w))

# This was from stackoverflow

The  :  The
striped  :  striped
bats  :  bat
are  :  are
hanging  :  hanging
on  :  on
their  :  their
feet  :  foot
for  :  for
best  :  best


In [7]:
h=[]
for i in range(len(d)):
    for j in d[i][0]:
#         print(j)
        h.append([i,lemmatizer.lemmatize(j)])

dg=defaultdict(list)
for i in h:
    dg[i[0]].append(i[1])

news_['updated_strings']=dg.values()
news_.head()

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content,updated_strings
0,Lifehacker.com,Brendan Hesse,Everything You Missed From Today's Facebook Co...,Today’s Facebook Connect kicked off with a two...,https://lifehacker.com/everything-you-missed-f...,https://i.kinja-img.com/gawker-media/image/upl...,2020-09-16 20:30:00,Todays Facebook Connect kicked off with a two-...,"[today, facebook, connect, kicked, two, hour, ..."
1,Lifehacker.com,Joel Cunningham,How to Turn Off Alexa's Creepy 'Whisper Mode',I love my smart speaker—as much as one can eve...,https://lifehacker.com/how-to-turn-off-alexas-...,https://i.kinja-img.com/gawker-media/image/upl...,2020-08-31 20:30:00,I love my smart speakeras much as one can ever...,"[love, smart, speakeras, much, one, ever, love..."
2,Lifehacker.com,Elizabeth Yuko,How to Avoid Getting a Last-Minute Booking Blo...,Airbnb is cracking down on parties. They are n...,https://lifehacker.com/how-to-avoid-getting-a-...,https://i.kinja-img.com/gawker-media/image/upl...,2020-09-06 14:00:00,Airbnb is cracking down on parties. They are n...,"[airbnb, cracking, party, basically, equivalen..."
3,Lifehacker.com,"Beth Skwarecki on Vitals, shared by Beth Skwar...",Tackle a Hill Head-On,Did you find a new trail to run or hike for la...,https://vitals.lifehacker.com/tackle-a-hill-he...,https://i.kinja-img.com/gawker-media/image/upl...,2020-09-11 16:30:00,Did you find a new trail to run or hike for la...,"[find, new, trail, run, hike, last, week, chal..."
4,Lifehacker.com,"Meghan Moravcik Walbert on Offspring, shared b...",How to Communicate With Kids When You're Weari...,"“Mask-muffle” is a term I just made up, but it...",https://offspring.lifehacker.com/how-to-commun...,https://i.kinja-img.com/gawker-media/image/upl...,2020-09-17 13:00:00,"Mask-muffle is a term I just made up, but its ...","[mask, muffle, term, made, real, thing, cloth,..."


# Correct Misspellings with TextBlob:

+ This is not perfect let's look

In [8]:
from textblob import TextBlob

misspelled=["hapenning", "mornin", "windoow", "jaket"]
miss_=dg[1]
print('Corrected Version: ',[str(TextBlob(word).correct()) for word in miss_])
print('-----------------')
print('Original: ',dg[1])


Corrected Version:  ['love', 'smart', 'speakers', 'much', 'one', 'ever', 'love', 'piece', 'privacy', 'stealing', 'technology', 'exists', 'gather', 'information', 'supposebut', 'doesn', 'mean', 'dont', 'find', 'many', 'thing']
-----------------
Original:  ['love', 'smart', 'speakeras', 'much', 'one', 'ever', 'love', 'piece', 'privacy', 'stealing', 'technology', 'exists', 'gather', 'information', 'supposebut', 'doesnt', 'mean', 'dont', 'find', 'many', 'thing']


# This is to take care of Words not separated by a space:

+ `This is Not Perfect`

Code for the cell below: https://stackoverflow.com/questions/38125281/split-sentence-without-space-in-python-nltk

In [12]:
# from __future__ import division
# from collections import Counter
# import re, nltk

WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print(segment('acquirecustomerdata'))
print(segment('supposebut'))
print(segment('speakeras'))




['acquire', 'customer', 'data']
['suppose', 'but']
['speaker', 'as']


# iterate through list and see if any words are combined and need to be separated

+ Notice, how this fails for some words

In [29]:
b=[]
for i in h:
    
    b.append([i[0],segment(i[1])])

print(b[:2]) # company name that was split
print(b[17]) # shorthand version of applications 

[[0, ['today']], [0, ['face', 'book']]]
[0, ['a', 'p', 'p', 's']]


# Only take lists of two words:

from these data it looks like sometimes there are words that are combined because someone forgot to use a space. 

In [14]:

y=[]
for i in b:
#     print(i)
    if len(i[1])==2:
        y.append(i)
    else:
        y.append([i[0],0])
y[:5]

[[0, 0], [0, ['face', 'book']], [0, 0], [0, 0], [0, 0]]

# This Code Block has a few interesting bits:

You are iterating through this, and looking for the lists of words and combining to see if it forms a legit word. 

+ If it doesn't, then just append the original split

+ otherwise, keep it 

In [47]:
# from nltk.corpus import words
ll=[]
for i in y:
#     print(i)
    if i[1]==0:
        ll.append([i[0],0])
    else:
        f=''.join(i[1])
        if (f in words.words())==False:
            ll.append(i)
        else:
            ll.append([i[0],f])
            
print('Correct: ',ll[21])
print('Correct: ',ll[33:37])
print('Fails: ',ll[189:196]) # CS related and Elon Musk: neuralink
print('Fails: ',ll[245]) # name of Siri voice recognition
print('Fails: ',ll[254:258]) # companies

Correct:  [1, ['speaker', 'as']]
Correct:  [[1, ['suppose', 'but']], [1, 0], [1, 0], [1, 'dont']]
Fails:  [[9, 'io'], [9, 0], [9, 'io'], [9, 'beta'], [10, ['neural', 'ink']], [10, 0], [10, 0]]
Fails:  [12, ['sir', 'i']]
Fails:  [[13, ['en', 'gadget']], [13, 0], [13, 0], [13, ['soft', 'bank']]]


# Let's see what happens with some of our words and notice how this can fail: 

+ Only can find correctly spelled words

In [21]:
# actual_words_from_our file

lst=['supposeas','neuarlink','facebook','suppose','supose','dont',"don't"]
m=[]
for i in lst:
    if (i in words.words())==False:
        m.append([i,'F'])
    else:
        m.append([i,'T'])
m

[['supposeas', 'F'],
 ['neuarlink', 'F'],
 ['facebook', 'F'],
 ['suppose', 'T'],
 ['supose', 'F'],
 ['dont', 'T'],
 ["don't", 'F']]

`------------------`

# <font color=red>LIKE</font>, Share &

# <font color=red>SUB</font>scribe

# Citations & Help:

# ◔̯◔

https://www.kaggle.com/matleonard/text-classification

https://www.kaggle.com/matleonard/word-vectors

https://github.com/graykode/nlp-tutorial

https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/

https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

https://medium.com/machine-learning-in-practice/over-200-of-the-best-machine-learning-nlp-and-python-tutorials-2018-edition-dd8cf53cb7dc

https://www.kdnuggets.com/2019/01/solve-90-nlp-problems-step-by-step-guide.html

https://www.kaggle.com/itratrahman/nlp-tutorial-using-python

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c

https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments

https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/

https://stackabuse.com/text-summarization-with-nltk-in-python/

https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c

https://opendatagroup.github.io/data%20science/2019/03/21/preprocessing-text.html

https://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb

https://stackoverflow.com/questions/38125281/split-sentence-without-space-in-python-nltk

https://towardsdatascience.com/how-i-used-natural-language-processing-to-extract-context-from-news-headlines-df2cf5181ca6