# Preprocessing text data for NLP analysis
## Author: Karina Lopez
### Last updated: 04/20/2021

**Purpose:** Clean raw text data for NLP analysis. Steps include tokenization, lower-casing, stop-word removal, stemming, and lemmatization

Source: https://towardsdatascience.com/text-preprocessing-in-natural-language-processing-using-python-6113ff5decd8



# Load in your packages and default styles

In [40]:
import pandas as pd
import glob
import os

import seaborn as sns
sns.set_style('darkgrid')

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier
nltk.download('punkt')

import re, string, random

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib

#setting pandas display options
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199

BASE_DIR = "/Users/karinalopez/Desktop/ds_projects/nlp/data/"

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/karinalopez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  pd.set_option('display.max_colwidth', -1)  # or 199


In [2]:
def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        
        elif tag.startswith('VB'):
            pos = 'v'
        
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    
    return cleaned_tokens


In [3]:
def get_all_words(cleaned_tokens_list):
    
    for tokens in cleaned_tokens_list:
        
        for token in tokens:
            
            yield token


In [4]:
def get_tweets_for_model(cleaned_tokens_list):
    
    for tweet_tokens in cleaned_tokens_list:
        
        yield dict([token, True] for token in tweet_tokens)

In [5]:
os.chdir(BASE_DIR + 'raw/ffa/')
posts_df = pd.read_csv('MEGA_ffa_posts_company.csv')
comments_df = pd.read_csv('MEGA_ffa_comments_company.csv')

In [6]:
print(posts_df.shape)
posts_df.head(n = 1)

(247, 8)


Unnamed: 0,title,score,id,url,comms_num,created,body,keyword
0,"A review of all the leggings I own - Alo Yoga, Nike, Outdoor Voices, Girlfriend Collective, Lululemon, Uniqlo",1546,gbm76k,https://www.reddit.com/r/femalefashionadvice/comments/gbm76k/a_review_of_all_the_leggings_i_own_alo_yoga_nike/,355,1588353444,t3_gbm76k,Girlfriend Collective


In [7]:
posts_df['keyword'].value_counts()

Nike                     100
Lululemon                52 
Adidas                   47 
Athleta                  37 
Girlfriend Collective    9  
Alo Yoga                 2  
Name: keyword, dtype: int64

In [8]:
print(comments_df.shape)
comments_df.head(n = 1)

(17960, 5)


Unnamed: 0,comment_id,comment_parent_id,comment_body,comment_link_id,keyword
0,fp6qtho,t3_gbm76k,"This is amazing! Thank you for putting it together!\n\nHonestly I’ve tried so many brands: GF Collective, Lululemon, Aerie, Nike, Athleta etc and the ones I always come back to are Old Navy. They are so comfortable and wash well and very affordable. \n\nThe Old Navy compression leggings retain their compression even with multiple washes. I also like the yoga leggings which are less compressive but still have a bit of compression. Those are a little softer than the compressive ones but I love both. I tried a bunch of fancier brands and ended up just going back to old navy.",t3_gbm76k,Girlfriend Collective


In [9]:
comments_df['keyword'].value_counts()

Nike                     6626
Lululemon                4014
Adidas                   3406
Athleta                  2906
Alo Yoga                 519 
Girlfriend Collective    489 
Name: keyword, dtype: int64

# Problems solved with this script:
- Tokenization
- lowercasing words
- stop-word removal
- Stemming
- Lemmatization

## Lowercasing

A very simple that facilitates text analysis is lowercasing all string characters in a comment or tecxt body. Excluding this step would cause wird frequencies and other text analyses to interpret words like "USA", "UsA", and "usa" as separate

In [30]:
comments_df['comment_body'] = comments_df['comment_body'].str.lower()

In [47]:
comments_df.head()

Unnamed: 0,comment_id,comment_parent_id,comment_body,comment_link_id,keyword,test,test2
0,fp6qtho,t3_gbm76k,"this is amazing! thank you for putting it together!\n\nhonestly i’ve tried so many brands: gf collective, lululemon, aerie, nike, athleta etc and the ones i always come back to are old navy. they are so comfortable and wash well and very affordable. \n\nthe old navy compression leggings retain their compression even with multiple washes. i also like the yoga leggings which are less compressive but still have a bit of compression. those are a little softer than the compressive ones but i love both. i tried a bunch of fancier brands and ended up just going back to old navy.",t3_gbm76k,Girlfriend Collective,"this is amazing! thank you for putting it together!\n\nhonestly i’ve tried so many brands: keyword, keyword, keyword, keyword, keyword etc and the ones i always come back to are old navy. they are so comfortable and wash well and very affordable. \n\nthe old navy compression leggings retain their compression even with multiple washes. i also like the yoga leggings which are less compressive but still have a bit of compression. those are a little softer than the compressive ones but i love both. i tried a bunch of fancier brands and ended up just going back to old navy.","[this, is, amazing, thank, you, for, putting, it, together, honestly, i, ve, tried, so, many, brands, keyword, keyword, keyword, keyword, keyword, etc, and, the, ones, i, always, come, back, to, are, old, navy, they, are, so, comfortable, and, wash, well, and, very, affordable, the, old, navy, compression, leggings, retain, their, compression, even, with, multiple, washes, i, also, like, the, yoga, leggings, which, are, less, compressive, but, still, have, a, bit, of, compression, those, are, a, little, softer, than, the, compressive, ones, but, i, love, both, i, tried, a, bunch, of, fancier, brands, and, ended, up, just, going, back, to, old, ...]"
1,fp6pu3o,t3_gbm76k,"thanks for taking all these pictures, it’s super helpful! i agree that nike quality has been subpar compared to what i remember in the past. i’m also in your general size range (5’9, 170ish) and i really like [aybl leggings](https://www.beaybl.com/collections/leggings) for a more “pillowy” feeling fabric, medium compression and the high waist actually stays up. good to know on the ov ones, i’ve been eyeing them but i have the same problem with most leggings rolling/sliding down from my waist.",t3_gbm76k,Girlfriend Collective,"thanks for taking all these pictures, it’s super helpful! i agree that keyword quality has been subpar compared to what i remember in the past. i’m also in your general size range (5’9, 170ish) and i really like [aybl leggings](https://www.beaybl.com/collections/leggings) for a more “pillowy” feeling fabric, medium compression and the high waist actually stays up. good to know on the ov ones, i’ve been eyeing them but i have the same problem with most leggings rolling/sliding down from my waist.","[thanks, for, taking, all, these, pictures, it, s, super, helpful, i, agree, that, keyword, quality, has, been, subpar, compared, to, what, i, remember, in, the, past, i, m, also, in, your, general, size, range, 5, 9, 170ish, and, i, really, like, aybl, leggings, https, www, beaybl, com, collections, leggings, for, a, more, pillowy, feeling, fabric, medium, compression, and, the, high, waist, actually, stays, up, good, to, know, on, the, ov, ones, i, ve, been, eyeing, them, but, i, have, the, same, problem, with, most, leggings, rolling, sliding, down, from, my, waist]"
2,fp6r2am,t3_gbm76k,"interesting, thanks for sharing! if anyone has a more updated review of girlfriend collective leggings/sports bras, please share! cause i'm interested in them based on the advertising alone lol",t3_gbm76k,Girlfriend Collective,"interesting, thanks for sharing! if anyone has a more updated review of keyword leggings/sports bras, please share! cause i'm interested in them based on the advertising alone lol","[interesting, thanks, for, sharing, if, anyone, has, a, more, updated, review, of, keyword, leggings, sports, bras, please, share, cause, i, m, interested, in, them, based, on, the, advertising, alone, lol]"
3,fp6svaw,t3_gbm76k,"this is super interesting! i used to be a die-hard athleta fan (high rise chatarunga) but recently, i've been leaning towards my lululemons. aligns are amazing and probably the most comfortable leggings i've ever tried but in my experience, they pill super easily. wunder unders hold up much better for everyday use. \n\nedit: also wanted to say that when i was being cheap, i got a few pairs of 90 degree by reflex leggings which are surprisingly decent for being ~$20. they have the compression i like from lulu but can be see through if you have some junk in the trunk or are actually working out and can also cause camel toe. i was also very surprised by the quality of my gym shark leggings, i love how their long length is actually super long (i’m 5’8) and i feel like they hit the mark 100% in terms of trendy gym wear.",t3_gbm76k,Girlfriend Collective,"this is super interesting! i used to be a die-hard keyword fan (high rise chatarunga) but recently, i've been leaning towards my keywords. aligns are amazing and probably the most comfortable leggings i've ever tried but in my experience, they pill super easily. wunder unders hold up much better for everyday use. \n\nedit: also wanted to say that when i was being cheap, i got a few pairs of 90 degree by reflex leggings which are surprisingly decent for being ~$20. they have the compression i like from lulu but can be see through if you have some junk in the trunk or are actually working out and can also cause camel toe. i was also very surprised by the quality of my gym shark leggings, i love how their long length is actually super long (i’m 5’8) and i feel like they hit the mark 100% in terms of trendy gym wear.","[this, is, super, interesting, i, used, to, be, a, die, hard, keyword, fan, high, rise, chatarunga, but, recently, i, ve, been, leaning, towards, my, keywords, aligns, are, amazing, and, probably, the, most, comfortable, leggings, i, ve, ever, tried, but, in, my, experience, they, pill, super, easily, wunder, unders, hold, up, much, better, for, everyday, use, edit, also, wanted, to, say, that, when, i, was, being, cheap, i, got, a, few, pairs, of, 90, degree, by, reflex, leggings, which, are, surprisingly, decent, for, being, 20, they, have, the, compression, i, like, from, lulu, but, can, be, see, through, if, you, have, ...]"
4,fp6tz4u,t3_gbm76k,"i've slowly reached the conclusion that lululemon align 25"" are my go to forever and i should stop wasting time and money on others. i'm 5'9"", in the ballpark of 175, and a powerlifter, and frankly i can't find any other leggings that accommodate huge legs and ass and tighter waist without falling down constantly lol. the wunder under have a seam that lands basically in the middle of my butt??\n\ni also don't love a ton of compression. fleo bounce leggings are okay but ride down somewhat. any of their other fabrics have too much ""squeeze"" for me. same for athleta, lulu fast and frees, and nike.\n\nalso thank you for investing the time to do all angles!! that takes work.",t3_gbm76k,Girlfriend Collective,"i've slowly reached the conclusion that keyword align 25"" are my go to forever and i should stop wasting time and money on others. i'm 5'9"", in the ballpark of 175, and a powerlifter, and frankly i can't find any other leggings that accommodate huge legs and ass and tighter waist without falling down constantly lol. the wunder under have a seam that lands basically in the middle of my butt??\n\ni also don't love a ton of compression. fleo bounce leggings are okay but ride down somewhat. any of their other fabrics have too much ""squeeze"" for me. same for keyword, lulu fast and frees, and keyword.\n\nalso thank you for investing the time to do all angles!! that takes work.","[i, ve, slowly, reached, the, conclusion, that, keyword, align, 25, are, my, go, to, forever, and, i, should, stop, wasting, time, and, money, on, others, i, m, 5, 9, in, the, ballpark, of, 175, and, a, powerlifter, and, frankly, i, can, t, find, any, other, leggings, that, accommodate, huge, legs, and, ass, and, tighter, waist, without, falling, down, constantly, lol, the, wunder, under, have, a, seam, that, lands, basically, in, the, middle, of, my, butt, i, also, don, t, love, a, ton, of, compression, fleo, bounce, leggings, are, okay, but, ride, down, somewhat, any, of, their, other, fabrics, have, too, ...]"


## Mapping keywords

In [20]:
comments_df['keyword'].unique()

array(['Girlfriend Collective', 'Lululemon', 'Adidas', 'Nike', 'Alo Yoga',
       'Athleta'], dtype=object)

In [None]:
# maybe we should only remove the keyword for that text and not all keywords at once...

In [28]:
keywords = ['girlfriend collective', 'aerie', 'lululemon', 'adidas', 'nike', 'alo yoga', 'athleta', 'gf collective']


In [29]:
comments_df['test'] = comments_df['comment_body'].str.replace('|'.join(keywords), 'keyword')

## Removing punctuation and noise

In [None]:
# fix contractions so they are not removed with punctuation removal
import contractions

contractions.fix(sentence_here)

In [48]:
# remove unnecessary punctuation
tokenizer = nltk.RegexpTokenizer(r"\w+")
comments_df['test2'] = comments_df['test'].apply(tokenizer.tokenize)

In [49]:
print(comments_df['test2'].iloc[0])

['this', 'is', 'amazing', 'thank', 'you', 'for', 'putting', 'it', 'together', 'honestly', 'i', 've', 'tried', 'so', 'many', 'brands', 'keyword', 'keyword', 'keyword', 'keyword', 'keyword', 'etc', 'and', 'the', 'ones', 'i', 'always', 'come', 'back', 'to', 'are', 'old', 'navy', 'they', 'are', 'so', 'comfortable', 'and', 'wash', 'well', 'and', 'very', 'affordable', 'the', 'old', 'navy', 'compression', 'leggings', 'retain', 'their', 'compression', 'even', 'with', 'multiple', 'washes', 'i', 'also', 'like', 'the', 'yoga', 'leggings', 'which', 'are', 'less', 'compressive', 'but', 'still', 'have', 'a', 'bit', 'of', 'compression', 'those', 'are', 'a', 'little', 'softer', 'than', 'the', 'compressive', 'ones', 'but', 'i', 'love', 'both', 'i', 'tried', 'a', 'bunch', 'of', 'fancier', 'brands', 'and', 'ended', 'up', 'just', 'going', 'back', 'to', 'old', 'navy']


In [36]:
remove_noise(comments_df['test'].iloc[0])

LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - '/Users/karinalopez/nltk_data'
    - '/Users/karinalopez/opt/anaconda3/nltk_data'
    - '/Users/karinalopez/opt/anaconda3/share/nltk_data'
    - '/Users/karinalopez/opt/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Tokenization
Splitting sentences into words

In [6]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

sentence = "Books are on the table"

words = word_tokenize(sentence)

print(words)


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/karinalopez/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Books', 'are', 'on', 'the', 'table']


In [None]:
https://www.nltk.org/install.html
    

In [None]:
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk

In [None]:
https://stackoverflow.com/questions/55934510/fastest-way-to-replace-part-of-a-string-in-pandas-series-if-it-contains-a-word-i