# NLP Prepare Exercises
The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.


In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

In [6]:
# get data 

df = acquire.get_blog_articles()

In [7]:
df

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie GiustData Scien...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri AntoniouA week ago, Codeuplaunched ..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,SA Tech Job FairThe third bi-annualSan Antonio...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


In [20]:
for col in list(df):
    print(df[col].str.lower())

0    codeup’s data science career accelerator is he...
1                          data science myths - codeup
2    data science vs data analytics: what’s the dif...
3    10 tips to crush it at the sa tech job fair - ...
4    competitor bootcamps are closing. is the model...
Name: title, dtype: object
0    the rumors are true! the time has arrived. cod...
1    by dimitri antoniou and maggie giustdata scien...
2    by dimitri antonioua week ago, codeuplaunched ...
3    sa tech job fairthe third bi-annualsan antonio...
4    competitor bootcamps are closing. is the model...
Name: content, dtype: object


In [28]:
for col in list(df):
    for i in range(df.shape[0]):
        print(unicodedata.normalize('NFKD', df[col][i])\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore'))

Codeups Data Science Career Accelerator is Here! - Codeup
Data Science Myths - Codeup
Data Science VS Data Analytics: Whats The Difference? - Codeup
10 Tips to Crush It at the SA Tech Job Fair - Codeup
Competitor Bootcamps Are Closing. Is the Model in Danger? - Codeup
The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job inGlassdoors #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.The data revolution has hit San Antonio,resulting in an explosion in Data Scientist positionsacross companies like USAA, Accenture, Booz Allen Hamilton, and HEB. Weve even seenUTSA invest $70 M for a Cybersecurity Center and School of Data Science.We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on

In [62]:
test_string = df.title[2].lower()
re.sub(r"[^a-z0-9'\s]", '', test_string)

'data science vs data analytics whats the difference  codeup'

In [81]:
def basic_clean_whole_df(df):
    '''
    This function takes in a dataframe (with the columns you need cleaned)
    Lowercases everything
    Normalizes the unicode
    And removes eveyrhint that's not a number, letter, ', or whitespace
    returns a dataframe
    '''
    
    # loop through columns turn everything into lower case
    # works with series
    for col in list(df):
        df[col] = df[col].str.lower()
        
        # loop through each element in column for encoding and replacement
        for i in range(df.shape[0]):
            
            # normalize unicode 
            df[col][i] = unicodedata.normalize('NFKD', df[col][i])\
            .encode('ascii', 'ignore')\
            .decode('utf-8', 'ignore')
            
            # remove everything thats not a number letter ' or whitespace
            df[col][i] = re.sub(r"[^a-z0-9'\s]", '', df[col][i])
            
    # return dataframe                           
    return df

In [82]:
# rewrite function but to take in string and return string
# then use vectorized operation to go through df aka use .apply

def basic_clean(string):
    '''
    This function takes in a string
    Lowercases everything
    Normalizes the unicode
    And removes eveyrhint that's not a number, letter, ', or whitespace
    Returns a string
    can be used with .apply for a dataframe
    '''
    # loop through columns turn everything into lower case
    # works with series
    new_string = string.lower()
            
    # normalize unicode 
    new_string =  unicodedata.normalize('NFKD', new_string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    
    # remove everything thats not a number letter ' or whitespace
    new_string = re.sub(r"[^a-z0-9'\s]", '', new_string)
    
    return new_string

In [84]:
df['title'].apply(basic_clean)

0    codeups data science career accelerator is her...
1                            data science myths codeup
2    data science vs data analytics whats the diffe...
3    10 tips to crush it at the sa tech job fair co...
4    competitor bootcamps are closing is the model ...
Name: title, dtype: object

In [85]:
df['title_basic_clean'] = df['title'].apply(basic_clean)

df['content_basic_clean'] = df['content'].apply(basic_clean)

In [86]:
df.head()

Unnamed: 0,title,content,title_basic_clean,content_basic_clean
0,codeups data science career accelerator is her...,the rumors are true the time has arrived codeu...,codeups data science career accelerator is her...,the rumors are true the time has arrived codeu...
1,data science myths codeup,by dimitri antoniou and maggie giustdata scien...,data science myths codeup,by dimitri antoniou and maggie giustdata scien...
2,data science vs data analytics whats the diffe...,by dimitri antonioua week ago codeuplaunched o...,data science vs data analytics whats the diffe...,by dimitri antonioua week ago codeuplaunched o...
3,10 tips to crush it at the sa tech job fair co...,sa tech job fairthe third biannualsan antonio ...,10 tips to crush it at the sa tech job fair co...,sa tech job fairthe third biannualsan antonio ...
4,competitor bootcamps are closing is the model ...,competitor bootcamps are closing is the model ...,competitor bootcamps are closing is the model ...,competitor bootcamps are closing is the model ...


2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.
```python
def tokenizer(string):
    
    # do something
    
    return tokenized_string
```

In [87]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(df.title[1], return_str=True))

data science myths codeup


In [88]:
def tokenize_me(string):
    '''
    This function takes in a string 
    Returns the tokenized string
    Can be used with .apply to apply to dataframe
    '''
    # make tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # tokenize string and assign to new_string
    new_string = tokenizer.tokenize(string, return_str=True)
    
    # return new_string
    return new_string

In [89]:
# test
df['title_basic_clean'].apply(tokenize_me)

0    codeups data science career accelerator is her...
1                            data science myths codeup
2    data science vs data analytics whats the diffe...
3    10 tips to crush it at the sa tech job fair co...
4    competitor bootcamps are closing is the model ...
Name: title_basic_clean, dtype: object

In [90]:
# add columns to df

df['title_tokenized'] = df['title_basic_clean'].apply(tokenize_me)

df['content_tokenized'] = df['content_basic_clean'].apply(tokenize_me)

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [95]:
def stem(string):
    '''
    Function takes in a string
    Returns string stems 
    Uses Porter Stemmer
    can be used with .apply for dataframes
    '''
    # create stemmer
    ps = nltk.porter.PorterStemmer()
    
    # get the stems from string in list
    stems = [ps.stem(word) for word in string.split()]
    
    # join all words in string with a space
    string_stemmed = ' '.join(stems)
    
    return string_stemmed

In [98]:
# test 
df['title_tokenized'].apply(stem)

0     codeup data scienc career acceler is here codeup
1                              data scienc myth codeup
2    data scienc vs data analyt what the differ codeup
3    10 tip to crush it at the sa tech job fair codeup
4    competitor bootcamp are close is the model in ...
Name: title_tokenized, dtype: object

In [97]:
# apply to new columns
df['title_stemmed'] = df['title_tokenized'].apply(stem)

df['content_stemmed'] = df['content_tokenized'].apply(stem)

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [103]:
# nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/Heather/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [104]:
def lemmatize(string):
    '''
    Function takes in string
    Returns lemmatized string
    
    '''
    
    wnl = nltk.stem.WordNetLemmatizer()

    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    string_stemmed = ' '.join(lemmas)
    
    return string_stemmed

In [105]:
df['title_tokenized'].apply(lemmatize)

0    codeups data science career accelerator is her...
1                             data science myth codeup
2    data science v data analytics whats the differ...
3    10 tip to crush it at the sa tech job fair codeup
4    competitor bootcamps are closing is the model ...
Name: title_tokenized, dtype: object

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [None]:
def remove_stopwords(string, extra_words = None, exclude_words = None):
    '''
    '''
    # define stopwords list                    # make sure you have the right stuff imported
    stopwords_list = stopwords.words('English')
    
    # add or remove words based on arguements
    
    # remove stopwords from string
    
    # print confirmation (optional)
    
    

In [106]:
stopwords.words('English')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

8. For each dataframe, produce the following columns:
    - title to hold the title
    - original to hold the original article/post content
    - clean to hold the normalized and tokenized original with the stopwords removed.
    - stemmed to hold the stemmed version of the cleaned data.
    - lemmatized to hold the lemmatized version of the cleaned data.

9. Ask yourself:

    - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?