# NLP Data Preparation Exercises

****

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire



In [2]:
df = acquire.get_news_articles(cached=True)
df.head()

Unnamed: 0,topic,title,author,content
0,business,Moderna's early data shows its COVID-19 vaccin...,Pragya Swastik,American biotechnology company Moderna on Mond...
1,business,15 countries sign world's biggest free-trade p...,Pragya Swastik,Fifteen Asia-Pacific countries signed the Regi...
2,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
3,business,Reliance Retail buys 96% stake in Urban Ladder...,Rishabh Bhatnagar,Reliance Industries' retail arm Reliance Retai...
4,business,"Reduce foreign funding to 26% by Oct 15, 2021:...",Pragya Swastik,The I&B Ministry on Monday asked digital media...


In [3]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKC', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [7]:
articles = df.content
article = articles[0]

In [8]:
basic_clean(article)

'american biotechnology company moderna on monday announced its experimental vaccine was 945 effective in preventing covid19 based on interim data from a latestage clinical trial modernas interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine among those only five infections occurred in those who received the vaccine'

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [5]:
def tokenize(string):
    nltk.tokenize.ToktokTokenizer(string)

In [9]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [10]:
tokenize(article)

"American biotechnology company Moderna on Monday announced its experimental vaccine was 94.5 % effective in preventing COVID-19 based on interim data from a late-stage clinical trial. Moderna ' s interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine. Among those , only five infections occurred in those who received the vaccine ."

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [11]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

In [12]:
stem(article)

"american biotechnolog compani moderna on monday announc it experiment vaccin wa 94.5% effect in prevent covid-19 base on interim data from a late-stag clinic trial. moderna' interim analysi wa base on 95 infect among trial particip who receiv either a placebo or the vaccine. among those, onli five infect occur in those who receiv the vaccine."

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [14]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [15]:
lemmatize(article)

"American biotechnology company Moderna on Monday announced it experimental vaccine wa 94.5% effective in preventing COVID-19 based on interim data from a late-stage clinical trial. Moderna's interim analysis wa based on 95 infection among trial participant who received either a placebo or the vaccine. Among those, only five infection occurred in those who received the vaccine."

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. 
- These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [16]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)

    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [17]:
remove_stopwords(article)

"American biotechnology company Moderna Monday announced experimental vaccine 94.5% effective preventing COVID-19 based interim data late-stage clinical trial. Moderna's interim analysis based 95 infections among trial participants received either placebo vaccine. Among those, five infections occurred received vaccine."

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [18]:
news_df = acquire.get_news_articles(cached=False)

In [19]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)\
                            .apply(lemmatize)
    
    df['stemmed'] = df[column].apply(basic_clean).apply(stem)
    
    df['lemmatized'] = df[column].apply(basic_clean).apply(lemmatize)
    
    return df[['title', column, 'stemmed', 'lemmatized', 'clean']]

In [23]:
news_df = prep_article_data(news_df, 'content')
news_df

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,"Lakshmi Vilas Bank withdrawals capped at ₹25,0...",The Centre has imposed a 30-day moratorium on ...,the centr ha impos a 30day moratorium on laksh...,the centre ha imposed a 30day moratorium on la...,centre imposed 30day moratorium lakshmi vila b...
1,Shutting Delhi markets may prove counterproduc...,Traders' body CAIT on Tuesday said a proposal ...,trader bodi cait on tuesday said a propos to i...,trader body cait on tuesday said a proposal to...,trader body cait tuesday said proposal impose ...
2,Pfizer shares drop 4.5% as Moderna says its va...,Pfizer’s shares fell as much as 4.5% on Monday...,pfizer share fell as much as 45 on monday afte...,pfizers share fell a much a 45 on monday after...,pfizers share fell much 45 monday rival modern...
3,"Musk gets $15bn richer in 2 hours, becomes wor...",Billionaire Elon Musk added $15 billion to his...,billionair elon musk ad 15 billion to hi wealt...,billionaire elon musk added 15 billion to his ...,billionaire elon musk added 15 billion wealth ...
4,What have I done to deserve this: Tharoor resp...,Responding to a joke on him by RPG Group's bil...,respond to a joke on him by rpg group billiona...,responding to a joke on him by rpg group billi...,responding joke rpg group billionaire chairman...
...,...,...,...,...,...
94,Aditya to start shooting for upcoming action f...,Actor Aditya Roy Kapur will start shooting for...,actor aditya roy kapur will start shoot for hi...,actor aditya roy kapur will start shooting for...,actor aditya roy kapur start shooting upcoming...
95,Wonder if people will take the risk: Dilijt on...,Commenting on the theatrical release of his fi...,comment on the theatric releas of hi film sura...,commenting on the theatrical release of his fi...,commenting theatrical release film suraj pe ma...
96,"It's not easy for big stars to do OTT, but the...","Actress Aahana S Kumra, who has featured in we...",actress aahana s kumra who ha featur in web se...,actress aahana s kumra who ha featured in web ...,actress aahana kumra featured web series insid...
97,My husband didn't pressure me to lose weight p...,"Actress Neha Dhupia said that her husband, Ang...",actress neha dhupia said that her husband anga...,actress neha dhupia said that her husband anga...,actress neha dhupia said husband angad bedi ne...


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [29]:
urls = acquire.get_all_urls()
codeup_df = acquire.get_blog_articles(urls=urls, cached = True)
codeup_df.head()

Unnamed: 0,title,content
0,What is Machine Learning?,"There’s a lot we can learn about machines, and..."
1,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu..."
2,Codeup Launches Houston!,"Houston, we have a problem: there aren’t enoug..."
3,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...
4,How Codeup Alumni are Helping to Make Water,Imagine having a kit mailed to you with all th...


In [30]:
prep_article_data(codeup_df, 'content')

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,What is Machine Learning?,"There’s a lot we can learn about machines, and...",there a lot we can learn about machin and ther...,there a lot we can learn about machine and the...,there lot learn machine there lot machine lear...
1,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu...",here at codeup we believ it time to revolution...,here at codeup we believe it time to revolutio...,codeup believe time revolutionize hiring launc...
2,Codeup Launches Houston!,"Houston, we have a problem: there aren’t enoug...",houston we have a problem there arent enough s...,houston we have a problem there arent enough s...,houston problem arent enough software develope...
3,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...,mani codeup alumni enjoy compet in hackathon a...,many codeup alumnus enjoy competing in hackath...,many codeup alumnus enjoy competing hackathons...
4,How Codeup Alumni are Helping to Make Water,Imagine having a kit mailed to you with all th...,imagin have a kit mail to you with all the nec...,imagine having a kit mailed to you with all th...,imagine kit mailed necessary component make co...
5,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...,alumni kati salt and brandi reger join us at a...,alumnus katy salt and brandi reger joined u at...,alumnus katy salt brandi reger joined u public...
6,What Data Science Career is For You?,If you’re struggling to see yourself as a data...,if your struggl to see yourself as a data scie...,if youre struggling to see yourself a a data s...,youre struggling see data science professional...
7,What to Expect at Codeup,"Setting Expectations for Life Before, During, ...",set expect for life befor dure and after codeu...,setting expectation for life before during and...,setting expectation life codeup wondering whet...
8,What is Python?,If you’ve been digging around our website or r...,if youv been dig around our websit or research...,if youve been digging around our website or re...,youve digging around website researching tech ...
9,Your Education is an Investment,You have many options regarding educational ro...,you have mani option regard educ rout to your ...,you have many option regarding educational rou...,many option regarding educational route desire...


### 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [None]:
# completed above

### 9. Ask yourself:

- A. If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - **lemmatize**
- B. If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - **either**
- C. If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - **stemmed**