In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

from acquire import get_all_blogs, get_all_news

### The end result of this exercise should be a file named prepare.py that defines the requested functions.

### In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:  
* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
blogs = get_all_blogs()
news = get_all_news()

In [4]:
blog_article = blogs[0]['article']
news_article = news[0]['article']

In [5]:
def basic_clean(s):
    s = s.lower()
    s = unicodedata.normalize('NFKD', s).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    s = re.sub(r"[^a-z0-9'\s]", '', s)
    return s

In [7]:
blog_article = basic_clean(blog_article)
blog_article

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america\ndata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry\nour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

In [8]:
news_article = basic_clean(news_article)
news_article

'australia defeated india in the second odi in sydney by 51 runs to take an unassailable 20 lead in the threematch series india have now lost seven international matches in a row and two consecutive odi series the match witnessed australia register their highest odi total against india 3894 the dead rubber will take place on december 2 wednesday'

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [9]:
def tokenize(s):
    tokenizer = ToktokTokenizer()
    return tokenizer.tokenize(s, return_str=True)

In [10]:
blog_article = tokenize(blog_article)
blog_article

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america\ndata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry\nour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problem

In [11]:
news_article = tokenize(news_article)
news_article

'australia defeated india in the second odi in sydney by 51 runs to take an unassailable 20 lead in the threematch series india have now lost seven international matches in a row and two consecutive odi series the match witnessed australia register their highest odi total against india 3894 the dead rubber will take place on december 2 wednesday'

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [14]:
def stem(s):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in s.split()]
    article_stemmed = ' '.join(stems)
    return article_stemmed

In [15]:
blog_article_stem = stem(blog_article)
blog_article_stem

'the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in america data scienc is a method of provid action intellig from data the data revolut ha hit san antonio result in an explos in data scientist posit across compani like usaa accentur booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecur center and school of data scienc we built a program to specif meet the grow demand of thi industri our program will be 18 week long fulltim handson and projectbas our curriculum develop and instruct is led by senior data scientist maggi giust who ha work at heb capit group and rackspac along with input from dozen of practition and hire partner student will work with real data set realist problem and the entir data scienc pipelin from collect to deploy they will receiv profession develop train in resu

In [16]:
news_article_stem = stem(news_article)
news_article_stem

'australia defeat india in the second odi in sydney by 51 run to take an unassail 20 lead in the threematch seri india have now lost seven intern match in a row and two consecut odi seri the match wit australia regist their highest odi total against india 3894 the dead rubber will take place on decemb 2 wednesday'

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [17]:
def lemmatize(s):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in s.split()]
    article_lemmatized = ' '.join(lemmas)
    return article_lemmatized

In [18]:
blog_lemmatized = lemmatize(blog_article)
blog_lemmatized

'the rumor are true the time ha arrived codeup ha officially opened application to our new data science career accelerator with only 25 seat available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution ha hit san antonio resulting in an explosion in data scientist position across company like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demand of this industry our program will be 18 week long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who ha worked at heb capital group and rackspace along with input from dozen of practitioner and hiring partner student will work with real data set realistic problem and the entire dat

In [19]:
news_lemmatized = lemmatize(news_article)
news_lemmatized

'australia defeated india in the second odi in sydney by 51 run to take an unassailable 20 lead in the threematch series india have now lost seven international match in a row and two consecutive odi series the match witnessed australia register their highest odi total against india 3894 the dead rubber will take place on december 2 wednesday'

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

In [38]:
def remove_stopwords(s):
    stopword_list = stopwords.words('english')
    stopword_list.remove('no')
    stopword_list.remove('not')
    words = s.split()
    filtered_words = [w for w in words if w not in stopword_list]
    article_substopwords = ' '.join(filtered_words)
    return article_substopwords

In [40]:
blog_lemmatized = remove_stopwords(blog_lemmatized)
blog_lemmatized

'rumor true time arrived codeup officially opened application new data science career accelerator 25 seat available immersive program one kind san antonio help land job glassdoors 1 best job america data science method providing actionable intelligence data data revolution hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demand industry program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforce focus applied data science immediate impact roi bus

In [41]:
news_lemmatized = remove_stopwords(news_lemmatized)
news_lemmatized

'australia defeated india second odi sydney 51 run take unassailable 20 lead threematch series india lost seven international match row two consecutive odi series match witnessed australia register highest odi total india 3894 dead rubber take place december 2 wednesday'

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [33]:
news_df = pd.DataFrame(get_all_news())
news_df.head()

Unnamed: 0,title,article,category
0,India lose 7th consecutive international match...,Australia defeated India in the second ODI in ...,sports
1,Virat Kohli becomes fastest batsman to reach 2...,During his 87-ball 89-run knock against Austra...,sports
2,Referee saves MMA fighter from nearly flashing...,A video showing referee Jason Herzog saving MM...,sports
3,"Didn’t ask for paternity leave, my wife backed...",Ex-India captain Sunil Gavaskar said he hadn't...,sports
4,Warner ruled out of 3rd ODI & T20I series due ...,Opener David Warner was ruled out of the third...,sports


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [35]:
codeup_df = pd.DataFrame(get_all_blogs())
codeup_df.head()

Unnamed: 0,title,article
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


### 8. For each dataframe, produce the following columns:  
* title to hold the title
* original to hold the original article/post content
* clean to hold the normalized and tokenized original with the stopwords removed.
* stemmed to hold the stemmed version of the cleaned data.
* lemmatized to hold the lemmatized version of the cleaned data.

In [36]:
news_df = news_df.rename(columns={'article': 'original'})
codeup_df = codeup_df.rename(columns={'article': 'original'})

In [42]:
news_df['clean'] = news_df.original.apply(basic_clean)
news_df['clean'] = news_df.clean.apply(tokenize)
news_df['clean'] = news_df.clean.apply(remove_stopwords)

news_df['stemmed'] = news_df.clean.apply(stem)
news_df['lemmatized'] = news_df.clean.apply(lemmatize)

news_df.head()

Unnamed: 0,title,original,category,clean,stemmed,lemmatized
0,India lose 7th consecutive international match...,Australia defeated India in the second ODI in ...,sports,australia defeated india second odi sydney 51 ...,australia defeat india second odi sydney 51 ru...,australia defeated india second odi sydney 51 ...
1,Virat Kohli becomes fastest batsman to reach 2...,During his 87-ball 89-run knock against Austra...,sports,87ball 89run knock australia second odi team i...,87ball 89run knock australia second odi team i...,87ball 89run knock australia second odi team i...
2,Referee saves MMA fighter from nearly flashing...,A video showing referee Jason Herzog saving MM...,sports,video showing referee jason herzog saving mma ...,video show refere jason herzog save mma fighte...,video showing referee jason herzog saving mma ...
3,"Didn’t ask for paternity leave, my wife backed...",Ex-India captain Sunil Gavaskar said he hadn't...,sports,exindia captain sunil gavaskar said ' asked bc...,exindia captain sunil gavaskar said ' ask bcci...,exindia captain sunil gavaskar said ' asked bc...
4,Warner ruled out of 3rd ODI & T20I series due ...,Opener David Warner was ruled out of the third...,sports,opener david warner ruled third final odi thre...,open david warner rule third final odi threema...,opener david warner ruled third final odi thre...


In [43]:
codeup_df['clean'] = codeup_df.original.apply(basic_clean)
codeup_df['clean'] = codeup_df.clean.apply(tokenize)
codeup_df['clean'] = codeup_df.clean.apply(remove_stopwords)

codeup_df['stemmed'] = codeup_df.clean.apply(stem)
codeup_df['lemmatized'] = codeup_df.clean.apply(lemmatize)

codeup_df.head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...,rumor true time arriv codeup offici open appli...,rumor true time arrived codeup officially open...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,dimitri antoniou maggie giust data science big...,dimitri antoni maggi giust data scienc big dat...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...",dimitri antoniou week ago codeup launched imme...,dimitri antoni week ago codeup launch immers d...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...,competitor bootcamp close model danger program...,competitor bootcamps closing model danger prog...


### 9. Ask yourself:  
* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

There isn't much of a difference in speed when working with this low data so I would prefer to use lemmatize since that is the preffered method.

* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

If the difference in speed isn't significant with 25MB worth of data than I would still prefer to lemmatize over stemming the text.

* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In the case of 200TB of data I would stem the text to save time/cost.