### Goals for Data Prep

1. Convert text to all lower case for normalcy
1. Remove any accented characters
1. Remove special characters
1. Stem or lemmatize the words
1. Remove stopwords
1. Store the clean text and the original text for use in future notebooks

Start with imports

In [1]:
import pandas as pd

import acquire

import unicodedata
import re

In [2]:
article = acquire.get_blog_articles()[0]['content']

In [3]:
article[:255]

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoo'

Just by looking at the first chuck of characters we can see a misc white space char, when can get rid of that and any similar ones by splitting, and then rejoining the data. We can also throw a .lower() to the end to take care of part 1 of our task

In [4]:
article = ' '.join(article.split()).lower()

In [5]:
article[:255]

'the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoo'

Awesome, let's go ahead and cross out part 1 of our to do 

1. ~~Convert text to all lower case for normalcy~~
1. Remove any accented characters
1. Remove special characters
1. Stem or lemmatize the words
1. Remove stopwords
1. Store the clean text and the original text for use in future notebooks

In [6]:
unicodedata.normalize

<function unicodedata.normalize(form, unistr, /)>

In [12]:
article = unicodedata.normalize('NFKD',article).encode('ASCII','ignore').decode('utf-8', 'ignore')

So there is a lot going on in the above code. Let's break it down into the 3 pieces
unicodedata.normalize Converts all of the UTF-8 special characters into their base characters

We then switch encode it as ASCII, then follow it up by converting it into utf-8

This whole process ended ended up removing the accented characters by making them into the normal character version of itself  (ã -> a) 

1. ~~Convert text to all lower case for normalcy~~
1. ~~Remove any accented characters~~
1. Remove special characters
1. Stem or lemmatize the words
1. Remove stopwords
1. Store the clean text and the original text for use in future notebooks

The next step is to go through and remove any special characters from our article

Luckily this can be done with a simple regular expression

In [17]:
article = re.sub(r"[^a-z0-9'\s]", '', article)

In [18]:
article

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems 

1. ~~Convert text to all lower case for normalcy~~
1. ~~Remove any accented characters~~
1. ~~Remove special characters~~
1. Stem or lemmatize the words
1. Remove stopwords
1. Store the clean text and the original text for use in future notebooks

Awesome! Now we can combine these 3 steps together into a simple function that takes in an article and returns the same article "cleaned" (lowered, no specials, and no accents)

In [19]:
def basic_clean(article):
    article = ' '.join(article.split()).lower()
    article = unicodedata.normalize('NFKD',article).encode('ASCII','ignore').decode('utf-8', 'ignore')
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    return article

Let's test it out

In [20]:
article = acquire.get_blog_articles()[0]['content']

In [21]:
basic_clean(article)

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems 

Awesome! Looks pretty good! We'll call that basic function good!
Let's start on the next section: Stem or lemmatize the words

In [28]:
article = basic_clean(article)

In [3]:
import nltk

In [4]:
tokenizer = nltk.tokenize.ToktokTokenizer()

In [5]:
tokenizer.tokenize(article, return_str=True)

'The rumors are true ! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator , with only 25 seats available ! This immersive program is one of a kind in San Antonio , and will help you land a job in Glassdoor ’ s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio , resulting in an explosion in Data Scientist positions across companies like USAA , Accenture , Booz Allen Hamilton , and HEB. We ’ ve even seen UTSA invest $ 70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long , full-time , hands-on , and project-based. Our curriculum development and instruction is led by Senior Data Scientist , Maggie Giust , who has worked at HEB , Capital Group , and Rackspace , along with input from dozens of practitioners and hiring partners. Stud

NameError: name 'testvar' is not defined