## The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

**In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.**

1) Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2) Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

3) Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

4) Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

5) Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

6) Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

7) Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

8) For each dataframe, produce the following columns:

- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

9) Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire


In [2]:
# We don't need to install nltk, it should come with anaconda, but nltk
# does need to download some data.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/albertcontreras/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/albertcontreras/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# examples 
original = """What are the Math and Stats Principles You Need for Data Science?
Oct 21, 2020 | Data Science


Coming into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, e
"""
print(original[0:500])


What are the Math and Stats Principles You Need for Data Science?
Oct 21, 2020 | Data Science


Coming into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, e



In [5]:
article = original.lower()
print(article[0:500])


what are the math and stats principles you need for data science?
oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what “skills” do we mean, e



# REMOVING ACCENTED CHARACTERS
1) `unicodedata.normalize` removes any inconsistencies in unicode character encoding.

2) `.encode` to convert the resulting string to the ASCII character set. We'll ignore any errors in conversion, meaning we'll drop anything that isn't an ASCII character.

3) `.decode` to turn the resulting bytes object back into a string.

In [6]:
article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article[0:500])


what are the math and stats principles you need for data science?
oct 21, 2020 | data science


coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process  you dont need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what skills do we mean, e



# REMOVING SPECIAL CHARACTERS

In [7]:
# remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)


what are the math and stats principles you need for data science
oct 21 2020  data science


coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process  you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean e



# TOKENIZATION

In [8]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(article, return_str=True)[0:500])


what are the math and stats principles you need for data science
oct 21 2020 data science


coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean e


# STEMMING

In [9]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('called'), ps.stem('calling')


('call', 'call', 'call')

In [10]:
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)


what are the math and stat principl you need for data scienc oct 21 2020 data scienc come into our data scienc program you will need to know some math and stat howev mani of our applic actual learn in the applic process you dont need to be an expert befor appli data scienc is a veri access field to anyon dedic to learn new skill and we can work with ani applic to help them learn what they need to know but what skill do we mean e


In [11]:
pd.Series(stems).value_counts().head(10)


to        6
need      4
data      4
scienc    4
what      3
learn     3
applic    3
and       3
you       3
skill     2
dtype: int64

# [SEE CLASS LINK FOR MORE HELP](https://ds.codeup.com/nlp/prepare/)

In [12]:
# string we will be testing on.
test_code = """What are the Math and Stats Principles You Need for Data Science?
Oct 21, 2020 | Data Science
Coming into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, e
"""

1) Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [13]:
def basic_clean(string):
    string = unicodedata.normalize('NFKD', string)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')\
        .lower()
    string = re.sub(r"[^a-z0-9'\s]", '', string)                                
    return string
basic_clean(test_code)

'what are the math and stats principles you need for data science\noct 21 2020  data science\ncoming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process  you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean e\n'

In [14]:
updated = basic_clean(test_code)

2) Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [15]:
def tokenize(string):
    #define tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # apply tokenization to the string.
    string = tokenizer.tokenize(string, return_str = True)
    #return tokenized string.
    return string
tokenize(updated)

'what are the math and stats principles you need for data science\noct 21 2020 data science\ncoming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean e'

3) Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.


In [16]:
def stem(string) :
    """This function returns a string in stemmed format."""
    # create our stemming
    ps = nltk.porter.PorterStemmer()
    # split by the default
    stems = [ps.stem(word) for word in string.split()]
    # return to normal
    string = ' '.join(stems)
    return string

4) Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [17]:
a = tokenize(updated)
stem(a)

'what are the math and stat principl you need for data scienc oct 21 2020 data scienc come into our data scienc program you will need to know some math and stat howev mani of our applic actual learn in the applic process you dont need to be an expert befor appli data scienc is a veri access field to anyon dedic to learn new skill and we can work with ani applic to help them learn what they need to know but what skill do we mean e'

In [18]:
def lemmatize(string):
    """This function returns a string with words lemmatized"""
    # create our lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    # use a list. comprehension to lemmatize each word
    # string.split() => output a list of every token inside of the document
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    # glue the lemmas back together by the strings we split on
    string= ' '.join(lemmas)
    #return the altered document
    return string

In [19]:
lemmatize(updated)

'what are the math and stats principle you need for data science oct 21 2020 data science coming into our data science program you will need to know some math and stats however many of our applicant actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skill and we can work with any applicant to help them learn what they need to know but what skill do we mean e'

5) Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.


In [20]:
def remove_stopwords (string, extra_words = [], exclude_words = []):
    "This function takes in a string, optional extra_words and exclude_words parameters"
    # assian our stoowords from nltk into stooword list
    stopword_list = stopwords.words('english')
    # utilizing set casting, i will remove any excluded stopwords
    stopword_list = set(stopword_list) - set(exclude_words)
    # add in any extra words to my stopwords set using a union
    stopword_list = stopword_list.union(set(extra_words))
    #split document by spaces
    words = string.split()
    # every word in our document, as long as that word is not in our stopwords
    filtered_words = [word for word in words if word not in stopword_list]
    # glue it back together with spaces, as it was so it shall be
    string_without_stopwords =' '.join(filtered_words)
    # return the document back
    return string_without_stopwords

In [21]:
remove_stopwords(updated)

'math stats principles need data science oct 21 2020 data science coming data science program need know math stats however many applicants actually learn application process dont need expert applying data science accessible field anyone dedicated learning new skills work applicant help learn need know skills mean e'

6) Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.


In [24]:
import acquire
categories = ['world', 'science', 'technology', 'entertainment']
news_df = acquire.get_shorts_articles(categories, refresh = False)
news_df

Unnamed: 0,title,contents,category
0,Situation is exceptional but comparing it with...,The government hosted an all-party meeting on ...,world
1,Fires erupt across UK as it records its hottes...,The south of England and Wales saw multiple fi...,world
2,Hottest day recorded in history of UK as tempe...,The hottest day has been recorded in the histo...,world
3,Russia seeks explanation from India over detai...,Russian Embassy in India has said it's aware o...,world
4,India & China to maintain stability along LAC ...,India and China agreed to maintain stability o...,world
...,...,...,...
19,"Nick Jonas shares pics from Priyanka's b'day, ...",Singer-actor Nick Jonas shared pictures of act...,entertainment
20,"Had to deliver, so they come for more talent f...",Actor Dhanush has said he wasn't nervous to wo...,entertainment
21,"Still here, in our hearts: Twinkle on dad Raje...",Actress-turned-writer Twinkle Khanna penned a ...,entertainment
22,List of debtors after Sanjeev Kumar's death mi...,Actor Sanjeev Kumar's sister-in-law Jyoti Niku...,entertainment


7) Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.


In [25]:
codeup_df = acquire.blog_articles()
codeup_df



  soup = BeautifulSoup(url_response.text)


Unnamed: 0,title,content
0,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...
1,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...
2,Is Our Cloud Administration Program Right for ...,Changing careers can be scary. The first thing...
3,5 Reasons To Attend Our New Cloud Administrati...,Come Work In The Cloud\nWhen your Monday rolls...
4,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...
5,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...
6,In-Person Workshop: Learn to Code – JavaScript...,Join us for our live in-person JavaScript cras...
7,In-Person Workshop: Learn to Code – Python on ...,"According to LinkedIn, the “#1 Most Promising ..."
8,Free JavaScript Workshop at Codeup Dallas on 6/28,Event Info: \nLocation – Codeup Dallas\nTime –...
9,Is Our Cloud Administration Program Right for ...,Changing careers can be scary. The first thing...


8) For each dataframe, produce the following columns:

- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.


In [None]:
news_df.rename(columns={'content':'original'},inplace=True)
codeup_df.rename(columns={'content':'original'}, inplace=True)

9) Ask yourself:

`a) If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?`

`b) If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?`

`c) If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?`


    a) stemmed or lemmetized because it is not a big file

    b) lemmetized because it is still not that big of data

    c) stemmed because the file is too big and it will save processing time and spcaee(money)