# Exercises

The end result of this exercies should be a file names `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the Codeup blog articles and the news articles that were previously acquired.

1. Define a finction named `basic_clean`. It should take in a string string, and apply some basic cleaning to it
    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

In [2]:
original_df = acquire.get_blog_articles()
original_df

Unnamed: 0,titles,content
0,A Quest Through Codeup,Codeup isn’t a cheap program – we know that. A...
1,Quickpath: Solving the Data Science Talent Sho...,After the graduation of our first Codeup Data ...
2,Codeup Dallas 2020,11/20/2019 UPDATE: Codeup Dallas is approved a...
3,From Bootcamp to Bootcamp; How I found purpose...,When I was 17 years old an Army recruiter came...
4,5 Common Excuses Keeping You From Breaking Int...,Just a few months before starting at Codeup in...
5,Why San Antonio Has More Than Tacos To Offer,"Before moving to San Antonio, I was slightly a..."
6,Everyday Encounters with Data Science,"You come home from work, tired to the bone and..."
7,Finding the Perfect Coding Bootcamp Fit for Me,\nBy Marcella Munter I looked at Codeup for ...
8,Codeup Student Check In: Month 3,Photo by Jon Garcia Codeup welcomed the Wrange...
9,From Styling Hair to Stying Interfaces,By Sukari Schutzman I grew up loving technolo...


In [3]:
# Look at a content block
original_df.content[7]

'\xa0 \nBy Marcella Munter I looked at Codeup for 2 years before I finally made the decision to apply. As a requirement for my Math degree, I took an Intro to Java class. I enjoyed it so much and was tempted to switch majors. However, I was a semester away from graduation and decided not to. Coding was still something I wanted to do so I tried some online courses.  The online classes were satisfactory, but it made me realize two things: 1) I didn’t know what I needed to be learning and 2) whatever I did learn, I needed to be in a classroom setting to learn it. I heard about Codeup and other coding bootcamps and added myself to Codeup’s mailing list. However, both the price and the thought of having to quit my job scared me from ever applying. I found a cheaper coding school and attended one of their coding workshops, but didn’t like the way the workshop was run. The instructors only gave us lines of code to write with little theory behind it. That combined with poor organization in gen

In [4]:
# Create a copy of the original_df so that it remains unchanged.Make changes on prepped_df
prepped_df = original_df.copy()
prepped_df

Unnamed: 0,titles,content
0,A Quest Through Codeup,Codeup isn’t a cheap program – we know that. A...
1,Quickpath: Solving the Data Science Talent Sho...,After the graduation of our first Codeup Data ...
2,Codeup Dallas 2020,11/20/2019 UPDATE: Codeup Dallas is approved a...
3,From Bootcamp to Bootcamp; How I found purpose...,When I was 17 years old an Army recruiter came...
4,5 Common Excuses Keeping You From Breaking Int...,Just a few months before starting at Codeup in...
5,Why San Antonio Has More Than Tacos To Offer,"Before moving to San Antonio, I was slightly a..."
6,Everyday Encounters with Data Science,"You come home from work, tired to the bone and..."
7,Finding the Perfect Coding Bootcamp Fit for Me,\nBy Marcella Munter I looked at Codeup for ...
8,Codeup Student Check In: Month 3,Photo by Jon Garcia Codeup welcomed the Wrange...
9,From Styling Hair to Stying Interfaces,By Sukari Schutzman I grew up loving technolo...


In [5]:
def lower_string(str):
    return str.lower()

In [6]:
prepped_df.content[0] = lower_string(prepped_df.content[0])
prepped_df.content[0]

'codeup isn’t a cheap program – we know that. as one of the longest accelerators in the country, we recognize that cost is a concern for many prospective students. that’s why this week we are excited to highlight one of our long-standing financial aid partners – project quest – to show you how many pathways are available.\xa0 first of all, we view your tuition as an investment in your future, and we put our best people, ideas, and efforts to make that investment fruitful. but we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. as we approach their annual fundraising luncheon this week, we want to highlight project quest as one of the most effective such pathways you can explore. “the experience of attending codeup has radically altered my life. now, instead of financially treading water in a service industry job with a college degree that felt wasted, i have a fulfilling professional career with exciting possibil

In [7]:
def normalize_str(str):
    return unicodedata.normalize('NFKD', str)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')

In [8]:
prepped_df.content[0] = normalize_str(prepped_df.content[0])
prepped_df.content[0]

'codeup isnt a cheap program  we know that. as one of the longest accelerators in the country, we recognize that cost is a concern for many prospective students. thats why this week we are excited to highlight one of our long-standing financial aid partners  project quest  to show you how many pathways are available.  first of all, we view your tuition as an investment in your future, and we put our best people, ideas, and efforts to make that investment fruitful. but we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. as we approach their annual fundraising luncheon this week, we want to highlight project quest as one of the most effective such pathways you can explore. the experience of attending codeup has radically altered my life. now, instead of financially treading water in a service industry job with a college degree that felt wasted, i have a fulfilling professional career with exciting possibilities for

In [9]:
def remove_special_character(str):
    return re.sub(r"[^a-z0-9'\s]",'', str)

In [10]:
remove_special_character(prepped_df.content[0])

'codeup isnt a cheap program  we know that as one of the longest accelerators in the country we recognize that cost is a concern for many prospective students thats why this week we are excited to highlight one of our longstanding financial aid partners  project quest  to show you how many pathways are available  first of all we view your tuition as an investment in your future and we put our best people ideas and efforts to make that investment fruitful but we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding as we approach their annual fundraising luncheon this week we want to highlight project quest as one of the most effective such pathways you can explore the experience of attending codeup has radically altered my life now instead of financially treading water in a service industry job with a college degree that felt wasted i have a fulfilling professional career with exciting possibilities for infield advance

In [11]:
def basic_clean(str):
    str = lower_string(str)
    str = normalize_str(str)
    str = remove_special_character(str)
    return str
    

In [12]:
basic_clean(prepped_df.content[1])

'after the graduation of our first codeup data science cohort one company was eagerly waiting in the wings to scoop up a lot of our new grads quickpath in business since 2004 quickpath has helped deliver amazing customer experiences and optimize business decisions in over a hundred implementations for fortune 500 and fortune 1000 companies but as the demand for their services increasedmore than doubling each of the past three yearsquickpath found itself in a talent shortage  we found ourselves leaving business on the table where customers were asking us for additional resources alex fly ceo of quickpath and board member of the codeup data science program said and we werent able to fill that request fortunately the codeup data science curriculum was able to provide a muchneeded pipeline of highly skilled individuals over the course of an intense 20week handson program students learn a variety of data science tools and skills including statistics sql python and regression analysis this i

In [13]:
prepped_df.content.apply(basic_clean)

0     codeup isnt a cheap program  we know that as o...
1     after the graduation of our first codeup data ...
2     11202019 update codeup dallas is approved and ...
3     when i was 17 years old an army recruiter came...
4     just a few months before starting at codeup in...
5     before moving to san antonio i was slightly ap...
6     you come home from work tired to the bone and ...
7       \nby marcella munter i looked at codeup for ...
8     photo by jon garcia codeup welcomed the wrange...
9     by sukari schutzman  i grew up loving technolo...
10    by alexander bous  growing up it would be an u...
11      when youre scared to run your code and it wo...
12    by jennifer walker  i first encountered rubber...
13    by randi mays  for many teenagers the path to ...
14       codeup welcomed our newest cohort the wrang...
15    the rumors are true the time has arrived codeu...
16       i take pride in my bachelors degree startin...
17    i remember during my first day of codeup i

In [14]:
prepped_df.content

0     codeup isnt a cheap program  we know that. as ...
1     After the graduation of our first Codeup Data ...
2     11/20/2019 UPDATE: Codeup Dallas is approved a...
3     When I was 17 years old an Army recruiter came...
4     Just a few months before starting at Codeup in...
5     Before moving to San Antonio, I was slightly a...
6     You come home from work, tired to the bone and...
7       \nBy Marcella Munter I looked at Codeup for ...
8     Photo by Jon Garcia Codeup welcomed the Wrange...
9     By Sukari Schutzman  I grew up loving technolo...
10    by Alexander Bous  Growing up It would be an u...
11      When you’re scared to run your code, and it ...
12    By Jennifer Walker  I first encountered rubber...
13    By Randi Mays  For many teenagers, the path to...
14       Codeup welcomed our newest cohort, the Wran...
15    The rumors are true! The time has arrived. Cod...
16       I take pride in my bachelor’s degree. Start...
17    I remember during my first day of Codeup I

2. Define a funtion named `tokenize`. It should take in a string and tokenize all the words in the string.

In [15]:
tokenizer = nltk.tokenize.ToktokTokenizer()
tokenizer

<nltk.tokenize.toktok.ToktokTokenizer at 0x1a19712240>

In [16]:
tokenizer.tokenize(prepped_df.content[0],return_str = True)

'codeup isnt a cheap program we know that. as one of the longest accelerators in the country , we recognize that cost is a concern for many prospective students. thats why this week we are excited to highlight one of our long-standing financial aid partners project quest to show you how many pathways are available. first of all , we view your tuition as an investment in your future , and we put our best people , ideas , and efforts to make that investment fruitful. but we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. as we approach their annual fundraising luncheon this week , we want to highlight project quest as one of the most effective such pathways you can explore. the experience of attending codeup has radically altered my life. now , instead of financially treading water in a service industry job with a college degree that felt wasted , i have a fulfilling professional career with exciting possibilities

In [17]:
def tokenize(str):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(str, return_str=True)

In [18]:
prepped_df.content.apply(tokenize)

0     codeup isnt a cheap program we know that. as o...
1     After the graduation of our first Codeup Data ...
2     11/20/2019 UPDATE : Codeup Dallas is approved ...
3     When I was 17 years old an Army recruiter came...
4     Just a few months before starting at Codeup in...
5     Before moving to San Antonio , I was slightly ...
6     You come home from work , tired to the bone an...
7     By Marcella Munter I looked at Codeup for 2 ye...
8     Photo by Jon Garcia Codeup welcomed the Wrange...
9     By Sukari Schutzman I grew up loving technolog...
10    by Alexander Bous Growing up It would be an un...
11    When you ’ re scared to run your code , and it...
12    By Jennifer Walker I first encountered rubber ...
13    By Randi Mays For many teenagers , the path to...
14    Codeup welcomed our newest cohort , the Wrange...
15    The rumors are true ! The time has arrived. Co...
16    I take pride in my bachelor ’ s degree. Starti...
17    I remember during my first day of Codeup I

4. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words

In [19]:
ps = nltk.porter.PorterStemmer()
ps

<PorterStemmer>

In [20]:
# .split() creates a list of words in a string
prepped_df.content[3].split()

['When',
 'I',
 'was',
 '17',
 'years',
 'old',
 'an',
 'Army',
 'recruiter',
 'came',
 'along',
 'and',
 'convinced',
 'me',
 'that',
 'the',
 'military',
 'was',
 'a',
 'good',
 'choice',
 'and',
 'could',
 'afford',
 'me',
 'many',
 'great',
 'opportunities',
 'in',
 'life.',
 'At',
 'the',
 'time',
 'I',
 'was',
 'living',
 'a',
 'carefree',
 'life',
 'with',
 'a',
 'close',
 'friend.',
 'I',
 'hadn’t',
 'considered',
 'what',
 'the',
 'next',
 'year',
 'of',
 'my',
 'life',
 'would',
 'look',
 'like,',
 'let',
 'alone',
 'the',
 'next',
 'decade.',
 'I',
 'didn’t',
 'really',
 'see',
 'myself',
 'as',
 'a',
 'soldier,',
 'but',
 'I',
 'was',
 'strongly',
 'encouraged',
 'to',
 'enlist',
 'and',
 'so',
 'I',
 'did.',
 'It',
 'was',
 'during',
 'this',
 'time',
 'that',
 'I',
 'began',
 'playing',
 'around',
 'with',
 'building',
 'websites.',
 'I',
 'was',
 'far',
 'from',
 'good',
 'at',
 'it,',
 'but',
 'I',
 'enjoyed',
 'it,',
 'a',
 'lot.',
 'In',
 'April',
 'of',
 '2018',
 'I'

In [21]:
stems = [ps.stem(word) for word in prepped_df.content[3].split()]
stems

['when',
 'I',
 'wa',
 '17',
 'year',
 'old',
 'an',
 'armi',
 'recruit',
 'came',
 'along',
 'and',
 'convinc',
 'me',
 'that',
 'the',
 'militari',
 'wa',
 'a',
 'good',
 'choic',
 'and',
 'could',
 'afford',
 'me',
 'mani',
 'great',
 'opportun',
 'in',
 'life.',
 'At',
 'the',
 'time',
 'I',
 'wa',
 'live',
 'a',
 'carefre',
 'life',
 'with',
 'a',
 'close',
 'friend.',
 'I',
 'hadn’t',
 'consid',
 'what',
 'the',
 'next',
 'year',
 'of',
 'my',
 'life',
 'would',
 'look',
 'like,',
 'let',
 'alon',
 'the',
 'next',
 'decade.',
 'I',
 'didn’t',
 'realli',
 'see',
 'myself',
 'as',
 'a',
 'soldier,',
 'but',
 'I',
 'wa',
 'strongli',
 'encourag',
 'to',
 'enlist',
 'and',
 'so',
 'I',
 'did.',
 'It',
 'wa',
 'dure',
 'thi',
 'time',
 'that',
 'I',
 'began',
 'play',
 'around',
 'with',
 'build',
 'websites.',
 'I',
 'wa',
 'far',
 'from',
 'good',
 'at',
 'it,',
 'but',
 'I',
 'enjoy',
 'it,',
 'a',
 'lot.',
 'In',
 'april',
 'of',
 '2018',
 'I',
 'found',
 'out',
 'about',
 'code',

In [22]:
article_stemmed = ' '.join(stems)
article_stemmed

'when I wa 17 year old an armi recruit came along and convinc me that the militari wa a good choic and could afford me mani great opportun in life. At the time I wa live a carefre life with a close friend. I hadn’t consid what the next year of my life would look like, let alon the next decade. I didn’t realli see myself as a soldier, but I wa strongli encourag to enlist and so I did. It wa dure thi time that I began play around with build websites. I wa far from good at it, but I enjoy it, a lot. In april of 2018 I found out about code bootcamps. I had never consid that I could make a live as a web developer. I didn’t even know what a web develop did. but it turn out that develop do the thing that I enjoy so much that I stay up all night do them. At the time I wa manag a small cafe, make $11 an hour. I wa also a part-tim uber driver to help make end meet. It sucked, big time. It wa at thi time that I made the best decis i’v ever made. I quit that job and decid to focu on teach myself h

In [23]:
def stem(str):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(str) for str in str.split()]
    article_stemmed = ' '.join(stems)
    return article_stemmed

In [24]:
stem(prepped_df.content[1])

'after the graduat of our first codeup data scienc cohort, one compani wa eagerli wait in the wing to scoop up a lot of our new grads: quickpath. In busi sinc 2004, quickpath ha help deliv amaz custom experi and optim busi decis in over a hundr implement for fortun 500 and fortun 1,000 companies. but as the demand for their servic increased—mor than doubl each of the past three years—quickpath found itself in a talent shortage. “we found ourselv leav busi on the tabl where custom were ask us for addit resources,” alex fly, ceo of quickpath and board member of the codeup data scienc program, said, “and we weren’t abl to fill that request.” fortunately, the codeup data scienc curriculum wa abl to provid a much-need pipelin of highli skill individuals. over the cours of an intens 20-week, hands-on program, student learn a varieti of data scienc tool and skills, includ statistics, sql, python and regress analysis. thi intens preparation, culmin in masteri with a capston project, is what le

In [25]:
prepped_df.content.apply(stem)

0     codeup isnt a cheap program we know that. as o...
1     after the graduat of our first codeup data sci...
2     11/20/2019 update: codeup dalla is approv and ...
3     when I wa 17 year old an armi recruit came alo...
4     just a few month befor start at codeup in the ...
5     befor move to san antonio, I wa slightli appre...
6     you come home from work, tire to the bone and ...
7     By marcella munter I look at codeup for 2 year...
8     photo by jon garcia codeup welcom the wrangel ...
9     By sukari schutzman I grew up love technology....
10    by alexand bou grow up It would be an understa...
11    when you’r scare to run your code, and it work...
12    By jennif walker I first encount rubber duck d...
13    By randi may for mani teenagers, the path to s...
14    codeup welcom our newest cohort, the wrangel c...
15    the rumor are true! the time ha arrived. codeu...
16    I take pride in my bachelor’ degree. start my ...
17    I rememb dure my first day of codeup I beg

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [26]:
wnl = nltk.stem.WordNetLemmatizer()

In [27]:
for word in 'study studies'.split():
    print('stem:', ps.stem(word), '--lemma:', wnl.lemmatize(word))

stem: studi --lemma: study
stem: studi --lemma: study


In [28]:
lemmas = [wnl.lemmatize(word) for word in prepped_df.content[0].split()]
lemmas

['codeup',
 'isnt',
 'a',
 'cheap',
 'program',
 'we',
 'know',
 'that.',
 'a',
 'one',
 'of',
 'the',
 'longest',
 'accelerator',
 'in',
 'the',
 'country,',
 'we',
 'recognize',
 'that',
 'cost',
 'is',
 'a',
 'concern',
 'for',
 'many',
 'prospective',
 'students.',
 'thats',
 'why',
 'this',
 'week',
 'we',
 'are',
 'excited',
 'to',
 'highlight',
 'one',
 'of',
 'our',
 'long-standing',
 'financial',
 'aid',
 'partner',
 'project',
 'quest',
 'to',
 'show',
 'you',
 'how',
 'many',
 'pathway',
 'are',
 'available.',
 'first',
 'of',
 'all,',
 'we',
 'view',
 'your',
 'tuition',
 'a',
 'an',
 'investment',
 'in',
 'your',
 'future,',
 'and',
 'we',
 'put',
 'our',
 'best',
 'people,',
 'ideas,',
 'and',
 'effort',
 'to',
 'make',
 'that',
 'investment',
 'fruitful.',
 'but',
 'we',
 'also',
 'want',
 'to',
 'make',
 'that',
 'investment',
 'more',
 'accessible',
 'by',
 'providing',
 'a',
 'many',
 'financial',
 'pathway',
 'a',
 'possible',
 'for',
 'tuition',
 'funding.',
 'a',
 

In [29]:
article_lemmatized = ' '.join(lemmas)
article_lemmatized

'codeup isnt a cheap program we know that. a one of the longest accelerator in the country, we recognize that cost is a concern for many prospective students. thats why this week we are excited to highlight one of our long-standing financial aid partner project quest to show you how many pathway are available. first of all, we view your tuition a an investment in your future, and we put our best people, ideas, and effort to make that investment fruitful. but we also want to make that investment more accessible by providing a many financial pathway a possible for tuition funding. a we approach their annual fundraising luncheon this week, we want to highlight project quest a one of the most effective such pathway you can explore. the experience of attending codeup ha radically altered my life. now, instead of financially treading water in a service industry job with a college degree that felt wasted, i have a fulfilling professional career with exciting possibility for in-field advanceme

In [30]:
pd.Series(lemmas).value_counts()

a                   19
and                 14
quest               12
of                  10
we                  10
to                  10
project             10
the                  7
for                  7
that                 6
in                   6
our                  5
codeup               4
support              4
career               4
many                 4
this                 4
help                 4
investment           3
change,              3
on                   3
by                   3
your                 3
financial            3
student              3
from                 3
pathway              3
have                 3
tuition              3
one                  3
                    ..
4                    1
long-standing        1
combined             1
providing            1
95%                  1
say                  1
impact               1
come                 1
accessible           1
scientists.          1
possible             1
when                 1
degree     

In [31]:
def lemmitize(str):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in str.split()]
    article_lemmatized = ' '.join(lemmas)
    return article_lemmatized
    

In [32]:
lemmitize(prepped_df.content[2])

'11/20/2019 UPDATE: Codeup Dallas is approved and now accepting applications! Learn more here. — Codeup is San Antonio’s premier career accelerator, and we’ve been proud to build the tech workforce here over the last five years. Now, we’re excited to bring our mission of creating pathway to Software Development and Data Science career to more people. Dallas, we have our eye set on you! After five year exclusively focused on San Antonio, why expand now? As the bootcamp model ha grown, we’ve watched our competitor expand nationally, open dozen of campuses, buy and sell to and from each other, launch coworking spaces, and more. Meanwhile, we believe the outcome of our graduate far outweigh the number of campus we have. We have refined our curriculum, built our team, expanded our partnership network, and built a brand people can trust. Where did we all end up? Many of our competitor have closed or sold. We’re still Texas-based, owned by our founders, and our investment in quality ha brough

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords. 

    This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additonal stop words to include, and any words that we ***dont*** want to remove.

In [33]:
# Before removing stopwords, we want to segment text into linguistic units such as words or numbers.
# This process is called tokenization.

stopword_list = stopwords.words('english')
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [34]:
words_list = prepped_df.content[0].split()
filtered_words = [word for word in words_list if word not in stopword_list]
filtered_words

['codeup',
 'isnt',
 'cheap',
 'program',
 'know',
 'that.',
 'one',
 'longest',
 'accelerators',
 'country,',
 'recognize',
 'cost',
 'concern',
 'many',
 'prospective',
 'students.',
 'thats',
 'week',
 'excited',
 'highlight',
 'one',
 'long-standing',
 'financial',
 'aid',
 'partners',
 'project',
 'quest',
 'show',
 'many',
 'pathways',
 'available.',
 'first',
 'all,',
 'view',
 'tuition',
 'investment',
 'future,',
 'put',
 'best',
 'people,',
 'ideas,',
 'efforts',
 'make',
 'investment',
 'fruitful.',
 'also',
 'want',
 'make',
 'investment',
 'accessible',
 'providing',
 'many',
 'financial',
 'pathways',
 'possible',
 'tuition',
 'funding.',
 'approach',
 'annual',
 'fundraising',
 'luncheon',
 'week,',
 'want',
 'highlight',
 'project',
 'quest',
 'one',
 'effective',
 'pathways',
 'explore.',
 'experience',
 'attending',
 'codeup',
 'radically',
 'altered',
 'life.',
 'now,',
 'instead',
 'financially',
 'treading',
 'water',
 'service',
 'industry',
 'job',
 'college',
 '

In [35]:
# What is the difference (words count) between the words_list & filtered_words?
len(words_list) - len(filtered_words)



163

In [36]:
article_without_stopwords = ' '.join(filtered_words)
article_without_stopwords

'codeup isnt cheap program know that. one longest accelerators country, recognize cost concern many prospective students. thats week excited highlight one long-standing financial aid partners project quest show many pathways available. first all, view tuition investment future, put best people, ideas, efforts make investment fruitful. also want make investment accessible providing many financial pathways possible tuition funding. approach annual fundraising luncheon week, want highlight project quest one effective pathways explore. experience attending codeup radically altered life. now, instead financially treading water service industry job college degree felt wasted, fulfilling professional career exciting possibilities in-field advancement. matthew capper project quest san antonio based non-profit dedicated workforce development building citys future economy. provide grant funding educational training programs, well career coaching support resources (like utilities, childcare, tran

In [37]:
def remove_stopwords(str, extra_words = [], exclude_words = []):
    if type(extra_words) != list:
        extra_words = extra_words.split()
    
    if type(exclude_words) != list:
        exclude_words = exclude_words.split()

    stopword_list = stopwords.words('english')
    
    for word in exclude_words:
        stopword_list.remove(word)
    
    for word in extra_words:
        stopword_list.append(word)
    
    words_list = str.split()
    
    filtered_words = [word for word in words_list if word not in stopword_list]
    
    article_without_stopwords = ' '.join(filtered_words)
    
    return article_without_stopwords
    
    
    

In [38]:
remove_stopwords(prepped_df.content[0])

'codeup isnt cheap program know that. one longest accelerators country, recognize cost concern many prospective students. thats week excited highlight one long-standing financial aid partners project quest show many pathways available. first all, view tuition investment future, put best people, ideas, efforts make investment fruitful. also want make investment accessible providing many financial pathways possible tuition funding. approach annual fundraising luncheon week, want highlight project quest one effective pathways explore. experience attending codeup radically altered life. now, instead financially treading water service industry job college degree felt wasted, fulfilling professional career exciting possibilities in-field advancement. matthew capper project quest san antonio based non-profit dedicated workforce development building citys future economy. provide grant funding educational training programs, well career coaching support resources (like utilities, childcare, tran

6. Define a function named `prep_article` that takes in the dictionary representing an article and returns a dictionary that look like this:

In [49]:
{
    'title': 'the original title',
    'original': 'original',
    'stemmed': 'article_stemmed',
    'lemmatized': 'article_lemmatized',
    'clean': 'article_without_stopwords'
}

{'title': 'the original title',
 'original': 'original',
 'stemmed': 'article_stemmed',
 'lemmatized': 'article_lemmatized',
 'clean': 'article_without_stopwords'}

Note that if the original dictionary has a title property, it should remain unchanged(same goes for the `category` property).

In [39]:
#Create the dictionary, and do a check with len() to make sure it has the right amount of values in it.
blog_dictionary = acquire.make_blog_dictionary()
len(blog_dictionary)

67

In [40]:
# return the whole dictionary of titles & blogs.
blog_dictionary

[{'title': 'A Quest Through Codeup',
  'body': 'Codeup isn’t a cheap program – we know that. As one of the longest accelerators in the country, we recognize that cost is a concern for many prospective students. That’s why this week we are excited to highlight one of our long-standing financial aid partners – Project QUEST – to show you how many pathways are available.\xa0 First of all, we view your tuition as an investment in your future, and we put our best people, ideas, and efforts to make that investment fruitful. But we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. As we approach their annual fundraising luncheon this week, we want to highlight Project QUEST as one of the most effective such pathways you can explore. “The experience of attending Codeup has radically altered my life. Now, instead of financially treading water in a service industry job with a college degree that felt wasted, I have a fulfil

In [41]:
# look at the first dictionary item.
blog_dictionary[0]

{'title': 'A Quest Through Codeup',
 'body': 'Codeup isn’t a cheap program – we know that. As one of the longest accelerators in the country, we recognize that cost is a concern for many prospective students. That’s why this week we are excited to highlight one of our long-standing financial aid partners – Project QUEST – to show you how many pathways are available.\xa0 First of all, we view your tuition as an investment in your future, and we put our best people, ideas, and efforts to make that investment fruitful. But we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. As we approach their annual fundraising luncheon this week, we want to highlight Project QUEST as one of the most effective such pathways you can explore. “The experience of attending Codeup has radically altered my life. Now, instead of financially treading water in a service industry job with a college degree that felt wasted, I have a fulfilli

In [52]:
prep_article = {}
prep_article['title'] = blog_dictionary[0]['title']
prep_article['original'] = blog_dictionary[0]['body']
prep_article['stemmed'] = stem(blog_dictionary[0]['body'])
prep_article['lemmatized'] = lemmitize(blog_dictionary[0]['body'])
prep_article['clean'] = remove_stopwords(blog_dictionary[0]['body'])

In [53]:
prep_article

{'title': 'A Quest Through Codeup',
 'original': 'Codeup isn’t a cheap program – we know that. As one of the longest accelerators in the country, we recognize that cost is a concern for many prospective students. That’s why this week we are excited to highlight one of our long-standing financial aid partners – Project QUEST – to show you how many pathways are available.\xa0 First of all, we view your tuition as an investment in your future, and we put our best people, ideas, and efforts to make that investment fruitful. But we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. As we approach their annual fundraising luncheon this week, we want to highlight Project QUEST as one of the most effective such pathways you can explore. “The experience of attending Codeup has radically altered my life. Now, instead of financially treading water in a service industry job with a college degree that felt wasted, I have a fulf

In [56]:
def prep_article(dictionary):
    p_article = {}
    p_article['title'] = dictionary['title']
    p_article['original'] = dictionary['body']
    p_article['stemmed'] = stem(dictionary['body'])
    p_article['lemmatized'] = lemmitize(dictionary['body'])
    p_article['clean'] = remove_stopwords(dictionary['body'])
    return p_article
    

In [57]:
prep_article(blog_dictionary[5])

{'title': 'Why San Antonio Has More Than Tacos To Offer',
 'original': 'Before moving to San Antonio, I was slightly apprehensive. Knowing little to nothing about the city, I didn’t know what to expect. All that had filtered down from the collective consciousness to my perception of the city was the obvious tourist attractions such as the Riverwalk, Alamo, and amazing Tex Mex. After spending a little under a year there, I came to fall in love with this city for reasons beyond what tourists can see in a weekend. The experience of its vibrant and proud Latino culture, flourishing art scene, and emphasis on community have left an indelible mark upon my life.  All this aside, a huge part of my experience in San Antonio was attending Codeup, a coding boot camp that allowed me to have the skills needed to make a complete career shift and also an insider’s look into San Antonio’s flourishing tech community. For those of you that have never had a local’s perspective into the technology layer o

7. Define a function named `prepare_article_date` that takes in the list of articles dictionaries, applies the `prep_article` function to each one, and returns the transformed data.

In [60]:
def prepare_article_data(list_of_dictionaries):
    transformed_articles = []
    for x in list_of_dictionaries:
        transformed_articles.append(prep_article(x))
    return transformed_articles

In [61]:
prepare_article_data(blog_dictionary)

[{'title': 'A Quest Through Codeup',
  'original': 'Codeup isn’t a cheap program – we know that. As one of the longest accelerators in the country, we recognize that cost is a concern for many prospective students. That’s why this week we are excited to highlight one of our long-standing financial aid partners – Project QUEST – to show you how many pathways are available.\xa0 First of all, we view your tuition as an investment in your future, and we put our best people, ideas, and efforts to make that investment fruitful. But we also want to make that investment more accessible by providing as many financial pathways as possible for tuition funding. As we approach their annual fundraising luncheon this week, we want to highlight Project QUEST as one of the most effective such pathways you can explore. “The experience of attending Codeup has radically altered my life. Now, instead of financially treading water in a service industry job with a college degree that felt wasted, I have a fu