In [58]:
import pandas as pd
import unicodedata
import re
import nltk
import acquire as a
import os

from requests import get
from bs4 import BeautifulSoup

from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# The end result of this exercise should be a file named prepare.py that defines the requested functions.

# In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

# 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
string = a.get_blog_articles_data()
string = string.content[0]
string

'May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.\n\nIn an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.\nArbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.\nAt Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA).\nHere is how the rest

In [3]:
def basic_clean(text):
    
    text = text.lower()
    
    text = unicodedata.normalize('NFKD', text)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    
    text = re.sub(r"[^a-z0-9'\s]", '', text)
    
    return text

In [4]:
string = basic_clean(string)
string

'may is traditionally known as asian american and pacific islander aapi heritage month this month we celebrate the history and contributions made possible by our aapi friends family and community we also examine our level of support and seek opportunities to better understand the aapi community\n\nin an effort to address real concerns and experiences we sat down with arbeena thapa one of codeups financial aid and enrollment managers\narbeena identifies as nepali american and desi arbeenas parents immigrated to texas in 1988 for better employment and educational opportunities arbeenas older sister was five when they made the move to the us arbeena was born later becoming the first in her family to be a us citizen\nat codeup we take our efforts at inclusivity very seriously after speaking with arbeena we were taught that the term aapi excludes desiamerican individuals hence we will now use the term asian pacific islander desi american apida\nhere is how the rest of our conversation with 

# 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [5]:
def tokenize(text):
    
    tokenizer = nltk.tokenize.ToktokTokenizer()

    text = tokenizer.tokenize(text, return_str=True)
    
    return text


In [6]:
string = tokenize(string)
string

'may is traditionally known as asian american and pacific islander aapi heritage month this month we celebrate the history and contributions made possible by our aapi friends family and community we also examine our level of support and seek opportunities to better understand the aapi community\n\nin an effort to address real concerns and experiences we sat down with arbeena thapa one of codeups financial aid and enrollment managers\narbeena identifies as nepali american and desi arbeenas parents immigrated to texas in 1988 for better employment and educational opportunities arbeenas older sister was five when they made the move to the us arbeena was born later becoming the first in her family to be a us citizen\nat codeup we take our efforts at inclusivity very seriously after speaking with arbeena we were taught that the term aapi excludes desiamerican individuals hence we will now use the term asian pacific islander desi american apida\nhere is how the rest of our conversation with 

# 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [7]:
def stem(text):
    
    ps = nltk.porter.PorterStemmer()
    
    stems = [ps.stem(word) for word in text.split()]
    
    text_stemmed = ' '.join(stems)
    
    return text_stemmed


In [8]:
stem(string)

'may is tradit known as asian american and pacif island aapi heritag month thi month we celebr the histori and contribut made possibl by our aapi friend famili and commun we also examin our level of support and seek opportun to better understand the aapi commun in an effort to address real concern and experi we sat down with arbeena thapa one of codeup financi aid and enrol manag arbeena identifi as nepali american and desi arbeena parent immigr to texa in 1988 for better employ and educ opportun arbeena older sister wa five when they made the move to the us arbeena wa born later becom the first in her famili to be a us citizen at codeup we take our effort at inclus veri serious after speak with arbeena we were taught that the term aapi exclud desiamerican individu henc we will now use the term asian pacif island desi american apida here is how the rest of our convers with arbeena went how do you celebr or connect with your heritag and cultur tradit i celebr nepal version of christma o

# 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [9]:
def lemmatize(text):
    
    wnl = nltk.stem.WordNetLemmatizer()
    
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    
    text_lemmatized = ' '.join(lemmas)

    return text_lemmatized

In [10]:
lemmatize(string)

'may is traditionally known a asian american and pacific islander aapi heritage month this month we celebrate the history and contribution made possible by our aapi friend family and community we also examine our level of support and seek opportunity to better understand the aapi community in an effort to address real concern and experience we sat down with arbeena thapa one of codeups financial aid and enrollment manager arbeena identifies a nepali american and desi arbeenas parent immigrated to texas in 1988 for better employment and educational opportunity arbeenas older sister wa five when they made the move to the u arbeena wa born later becoming the first in her family to be a u citizen at codeup we take our effort at inclusivity very seriously after speaking with arbeena we were taught that the term aapi excludes desiamerican individual hence we will now use the term asian pacific islander desi american apida here is how the rest of our conversation with arbeena went how do you 

# 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

# This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [39]:
stopword_list = stopwords.words('english')
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [40]:
stopword_list.remove('not')
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [42]:
lis= 'marcelino', 'salazar'
for w in lis:
    
    stopword_list.append(w)
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [43]:
def remove_stopwords(text, extra_words = None, exclude_words = None):
    
    stopword_list = stopwords.words('english')
    
    if exclude_words is not None:
        
        for w in exclude_words:
        
            stopword_list.remove(w)
    
    if extra_words is not None:
        
        for w in extra_words:
        
            stopword_list.append(w)
    
    words = text.split()
    
    filtered_words = [w for w in words if w not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    text_without_stopwords = ' '.join(filtered_words)

    return text_without_stopwords

In [47]:
remove_stopwords(string, exclude_words = ('i', 'it'))

Removed 379 stopwords
---


'may traditionally known asian american pacific islander aapi heritage month month celebrate history contributions made possible aapi friends family community also examine level support seek opportunities better understand aapi community effort address real concerns experiences sat arbeena thapa one codeups financial aid enrollment managers arbeena identifies nepali american desi arbeenas parents immigrated texas 1988 better employment educational opportunities arbeenas older sister five made move us arbeena born later becoming first family us citizen codeup take efforts inclusivity seriously speaking arbeena taught term aapi excludes desiamerican individuals hence use term asian pacific islander desi american apida rest conversation arbeena went celebrate connect heritage cultural traditions i celebrate nepals version christmas dashain nineday celebration also known dussehra i grew hindu i identify hindu large part heritage ways i connect culture include sharing food momos south asian

# 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

# 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [85]:
def codeup_df(refresh=False):
    
    if not os.path.isfile('codeup_df.csv') or refresh:
        
        url = 'https://codeup.com/blog/'
        headers = {'User-Agent': 'Codeup Data Science'}
        response = get(url, headers=headers)

        soup = BeautifulSoup(response.content, 'html.parser')

        links = [link['href'] for link in soup.select('h2 a[href]')]

        articles = []

        for url in links:

            url_response = get(url, headers=headers)
            soup = BeautifulSoup(url_response.text, 'html.parser')

            title = soup.find('h1', class_='entry-title').text
            content = soup.find('div', class_='entry-content').text.strip()
            clean = basic_clean(content)
            tokenized = tokenize(clean)
            final = remove_stopwords(clean)
            stemmed = stem(final)
            lemmatized = lemmatize(final)

            article_dict = {
                'title': title,
                'original': content,
                'clean': final,
                'stemmed': stemmed,
                'lemmatized': lemmatized,
                
            }

            articles.append(article_dict)
        
        blog_article_df = pd.DataFrame(articles)
        
        blog_article_df.to_csv('codeup_df.csv', index=False)
        
    return pd.read_csv('codeup_df.csv')

In [86]:
string = codeup_df()
string

Removed 396 stopwords
---
Removed 75 stopwords
---
Removed 167 stopwords
---
Removed 98 stopwords
---
Removed 87 stopwords
---
Removed 70 stopwords
---


Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...,may tradit known asian american pacif island a...,may traditionally known asian american pacific...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...,women tech panelist spotlight magdalena rahn c...,women tech panelist spotlight magdalena rahn c...,woman tech panelist spotlight magdalena rahn c...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...,women tech panelist spotlight rachel robbinsma...,women tech panelist spotlight rachel robbinsma...,woman tech panelist spotlight rachel robbinsma...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...,women tech panelist spotlight sarah mellor cod...,women tech panelist spotlight sarah mellor cod...,woman tech panelist spotlight sarah mellor cod...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...,women tech panelist spotlight madeleine capper...,women tech panelist spotlight madelein capper ...,woman tech panelist spotlight madeleine capper...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...,black excellence tech panelist spotlight wilma...,black excel tech panelist spotlight wilmari de...,black excellence tech panelist spotlight wilma...


In [87]:
words = string.original.str.split()
len(words.sum())

2068

In [88]:
words = string.clean.str.split()
len(words.sum())

1164

In [89]:
words = string.stemmed.str.split()
len(words.sum())

1164

In [90]:
words = string.lemmatized.str.split()
len(words.sum())

1164

# 8. For each dataframe, produce the following columns:

* title to hold the title
* original to hold the original article/post content
* clean to hold the normalized and tokenized original with the stopwords removed.
* stemmed to hold the stemmed version of the cleaned data.
* lemmatized to hold the lemmatized version of the cleaned data.

# 9. Ask yourself:

* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
#lemmatized text since it is a small file
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
#lemmatized text since it is still a relatively small file
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
#stemmed since it has greater optimization performance