# Prepare

In [1]:
import pandas as pd

import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from nltk import PorterStemmer, word_tokenize, WordNetLemmatizer
from acquire import get_new_links, get_news_article, get_article_data

### 1) 
Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:  

Lowercase everything  
Normalize unicode characters  
Replace anything that is not a letter, number, whitespace or a single quote.

In [31]:
string = "Dark energy is a mysterious and hypothetical form of energy that is believed to make up a large portion of the total mass-energy content of the universe. It is one of the most significant and perplexing discoveries in modern cosmology and was first proposed to explain the observed accelerated expansion of the universe. Acceleration of the Universe: In the late 1990s, astronomers observed that the expansion of the universe is not slowing down due to gravity, as previously thought, but instead accelerating. "

In [3]:
def basic_clean(string):
    string = string.lower()
    string = unicodedata.normalize('NFKD', string)\
            .encode('ascii', 'ignore')\
            .decode('utf-8', 'ignore')
    string = re.sub(r"[^a-z0-9'\s]", "", string)
    string = string.strip()
    return string

In [4]:
basic_clean(string)

'dark energy is a mysterious and hypothetical form of energy that is believed to make up a large portion of the total massenergy content of the universe it is one of the most significant and perplexing discoveries in modern cosmology and was first proposed to explain the observed accelerated expansion of the universe cceleration of the universe in the late 1990s astronomers observed that the expansion of the universe is not slowing down due to gravity as previously thought but instead accelerating'

### 2) 
Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [5]:
def tokenize(string):
    tokenizer = ToktokTokenizer()
    string = tokenizer.tokenize(string, return_str = True)
    return string

In [6]:
tokenize(string)

'Dark energy is a mysterious and hypothetical form of energy that is believed to make up a large portion of the total mass-energy content of the universe. It is one of the most significant and perplexing discoveries in modern cosmology and was first proposed to explain the observed accelerated expansion of the universe. cceleration of the Universe : In the late 1990s , astronomers observed that the expansion of the universe is not slowing down due to gravity , as previously thought , but instead accelerating.'

### 3) 
Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [7]:
def stem(string):
    stopwords_list = stopwords.words('english')
    words = [word for word in string.split() if word not in stopwords_list]
    new_data = ' '.join(words)
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in words]
    stemmed_data = ' '.join(stems)
    return stemmed_data

In [8]:
stem(string)

'dark energi mysteri hypothet form energi believ make larg portion total mass-energi content universe. it one signific perplex discoveri modern cosmolog first propos explain observ acceler expans universe. cceler universe: in late 1990s, astronom observ expans univers slow due gravity, previous thought, instead accelerating.'

### 4) 
Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [9]:
def lemmatize(string):
    stopwords_list = stopwords.words('english')
    words = [word for word in string.split() if word not in stopwords_list]
    new_data = ' '.join(words)
    wnl = nltk.stem.WordNetLemmatizer()
    lemmatize = [wnl.lemmatize(word) for word in words]
    lemmatized_data = ' '.join(lemmatize)
    return lemmatized_data

In [10]:
lemmatize(string)

'Dark energy mysterious hypothetical form energy believed make large portion total mass-energy content universe. It one significant perplexing discovery modern cosmology first proposed explain observed accelerated expansion universe. cceleration Universe: In late 1990s, astronomer observed expansion universe slowing due gravity, previously thought, instead accelerating.'

### 5)
Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.  

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [42]:
def remove_stopwords(string, extra_words=None, exclude_words=None):
    # If extra_words and exclude_words are not provided, use empty sets
    extra_words = extra_words or set()
    exclude_words = exclude_words or set()

    # Add the custom extra words to the stopwords list
    stopwords_list = stopwords.words('english')
    stopwords_list.extend(extra_words)

    # Remove words that are in the stopwords list, but not in the exclude_words list
    words = [word for word in string.split() if word not in stopwords_list or word in exclude_words]
    
    new_data = ' '.join(words)
    return new_data


In [45]:
extra_words = {'mysterious', 'perplexing', 'accelerated'}
exclude_words = {'dark', 'energy'}
remove_stopwords(string, extra_words, exclude_words)

'Dark energy hypothetical form energy believed make large portion total mass-energy content universe. It one significant discoveries modern cosmology first proposed explain observed expansion universe. Acceleration Universe: In late 1990s, astronomers observed expansion universe slowing due gravity, previously thought, instead accelerating.'

### 6) 
Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [13]:
# Define the URL to the webpage you want to scrape
url = 'https://inshorts.com/'
# Define the User-Agent header to be used in the HTTP request
headers = {'User-Agent': 'Codeup Data Science'}

In [14]:
links = get_new_links(url, headers, 'div', 'a', 'href')

https://blog.inshorts.com/
https://blog.inshorts.com/
https://blog.inshorts.com/
https://blog.inshorts.com/
https://blog.inshorts.com/
/en/read
https://itunes.apple.com/us/app/news-in-shorts/id892146527
https://itunes.apple.com/us/app/news-in-shorts/id892146527
https://itunes.apple.com/us/app/news-in-shorts/id892146527
https://itunes.apple.com/us/app/news-in-shorts/id892146527
https://itunes.apple.com/us/app/news-in-shorts/id892146527
https://itunes.apple.com/us/app/news-in-shorts/id892146527
/tnc
/tnc
https://facebook.com/inshortsapp


In [15]:
news_df = get_news_article(links, headers)

## 7) 
Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [16]:
# Define the URL to the webpage you want to scrape
url = 'https://codeup.edu/blog'
# Define the User-Agent header to be used in the HTTP request
headers = {'User-Agent': 'Codeup Data Science'}

In [17]:
links = get_new_links(url, headers, 'h2', 'a', 'href')

https://codeup.edu/featured/apida-heritage-month/
https://codeup.edu/featured/women-in-tech-panelist-spotlight/
https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/
https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/
https://codeup.edu/events/women-in-tech-madeleine/
https://codeup.edu/codeup-news/panelist-spotlight-4/


In [18]:
codeup_df = get_article_data(links, headers)

### 8)
For each dataframe, produce the following columns:

title to hold the title  
original to hold the original article/post content  
clean to hold the normalized and tokenized original with the stopwords removed.  
stemmed to hold the stemmed version of the cleaned data.  
lemmatized to hold the lemmatized version of the cleaned data.  

### news_df

In [19]:
#title to hold the title 
title = news_df['title']
title

'86% Indians feel voting should be made compulsory: Public App Survey￼'

In [20]:
#original to hold the original article/post content 
original = news_df['content']
original

'\nOn the occasion of the 12th National Voters’ Day, Public App, India’s largest location-based social network, conducted a survey to understand how seriously Indians take their voting rights and on what factors they evaluate for which candidate to choose. This pan-India poll was conducted with a sizable data pool of over 4 lakh people. As…More \n\nAs 2021 draws to a close end, users of location-based social network Public weighed in their views on the age-old tradition of making New Year’s resolutions. A New Year’s resolution is a common practice in which a person resolves to continue good habits, change an undesired trait or behavior, accomplish a personal goal, or otherwise…More \n\nWhile addressing the nation on November 19, Prime Minister Narendra Modi announced that the three contentious farm laws -The Farmers’ Produce Trade and Commerce (Promotion and Facilitation) Bill, 2020, The Farmers (Empowerment and Protection) Agreement of Price Assurance and Farm Services Bill, 2020 & Th

In [21]:
#clean to hold the normalized and tokenized original with the stopwords removed.  
clean = basic_clean(original)
clean = tokenize(clean)
clean = remove_stopwords(clean)
clean

'occasion 12th national voters day public app indias largest locationbased social network conducted survey understand seriously indians take voting rights factors evaluate candidate choose panindia poll conducted sizable data pool 4 lakh people asmore 2021 draws close end users locationbased social network public weighed views ageold tradition making new years resolutions new years resolution common practice person resolves continue good habits change undesired trait behavior accomplish personal goal otherwisemore addressing nation november 19 prime minister narendra modi announced three contentious farm laws farmers produce trade commerce promotion facilitation bill 2020 farmers empowerment protection agreement price assurance farm services bill 2020 essential commodities amendment bill 2020 would repealed inmore latest round opinion polls conducted locationbased social network public revealed 646 cricket fans strongly agree mentor ms dhoni help improve indias prospects t20 poll gauge

In [22]:
#stemmed to hold the stemmed version of the cleaned data. 
stemmed = stem(clean)
stemmed

'occas 12th nation voter day public app india largest locationbas social network conduct survey understand serious indian take vote right factor evalu candid choos panindia poll conduct sizabl data pool 4 lakh peopl asmor 2021 draw close end user locationbas social network public weigh view ageold tradit make new year resolut new year resolut common practic person resolv continu good habit chang undesir trait behavior accomplish person goal otherwisemor address nation novemb 19 prime minist narendra modi announc three contenti farm law farmer produc trade commerc promot facilit bill 2020 farmer empower protect agreement price assur farm servic bill 2020 essenti commod amend bill 2020 would repeal inmor latest round opinion poll conduct locationbas social network public reveal 646 cricket fan strongli agre mentor ms dhoni help improv india prospect t20 poll gaug opinion bharat user ahead icc men t20 world cup 2021 year bcci host ofmor 5th octob 2021 festiv season around corner recent co

In [23]:
#lemmatized to hold the lemmatized version of the cleaned data.  
lemmatized = lemmatize(clean)
lemmatized

'occasion 12th national voter day public app india largest locationbased social network conducted survey understand seriously indian take voting right factor evaluate candidate choose panindia poll conducted sizable data pool 4 lakh people asmore 2021 draw close end user locationbased social network public weighed view ageold tradition making new year resolution new year resolution common practice person resolve continue good habit change undesired trait behavior accomplish personal goal otherwisemore addressing nation november 19 prime minister narendra modi announced three contentious farm law farmer produce trade commerce promotion facilitation bill 2020 farmer empowerment protection agreement price assurance farm service bill 2020 essential commodity amendment bill 2020 would repealed inmore latest round opinion poll conducted locationbased social network public revealed 646 cricket fan strongly agree mentor m dhoni help improve india prospect t20 poll gauged opinion bharat user ah

### codeup_df

In [24]:
#title to hold the title 
title = [item['title'] for item in codeup_df]
title

['Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
 'Women in tech: Panelist Spotlight – Magdalena Rahn',
 'Women in tech: Panelist Spotlight – Rachel Robbins-Mayhill',
 'Women in Tech: Panelist Spotlight – Sarah Mellor',
 'Women in Tech: Panelist Spotlight – Madeleine Capper',
 'Black Excellence in Tech: Panelist Spotlight – Wilmarie De La Cruz Mejia']

In [25]:
#original to hold the original article/post content 
original = [item['content'] for item in codeup_df]
original

['May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA).Here is how the rest of our c

In [26]:
#clean to hold the normalized and tokenized original with the stopwords removed. 
clean = [basic_clean(text) for text in original]
clean = tokenize(clean)
clean = remove_stopwords(clean)
clean

"[ ' may traditionally known asian american pacific islander aapi heritage month month celebrate history contributions made possible aapi friends family community also examine level support seek opportunities better understand aapi communityin effort address real concerns experiences sat arbeena thapa one codeups financial aid enrollment managersarbeena identifies nepali american desi arbeenas parents immigrated texas 1988 better employment educational opportunities arbeenas older sister five made move us arbeena born later becoming first family us citizenat codeup take efforts inclusivity seriously speaking arbeena taught term aapi excludes desiamerican individuals hence use term asian pacific islander desi american apidahere rest conversation arbeena wenthow celebrate connect heritage cultural traditionsi celebrate nepals version christmas dashain nineday celebration also known dussehra grew hindu identify hindu large part heritage ways connect culture include sharing food momos sout

In [27]:
#stemmed to hold the stemmed version of the cleaned data. 
stemmed = stem(clean)
stemmed

"[ ' may tradit known asian american pacif island aapi heritag month month celebr histori contribut made possibl aapi friend famili commun also examin level support seek opportun better understand aapi communityin effort address real concern experi sat arbeena thapa one codeup financi aid enrol managersarbeena identifi nepali american desi arbeena parent immigr texa 1988 better employ educ opportun arbeena older sister five made move us arbeena born later becom first famili us citizenat codeup take effort inclus serious speak arbeena taught term aapi exclud desiamerican individu henc use term asian pacif island desi american apidaher rest convers arbeena wenthow celebr connect heritag cultur traditionsi celebr nepal version christma dashain nineday celebr also known dussehra grew hindu identifi hindu larg part heritag way connect cultur includ share food momo south asian dumpl theyr favorit make shareon asian american side advoc immigr justic erasur within apida social polit movement p

In [28]:
lemmatized = lemmatize(clean)
lemmatized

"[ ' may traditionally known asian american pacific islander aapi heritage month month celebrate history contribution made possible aapi friend family community also examine level support seek opportunity better understand aapi communityin effort address real concern experience sat arbeena thapa one codeups financial aid enrollment managersarbeena identifies nepali american desi arbeenas parent immigrated texas 1988 better employment educational opportunity arbeenas older sister five made move u arbeena born later becoming first family u citizenat codeup take effort inclusivity seriously speaking arbeena taught term aapi excludes desiamerican individual hence use term asian pacific islander desi american apidahere rest conversation arbeena wenthow celebrate connect heritage cultural traditionsi celebrate nepal version christmas dashain nineday celebration also known dussehra grew hindu identify hindu large part heritage way connect culture include sharing food momos south asian dumplin

### 9) 
## Ask yourself:

### a) If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?  
lemmatized - While stemming and lemmatization both reduce words to their base or root form, lemmatization is generally more accurate as it considers the word's context and part of speech

### b) If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?  
lemmatized - For a corpus of 25MB, which is larger than the previous one but still relatively small, lemmatization remains a good choice. 

### c) If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?  
stemmed - In this case, with an extremely large corpus of 200TB, i would likely prefer to use stemming over lemmatization to save storage room.