# Exercises
- The end result of this exercise should be a file named prepare.py that defines the requested functions.

- In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

#1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

#2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

#3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

#4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

#5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.
    
#6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

#7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

#8. For each dataframe, produce the following columns:

    - title to hold the title
    - original to hold the original article/post content
    - clean to hold the normalized and tokenized original with the stopwords removed.
    - stemmed to hold the stemmed version of the cleaned data.
    - lemmatized to hold the lemmatized version of the cleaned data.
    
#9. Ask yourself:
    - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

# Imports

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from requests import get
from bs4 import BeautifulSoup

# We don't need to install nltk, it should come with anaconda, but nltk
# does need to download some data.
nltk.download('stopwords')

import pandas as pd

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/linhquach/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.


In [2]:
original = "Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"
original

"Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

In [3]:
def basic_clean(string_value):
    #lowercase all letters in the text

    article = original.lower()

    # Normalizaton: Remove inconsistencies in unicode charater encoding.
    # encode the strings into ASCII byte-strings (ignore non-ASCII characters)
    # decode the byte-string back into a string

    unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8')
    # remove anything that is not a through z, a number, a single quote, or whitespace
    article = re.sub(r"[^a-z0-9\s]", '', article)
    
    return article

In [4]:
article= basic_clean(original)

# 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.


In [5]:
def tokenize(article):
    # Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    # Use the tokenizer
    article = tokenizer.tokenize(article, return_str= True)
    return article

In [6]:
article= tokenize(article)

# 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.


In [7]:
def stem(article):
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    # Check stemmer. It works.
    ps.stem('Calling')
    # Apply the stemmer to each word in our string.
    stems =[ps.stem(word) for word in article.split()]
    # Join our lists of words into a string again; assign to a variable to save changes
    article_stemmed = ' '.join(stems)
    return article_stemmed


In [8]:
article_stemmed= stem(article)

In [9]:
article_stemmed

'paul erd and georg plya are influenti hungarian mathematician who contribut a lot to the field erdss name contain the hungarian letter o with doubl acut accent but is often incorrectli written as erdo or erd either by mistak or out of typograph necess'

# 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.


In [10]:
def lemmatize(article):
    # Create the Lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in article.split()]
    # Join our list of words into a string again; assign to a variable to save changes.
    article_lemmatized = ' '.join(lemmas)
    return article_lemmatized

In [11]:
article_lemmatized= lemmatize(article)

In [12]:
article_lemmatized

'paul erds and george plya are influential hungarian mathematician who contributed a lot to the field erdss name contains the hungarian letter o with double acute accent but is often incorrectly written a erdos or erds either by mistake or out of typographical necessity'

# 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.


In [19]:
def remove_stopwords(article_lemmatized, extra_words= [], exclude_words=[]):
    # standard English language stopwords list from nltk
    stopword_list = stopwords.words('english')
    # remove 'exlude_words' from stopword_list to keep these
    stopword_list= set(stopword_list) - set(exclude_words)
    #Add in 'extra_words' to stopword_list
    stopword_list= stopword_list.union(set(extra_words))
    #split words
    words= article_lemmatized.split()
#     # you can add or remove from stopword list 
#     stopword_list.append('o')
#     stopword_list.remove('not')
#     stopword_list.append("'")
#     # Split words in lemmatized article.
#     words = article_lemmatized.split()
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    # Join words in the list back into strings; assign to a variable to keep changes.
    article_without_stopwords = ' '.join(filtered_words)
    return article_without_stopwords

    

In [20]:
article_without_stopwords= remove_stopwords(article_lemmatized)

In [21]:
article_without_stopwords

'paul erds george plya influential hungarian mathematician contributed lot field erdss name contains hungarian letter double acute accent often incorrectly written erdos erds either mistake typographical necessity'

# 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [22]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [23]:
def get_articles(category, base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [24]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [25]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
news_df = get_all_news_articles(categories)

In [26]:
news_df

Unnamed: 0,title,content,category
0,"Reliance Industries vaccinates 98% of workers,...",Reliance Industries has said in a statement th...,business
1,"Musk criticises Apple's 'walled garden', cobal...",Tesla's billionaire CEO Elon Musk criticised A...,business
2,I will most likely not be on future earnings c...,Tesla CEO and the world's second-richest perso...,business
3,Speculation around our plans for crypto not tr...,Amazon on Monday denied speculations that it w...,business
4,Factually incorrect: INOX on report of Amazon ...,INOX Leisure denied a report that claimed Amaz...,business
...,...,...,...
143,UN panel uses $600mn in Iraqi funds to pay Kuw...,A UN commission on Tuesday used $600 million i...,world
144,Samoa's first female PM takes office after con...,Samoa's first female PM Fiame Naomi Mata'afa t...,world
145,1st person charged under Hong Kong national se...,The first person to be tried under the nationa...,world
146,6 killed after rains trigger landslides in ref...,Bangladeshi officials on Tuesday said that at ...,world


# 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [27]:
def get_codeup_blog(url):
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)
    
    soup = BeautifulSoup(response.text)
    
    title = soup.find("h1").text
    published_date = soup.time.text
    
    if len(soup.select(".jupiterx-post-image")) > 0:
        blog_image = soup.select(".jupiterx-post-image")[0].picture.img["data-src"]
    else:
        blog_image = None
        
    content = soup.select(".jupiterx-post-content")[0].text
    
    output = {}
    output["title"] = title
    output["published_date"] = published_date
    output["blog_image"] = blog_image
    output["content"] = content
    
    return output

In [28]:
def get_blog_articles(urls):
    # List of dictionaries
    posts = [get_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [29]:
urls = [
    "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
    "https://codeup.com/data-science-myths/",
    "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
    "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
    "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"
]

In [30]:
Codeup_df = get_blog_articles(urls)

In [31]:
Codeup_df.head()

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


# 8. For each dataframe, produce the following columns:

    - title to hold the title
    - original to hold the original article/post content
    - clean to hold the normalized and tokenized original with the stopwords removed.
    - stemmed to hold the stemmed version of the cleaned data.
    - lemmatized to hold the lemmatized version of the cleaned data.

In [36]:
def dataframes(df): 
    df.rename(columns={'content': 'original'}, inplace=True)
    df['clean']= df['original'].apply(basic_clean)
    df['clean']= df['original'].apply(tokenize)
    df['stemmed']= df['original'].apply(stem)
    df['lemmatized']= df['original'].apply(lemmatize)
    return df
    
    

In [38]:
#Using function on news_df
news_edited_df= dataframes(news_df)

In [39]:
#Looking at values
news_edited_df.head()

Unnamed: 0,title,original,category,clean,stemmed,lemmatized
0,"Reliance Industries vaccinates 98% of workers,...",Reliance Industries has said in a statement th...,business,Reliance Industries has said in a statement th...,relianc industri ha said in a statement that o...,Reliance Industries ha said in a statement tha...
1,"Musk criticises Apple's 'walled garden', cobal...",Tesla's billionaire CEO Elon Musk criticised A...,business,Tesla ' s billionaire CEO Elon Musk criticised...,tesla' billionair ceo elon musk criticis appl ...,Tesla's billionaire CEO Elon Musk criticised A...
2,I will most likely not be on future earnings c...,Tesla CEO and the world's second-richest perso...,business,Tesla CEO and the world ' s second-richest per...,tesla ceo and the world' second-richest person...,Tesla CEO and the world's second-richest perso...
3,Speculation around our plans for crypto not tr...,Amazon on Monday denied speculations that it w...,business,Amazon on Monday denied speculations that it w...,amazon on monday deni specul that it wa look t...,Amazon on Monday denied speculation that it wa...
4,Factually incorrect: INOX on report of Amazon ...,INOX Leisure denied a report that claimed Amaz...,business,INOX Leisure denied a report that claimed Amaz...,inox leisur deni a report that claim amazon in...,INOX Leisure denied a report that claimed Amaz...


In [40]:
#Using function on Codeup_df
Codeup_edited_df= dataframes(Codeup_df)

In [41]:
Codeup_edited_df.head()

Unnamed: 0,title,published_date,blog_image,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...,The rumors are true ! The time has arrived. Co...,the rumor are true! the time ha arrived. codeu...,The rumor are true! The time ha arrived. Codeu...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...,By Dimitri Antoniou and Maggie Giust\nData Sci...,By dimitri antoni and maggi giust data science...,By Dimitri Antoniou and Maggie Giust Data Scie...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch...","By Dimitri Antoniou\nA week ago , Codeup launc...","By dimitri antoni A week ago, codeup launch ou...","By Dimitri Antoniou A week ago, Codeup launche..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...,SA Tech Job Fair\nThe third bi-annual San Anto...,SA tech job fair the third bi-annu san antonio...,SA Tech Job Fair The third bi-annual San Anton...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamp are closing. Is the model ...,Competitor Bootcamps Are Closing. Is the Model...


# 9. Ask yourself:
   - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
        - answer: Due to size I would use lemmatized
        
        
   - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
        - answer: Due to size I would use lemmatized
 
 
   - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
        - answer: Due to size I would use stemmed