## Prepare Exercise: NLP
### Corey Solitaire
`11.16.2020`

Imports:

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/coreysolitaire/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Gameplan for Parsing Text Data:

    1. Convert text to all lower case for normalcy.
    2. Remove any accented characters, non-ASCII characters.
    3. Remove special characters.
    4. Stem or lemmatize the words.
    5. Remove stopwords.
    6. Store the clean text and the original text for use in future notebooks.


#### Import Dataframe:

In [2]:
news = pd.read_json('news_articles.json')
news.head()

Unnamed: 0,title,author,category,content
0,"In face of Joe Biden's win, 'Trump Nation' is ...",Mike Kelly,News,"In face of Joe Biden's win, 'Trump Nation' is..."
1,President-elect Joe Biden's hometown of Wilmin...,Christopher Maag,News,President-elect Joe Biden's hometown of Wilmi...
2,Vermont's only pharmacy school is shutting dow...,NEWS,News,NEWSVermont's only pharmacy school is shuttin...
3,Ho-Hum Motel on Williston Road to become site ...,NEWS,News,NEWSHo-Hum Motel on Williston Road to become ...
4,"Church Street adjusts Black Friday shopping, h...",NEWS,News,NEWSChurch Street adjusts Black Friday shoppi...


In [3]:
original = news.content[0]
original

' In face of Joe Biden\'s win, \'Trump Nation\' is confused, angry — but resigned | KellyMike KellyNorthJersey.comThey’re confused. They’re angry. They’re resigned and wondering how America can\xa0heal its political wounds as a pandemic rages and the economy flounders.\xa0Such is the state of “Trump Nation” after Donald Trump.Since the spring of 2017 — just 100 days after Trump stepped into the White House — I’ve chronicled, along with photojournalist Chris Pedota, the hopes, fears and ordinary lives of Trump voters in a series of communities stretching from New Jersey to Ohio.\xa0Our goal was simple — to listen to those who voted for Trump. Why did they look to an untested reality TV star, billionaire and failed casino mogul — who proclaimed himself a master dealmaker — to be their political savior?I’ve walked with a dejected high school football coach along the rusty train line tracks that run by an empty steel mill on the Ohio River. I’ve stood next to mounds of bituminous coal in t

## Exercises:

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
def basic_clean(df):
    df = df.lower()
    df = unicodedata.normalize('NFKD', df)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')
    basic_clean = re.sub(r"[^a-z0-9'\s]", '', df)
    return basic_clean

In [5]:
basic_clean = basic_clean(original)
basic_clean

" in face of joe biden's win 'trump nation' is confused angry  but resigned  kellymike kellynorthjerseycomtheyre confused theyre angry theyre resigned and wondering how america can heal its political wounds as a pandemic rages and the economy flounders such is the state of trump nation after donald trumpsince the spring of 2017  just 100 days after trump stepped into the white house  ive chronicled along with photojournalist chris pedota the hopes fears and ordinary lives of trump voters in a series of communities stretching from new jersey to ohio our goal was simple  to listen to those who voted for trump why did they look to an untested reality tv star billionaire and failed casino mogul  who proclaimed himself a master dealmaker  to be their political saviorive walked with a dejected high school football coach along the rusty train line tracks that run by an empty steel mill on the ohio river ive stood next to mounds of bituminous coal in the bucolic hills of southwest pennsylvania

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(df):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(df, return_str=True)

In [7]:
token = tokenize(basic_clean)
token

"in face of joe biden ' s win ' trump nation ' is confused angry but resigned kellymike kellynorthjerseycomtheyre confused theyre angry theyre resigned and wondering how america can heal its political wounds as a pandemic rages and the economy flounders such is the state of trump nation after donald trumpsince the spring of 2017 just 100 days after trump stepped into the white house ive chronicled along with photojournalist chris pedota the hopes fears and ordinary lives of trump voters in a series of communities stretching from new jersey to ohio our goal was simple to listen to those who voted for trump why did they look to an untested reality tv star billionaire and failed casino mogul who proclaimed himself a master dealmaker to be their political saviorive walked with a dejected high school football coach along the rusty train line tracks that run by an empty steel mill on the ohio river ive stood next to mounds of bituminous coal in the bucolic hills of southwest pennsylvania as 

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [8]:
def stem(df):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in df.split()]
    article_stemmed = ' '.join(stems)
    return article_stemmed

In [9]:
article_stemmed = stem(token)
article_stemmed

"in face of joe biden ' s win ' trump nation ' is confus angri but resign kellymik kellynorthjerseycomtheyr confus theyr angri theyr resign and wonder how america can heal it polit wound as a pandem rage and the economi flounder such is the state of trump nation after donald trumpsinc the spring of 2017 just 100 day after trump step into the white hous ive chronicl along with photojournalist chri pedota the hope fear and ordinari live of trump voter in a seri of commun stretch from new jersey to ohio our goal wa simpl to listen to those who vote for trump whi did they look to an untest realiti tv star billionair and fail casino mogul who proclaim himself a master dealmak to be their polit savior walk with a deject high school footbal coach along the rusti train line track that run by an empti steel mill on the ohio river ive stood next to mound of bitumin coal in the bucol hill of southwest pennsylvania as thickmuscl miner wonder about their futur ive sat with an opioid addict in a poo

#### Sanity Check:

In [10]:
ps = nltk.porter.PorterStemmer()
stems = [ps.stem(word) for word in token.split()]
pd.Series(stems).value_counts().head(10)

the      93
of       58
trump    56
a        50
in       37
and      34
that     31
to       29
wa       29
he       28
dtype: int64

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [11]:
def lemmatize(df):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in df.split()]
    article_lemmatized = ' '.join(lemmas)
    return article_lemmatized

In [12]:
article_lemmatized = lemmatize(token)
article_lemmatized

"in face of joe biden ' s win ' trump nation ' is confused angry but resigned kellymike kellynorthjerseycomtheyre confused theyre angry theyre resigned and wondering how america can heal it political wound a a pandemic rage and the economy flounder such is the state of trump nation after donald trumpsince the spring of 2017 just 100 day after trump stepped into the white house ive chronicled along with photojournalist chris pedota the hope fear and ordinary life of trump voter in a series of community stretching from new jersey to ohio our goal wa simple to listen to those who voted for trump why did they look to an untested reality tv star billionaire and failed casino mogul who proclaimed himself a master dealmaker to be their political saviorive walked with a dejected high school football coach along the rusty train line track that run by an empty steel mill on the ohio river ive stood next to mound of bituminous coal in the bucolic hill of southwest pennsylvania a thickmuscled mine

#### Sanity Check

In [13]:
wnl = nltk.stem.WordNetLemmatizer()
lemmas = [wnl.lemmatize(word) for word in token.split()]
pd.Series(lemmas).value_counts()[:10]

the      93
of       58
a        57
trump    56
in       37
and      34
that     31
wa       29
to       29
he       28
dtype: int64

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [14]:
def remove_stopwords(df):
    stopword_list = stopwords.words('english')
    #stopword_list.remove('no')
    #stopword_list.remove('not')
    words = df.split()
    filtered_words = [w for w in words if w not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    article_without_stopwords = ' '.join(filtered_words)

    return article_without_stopwords

In [15]:
article_without_stopwords = remove_stopwords(article_lemmatized)
article_without_stopwords

Removed 725 stopwords
---


"face joe biden ' win ' trump nation ' confused angry resigned kellymike kellynorthjerseycomtheyre confused theyre angry theyre resigned wondering america heal political wound pandemic rage economy flounder state trump nation donald trumpsince spring 2017 100 day trump stepped white house ive chronicled along photojournalist chris pedota hope fear ordinary life trump voter series community stretching new jersey ohio goal wa simple listen voted trump look untested reality tv star billionaire failed casino mogul proclaimed master dealmaker political saviorive walked dejected high school football coach along rusty train line track run empty steel mill ohio river ive stood next mound bituminous coal bucolic hill southwest pennsylvania thickmuscled miner wondered future ive sat opioid addict poor hamlet west virginia clockmaker vermont widow westchester opposes abortion ive stood windswept rainy beach real estate agent jersey shore barrier island likely swamped coming decade rising ocean le

### 6. Define a function named prep_article that takes in the dictionary representing an article and returns a dictionary that looks like this:   
   
{   
    'title': 'the original title'.   
    'original': original,   
    'stemmed': article_stemmed,   
    'lemmatized': article_lemmatized,   
    'clean': article_without_stopwords   
}   
   
`Note that if the orignal dictionary has a title property, it should remain unchanged (same goes for the category property`     

In [16]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKC', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [17]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [18]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string


In [19]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [20]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove additional exclude_words.
    stopword_list.extend(exclude_words)
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Add additional extra_words.
    filtered_words.extend(extra_words)
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [21]:
news['content'].apply(basic_clean)\
             .apply(tokenize)\
             .apply(remove_stopwords)\
             .apply(lemmatize)

0    face joe bidens win trump nation confused angr...
1    presidentelect joe bidens hometown wilmington ...
2    newsvermonts pharmacy school shutting junedan ...
3    newshohum motel williston road become site pan...
4    newschurch street adjusts black friday shoppin...
Name: content, dtype: object

### 7. Define a function named prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.

In [22]:
# import pandas as pd
# import numpy as np

# import os
# import unicodedata
# import re
# import json

# import nltk
# from nltk.tokenize.toktok import ToktokTokenizer
# from nltk.corpus import stopwords

In [23]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKC', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

##############################

def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

#############################

def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

#############################


def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

#############################


def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove additional exclude_words.
    stopword_list.extend(exclude_words)
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Add additional extra_words.
    filtered_words.extend(extra_words)
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

###############################


def prep_article_data(df, column):
    '''
    This function take in a df and the string name for a text column and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned text, and cleaned & lemmatized text.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                 .apply(tokenize)\
                 .apply(remove_stopwords)\
                 .apply(lemmatize)
    df['stemmed'] = df[column].apply(basic_clean).apply(stem)
    df['lemmatized'] = df[column].apply(basic_clean).apply(lemmatize)
    
    return df[['title', 'content', 'stemmed', 'lemmatized', 'clean']]

In [24]:
df = prep_article_data(news, 'content')
df

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,"In face of Joe Biden's win, 'Trump Nation' is ...","In face of Joe Biden's win, 'Trump Nation' is...",in face of joe biden win trump nation is confu...,in face of joe bidens win trump nation is conf...,face joe bidens win trump nation confused angr...
1,President-elect Joe Biden's hometown of Wilmin...,President-elect Joe Biden's hometown of Wilmi...,presidentelect joe biden hometown of wilmingto...,presidentelect joe bidens hometown of wilmingt...,presidentelect joe bidens hometown wilmington ...
2,Vermont's only pharmacy school is shutting dow...,NEWSVermont's only pharmacy school is shuttin...,newsvermont onli pharmaci school is shut down ...,newsvermonts only pharmacy school is shutting ...,newsvermonts pharmacy school shutting junedan ...
3,Ho-Hum Motel on Williston Road to become site ...,NEWSHo-Hum Motel on Williston Road to become ...,newshohum motel on williston road to becom sit...,newshohum motel on williston road to become si...,newshohum motel williston road become site pan...
4,"Church Street adjusts Black Friday shopping, h...",NEWSChurch Street adjusts Black Friday shoppi...,newschurch street adjust black friday shop hol...,newschurch street adjusts black friday shoppin...,newschurch street adjusts black friday shoppin...
