## Prepare Exercise: NLP
### Corey Solitaire
`11.16.2020`

Imports:

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/coreysolitaire/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Gameplan for Parsing Text Data:

    1. Convert text to all lower case for normalcy.
    2. Remove any accented characters, non-ASCII characters.
    3. Remove special characters.
    4. Stem or lemmatize the words.
    5. Remove stopwords.
    6. Store the clean text and the original text for use in future notebooks.


#### Import Dataframe:

In [2]:
news = pd.read_json('news_articles.json')
original = news.content[0]
original

' In face of Joe Biden\'s win, \'Trump Nation\' is confused, angry — but resigned | KellyMike KellyNorthJersey.comThey’re confused. They’re angry. They’re resigned and wondering how America can\xa0heal its political wounds as a pandemic rages and the economy flounders.\xa0Such is the state of “Trump Nation” after Donald Trump.Since the spring of 2017 — just 100 days after Trump stepped into the White House — I’ve chronicled, along with photojournalist Chris Pedota, the hopes, fears and ordinary lives of Trump voters in a series of communities stretching from New Jersey to Ohio.\xa0Our goal was simple — to listen to those who voted for Trump. Why did they look to an untested reality TV star, billionaire and failed casino mogul — who proclaimed himself a master dealmaker — to be their political savior?I’ve walked with a dejected high school football coach along the rusty train line tracks that run by an empty steel mill on the Ohio River. I’ve stood next to mounds of bituminous coal in t

## Exercises:

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

In [3]:
def basic_clean(df):
    df = df.lower()
    df = unicodedata.normalize('NFKD', df)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')
    basic_clean = re.sub(r"[^a-z0-9'\s]", '', df)
    return basic_clean

In [4]:
basic_clean = basic_clean(original)
basic_clean

" in face of joe biden's win 'trump nation' is confused angry  but resigned  kellymike kellynorthjerseycomtheyre confused theyre angry theyre resigned and wondering how america can heal its political wounds as a pandemic rages and the economy flounders such is the state of trump nation after donald trumpsince the spring of 2017  just 100 days after trump stepped into the white house  ive chronicled along with photojournalist chris pedota the hopes fears and ordinary lives of trump voters in a series of communities stretching from new jersey to ohio our goal was simple  to listen to those who voted for trump why did they look to an untested reality tv star billionaire and failed casino mogul  who proclaimed himself a master dealmaker  to be their political saviorive walked with a dejected high school football coach along the rusty train line tracks that run by an empty steel mill on the ohio river ive stood next to mounds of bituminous coal in the bucolic hills of southwest pennsylvania

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [5]:
def tokenize(df):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(df, return_str=True)

In [6]:
token = tokenize(basic_clean)
token

"in face of joe biden ' s win ' trump nation ' is confused angry but resigned kellymike kellynorthjerseycomtheyre confused theyre angry theyre resigned and wondering how america can heal its political wounds as a pandemic rages and the economy flounders such is the state of trump nation after donald trumpsince the spring of 2017 just 100 days after trump stepped into the white house ive chronicled along with photojournalist chris pedota the hopes fears and ordinary lives of trump voters in a series of communities stretching from new jersey to ohio our goal was simple to listen to those who voted for trump why did they look to an untested reality tv star billionaire and failed casino mogul who proclaimed himself a master dealmaker to be their political saviorive walked with a dejected high school football coach along the rusty train line tracks that run by an empty steel mill on the ohio river ive stood next to mounds of bituminous coal in the bucolic hills of southwest pennsylvania as 

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [7]:
def stem(df):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in df.split()]
    article_stemmed = ' '.join(stems)
    return article_stemmed

In [8]:
article_stemmed = stem(token)
article_stemmed

"in face of joe biden ' s win ' trump nation ' is confus angri but resign kellymik kellynorthjerseycomtheyr confus theyr angri theyr resign and wonder how america can heal it polit wound as a pandem rage and the economi flounder such is the state of trump nation after donald trumpsinc the spring of 2017 just 100 day after trump step into the white hous ive chronicl along with photojournalist chri pedota the hope fear and ordinari live of trump voter in a seri of commun stretch from new jersey to ohio our goal wa simpl to listen to those who vote for trump whi did they look to an untest realiti tv star billionair and fail casino mogul who proclaim himself a master dealmak to be their polit savior walk with a deject high school footbal coach along the rusti train line track that run by an empti steel mill on the ohio river ive stood next to mound of bitumin coal in the bucol hill of southwest pennsylvania as thickmuscl miner wonder about their futur ive sat with an opioid addict in a poo

#### Sanity Check:

In [9]:
ps = nltk.porter.PorterStemmer()
stems = [ps.stem(word) for word in token.split()]
pd.Series(stems).value_counts().head(10)

the      93
of       58
trump    56
a        50
in       37
and      34
that     31
wa       29
to       29
he       28
dtype: int64

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [10]:
def lemmatize(df):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in df.split()]
    article_lemmatized = ' '.join(lemmas)
    return article_lemmatized

In [11]:
article_lemmatized = lemmatize(article_stemmed)
article_lemmatized

"in face of joe biden ' s win ' trump nation ' is confus angri but resign kellymik kellynorthjerseycomtheyr confus theyr angri theyr resign and wonder how america can heal it polit wound a a pandem rage and the economi flounder such is the state of trump nation after donald trumpsinc the spring of 2017 just 100 day after trump step into the white hous ive chronicl along with photojournalist chri pedota the hope fear and ordinari live of trump voter in a seri of commun stretch from new jersey to ohio our goal wa simpl to listen to those who vote for trump whi did they look to an untest realiti tv star billionair and fail casino mogul who proclaim himself a master dealmak to be their polit savior walk with a deject high school footbal coach along the rusti train line track that run by an empti steel mill on the ohio river ive stood next to mound of bitumin coal in the bucol hill of southwest pennsylvania a thickmuscl miner wonder about their futur ive sat with an opioid addict in a poor 

#### Sanity Check

In [12]:
wnl = nltk.stem.WordNetLemmatizer()
lemmas = [wnl.lemmatize(word) for word in article_stemmed.split()]
pd.Series(lemmas).value_counts()[:10]

the      93
of       58
a        57
trump    56
in       37
and      34
that     31
to       29
wa       29
he       28
dtype: int64

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [13]:
def remove_stopwords(df):
    stopword_list = stopwords.words('english')
    #stopword_list.remove('no')
    #stopword_list.remove('not')
    words = df.split()
    filtered_words = [w for w in words if w not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    article_without_stopwords = ' '.join(filtered_words)

    return article_without_stopwords

In [14]:
article_without_stopwords = remove_stopwords(article_lemmatized)
article_without_stopwords

Removed 688 stopwords
---


"face joe biden ' win ' trump nation ' confus angri resign kellymik kellynorthjerseycomtheyr confus theyr angri theyr resign wonder america heal polit wound pandem rage economi flounder state trump nation donald trumpsinc spring 2017 100 day trump step white hous ive chronicl along photojournalist chri pedota hope fear ordinari live trump voter seri commun stretch new jersey ohio goal wa simpl listen vote trump whi look untest realiti tv star billionair fail casino mogul proclaim master dealmak polit savior walk deject high school footbal coach along rusti train line track run empti steel mill ohio river ive stood next mound bitumin coal bucol hill southwest pennsylvania thickmuscl miner wonder futur ive sat opioid addict poor hamlet west virginia clockmak vermont widow westchest oppos abort ive stood windswept raini beach real estat agent jersey shore barrier island like swamp come decad rise ocean levelsal ardent trump support variou stage shock tri make sen trump defeat nov 3 electi

### 6. Define a function named prep_article that takes in the dictionary representing an article and returns a dictionary that looks like this:

### 7. Define a function named prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.