# Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
import pandas as pd
import numpy as np
import unicodedata
import re
import nltk
from nltk.corpus import stopwords

### 1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

In [2]:
def basic_clean(string):
    clean = string.lower()
    clean = unicodedata.normalize('NFKD', clean)\
    .encode('ascii', 'ignore')\
    .decode('utf-8')
    clean = re.sub(r'[^a-z0-9\'\s]', '', clean)
    
    return clean

In [3]:
test_string = 'This is a testing string. There\'s a ; numBER "of" THings in > here that should be % Removed'

In [4]:
clean_string = basic_clean(test_string)
clean_string

"this is a testing string there's a  number of things in  here that should be  removed"

#### - Lowercase everything

#### - Normalize unicode characters

#### - Replace anything that is not a letter, number, whitespace or a single quote.

### 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [5]:
def tokenize(string):
    toke = nltk.tokenize.ToktokTokenizer()
    toked = toke.tokenize(string, return_str=True)
    return toked

In [6]:
token_string = tokenize(clean_string)
token_string

"this is a testing string there ' s a number of things in here that should be removed"

### 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [7]:
def stem(string):
    ps = nltk.porter.PorterStemmer()
    stemmed = [ps.stem(word) for word in string.split()]
    stemmed = ' '.join(stemmed)
    return stemmed

In [8]:
stemmed_string = stem(token_string)
stemmed_string

"thi is a test string there ' s a number of thing in here that should be remov"

### 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [9]:
def lemmatize(string):
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    lemman = [wnl.lemmatize(word) for word in string.split()]
    
    output_lemman = ' '.join(lemman)
    
    return output_lemman

In [10]:
lemmed_string = lemmatize(token_string)
lemmed_string

"this is a testing string there ' s a number of thing in here that should be removed"

### 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we *don't* want to remove.

In [11]:
def remove_stopwords(string):
    stopper = stopwords.words('english')
    stopper.append("'")
    my_words = string.split()
    dont_stop = [word for word in my_words if word not in stopper]
    unstopped = ' '.join(dont_stop)
    return unstopped

In [12]:
does_this_work = remove_stopwords(lemmed_string)
does_this_work

'testing string number thing removed'

In [13]:
what_about_this = remove_stopwords(stemmed_string)
what_about_this

'thi test string number thing remov'

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [14]:
import aquire

In [15]:
news_df = pd.DataFrame(aquire.get_all_shorts())

In [16]:
news_df.head(2)

Unnamed: 0,title,category,body
0,Bharti Airtel rakes in 61% profit,india,"Bharti Airtel, India's top telecommunications ..."
1,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I..."


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [17]:
codeup_df = pd.DataFrame(aquire.get_blog_articles())

In [18]:
codeup_df.head(2)

Unnamed: 0,title,content
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...


### 8. For each dataframe, produce the following columns:

#### - `title` to hold the title

Already there...

#### - `original` to hold the original article/post content

In [19]:
def make_original(dataframe):
    for i in dataframe:
        if i == 'body':
            dataframe['original'] = dataframe['body'].copy()
        if i == 'content':
            dataframe['original'] = dataframe['content'].copy()
    return dataframe

In [20]:
codeup_df = make_original(codeup_df)

In [21]:
news_df = make_original(news_df)

#### - `clean` to hold the normalized and tokenized original with the stopwords removed.

In [22]:
codeup_df['content'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)

0    codeup excited launch first diversity equity i...
1    codeup named 2022 diversity inclusion award wi...
2    deciding transition tech career big step signi...
3    codeup strongly values diversity inclusion hon...
4    many companies switching cloud services implem...
5    codeups chief operating officer stephen notebo...
Name: content, dtype: object

In [23]:
def make_clean(dataframe):
    dataframe['clean'] = dataframe['original'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)
    return dataframe

In [24]:
codeup_df = make_clean(codeup_df)
codeup_df.head(1)

Unnamed: 0,title,content,original,clean
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...


In [25]:
news_df = make_clean(news_df)
news_df.head(1)

Unnamed: 0,title,category,body,original,clean
0,Bharti Airtel rakes in 61% profit,india,"Bharti Airtel, India's top telecommunications ...","Bharti Airtel, India's top telecommunications ...",bharti airtel india top telecommunications com...


#### - `stemmed` to hold the stemmed version of the cleaned data.

In [26]:
def make_stemmed(dataframe):
    dataframe['stemmed'] = dataframe['clean'].apply(stem)
    return dataframe

In [27]:
news_df = make_stemmed(news_df)
news_df.head(1)

Unnamed: 0,title,category,body,original,clean,stemmed
0,Bharti Airtel rakes in 61% profit,india,"Bharti Airtel, India's top telecommunications ...","Bharti Airtel, India's top telecommunications ...",bharti airtel india top telecommunications com...,bharti airtel india top telecommun compani rep...


In [28]:
codeup_df = make_stemmed(codeup_df)
codeup_df.head(1)

Unnamed: 0,title,content,original,clean,stemmed
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...,codeup excit launch first divers equiti inclus...


#### - `lemmatized` to hold the lemmatized version of the cleaned data.

In [29]:
def make_lemmatized(dataframe):
    dataframe['lemmatized'] = dataframe['clean'].apply(lemmatize)
    return dataframe

In [30]:
news_df = make_lemmatized(news_df)
news_df.head(1)

Unnamed: 0,title,category,body,original,clean,stemmed,lemmatized
0,Bharti Airtel rakes in 61% profit,india,"Bharti Airtel, India's top telecommunications ...","Bharti Airtel, India's top telecommunications ...",bharti airtel india top telecommunications com...,bharti airtel india top telecommun compani rep...,bharti airtel india top telecommunication comp...


In [31]:
codeup_df = make_lemmatized(codeup_df)
codeup_df.head(1)

Unnamed: 0,title,content,original,clean,stemmed,lemmatized
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...,codeup excit launch first divers equiti inclus...,codeup excited launch first diversity equity i...


In [32]:
codeup_df.columns.to_list()

['title', 'content', 'original', 'clean', 'stemmed', 'lemmatized']

### 9. Ask yourself:

- **If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?**

- **If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?**

- **If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?**

In [33]:
from prepare import make_nlp_df

In [34]:
one, two = make_nlp_df()

In [35]:
one.head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...,codeup excit launch first divers equiti inclus...,codeup excited launch first diversity equity i...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup named 2022 diversity inclusion award wi...,codeup name 2022 divers inclus award winner sa...,codeup named 2022 diversity inclusion award wi...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding transition tech career big step signi...,decid transit tech career big step signific co...,deciding transition tech career big step signi...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity inclusion hon...,codeup strongli valu divers inclus honor ameri...,codeup strongly value diversity inclusion hono...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,many companies switching cloud services implem...,mani compani switch cloud servic implement clo...,many company switching cloud service implement...


In [36]:
two.head()

Unnamed: 0,title,category,original,clean,stemmed,lemmatized
0,Bharti Airtel rakes in 61% profit,india,"Bharti Airtel, India's top telecommunications ...",bharti airtel india top telecommunications com...,bharti airtel india top telecommun compani rep...,bharti airtel india top telecommunication comp...
1,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I...",regulatory filing bse friday infosys ltd decid...,regulatori file bse friday infosi ltd decid gi...,regulatory filing bse friday infosys ltd decid...
2,India beat NZ 3-2 to enter CWG hockey finals,india,In the CWG men's hockey semi-final against New...,cwg men hockey semifinal new zealand saturday ...,cwg men hockey semifin new zealand saturday in...,cwg men hockey semifinal new zealand saturday ...
3,India's first Billiards Premier League,india,The Billiards and Snooker Association of Mahar...,billiards snooker association maharashtrabsam ...,billiard snooker associ maharashtrabsam decid ...,billiards snooker association maharashtrabsam ...
4,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...,recent snowfall upper reaches kashmir himalaya...,recent snowfal upper reach kashmir himalayan p...,recent snowfall upper reach kashmir himalayan ...
