# Data Preparation Exercises

### **The end result of this exercise should be a file named `prepare.py` that defines the requested functions.**

###  In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
# imported libraries 
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

# acquire imports
import acquire as a

### 1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything

- Normalize unicode characters  
  
- Replace anything that is not a letter, number, whitespace or a single quote.  

### Using adams's text data

In [2]:
data = "Advanced: String theory is a mathematical framework that proposes to be a theory of quantum gravity, seeking to reconcile general relativity (which describes gravity on a large scale) and quantum mechanics (which describes the behavior of particles at a microscopic level). It introduces the idea that the fundamental building blocks of the universe are not particles, but rather one-dimensional strings of energy. These strings can vibrate at different frequencies, giving rise to different types of particles and forces. String theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with, which are compactified or curled up into tiny sizes."

In [3]:
# defined function to accomplish basic clean actions on text data.
def basic_clean(text_data):
    
    text_data = text_data.lower()
    
    text_data = unicodedata.normalize('NFKD', text_data)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')

    text_data = re.sub(r'[^a-z0-9\s]', '', text_data)

    return text_data

In [4]:
basic_clean(data)

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity seeking to reconcile general relativity which describes gravity on a large scale and quantum mechanics which describes the behavior of particles at a microscopic level it introduces the idea that the fundamental building blocks of the universe are not particles but rather onedimensional strings of energy these strings can vibrate at different frequencies giving rise to different types of particles and forces string theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with which are compactified or curled up into tiny sizes'

### 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [50]:
# defined function to apply tokenizer object onto text dat and return data as str values.
def tokenize(text_data):
    
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    text_data = tokenizer.tokenize(text_data, return_str=True)

    return text_data

In [51]:
tokenize(data)

'Advanced : String theory is a mathematical framework that proposes to be a theory of quantum gravity , seeking to reconcile general relativity ( which describes gravity on a large scale ) and quantum mechanics ( which describes the behavior of particles at a microscopic level ) . It introduces the idea that the fundamental building blocks of the universe are not particles , but rather one-dimensional strings of energy. These strings can vibrate at different frequencies , giving rise to different types of particles and forces. String theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with , which are compactified or curled up into tiny sizes .'

### 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [7]:
# defined function used to stem text in data and joins them with spaces as a string value
def stem(text_data):
    
    ps = nltk.porter.PorterStemmer()

    stems = [ps.stem(word) for word in text_data.split()]
    
    text_data_stemmed = ' '.join(stems)
    
    return text_data_stemmed 

In [8]:
stem(data)

'advanced: string theori is a mathemat framework that propos to be a theori of quantum gravity, seek to reconcil gener rel (which describ graviti on a larg scale) and quantum mechan (which describ the behavior of particl at a microscop level). it introduc the idea that the fundament build block of the univers are not particles, but rather one-dimension string of energy. these string can vibrat at differ frequencies, give rise to differ type of particl and forces. string theori also requir the exist of addit dimens beyond the three spatial dimens we are familiar with, which are compactifi or curl up into tini sizes.'

### 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [9]:
# defined function to lemmatize text in data and return the text as a string in a sentence with "lemmas"
def lemmatize(text_data):

    wnl = nltk.stem.WordNetLemmatizer()

    lemmas = [wnl.lemmatize(word) for word in text_data.split()]
    
    text_data_lemmatized = ' '.join(lemmas)

    return text_data_lemmatized

In [10]:
lemmatize(data)

'Advanced: String theory is a mathematical framework that proposes to be a theory of quantum gravity, seeking to reconcile general relativity (which describes gravity on a large scale) and quantum mechanic (which describes the behavior of particle at a microscopic level). It introduces the idea that the fundamental building block of the universe are not particles, but rather one-dimensional string of energy. These string can vibrate at different frequencies, giving rise to different type of particle and forces. String theory also requires the existence of additional dimension beyond the three spatial dimension we are familiar with, which are compactified or curled up into tiny sizes.'

### 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

#### **This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.**

In [52]:
def remove_stopwords(text_data, extra_words=None, exclude_words=None):
    # stopwords list
    stopwords_list = stopwords.words('english')

    # If extra_words are provided, add them to the stopwords_list
    if extra_words:
        stopwords_list.extend(extra_words)

    # If exclude_words are provided, remove them from the stopwords_list
    if exclude_words:
        stopwords_list = [word for word in stopwords_list if word not in exclude_words]

    # Tokenize the text data and remove stopwords
    words = [word for word in text_data.split() if word not in stopwords_list]

    # Join the words back 
    new_text_data = ' '.join(words)

    return new_text_data    

In [49]:
# Setting values for the additional parameters.
extra_words = ["framework", "with"]
exclude_words = ["can"]

result = remove_stopwords(data, extra_words, exclude_words)
print(result)

Advanced: String theory mathematical proposes theory quantum gravity, seeking reconcile general relativity (which describes gravity large scale) quantum mechanics (which describes behavior particles microscopic level). It introduces idea fundamental building blocks universe particles, rather one-dimensional strings energy. These strings can vibrate different frequencies, giving rise different types particles forces. String theory also requires existence additional dimensions beyond three spatial dimensions familiar with, compactified curled tiny sizes.


### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [19]:
news_df = a.get_news_articles_data()
news_df

Unnamed: 0,title,content,category
0,What is the Nithari serial killings case? ...,The Nithari killings came to light in December...,national
1,100-140 soldiers died by suicide every year si...,After 21-year-old Agniveer Amritpal Singh died...,national
2,Thank India for its shoulder-to-shoulder suppo...,"Israeli Minister of Diaspora Affairs, Amichai ...",national
3,Why are single & unmarried women excluded from...,Delhi HC asked the Centre to explain why singl...,national
4,Signs of possible cyclonic storm in Arabian Sea,Meteorologists have picked up signs of a possi...,national
5,"ISRO chief meets Tamil Nadu CM, gifts him Chan...",ISRO Chairman S Somanath met with Tamil Nadu C...,national
6,Indiscriminate bombing in Gaza amounts to geno...,"Around 15 prominent opposition leaders, includ...",national
7,Portion of flyover on Mumbai-Goa 4-lane highwa...,A portion of an under-construction flyover on ...,national
8,Ambulance carrying live heart covers 14 kms in...,The Bengaluru Police on Sunday facilitated a g...,national
9,Supreme Court verdict on same-sex marriage lik...,The Supreme Court on Tuesday will likely deliv...,national


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [20]:
codeup_df = a.get_blog_articles_data()
codeup_df

Unnamed: 0,title,content
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...


### 8. For each dataframe, produce the following columns:

- `title` to hold the title
  
- `original` to hold the original article/post content
  
- `clean` to hold the normalized and tokenized original with the stopwords removed.
  
- `stemmed` to hold the stemmed version of the cleaned data.
  
- `lemmatized` to hold the lemmatized version of the cleaned data.
  

#### Ask yourself:  

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

        - lemm
  
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
  
        - lemm
  
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

        - stem

## News Dataframe

In [24]:
# news dataframe columns with applied functions.
news_df['title'] = news_df['title']  # Keep the title column as is

# Clean the 'content' column
news_df['clean'] = news_df['content'].apply(basic_clean)

# Tokenize the cleaned text
news_df['tokenized'] = news_df['clean'].apply(tokenize)

# Remove stopwords from tokenized text
news_df['clean'] = news_df['tokenized'].apply(remove_stopwords)

# Stem the cleaned text
news_df['stemmed'] = news_df['clean'].apply(stem)

# Lemmatize the cleaned text
news_df['lemmatized'] = news_df['clean'].apply(lemmatize)

In [25]:
news_df

Unnamed: 0,title,content,category,clean,tokenized,stemmed,lemmatized
0,What is the Nithari serial killings case? ...,The Nithari killings came to light in December...,national,Advanced: String theory mathematical framework...,the nithari killings came to light in december...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
1,100-140 soldiers died by suicide every year si...,After 21-year-old Agniveer Amritpal Singh died...,national,Advanced: String theory mathematical framework...,after 21yearold agniveer amritpal singh died o...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
2,Thank India for its shoulder-to-shoulder suppo...,"Israeli Minister of Diaspora Affairs, Amichai ...",national,Advanced: String theory mathematical framework...,israeli minister of diaspora affairs amichai c...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
3,Why are single & unmarried women excluded from...,Delhi HC asked the Centre to explain why singl...,national,Advanced: String theory mathematical framework...,delhi hc asked the centre to explain why singl...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
4,Signs of possible cyclonic storm in Arabian Sea,Meteorologists have picked up signs of a possi...,national,Advanced: String theory mathematical framework...,meteorologists have picked up signs of a possi...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
5,"ISRO chief meets Tamil Nadu CM, gifts him Chan...",ISRO Chairman S Somanath met with Tamil Nadu C...,national,Advanced: String theory mathematical framework...,isro chairman s somanath met with tamil nadu c...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
6,Indiscriminate bombing in Gaza amounts to geno...,"Around 15 prominent opposition leaders, includ...",national,Advanced: String theory mathematical framework...,around 15 prominent opposition leaders includi...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
7,Portion of flyover on Mumbai-Goa 4-lane highwa...,A portion of an under-construction flyover on ...,national,Advanced: String theory mathematical framework...,a portion of an underconstruction flyover on t...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
8,Ambulance carrying live heart covers 14 kms in...,The Bengaluru Police on Sunday facilitated a g...,national,Advanced: String theory mathematical framework...,the bengaluru police on sunday facilitated a g...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
9,Supreme Court verdict on same-sex marriage lik...,The Supreme Court on Tuesday will likely deliv...,national,Advanced: String theory mathematical framework...,the supreme court on tuesday will likely deliv...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...


## Codeup Dataframe

In [26]:
# news dataframe columns with applied functions.
codeup_df['title'] = codeup_df['title']  # Keep the title column as is

# Clean the 'content' column
codeup_df['clean'] = codeup_df['content'].apply(basic_clean)

# Tokenize the cleaned text
codeup_df['tokenized'] = codeup_df['clean'].apply(tokenize)

# Remove stopwords from tokenized text
codeup_df['clean'] = codeup_df['tokenized'].apply(remove_stopwords)

# Stem the cleaned text
codeup_df['stemmed'] = codeup_df['clean'].apply(stem)

# Lemmatize the cleaned text
codeup_df['lemmatized'] = codeup_df['clean'].apply(lemmatize)

In [27]:
codeup_df

Unnamed: 0,title,content,clean,tokenized,stemmed,lemmatized
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,Advanced: String theory mathematical framework...,may is traditionally known as asian american a...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...,Advanced: String theory mathematical framework...,women in tech panelist spotlight magdalena ra...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...,Advanced: String theory mathematical framework...,women in tech panelist spotlight rachel robbi...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...,Advanced: String theory mathematical framework...,women in tech panelist spotlight sarah mellor...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...,Advanced: String theory mathematical framework...,women in tech panelist spotlight madeleine ca...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...,Advanced: String theory mathematical framework...,black excellence in tech panelist spotlight w...,advanced: string theori mathemat framework pro...,Advanced: String theory mathematical framework...
