# Raw Data Processing
Download wikihow raw data in CSV format and do some processing.

We only need summarry and text pairs. Plus, we need to do some preprocessing even before training, such as remove empty entries, lower case, remove stopwords, etc.

After this processing, we save the processed data to a CSV file so that we can reuse it later.

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

In [2]:
content = pd.read_csv("data/wikihowSep.csv")

In [3]:
print(content.shape)
content.head()

(1585695, 5)


Unnamed: 0,overview,headline,text,sectionLabel,title
0,So you're a new or aspiring artist and your c...,\nSell yourself first.,"Before doing anything else, stop and sum up y...",Steps,How to Sell Fine Art Online
1,"If you want to be well-read, then, in the wor...",\nRead the classics before 1600.,Reading the classics is the very first thing ...,Reading the Classics,How to Be Well Read
2,So you're a new or aspiring artist and your c...,\nJoin online artist communities.,Depending on what scale you intend to sell yo...,Steps,How to Sell Fine Art Online
3,So you're a new or aspiring artist and your c...,\nMake yourself public.,Get yourself out there as best as you can by ...,Steps,How to Sell Fine Art Online
4,So you're a new or aspiring artist and your c...,\nBlog about your artwork.,"Given the hundreds of free blogging websites,...",Steps,How to Sell Fine Art Online


In [4]:
content.isnull().sum()

overview          2508
headline             0
text            198405
sectionLabel      1904
title                1
dtype: int64

In [5]:
# remove null values and unneeded features
content = content.dropna()
content = content.drop(['overview', 'sectionLabel', 'title'], 1)
content = content.reset_index(drop=True)
wikihow = content.replace('\\n', '', regex=True)

In [6]:
wikihow.head()

Unnamed: 0,headline,text
0,Sell yourself first.,"Before doing anything else, stop and sum up y..."
1,Read the classics before 1600.,Reading the classics is the very first thing ...
2,Join online artist communities.,Depending on what scale you intend to sell yo...
3,Make yourself public.,Get yourself out there as best as you can by ...
4,Blog about your artwork.,"Given the hundreds of free blogging websites,..."


In [7]:
for i in range(2):
    print('SUMMARY:', wikihow.headline[i])
    print('TEXT:', wikihow.text[i])
    print()

SUMMARY: Sell yourself first.
TEXT:  Before doing anything else, stop and sum up yourself as an artist. Now, think about how to translate that to an online profile. Be it the few words, Twitter allows you or an entire page of indulgence that your own website would allow you. Bring out the most salient features of your creativity, your experience, your passion, and your reasons for painting. Make it clear to readers why you are an artist who loves art, produces high quality art, and is a true champion of art. If you're not great with words, find a friend who can help you with this really important aspect of selling online – the establishment of your credibility and reliability.;

SUMMARY: Read the classics before 1600.
TEXT:  Reading the classics is the very first thing you have to do to be well-read. If you want to build a solid foundation for your understanding of the books you read, then you can't avoid some of the earliest plays, poems, and oral tales ever written down. Remember tha

## Save a version only has Summaries and Texts

In [8]:
# A version only including headlines and texts
wikihow.to_csv('wikihow.csv', index=False)
# Backup and copy the file to data/wikihow.csv

In [9]:
content = pd.read_csv("wikihow.csv")

In [10]:
content.head()

Unnamed: 0,headline,text
0,Sell yourself first.,"Before doing anything else, stop and sum up y..."
1,Read the classics before 1600.,Reading the classics is the very first thing ...
2,Join online artist communities.,Depending on what scale you intend to sell yo...
3,Make yourself public.,Get yourself out there as best as you can by ...
4,Blog about your artwork.,"Given the hundreds of free blogging websites,..."


## Data Clean
Do following operations:

1. Convert words to lower case.
2. Expand language contractions: The English language has a couple of contractions. For instance: you've -> you have. These can sometimes cause headache when you are doing NLP.
3. Format words and remove unwanted characters.
4. Remove stop words.

In [11]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

In [12]:
nltk.download('stopwords')
  
# Clean the summaries and texts
clean_summaries = []
# for summary in wikihow.headline:
#     clean_summaries.append(clean_text(summary, remove_stopwords=False))
# print("Summaries are complete.")

clean_texts = []
# for text in wikihow.text:
#     clean_texts.append(clean_text(text))
# print("Texts are complete.")

for i in range(0, wikihow.shape[0]):
    summary = wikihow.iloc[i].headline
    text = wikihow.iloc[i].text
    if isinstance(summary, float) or isinstance(text, float):
        continue
    len_sum = len(summary.split())
    len_text = len(text.split())
    if len_text <= len_sum or len_sum < 3:
        continue
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
    clean_texts.append(clean_text(text))
print('Complete.', len(clean_summaries), len(clean_texts))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/junwenbu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Complete. 1212030 1212030


In [13]:
clean_dict = {'summary':clean_summaries, 'text':clean_texts}
df = pd.DataFrame(clean_dict)
df.to_csv('clean_wikihow.csv', index=False)
# Copy it to data folder so that it can be used in the project later.

## Save as CSV
Save a copy so that it can be used later.

In [14]:
wikihow = pd.read_csv("clean_wikihow.csv")
print(wikihow.shape)
wikihow.head()

(1212030, 2)


Unnamed: 0,summary,text
0,sell yourself first,anything else stop sum artist think translate ...
1,read the classics before 1600,reading classics first thing well read want bu...
2,join online artist communities,depending scale intend sell art pieces may wan...
3,make yourself public,get best advertising publish example pieces ar...
4,blog about your artwork,given hundreds free blogging websites lot choi...
