### This notebook cleans up the wikihow data from the unwanted characters that presented at the scraping.  The process consists of the following:
1. remove empty rows, whethere it is empty summary or empty text
2. remove the commos, colons, and ... after the full stops
3. use SpaCy to tokenize data and reconstruct the sentences
4. only consider the articles that both summary and article are at least 20 characters
5. identify sentences, and consider only the sentences that are at least 3 characters
6. tokenize every sentence and consider only the tokens that are neither space, nor punctuations
7. add the normalized form of the token to the list (e.g. n't --> not)
8. out of sentence-tokenizer loop, join the words together to create the sentence, and add the last character in the sentence to the end of the sentence  (e.g. '.', '?')
9. remove the empty rows again

Now its time to decide which tuples to keep
1. get the length of every summary and article and add to the dataframe
2. get the statistics description of the the length
3. include the tuples that have length within 90% of the distribution (>.05 & <.95)
4. identify the tuples that the length of the text is at least as twice as the summary length

In [1]:
import numpy as np
import pandas as pd

In [2]:
how = pd.read_csv('wikihowAll.csv')
how.shape

(215365, 3)

In [3]:
how = how[['headline','text']]
data = how[~(how['headline'].isnull() | how['text'].isnull())]
data.shape

(214294, 2)

In [4]:
data['headline'] = data['headline'].apply(lambda x: x.lower().replace('.,','.'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [5]:
data['text'] = data['text'].apply(lambda x: x.lower().replace('.,','.'))
data['text'] = data['text'].apply(lambda x: x.lower().replace('.;','.'))
data['text'] = data['text'].apply(lambda x: x.lower().replace('.:','.'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [6]:
from spacy.lang.en import English
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

In [13]:
def regenerateText(txt):
    clean_text = ''
    if len(txt)>20:
        for sentence in txt.sents:
            if len(sentence)>3:
                build_sent = []
                for token in sentence:
                    if not (token.is_punct | token.is_space):  
                        build_sent.append(str(token.norm_))
                clean_text += ' '.join(build_sent)+str(sentence[-1])+' '
    return clean_text        

In [15]:
data['text'] = data['text'].apply(lambda x: regenerateText(nlp(x)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
data['headline'] = data['headline'].apply(lambda x: regenerateText(nlp(x)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [42]:
dataCL = data[~(data['headline']=='')]
dataCL = dataCL[~(dataCL['text']=='')]
dataCL.shape

(194184, 2)

In [43]:
dataCL['summaryLen'] = dataCL['headline'].apply(lambda x: len(x))
dataCL['textLen'] = dataCL['text'].apply(lambda x: len(x))

In [44]:
dataCL['summaryLen'].describe(percentiles=[.05,.1,.25,.5,.75,.9,.95])

count    194184.000000
mean        362.351579
std         320.100462
min           9.000000
5%          103.000000
10%         119.000000
25%         164.000000
50%         267.000000
75%         459.000000
90%         706.000000
95%         916.000000
max       23173.000000
Name: summaryLen, dtype: float64

In [45]:
dataCL['textLen'].describe(percentiles=[.05,.1,.25,.5,.75,.9,.95])

count    194184.000000
mean       2670.422342
std        2837.005673
min           2.000000
5%          196.000000
10%         352.000000
25%         898.000000
50%        1820.000000
75%        3253.000000
90%        6196.000000
95%        8666.000000
max       74177.000000
Name: textLen, dtype: float64

In [46]:
tr = dataCL[(dataCL['textLen']>196)&(dataCL['summaryLen']>103)&
             (dataCL['textLen']<8666)&(dataCL['summaryLen']<916)]
tr.shape

(157587, 4)

In [47]:
tr2 = tr[(tr['textLen']>2*tr['summaryLen'])]
tr2.shape

(135446, 4)

In [48]:
tr2.to_csv('2018_03_06_wikihow_preservingSentences_truncated.csv',index=None)