## Use of RAKE algorithm for keyword extraction
### Source: https://github.com/zelandiya/RAKE-tutorial

RAKE is the first keyword extraction algorithm to be tested for the NLP-Unilever project. We will adapt it to the format of the input data.

In [7]:
import pandas as pd

In [5]:
#RAKE Imports
from __future__ import absolute_import
from __future__ import print_function
import six

import rake
import operator
import io

### Data preprocessing
Preparing the survey data as a chain of characters.

In [63]:
srv_raw = pd.ExcelFile("../data/survey_unilever.xlsx")
srv = srv_raw.parse()
del srv[list(srv)[0]]
qst = list(srv)
qst_clean = [str(q).split("\n")[0] for q in list(srv)]
qst_clean

['What is Healthy Skin?',
 'How do you know your skin is healthy?',
 'How do you know your skin is getting healthier with every shower?',
 'How do you get healthy skin?',
 'How do you maintain healthy skin?',
 'How Bar Soap or Body Wash gives you Healthy Skin?',
 'How concerned']

In [64]:
srv_concat = [reduce(lambda x,y: str(x)+'. '+str(y), srv[qst[k]]) for k in range(len(srv.columns))]

### RAKE Algorithm

keywords contains the top 10 keyphrases for each question of the survey. 

In [67]:
# Import list of Stop Words
stoppath = "SmartStoplist.txt"

#Initialize RAKE by providing a path to a stopwords file
'''
Arguments: 
    stoppath: path to the list of stopwords
    int1 (5): min_char_length, minimum number of characters per word
    int2 (3): max_words_length, maximum number of words per keyphrase
    int4 (4): min_keyword_frequency, minimum nb of occurences for the keyword in the text.
'''
rake_object = rake.Rake(stoppath, 3, 3, 3)

keywords = {qst_clean[k]: rake_object.run(srv_concat[k])[:10] for k in range(len(qst_clean))}

In [68]:
keywords

{'How Bar Soap or Body Wash gives you Healthy Skin?': [('body wash',
   4.166666666666667),
  ('dead skin', 3.8821138211382116),
  ('bar soap', 3.8),
  ('adding moisture', 3.5142857142857142),
  ('harsh chemicals', 3.3714285714285714),
  ('moisturizing soap', 3.311111111111111),
  ('healthy skin', 3.2410881801125706),
  ('dry skin', 3.198780487804878),
  ('skin type', 3.1043360433604335),
  ('skin smooth', 3.048780487804878)],
 'How concerned': [('slightly concerned', 3.677631578947368),
  ('extremely concerned', 3.677631578947368),
  ('moderately concerned', 3.677631578947368),
  ('concerned', 1.6776315789473684)],
 'How do you get healthy skin?': [('moisturizing body wash',
   5.711764705882353),
  ('quality products', 4.322545846817691),
  ('processed foods', 4.3076923076923075),
  ('good diet', 4.216666666666667),
  ('body wash', 4.1517647058823535),
  ('healthy diet', 4.0773809523809526),
  ('good soap', 4.072727272727272),
  ('good products', 4.066990291262136),
  ('drink plenty'