# RAKE

**Task 1**: to compare two lists of keywords get from the text in two ways (manually and with Rake algorithm).  
**Task 2**: to improve Rake algorithm.  
**Task 3**: check Rake algorithm on Russian texts

Text is given from: http://www.bbc.com/news/business-43032542

In [2]:
import RAKE
import operator
import io
import pandas as pd

In [3]:
stoppath = "SmartStoplist.txt"

In [4]:
#initialize RAKE by providing a path to a stopwords file
rake_object = RAKE.Rake(stoppath)

In [5]:
#run on RAKE on a given text
sample_file = io.open("Fish_farming.txt", 'r',encoding="utf-8")
text = sample_file.read()

15 manually marked keywords (person 1):

In [69]:
keywords1 = pd.read_csv("keywords_person1_top15.csv", encoding='utf-8')

In [70]:
keywords1['len1'] = pd.Series(len(i.split()) for i in keywords1['KeyWord_Nastya'])

In [71]:
keywords1

Unnamed: 0,KeyWord_Nastya,len1
0,tech firm CageEye,3
1,Norwegian fish farms,3
2,automated feeding systems,3
3,artificial intelligence technologies,3
4,monitor the salmon,3
5,detect sea lice,3
6,fish farming,2
7,farmed salmon,2
8,Mr Sovegjarto,2
9,fish monitoring,2


15 manually marked keywords (person 2):

In [79]:
keywords2 = pd.read_csv("keywords_person2_top15.csv", encoding='utf-8')

In [80]:
keywords2['len2'] = pd.Series(len(i.split()) for i in keywords2['KeyWord_Masha'])

In [81]:
key_words = pd.concat([keywords1, keywords2], axis=1)
key_words

Unnamed: 0,KeyWord_Nastya,len1,KeyWord_Masha,len2
0,tech firm CageEye,3,fish farming,2
1,Norwegian fish farms,3,hydro-acoustic system,2
2,automated feeding systems,3,tech firm CageEye,3
3,artificial intelligence technologies,3,caged salmon,2
4,monitor the salmon,3,Norwegian fish farms,3
5,detect sea lice,3,salmon feeding patterns,3
6,fish farming,2,automated fish monitoring,3
7,farmed salmon,2,monitor the salmon,3
8,Mr Sovegjarto,2,feeding fish visually,3
9,fish monitoring,2,the parasitic sea lice,4


Thus, we get ~0.27% coincidence between two undependently marked keyword lists (4/15 words).

Let's also count an agreement of annotators. To be able to do it we need to encode strings as numbers first.

In [82]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
le = preprocessing.LabelEncoder()
le.fit(pd.Series(list(key_words['KeyWord_Nastya']) + list(key_words['KeyWord_Masha'])))
KeyWords_Nastya_le = le.transform(key_words['KeyWord_Nastya'])
KeyWords_Masha_le = le.transform(key_words['KeyWord_Masha'])

In [83]:
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(KeyWords_Nastya_le, KeyWords_Masha_le)

-0.018099547511312153

Kappa score is -.01 which means no agreement. There is no surprise cause even having common keywords it's impossible to put them into the same order which kappa score is based on as well.

In [146]:
#fing keywords with max word length 3
keywords_rake_1 = rake_object.run(text, maxWords = 3)

In [135]:
len(keywords_rake_1)

155

In [37]:
print("Keywords:")
for word in keywords_rake_1:
    print(word)

Keywords:
('7m-10m krone', 9.0)
('hi-tech approaches', 9.0)
('perennially problematic visitor', 9.0)
('self-guided tool', 9.0)
('hi-tech solutions', 9.0)
('mr shang hopes', 8.5)
('salmon prices soaring', 8.1)
('salmon sloshing loudly', 8.1)
('de-loused salmon', 8.1)
('believes mr sovegjarto', 8.0)
('humans simply overseeing', 8.0)
('artificial intelligence technologies', 8.0)
('computers carefully monitor', 8.0)
('parasitic sea louse', 8.0)
('tech firm cageeye', 7.75)
('hydro-acoustic system', 7.6)
('un-gobbled feed', 7.6)
('lingalaks fish farms', 7.25)
('modern fish farms', 7.25)
('automated fish monitoring', 7.0)
('feeding fish visually', 7.0)
('artificial intelligence', 5.0)
('mr sovegjarto', 5.0)
('mr folkedal', 4.5)
('fish farms', 4.25)
('farmed salmon', 4.1)
('million salmon', 4.1)
('caged salmon', 4.1)
('salmon farming', 4.1)
('fish farming', 4.0)
('big business', 4.0)
('100 million tonnes', 4.0)
('boost production', 4.0)
('cut costs', 4.0)
('feeding frenzy', 4.0)
('improve [exp

Not really keywords due to my introspection:  

7m-10m krone;  
perennially problematic visitor;  
salmon prices soaring;  
humans simply overseeing;  
mr shang hopes;  
believes mr sovegjarto;  
million salmon;  
big business;  
100 million tonnes;  
boost production;  
feeding frenzy;  
improve [expenditure];  
pellet detector;  
wrong place;  
giving scientists;  
huge variations;  
factors influencing;  
recently raised $3;  
causing damage;  
removed manually;  
fish make;  
firm lots;  
make decisions;  
make big;  
water currents;  
water temperature;  
fires lasers;  
fatal result;  
monitor;  
humans;  
make;  
water;  
system;  
fatal;  
data;  
produces;  
year;  
producers;  
turning;  
eat;  
produce;  
installed;  
cluster;  
swim;  
knowledge;  
save;  
money;  
wasted;  
developed;  
years;  
develop;  
observes;  
number;  
pellets;  
full;  
accurate;  
institute;  
bergen;  
insights;  
day;  
adds;  
process;  
rise;  
start;  
set;  
start-;  
5m;  
makes;  
pens;  
operators;  
determining;  
explains;  
approach;  
group;  
designed;  
stingray;  
hit;  
coagulates;  
milliseconds;  
boasts;  
website;  
mirror-;  
skin;  
reflects;  
swims;  
surprise;  
wealthy;  
5%;  
｣2;  
];  

**Results**:  
    
The text is about IT-technologies in fishing.  
155 keyword expression with maximal length equal to 3 is found in text with 624 words.

Almost all manually marked keywords are found (although I'd add 'salmon feeding patterns', 'Norway', 'automated feeding systems', 'fish monitoring' and 'computer vision algorithms').

Still there is quite a lot of trash (the accuracy is nearly 0.58%: 90 keywords are really keywords): mostly because of lonely words that are specific maybe (like 'monitor') but not really reflect the sense of the text alone while including them in ngrams helps (compare 'monitor' with 'fish monitoring'). 

If we take top-15 rake keywords (still with max length 3) then the intersection with GS is only 4 expressions (0.27%):  'hi-tech approaches', 'artificial intelligence technologies', 'tech firm cageeye' and 'parasitic sea louse'.  
Although except for several totally not key expressions ('7m-10m krone' and 'perennially problematic visitors') it's still more or less about the same but not exactly. For example: it's important to include to keywords personalities that are mentioned in article ('mr shang') but there is no need to include a verb within it (compare 'mr shang' and 'mr shang hopes', 'mr sovergjarto' and 'believes mr sovegjarto'). Similarly, 'salmon' itself should be included in keywords cause the whole text is about the salmon while 'salmon sloshing loudly' is not really key expression (adverb is definutely unnecessary and sloshing is not the main topic of text as well).

I don't think there are any crucially necessary expressions that were not included in the GS.

In [None]:
('7m-10m krone', 9.0)
('hi-tech approaches', 9.0)
('perennially problematic visitor', 9.0)
('self-guided tool', 9.0)
('hi-tech solutions', 9.0)
('mr shang hopes', 8.5)
('salmon prices soaring', 8.1)
('salmon sloshing loudly', 8.1)
('de-loused salmon', 8.1)
('believes mr sovegjarto', 8.0)
('humans simply overseeing', 8.0)
('artificial intelligence technologies', 8.0)
('computers carefully monitor', 8.0)
('parasitic sea louse', 8.0)
('tech firm cageeye', 7.75)

### Rake 2.0

1. В алгоритм можно добавить лемматизацию слов
2. Можно выставить минимальную длину слова 3, тогда не будет мусора (5% / ｣2 / ] / 5m)

P.S. Увеличение порога минимальной встречаемости приводит к тому, что ключевых выражений очень мало и одни однословные.

In [147]:
#fing keywords with max word length 3
keywords = rake_object.run(text, maxWords = 3, minCharacters=3, minFrequency=2)

In [148]:
keywords

[('salmon', 2.1),
 ('fish', 2.0),
 ('firm', 1.75),
 ('feed', 1.6),
 ('system', 1.6),
 ('lice', 1.5),
 ('industry', 1.3333333333333333),
 ('data', 1.3333333333333333),
 ('year', 1.0),
 ('turning', 1.0),
 ('eat', 1.0),
 ('developed', 1.0),
 ('technology', 1.0)]

In [102]:
import nltk
from nltk.stem import WordNetLemmatizer

In [116]:
wordnet_lemmatizer = WordNetLemmatizer()
words = RAKE.RAKE.separate_words(text)
lemmas = [wordnet_lemmatizer.lemmatize(word) for word in words]
lemmatized_text = ' '.join(lemmas)

In [149]:
#fing keywords with max word length 3
keywords_rake_2 = rake_object.run(lemmatized_text, maxWords = 3)

In [150]:
len(keywords_rake_2)

127

In [162]:
'video camera' in [i[0] for i in keywords_rake_2]

True

In [140]:
keywords2

[('7m 10m krone', 9.0),
 ('human simply overseeing', 9.0),
 ('computer vision algorithm', 9.0),
 ('perennially problematic visitor', 9.0),
 ('future stingray ha', 8.333333333333334),
 ('salmon sloshing loudly', 8.0),
 ('de loused salmon', 8.0),
 ('hydro acoustic system', 7.8),
 ('lingalaks fish farm', 7.25),
 ('make big change', 7.25),
 ('modern fish farm', 7.25),
 ('artificial intelligence technology', 7.0),
 ('automated fish monitoring', 7.0),
 ('feeding fish visually', 7.0),
 ('artificial intelligence', 5.0),
 ('big business', 4.5),
 ('ha installed', 4.333333333333334),
 ('technology ha', 4.333333333333334),
 ('fish farm', 4.25),
 ('fish farming', 4.0),
 ('million tonne', 4.0),
 ('laser automation', 4.0),
 ('boost production', 4.0),
 ('cut cost', 4.0),
 ('farmed salmon', 4.0),
 ('million salmon', 4.0),
 ('feeding frenzy', 4.0),
 ('improve expenditure', 4.0),
 ('wrong place', 4.0),
 ('caged salmon', 4.0),
 ('ole folkedal', 4.0),
 ('marine research', 4.0),
 ('oxygen level', 4.0),
 ('g

Устойчивые выражения, выбранные стандартным рейком, но не выбранные улучшенным:

In [152]:
set([i[0] for i in keywords_rake_1]) - set([i[0] for i in keywords_rake_2])

{'100 million tonnes',
 '5%',
 '5m',
 '7m-10m krone',
 ']',
 'adds',
 'approach',
 'aquabyte',
 'artificial intelligence technologies',
 'automation',
 'believes mr sovegjarto',
 'boasts',
 'cages',
 'computers carefully monitor',
 'cut costs',
 'de-loused salmon',
 'factors influencing',
 'farmers',
 'farms',
 'fires lasers',
 'firm lots',
 'fish farms',
 'fish make',
 'future',
 'giving scientists',
 'hi-tech approaches',
 'hi-tech solutions',
 'huge variations',
 'humans',
 'humans simply overseeing',
 'hydro-acoustic system',
 'images',
 'improve [expenditure]',
 'insights',
 'installed',
 'laser',
 'lasers',
 'lice',
 'lice attach',
 'lice removal',
 'lingalaks fish farms',
 'make big',
 'make decisions',
 'makes',
 'milliseconds',
 'mirror-',
 'modern fish farms',
 'mr shang hopes',
 'mr sovegjarto',
 'noise',
 'noise lessens',
 'operators',
 'oxygen levels',
 'parasitic sea louse',
 'pellet detector',
 'pellets',
 'pens',
 'producers',
 'produces',
 'recently raised $3',
 'salmo

Лемматизатор NLTK сокращает некоторые слова, которые не должна (например, less -> le). Но в целом новый алгоритм оставляет только лемматизированные сочетания, таким образом убирая лишние:
- 'artificial intelligence technology' вместо 'artificial intelligence technology' и 'artificial intelligence technologies'
- 'video camera' вместо 'video camera' и 'video cameras'

### Rake Russian

In [3]:
#initialize RAKE by providing a path to a stopwords file
rake_object_russian = RAKE.Rake("stoplist_ru.txt")

In [11]:
#run on RAKE on a given text
sample_file = io.open("технологии_рыбалка.txt", 'r')
text_ru = sample_file.read()

In [12]:
#fing keywords with max word length 3
keywords_ru = rake_object_russian.run(text_ru, maxWords = 3)

In [13]:
keywords_ru

[('первыми занятиями человека', 9.0),
 ('собирания всяких корешков', 9.0),
 ('остаются самыми консервативными', 9.0),
 ('делают орудия труда', 9.0),
 ('большей частью увлечения', 9.0),
 ('пошло развитие гаджета', 9.0),
 ('вооружение рыбаков поступили', 9.0),
 ('высокая разрешающая способность', 9.0),
 ('проводить поиск рыбы', 9.0),
 ('предмет наличия рыбы', 9.0),
 ('оповещают рыболова писком', 9.0),
 ('дорогие модели идут', 9.0),
 ('кучу рыбацких наворотов', 9.0),
 ('способны воспроизводить звуки', 9.0),
 ('издаваемые мелкой рыбкой', 9.0),
 ('электронные донные приманки', 9.0),
 ('имитация работы жабр', 8.5),
 ('предлагают беспроводные эхолоты', 8.25),
 ('эхолоты позволяют работать', 8.25),
 ('показа технологии ebs', 8.25),
 ('установить бесплатное приложение', 8.0),
 ('предметов пользования карпятников', 8.0),
 ('по-мощью рыбак', 7.666666666666667),
 ('берегу появятся рыбаки', 7.166666666666667),
 ('рыбаки карпятники', 4.5),
 ('имитация рачка', 4.5),
 ('ebs original', 4.25),
 ('ebs™ c

Текст примерно на одну и ту же тему, что и английский - про инноваицонные технологии в рабылке.

Ключевые биграммы и триграммы в русском тексте rake находит гораздо хуже. Похоже, что в русском тройные коллокации, в отличие от английского, в принципе менее устойчивы и реже употребляются вместе, можно ограничить максимальное количество слов двумя.  

Особенно выделяются ненужные сочетания с глаголами и случайными прямыме объектами (например, "изучать водоём") и предложные группы ("вместо удочек", "вместо улова"), поэтому хорошо бы учиывать часть речи. Но встречаются и подходящие: навигационный прибор, крупная рыба.  

Отдельные слова в русском все же лучше отражают смысл текста. Но лучше подключить лемматизацию, чтобы не повторялись словоформы одной и той же лексемы, это бы помогло исключить стоп-слова в любых формах, а не только в тех, что указаны в стоп-листе (например, местоимения "каких", "любое").