# Automatic Keyword Extraction From Any Text Document Using N-gram Rigid Collocation

here is the [references article](https://www.researchgate.net/publication/272674061_Automatic_Keyword_Extraction_From_Any_Text_Document_Using_N-gram_Rigid_Collocation)

we use this [wikipedia article](https://en.wikipedia.org/wiki/Mobile_phone) to test the algorithm

In [176]:
import nltk
import spacy
import re
import statistics
from scipy import stats
from nltk import *
from collections import Counter
import itertools
import math
import json
from spacy import Language

### prepare document

In [177]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(' '.join(open("mobile_phone_wiki_article.txt", 'r').readlines()).lower())

### pre-process the document

In [178]:
def remove_digits(text: str) -> str:
    return re.sub(r"\d+", "", text)


nlp = spacy.load('en_core_web_sm')
text = ' '.join(open("mobile_phone_wiki_article.txt", 'r').readlines()).lower()
text = remove_digits(text)
doc = nlp(text)
# tokenize text
doc = nlp(text)
# remove stopwords


word_list =  []

for token in doc:
    if token.is_punct:
        word_list.append('[PUNCT]')
    elif not token.is_stop and not token.is_space:
        word_list.append(token.lemma_)

print(f'the document has {len(word_list)} words')
print(word_list[:100])

the document has 4415 words
['mobile', 'phone', '[PUNCT]', 'cellular', 'phone', '[PUNCT]', 'cell', 'phone', '[PUNCT]', 'cellphone', '[PUNCT]', 'handphone', '[PUNCT]', 'hand', 'phone', '[PUNCT]', 'shorten', 'simply', 'mobile', '[PUNCT]', 'cell', 'phone', '[PUNCT]', 'portable', 'telephone', 'receive', 'call', 'radio', 'frequency', 'link', 'user', 'move', 'telephone', 'service', 'area', '[PUNCT]', 'radio', 'frequency', 'link', 'establish', 'connection', 'switching', 'system', 'mobile', 'phone', 'operator', '[PUNCT]', 'provide', 'access', 'public', 'switch', 'telephone', 'network', '[PUNCT]', 'pstn', '[PUNCT]', '[PUNCT]', 'modern', 'mobile', 'telephone', 'service', 'use', 'cellular', 'network', 'architecture', '[PUNCT]', '[PUNCT]', 'mobile', 'telephone', 'call', 'cellular', 'telephone', 'cell', 'phone', 'north', 'america', '[PUNCT]', 'addition', 'telephony', '[PUNCT]', 'digital', 'mobile', 'phone', '[PUNCT]', 'g', '[PUNCT]', 'support', 'variety', 'service', '[PUNCT]', 'text', 'messaging', 

## Definition
* a collocation is just a set of words occuring more than by chance. 
* A collocation is an syntactic and semantic independent unit. 
* two type of collocation : rigid and flexible (the words order is not fixed)

## Motivation of the paper
find top ranked N-gram rigid collocation could be used as keywords

# Popular Method to extract collocation

### Point-Wise Mutual Information

Goal : Find closeness between word pairs 

# $log_{2}[\frac{P(w_{1},w_{2})}{P(w_{1}), P(w_{2})}]$

In [179]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_list)
print(f'here is the top 10 bigram with PMI measure : {finder.nbest(bigram_measures.pmi, 10)}')

here is the top 10 bigram with PMI measure : [('accord', 'federal'), ('activism', 'citizen'), ('advent', 'widespread'), ('affair', 'clandestine'), ('affect', 'overall'), ('airtime', 'street'), ('alongside', 'coffee'), ('article', 'reflect'), ('asahi', 'kasei'), ('assist', 'victim')]


### T-score
The T-Score is use to test the plausability of a hypthosis. 
We want to test the following hypothesis (the null hypothesis $H_{0}$) : $P(w_{1}, w_{2}) = P(w_{1}) \cdot P(w_{2})$. 
In other words we would like to see if two words come indepedently or not. 
## $t = \bar{x} - \frac{\mu}{\sqrt{\frac{s^{2}}{N}}}$

where :

N is a total of words / bigram in the corpus

$H_{0} = P(w_{1}) \cdot P(w_{2})$

$P(w_{1}) = f(w_{1})/N$

$P(w_{2}) = f(w_{2})/N$

$P(w_{1}, w_{2}) = f(w_{1}w_{2})/N$

$\mu = P_{w_{i}}$

$s^{2} = P(w_{1}, w_{2})(1 - P(w_{1}, w_{2}))$

$\bar{x} = P(w_{1}, w_{2})$

then we compare the null hypothesis and the T-score. If T-score is greater than the null hypothesis then the bi-grams is considered as a collocation.

In [180]:
bigrams = list(nltk.bigrams(word_list))
bigram_frequency = Counter([i for i in bigrams])
monogram_frequency = Counter(word_list)
def compute_t_score(bigram, total_frequency):
    x_mean = bigram[1]/total_frequency
    s_2 = x_mean * (1 - x_mean)
    mu = monogram_frequency[bigram[0][0]]/total_frequency
    null_hypothesis = mu * (monogram_frequency[bigram[0][1]]/total_frequency)
    t_score = x_mean - (mu/math.sqrt(s_2/total_frequency))
    return [bigram[0], True if t_score >= null_hypothesis else False]


t_score = [compute_t_score(bigram, sum(bigram_frequency.values())) for bigram in bigram_frequency.items()]
[t for t in t_score if t[1]]

[]

## Proposed Methodology

there are three main steps :
1. extracts the list of bi-grams and arranges them in descending order based on their FBI
2. extracting the n-grams and arranges them in descending order of respective FNI
3. checks whether it is necessary to include monogram keywords

Before this three main steps we need to generate bi-gram and the frequency of monogram and bi-gram

In [181]:
bigrams = list(nltk.bigrams(word_list))
monogram_frequency = Counter(word_list)
bigram_frequency = Counter([i for i in bigrams])
list(bigram_frequency.items())

[(('mobile', 'phone'), 103),
 (('phone', '[PUNCT]'), 38),
 (('[PUNCT]', 'cellular'), 2),
 (('cellular', 'phone'), 3),
 (('[PUNCT]', 'cell'), 12),
 (('cell', 'phone'), 15),
 (('[PUNCT]', 'cellphone'), 2),
 (('cellphone', '[PUNCT]'), 2),
 (('[PUNCT]', 'handphone'), 1),
 (('handphone', '[PUNCT]'), 1),
 (('[PUNCT]', 'hand'), 4),
 (('hand', 'phone'), 1),
 (('[PUNCT]', 'shorten'), 1),
 (('shorten', 'simply'), 1),
 (('simply', 'mobile'), 1),
 (('mobile', '[PUNCT]'), 4),
 (('[PUNCT]', 'portable'), 1),
 (('portable', 'telephone'), 2),
 (('telephone', 'receive'), 1),
 (('receive', 'call'), 1),
 (('call', 'radio'), 1),
 (('radio', 'frequency'), 2),
 (('frequency', 'link'), 2),
 (('link', 'user'), 1),
 (('user', 'move'), 1),
 (('move', 'telephone'), 1),
 (('telephone', 'service'), 6),
 (('service', 'area'), 2),
 (('area', '[PUNCT]'), 3),
 (('[PUNCT]', 'radio'), 1),
 (('link', 'establish'), 1),
 (('establish', 'connection'), 1),
 (('connection', 'switching'), 1),
 (('switching', 'system'), 1),
 (('

Usually, $F_{iw}(i)$ is the number of word appeared $i$ times. 
But in our case, $F_{iw}(i)$ is different for bigram and monogram.

For our monogram, $F_{iw}(i)$ is replace by the mean of other values apparence.

For our bigram, $F_{iw}(1)$ is just replace by the mean of other values apparence.

In [182]:
def compute_cumul_freq_bigram(ngram_frequency):
    
    ngram_cumulative_freq = {}
    for i in ngram_frequency.values():
        if i not in ngram_cumulative_freq.keys():
            ngram_cumulative_freq[i] = 1
        ngram_cumulative_freq[i] += 1
    ngram_cumulative_freq[1] = statistics.mean([v for k,v in ngram_cumulative_freq.items() if k != 1]) 
    key_list = sorted(list(ngram_cumulative_freq.keys()), key=lambda x : x)
    cumul = 0
    for key in key_list :
        cumul += ngram_cumulative_freq[key]
        ngram_cumulative_freq[key] = cumul
    return ngram_cumulative_freq

def compute_cumul_freq_monogram(ngram_frequency):
    
    ngram_cumulative_freq = {}
    for i in ngram_frequency.values():
        if i not in ngram_cumulative_freq.keys():
            ngram_cumulative_freq[i] = 1
        ngram_cumulative_freq[i] += 1
    ngram_cumulative_freq = {k: statistics.mean([m_v for m_k, m_v in ngram_cumulative_freq.items() if m_k != k]) 
                             for k in ngram_cumulative_freq.keys()}
    key_list = sorted(list(ngram_cumulative_freq.keys()), key=lambda x : x)
    cumul = 0
    for key in key_list:
        cumul += ngram_cumulative_freq[key]
        ngram_cumulative_freq[key] = cumul

    
    return ngram_cumulative_freq


cumul_monogram = compute_cumul_freq_monogram(monogram_frequency)
cumul_bigram = compute_cumul_freq_bigram(bigram_frequency)
cumul_monogram

{152: 1199.84,
 187: 1251.9199999999998,
 1119: 1303.9999999999998,
 20: 939.4800000000001,
 32: 1095.68,
 5: 213.95999999999998,
 1: 21.44,
 14: 679.28,
 4: 162.64,
 3: 113.28,
 17: 783.24,
 19: 887.4000000000001,
 6: 264.96,
 12: 575.36,
 2: 64.92,
 21: 991.5200000000001,
 11: 523.4,
 9: 419.79999999999995,
 18: 835.32,
 45: 1147.76,
 10: 471.59999999999997,
 8: 368.03999999999996,
 7: 316.15999999999997,
 13: 627.36,
 26: 1043.6000000000001,
 16: 731.24}

## High Word Occurence (*HWO*)

HWO is defined as $HWO(w) = C_{iw}(n) / C_{iw}(N_{max})$

Cw is the cumulative frequency of an individual word.

Here, $C_{iw}(n) = \sum_{i-1}^{n}F_{iw}(i)$ 

In [183]:
def HWO(word, monogram_frequency, cumul_monogram):
    key_max = max(list(cumul_monogram.keys()))
    return cumul_monogram[monogram_frequency[word]]/cumul_monogram[key_max]

In [184]:
hwo = {word: HWO(word, monogram_frequency, cumul_monogram) for word in word_list}
hwo

{'mobile': 0.9201226993865032,
 'phone': 0.9600613496932516,
 '[PUNCT]': 1.0,
 'cellular': 0.7204601226993868,
 'cell': 0.8402453987730063,
 'cellphone': 0.164079754601227,
 'handphone': 0.016441717791411046,
 'hand': 0.5209202453987731,
 'shorten': 0.016441717791411046,
 'simply': 0.12472392638036811,
 'portable': 0.0868711656441718,
 'telephone': 0.6006441717791412,
 'receive': 0.016441717791411046,
 'call': 0.6805214723926383,
 'radio': 0.20319018404907976,
 'frequency': 0.4412269938650308,
 'link': 0.04978527607361964,
 'user': 0.5209202453987731,
 'move': 0.04978527607361964,
 'service': 0.7603680981595095,
 'area': 0.20319018404907976,
 'establish': 0.016441717791411046,
 'connection': 0.0868711656441718,
 'switching': 0.016441717791411046,
 'system': 0.4013803680981596,
 'operator': 0.4013803680981596,
 'provide': 0.5209202453987731,
 'access': 0.32193251533742334,
 'public': 0.12472392638036811,
 'switch': 0.0868711656441718,
 'network': 0.6405828220858897,
 'pstn': 0.016441717

## High Word Pair Occurence (HWPO)

HWPO is a the computation than HWO except that we compute cumulative bi-grams frequencies

In [185]:
def HWPO(bigram, bigram_frequency, cumul_bigram):
    key_max = max(list(cumul_bigram.keys()))
    return cumul_bigram[bigram_frequency[bigram]]/cumul_bigram[key_max]

In [186]:
hwpo = {bigram: HWPO(bigram, bigram_frequency, cumul_bigram) for bigram in list(bigrams)}
hwpo

{('mobile', 'phone'): 0.9959981167608286,
 ('phone', '[PUNCT]'): 0.9919962335216572,
 ('[PUNCT]', 'cellular'): 0.6398305084745762,
 ('cellular', 'phone'): 0.807909604519774,
 ('[PUNCT]', 'cell'): 0.973987758945386,
 ('cell', 'phone'): 0.9839924670433146,
 ('[PUNCT]', 'cellphone'): 0.6398305084745762,
 ('cellphone', '[PUNCT]'): 0.6398305084745762,
 ('[PUNCT]', 'handphone'): 0.05555555555555556,
 ('handphone', '[PUNCT]'): 0.05555555555555556,
 ('[PUNCT]', 'hand'): 0.8579331450094162,
 ('hand', 'phone'): 0.05555555555555556,
 ('[PUNCT]', 'shorten'): 0.05555555555555556,
 ('shorten', 'simply'): 0.05555555555555556,
 ('simply', 'mobile'): 0.05555555555555556,
 ('mobile', '[PUNCT]'): 0.8579331450094162,
 ('[PUNCT]', 'portable'): 0.05555555555555556,
 ('portable', 'telephone'): 0.6398305084745762,
 ('telephone', 'receive'): 0.05555555555555556,
 ('receive', 'call'): 0.05555555555555556,
 ('call', 'radio'): 0.05555555555555556,
 ('radio', 'frequency'): 0.6398305084745762,
 ('frequency', 'link'

# FBI Score
We calculate the FBI score for each bi-gram on the corpus defined as :

In [187]:
# the alpha is defined between 0 and 0.5
def FBI(bigram, alpha, hwpo, hwo):
    return hwpo[bigram]*(1 - (alpha * (hwo[bigram[0]] + hwo[bigram[1]])))

In [188]:
fbi_bigrams = {bigram:FBI(bigram, 0.01, hwpo, hwo) for bigram in bigrams if FBI(bigram, 0.01, hwpo, hwo) > 0.5}
fbi_bigrams

{('mobile', 'phone'): 0.9772715190403567,
 ('phone', '[PUNCT]'): 0.9725524987579864,
 ('[PUNCT]', 'cellular'): 0.6288224797234064,
 ('cellular', 'phone'): 0.7943325101382968,
 ('[PUNCT]', 'cell'): 0.9560639940267812,
 ('cell', 'phone'): 0.9662775842547341,
 ('[PUNCT]', 'cellphone'): 0.6323823710616616,
 ('cellphone', '[PUNCT]'): 0.6323823710616616,
 ('[PUNCT]', 'hand'): 0.8448846661149816,
 ('mobile', '[PUNCT]'): 0.8414597759465299,
 ('portable', 'telephone'): 0.6354315755953,
 ('radio', 'frequency'): 0.6357073307684309,
 ('frequency', 'link'): 0.6366888621711552,
 ('telephone', 'service'): 0.905467826396832,
 ('service', 'area'): 0.6336653686180721,
 ('area', '[PUNCT]'): 0.7981889154622024,
 ('system', 'mobile'): 0.797233054313542,
 ('phone', 'operator'): 0.7969103861217982,
 ('operator', '[PUNCT]'): 0.7965877179300545,
 ('[PUNCT]', 'provide'): 0.919764004063984,
 ('provide', 'access'): 0.6344376793698658,
 ('network', '[PUNCT]'): 0.9029014766602544,
 ('[PUNCT]', '[PUNCT]'): 0.98,
 ('

# FNI Score

then we calculate the FNI for each n-gram present in the corpus

In [189]:
def FNI(ngrams):
    
    last_ngram = ngrams[-1]
    conserved_ngram = ngrams[-1]
    trigram = []
    it = iter(last_ngram)
    for (bigram, score) in it:
        next_bigram, next_score = next(it, (None, None))
        if next_bigram is None:
            break
        if bigram[1] == next_bigram[0]:
            trigram.append((tuple(list(bigram) + [next_bigram[-1]]), statistics.mean([score, next_score])))
            conserved_ngram.remove((bigram, score))
            conserved_ngram.remove((next_bigram, next_score))
        else:
            it = itertools.chain(it, [(next_bigram, next_score)])
        
    ngrams[-1] = [ngram for ngram in conserved_ngram if '[PUNCT]' not in ngram[0]]
    ngrams.append(trigram)
    if trigram != []:
        FNI(ngrams)
    return ngrams[:-1]
        
fni_ngrams = FNI([list(fbi_bigrams.items())])
fni_ngrams

[[(('cellular', 'phone'), 0.7943325101382968),
  (('portable', 'telephone'), 0.6354315755953),
  (('radio', 'frequency'), 0.6357073307684309),
  (('frequency', 'link'), 0.6366888621711552),
  (('telephone', 'service'), 0.905467826396832),
  (('system', 'mobile'), 0.797233054313542),
  (('phone', 'operator'), 0.7969103861217982),
  (('provide', 'access'), 0.6344376793698658),
  (('modern', 'mobile'), 0.6328934504003327),
  (('mobile', 'telephone'), 0.9040013408330734),
  (('service', 'use'), 0.6293337553291047),
  (('cellular', 'network'), 0.8778356335857799),
  (('north', 'america'), 0.6391934257044816),
  (('wireless', 'communication'), 0.635965225642092),
  (('video', 'game'), 0.8055661710166026),
  (('phone', 'offer'), 0.7985115836539461),
  (('feature', 'phone'), 0.9587866782130602),
  (('advanced', 'computing'), 0.63846213476136),
  (('development', 'metal'), 0.6374806033586358),
  (('cellular', 'networking'), 0.6349022434231049),
  (('networking', 'lead'), 0.6379606725070187),
  

# Extract High Relevant Monogram
Finally, we extract the high relevant mono and n-gram

In [190]:
def extract_high_relevant_monogram(monograms, highest_freq_bigram):
    high_relevant_monograms = []
    for gram in monograms.items():
        if gram[1] > highest_freq_bigram * 8 and gram[0] != '[PUNCT]':
            high_relevant_monograms.append(gram)
    return high_relevant_monograms

highest_frequency = sorted(bigram_frequency.items(), key=lambda x : x[1], reverse=True)[0][1]
highest_monogram = extract_high_relevant_monogram(monogram_frequency, highest_frequency)

## Extract High Relevant N-gram

In [191]:
top_ranked_bigram = sorted(fni_ngrams[0], key=lambda x: x[1], reverse=True)[:11]
top_bigram_mean = statistics.mean(bigram[1] for bigram in top_ranked_bigram)
highest_ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1] >= top_bigram_mean]

## Visualize the Result

We extract the different top

In [192]:
top_ranked_bigram = sorted(fni_ngrams[0], key=lambda x: x[1], reverse=True)[:11]
top_bigram_mean = statistics.mean(bigram[1] for bigram in top_ranked_bigram)

ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1]]
top_ranked_ngram = sorted(ngrams, key=lambda x: x[1], reverse=True)[:11]

The final list is ordered as : top 5 monogram, top 10 bigram and top 10 ngram

In [193]:
print('-'* 30)
print('Top Monograms :')
print(highest_monogram)
print('-'* 30)
print('Top Highest Ngrams :')
print(highest_ngrams)
print('-'* 30)
print('Top Bigram :')
print('-'* 30)
print('\n'.join([' '.join(bigram[0]) for bigram in top_ranked_bigram]))
print('-' * 30)
print('Top Ngram :')
print('-'* 30)
print('\n'.join([' '.join(ngram[0]) for ngram in top_ranked_ngram]))

------------------------------
Top Monograms :
[]
------------------------------
Top Highest Ngrams :
[]
------------------------------
Top Bigram :
------------------------------
phone use
sim card
feature phone
use mobile
telephone service
mobile telephone
cellular network
phone subscriber
screen size
mobile app
mobile operator
------------------------------
Top Ngram :
------------------------------
content mobile banking
commercially available handheld
martin cooper motorola
john f. mitchell
lead development affordable
nippon telegraph telephone
voice calling text
information theory cellular
greatly increase capacity
state ban texting


# using the implementation on several texts

In [194]:
def automatic_keyword_extraction(word_list:list, alpha:float, threshold:float):
    bigrams = list(nltk.bigrams(word_list))
    monogram_frequency = Counter(word_list)
    bigram_frequency = Counter([i for i in bigrams])
    
    cumul_monogram = compute_cumul_freq_monogram(monogram_frequency)
    cumul_bigram = compute_cumul_freq_bigram(bigram_frequency)
    
    hwo = {word: HWO(word, monogram_frequency, cumul_monogram) for word in word_list}
    hwpo = {bigram: HWPO(bigram, bigram_frequency, cumul_bigram) for bigram in list(bigrams)}
    
    # Step 1
    fbi_bigrams = {bigram:FBI(bigram, alpha, hwpo, hwo) for bigram in bigrams if FBI(bigram, alpha, hwpo, hwo) > threshold}
    
    # Step 2
    fni_ngrams = FNI([list(fbi_bigrams.items())])
    
    # Step 3
    highest_frequency = sorted(bigram_frequency.items(), key=lambda x : x[1], reverse=True)[0][1]
    highest_monogram = extract_high_relevant_monogram(monogram_frequency, highest_frequency)
    
    top_ranked_bigram = sorted(fni_ngrams[0], key=lambda x: x[1], reverse=True)[:11]
    top_bigram_mean = statistics.mean(bigram[1] for bigram in top_ranked_bigram)
    
    ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1]]
    highest_ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1] >= top_bigram_mean]
    top_ranked_ngram = sorted(ngrams, key=lambda x: x[1], reverse=True)[:11]
    
    return {'monograms': highest_monogram, 
            'highest n-grams': highest_ngrams, 
            'top ranked bigrams': top_ranked_bigram,
            'top ranked n-grams': top_ranked_ngram}

automatic_keyword_extraction(word_list, 0.02, 0.5)

{'monograms': [],
 'highest n-grams': [],
 'top ranked bigrams': [(('sim', 'card'), 0.9521405562198884),
  (('phone', 'use'), 0.9477767148452394),
  (('feature', 'phone'), 0.9435855974807342),
  (('use', 'mobile'), 0.9292737470105022),
  (('telephone', 'service'), 0.8929742591966772),
  (('mobile', 'telephone'), 0.8900412880691599),
  (('cellular', 'network'), 0.8657230562487723),
  (('phone', 'subscriber'), 0.865715958502305),
  (('screen', 'size'), 0.8475674177960325),
  (('mobile', 'app'), 0.8359395288147147),
  (('mobile', 'operator'), 0.8352579193528821)],
 'top ranked n-grams': [(('content', 'mobile', 'banking'), 0.8295394360969579),
  (('commercially', 'available', 'handheld'), 0.8226015495924462),
  (('martin', 'cooper', 'motorola'), 0.8041624900350075),
  (('john', 'f.', 'mitchell'), 0.638556342934387),
  (('lead', 'development', 'affordable'), 0.6353605269314755),
  (('nippon', 'telegraph', 'telephone'), 0.635031779661017),
  (('voice', 'calling', 'text'), 0.6348682892274098)

# Using the keyword extraction to the HAS corpus

### load french spacy model

In [195]:
nlp = spacy.load('fr_core_news_sm')

In [196]:
has_json = json.load(open('processed-HAS.json', 'r'))

In [207]:
data = {}
url_c = 0
for url, article in has_json.items():
    text = remove_digits(article)
    # tokenize text
    doc = nlp(text)
    # remove stopwords
    word_list =  []

    for token in doc:
        if token.is_punct:
            word_list.append('[PUNCT]')
        elif not token.is_stop and not token.is_space:
            word_list.append(token.lemma_)

    if len(word_list) > 70:
        data[url] = [article, automatic_keyword_extraction(word_list, 0.02, 0.3)]
        url_c += 1
print(f'url : {len(has_json)}')
print(f'url processed : {url_c}')
json.dump(data, open('has_keywords_result.json', 'w'), indent=4, sort_keys=True)

url : 377
url processed : 175


In [211]:
list_items = [site for site in data.items() if site[1][1]['highest n-grams'] != [] and site[1][1]['top ranked bigrams'] != [] and site[1][1]['top ranked n-grams'] != []]
url_list = [site[0] for site in list_items]
highest_ngram_list = [site[1][1]['highest n-grams'] for site in list_items]
highest_ngram_list = [(lambda ngram: '; '.join([' '.join(n[0]) for n in ngram]))(ngram) for ngram in highest_ngram_list]
bigram_list = [site[1][1]['top ranked bigrams'] for site in list_items]
bigram_list = [(lambda ngram: '; '.join([' '.join(n[0]) for n in ngram]))(ngram) for ngram in bigram_list]
ngram_list = [site[1][1]['top ranked n-grams'] for site in list_items]
ngram_list = [(lambda ngram: '; '.join([' '.join(n[0]) for n in ngram]))(ngram) for ngram in ngram_list]

import pandas as pd
df = pd.DataFrame([element for element in zip(url_list, highest_ngram_list, bigram_list, ngram_list)],columns=['url','mots-clés très pertinent', 'mots-clés avec 2 mots', 'mots-clés avec plusieurs mots'])
df.to_csv('has_mot_cle.csv')
df

Unnamed: 0,url,mots-clés très pertinent,mots-clés avec 2 mots,mots-clés avec plusieurs mots
0,https://www.has-sante.fr/jcms/c_1009982/fr/bio...,recherche mutation ga,ga gène; gène facteur; epreuve duke; duke test...,recherche mutation ga; titrage inhibiteur cont...
1,https://www.has-sante.fr/jcms/c_1164340/fr/eva...,condition réalisation acte,prise charge; groupe travail; sécurité acte; o...,condition réalisation acte; traiter balance bé...
2,https://www.has-sante.fr/jcms/c_1173766/fr/l-h...,interrogation commission transparence; amélior...,répondre interrogation; transparence porter; p...,interrogation commission transparence; toléran...
3,https://www.has-sante.fr/jcms/c_1237001/fr/syn...,protocole national diagnostic; professionnel c...,explicite professionnel; prise charge; charge ...,protocole national diagnostic; professionnel c...
4,https://www.has-sante.fr/jcms/c_1261551/fr/del...,évaluation information donner; issu loi numéro,information délivrer; régir règle; santé publi...,issu loi numéro; évaluation information donner...
...,...,...,...,...
113,https://www.has-sante.fr/jcms/r_1439925/fr/les...,traitement reflux gastro-oesophagien; traiteme...,gastroduodénale devoir; devoir anti-inflammato...,anti-inflammatoire non stéroïdien; traitement ...
114,https://www.has-sante.fr/jcms/r_1440019/fr/que...,patient symptomatique dépit,stratégie thérapeutique\xa; l\'amélioration se...,patient symptomatique dépit
115,https://www.has-sante.fr/jcms/r_1495740/fr/pri...,deprévention secondaire aspirine,immédiat accident; accident ischémique; patien...,deprévention secondaire aspirine; diagnostique...
116,https://www.has-sante.fr/jcms/r_1497412/fr/rea...,note cadrage précise; réaliser outil aide; for...,suite réadaptation; aide décision; décision fo...,patient soin suite; forme grille pertinence; n...
