# Automatic Keyword Extraction From Any Text Document Using N-gram Rigid Collocation

here is the [references article](https://www.researchgate.net/publication/272674061_Automatic_Keyword_Extraction_From_Any_Text_Document_Using_N-gram_Rigid_Collocation)

we use this [wikipedia article](https://en.wikipedia.org/wiki/Mobile_phone) to test the algorithm

In [99]:
import nltk
import spacy
import re
import statistics
from scipy import stats
from nltk import *
from collections import Counter
import itertools
import math
import json

### prepare document

In [100]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(' '.join(open("mobile_phone_wiki_article.txt", 'r').readlines()).lower())

### pre-process the document

In [101]:
def remove_digits(text: str) -> str:
    return re.sub(r"\d+|[^\w\s]", "", text)

nlp = spacy.load('en_core_web_sm')

text = ' '.join(open("mobile_phone_wiki_article.txt", 'r').readlines()).lower()
text = remove_digits(text)
doc = nlp(text)
# tokenize text
doc = nlp(text)
# remove stopwords
word_list =  [token.text for token in doc if not token.is_stop and not token.is_space]
print(f'the document has {len(word_list)} words')
print(word_list[:100])

the document has 3211 words
['mobile', 'phone', 'cellular', 'phone', 'cell', 'phone', 'cellphone', 'handphone', 'hand', 'phone', 'shortened', 'simply', 'mobile', 'cell', 'phone', 'portable', 'telephone', 'receive', 'calls', 'radio', 'frequency', 'link', 'user', 'moving', 'telephone', 'service', 'area', 'radio', 'frequency', 'link', 'establishes', 'connection', 'switching', 'systems', 'mobile', 'phone', 'operator', 'provides', 'access', 'public', 'switched', 'telephone', 'network', 'pstn', 'modern', 'mobile', 'telephone', 'services', 'use', 'cellular', 'network', 'architecture', 'mobile', 'telephones', 'called', 'cellular', 'telephones', 'cell', 'phones', 'north', 'america', 'addition', 'telephony', 'digital', 'mobile', 'phones', 'g', 'support', 'variety', 'services', 'text', 'messaging', 'mms', 'email', 'internet', 'access', 'shortrange', 'wireless', 'communications', 'infrared', 'bluetooth', 'business', 'applications', 'video', 'games', 'digital', 'photography', 'mobile', 'phones', 'o

## Definition
* a collocation is just a set of words occuring more than by chance. 
* A collocation is an syntactic and semantic independent unit. 
* two type of collocation : rigid and flexible (the words order is not fixed)

## Motivation of the paper
find top ranked N-gram rigid collocation could be used as keywords

# Popular Method to extract collocation

### Point-Wise Mutual Information

Goal : Find closeness between word pairs 

# $log_{2}[\frac{P(w_{1},w_{2})}{P(w_{1}), P(w_{2})}]$

In [102]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_list)
print(f'here is the top 10 bigram with PMI measure : {finder.nbest(bigram_measures.pmi, 10)}')

here is the top 10 bigram with PMI measure : [('according', 'federal'), ('activate', 'microphones'), ('activism', 'citizen'), ('address', 'gave'), ('administration', 'nhtsa'), ('adoption', 'mosfetbased'), ('adults', 'prefer'), ('advent', 'widespread'), ('affairs', 'clandestine'), ('affects', 'overall')]


### T-score
The T-Score is use to test the plausability of a hypthosis. 
We want to test the following hypothesis (the null hypothesis $H_{0}$) : $P(w_{1}, w_{2}) = P(w_{1}) \cdot P(w_{2})$. 
In other words we would like to see if two words come indepedently or not. 
## $t = \bar{x} - \frac{\mu}{\sqrt{\frac{s^{2}}{N}}}$

where :

N is a total of words / bigram in the corpus

$H_{0} = P(w_{1}) \cdot P(w_{2})$

$P(w_{1}) = f(w_{1})/N$

$P(w_{2}) = f(w_{2})/N$

$P(w_{1}, w_{2}) = f(w_{1}w_{2})/N$

$\mu = P_{w_{i}}$

$s^{2} = P(w_{1}, w_{2})(1 - P(w_{1}, w_{2}))$

$\bar{x} = P(w_{1}, w_{2})$

then we compare the null hypothesis and the T-score. If T-score is greater than the null hypothesis then the bi-grams is considered as a collocation.

In [103]:
bigrams = list(nltk.bigrams(word_list))
bigram_frequency = Counter([i for i in bigrams])
monogram_frequency = Counter(word_list)
def compute_t_score(bigram, total_frequency):
    x_mean = bigram[1]/total_frequency
    s_2 = x_mean * (1 - x_mean)
    mu = monogram_frequency[bigram[0][0]]/total_frequency
    null_hypothesis = mu * (monogram_frequency[bigram[0][1]]/total_frequency)
    t_score = x_mean - (mu/math.sqrt(s_2/total_frequency))
    return [bigram[0], True if t_score >= null_hypothesis else False]


t_score = [compute_t_score(bigram, sum(bigram_frequency.values())) for bigram in bigram_frequency.items()]
[t for t in t_score if t[1]]

[]

## Proposed Methodology

there are three main steps :
1. extracts the list of bi-grams and arranges them in descending order based on their FBI
2. extracting the n-grams and arranges them in descending order of respective FNI
3. checks whether it is necessary to include monogram keywords

Before this three main steps we need to generate bi-gram and the frequency of monogram and bi-gram

In [104]:
bigrams = list(nltk.bigrams(word_list))
monogram_frequency = Counter(word_list)
bigram_frequency = Counter([i for i in bigrams])
list(bigram_frequency.items())

[(('mobile', 'phone'), 59),
 (('phone', 'cellular'), 1),
 (('cellular', 'phone'), 1),
 (('phone', 'cell'), 2),
 (('cell', 'phone'), 10),
 (('phone', 'cellphone'), 1),
 (('cellphone', 'handphone'), 1),
 (('handphone', 'hand'), 1),
 (('hand', 'phone'), 1),
 (('phone', 'shortened'), 1),
 (('shortened', 'simply'), 1),
 (('simply', 'mobile'), 1),
 (('mobile', 'cell'), 1),
 (('phone', 'portable'), 1),
 (('portable', 'telephone'), 2),
 (('telephone', 'receive'), 1),
 (('receive', 'calls'), 1),
 (('calls', 'radio'), 1),
 (('radio', 'frequency'), 2),
 (('frequency', 'link'), 2),
 (('link', 'user'), 1),
 (('user', 'moving'), 1),
 (('moving', 'telephone'), 1),
 (('telephone', 'service'), 5),
 (('service', 'area'), 2),
 (('area', 'radio'), 1),
 (('link', 'establishes'), 1),
 (('establishes', 'connection'), 1),
 (('connection', 'switching'), 1),
 (('switching', 'systems'), 1),
 (('systems', 'mobile'), 3),
 (('phone', 'operator'), 3),
 (('operator', 'provides'), 1),
 (('provides', 'access'), 1),
 ((

Usually, $F_{iw}(i)$ is the number of word appeared $i$ times. 
But in our case, $F_{iw}(i)$ is different for bigram and monogram.

For our monogram, $F_{iw}(i)$ is replace by the mean of other values apparence.

For our bigram, $F_{iw}(1)$ is just replace by the mean of other values apparence.

In [105]:
def compute_cumul_freq_bigram(ngram_frequency):
    
    ngram_cumulative_freq = {}
    for i in ngram_frequency.values():
        if i not in ngram_cumulative_freq.keys():
            ngram_cumulative_freq[i] = 1
        ngram_cumulative_freq[i] += 1
    ngram_cumulative_freq[1] = statistics.mean([v for k,v in ngram_cumulative_freq.items() if k != 1]) 
    key_list = sorted(list(ngram_cumulative_freq.keys()), key=lambda x : x)
    cumul = 0
    for key in key_list :
        cumul += ngram_cumulative_freq[key]
        ngram_cumulative_freq[key] = cumul
    return ngram_cumulative_freq

def compute_cumul_freq_monogram(ngram_frequency):
    
    ngram_cumulative_freq = {}
    for i in ngram_frequency.values():
        if i not in ngram_cumulative_freq.keys():
            ngram_cumulative_freq[i] = 1
        ngram_cumulative_freq[i] += 1
    ngram_cumulative_freq = {k: statistics.mean([m_v for m_k, m_v in ngram_cumulative_freq.items() if m_k != k]) 
                             for k in ngram_cumulative_freq.keys()}
    key_list = sorted(list(ngram_cumulative_freq.keys()), key=lambda x : x)
    cumul = 0
    for key in key_list:
        cumul += ngram_cumulative_freq[key]
        ngram_cumulative_freq[key] = cumul

    
    return ngram_cumulative_freq


cumul_monogram = compute_cumul_freq_monogram(monogram_frequency)
cumul_bigram = compute_cumul_freq_bigram(bigram_frequency)
cumul_monogram

{150: 1519.9999999999995,
 104: 1447.7142857142853,
 20: 1086.2857142857142,
 27: 1230.8571428571427,
 2: 86.28571428571428,
 1: 24.61904761904762,
 5: 294.42857142857144,
 4: 223.52380952380952,
 3: 153.8095238095238,
 15: 941.7619047619047,
 13: 869.5714285714286,
 7: 437.0,
 8: 508.76190476190476,
 10: 652.9047619047619,
 38: 1303.1428571428569,
 6: 365.61904761904765,
 82: 1375.428571428571,
 23: 1158.5714285714284,
 9: 580.7619047619048,
 11: 725.0952380952381,
 12: 797.3809523809524,
 16: 1014.0}

## High Word Occurence (*HWO*)

HWO is defined as $HWO(w) = C_{iw}(n) / C_{iw}(N_{max})$

Cw is the cumulative frequency of an individual word.

Here, $C_{iw}(n) = \sum_{i-1}^{n}F_{iw}(i)$ 

In [106]:
def HWO(word, monogram_frequency, cumul_monogram):
    key_max = max(list(cumul_monogram.keys()))
    return cumul_monogram[monogram_frequency[word]]/cumul_monogram[key_max]

In [107]:
hwo = {word: HWO(word, monogram_frequency, cumul_monogram) for word in word_list}
hwo

{'mobile': 1.0,
 'phone': 0.9524436090225564,
 'cellular': 0.7146616541353386,
 'cell': 0.8097744360902257,
 'cellphone': 0.0567669172932331,
 'handphone': 0.016196741854636598,
 'hand': 0.19370300751879707,
 'shortened': 0.016196741854636598,
 'simply': 0.14705513784461158,
 'portable': 0.1011904761904762,
 'telephone': 0.6195802005012533,
 'receive': 0.016196741854636598,
 'calls': 0.5720864661654137,
 'radio': 0.19370300751879707,
 'frequency': 0.2875000000000001,
 'link': 0.0567669172932331,
 'user': 0.3347117794486216,
 'moving': 0.016196741854636598,
 'service': 0.5720864661654137,
 'area': 0.14705513784461158,
 'establishes': 0.016196741854636598,
 'connection': 0.0567669172932331,
 'switching': 0.016196741854636598,
 'systems': 0.2875000000000001,
 'operator': 0.19370300751879707,
 'provides': 0.0567669172932331,
 'access': 0.2875000000000001,
 'public': 0.14705513784461158,
 'switched': 0.016196741854636598,
 'network': 0.4295426065162909,
 'pstn': 0.016196741854636598,
 'mode

## High Word Pair Occurence (HWPO)

HWPO is a the computation than HWO except that we compute cumulative bi-grams frequencies

In [108]:
def HWPO(bigram, bigram_frequency, cumul_bigram):
    key_max = max(list(cumul_bigram.keys()))
    return cumul_bigram[bigram_frequency[bigram]]/cumul_bigram[key_max]

In [109]:
hwpo = {bigram: HWPO(bigram, bigram_frequency, cumul_bigram) for bigram in list(bigrams)}
hwpo

{('mobile', 'phone'): 1.0,
 ('phone', 'cellular'): 0.09090909090909091,
 ('cellular', 'phone'): 0.09090909090909091,
 ('phone', 'cell'): 0.7002997002997002,
 ('cell', 'phone'): 0.97002997002997,
 ('phone', 'cellphone'): 0.09090909090909091,
 ('cellphone', 'handphone'): 0.09090909090909091,
 ('handphone', 'hand'): 0.09090909090909091,
 ('hand', 'phone'): 0.09090909090909091,
 ('phone', 'shortened'): 0.09090909090909091,
 ('shortened', 'simply'): 0.09090909090909091,
 ('simply', 'mobile'): 0.09090909090909091,
 ('mobile', 'cell'): 0.09090909090909091,
 ('phone', 'portable'): 0.09090909090909091,
 ('portable', 'telephone'): 0.7002997002997002,
 ('telephone', 'receive'): 0.09090909090909091,
 ('receive', 'calls'): 0.09090909090909091,
 ('calls', 'radio'): 0.09090909090909091,
 ('radio', 'frequency'): 0.7002997002997002,
 ('frequency', 'link'): 0.7002997002997002,
 ('link', 'user'): 0.09090909090909091,
 ('user', 'moving'): 0.09090909090909091,
 ('moving', 'telephone'): 0.09090909090909091,

# FBI Score
We calculate the FBI score for each bi-gram on the corpus defined as :

In [110]:
# the alpha is defined between 0 and 0.5
def FBI(bigram, alpha, hwpo, hwo):
    return hwpo[bigram]*(1 - (alpha * (hwo[bigram[0]] + hwo[bigram[1]])))

In [111]:
fbi_bigrams = {bigram:FBI(bigram, 0.05, hwpo, hwo) for bigram in bigrams if FBI(bigram, 0.05, hwpo, hwo) > 0.5}
fbi_bigrams

{('mobile', 'phone'): 0.9023778195488722,
 ('phone', 'cell'): 0.63859566185694,
 ('cell', 'phone'): 0.8845597541556188,
 ('portable', 'telephone'): 0.6750619258560987,
 ('radio', 'frequency'): 0.6834503842022638,
 ('frequency', 'link'): 0.6882451993495227,
 ('telephone', 'service'): 0.8746532634032634,
 ('service', 'area'): 0.6751189678116369,
 ('systems', 'mobile'): 0.8094418081918082,
 ('phone', 'operator'): 0.8155562952085509,
 ('modern', 'mobile'): 0.6585022073790869,
 ('mobile', 'telephone'): 0.8547537878787879,
 ('services', 'use'): 0.6602134660452329,
 ('cellular', 'network'): 0.8438957533694376,
 ('cell', 'phones'): 0.8503321678321678,
 ('north', 'america'): 0.6963243147829613,
 ('digital', 'mobile'): 0.6502442873542497,
 ('mobile', 'phones'): 0.8957171212246401,
 ('text', 'messaging'): 0.8989321095571096,
 ('wireless', 'communications'): 0.6818104279805407,
 ('video', 'games'): 0.8523787334219666,
 ('feature', 'phones'): 0.8845161417529838,
 ('phones', 'mobile'): 0.80985120143

# FNI Score

then we calculate the FNI for each n-gram present in the corpus

In [112]:
def FNI(ngrams):
    
    last_ngram = ngrams[-1]
    conserved_ngram = ngrams[-1]
    trigram = []
    it = iter(last_ngram)
    for (bigram, score) in it:
        next_bigram, next_score = next(it, (None, None))
        if next_bigram is None:
            break
        if bigram[1] == next_bigram[0]:
            trigram.append((tuple(list(bigram) + [next_bigram[-1]]), statistics.mean([score, next_score])))
            conserved_ngram.remove((bigram, score))
            conserved_ngram.remove((next_bigram, next_score))
        else:
            it = itertools.chain(it, [(next_bigram, next_score)])
        
    ngrams[-1] = conserved_ngram
    ngrams.append(trigram)
    if trigram != []:
        FNI(ngrams)
    return ngrams[:-1]
        
fni_ngrams = FNI([list(fbi_bigrams.items())])
fni_ngrams

[[(('cell', 'phone'), 0.8845597541556188),
  (('portable', 'telephone'), 0.6750619258560987),
  (('telephone', 'service'), 0.8746532634032634),
  (('service', 'area'), 0.6751189678116369),
  (('systems', 'mobile'), 0.8094418081918082),
  (('phone', 'operator'), 0.8155562952085509),
  (('services', 'use'), 0.6602134660452329),
  (('cellular', 'network'), 0.8438957533694376),
  (('cell', 'phones'), 0.8503321678321678),
  (('north', 'america'), 0.6963243147829613),
  (('text', 'messaging'), 0.8989321095571096),
  (('wireless', 'communications'), 0.6818104279805407),
  (('video', 'games'), 0.8523787334219666),
  (('feature', 'phones'), 0.8845161417529838),
  (('phones', 'mobile'), 0.8098512014301488),
  (('phones', 'offer'), 0.6666273951612297),
  (('advanced', 'computing'), 0.6931628740933251),
  (('development', 'metaloxidesemiconductor'), 0.6899740093866034),
  (('largescale', 'integration'), 0.6963243147829613),
  (('integration', 'lsi'), 0.6963243147829613),
  (('information', 'theory

# Extract High Relevant Monogram
Finally, we extract the high relevant mono and n-gram

In [113]:
def extract_high_relevant_monogram(monograms, highest_freq_bigram):
    high_relevant_monograms = []
    for gram in monograms.items():
        if gram[1] > highest_freq_bigram * 8:
            high_relevant_monograms.append(gram)
    return high_relevant_monograms

highest_frequency = sorted(bigram_frequency.items(), key=lambda x : x[1], reverse=True)[0][1]
highest_monogram = extract_high_relevant_monogram(monogram_frequency, highest_frequency)

## Extract High Relevant N-gram

In [114]:
top_ranked_bigram = sorted(fni_ngrams[0], key=lambda x: x[1], reverse=True)[:11]
top_bigram_mean = statistics.mean(bigram[1] for bigram in top_ranked_bigram)
highest_ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1] >= top_bigram_mean]

## Visualize the Result

We extract the different top

In [115]:
top_ranked_bigram = sorted(fni_ngrams[0], key=lambda x: x[1], reverse=True)[:11]
top_bigram_mean = statistics.mean(bigram[1] for bigram in top_ranked_bigram)

ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1]]
top_ranked_ngram = sorted(ngrams, key=lambda x: x[1], reverse=True)[:11]

The final list is ordered as : top 5 monogram, top 10 bigram and top 10 ngram

In [116]:
print('-'* 30)
print('Top Monograms :')
print(highest_monogram)
print('-'* 30)
print('Top Highest Ngrams :')
print(highest_ngrams)
print('-'* 30)
print('Top Bigram :')
print('-'* 30)
print('\n'.join([' '.join(bigram[0]) for bigram in top_ranked_bigram]))
print('-' * 30)
print('Top Ngram :')
print('-'* 30)
print('\n'.join([' '.join(ngram[0]) for ngram in top_ranked_ngram]))

------------------------------
Top Monograms :
[]
------------------------------
Top Highest Ngrams :
[]
------------------------------
Top Bigram :
------------------------------
sim card
text messaging
cell phone
feature phones
use mobile
screen sizes
telephone service
mobile banking
phone subscribers
handheld mobile
sim cards
------------------------------
Top Ngram :
------------------------------
phone use certain
martin cooper motorola
commercially available handheld
metaloxidesemiconductor mos largescale
digital mobile phones
mobile phone cell
identity module sim
worldwide samsung apple
phones central processing
modern mobile telephone
cell towers placed


# using the implementation on several texts

In [117]:
def automatic_keyword_extraction(word_list:list, alpha:float, threshold:float):
    bigrams = list(nltk.bigrams(word_list))
    monogram_frequency = Counter(word_list)
    bigram_frequency = Counter([i for i in bigrams])
    
    cumul_monogram = compute_cumul_freq_monogram(monogram_frequency)
    cumul_bigram = compute_cumul_freq_bigram(bigram_frequency)
    
    hwo = {word: HWO(word, monogram_frequency, cumul_monogram) for word in word_list}
    hwpo = {bigram: HWPO(bigram, bigram_frequency, cumul_bigram) for bigram in list(bigrams)}
    
    # Step 1
    fbi_bigrams = {bigram:FBI(bigram, alpha, hwpo, hwo) for bigram in bigrams if FBI(bigram, alpha, hwpo, hwo) > threshold}
    
    # Step 2
    fni_ngrams = FNI([list(fbi_bigrams.items())])
    
    # Step 3
    highest_frequency = sorted(bigram_frequency.items(), key=lambda x : x[1], reverse=True)[0][1]
    highest_monogram = extract_high_relevant_monogram(monogram_frequency, highest_frequency)
    
    top_ranked_bigram = sorted(fni_ngrams[0], key=lambda x: x[1], reverse=True)[:11]
    top_bigram_mean = statistics.mean(bigram[1] for bigram in top_ranked_bigram)
    
    ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1]]
    highest_ngrams = [ngram for ngrams in fni_ngrams[1:] for ngram in ngrams if ngram[1] >= top_bigram_mean]
    top_ranked_ngram = sorted(ngrams, key=lambda x: x[1], reverse=True)[:11]
    
    return {'monograms': highest_monogram, 
            'highest n-grams': highest_ngrams, 
            'top ranked bigrams': top_ranked_bigram,
            'top ranked n-grams': top_ranked_ngram}

automatic_keyword_extraction(word_list, 0.02, 0.5)

{'monograms': [],
 'highest n-grams': [],
 'top ranked bigrams': [(('cell', 'phone'), 0.9358418836802295),
  (('sim', 'card'), 0.9340979947370925),
  (('use', 'mobile'), 0.9339966386996462),
  (('feature', 'phones'), 0.9268334297281665),
  (('text', 'messaging'), 0.9176148018648019),
  (('handheld', 'mobile'), 0.9096098907358307),
  (('telephone', 'service'), 0.9079032634032634),
  (('mobile', 'banking'), 0.9069941724941726),
  (('phone', 'subscribers'), 0.9061270396270397),
  (('cell', 'phones'), 0.8981748251748252),
  (('screen', 'sizes'), 0.8873254464833412)],
 'top ranked n-grams': [(('phone', 'use', 'certain'), 0.8965487695262131),
  (('commercially', 'available', 'handheld'), 0.8667178504452941),
  (('martin', 'cooper', 'motorola'), 0.8604275774350962),
  (('mobile', 'phone', 'cell'), 0.8182846063710725),
  (('digital', 'mobile', 'phones'), 0.8162851888086851),
  (('modern', 'mobile', 'telephone'), 0.7917620881624641),
  (('metaloxidesemiconductor', 'mos', 'largescale'), 0.779860

# Using the keyword extraction to the HAS corpus

### load french spacy model

In [119]:
nlp = spacy.load('fr_core_news_sm')

In [121]:
has_json = json.load(open('processed-HAS.json', 'r'))

In [125]:
data = {}
url_c = 0
for url, article in has_json.items():
    text = remove_digits(article)
    # tokenize text
    doc = nlp(text)
    # remove stopwords
    word_list =  [token.text for token in doc if not token.is_stop and not token.is_space]
    if len(word_list) > 80:
        data[url] = [article, automatic_keyword_extraction(word_list, 0.02, 0.5)]
        url_c += 1
print(f'url : {len(has_json)}')
print(f'url processed : {url_c}')
json.dump(data, open('has_keywords_result.json', 'w'), indent=4, sort_keys=True)

url : 377
url processed : 120


{'https://www.has-sante.fr/jcms/c_1009982/fr/biologie-des-anomalies-de-l-hemostase': ['Ce travail a eu pour but d’évaluer dix actes de biologie mesurant les anomalies de l’hémostase, qui avaient été signalés par le demandeur de cette évaluation soit comme obsolètes mais figurant encore à la Nomenclature des actes de biologie médicale (NABM), soit comme pertinents mais ne figurant pourtant pas dans cette nomenclature. Cette évaluation pourra donc permettre l’actualisation du sous chapitre 5-02 « Hémostase et coagulation » de la NABM. Cette évaluation a donné lieu à la rédaction de sept documents : Tome I : Evaluation du Temps de saignement (Epreuve de DUKE et tests d’IVY) ;Tome II : Temps de Thrombine et correction du Temps de Thrombine ;Tome III : Test photométrique d’agrégation plaquettaire ;Tome IV : Recherche d’anticorps antifacteur 4 plaquettaire dans le cadre d’une thrombopénie induite par l’héparine ;Tome V : Recherche et titrage d’inhibiteur contre les facteurs antihémophiliques

In [128]:
list_items = [site for site in data if site[1][1]['highest n-grams'] != [] and site[1][1]['top ranked bigrams'] != [] and site[1][1]['top ranked n-grams'] != []]
url_list = [site[0] for site in list_items]
highest_ngram_list = [site[1][1]['highest n-grams'] for site in list_items]
highest_ngram_list = [(lambda ngram: '; '.join([' '.join(n[0]) for n in ngram]))(ngram) for ngram in highest_ngram_list]
bigram_list = [site[1][1]['top ranked bigrams'] for site in list_items]
bigram_list = [(lambda ngram: '; '.join([' '.join(n[0]) for n in ngram]))(ngram) for ngram in bigram_list]
ngram_list = [site[1][1]['top ranked n-grams'] for site in list_items]
ngram_list = [(lambda ngram: '; '.join([' '.join(n[0]) for n in ngram]))(ngram) for ngram in ngram_list]

import pandas as pd
df = pd.DataFrame([element for element in zip(url_list, highest_ngram_list, bigram_list, ngram_list)],columns=['url','mots-clés très pertinent', 'mots-clés avec 2 mots', 'mots-clés avec plusieurs mots'])
df.to_csv('has_mot_cle.csv')
df

IndexError: string index out of range