<h1><center>Semester Project</center></h1>


Over the past decades, the development of online scientific platforms like 
arXiv.org or JSTOR radically changed the way researchers access, browse, or read 
scientific articles. Yet, while observing the behavior of researchers in libraries and 
laboratories has become commonplace in the humanities, the computational study 
of digital research practices is only in its early days. This project aims at documenting 
the behavior of scientists on online platforms by making sense of the digital traces 
they generate while navigating.

During this project we are using the browsing logs of Gallica ( https://gallica.bnf.fr/ ) to make sense of user experience and identify patterns.

Project goals : 

* Parsing and sessionizing the log files in order to extract relevant sessions.
* Building an ontology based on the descriptors provided by the ARKs (Archival Resource Key, i.e. the metadata) of the documents (type of document, discipline, year, etc.)
* Using off-the-shelf Python libraries of topological data analysis (TDA) to identify the geometrical shapes of users’ paths through the ontology
* Building a typology of users’ behaviour on the platform by clustering the previously identified users’ paths

## Table Of Content:
[I. Exploring Gallica Logs](#first-bullet)

[II. Creating Sessions](#second-bullet)

[III. Word2vec Representation](#third-bullet)

[IV. Path Representation](#fourth-bullet)

[V. Path Clustering](#fifth-bullet)

[VI. Analysis](#sixth-bullet)

<a class="anchor" id="first-bullet"></a><h2><center>I. Exploring Gallica Logs</center></h2> 

## 1. Importing logs

Logs have been provided by TODO, they span from a period of TODO.

In [6]:
import os
import numpy as np


# for now I manually extracted a single file to test
file  = open('res296.log','r',encoding="utf8")
lines = file.read().splitlines()

## 2. Exploring logs

In [7]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# convert lines from logs into pandas dataframe
lines_df = pd.DataFrame(lines)

In [8]:
lines_df.head()

Unnamed: 0,0
0,##320022e99796ca35dab7e63d48fd5e7##null##null#...
1,##e7fdec50f50253f6796d61b5382155f8##null##null...
2,##320022e99796ca35dab7e63d48fd5e7##null##null#...
3,##e7fdec50f50253f6796d61b5382155f8##null##null...
4,##320022e99796ca35dab7e63d48fd5e7##null##null#...


In [9]:
# we need to split each line into relevant metadata according to this example from Nouvellet et al.

![caption](log_example.png)

In [10]:
# first split, according to the example we split by ## to get ip, pays, ville and then date/requete/procole/code/taille/référant together
lines_df=lines_df[0].str.split('##', expand=True)

In [11]:
lines_df.head()

Unnamed: 0,0,1,2,3,4
0,,320022e99796ca35dab7e63d48fd5e7,,,"- - [03/Mar/2017:10:58:15 +0100] ""GET /ark:/12..."
1,,e7fdec50f50253f6796d61b5382155f8,,,"- - [03/Mar/2017:10:58:41 +0100] ""GET /ark:/12..."
2,,320022e99796ca35dab7e63d48fd5e7,,,"- - [03/Mar/2017:11:01:15 +0100] ""GET /ark:/12..."
3,,e7fdec50f50253f6796d61b5382155f8,,,"- - [03/Mar/2017:11:01:41 +0100] ""GET /ark:/12..."
4,,320022e99796ca35dab7e63d48fd5e7,,,"- - [03/Mar/2017:11:04:15 +0100] ""GET /ark:/12..."


### 2.1 Parsing HTTP requests

#### Experimenting with extracting data

In [12]:
# This is just experimental, TODO change with lazy eval regex when working on bigger data.

In [13]:
import re
import shutil
import requests
import xmltodict
from bs4 import BeautifulSoup
from xml.etree import ElementTree as ET

In [14]:
'''
Function to query Gallica API and get metadata about a document from it's ark.
inputs: 
    'id': here id is the ARK of the document, that we extract from the request
Outputs:
     this function outputs the result of the API as a dictionary.

'''
def OAI(id):

    OAI_BASEURL = 'https://gallica.bnf.fr/services/OAIRecord?ark='

    url = "".join([OAI_BASEURL, id])
    #print(url)

    s = requests.get(url, stream=True)
    soup = BeautifulSoup(s.content,"lxml-xml")
    #print(soup)
    file = open('oai.xml', 'wb')
    file.write(soup.prettify().encode('UTF-8'))
    file.close()
    with open('oai.xml',encoding='UTF-8') as xml:
        doc = xmltodict.parse(xml.read())
        return doc


In [15]:
temp = pd.DataFrame()

In [16]:
# extracting the date of query
temp['Date']=lines_df.apply(lambda x: x[4].split("]")[0].split("[")[1] ,axis = 1)

In [17]:
# extracting request
temp['Request'] = lines_df.apply(lambda x: ' '.join(x[4].split("\"")[1].split(' ')[:2]),axis=1)

In [18]:
# extracting protocol, still need to look into it further
# do i really need to extract protocol at this point?
# commenting this for now until further research

# temp['protocol'] = lines_df.apply(lambda x: x[4].split("\"")[1].split(' ')[2], axis=1)

In [19]:
# extracting code
# try catch to avoid index errors, need to look into this further, what happens exactly there?
def trycode(value, default):
    try:
        return value[4].split("\"")[2].split(' ')[1]
    except (IndexError):
        return '-'
    return default


temp['Code'] = lines_df.apply(lambda x: trycode(x,'-'), axis=1)

In [20]:
# extracting length
# try catch to avoid index errors, need to look into this further, what happens exactly there?
def trylength(value, default):
    try:
        return value[4].split("\"")[2].split(' ')[2] 
    except (IndexError):
        return '-'
    return default

temp['Length'] = lines_df.apply(lambda x:  trylength(x,'-') , axis=1)

In [21]:
# extracting referant
temp['Referant'] = lines_df.apply(lambda x: x[4].split("\"")[3], axis=1)

In [22]:
# extracting ark name

#function to check if the request contains ark
def extract_ark(request):
    # capture everything between 12148 and / or between 12148 and . using regex
    ark = '-'
    ark = re.findall('(?<=12148/).+?(?=/)|(?<=12148/).+?(?=\.)', request)
    return ark


temp['Ark'] = temp.apply(lambda x: extract_ark(x['Request']), axis=1)
    

In [23]:
lines_df = lines_df.rename(columns={1:"IPAdress",2:"Country",3:"City",4:"Full_request"})

In [24]:
final_df = pd.concat([lines_df, temp],axis=1)

In [25]:
final_df[1195:1200]

Unnamed: 0,0,IPAdress,Country,City,Full_request,Date,Request,Code,Length,Referant,Ark
1195,,dd7e7fe210e8c18c7652e3f0e48700b2,France,Villemoustaussou,"- - [03/Mar/2017:18:04:59 +0100] ""GET /ark:/12...",03/Mar/2017:18:04:59 +0100,GET /ark:/12148/bpt6k1263204k.thumbnail,200,2291,http://gallica.bnf.fr/services/engine/search/s...,[bpt6k1263204k]
1196,,dd7e7fe210e8c18c7652e3f0e48700b2,France,Villemoustaussou,"- - [03/Mar/2017:18:04:59 +0100] ""GET /ark:/12...",03/Mar/2017:18:04:59 +0100,GET /ark:/12148/bpt6k5761916m.thumbnail,200,11323,http://gallica.bnf.fr/services/engine/search/s...,[bpt6k5761916m]
1197,,dd7e7fe210e8c18c7652e3f0e48700b2,France,Villemoustaussou,"- - [03/Mar/2017:18:04:59 +0100] ""GET /service...",03/Mar/2017:18:04:59 +0100,GET /services/ajax/extract/ark:/12148/bpt6k623...,200,1047,http://gallica.bnf.fr/services/engine/search/s...,[bpt6k6237463k]
1198,,dd7e7fe210e8c18c7652e3f0e48700b2,France,Villemoustaussou,"- - [03/Mar/2017:18:04:59 +0100] ""GET /ark:/12...",03/Mar/2017:18:04:59 +0100,GET /ark:/12148/bpt6k5495010c.thumbnail,200,9595,http://gallica.bnf.fr/services/engine/search/s...,[bpt6k5495010c]
1199,,dd7e7fe210e8c18c7652e3f0e48700b2,France,Villemoustaussou,"- - [03/Mar/2017:18:04:59 +0100] ""GET /service...",03/Mar/2017:18:04:59 +0100,GET /services/ajax/extract/ark:/12148/bpt6k553...,200,1065,http://gallica.bnf.fr/services/engine/search/s...,[bpt6k5535441w]


<a class="anchor" id="second-bullet"></a><h2><center>II. Creating Sessions</center></h2> 

#### 1. Creating a dataframe with IP and the difference in time between each connexion and the last

In [26]:
# Session: séquences de requêtes
# Regrouper même adresse IP => session se termine intervalle supérieur à 60 minutes entre deux requêtes. 
sessions_df = final_df.groupby('IPAdress').agg({'Ark':list,'Date':list})
sessions_df.head()

Unnamed: 0_level_0,Ark,Date
IPAdress,Unnamed: 1_level_1,Unnamed: 2_level_1
103e44bc19d6aac58db9a149c73e505b,[[]],[03/Mar/2017:18:12:04 +0100]
105781f3101367c473a91d52b6d4fd67,"[[bpt6k54673247], [bpt6k54673247], [bpt6k62397...","[03/Mar/2017:18:27:36 +0100, 03/Mar/2017:18:27..."
10907c8edc0b2702015e04f49a8204a2,"[[], [], [], [], [bpt6k6308044k], [bpt6k759364...","[03/Mar/2017:18:05:21 +0100, 03/Mar/2017:18:05..."
10915f6650d7b3ab000aafb953615c4e,"[[bpt6k33258628], [bpt6k3321225p], [bpt6k62553...","[03/Mar/2017:19:40:11 +0100, 03/Mar/2017:19:41..."
10dfc529d2b8f1a7ae6f94229848fbf,"[[bpt6k4453214], [], [], [], [], [], [], [], [...","[03/Mar/2017:18:38:47 +0100, 03/Mar/2017:18:39..."


In [27]:
from datetime import datetime
'''
Function to calculate absolute value of minutes between two dates
inputs: 
    d1: first date
    d2: second date
Outputs:
    absolute value of minutes between d1 and d2

'''
def minutes_between(d1, d2):
    d1 = datetime.strptime(d1, "%d/%b/%Y:%H:%M:%S")
    d2 = datetime.strptime(d2, "%d/%b/%Y:%H:%M:%S")
    return abs(((d2 - d1)).total_seconds() // 60.0)

In [28]:
time_beginning = "01/Jan/0001:01:01:01 +0100"
time_end = "01/Jan/3000:01:01:01 +0100"
sessions_df['date_1'] = sessions_df.apply(lambda x: [time_beginning]+x['Date'], axis = 1)
sessions_df['date_2'] = sessions_df.apply(lambda x: x['Date']+[time_end],axis=1)

In [29]:
'''
Function to calculate the difference between two zipped lists
'''
def calculate_difference_zipped_list(lst):
    new_lst = []
    for e in lst:
        if (e[0]==time_beginning):
            new_lst.append(999)
        elif (e[1]==time_end):
            new_lst.append(999)
        else:
            new_lst.append(minutes_between(e[0][:-6], e[1][:-6]))
    return new_lst
        
    

In [30]:
# this contains the ip adress and the zipped version of date_1,date_2
from collections import deque
IP_and_sessions = sessions_df.apply(lambda x: deque(calculate_difference_zipped_list(list(zip(x['date_1'],x['date_2'])))),axis=1)

In [31]:
# IP and the difference in time between each connection and the last
IP_and_sessions[:2]

IPAdress
103e44bc19d6aac58db9a149c73e505b                                           [999, 999]
105781f3101367c473a91d52b6d4fd67    [999, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
dtype: object

#### 2. Enriching the original dataframe with previous connexion date column, this will be useful when we try to set a timeout to define a session

In [32]:
# index on ipadress/date
# queue 
final_df['previous_connexion_date']=final_df.apply(lambda x: IP_and_sessions[x['IPAdress']].popleft(),axis=1)

In [33]:
final_df.head(2)

Unnamed: 0,0,IPAdress,Country,City,Full_request,Date,Request,Code,Length,Referant,Ark,previous_connexion_date
0,,320022e99796ca35dab7e63d48fd5e7,,,"- - [03/Mar/2017:10:58:15 +0100] ""GET /ark:/12...",03/Mar/2017:10:58:15 +0100,GET /ark:/12148/bpt6k70211m,503,-,-,[],999.0
1,,e7fdec50f50253f6796d61b5382155f8,,,"- - [03/Mar/2017:10:58:41 +0100] ""GET /ark:/12...",03/Mar/2017:10:58:41 +0100,GET /ark:/12148/bpt6k70211m,503,-,-,[],999.0


#### 3. Assigning a session ID to each connection according to the set rules we have set above

In [34]:
'''
Function to query Gallica API and get metadata about a document from it's ark.
inputs: 
    'id': here id is the ARK of the document, that we extract from the request
Outputs:
     this function outputs the result of the API as a dictionary.

'''
session_id=0
def create_session(period):
    global session_id
    if(period>30):
        session_id += 1
    return session_id

In [35]:
final_df=final_df.sort_values(by=['IPAdress','Date'])
final_df['session_id'] = final_df.apply(lambda x: create_session(x['previous_connexion_date']),axis=1)

#### 4. Creating sessions by grouping by session ID and collecting all ARKs


In [36]:
sessions = final_df.groupby('session_id').agg({'Ark':list})

#### 5. Converting arks from sessions to document titles

In [37]:
'''
get titles from ARKs

Input: list of arks

Output: list of document titles

'''
def get_title_from_ark(l):
    temp = []
    for ark in l:  
        title = ''
        if(len(ark)>0):
            oai_result = OAI(ark[0]) 
            if(oai_result != None ):
                try:
                    title = oai_result.get('results').get('notice').get('record').get('metadata').get('oai_dc:dc').get('dc:title')
                except:
                    title = ''
                
        temp.append(title)
    return temp

In [38]:
'''
get titles from ARKs using caching, if a user opens the same document twice we do not do any API call

Input: list of arks

Output: list of document titles

'''
def get_title_from_ark_cache(l):
    temp = []
    cache = {}
    for ark in l:  
        title = ''
        # remembering that l is a list of list [[ark1],[ark2],[ark3]]
        # if ark is not empty
        if(len(ark)>0):
            # check if ark in cache
            if ark[0] in cache:
                temp.append(cache.get(ark))
                continue
            else:            
                oai_result = OAI(ark[0]) 
                if(oai_result != None ):
                    try:
                        title = oai_result.get('results').get('notice').get('record').get('metadata').get('oai_dc:dc').get('dc:title')
                        cache["ark"]=title
                    except:
                        title = ''

        temp.append(title)
    return temp

In [39]:
# testing on 5 sessions, API calls are taking a bit too much time so far.
sessions_5 = sessions[:5]

In [40]:
sessions_5['document_titles_path'] = sessions_5.apply(lambda x: get_title_from_ark_cache(x['Ark']),axis = 1)

In [41]:

# TODO: contact Gallica ?
sessions_5

Unnamed: 0_level_0,Ark,document_titles_path
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,[[]],[]
2,"[[bpt6k54673247], [bpt6k54673247], [bpt6k62397...","[Le Progrès, Le Progrès, L'Avenir de Souk-Ahra..."
3,"[[], [], [], [], [bpt6k6308044k], [bpt6k759364...","[, , , , Le Rire : journal humoristique, L'Act..."
4,"[[bpt6k33258628], [bpt6k3321225p], [bpt6k62553...",[La pratique administrative dans la fonction p...
5,"[[bpt6k4453214], [], [], [], [], [], [], [], [...",[Matériaux pour l'histoire primitive et nature...


#### 6. Converting arks to themes

First we start by making a dictionary of gallica themes, how we obtain those themes is explained in this link https://api.bnf.fr/fr/api-gallica-de-recherche where each index has a signification:

            0 = Généralités
			1 = Philosophie et psychologie
			2 = Religion
			3 = Économie et société
			4 = Langues
			5 = Sciences
			6 = Techniques
			7 = Arts et loisirs
			8 = Littérature
			9 = Histoire et géographie
            
            
TODO: implement caching for both solutions, one query for both title and themes



In [86]:
# dictionary of gallica themes, useful to interpret results from Gallica API queries
index_to_themes = {'0':'Généralités','1':'Philosophie et psychologie','2':'Religion','3':'Économie et société','4':'Langues','5':'Sciences','6':'Techniques','7':'Arts et loisirs','8':'Littérature','9':'Histoire et géographie'}

In [115]:
'''
get themes from ARKs using caching, if a user opens the same document twice we do not do any API call

Input: list of arks

Output: list of document titles

'''
def get_theme_from_ark_cache(l):
    temp = []
    cache = {}
    # regular expression to only catch fields containing theme
    r = re.compile(".*theme")
    for ark in l:  
        theme = ''
        # remembering that l is a list of list [[ark1],[ark2],[ark3]]
        # if ark is not empty
        if(len(ark)>0):
            # check if ark in cache
            if ark[0] in cache:
                temp.append(cache.get(ark))
                continue
            else:            
                oai_result = OAI(ark[0]) 
                if(oai_result != None ):
                    try:
                        theme = oai_result.get('results').get('notice').get('record').get('header').get('setSpec')
                        theme = list(filter(r.match, theme))[0].split(':')[2] # Read Note
                        theme = index_to_themes.get(theme)
                        cache["ark"]=theme
                    except:
                        theme = ''

        temp.append(theme)
    return temp

In [118]:
# visualising one path with themes
sessions_5['document_theme_path'] = sessions_5.apply(lambda x: get_theme_from_ark_cache(x['Ark']),axis = 1)

In [207]:
sessions_5

Unnamed: 0_level_0,Ark,document_titles_path,document_theme_path
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,[[]],[],[]
2,"[[bpt6k54673247], [bpt6k54673247], [bpt6k62397...","[Le Progrès, Le Progrès, L'Avenir de Souk-Ahra...","[Généralités, Généralités, Histoire et géograp..."
3,"[[], [], [], [], [bpt6k6308044k], [bpt6k759364...","[, , , , Le Rire : journal humoristique, L'Act...","[, , , , Généralités, Généralités, Généralités..."
4,"[[bpt6k33258628], [bpt6k3321225p], [bpt6k62553...",[La pratique administrative dans la fonction p...,"[Économie et société, Économie et société, His..."
5,"[[bpt6k4453214], [], [], [], [], [], [], [], [...",[Matériaux pour l'histoire primitive et nature...,"[Histoire et géographie, , , , , , , , , , ]"


#### 7. TODO Adding sub themes 

<a class="anchor" id="third-bullet"></a><h2><center>III. Word2vec Representation</center></h2> 

In this part I will try to represent the arks in a 2D dimension using word2vec, each ark will be assigned a vector.

Word2vec model is trained directly from the requests.

Choice between 2 models : 
* Skip-gram: Model will take one word as input, but it will return multiple words as per window size.
* CBOW (Continuous Bag of Words): takes the context of each word as the input and tries to predict the word corresponding to the context. 


#### 1. Extract referant ark

First we aim to create a list for each request containing [ ark , referant_ark ]

In [169]:
#first we initialize the dataframe we would be working on for this part
repr_df = final_df.copy()
repr_df['Referant_Ark'] = repr_df.apply(lambda x: extract_ark(x['Referant']), axis=1)
repr_df['Ark_list'] = repr_df.apply(lambda x: [x['Ark'],x['Referant_Ark']],axis = 1)
repr_df.head(2)

Unnamed: 0,0,IPAdress,Country,City,Full_request,Date,Request,Code,Length,Referant,Ark,previous_connexion_date,session_id,Referant_Ark,Ark_list
5235,,103e44bc19d6aac58db9a149c73e505b,United States,Menlo Park,"- - [03/Mar/2017:18:12:04 +0100] ""GET /resize?...",03/Mar/2017:18:12:04 +0100,GET /resize?w=90&url=http%3A%2F%2Fgallica.bnf....,404,380,-,[],999.0,1,[],"[[], []]"
12352,,105781f3101367c473a91d52b6d4fd67,France,Rubelles,"- - [03/Mar/2017:18:27:36 +0100] ""GET /iiif/ar...",03/Mar/2017:18:27:36 +0100,"GET /iiif/ark:/12148/bpt6k54673247/f1/0,0,3819...",200,19377,http://gallica.bnf.fr/ark:/12148/bpt6k54673247...,[bpt6k54673247],999.0,2,[bpt6k54673247],"[[bpt6k54673247], [bpt6k54673247]]"


#### 2. Creating corpus

Our corpus would be represented as [ [ ark1 , referant_ark1 ] , [ ark2 , referant_ark2 ], [ ark2 , referant_ark2 ] ... ] 

TODO: change repr to [ark1,ark2,ark3,ark4]

In [170]:
#we filter all requests that have the same ark as referant and original, we also ignore empty arks.
repr_df = repr_df.loc[repr_df.apply(lambda x: x['Referant_Ark']!= [] and x['Ark']!= [] and x['Referant_Ark']!= x['Ark'] ,axis = 1 )]['Ark_list']


In [173]:
repr_df

28289        [[bpt6k404856c], [cb327877302]]
11051       [[bpt6k6394101d], [cb34378481r]]
12701       [[bpt6k6467719h], [cb343492000]]
12703       [[bpt6k6467717p], [cb343492000]]
12704       [[bpt6k6467717p], [cb343492000]]
                         ...                
92837     [[bpt6k62459525], [bpt6k6244617c]]
93201       [[cb34432899t], [bpt6k62459525]]
102185      [[bpt6k5744051b], [cb34458709m]]
102277      [[bpt6k5744051b], [cb34458709m]]
152476      [[bpt6k5564700h], [cb32731184v]]
Name: Ark_list, Length: 431, dtype: object

In [194]:
# Create the list of list format of the custom corpus for gensim modeling 
corpus = [[row[0][0],row[1][0]] for row in repr_df]

In [196]:
# view corpus
corpus[:5]

[['bpt6k404856c', 'cb327877302'],
 ['bpt6k6394101d', 'cb34378481r'],
 ['bpt6k6467719h', 'cb343492000'],
 ['bpt6k6467717p', 'cb343492000'],
 ['bpt6k6467717p', 'cb343492000']]

In [220]:
len(corpus)

431

#### 3. Inputing corpus to the model

In [199]:
import gensim
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

* size: The number of dimensions of the embeddings and the default is 100.
* window: The maximum distance between a target word and words around the target word. The default window is 5.
* min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
* workers: The number of partitions during training and the default workers is 3.
* sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

In [204]:
model = Word2Vec(corpus, min_count=1,size= 50,workers=3, window =3, sg = 1)

#### 5. Model results + T-SNE representation

In [208]:
# trying the model with one word
model.wv['bpt6k6467717p']

array([ 0.00323837,  0.00182838, -0.00300966, -0.00372714,  0.00717202,
       -0.00104797, -0.00180374,  0.00332456,  0.00211593, -0.00875297,
       -0.00114901, -0.00804507, -0.00054302,  0.00587437, -0.00979225,
        0.00932064,  0.0079716 , -0.00557393,  0.00434826,  0.0042955 ,
       -0.00337432,  0.00915539,  0.00057827, -0.00081348,  0.00871266,
        0.00507352,  0.0004839 ,  0.00160007,  0.00704157,  0.00333393,
        0.00526501,  0.00107506, -0.00678855, -0.00051837, -0.00112427,
        0.00736201, -0.00170719,  0.00274247,  0.00454962, -0.00963529,
       -0.00895235,  0.00110883,  0.00303389, -0.00037494, -0.00762413,
       -0.00344611, -0.00426454,  0.00486654,  0.00157237, -0.0054455 ],
      dtype=float32)

TODO do this representation on all data. cluster? run everything on my laptop?

<a class="anchor" id="fourth-bullet"></a><h2><center>IV. Path representation </center></h2> 

#### 1. Example of one single path from a title/theme point of view

In [219]:
list(sessions_5.iloc[1]['document_titles_path'])

['Le Progrès',
 'Le Progrès',
 "L'Avenir de Souk-Ahras : journal hebdomadaire indépendant, paraissant le dimanche : organe de défense des intérêts généraux de Souk-Ahras, Tébessa, Ain-Beida, Sedrata et leurs régions",
 'Le Temps',
 'Le Temps',
 'Le Temps',
 'Le Temps',
 'Le Temps',
 'Le Temps',
 'Le Temps',
 'Le Temps',
 'Le Temps',
 '',
 '',
 'Le Journal',
 '',
 '',
 "L'Afrique du Nord illustrée : journal hebdomadaire d'actualités nord-africaines : Algérie, Tunisie, Maroc",
 "Le Mutilé de l'Algérie : journal des mutilés, réformés et blessés de guerre de l'Afrique du Nord",
 "Le Mutilé de l'Algérie : journal des mutilés, réformés et blessés de guerre de l'Afrique du Nord",
 "Le Mutilé de l'Algérie : journal des mutilés, réformés et blessés de guerre de l'Afrique du Nord",
 "Le Mutilé de l'Algérie : journal des mutilés, réformés et blessés de guerre de l'Afrique du Nord",
 "Le Mutilé de l'Algérie : journal des mutilés, réformés et blessés de guerre de l'Afrique du Nord",
 "Le Mutilé de 

In [216]:
list(sessions_5.iloc[1]['document_theme_path'])

['Généralités',
 'Généralités',
 'Histoire et géographie',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 '',
 '',
 'Généralités',
 '',
 '',
 'Histoire et géographie',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Généralités',
 'Histoire et géographie',
 '',
 'Économie et société',
 'Histoire et géographie',
 'Histoire et géographie',
 'Histoire et géographie',
 'Histoire et géographie',
 'Techniques',
 'Littérature',
 'Économie et société',
 '',
 'Techniques',
 'Techniques',
 'Techniques',
 'Économie et société',
 'Économie et société',
 'Économie et société',
 'Économie et société',
 'Économie et société',
 'Économie et société']

In [None]:
<a class="anchor" id="first-bullet"></a><h2><center>I. Exploring Gallica Logs</center></h2> 