# **<span style="color:#250389" > Croisement des données HAL et REVCONF (JSON - fichier : valid_conf) </span>**


### **<span style="color:#250389" > Objectifs et definition  </span>**

L'objectif du  croisement est d'avoir le nombre de  "COMM" par congrès/conférences de 2018 à 2023
Pour cela nous utilisons 2 fichiers: 

1. 'hal_conf2.csv': concerne les données  extraites de l'API HAL qui ont étés préalablement filtrées puis nettoyées (pour consulter le notebook lié à l'extraction et au nettoyage :  http://localhost:8889/lab/tree/nb_python/Publis_conferences_Inria2rteam.ipynb) : il s'agit de la création du tableau pivot à partir des résultats de l'API avec le remplacement des Team ID par les acronymes pour les équipes Inria (suppression des autres ID).
   
3. 'revconf.csv': extraction provenant de la plateforme Revconf, qui répertorie l'ensemble des conférences/congrès qui ont étés saisie par les "auteurs" sur le site de HAL (http://localhost:8889/lab/tree/nb_python/Croisement%20des%20donn%C3%A9es%20HAL%20et%20REVCONF.ipynb)
   
5. "df2":  proviennent   du notebook (http://localhost:8888/files/nb_python/ref_publi_merged.ipynb?_xsrf=2%7Cb5ee617c%7Ce4a5f4de3c3b122afac26bddd7fc7fad%7C1718098371), nous permet de filtrer sur les thèmes et les equipes.

Points d'attentions pour "hal_conf":

- La colonne 'Conference' représente  toutes les conférences confondues aussi bien les titres contrôlés ("title" dans le fichier revconf) que les titres affiliés au titre contrôlés (form_title dans le fichier revconf).
- La colonne 'Publication Date' fait référence à la date de publication (il peut avoir des modifications de faites dans HAL plus tard mais la plateforme revconf ne le prends pas en compte)
- La colonne 'Publication Count' est le calcul du nombre de com par conférences
- Les colonnes 'clean_form_title' et  'clean_title' sont des copies de de la colonne "Conférence" sans caractères spéciaux  ==> Certains nom de conférences comportent espaces qui empêchent la reconnaissance des champs.



In [4]:
import json
import re
!pip install unidecode
from unidecode import unidecode
import pandas as pd

# Lecture des fichiers

hal_conf=pd.read_csv('hal_conf2.csv', sep=',', index_col=0)
revconf=pd.read_csv('revconf.csv', sep=',', index_col=0)
df2 = pd.read_csv('sigles_et_domaines_pour_Test_nbre_de_revues_.csv')

# On supprime la colonne 'clean_form_title' pour la retraiter par la suite
revconf= revconf.drop(columns=['clean_form_title'])





In [5]:
# 1er nettoyage : suppression des caractères spéciaux, des espaces
def nettoyer_chaine(chaine):
    # Retirer les espaces
    chaine_sans_espaces = chaine.replace(" ", "")
    # Retirer les caractères spéciaux en gardant uniquement les lettres et chiffres
    chaine_nettoyee = re.sub(r'\W+', '', chaine_sans_espaces)
    return chaine_nettoyee

# 2d nettoyage: suppression des accents
def supprimer_accents(chaine):
    return unidecode(chaine)



# 3ème nettoyage: suppression des accents non reconnus par unicode, texte en minuscule, des doubles espaces, des espaces en début et fin de chaîne
def nettoyer_texte(text):
    # Mettre en minuscules
    text = text.lower()

    # Supprimer les lettres accentuées
    text = unidecode(text)

    # Supprimer les doubles espaces
    text = re.sub(r'\s+', ' ', text)

    # Supprimer les espaces en début et fin de chaîne
    text = text.strip()

    return text

# Création de colonnes "cleannées" pour permettre les croisements 


hal_conf['clean_form_title'] = hal_conf['Conference'].apply(nettoyer_chaine)
hal_conf['clean_title'] = hal_conf['Conference'].apply(nettoyer_chaine)

revconf['clean_title'] = revconf['title'].apply(nettoyer_chaine)
revconf['clean_form_title'] = revconf['form_title'].apply(nettoyer_chaine)

revconf['clean_title'].apply(supprimer_accents)
revconf['clean_form_title'].apply(supprimer_accents)
hal_conf['clean_form_title'].apply(supprimer_accents)
hal_conf['clean_title'].apply(supprimer_accents)

revconf['clean_title'].apply(nettoyer_texte)
revconf['clean_form_title'].apply(nettoyer_texte)
hal_conf['clean_form_title'].apply(nettoyer_texte)
hal_conf['clean_title'].apply(nettoyer_texte)

0            cdc201958thieeeconferenceondecisionandcontrol
1            cdc201958thieeeconferenceondecisionandcontrol
2            cdc201958thieeeconferenceondecisionandcontrol
3            cdc201958thieeeconferenceondecisionandcontrol
4            cdc201958thieeeconferenceondecisionandcontrol
                               ...                        
12211                        ecoledeprintempsdelachairemmb
12212                               ecoledetetempsreel2021
12213    itsc201821stieeeinternationalconferenceonintel...
12214                           bioimageprocessingworkshop
12215         randomwalksandintracellulartransportworkshop
Name: clean_title, Length: 12216, dtype: object

In [6]:
revconf

Unnamed: 0,form_id,form_title,form_year,title,revconf_siid,clean_title,clean_form_title
0,1,"I3E 2008 : 8th IFIP Conference on e-Business, ...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,I3E20088thIFIPConferenceoneBusinesseServicesan...
1,2,"8th IFIP Conference on e-Business, e-Services ...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,8thIFIPConferenceoneBusinesseServicesandeSocie...
2,3,"The 8th IFIP Conference on e-Business, e-Servi...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,The8thIFIPConferenceoneBusinesseServicesandeSo...
3,4,"The 8th IFIP Conference on e-Business, e-Servi...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,The8thIFIPConferenceoneBusinesseServicesandeSo...
4,5,"I3E 2009 : 9th IFIP Conference on e-Business, ...",2009.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,I3E20099thIFIPConferenceoneBusinesseServicesan...
...,...,...,...,...,...,...,...
64147,89998,Ibero-Latin-American Congress on Computational...,,Ibero-Latin-American Congress on Computational...,RC2450539386,IberoLatinAmericanCongressonComputationalMetho...,IberoLatinAmericanCongressonComputationalMetho...
64148,84862,CLAIB 2022 / CBEB 2022 - IX Latin American Con...,2022.0,Latin American Congress on Biomedical Engineer...,RC5965967315,LatinAmericanCongressonBiomedicalEngineeringBr...,CLAIB2022CBEB2022IXLatinAmericanCongressonBiom...
64149,89999,Latin American Congress on Biomedical Engineer...,,Latin American Congress on Biomedical Engineer...,RC5965967315,LatinAmericanCongressonBiomedicalEngineeringBr...,LatinAmericanCongressonBiomedicalEngineeringBr...
64150,82213,CLPsyc 2019 - Sixth Workshop on Computational ...,2019.0,Workshop on Computational Linguistics and Clin...,RC2020474300,WorkshoponComputationalLinguisticsandClinicalP...,CLPsyc2019SixthWorkshoponComputationalLinguist...


In [7]:
hal_conf['Team Acronym'] = hal_conf['Team Acronym'].str.upper().str.replace(' ', '')
hal_conf

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title
0,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
1,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
2,CDC 2019 - 58th IEEE Conference on Decision an...,2019,MCTAO,3,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
3,CDC 2019 - 58th IEEE Conference on Decision an...,2019,VALSE,3,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
4,CDC 2019 - 58th IEEE Conference on Decision an...,2019,ACUMES,2,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
...,...,...,...,...,...,...
12211,École de Printemps de la chaire MMB,2023,MERGE,1,ÉcoledePrintempsdelachaireMMB,ÉcoledePrintempsdelachaireMMB
12212,École d’Été Temps Réel 2021,2021,KAIROS,1,ÉcoledÉtéTempsRéel2021,ÉcoledÉtéTempsRéel2021
12213,​ITSC 2018 - 21st IEEE International Conferen...,2018,RITS,1,ITSC201821stIEEEInternationalConferenceonIntel...,ITSC201821stIEEEInternationalConferenceonIntel...
12214,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop


In [8]:
#  data qui nous permettra de filtrer sur les équipes
df2['SigleBastri'] = df2['SigleBastri'].str.upper().str.replace(' ', '')
df2['sigle'] = df2['sigle'].str.upper().str.replace(' ', '')
df2

Unnamed: 0,siidRNSR,Thème anglais,SigleBastri,sigle,NonPertinentDansHal,siidEquipeGroupe,idStructureHal
0,200818987H,Computational Biology,ABS,ABS,,GS0184O,51872
1,202324373X,Computational Neuroscience and Medicine,AISTROSIGHT,,,,1152689
2,202324373X,Computational Neuroscience and Medicine,AISTROSIGHT,,,,1152689
3,201120772K,Computational Biology,AMIB,AMIB,,GS0318Q,103784
4,201622254Z,Computational Biology,AMIBIO,AMIBIO,,GS0781D,485707
...,...,...,...,...,...,...,...
96,202224249S,Computational Neuroscience and Medicine,SODA,SODA,,GS0918E,1093353
97,201622153P,Computational Biology,TAPDANCE,TAPDANCE,,GS0753A,461293
98,200718385H,Computational Biology,VIRTUALPLANTS,VIRTUALPLANTS,,GS0152C,21925
99,200518339S,Computational Neuroscience and Medicine,VISAGES,VISAGES,,GS0103J,11869;491318


In [9]:
df2.describe(include='all')

Unnamed: 0,siidRNSR,Thème anglais,SigleBastri,sigle,NonPertinentDansHal,siidEquipeGroupe,idStructureHal
count,101,101,101,89,0.0,89,101
unique,74,3,73,65,,65,74
top,201421109N,Computational Neuroscience and Medicine,MAMBA,MAMBA,,GS0604V,240018;454654;542023;1005056
freq,4,42,4,4,,4,4
mean,,,,,,,
std,,,,,,,
min,,,,,,,
25%,,,,,,,
50%,,,,,,,
75%,,,,,,,


### **<span style="color:#250389" >Premier test avec un croisement sur "clean_form_title" </span>**

In [11]:
tab_merged=pd.merge(hal_conf, revconf, on=['clean_form_title'] , how='left', indicator=True)
tab_merged

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,_merge
0,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718.0,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
1,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40724.0,CDC 2019 - 58thIEEE Conference on Decision and...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
2,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718.0,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
3,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40724.0,CDC 2019 - 58thIEEE Conference on Decision and...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
4,CDC 2019 - 58th IEEE Conference on Decision an...,2019,MCTAO,3,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718.0,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13804,​ITSC 2018 - 21st IEEE International Conferen...,2018,RITS,1,ITSC201821stIEEEInternationalConferenceonIntel...,ITSC201821stIEEEInternationalConferenceonIntel...,13546.0,ITSC 2018 – 21st IEEE International Conference...,2018.0,International IEEE Conference on Intelligent T...,RC1769683265,InternationalIEEEConferenceonIntelligentTransp...,both
13805,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop,54206.0,"""Bioimage Processing"" workshop",2019.0,Bio-Image Processing Workshop,RC9065248414,BioImageProcessingWorkshop,both
13806,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop,54207.0,“Bioimage Processing” workshop,,Bio-Image Processing Workshop,RC9065248414,BioImageProcessingWorkshop,both
13807,“Random Walks and Intracellular Transport” wor...,2019,SERPICO,1,RandomWalksandIntracellularTransportworkshop,RandomWalksandIntracellularTransportworkshop,54208.0,"""Random Walks and Intracellular Transport"" wor...",2019.0,Random Walks and Intracellular Transport Workshop,RC5072176599,RandomWalksandIntracellularTransportWorkshop,both


In [12]:
tab_merged['_merge'].unique()

['both', 'left_only']
Categories (3, object): ['left_only', 'right_only', 'both']

In [13]:
tab_merged_bo=tab_merged.loc[tab_merged['_merge']=='both']
tab_merged_bo

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,_merge
0,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718.0,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
1,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40724.0,CDC 2019 - 58thIEEE Conference on Decision and...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
2,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718.0,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
3,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40724.0,CDC 2019 - 58thIEEE Conference on Decision and...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
4,CDC 2019 - 58th IEEE Conference on Decision an...,2019,MCTAO,3,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718.0,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13804,​ITSC 2018 - 21st IEEE International Conferen...,2018,RITS,1,ITSC201821stIEEEInternationalConferenceonIntel...,ITSC201821stIEEEInternationalConferenceonIntel...,13546.0,ITSC 2018 – 21st IEEE International Conference...,2018.0,International IEEE Conference on Intelligent T...,RC1769683265,InternationalIEEEConferenceonIntelligentTransp...,both
13805,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop,54206.0,"""Bioimage Processing"" workshop",2019.0,Bio-Image Processing Workshop,RC9065248414,BioImageProcessingWorkshop,both
13806,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop,54207.0,“Bioimage Processing” workshop,,Bio-Image Processing Workshop,RC9065248414,BioImageProcessingWorkshop,both
13807,“Random Walks and Intracellular Transport” wor...,2019,SERPICO,1,RandomWalksandIntracellularTransportworkshop,RandomWalksandIntracellularTransportworkshop,54208.0,"""Random Walks and Intracellular Transport"" wor...",2019.0,Random Walks and Intracellular Transport Workshop,RC5072176599,RandomWalksandIntracellularTransportWorkshop,both


In [14]:
tab_merged_lo=tab_merged.loc[tab_merged['_merge']=='left_only']
tab_merged_lo

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,_merge
1138,Séminaire du LIRIMA,2019,ZENITH,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1139,Séminaire du LIRIMA,2019,CIDRE,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1140,Séminaire du LIRIMA,2019,DEDUCTEAM,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1141,Séminaire du LIRIMA,2020,MIMESIS,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1142,Séminaire du LIRIMA,2020,KERDATA,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13777,iMIMIC 2022 - Workshop on Interpretability of ...,2022,STATIFY,1,iMIMIC2022WorkshoponInterpretabilityofMachineI...,iMIMIC2022WorkshoponInterpretabilityofMachineI...,,,,,,,left_only
13783,international Conference on 'Future is Urban' ...,2021,STEEP,1,internationalConferenceonFutureisUrban2021,internationalConferenceonFutureisUrban2021,,,,,,,left_only
13785,jspyrene2020 : Journée Scientifique autour du ...,2020,CAGIRE,1,jspyrene2020JournéeScientifiqueautourducluster...,jspyrene2020JournéeScientifiqueautourducluster...,,,,,,,left_only
13799,"École Doctorale Sociétés, Communication, Arts,...",2021,MULTISPEECH,1,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,,,,,,,left_only


In [15]:
tab_merged_lo.describe(include='all')


Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,_merge
count,892,892.0,892,892.0,892,892,0.0,0.0,0.0,0.0,0.0,0.0,892
unique,790,,215,,785,785,,0.0,,0.0,0.0,0.0,1
top,Séminaire du LIRIMA,,ALMANACH,,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
freq,6,,36,,6,6,,,,,,,892
mean,,2020.908072,,1.070628,,,,,,,,,
std,,1.533927,,0.332576,,,,,,,,,
min,,2018.0,,1.0,,,,,,,,,
25%,,2020.0,,1.0,,,,,,,,,
50%,,2021.0,,1.0,,,,,,,,,
75%,,2022.0,,1.0,,,,,,,,,


### **<span style="color:#250389" >Deuxième croisement sur  "clean_title and clean_form_title" </span>**

In [17]:
merged_df = pd.merge(hal_conf, revconf, left_on='clean_title', right_on='clean_form_title', how='inner')
merged_df 

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title_x,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,clean_form_title_y
0,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
1,CDC 2019 - 58th IEEE Conference on Decision an...,2019,NECS-POST,6,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40724,CDC 2019 - 58thIEEE Conference on Decision and...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
2,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
3,CDC 2019 - 58th IEEE Conference on Decision an...,2019,BIOCORE,4,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40724,CDC 2019 - 58thIEEE Conference on Decision and...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
4,CDC 2019 - 58th IEEE Conference on Decision an...,2019,MCTAO,3,CDC201958thIEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl,40718,CDC 2019 - 58th IEEE Conference on Decision an...,2019.0,IEEE Conference on Decision and Control,RC8095363186,IEEEConferenceonDecisionandControl,CDC201958thIEEEConferenceonDecisionandControl
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12912,École d’Été Temps Réel 2021,2021,KAIROS,1,ÉcoledÉtéTempsRéel2021,ÉcoledÉtéTempsRéel2021,83727,École d’Été Temps Réel 2021,2021.0,École d'Été Temps Réel,RC5242169440,ÉcoledÉtéTempsRéel,ÉcoledÉtéTempsRéel2021
12913,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop,54206,"""Bioimage Processing"" workshop",2019.0,Bio-Image Processing Workshop,RC9065248414,BioImageProcessingWorkshop,BioimageProcessingworkshop
12914,“Bioimage Processing” workshop,2019,SERPICO,1,BioimageProcessingworkshop,BioimageProcessingworkshop,54207,“Bioimage Processing” workshop,,Bio-Image Processing Workshop,RC9065248414,BioImageProcessingWorkshop,BioimageProcessingworkshop
12915,“Random Walks and Intracellular Transport” wor...,2019,SERPICO,1,RandomWalksandIntracellularTransportworkshop,RandomWalksandIntracellularTransportworkshop,54208,"""Random Walks and Intracellular Transport"" wor...",2019.0,Random Walks and Intracellular Transport Workshop,RC5072176599,RandomWalksandIntracellularTransportWorkshop,RandomWalksandIntracellularTransportworkshop


In [18]:
tab_merged_lo

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,_merge
1138,Séminaire du LIRIMA,2019,ZENITH,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1139,Séminaire du LIRIMA,2019,CIDRE,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1140,Séminaire du LIRIMA,2019,DEDUCTEAM,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1141,Séminaire du LIRIMA,2020,MIMESIS,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1142,Séminaire du LIRIMA,2020,KERDATA,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13777,iMIMIC 2022 - Workshop on Interpretability of ...,2022,STATIFY,1,iMIMIC2022WorkshoponInterpretabilityofMachineI...,iMIMIC2022WorkshoponInterpretabilityofMachineI...,,,,,,,left_only
13783,international Conference on 'Future is Urban' ...,2021,STEEP,1,internationalConferenceonFutureisUrban2021,internationalConferenceonFutureisUrban2021,,,,,,,left_only
13785,jspyrene2020 : Journée Scientifique autour du ...,2020,CAGIRE,1,jspyrene2020JournéeScientifiqueautourducluster...,jspyrene2020JournéeScientifiqueautourducluster...,,,,,,,left_only
13799,"École Doctorale Sociétés, Communication, Arts,...",2021,MULTISPEECH,1,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,,,,,,,left_only


In [19]:
revconf

Unnamed: 0,form_id,form_title,form_year,title,revconf_siid,clean_title,clean_form_title
0,1,"I3E 2008 : 8th IFIP Conference on e-Business, ...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,I3E20088thIFIPConferenceoneBusinesseServicesan...
1,2,"8th IFIP Conference on e-Business, e-Services ...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,8thIFIPConferenceoneBusinesseServicesandeSocie...
2,3,"The 8th IFIP Conference on e-Business, e-Servi...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,The8thIFIPConferenceoneBusinesseServicesandeSo...
3,4,"The 8th IFIP Conference on e-Business, e-Servi...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,The8thIFIPConferenceoneBusinesseServicesandeSo...
4,5,"I3E 2009 : 9th IFIP Conference on e-Business, ...",2009.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,I3E20099thIFIPConferenceoneBusinesseServicesan...
...,...,...,...,...,...,...,...
64147,89998,Ibero-Latin-American Congress on Computational...,,Ibero-Latin-American Congress on Computational...,RC2450539386,IberoLatinAmericanCongressonComputationalMetho...,IberoLatinAmericanCongressonComputationalMetho...
64148,84862,CLAIB 2022 / CBEB 2022 - IX Latin American Con...,2022.0,Latin American Congress on Biomedical Engineer...,RC5965967315,LatinAmericanCongressonBiomedicalEngineeringBr...,CLAIB2022CBEB2022IXLatinAmericanCongressonBiom...
64149,89999,Latin American Congress on Biomedical Engineer...,,Latin American Congress on Biomedical Engineer...,RC5965967315,LatinAmericanCongressonBiomedicalEngineeringBr...,LatinAmericanCongressonBiomedicalEngineeringBr...
64150,82213,CLPsyc 2019 - Sixth Workshop on Computational ...,2019.0,Workshop on Computational Linguistics and Clin...,RC2020474300,WorkshoponComputationalLinguisticsandClinicalP...,CLPsyc2019SixthWorkshoponComputationalLinguist...


### **<span style="color:#250389" > Troisieme croisement avec le "left_only"  (on va merger sur clean_form_title et title) </span>**

In [21]:
# tab_merged_lo_test = tab_merged_lo.rename(columns={'clean_form_title' : 'clean_title2'})
tab_merged_lo = tab_merged_lo.drop(columns=['_merge'])
# revconf_test = revconf.rename(columns={'clean_title' : 'clean_title2'})

In [22]:
tab_merged_lo

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y
1138,Séminaire du LIRIMA,2019,ZENITH,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,
1139,Séminaire du LIRIMA,2019,CIDRE,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,
1140,Séminaire du LIRIMA,2019,DEDUCTEAM,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,
1141,Séminaire du LIRIMA,2020,MIMESIS,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,
1142,Séminaire du LIRIMA,2020,KERDATA,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
13777,iMIMIC 2022 - Workshop on Interpretability of ...,2022,STATIFY,1,iMIMIC2022WorkshoponInterpretabilityofMachineI...,iMIMIC2022WorkshoponInterpretabilityofMachineI...,,,,,,
13783,international Conference on 'Future is Urban' ...,2021,STEEP,1,internationalConferenceonFutureisUrban2021,internationalConferenceonFutureisUrban2021,,,,,,
13785,jspyrene2020 : Journée Scientifique autour du ...,2020,CAGIRE,1,jspyrene2020JournéeScientifiqueautourducluster...,jspyrene2020JournéeScientifiqueautourducluster...,,,,,,
13799,"École Doctorale Sociétés, Communication, Arts,...",2021,MULTISPEECH,1,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,,,,,,


In [23]:
revconf

Unnamed: 0,form_id,form_title,form_year,title,revconf_siid,clean_title,clean_form_title
0,1,"I3E 2008 : 8th IFIP Conference on e-Business, ...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,I3E20088thIFIPConferenceoneBusinesseServicesan...
1,2,"8th IFIP Conference on e-Business, e-Services ...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,8thIFIPConferenceoneBusinesseServicesandeSocie...
2,3,"The 8th IFIP Conference on e-Business, e-Servi...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,The8thIFIPConferenceoneBusinesseServicesandeSo...
3,4,"The 8th IFIP Conference on e-Business, e-Servi...",2008.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,The8thIFIPConferenceoneBusinesseServicesandeSo...
4,5,"I3E 2009 : 9th IFIP Conference on e-Business, ...",2009.0,"IFIP Conference on e-Business, e-Services, and...",RC2040646750,IFIPConferenceoneBusinesseServicesandeSociety,I3E20099thIFIPConferenceoneBusinesseServicesan...
...,...,...,...,...,...,...,...
64147,89998,Ibero-Latin-American Congress on Computational...,,Ibero-Latin-American Congress on Computational...,RC2450539386,IberoLatinAmericanCongressonComputationalMetho...,IberoLatinAmericanCongressonComputationalMetho...
64148,84862,CLAIB 2022 / CBEB 2022 - IX Latin American Con...,2022.0,Latin American Congress on Biomedical Engineer...,RC5965967315,LatinAmericanCongressonBiomedicalEngineeringBr...,CLAIB2022CBEB2022IXLatinAmericanCongressonBiom...
64149,89999,Latin American Congress on Biomedical Engineer...,,Latin American Congress on Biomedical Engineer...,RC5965967315,LatinAmericanCongressonBiomedicalEngineeringBr...,LatinAmericanCongressonBiomedicalEngineeringBr...
64150,82213,CLPsyc 2019 - Sixth Workshop on Computational ...,2019.0,Workshop on Computational Linguistics and Clin...,RC2020474300,WorkshoponComputationalLinguisticsandClinicalP...,CLPsyc2019SixthWorkshoponComputationalLinguist...


In [24]:
tab_merged3 = pd.merge(tab_merged_lo, revconf, left_on='clean_form_title', right_on='clean_title', how='inner', indicator=True)
tab_merged3 

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title_x,clean_title_x,form_id_x,form_title_x,form_year_x,title_x,revconf_siid_x,clean_title_y,form_id_y,form_title_y,form_year_y,title_y,revconf_siid_y,clean_title,clean_form_title_y,_merge


In [25]:
# tab_merged3=pd.merge(tab_merged_lo_test, revconf_test, on=['clean_title2'] , how='left', indicator=True)
# tab_merged3

In [26]:
tab_merged3['_merge'].unique()

[], Categories (3, object): ['left_only', 'right_only', 'both']

In [27]:
tab_merged_lo=tab_merged.loc[tab_merged['_merge']=='left_only']
tab_merged_lo

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,clean_form_title,clean_title_x,form_id,form_title,form_year,title,revconf_siid,clean_title_y,_merge
1138,Séminaire du LIRIMA,2019,ZENITH,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1139,Séminaire du LIRIMA,2019,CIDRE,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1140,Séminaire du LIRIMA,2019,DEDUCTEAM,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1141,Séminaire du LIRIMA,2020,MIMESIS,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
1142,Séminaire du LIRIMA,2020,KERDATA,1,SéminaireduLIRIMA,SéminaireduLIRIMA,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13777,iMIMIC 2022 - Workshop on Interpretability of ...,2022,STATIFY,1,iMIMIC2022WorkshoponInterpretabilityofMachineI...,iMIMIC2022WorkshoponInterpretabilityofMachineI...,,,,,,,left_only
13783,international Conference on 'Future is Urban' ...,2021,STEEP,1,internationalConferenceonFutureisUrban2021,internationalConferenceonFutureisUrban2021,,,,,,,left_only
13785,jspyrene2020 : Journée Scientifique autour du ...,2020,CAGIRE,1,jspyrene2020JournéeScientifiqueautourducluster...,jspyrene2020JournéeScientifiqueautourducluster...,,,,,,,left_only
13799,"École Doctorale Sociétés, Communication, Arts,...",2021,MULTISPEECH,1,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,ÉcoleDoctoraleSociétésCommunicationArtsLettres...,,,,,,,left_only


### **<span style="color:#250389" > FILTRAGE SUR NEURIPS,ICML,AAAI, AISTATS ET LES TEAMS </span>**

In [29]:
print('Liste des "SigleBastri" à garder pour filtrage:','\n')
print(df2['SigleBastri'].unique())

Liste des "SigleBastri" à garder pour filtrage: 

['ABS' 'AISTROSIGHT' 'AMIB' 'AMIBIO' 'ARAMIS' 'ASCLEPIOS' 'ATHENA'
 'BAMBOO' 'BEAGLE' 'BIGS' 'BIOCORE' 'BIOVISION' 'BONSAI' 'CAMIN' 'CAPSID'
 'CARMEN' 'COMMEDIA' 'COMPO' 'CRONOS' 'DEMAR' 'DRACULA' 'DYLISS' 'EMPENN'
 'EPIMETHEE' 'EPIONE' 'ERABLE' 'GALEN' 'GENSCALE' 'HEKA' 'IBIS' 'INBIO'
 'LIFEWARE' 'M3DISIM' 'MACBES' 'MAGNOME' 'MAMBA' 'MASAIE' 'MATHNEURO'
 'MERGE' 'MICROCOSME' 'MIMESIS' 'MIND' 'MNEMOSYNE' 'MODEMIC' 'MONC'
 'MORPHEME' 'MOSAIC' 'MUSCA' 'MUSCLEES' 'MYCENAE' 'NERV' 'NEUROMATHCOMP'
 'NEUROSYS' 'NUMED' 'OPIS' 'ORCHESTRON' 'PARIETAL' 'PLEIADE' 'POPIX'
 'PREMEDICAL' 'REO' 'SAIRPICO' 'SERPICO' 'SHACRA' 'SIMBA' 'SIMBIOTX'
 'SISTM' 'SISYPHE' 'SODA' 'TAPDANCE' 'VIRTUALPLANTS' 'VISAGES' 'XPOP']


In [30]:
print('Liste des "sigle" à garder pour filtrage:','\n')
print(df2['sigle'].unique())


Liste des "sigle" à garder pour filtrage: 

['ABS' nan 'AMIB' 'AMIBIO' 'ARAMIS' 'ASCLEPIOS' 'ATHENA' 'BAMBOO' 'BEAGLE'
 'BIGS' 'BIOCORE' 'BIOVISION' 'BONSAI' 'CAMIN' 'CAPSID' 'CARMEN'
 'COMMEDIA' 'COMPO' 'DEMAR' 'DRACULA' 'DYLISS' 'VISAGES' 'EPIONE' 'ERABLE'
 'GALEN' 'GALEN-POST' 'GENSCALE' 'HEKA' 'IBIS' 'INBIO' 'LIFEWARE'
 'M3DISIM' 'MAGNOME' 'MAMBA' 'MASAIE' 'MATHNEURO' 'MICROCOSME' 'MIMESIS'
 'MIND' 'MNEMOSYNE' 'MODEMIC' 'MONC' 'MORPHEME' 'MOSAIC' 'MUSCA' 'MYCENAE'
 'NEUROMATHCOMP' 'NEUROSYS' 'NUMED' 'OPIS' 'ORCHESTRON' 'PARIETAL'
 'PLEIADE' 'POPIX' 'PREMEDICAL' 'REO' 'REO-POST' 'SERPICO' 'SHACRA'
 'SIMBIOTX' 'SISTM' 'SISYPHE' 'SODA' 'TAPDANCE' 'VIRTUALPLANTS' 'XPOP']


In [31]:
df2. describe(include='all')

Unnamed: 0,siidRNSR,Thème anglais,SigleBastri,sigle,NonPertinentDansHal,siidEquipeGroupe,idStructureHal
count,101,101,101,89,0.0,89,101
unique,74,3,73,65,,65,74
top,201421109N,Computational Neuroscience and Medicine,MAMBA,MAMBA,,GS0604V,240018;454654;542023;1005056
freq,4,42,4,4,,4,4
mean,,,,,,,
std,,,,,,,
min,,,,,,,
25%,,,,,,,
50%,,,,,,,
75%,,,,,,,


In [32]:
# title_to_keep = ['Annual Conference on Neural Information Processing Systems','International Conference on Machine Learning', 
                   # 'International Conference on Artificial Intelligence and Statistics', 'National Conference on Artificial Intelligence']

# Filtrage sur les titles contrôlés et les teams
id_to_keep = ['RC9798641609','RC6493598945','RC8357518104','RC7612516265']

# team_acronym_to_keep = ['ABS', 'AISTROSIGHT', 'AMIB', 'AMIBIO', 'ARAMIS', 'ASCLEPIOS',
#        'ATHENA', 'BAMBOO', 'BEAGLE', 'BIGS', 'BIOCORE', 'BIOVISION',
#        'BONSAI', 'CAMIN', 'CAPSID', 'CARMEN', 'COMMEDIA', 'COMPO',
#        'CRONOS', 'DEMAR', 'DRACULA', 'DYLISS', 'EMPENN', 'EPIMETHEE',
#        'EPIONE', 'ERABLE', 'GALEN', 'GENSCALE', 'HEKA', 'IBIS', 'INBIO',
#        'LIFEWARE', 'M3DISIM', 'MACBES', 'MAGNOME', 'MAMBA', 'MASAIE',
#        'MATHNEURO', 'MERGE', 'MICROCOSME', 'MIMESIS', 'MIND', 'MNEMOSYNE',
#        'MODEMIC', 'MONC', 'MORPHEME', 'MOSAIC', 'MUSCA', 'MUSCLEES',
#        'MYCENAE', 'NERV', 'NEUROMATHCOMP', 'NEUROSYS', 'NUMED', 'OPIS',
#        'ORCHESTRON', 'PARIETAL', 'PLEIADE', 'POPIX', 'PREMEDICAL', 'REO',
#        'SAIRPICO', 'SERPICO', 'SHACRA', 'SIMBA', 'SIMBIOTX', 'SISTM',
#        'SISYPHE', 'SODA', 'TAPDANCE', 'VIRTUAL PLANTS', 'VISAGES', 'XPOP']

team_acronym_to_keep =['ABS' , 'AMIB' ,'AMIBIO', 'ARAMIS', 'ASCLEPIOS' ,'ATHENA' ,'BAMBOO', 'BEAGLE',
 'BIGS' ,'BIOCORE' ,'BIOVISION', 'BONSAI', 'CAMIN' ,'CAPSID' ,'CARMEN',
 'COMMEDIA' ,'COMPO', 'DEMAR' ,'DRACULA', 'DYLISS' ,'VISAGES', 'EPIONE', 'ERABLE',
 'GALEN' ,'GALEN-POST' ,'GENSCALE' ,'HEKA' ,'IBIS' ,'INBIO', 'LIFEWARE',
 'M3DISIM' ,'MAGNOME' ,'MAMBA' ,'MASAIE', 'MATHNEURO' ,'MICROCOSME' ,'MIMESIS',
 'MIND' , 'MNEMOSYNE', 'MODEMIC' ,'MONC' ,'MORPHEME', 'MOSAIC' ,'MUSCA', 'MYCENAE',
 'NEUROMATHCOMP', 'NEUROSYS', 'NUMED', 'OPIS' ,'ORCHESTRON', 'PARIETAL',
 'PLEIADE' ,'POPIX' ,'PREMEDICAL', 'REO', 'REO-POST', 'SERPICO', 'SHACRA',
 'SIMBIOTX', 'SISTM' ,'SISYPHE', 'SODA', 'TAPDANCE' ,'VIRTUAL PLANTS' ,'XPOP']


ml_ia_publications  = merged_df[merged_df ['revconf_siid'].isin(id_to_keep)]
ml_ia_publications = ml_ia_publications[ml_ia_publications['Team Acronym'].isin(team_acronym_to_keep)]
# Suppression des colonnes
ml_ia_publications = ml_ia_publications.drop(columns=['clean_title_x', 'form_id' ,'form_year', 'clean_title_y','form_title', 'clean_form_title_y', 'clean_form_title_x'	,'clean_title_x','form_id'])

ml_ia_publications

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,title,revconf_siid
46,NeurIPS 2020 - 34th Conference on Neural Infor...,2020,PARIETAL,4,Annual Conference on Neural Information Proces...,RC9798641609
47,NeurIPS 2020 - 34th Conference on Neural Infor...,2020,XPOP,3,Annual Conference on Neural Information Proces...,RC9798641609
89,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,MIND,3,Annual Conference on Neural Information Proces...,RC9798641609
90,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,MIND,3,Annual Conference on Neural Information Proces...,RC9798641609
91,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,MIND,3,Annual Conference on Neural Information Proces...,RC9798641609
...,...,...,...,...,...,...
10815,NeurIPS 2023 Workshop on Diffusion Models,2023,EPIONE,1,Annual Conference on Neural Information Proces...,RC9798641609
10824,Neural Information Processing Systems (NeurIPS...,2023,MONC,1,Annual Conference on Neural Information Proces...,RC9798641609
11067,PMLR 2022 - Proceedings of Machine Learning Re...,2022,PREMEDICAL,1,International Conference on Machine Learning,RC6493598945
11279,Proceedings of the Neural Information Processi...,2022,HEKA,1,Annual Conference on Neural Information Proces...,RC9798641609


In [33]:
ml_ia_publications.describe(include='all')

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,title,revconf_siid
count,126,126.0,126,126.0,126,126
unique,80,,19,,4,4
top,NeurIPS 2023 – 37th Conference on Neural Infor...,,PARIETAL,,Annual Conference on Neural Information Proces...,RC9798641609
freq,9,,40,,73,73
mean,,2021.031746,,1.134921,,
std,,1.609653,,0.479219,,
min,,2018.0,,1.0,,
25%,,2020.0,,1.0,,
50%,,2021.0,,1.0,,
75%,,2022.0,,1.0,,


In [34]:
ml_ia_publicationsta = ml_ia_publications[ml_ia_publications['Team Acronym'] == 'Mnemosyne']
ml_ia_publicationsta

Unnamed: 0,Conference,Publication Date,Team Acronym,Publication Count,title,revconf_siid


On a bien 4 "title"  et 4 "revconf_siid" différents

In [36]:
ml_ia_publications.columns

Index(['Conference', 'Publication Date', 'Team Acronym', 'Publication Count',
       'title', 'revconf_siid'],
      dtype='object')

In [37]:


# # 1. les doublons d'index
# index_duplicated = ml_ia_publications.index.duplicated()
# if index_duplicated.any():
#     print("Il y a des index dupliqués. Voici les lignes dupliquées :")
#     print(ml_ia_publications[index_duplicated])

# # 2.  les doublons de lignes entières
# rows_duplicated = ml_ia_publications.duplicated()
# if rows_duplicated.any():
#     print("Il y a des lignes dupliquées. Voici les lignes dupliquées :")
#     print( ml_ia_publications[rows_duplicated])

# # 3. les doublons basés sur des colonnes spécifiques
# cols_to_check = ['Conference', 'Publication Date', 'Publication Count', 'title', 'revconf_siid']  # Remplacez par les colonnes pertinentes
# cols_duplicated = ml_ia_publications.duplicated(subset=cols_to_check)
# if cols_duplicated.any():
#     print(f"Il y a des doublons basés sur les colonnes {cols_to_check}. Voici les lignes dupliquées :")
#     print(ml_ia_publications[cols_duplicated])

# # Optionnel : Supprimer les doublons (conserver la première occurrence)
# ml_ia_publications = ml_ia_publications.drop_duplicates()
# print("DataFrame sans doublons :")
# print(ml_ia_publications)

# # Optionnel : Réinitialiser l'index si nécessaire
# ml_ia_publications= ml_ia_publications.reset_index(drop=True)
# print("DataFrame sans doublons avec index réinitialisé :")
# ml_ia_publications


In [38]:
# Sup des doublons
ml_ia_publications = ml_ia_publications.drop_duplicates(subset=['Conference', 'Publication Count','title', 'Team Acronym'])

# Réorganisation des colonnes
new_order = ['title','revconf_siid','Conference', 'Publication Date', 'Publication Count']
ml_ia_publications = ml_ia_publications[new_order]

ml_ia_publications

Unnamed: 0,title,revconf_siid,Conference,Publication Date,Publication Count
46,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2020 - 34th Conference on Neural Infor...,2020,4
47,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2020 - 34th Conference on Neural Infor...,2020,3
89,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,3
98,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,1
101,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,1
...,...,...,...,...,...
10815,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 Workshop on Diffusion Models,2023,1
10824,Annual Conference on Neural Information Proces...,RC9798641609,Neural Information Processing Systems (NeurIPS...,2023,1
11067,International Conference on Machine Learning,RC6493598945,PMLR 2022 - Proceedings of Machine Learning Re...,2022,1
11279,Annual Conference on Neural Information Proces...,RC9798641609,Proceedings of the Neural Information Processi...,2022,1


In [39]:
# Vérif
neurips = ml_ia_publications[ml_ia_publications['title'] == 'Annual Conference on Neural Information Processing Systems']
neurips

Unnamed: 0,title,revconf_siid,Conference,Publication Date,Publication Count
46,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2020 - 34th Conference on Neural Infor...,2020,4
47,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2020 - 34th Conference on Neural Infor...,2020,3
89,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,3
98,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,1
101,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 - 37th Conference on Neural Infor...,2023,1
128,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 – 37th Conference on Neural Infor...,2023,1
131,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 – 37th Conference on Neural Infor...,2023,1
134,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2023 – 37th Conference on Neural Infor...,2023,1
300,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2019 - Thirty-third Conference on Neur...,2019,1
309,Annual Conference on Neural Information Proces...,RC9798641609,NeurIPS 2022 - Thirty-sixth Conference on Neur...,2022,1


In [40]:
# VERIF
# Filtrer les lignes où la 'Publication Date' est 2023
neurips2023 = neurips[neurips['Publication Date'] == 2023]
neurips2023
# Calculer la somme de la colonne 'Publication Count' pour ces lignes filtrées
publication_count_sum = neurips2023['Publication Count'].sum()
publication_count_sum

13

In [41]:

pd.set_option('display.max_rows', None) #Permet de voir toutes les lignes

table = pd.pivot_table(ml_ia_publications, values='Publication Count', index=['title'],
                       columns=['Publication Date'], aggfunc="sum")
table.fillna(0, inplace = True)

table['Total'] = table.sum(axis=1)
table

Publication Date,2018,2019,2020,2021,2022,2023,Total
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Annual Conference on Neural Information Processing Systems,6.0,8.0,15.0,6.0,18.0,13.0,66.0
International Conference on Artificial Intelligence and Statistics,1.0,2.0,5.0,2.0,1.0,0.0,11.0
International Conference on Machine Learning,5.0,4.0,5.0,5.0,3.0,5.0,27.0
National Conference on Artificial Intelligence,0.0,1.0,1.0,5.0,1.0,1.0,9.0


In [79]:
table.to_excel('tableau_confs.xlsx', index=True)

In [43]:
# api verif:
#  https://api.archives-ouvertes.fr/search/INRIA2?q=neurips&fq=publicationDateY_i:2023%20AND%20docType_s:COMM&fl=uri_s,conferenceTitle_s&rows=70