In [None]:
# Corrections

# 1. Pr√©traitement du texte

Le pr√©traitement du texte aide √† am√©liorer la qualit√© et la fiabilit√© des r√©sultats des analyses NLP en nettoyant et en normalisant les donn√©es textuelles. Cela peut √©galement acc√©l√©rer les analyses en r√©duisant la taille des donn√©es √† analyser.

Le pr√©traitement du texte est une √©tape importante dans le traitement du langage naturel (NLP) pour plusieurs raisons :
1. **Conversion de la casse** : La conversion de la casse peut √™tre importante pour garantir la coh√©rence des donn√©es et √©viter les probl√®mes li√©s √† la casse dans les algorithmes de NLP.


2. **Nettoyage de donn√©es** : Les donn√©es textuelles brutes peuvent contenir des erreurs, des abr√©viations, des symboles et d'autres caract√©ristiques qui peuvent affecter la qualit√© des r√©sultats des analyses NLP. Le pr√©traitement du texte aide √† nettoyer ces donn√©es en supprimant les caract√®res ind√©sirables, en corrigeant les erreurs et en normalisant les donn√©es.

3. **Tokenisation** : La tokenisation est l'√©tape cruciale de la division du texte en unit√©s plus petites pour une analyse ult√©rieure. La tokenisation peut √™tre utilis√©e pour diviser les donn√©es textuelles en mots, phrases, symboles et autres unit√©s pertinentes.

4. **Suppression des Stop words** : Les stop words peuvent affecter n√©gativement les performances des algorithmes NLP en ajoutant du bruit aux donn√©es. La suppression des stop words aide √† filtrer les donn√©es et √† am√©liorer les r√©sultats de l'analyse.

5. **Stemming et Lemmatisation** : Le stemming et la lemmatisation sont des techniques importantes pour normaliser les mots et les r√©duire √† leur forme de base pour une analyse plus coh√©rente.

Nous utiliserons par la suite le vocabulaire suivant :
- **Corpus** : Un corpus est un ensemble de documents textuels rassembl√©s en vu d'un traitement.
- **Document** : Document : Un document est une unit√© de texte distincte, telle qu'un livre, un article de journal ou une page web.
- **Text** : Le texte est un ensemble de mots et de phrases utilis√©s pour communiquer des informations et des id√©es.
- **Token** : Un token est une unit√© d'information dans un texte, qui peut √™tre un mot, un symbole, une poctuation ou tout autre √©l√©ment pertinent.
- **Vocabulaire** : C'est l'ensemble des tokens individuels pr√©sents dans l'ensemble du corpus.

### 1.1 Tokenisation

<img src='https://miro.medium.com/max/1050/0*EKgminT7W-0R4Iae.png'>

La tokenisation est un processus dans le traitement du langage naturel (NLP) qui consiste √† diviser un texte en unit√©s plus petites appel√©es tokens.

Les tokens peuvent √™tre des mots, des phrases, des symboles ou des caract√®res. La tokenisation est souvent la premi√®re √©tape dans le traitement des donn√©es textuelles, car elle permet de pr√©parer le texte pour les analyses ult√©rieures telles que la reconnaissance de la signification, la classification, la g√©n√©ration de r√©sum√©s, etc.

In [1]:
import warnings

# D√©sactiver les avertissements
warnings.filterwarnings("ignore")

In [4]:
pip install spacy



In [2]:
import spacy
from spacy.lang.fr import French



# Cr√©ation d'un objet Spacy
nlp = French()


# Cr√©ation de la fonction de tokenisation
tokenizer = nlp.tokenizer

In [4]:
tokenizer("Hello")[0]

Hello

In [5]:
# Transformation d'un texte sous forme de token
doc = tokenizer("Le chien mange ses croquettes")
doc

Le chien mange ses croquettes

In [6]:
tokens = [token.text for token in doc]
tokens

['Le', 'chien', 'mange', 'ses', 'croquettes']

### 1.2 Stemming & Lemmatization

<img src='https://d2mk45aasx86xg.cloudfront.net/difference_between_Stemming_and_lemmatization_8_11zon_452539721d.webp'>

**Stemming** et **lemmatisation** sont deux techniques utilis√©es dans le traitement du langage naturel pour r√©duire les mots √† leur forme de base.

**Stemming** : Le stemming est une technique pour extraire la racine d'un mot en enlevant les suffixes, les pr√©fixes et autres modifications morphologiques. L'objectif du stemming est de r√©duire les mots √† leur forme de base pour une analyse plus coh√©rente. Par exemple, les mots "runner", "running", "ran" peuvent √™tre r√©duits √† la forme de base "run".

**Lemmatisation** : La lemmatisation est similaire au stemming, mais elle vise √† produire un lemme ou un mot normalis√©, qui est une forme valide de dictionnaire pour un mot donn√©. La lemmatisation implique une analyse morphologique plus avanc√©e pour d√©terminer la forme correcte d'un mot, en prenant en compte son contexte et sa d√©finition. Par exemple, le mot "running" peut √™tre lemmatis√© en "run", tandis que le mot "better" peut √™tre lemmatis√© en "good".

En g√©n√©ral, la lemmatisation est consid√©r√©e comme une technique plus pr√©cise que le stemming, mais elle est √©galement plus lente et plus complexe √† impl√©menter. Les deux techniques peuvent √™tre utiles pour normaliser les mots et am√©liorer les r√©sultats des analyses NLP, mais le choix entre les deux d√©pend des besoins sp√©cifiques d'un projet NLP particulier.

A noter que le mot : `And` n'ont pas √©t√© supprim√©s car python est sensible √† la casse :

`'And' != 'and' `

In [5]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m16.3/16.3 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.8.0
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [6]:
import spacy
# Charger le mod√®le fran√ßais pr√©-entra√Æn√© :
nlp = spacy.load("fr_core_news_sm")

# Tokenisation de la phrase :
doc = nlp("Le chien mange ses croquettes")

# Affichage des lemmes :
doc = [token.lemma_ for token in doc]
doc

['le', 'chien', 'manger', 'son', 'croquette']

### 1.2 Traitement des Stop Words


Un stop word d√©signe des mots qui sont souvent ignor√©s ou filtr√©s lors de l'analyse de donn√©es textuelles. Les stop words sont consid√©r√©s comme des mots peu informatifs et ne sont g√©n√©ralement pas pris en compte lors de l'analyse de la signification des textes.

Ils peuvent √™tre des pr√©positions, des conjonctions et d'autres mots communs qui ne contribuent pas de mani√®re significative √† la signification d'une phrase ou d'un document.

La liste des stop words peut varier en fonction de la langue, du domaine et des objectifs de l'analyse.

In [9]:
from spacy.lang.fr import stop_words as stop_words

stop_words.STOP_WORDS

{'a',
 'abord',
 'afin',
 'ah',
 'ai',
 'aie',
 'ainsi',
 'ait',
 'allaient',
 'allons',
 'alors',
 'anterieur',
 'anterieure',
 'anterieures',
 'ant√©rieur',
 'ant√©rieure',
 'ant√©rieures',
 'apres',
 'apr√®s',
 'as',
 'assez',
 'attendu',
 'au',
 'aupres',
 'auquel',
 'aura',
 'auraient',
 'aurait',
 'auront',
 'aussi',
 'autre',
 'autrement',
 'autres',
 'autrui',
 'aux',
 'auxquelles',
 'auxquels',
 'avaient',
 'avais',
 'avait',
 'avant',
 'avec',
 'avoir',
 'avons',
 'ayant',
 'bas',
 'basee',
 'bat',
 "c'",
 'car',
 'ce',
 'ceci',
 'cela',
 'celle',
 'celle-ci',
 'celle-la',
 'celle-l√†',
 'celles',
 'celles-ci',
 'celles-la',
 'celles-l√†',
 'celui',
 'celui-ci',
 'celui-la',
 'celui-l√†',
 'cent',
 'cependant',
 'certain',
 'certaine',
 'certaines',
 'certains',
 'certes',
 'ces',
 'cet',
 'cette',
 'ceux',
 'ceux-ci',
 'ceux-l√†',
 'chacun',
 'chacune',
 'chaque',
 'chez',
 'ci',
 'cinq',
 'cinquantaine',
 'cinquante',
 'cinquanti√®me',
 'cinqui√®me',
 'combien',
 'comme',
 

In [10]:
len(stop_words.STOP_WORDS)

507

‚û°Ô∏è Supprimons ces mots de nos jetons en utilisant NLTK :

In [11]:
# Filtrage des tokens en fonction des stop words :
tokens_filtres = []
for token in doc:
  if token not in stop_words.STOP_WORDS:
    tokens_filtres.append(token)
tokens_filtres

['chien', 'manger', 'croquette']

In [12]:
[token for token in doc if token not in stop_words.STOP_WORDS]

['chien', 'manger', 'croquette']

---

# Projet 1 - Pr√©traitement de texte

1. Cre√©ez une classe `Processing` contenant une m√©thode `tokenization` qui tranfome un document en liste de token.
Cette m√©thode poss√®de √© arguments :
- `document : str`  --> Le document sous forme de str,
- `stem:bool=False` --> Si True, la m√©thode applique la transformation stemming,

La m√©thode garde dans un attribut data, l'ensemble des pr√©c√©dents traitement : le document et la liste des tokens.

La m√©thode retourne la liste de token.

In [9]:
import spacy
from spacy.lang.fr import stop_words as stop_words

class Processing:
    # Charger le mod√®le fran√ßais pr√©-entra√Æn√© :
    nlp = spacy.load("fr_core_news_sm")
    stop_words = stop_words.STOP_WORDS

    def tokenizer(self, text:str, lemma_=True) -> str:
        text = text.lower()
        doc = self.nlp(text)

        if lemma_:
            tokens = [token.lemma_ for token in doc if token.lemma_ not in self.stop_words]
        else:
            tokens = [token.text for token in doc if token.text not in self.stop_words]
        return " ".join(tokens)

Processing().tokenizer("Bonjour tout le monde !", lemma_=False)

'bonjour monde !'

# 2. Words Vectorisation - Introduction & TF-IDF

<img src='https://www.gstatic.com/aihub/tfhub/universal-sentence-encoder/example-similarity.png'>

L'analyse de la similarit√© entre textes consiste √† mesurer la ressemblance ou la proximit√© entre deux ou plusieurs textes en utilisant des algorithmes math√©matiques et des m√©triques sp√©cifiques, cette analyse permet :
- la **classification de documents** : Elle peut √™tre utilis√©e pour classer des documents en fonction de leur ressemblance, ce qui est utile pour organiser et classer des documents en fonction de leur sujet ou de leur contenu.

- la **d√©tection de doublons** : Elle peut √™tre utilis√©e pour d√©tecter les documents en double ou similaires, ce qui est utile pour nettoyer les bases de donn√©es ou les archives de documents.

- la **recommandation de contenu** : Elle peut √™tre utilis√©e pour recommander du contenu similaire √† l'utilisateur, bas√© sur son historique de recherche ou de lecture.

- la **dettection de contenu** : Elle consiste √† trouver et √† identifier les textes ou les parties de textes qui sont identiques ou tr√®s similaires entre eux. Cela peut √™tre utile pour de nombreuses t√¢ches, telles que la suppression de doublons dans les bases de donn√©es, la d√©tection de plagiat dans les travaux acad√©miques, la v√©rification de l'originalit√© des articles de presse, et la mise en place de contr√¥les de qualit√© pour les sites Web et les applications.

### 2.1 Bag Of Word

Un **Bag of Words** (BoW) est une repr√©sentation fr√©quentielle des mots d'un document. Il s'agit d'un mod√®le simpliste qui se concentre sur la fr√©quence d'apparition de chaque mot dans le document psans prendre en compte l'ordre des mots.

Il suffit de compter le nombre d'occurrences de chaque mot dans le document et stocker ces informations dans un vecteur. Chaque √©l√©ment du vecteur repr√©sente le nombre d'occurrences d'un mot donn√© dans le document.

Les **BoW** sont souvent utilis√©s en NLP pour la vectorisation des textes, ce qui signifie qu'ils sont transform√©s en vecteurs num√©riques pouvant √™tre utilis√©s pour les algorithmes de classification et de clustering.

In [14]:
import pandas as pd
import numpy as np
# Import de la fonction CountVectorizer depuis la biblioth√®que sklearn
from sklearn.feature_extraction.text import CountVectorizer

corpus = """Le traitement du langage naturel (NLP) est un domaine de l'informatique, de l'intelligence artificielle et de la linguistique qui s'int√©resse aux interactions entre les ordinateurs et les langues humaines (naturelles). L'objectif du NLP est de permettre aux ordinateurs de comprendre, d'interpr√©ter et de g√©n√©rer le langage humain.
Les applications du NLP incluent la classification de texte, l'analyse de sentiment, la traduction automatique, la reconnaissance d'entit√©s nomm√©es, la reconnaissance vocale et les chatbots. Les techniques de NLP reposent sur des algorithmes d'apprentissage automatique, tels que les arbres de d√©cision, les for√™ts al√©atoires, les r√©seaux neuronaux et l'apprentissage profond, pour analyser et mod√©liser la structure et la signification du langage.
Le NLP est un domaine complexe, car le langage humain est tr√®s ambigu et d√©pend fortement du contexte. Pour surmonter ces d√©fis, les mod√®les de NLP s'appuient souvent sur de grandes quantit√©s de donn√©es annot√©es et des algorithmes sophistiqu√©s pour apprendre les sch√©mas et les relations dans le langage.
Malgr√© ces d√©fis, le NLP a le potentiel de r√©volutionner notre mani√®re d‚Äôinteragir avec les ordinateurs et d‚Äôouvrir de nouvelles possibilit√©s en mati√®re de communication, d‚Äô√©ducation et bien plus encore.""".split("\n")


corpus

["Le traitement du langage naturel (NLP) est un domaine de l'informatique, de l'intelligence artificielle et de la linguistique qui s'int√©resse aux interactions entre les ordinateurs et les langues humaines (naturelles). L'objectif du NLP est de permettre aux ordinateurs de comprendre, d'interpr√©ter et de g√©n√©rer le langage humain.",
 "Les applications du NLP incluent la classification de texte, l'analyse de sentiment, la traduction automatique, la reconnaissance d'entit√©s nomm√©es, la reconnaissance vocale et les chatbots. Les techniques de NLP reposent sur des algorithmes d'apprentissage automatique, tels que les arbres de d√©cision, les for√™ts al√©atoires, les r√©seaux neuronaux et l'apprentissage profond, pour analyser et mod√©liser la structure et la signification du langage.",
 "Le NLP est un domaine complexe, car le langage humain est tr√®s ambigu et d√©pend fortement du contexte. Pour surmonter ces d√©fis, les mod√®les de NLP s'appuient souvent sur de grandes quantit√©s d

In [15]:
# Document 1
corpus[0]

"Le traitement du langage naturel (NLP) est un domaine de l'informatique, de l'intelligence artificielle et de la linguistique qui s'int√©resse aux interactions entre les ordinateurs et les langues humaines (naturelles). L'objectif du NLP est de permettre aux ordinateurs de comprendre, d'interpr√©ter et de g√©n√©rer le langage humain."

In [16]:
for doc in corpus:
  print(Processing().tokenizer(doc,False))

le traitement du langage naturel ( nlp ) est un domaine de l' informatique , de l' intelligence artificielle et de la linguistique qui s' int√©resse aux interactions entre les ordinateurs et les langues humaines ( naturelles ) . l' objectif du nlp est de permettre aux ordinateurs de comprendre , d' interpr√©ter et de g√©n√©rer le langage humain .
les applications du nlp incluent la classification de texte , l' analyse de sentiment , la traduction automatique , la reconnaissance d' entit√©s nomm√©es , la reconnaissance vocale et les chatbots . les techniques de nlp reposent sur des algorithmes d' apprentissage automatique , tels que les arbres de d√©cision , les for√™ts al√©atoires , les r√©seaux neuronaux et l' apprentissage profond , pour analyser et mod√©liser la structure et la signification du langage .
le nlp est un domaine complexe , car le langage humain est tr√®s ambigu et d√©pend fortement du contexte . pour surmonter ces d√©fis , les mod√®les de nlp s' appuient souvent sur de

In [17]:
corpus = [Processing().tokenizer(doc) for doc in corpus]
corpus[0]

'traitement langage naturel ( nlp ) domaine informatique , intelligence artificiel linguistique int√©resser interaction entrer ordinateur langue humain ( naturel ) . objectif nlp permettre ordinateur comprendre , interpr√©ter g√©n√©rer langage humain .'

In [18]:
# Create a CountVectorizer object
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(corpus).toarray()

# Convert the BOW array to a DataFrame
BOW = pd.DataFrame(data=BOW, columns=vectorizer.get_feature_names_out())
BOW

Unnamed: 0,algorithme,al√©atoire,ambigu,analyse,analyser,annoter,application,apprendre,apprentissage,appuyer,...,signification,sophistiqu√©,structure,surmonter,technique,texte,traduction,traitement,vocal,√©ducation
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,1,0,1,1,0,1,0,2,0,...,1,0,1,0,1,1,1,0,1,0
2,1,0,1,0,0,1,0,1,0,1,...,0,1,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [19]:
# Nombre de fois que le mot "nlp" apparait dans le corpus
BOW.iloc[0].values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 1, 0, 1, 1, 2, 1, 1, 0, 0,
       0, 0, 2, 0, 2, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0])


### 2.2 Term Frequency (TF)

La fr√©quence des termes est le nombre d'occurrences d'un terme (par exemple un mot) dans un √©chantillon de texte, mais normalis√© par le nombre de mots dans cet √©chantillon. C'est tr√®s proche d'un Bag Of Words (BOW) : la principale diff√©rence est la normalisation.

üëâüèª Voyons un exemple pour comprendre pourquoi nous aurions besoin de la normalisation. Supposons que nous recherchions la requ√™te "amour", et que nous souhaitions trouver la citation la plus pertinente parmi 3 citations diff√©rentes :

In [20]:
TF = BOW.divide(BOW.sum(axis=1), axis=0)
TF

Unnamed: 0,algorithme,al√©atoire,ambigu,analyse,analyser,annoter,application,apprendre,apprentissage,appuyer,...,signification,sophistiqu√©,structure,surmonter,technique,texte,traduction,traitement,vocal,√©ducation
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0
1,0.029412,0.029412,0.0,0.029412,0.029412,0.0,0.029412,0.0,0.058824,0.0,...,0.029412,0.0,0.029412,0.0,0.029412,0.029412,0.029412,0.0,0.029412,0.0
2,0.041667,0.0,0.041667,0.0,0.0,0.041667,0.0,0.041667,0.0,0.041667,...,0.0,0.041667,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923


In [21]:
BOW.sum(axis=1)

Unnamed: 0,0
0,25
1,34
2,24
3,13



### 2.3 Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** repr√©sente l'inverse de la fr√©quence √† laquelle un terme appara√Æt dans nos documents. Fondamentalement, l'IDF donnera donc un poids plus √©lev√© aux mots qui apparaissent rarement dans nos documents et r√©duira le poids des mots qui apparaissent fr√©quemment.

<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/0f1b67e328e503d7dd2d10fdfff9ee75df88032a'>

O√π D est le nombre total de documents dans le corpus et le d√©nominateur : nombre de documents o√π le terme t appara√Æt.

In [22]:
BOW.sum(axis=0)

Unnamed: 0,0
algorithme,2
al√©atoire,1
ambigu,1
analyse,1
analyser,1
...,...
texte,1
traduction,1
traitement,1
vocal,1


In [23]:
np.log(1)

np.float64(0.0)

In [24]:
bow = BOW
bow[bow>1] = 1
IDF = np.log(len(bow)/bow.sum(axis=0) +1)
IDF

Unnamed: 0,0
algorithme,1.098612
al√©atoire,1.609438
ambigu,1.609438
analyse,1.609438
analyser,1.609438
...,...
texte,1.609438
traduction,1.609438
traitement,1.609438
vocal,1.609438



### 2.4 **TF-IDF** (Term Frequency - Inverse Document Frequency)

|       **`TF-IDF = TF √ó IDF`** |


Pourquoi est-ce une bonne fonctionnalit√© ?

D'une part, si vous recherchez un mot dans un corpus, plus ce mot appara√Æt, plus il a de chances d'√™tre pertinent : cela s'exprime par le Terme Fr√©quence.

En revanche, si ce mot particulier appara√Æt dans tous les documents du corpus, il n'est peut-√™tre pas opportun de discriminer les diff√©rents documents : cela s'exprime par la Fr√©quence Inverse des Documents.

‚û°Ô∏è En cons√©quence, la combinaison de TF et IDF semble √™tre un bon compromis et est une fonctionnalit√© largement utilis√©e dans le traitement du langage naturel.

üëâüèª Suite √† notre exemple pr√©c√©dent, nous pouvons calculer le TF-IDF manuellement :

In [25]:
TF*IDF

Unnamed: 0,algorithme,al√©atoire,ambigu,analyse,analyser,annoter,application,apprendre,apprentissage,appuyer,...,signification,sophistiqu√©,structure,surmonter,technique,texte,traduction,traitement,vocal,√©ducation
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064378,0.0,0.0
1,0.032312,0.047336,0.0,0.047336,0.047336,0.0,0.047336,0.0,0.094673,0.0,...,0.047336,0.0,0.047336,0.0,0.047336,0.047336,0.047336,0.0,0.047336,0.0
2,0.045776,0.0,0.06706,0.0,0.0,0.06706,0.0,0.06706,0.0,0.06706,...,0.0,0.06706,0.0,0.06706,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.123803


## 2. 5 Matrice TF-IDF avec la biblioth√®que SKLearn

Avec la biblioth√®que **scikit-learn (SKLearn)**, il est possible de g√©n√©rer cette matrice gr√¢ce √† [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Cet outil transforme un ensemble de documents textuels en une matrice creuse o√π chaque ligne repr√©sente un document et chaque colonne un mot, pond√©r√© par son score TF-IDF. L'utilisation de **`TfidfVectorizer`** permet d'automatiser le pr√©traitement, incluant la suppression des mots courants (`stop_words`), la normalisation et le filtrage des mots rares. Cette m√©thode est particuli√®rement efficace pour des t√¢ches de **classification de texte, recherche d‚Äôinformation ou clustering**, car elle att√©nue l'impact des mots fr√©quents tout en mettant en valeur les termes les plus significatifs.



In [26]:
corpus

['traitement langage naturel ( nlp ) domaine informatique , intelligence artificiel linguistique int√©resser interaction entrer ordinateur langue humain ( naturel ) . objectif nlp permettre ordinateur comprendre , interpr√©ter g√©n√©rer langage humain .',
 'application nlp inclure classification texte , analyse sentiment , traduction automatique , reconnaissance entit√© nommer , reconnaissance vocal chatbot . technique nlp reposer algorithme apprentissage automatique , arbre d√©cision , for√™t al√©atoire , r√©seau neuronal apprentissage profond , analyser mod√©liser structure signification langage .',
 'nlp domaine complexe , langage humain ambigu d√©pendre fortement contexte . surmonter d√©fi , mod√®le nlp appuyer grand quantit√© donn√©e annoter algorithme sophistiqu√© apprendre sch√©ma relation langage .',
 'd√©fi , nlp potentiel r√©volutionner mani√®re interagir ordinateur ouvrir possibilit√© mati√®re communication , √©ducation bien .']

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialisation du vectoriseur TF-IDF
tfidf_vectorizer = TfidfVectorizer()

# Transformation du corpus en matrice TF-IDF
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Conversion en DataFrame pour visualisation
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
df_tfidf

Unnamed: 0,algorithme,al√©atoire,ambigu,analyse,analyser,annoter,application,apprendre,apprentissage,appuyer,...,signification,sophistiqu√©,structure,surmonter,technique,texte,traduction,traitement,vocal,√©ducation
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.194945,0.0,0.0
1,0.127699,0.16197,0.0,0.16197,0.16197,0.0,0.16197,0.0,0.323939,0.0,...,0.16197,0.0,0.16197,0.0,0.16197,0.16197,0.16197,0.0,0.16197,0.0
2,0.171211,0.0,0.217159,0.0,0.0,0.217159,0.0,0.217159,0.0,0.217159,...,0.0,0.217159,0.0,0.217159,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.294685


# Projet 2 : Analyse de Sentiment avec TF-IDF et Machine Learning

Dans ce projet, nous allons analyser des textes pour pr√©dire leur sentiment (positif, n√©gatif, neutre) en utilisant une approche de Machine Learning. L‚Äôobjectif est de transformer un corpus de texte en une matrice TF-IDF puis d‚Äôentra√Æner un mod√®le de classification pour d√©tecter le sentiment des textes.

In [32]:
import pandas as pd

# Cr√©ation d'un dataset avec 20 exemples de textes et leur classification de sentiment
data = {
    "texte": [
        "J'adore cette journ√©e ensoleill√©e, elle me remplit de joie !",  # Positif
        "Ce film √©tait vraiment g√©nial, je le recommande √† tout le monde.",  # Positif
        "Je suis tellement d√©√ßu par ce produit, il ne fonctionne pas du tout.",  # N√©gatif
        "Rien de sp√©cial aujourd'hui, juste une journ√©e comme les autres.",  # Neutre
        "Ce restaurant offre un excellent service et des plats d√©licieux !",  # Positif
        "Le service client √©tait horrible, je ne reviendrai jamais ici.",  # N√©gatif
        "Un jour normal, rien de particulier √† signaler.",  # Neutre
        "J'ai ador√© ma visite au mus√©e, c'√©tait une exp√©rience inoubliable.",  # Positif
        "Je suis frustr√© par la lenteur du service, c'√©tait une perte de temps.",  # N√©gatif
        "Cette application est utile mais pas r√©volutionnaire.",  # Neutre
        "Le livre √©tait captivant, je ne pouvais pas m'arr√™ter de lire.",  # Positif
        "L'h√¥tel √©tait bruyant et inconfortable, je ne le recommande pas.",  # N√©gatif
        "Une journ√©e ordinaire avec du travail et quelques t√¢ches m√©nag√®res.",  # Neutre
        "Super concert hier soir ! Une ambiance incroyable.",  # Positif
        "La connexion Internet coupe sans arr√™t, c'est tr√®s aga√ßant.",  # N√©gatif
        "J'ai pass√© un bon moment avec mes amis, c'√©tait agr√©able.",  # Positif
        "Ce caf√© est correct mais il y a de meilleurs endroits en ville.",  # Neutre
        "L'attente √©tait interminable et le personnel peu aimable.",  # N√©gatif
        "Un bon repas en famille, √ßa fait toujours plaisir.",  # Positif
        "Je suis ni satisfait ni d√©√ßu, c'√©tait juste moyen.",  # Neutre
    ],
    "sentiment": [
        1, 1, -1, 0, 1,
        -1, 0, 1, -1, 0,
        1, -1, 0, 1, -1,
        1, 0, -1, 1, 0
    ]
}

# Cr√©ation du DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,texte,sentiment
0,"J'adore cette journ√©e ensoleill√©e, elle me rem...",1
1,"Ce film √©tait vraiment g√©nial, je le recommand...",1
2,"Je suis tellement d√©√ßu par ce produit, il ne f...",-1
3,"Rien de sp√©cial aujourd'hui, juste une journ√©e...",0
4,Ce restaurant offre un excellent service et de...,1
5,"Le service client √©tait horrible, je ne revien...",-1
6,"Un jour normal, rien de particulier √† signaler.",0
7,"J'ai ador√© ma visite au mus√©e, c'√©tait une exp...",1
8,"Je suis frustr√© par la lenteur du service, c'√©...",-1
9,Cette application est utile mais pas r√©volutio...,0


In [33]:
precessing = Processing()

In [34]:
# 0. Nettoyage du texte (Stopword, tokenisation, lemming) :

df.texte = df.texte.apply(lambda x: precessing.tokenizer(x, lemma_=True))
df

Unnamed: 0,texte,sentiment
0,"adorer journ√©e ensoleill√©e , remplir joie !",1
1,"film vraiment g√©nial , recommande monde .",1
2,"d√©cevoir produit , fonctionner .",-1
3,"rien sp√©cial aujourd'hui , journ√©e .",0
4,restaurant offrir excellent service plat d√©lic...,1
5,"service client horrible , revenir jamais ici .",-1
6,"jour normal , rien particulier signaler .",0
7,"adorer visite mus√©e , exp√©rience inoubliable .",1
8,"frustrer lenteur service , perte temps .",-1
9,application utile r√©volutionnaire .,0


In [35]:
# 1. Transformation du texte en matrice TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd


# Initialisation du vectoriseur TF-IDF
tfidf_vectorizer = TfidfVectorizer()


# Transformation du corpus en matrice TF-IDF
tfidf = tfidf_vectorizer.fit_transform(df.texte)


# Conversion en DataFrame pour visualisation
tfidf = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf

Unnamed: 0,adorer,aga√ßant,agr√©able,aimable,ambiance,ami,application,arr√™t,arr√™ter,attente,...,soir,sp√©cial,super,temps,travail,t√¢che,utile,ville,visite,vraiment
0,0.418969,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.447214
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.476634,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.402361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.457741,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.464783,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0


In [36]:
df.sentiment

Unnamed: 0,sentiment
0,1
1,1
2,-1
3,0
4,1
5,-1
6,0
7,1
8,-1
9,0


In [37]:
!pip install xgboost




In [38]:
# 2. Entrainement d'un algorytme de machine Learning
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Import d'un mod√®le de machine Learning pour de la classification :
model = LogisticRegression(
    solver="liblinear",
    penalty="l2",
    C=1.0,
    random_state=42
)

# Donn√©es d'entra√Ænement :
X_train, X_test, y_train, y_test = train_test_split(
    tfidf,
    df.sentiment,
    stratify=df.sentiment,
    test_size=0.2,
    random_state=42
)

# Entrainement du mod√®le :
model.fit(X_train, y_train)

In [39]:
model.score(X_test, y_test)

0.5

In [40]:
# 3. Pr√©diction du mod√®le
doc = "adorer journ√©e ensoleill√©e"

# Pr√©trainement du texte
doc = Processing().tokenizer(doc)
print(doc)


# Transformation en vecteur TFIDF
x = tfidf_vectorizer.transform([doc]).toarray()
x

adorer journ√©e ensoleill√©e


array([[0.56718988, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.64525598,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5118011 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [41]:

# Pr√©diction du mod√®le (1-> Positif, 0 -> Neutre, -1 -> N√©gatif)
model.predict(x)

array([1])

In [43]:
phrase = "aga√ßant"

# Transformation et pr√©diction
X_new = tfidf_vectorizer.transform([phrase])
pred = model.predict(X_new)[0]

# Affichage du r√©sultat
label_map = {-1: "N√©gatif", 0: "Neutre", 1: "Positif"}
print(label_map[pred])


N√©gatif


# Projet 3 : Classification d'articles

Ce projet consiste √† classifier automatiquement des titres d'articles en trois cat√©gories distinctes : Technologie, Sant√© et Juridique, en utilisant des techniques de Machine Learning pour l'analyse et la pr√©diction de la th√©matique associ√©e √† chaque titre.

In [11]:
import pandas as pd

data = {
    "titre": [
        # Technologie ==> 0
        "Les derni√®res avanc√©es en intelligence artificielle",
        "L'impact de la 5G sur l'industrie des t√©l√©communications",
        "Les nouvelles g√©n√©rations de processeurs pour le gaming",
        "La mont√©e en puissance du cloud computing",
        "L'importance de la cybers√©curit√© dans les entreprises",
        "Les v√©hicules autonomes et leur technologie embarqu√©e",
        "L'essor des cryptomonnaies et la technologie blockchain",

        # Sant√© ==> 1
        "Les effets du sommeil sur la sant√© cognitive",
        "Comment l'alimentation influence le syst√®me immunitaire",
        "Les progr√®s r√©cents dans le traitement du diab√®te",
        "L'impact de l'exercice physique sur la long√©vit√©",
        "Les bienfaits de la m√©ditation sur le stress",
        "La recherche sur les vaccins contre les maladies √©mergentes",
        "Le r√¥le des probiotiques dans la digestion",

        # Juridique ==> 2
        "Les d√©fis du droit num√©rique dans un monde connect√©",
        "Les implications du RGPD pour les entreprises europ√©ennes",
        "Le r√¥le des avocats face √† l'automatisation juridique",
        "Les nouvelles r√©glementations sur la propri√©t√© intellectuelle",
        "L'impact des contrats intelligents sur le droit des affaires",
        "La responsabilit√© l√©gale des entreprises face aux cyberattaques"
    ],
    "theme": [
        0, 0, 0, 0, 0,
       0, 0,
        1, 1, 1, 1, 1,
        1, 1,
        2, 2,2, 2, 2, 2
    ]
}

# Cr√©ation du DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,titre,theme
0,Les derni√®res avanc√©es en intelligence artific...,0
1,L'impact de la 5G sur l'industrie des t√©l√©comm...,0
2,Les nouvelles g√©n√©rations de processeurs pour ...,0
3,La mont√©e en puissance du cloud computing,0
4,L'importance de la cybers√©curit√© dans les entr...,0
5,Les v√©hicules autonomes et leur technologie em...,0
6,L'essor des cryptomonnaies et la technologie b...,0
7,Les effets du sommeil sur la sant√© cognitive,1
8,Comment l'alimentation influence le syst√®me im...,1
9,Les progr√®s r√©cents dans le traitement du diab√®te,1


In [13]:
# Nettoyage des Stop Words & Tokenization
processing = Processing()
df.titre = df.titre.apply(lambda x: processing.tokenizer(x, True))
df

Unnamed: 0,titre,theme
0,dernier avanc√©e intelligence artificiel,0
1,impact 5 gramme industrie t√©l√©communications,0
2,g√©n√©ration processeur gaming,0
3,mont√©e puissance cloud computing,0
4,importance cybers√©curit√© entreprise,0
5,v√©hicule autonome technologie embarquer,0
6,essor cryptomonnaie technologie blockchain,0
7,sommeil sant√© cognitif,1
8,alimentation influenc syst√®me immunitaire,1
9,progr√®s r√©cent traitement diab√®te,1


In [14]:
# Cr√©ation de la matrice TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer


# Initialisation du vectoriseur TF-IDF
vectoriser = TfidfVectorizer()

# Transformation du corpus en matrice TF-IDF
tfidf = vectoriser.fit_transform(df.titre)

# Conversion en DataFrame pour visualisation
tfidf_df = pd.DataFrame(tfidf.toarray(), columns = vectoriser.get_feature_names_out())
tfidf_df

Unnamed: 0,affaire,alimentation,artificiel,automatisation,autonome,avanc√©e,avocat,bienfait,blockchain,cloud,cognitif,computing,connecter,contrat,contre,cryptomonnaie,cyberattaque,cybers√©curit√©,dernier,diab√®te,digestion,droit,d√©fi,embarquer,entreprise,essor,europ√©en,exercice,face,gaming,gramme,g√©n√©ration,immunitaire,impact,implication,importance,industrie,influenc,intellectuel,intelligence,intelligent,juridique,long√©vit√©,l√©gal,maladie,monde,mont√©e,m√©ditation,num√©rique,physique,probiotique,processeur,progr√®s,propri√©t√©,puissance,recherche,responsabilit√©,rgpd,r√©cent,r√©glementation,r√¥le,sant√©,sommeil,stress,syst√®me,technologie,traitement,t√©l√©communications,vaccin,v√©hicule,√©mergent
0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.524927,0.0,0.0,0.416359,0.0,0.0,0.524927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.524927,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616729,0.0,0.0,0.0,0.0,0.0,0.0,0.489174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616729,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.514844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.514844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.452556,0.0,0.0,0.0,0.514844,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.514844,0.0,0.0,0.0,0.0,0.0,0.0,0.514844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.514844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.452556,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0


In [16]:
df.theme

Unnamed: 0,theme
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,1
8,1
9,1


In [48]:
# Entra√Ænement d'un mod√®le de machine Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(tfidf_df, df["theme"], test_size=0.2, random_state=42, stratify=df["theme"])
model.fit(X_train, y_train)

In [49]:
model.score(X_test, y_test)

0.25

In [23]:
# Pr√©diction du mod√®le


titre = "La sant√© au coeur du projet"


titre = Processing().tokenizer(titre)
titre

'sant√© coeur projet'

In [18]:
print(df["titre"].value_counts())


titre
dernier avanc√©e intelligence artificiel              1
impact 5 gramme industrie t√©l√©communications         1
g√©n√©ration processeur gaming                         1
mont√©e puissance cloud computing                     1
importance cybers√©curit√© entreprise                  1
v√©hicule autonome technologie embarquer              1
essor cryptomonnaie technologie blockchain           1
sommeil sant√© cognitif                               1
alimentation influenc syst√®me immunitaire            1
progr√®s r√©cent traitement diab√®te                    1
impact exercice physique long√©vit√©                   1
bienfait m√©ditation stress                           1
recherche vaccin contre maladie √©mergent             1
r√¥le probiotique digestion                           1
d√©fi droit num√©rique monde connecter                 1
implication rgpd entreprise europ√©en                 1
r√¥le avocat face automatisation juridique            1
r√©glementation propri√©t√© intellec