# Chargement des données

In [4]:
import pandas as pd
# Charger le jeu de données
df = pd.read_csv("dataset/livres_bruts.csv")#Bib-Readers-project-main\dataset\livres_bruts.csv

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Title             1000 non-null   object 
 1   Description       1000 non-null   object 
 2   Price             1000 non-null   float64
 3   Availability      1000 non-null   object 
 4   Image_URL         1000 non-null   object 
 5   Rating            1000 non-null   int64  
 6   availability_num  1000 non-null   int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 54.8+ KB


In [6]:
df.describe()

Unnamed: 0,Price,Rating,availability_num
count,1000.0,1000.0,1000.0
mean,35.07035,2.923,8.585
std,14.44669,1.434967,5.654622
min,10.0,1.0,1.0
25%,22.1075,2.0,3.0
50%,35.98,3.0,7.0
75%,47.4575,4.0,14.0
max,59.99,5.0,22.0


1000 lignes → donc 1000 livres dans ton dataset.
7 colonnes :

- Title (str) → titre du livre
- Description (str) → résumé
- Price (float64) → prix du livre
- Availability (str) → texte indiquant si le livre est en stock
- Image_URL (str) → lien de l’image de couverture
- Rating (int64) → note (de 1 à 5)
- availability_num (int64) → nombre d’exemplaires disponibles

aucune donnée manquante dans ce dataset

📌 Price
- 25 % des livres coûtent ≤ 22.11 £
- 25 % des livres coûtent ≥ 47.46 £

📌 Rating
- 25 % des livres ont ≤ 2 étoiles --> 25%
- 25 % des livres ont ≥ 4 étoiles --> 75%

📌 availability_num
- 25 % des livres ont ≤ 3 exemplaires --> 25%
- 25 % des livres ont ≥ 14 exemplaires --> 75%

**Dataset complet** → pas de valeurs manquantes.<br>
**Prix** : distribution relativement large (10 £ à 60 £).<br>
**Notes** : centrées autour de 3 étoiles, avec une bonne proportion de 4 et 5.<br>
**Stock** : très variable, certains livres en quantité limitée (1 ou 2), d’autres en grande quantité (jusqu’à 22).

In [7]:
df.head()

Unnamed: 0,Title,Description,Price,Availability,Image_URL,Rating,availability_num
0,A Light in the Attic,Its hard to imagine a world without A Light in...,51.77,In stock (22 available),https://books.toscrape.com/media/cache/fe/72/f...,3,22
1,Tipping the Velvet,Erotic and absorbing...Written with starling p...,53.74,In stock (20 available),https://books.toscrape.com/media/cache/08/e9/0...,1,20
2,Soumission,"Dans une France assez proche de la nôtre, un h...",50.1,In stock (20 available),https://books.toscrape.com/media/cache/ee/cf/e...,1,20
3,Sharp Objects,"WICKED above her hipbone, GIRL across her hear...",47.82,In stock (20 available),https://books.toscrape.com/media/cache/c0/59/c...,4,20
4,Sapiens: A Brief History of Humankind,From a renowned historian comes a groundbreaki...,54.23,In stock (20 available),https://books.toscrape.com/media/cache/ce/5f/c...,5,20


# Préparation du modèle de recommandation (similarité cosinus)

Charger les descriptions nettoyées depuis la base de données

In [8]:
# 2. Charger les descriptions nettoyées
df.Description

0      Its hard to imagine a world without A Light in...
1      Erotic and absorbing...Written with starling p...
2      Dans une France assez proche de la nôtre, un h...
3      WICKED above her hipbone, GIRL across her hear...
4      From a renowned historian comes a groundbreaki...
                             ...                        
995                                   Pas de description
996    High school student Kei Nagai is struck dead i...
997    In Englands Regency era, manners and elegance ...
998    James Patterson, bestselling author of the Ale...
999    Around the World, continent by continent, here...
Name: Description, Length: 1000, dtype: object

**1. Convertir tout le texte en minuscules : text.lower().**

In [9]:
df['Description'] = df['Description'].str.lower()
df['Description'].head(10)

0    its hard to imagine a world without a light in...
1    erotic and absorbing...written with starling p...
2    dans une france assez proche de la nôtre, un h...
3    wicked above her hipbone, girl across her hear...
4    from a renowned historian comes a groundbreaki...
5    patient twenty-nine.a monster roams the halls ...
6    drawing on his extensive experience evaluating...
7    if you have a heart, if you have a soul, karen...
8    for readers of laura hillenbrands seabiscuit a...
9    praise for aracelis girmay:girmays every losss...
Name: Description, dtype: object

# Appliquer la tokenisation : nltk.word_tokenize(text)

- Suppression de la ponctuation, caractères spéciaux et chiffres
- Tokenisation
- Stemming
- Vectorisation avec TF-IDF

**4. Supprimer les stopwords (mots vides) avec nltk.corpus.stopwords.words('english').**

In [10]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords_en = set(stopwords.words('english'))
stopwords_fr = set(stopwords.words('french'))
stopwords_combined = stopwords_en.union(stopwords_fr)

def remove_stopwords(text):
    tokens = word_tokenize(text) # Tokenisation du texte
    filtered_tokens = [word for word in tokens if word not in stopwords_combined] # Filtrer les stopwords
    return " ".join(filtered_tokens) # Reconstruire la phrase

# Appliquer la fonction de suppression des stopwords à la colonne 'text'
df['Description'] = df['Description'].apply(remove_stopwords)

# Afficher les premières lignes pour vérifier le résultat
df['Description'].head(20)

0     hard imagine world without light attic . now-c...
1     erotic absorbing ... written starling power. -...
2     france assez proche nôtre , homme sengage carr...
3     wicked hipbone , girl across heart words like ...
4     renowned historian comes groundbreaking narrat...
5     patient twenty-nine.a monster roams halls soot...
6     drawing extensive experience evaluating applic...
7     heart , soul , karen hicks coming woman make f...
8     readers laura hillenbrands seabiscuit unbroken...
9     praise aracelis girmay : girmays every lossshe...
10    since assault , miss annette chetwynd plagued ...
11    book important complete collection sonnets wil...
12    aaron ledbetters future planned since born . y...
13    scott pilgrims life totally sweet . hes 23 yea...
14    punks raw power rejuvenated rock , summer 1977...
15    never-before-told story musical revolution hap...
16    part fact , part fiction , tyehimba jesss much...
17    andrew barger , award-winning author engin

**5. Appliquer lemming ade NLTK pour réduire les mots à leur racine.**

In [13]:
import nltk

nltk.download('wordnet')                 # Pour lemmatizer
nltk.download('omw-1.4')                 # Dictionnaire multilingue pour WordNet
nltk.download('punkt')                   # Tokenizer
nltk.download('averaged_perceptron_tagger')  # POS tagger requis par pos_tag
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\naoui\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\naoui\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\naoui\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\naoui\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\naoui\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [14]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet

lemmatizer_en = WordNetLemmatizer()
stemmer_fr = SnowballStemmer('french')

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def preprocess_text(text):
    # Tokeniser
    tokens = word_tokenize(text.lower())

    # Supprimer stopwords
    tokens = [t for t in tokens if t.isalpha() and t not in stopwords_combined]

    # Pos tagging anglais (utile pour lemmatization anglaise)
    pos_tags = pos_tag(tokens)

    processed_tokens = []
    for token, tag in pos_tags:
        if token in stopwords_en:
            continue
        if token in stopwords_fr:
            # pour le français : stemmer simple (car pas de lemmatiseur fiable dans NLTK)
            processed_tokens.append(stemmer_fr.stem(token))
        else:
            # anglais : lemmatisation avec POS
            wn_tag = get_wordnet_pos(tag)
            processed_tokens.append(lemmatizer_en.lemmatize(token, wn_tag))

    return processed_tokens

# Appliquer sur la colonne
df['Description_processed'] = df['Description'].apply(preprocess_text)

df[['Description', 'Description_processed']].head(20)


Unnamed: 0,Description,Description_processed
0,hard imagine world without light attic . now-c...,"[hard, imagine, world, without, light, attic, ..."
1,erotic absorbing ... written starling power. -...,"[erotic, absorb, write, starling, power, new, ..."
2,"france assez proche nôtre , homme sengage carr...","[france, assez, proche, nôtre, homme, sengage,..."
3,"wicked hipbone , girl across heart words like ...","[wicked, hipbone, girl, across, heart, word, l..."
4,renowned historian comes groundbreaking narrat...,"[renowned, historian, come, groundbreaking, na..."
5,patient twenty-nine.a monster roams halls soot...,"[patient, monster, roams, hall, soothe, hill, ..."
6,drawing extensive experience evaluating applic...,"[draw, extensive, experience, evaluate, applic..."
7,"heart , soul , karen hicks coming woman make f...","[heart, soul, karen, hicks, come, woman, make,..."
8,readers laura hillenbrands seabiscuit unbroken...,"[reader, laura, hillenbrands, seabiscuit, unbr..."
9,praise aracelis girmay : girmays every lossshe...,"[praise, aracelis, girmay, girmays, every, los..."


# Extraction des caractéristiques

**1. Vectoriser le texte à l’aide de TfidfVectorizer()**

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib
import pandas as pd
import os

# Supposons que ta colonne pré-traitée s'appelle 'Description_processed'
# Elle contient des listes de tokens, il faut les convertir en chaînes de caractères
df['Description_processed_text'] = df['Description_processed'].apply(lambda tokens: ' '.join(tokens))

# Initialiser le vectorizer TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Appliquer fit_transform sur les descriptions prétraitées (texte)
X = tfidf_vectorizer.fit_transform(df['Description_processed_text'])

# Obtenir les noms des caractéristiques (features)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convertir la matrice TF-IDF en DataFrame pour exploration si besoin
tfidf_df = pd.DataFrame(X.toarray(), columns=feature_names)

# Affichage des informations
print(f"Forme de la matrice TF-IDF (Nombre de documents, Nombre de features) : {X.shape}")
print(f"Type de la matrice TF-IDF : {type(X)}")
print("Aperçu du DataFrame TF-IDF :")
print(tfidf_df.head())

# Sauvegarder le vectorizer pour réutilisation future
if not os.path.exists('models'):
    os.makedirs('models')

joblib.dump(tfidf_vectorizer, 'models/tfidf_vectorizer.pkl')
print("Vectorizer TF-IDF sauvegardé sous 'models/tfidf_vectorizer.pkl'")

print("\n--- Vectorisation TF-IDF terminée ---")


Forme de la matrice TF-IDF (Nombre de documents, Nombre de features) : (1000, 5000)
Type de la matrice TF-IDF : <class 'scipy.sparse._csr.csr_matrix'>
Aperçu du DataFrame TF-IDF :
   aaron  abandon  abbot  abby  abduction  abigail   ability  able  abound  \
0    0.0      0.0    0.0   0.0        0.0      0.0  0.000000   0.0     0.0   
1    0.0      0.0    0.0   0.0        0.0      0.0  0.000000   0.0     0.0   
2    0.0      0.0    0.0   0.0        0.0      0.0  0.000000   0.0     0.0   
3    0.0      0.0    0.0   0.0        0.0      0.0  0.000000   0.0     0.0   
4    0.0      0.0    0.0   0.0        0.0      0.0  0.069156   0.0     0.0   

   abraham  ...  yuki  zeal  zero  zeus  zimbardo  zodiac  zombie  zone  \
0      0.0  ...   0.0   0.0   0.0   0.0       0.0     0.0     0.0   0.0   
1      0.0  ...   0.0   0.0   0.0   0.0       0.0     0.0     0.0   0.0   
2      0.0  ...   0.0   0.0   0.0   0.0       0.0     0.0     0.0   0.0   
3      0.0  ...   0.0   0.0   0.0   0.0       0.0  

In [16]:
tfidf_df.head(20).round(30)


Unnamed: 0,aaron,abandon,abbot,abby,abduction,abigail,ability,able,abound,abraham,...,yuki,zeal,zero,zeus,zimbardo,zodiac,zombie,zone,zorin,zuko
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.069156,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 --- Vectorisation du texte avec TfidfVectorizer ---

- Instancier le TfidfVectorizer
- On utilise max_features pour limiter le vocabulaire et rendre le modèle plus gérable

- Adapter (fit) le vectorizer aux données d'entraînement et transformer le texte
- 'final_text_for_vectorization' est la colonne qui contient le texte nettoyé, tokenisé et stemmé


- **Term Frequency (TF)** : Il s'agit simplement du nombre de fois qu'un mot apparaît dans un email. Un mot qui apparaît fréquemment dans un document a une valeur TF élevée.

- **Inverse Document Frequency (IDF)** : Cette mesure évalue l'importance d'un mot dans l'ensemble du corpus d'emails. Un mot qui est très commun et apparaît dans de nombreux documents (comme "le", "la", "un", qui sont des stopwords) aura une valeur IDF faible, car il n'est pas très discriminant. À l'inverse, un mot rare qui n'apparaît que dans un petit nombre d'emails (comme "urgent", "gagner" pour les spams) aura une valeur IDF élevée.

- **Le score TF-IDF** est le produit de ces deux valeurs (TF×IDF). Cela a pour effet de donner un poids élevé aux mots qui sont fréquents dans un email spécifique mais rares dans l'ensemble du corpus. Par conséquent, les mots pertinents pour la classification, tels que "gratuit" dans un spam, recevront un poids plus important que des mots génériques comme "bonjour" ou "email".


#  Similarité entre livres (Recommandation basique)  

Nous allons utiliser les **descriptions des livres** pour calculer une mesure de similarité :  

1. **TF-IDF** : transformer les descriptions en vecteurs numériques.  
2. **Similarité cosinus** : mesurer la ressemblance entre les livres.  

Cela servira de base pour un **système de recommandation**.  


In [18]:
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

cosine_sim = cosine_similarity(X)
cosine_sim_df = pd.DataFrame(cosine_sim, index=df.index, columns=df.index)
print("Matrice de similarité cosinus :")
cosine_sim_df.head(20)

Matrice de similarité cosinus :


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,1.0,0.009794,0.0,0.013485,0.002102,0.002186,0.015258,0.014283,0.022232,0.040587,...,0.021126,0.027753,0.004372,0.0,0.022581,0.0,0.0,0.002485,0.008397,0.017007
1,0.009794,1.0,0.0,0.024254,0.017584,0.088155,0.013077,0.025107,0.015388,0.027401,...,0.029493,0.009629,0.01043,0.006486,0.010622,0.0,0.039042,0.055556,0.039841,0.033141
2,0.0,0.0,1.0,0.019266,0.0,0.0,0.0,0.015412,0.0,0.011638,...,0.0,0.0,0.0,0.0,0.014425,0.0,0.0,0.0,0.0,0.0
3,0.013485,0.024254,0.019266,1.0,0.053206,0.026619,0.026422,0.043533,0.015041,0.029409,...,0.01549,0.029058,0.015973,0.009564,0.014108,0.0,0.019227,0.01701,0.057315,0.003335
4,0.002102,0.017584,0.0,0.053206,1.0,0.000979,0.023519,0.041918,0.017537,0.030196,...,0.011916,0.020792,0.014494,0.025659,0.010768,0.0,0.057197,0.007341,0.041775,0.028889
5,0.002186,0.088155,0.0,0.026619,0.000979,1.0,0.005837,0.061088,0.005372,0.042171,...,0.005515,0.009331,0.021473,0.065527,0.00477,0.0,0.058852,0.011575,0.024447,0.012554
6,0.015258,0.013077,0.0,0.026422,0.023519,0.005837,1.0,0.022688,0.036283,0.030947,...,0.00592,0.043139,0.024283,0.034666,0.024898,0.0,0.045545,0.004767,0.056862,0.033813
7,0.014283,0.025107,0.015412,0.043533,0.041918,0.061088,0.022688,1.0,0.040819,0.037584,...,0.025221,0.089851,0.046822,0.055882,0.009593,0.0,0.034537,0.035226,0.11107,0.023747
8,0.022232,0.015388,0.0,0.015041,0.017537,0.005372,0.036283,0.040819,1.0,0.040318,...,0.020398,0.01776,0.046726,0.010593,0.02021,0.0,0.0,0.033262,0.063938,0.037572
9,0.040587,0.027401,0.011638,0.029409,0.030196,0.042171,0.030947,0.037584,0.040318,1.0,...,0.019665,0.046675,0.022303,0.018146,0.018262,0.0,0.046527,0.001044,0.021789,0.046214


In [19]:


# Fonction de recommandation
def recommend_books(book_index, top_n=5):
    similarities = cosine_sim[book_index]
    similar_indices = similarities.argsort()[::-1][1:top_n+1]  # Exclure le livre lui-même
    return df.iloc[similar_indices][['Description']]

# Exemple : recommander 5 livres pour le premier livre
recommend_books(0, top_n=5)

Unnamed: 0,Description
220,kind creative . everyone creative core . every...
39,"poetry harrowing , angry , achingly beautiful ..."
909,"drawn intimate personal associations , pablo n..."
444,consistently entertaining courtesy rabins humo...
269,salt journey warmth sharpness . collection poe...


In [20]:
# Sauvegarder le vectorizer pour réutilisation future
if not os.path.exists('models'):
    os.makedirs('models')

joblib.dump(cosine_sim, 'models/cosine_sim.pkl')

['models/cosine_sim.pkl']