Ce code utilise l’indice de Jaccard une mesure simple mais puissante qui permet d’évaluer la similarité entre deux ensembles. 

Exemple : 
Ensemble A : {"Le", "chat", "mange", "la", "souris"} 
Ensemble B : {"La", "souris", "mange", "le", "fromage"}
L'intersection de ces deux ensembles est : {"La", "souris", "mange", "le"}
L'union de ces deux ensembles est : {"Le", "chat", "mange", "la", "souris", "fromage"}
Donc, la similarité de Jaccard entre les deux phrases serait de 4/6 = 0,67.

Interprétation:
Si J(A,B) = 1, cela signifie que les deux ensembles sont identiques.
Si J(A,B) = 0, cela signifie que les deux ensembles n'ont aucun élément en commun.
Les valeurs entre 0 et 1 indiquent le degré de similarité entre les deux ensembles, avec des valeurs plus élevées indiquant une plus grande similarité.



In [4]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load data from csv files
df_results = pd.read_csv("ml_results.csv")
df_training = pd.read_csv("ml_training_dataset.csv")


# Concatenate product names from both dataframes to build a vocabulary for TF-IDF
all_product_names = pd.concat([df_results['product_name'], df_training['product_name']])

# Create and fit TF-IDF model
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(all_product_names)

# Seperate the vectors of results and training data
X_results = X[:df_results.shape[0]]
X_training = X[df_results.shape[0]:]

# Calculate cosine similarity
similarity_scores = cosine_similarity(X_results, X_training)

# Get most similar instances in the training dataset for each instance in results dataset
most_similar_indexes = np.argmax(similarity_scores, axis=1)

# Display most similar product names from training dataset
most_similar_product_names = df_training.iloc[most_similar_indexes]['product_name']


In [None]:
# Load data from csv files
df_results = pd.read_csv("/Users/oume3001/Downloads/ml_results.csv")
df_training = pd.read_csv("/Users/oume3001/Downloads/ml_training_dataset.csv")

# Loop through each product_name in df_results
for i, row in df_results.iterrows():
    result_product_name = set(row['product_name'].lower().split())
    # Loop through each product_name in df_training
    for j, row_train in df_training.iterrows():
        train_product_name = set(row_train['product_name'].lower().split())
        # Calculate the Jaccard similarity
        similarity = len(result_product_name & train_product_name) / len(result_product_name | train_product_name)
        # If the similarity is above a threshold
        if similarity > 0.5:  # You may adjust this threshold
            print(f"Similar products: {row['product_name']} (results) and {row_train['product_name']} (training)")


In [None]:
import pandas as pd

# Load data from csv files
df_results = pd.read_csv("/Users/oume3001/Downloads/ml_results.csv")
df_training = pd.read_csv("/Users/oume3001/Downloads/ml_training_dataset.csv")

# DataFrame to store the similar products
df_similar_products = pd.DataFrame(columns=['product_name_from_results', 'product_name_from_training', 'similarity'])

# Initialize a counter
counter = 0

# Loop through each product_name in df_results
for i, row in df_results.iterrows():
    result_product_name = set(row['product_name'].lower().split())
    # Loop through each product_name in df_training
    for j, row_train in df_training.iterrows():
        train_product_name = set(row_train['product_name'].lower().split())
        # Calculate the Jaccard similarity
        similarity = len(result_product_name & train_product_name) / len(result_product_name | train_product_name)
        # If the similarity is above 0.5
        if similarity > 0.5:  
            # Add to the DataFrame
            df_similar_products.loc[counter] = [row['product_name'], row_train['product_name'], similarity]
            counter += 1

# Print the DataFrame
print(df_similar_products)
