## Exploration et analyse des notes et interactions recettes

Cette étude explore la popularité des recettes du jeu de données public Kaggle `Food.com recipes and user interactions` (`https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions`). Nous cherchons à caractériser ce qui rend une recette « populaire » au sens des notes reçues et de l'activité des utilisateurs.

### Objectifs
- **Comprendre les caractéristiques des recettes populaires**: notes moyennes élevées, nombre d'évaluations, etc.
- **Évaluer le rôle du rating**: la note moyenne seule suffit-elle, ou faut-il considérer le volume d'évaluations et d'autres interactions ?

### Plan d'analyse
1. **Exploration préliminaire**
   - Nombre de recettes et d'interactions
   - Types de variables et tableau récapitulatif
   - Données manquantes et cohérence (conversion des dates)
2. **Analyse univariée de la variable `rating`**
   - Statistiques globales (moyenne, médiane, quantiles)
   - Distribution et valeurs aberrantes potentielles (incl. `rating=0` comme « non noté »)
3. **Visualisations**
   - Histogrammes et boxplots des notes
   - Scatter et subplots: relation entre **nombre d'évaluations** et **note moyenne**
4. **Analyse critique**
   - Le `rating` seul est-il suffisant ?
   - Importance du volume d'évaluations et de la dispersion
   - Vers une définition multidimensionnelle de la popularité

Les sections suivantes implémentent ce plan, en s'appuyant sur `pandas`, `matplotlib` et `seaborn`.


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_columns", None)
sns.set_theme(style="whitegrid")


In [3]:
# Import the data
recipes = pd.read_csv("../data/RAW_recipes.csv")
interactions = pd.read_csv("../data/RAW_interactions.csv")

In [4]:
recipes.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'side-dishes', 'vegetables', 'mexican', 'easy', 'fall', 'holiday-event',...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'depending on size of squash , cut into half or fourths', 'remove seeds', 'for spicy squash , drizzle olive oil or melted butter over each cut squash piec...","autumn is my favorite time of year to cook! this recipe \r\ncan be prepared either spicy or sweet, your choice!\r\ntwo of my posted mexican-inspired seasoning mix recipes are offered as suggestions.","['winter squash', 'mexican seasoning', 'mixed spice', 'honey', 'butter', 'olive oil', 'salt']",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'breakfast', 'main-dish', 'pork', 'american', 'oven', 'easy', 'kid-frien...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough into the bottom and sides of a 12 inch pizza pan', 'bake for 5 minutes until set but not browned', 'cut sausage into small pieces', 'whisk eggs and m...",this recipe calls for the crust to be prebaked a bit before adding ingredients. feel free to change sausage to ham or bacon. this warms well in the microwave for those late risers.,"['prepared pizza crust', 'sausage patty', 'eggs', 'milk', 'salt and pepper', 'cheese']",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'main-dish', 'chili', 'crock-pot-slow-cooker', 'dietary', 'equipment', '4-hours-or-less']","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add chopped onions to ground beef when almost brown and sautee until wilted', 'add all other ingredients', 'add kidney beans if you like beans in your chili', '...",this modified version of 'mom's' chili was a hit at our 2004 christmas party. we made an extra large pot to have some left to freeze but it never made it to the freezer. it was a favorite by all. ...,"['ground beef', 'yellow onions', 'diced tomatoes', 'tomato paste', 'tomato soup', 'rotel tomatoes', 'kidney beans', 'water', 'chili powder', 'ground cumin', 'salt', 'lettuce', 'cheddar cheese']",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'occasion', 'side-dishes', 'eggs-dairy', 'potatoes', 'vegetables', 'oven', 'easy', 'dinner-party', 'holiday-event...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,"['place potatoes in a large pot of lightly salted water and bring to a gentle boil', 'cook until potatoes are just tender', 'drain', 'place potatoes in a large bowl and add all ingredients except ...","this is a super easy, great tasting, make ahead side dish that looks like you spent a lot more time preparing than you actually do. plus, most everything is done in advance. the times do not refle...","['spreadable cheese with garlic and herbs', 'new potatoes', 'shallots', 'parsley', 'tarragon', 'olive oil', 'red wine vinegar', 'salt', 'pepper', 'red bell pepper', 'yellow bell pepper']",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'canning', 'condiments-etc', 'vegetables', 'american', 'heirloom-historical', 'ho...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,"['mix all ingredients& boil for 2 1 / 2 hours , or until thick', 'pour into jars', ""i use'old' glass ketchup bottles"", ""it is not necessary for these to'seal"", ""'my amish mother-in-law has been ma...","my dh's amish mother raised him on this recipe. he much prefers it over store-bought ketchup. it was a taste i had to acquire, but now my ds's also prefer this type of ketchup. enjoy!","['tomato juice', 'apple cider vinegar', 'sugar', 'salt', 'pepper', 'clove oil', 'cinnamon oil', 'dry mustard']",8


In [5]:
interactions.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for 15 minutes.Added a shake of cayenne and a pinch of salt. Used low fat sour cream. Thanks.
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall evening. Should have doubled it ;)<br/><br/>Second time around, forgot the remaining cumin. We usually love cumin, but didn't notice the missing 1/2 ..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not quite a whole package (10oz) of white chips. Great!
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunko. Everyone loved it.
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprinkling of black pepper. Yum!"


In [6]:
interactions.shape

(1132367, 5)

In [7]:
interactions.head(20)

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for 15 minutes.Added a shake of cayenne and a pinch of salt. Used low fat sour cream. Thanks.
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall evening. Should have doubled it ;)<br/><br/>Second time around, forgot the remaining cumin. We usually love cumin, but didn't notice the missing 1/2 ..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not quite a whole package (10oz) of white chips. Great!
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunko. Everyone loved it.
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprinkling of black pepper. Yum!"
5,52282,120345,2005-05-21,4,very very sweet. after i waited the 2 days i bought 2 more pints of raspberries and added them to the mix. i'm going to add some as a cake filling today and will take a photo.
6,124416,120345,2011-08-06,0,"Just an observation, so I will not rate. I followed this procedure with strawberries instead of raspberries. Perhaps this is the reason it did not work well. Sorry to report that the strawberri..."
7,2000192946,120345,2015-05-10,2,This recipe was OVERLY too sweet. I would start out with 1/3 or 1/4 cup of sugar and jsut add on from there. Just 2 cups was way too much and I had to go back to the grocery store to buy more ra...
8,76535,134728,2005-09-02,4,Very good!
9,273745,134728,2005-12-22,5,Better than the real!!


In [8]:
# Préparation et exploration préliminaire
# - Conversion des colonnes de dates
# - Tableaux récapitulatifs (taille, types, valeurs manquantes, cardinalités)

# Conversion des dates (coercition en NaT si invalide)
recipes['submitted'] = pd.to_datetime(recipes['submitted'], errors='coerce')
interactions['date'] = pd.to_datetime(interactions['date'], errors='coerce')

# Dimensions des jeux
recipes_shape = recipes.shape
interactions_shape = interactions.shape
print({
    'recipes_shape': recipes_shape,
    'interactions_shape': interactions_shape
})

# Fonction de synthèse des colonnes
def summarize_dataframe(df: pd.DataFrame, dataset_name: str) -> pd.DataFrame:
    columns = []
    for col in df.columns:
        series = df[col]
        columns.append({
            'dataset': dataset_name,
            'column': col,
            'dtype': str(series.dtype),
            'non_null': int(series.notna().sum()),
            'nulls': int(series.isna().sum()),
            'null_pct': float(series.isna().mean() * 100.0),
            'n_unique': int(series.nunique(dropna=True))
        })
    return pd.DataFrame(columns)

recipes_summary = summarize_dataframe(recipes, 'recipes')
interactions_summary = summarize_dataframe(interactions, 'interactions')

# Aperçu des variables (top n lignes du tableau)
display(recipes_summary)
display(interactions_summary)

# Comptage des notes observées (incluant 0)
rating_counts = interactions['rating'].value_counts(dropna=False).sort_index()
display(rating_counts)



{'recipes_shape': (231637, 12), 'interactions_shape': (1132367, 5)}


Unnamed: 0,dataset,column,dtype,non_null,nulls,null_pct,n_unique
0,recipes,name,object,231636,1,0.000432,230185
1,recipes,id,int64,231637,0,0.0,231637
2,recipes,minutes,int64,231637,0,0.0,888
3,recipes,contributor_id,int64,231637,0,0.0,27926
4,recipes,submitted,datetime64[ns],231637,0,0.0,5090
5,recipes,tags,object,231637,0,0.0,209115
6,recipes,nutrition,object,231637,0,0.0,229318
7,recipes,n_steps,int64,231637,0,0.0,94
8,recipes,steps,object,231637,0,0.0,231074
9,recipes,description,object,226658,4979,2.149484,222668


Unnamed: 0,dataset,column,dtype,non_null,nulls,null_pct,n_unique
0,interactions,user_id,int64,1132367,0,0.0,226570
1,interactions,recipe_id,int64,1132367,0,0.0,231637
2,interactions,date,datetime64[ns],1132367,0,0.0,6396
3,interactions,rating,int64,1132367,0,0.0,6
4,interactions,review,object,1132198,169,0.014924,1125282


rating
0     60847
1     12818
2     14123
3     40855
4    187360
5    816364
Name: count, dtype: int64

### Analyse univariée de `rating`

Nous examinons la distribution des notes. Remarque: `rating = 0` correspond souvent  e0 une interaction sans note (commentaire sans évaluation). Nous rapportons des statistiques avec et sans les zéros.


In [8]:
# Statistiques des notes (avec et sans zéros)
ratings_all = interactions['rating'].dropna()
ratings_pos = ratings_all[ratings_all > 0]

summary_all = ratings_all.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])
summary_pos = ratings_pos.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])

print('Taille (toutes notes):', ratings_all.shape[0])
print('Taille (notes > 0):', ratings_pos.shape[0])
print('\nStatistiques - toutes notes:')
display(summary_all)
print('\nStatistiques - notes > 0:')
display(summary_pos)

# Agréger au niveau recette pour l'analyse suivante
agg = (interactions
       .assign(has_rating=lambda d: d['rating'] > 0)
       .groupby('recipe_id')
       .agg(
           n_interactions=('user_id', 'count'),
           n_rated=('has_rating', 'sum'),
           mean_rating=('rating', lambda s: s[s > 0].mean()),
           median_rating=('rating', lambda s: s[s > 0].median()),
       )
       .reset_index()
      )
agg['share_rated'] = np.where(agg['n_interactions'] > 0, agg['n_rated'] / agg['n_interactions'], np.nan)

display(agg.head())


ERROR! Session/line number was not unique in database. History logging moved to new session 87


AssertionError: 

### Recettes les mieux et moins bien notées (seuil de volume)

Nous identifions les recettes extrêmes en appliquant un **seuil minimal** de `n_rated` pour éviter les artefacts dus aux de très faibles volumes (ex.: une seule note parfaite).



In [8]:
# Top / Bottom recettes selon la note moyenne (avec seuil)
MIN_RATED = 20  # ajustable
agg_valid = agg[agg['n_rated'] >= MIN_RATED].copy()

# Jointure pour r"cupérer les noms de recettes
recipes_min = recipes[['id', 'name']].rename(columns={'id': 'recipe_id'})
agg_named = agg_valid.merge(recipes_min, on='recipe_id', how='left')

# Top 10
top10 = agg_named.sort_values(['mean_rating', 'n_rated'], ascending=[False, False]).head(10)
# Bottom 10 (exclure mean_rating NaN)
bot10 = agg_named.dropna(subset=['mean_rating']).sort_values(['mean_rating', 'n_rated'], ascending=[True, False]).head(10)

display(top10[['recipe_id', 'name', 'mean_rating', 'n_rated', 'n_interactions']])
display(bot10[['recipe_id', 'name', 'mean_rating', 'n_rated', 'n_interactions']])


ERROR! Session/line number was not unique in database. History logging moved to new session 86


AssertionError: 

### Visualisations: distributions et relation volume-note

Nous visualisons:
- **Histogramme** et **boxplot** des notes (avec et sans zéros)
- **Dispersion** de la relation entre `n_rated` et `mean_rating` (avec transparence) et **subplots** par tranches de volume


In [None]:
# Histogrammes et boxplots des notes
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Histogramme toutes notes
sns.histplot(ratings_all, bins=20, kde=False, ax=axes[0, 0], color='#4C78A8')
axes[0, 0].set_title('Histogramme des notes (toutes)')
axes[0, 0].set_xlabel('rating')

# Histogramme notes > 0
sns.histplot(ratings_pos, bins=20, kde=False, ax=axes[0, 1], color='#F58518')
axes[0, 1].set_title('Histogramme des notes (> 0)')
axes[0, 1].set_xlabel('rating')

# Boxplot toutes notes
sns.boxplot(x=ratings_all, ax=axes[1, 0], color='#4C78A8')
axes[1, 0].set_title('Boxplot (toutes)')
axes[1, 0].set_xlabel('rating')

# Boxplot notes > 0
sns.boxplot(x=ratings_pos, ax=axes[1, 1], color='#F58518')
axes[1,  1].set_title('Boxplot (> 0)')
axes[1, 1].set_xlabel('rating')

plt.tight_layout()
plt.show()

# Scatter n_rated vs mean_rating (avec transparence)
fig, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(data=agg_named, x='n_rated', y='mean_rating', alpha=0.2, s=20)
ax.axvline(MIN_RATED, color='red', linestyle='--', alpha=0.6, label=f'Seuil n_rated={MIN_RATED}')
ax.set_title('Relation volume (n_rated) vs note moyenne')
ax.legend()
plt.show()

# Subplots par tranches de volume
bins = [0, 5, 10, 20, 50, 100, np.inf]
labels = ['<=5', '6-10', '11-20', '21-50', '51-100', '>100']
agg_named['volume_bin'] = pd.cut(agg_named['n_rated'], bins=bins, labels=labels, right=True, include_lowest=True)

fig, axes = plt.subplots(2, 3, figsize=(14, 8), sharey=True)
axes = axes.ravel()
for i, lab in enumerate(labels):
    subset = agg_named[agg_named['volume_bin'] == lab]
    sns.boxplot(data=subset, y='mean_rating', ax=axes[i], color='#72B7B2')
    axes[i].set_title(f'Bin: {lab}\n(n={subset.shape[0]})')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('mean_rating')

plt.tight_layout()
plt.show()


### Analyse critique: la note moyenne est-elle suffisante ?

- **Volume vs. moyenne**: des notes très élevées avec peu d'évaluations ne garantissent pas la popularité. Il faut pondérer par le **nombre d'évaluations**.
- **Distribution et dispersion**: la médiane/dispersion complètent la moyenne (robustesse aux extrêmes).
- **Interactions sans note**: le ratio `share_rated` renseigne sur l'engagement (beaucoup d'interactions mais peu de notes évoquent un autre comportement).
- **Indicateur composite (idée)**: `popularité ~ f(mean_rating, n_rated, share_rated)` ou encore **score Wilson** pour ordonner avec incertitude.
- **Prochaines étapes**: 
  - Définir un **score de popularité** combine (e.g., moyenne pondérée par log(n_rated), intervalle de Wilson)
  - Explorer d'autres facteurs: `minutes`, `n_ingredients`, `tags` (cuisine, occasion), saisonnalité (date)
  - Segmenter par catégories et comparer les distributions
