<a href="https://colab.research.google.com/github/Aarushi900/Recipe_Recommendation-_system/blob/main/Untitled10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Read the clean datasets
recipes_df = pd.read_pickle('clean_recipes.pkl')
users_df = pd.read_pickle('clean_interactions.pkl')
food_df = pd.read_pickle('food.pkl')

In [None]:
# Shape of the data
print('The shape of the recipe dataset is:', recipes_df.shape)
print('The shape of the user interactions dataset is:', users_df.shape)
print('The shape of the merged dataset is:', food_df.shape)

In [None]:
# View recipes dataset
recipes_df.head()

In [None]:
# Numerical summary
recipes_df.describe()

In [None]:
# Select nutritional columns for analysis
nutritional_columns = ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']

# Plot boxplots of nutritional features
sns.boxplot(data=recipes_df[nutritional_columns])
plt.xlabel('Nutritional Feature')
plt.ylabel('Value')
plt.title('Nutritional Analysis')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Merge the datasets on the common column 'recipe_id'
merged_df = pd.merge(recipes_df, users_df, left_on='id', right_on='recipe_id')

In [None]:
# View shape and info of merged df
merged_df.info()

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(merged_df.corr(),annot=True)

In the correlation analysis conducted, we found a correlation coefficient of 0.49 between the variables 'vegan' and 'vegetarian'. This moderate positive correlation is in line with the relationship between the two dietary labels. It is important to note that all vegan dishes are inherently vegetarian, as they exclude any animal products, while vegetarian dishes may include dairy and eggs. Consequently, the positive correlation observed suggests that recipes labeled as vegan are also likely to be labeled as vegetarian. This correlation analysis provides quantitative evidence supporting the logical relationship between these dietary categories.

Next, the correlation coefficient between 'n_steps' and 'n_ingredients' is 0.39, indicating a moderate positive correlation. This suggests that there is a tendency for recipes with more steps to also have more ingredients. This observation aligns with the intuition that as the number of ingredients increases, additional preparation and cooking steps are typically required to incorporate and combine those ingredients effectively. Therefore, it is expected that recipes with a higher number of ingredients will generally have a higher number of steps. The positive correlation between the number of steps and the number of ingredients reinforces this relationship, indicating that more complex recipes with a greater variety of ingredients tend to involve a greater number of steps in their preparation.

A correlation coefficient of 0.4 between 'submitted_year' and 'review_year' suggests a moderate positive correlation. This indicates that there is a tendency for recipes submitted in earlier years to receive reviews in earlier years as well. However, it's important to note that correlation does not imply causation. This correlation could be influenced by various factors, such as the popularity of certain recipes during specific time periods or the availability of ingredients. Further analysis and consideration of other variables would be needed to understand the underlying reasons behind this correlation.

In [None]:
correlation_matrix = merged_df[['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']].corr()
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Matrix')
plt.show()

The strong positive correlations observed between 'calories' and 'total_fat' (correlation coefficient of 0.91) and between 'calories' and 'saturated_fat' (correlation coefficient of 0.86) suggest an association between calorie content and fat-related measures in recipes. As the number of calories increases, there is a tendency for the total fat and saturated fat content to also increase. This is expected, as fat is a concentrated source of energy, containing more calories per gram compared to other macronutrients like protein and carbohydrates. Therefore, recipes with higher calorie counts are more likely to contain higher amounts of total fat and saturated fat, contributing to the observed correlations. It is important to consider these relationships when assessing the nutritional composition and energy content of recipes.

Similarly, the correlation coefficient of 0.73 between 'calories' and 'carbohydrates' suggests a moderate positive correlation between these two variables in recipes. This indicates that as the calorie content of a recipe increases, there is a tendency for the carbohydrate content to also increase. Carbohydrates are one of the macronutrients that contribute to the caloric content of food. They are a primary source of energy for the body and are commonly found in foods such as grains, fruits, and vegetables. Therefore, it is expected that recipes with higher calorie counts would typically contain higher amounts of carbohydrates. However, it is important to note that correlation does not imply causation, and other factors may also influence the relationship between calories and carbohydrates in recipes.

In [None]:
# Count for each rating score
users_df['rating'].value_counts()

In [None]:
# Bar graph for user rating
users_df['rating'].value_counts().sort_index().plot(kind="bar")
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('User Rating Distribution')
plt.xticks(rotation=0)
plt.show()

In [None]:

yearly_interaction_counts = users_df['review_year'].value_counts().sort_index()
yearly_interaction_counts.plot(kind='line', marker='o')
plt.xlabel('Year')
plt.ylabel('Number of Interactions')
plt.title('User Interactions Over Time')
plt.xlim(2000, 2018)
plt.show()


In [None]:
# Perform time-based analysis
time_analysis = recipes_df.groupby('submitted_year').size()
time_analysis.plot(kind='line', marker='o')
plt.xlim(2000, 2018)
plt.xlabel('Year')
plt.ylabel('Number of Recipes')
plt.title('Number of Recipes Over Time')
plt.show()

In [None]:
# Calculate the count of recipes for each dietary restriction
dietary_restrictions = ['dairy-free', 'gluten-free', 'low-carb', 'vegan', 'vegetarian']
restriction_counts = recipes_df[dietary_restrictions].sum()

# Add 'none' category when all diet types are 0
restriction_counts['other'] = len(recipes_df) - restriction_counts.sum()

# Plot bar chart of dietary restriction counts with log scale
sns.barplot(x=restriction_counts.index, y=restriction_counts.values)
plt.xlabel('Dietary Restriction')
plt.ylabel('Count')
plt.yscale('log')
plt.title('Count of Dietary Restrictions in Recipes (Log Scale)')
plt.show()

In [None]:
filtered_recipes = food_df[food_df['average_rating'] == 5]
top_recipes = filtered_recipes.nlargest(10, 'votes')

plt.figure(figsize=(10, 6))
plt.barh(top_recipes['name'], top_recipes['votes'], color='lightgreen')
plt.xlabel('Votes')
plt.ylabel('Recipe Name')
plt.title('Top 10 Recipes with Average Rating of 5 Based on Votes')
plt.gca().invert_yaxis()
plt.show()


In [None]:
from wordcloud import WordCloud

# Concatenate all ingredients into a single string
ingredients_text = ' '.join(recipes_df['ingredients'].explode().str.replace("'", ""))

# Create a WordCloud object with an off-white background color
wordcloud = WordCloud(width=800, height=400, background_color='ivory').generate(ingredients_text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Ingredient Word Cloud')
plt.show()

# **CONCLUSION**

During the EDA step of the recipe recommendation system, the following findings were uncovered:
- Some nutritional value columns may have outliers, but we keep them due to the unknown serving size.
- A moderate positive correlation (0.49) between 'vegan' and 'vegetarian', as expected.
- Moderate positive correlation (0.39) between 'n_steps' and 'n_ingredients', indicating more complex recipes have more steps.
- Moderate positive correlation (0.4) between 'submitted_year' and 'review_year', further investigation needed.
- Strong positive correlations between 'calories' & 'total_fat' (0.91) and 'calories' & 'saturated_fat' (0.86).
- Moderate positive correlation (0.73) between 'calories' and 'carbohydrates'.
- User ratings skewed positively, potential rating bias.
- Peak user interactions in 2008, gradual decline afterward.
- Steady increase in recipe submissions until peak in 2017, followed by a gradual decline.
- Varying popularity of dietary preferences: 'none', 'low-carb', 'vegetarian', 'vegan', 'gluten-free', and 'dairy-free'.
- Word cloud highlights common ingredients like "garlic clove" and "olive oil".

Overall, the EDA provides valuable insights into the dataset, offering information about nutritional aspects, dietary preferences, user interactions, and ingredient usage. These findings will guide further analysis and model development in the recipe recommendation system. With this, we will now proceed to the modeling phase.