# Data Mining Project 2023 - Vegetarian Menu
# Rodrigue Nasr and Nikola Nenovsky
# Autumn Semester 2023

##  I: Introduction

The increased availability of large databases in textual form thanks to the large swaths of data collected on the internet through web scrapping has led to an interest for text processing techniques that can lead us to exploit this data in order to extract precious commercial market information, provide better user experience for online platforms and services, text summarizing translation and many other fields. Modern methods often rely on machine learning techniques in order to extract the semantic and grammatical link between different words in sentences, identify link-words and discard them, and correct any misspelling that could hinder analysis. It is without saying that the recent introduction of ChatGPT to the general public has increased greatly the interest in such models ( we should recall that GPT stands for Generative Pre-trained Transformer, meaning that trained using a great volume of text found on the internet in order to predict consecutive words, use this learned information in order to understand the meaning of the different prompts that the user inputs to it, and then generate a coherent and useful response to the user).



##  II: Data collection and data cleaning

### Data collection: Web scraping

In order to proceed with our analysis, we will first need to collect textual data on the internet of different dishes with their names and ingredients in order to be able to classify them in different categories according to the different type of diets that they can satisfy: all-inclusive diet, pescatarian diet, vegetarian and vegan diet. We can remark that all these diets include one another in descending order:

$Vegan\subset Vegetarian \subset Pescaterian \subset All Inclusive$

We will extract data from the internet which both contains menus with ingredients to train our models and menus without ingredients in order to test the performance of our prediction.

We will collect data from French websites in order to make our analysis. Of course, the following should in theory be replicable in any language so long as a large enough database is available. 

#### Marmiton 

We will collect data to perform our analysis from a popular French cooking website which contains a great number of recipies and the ingredients needed to cook them: Marmiton.org. 

The website has differents categories for its recicipes that are awready: viande, poisson, fruits de mer, oeufs, plats végétarians and plats au fromage.

We will retrieve these recipies and ingredients thanks to the Beautiful Soup Python library developed by Leonard Richardson in 2004, and which is very efficient for parsing and extracting information from HTML webpages (as is the case for Marmiton).

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from collections import Counter

url_base = "https://www.marmiton.org/recettes/index/categorie/"
categories = ['viande', 'poisson', 'oeufs', 'fruits-de-mer', 'plat-vegetarien', 'plats-au-fromage']

On this webpage, each recipy is displayed with the following fashion:

<img src="marmiton.png">

Each recipy has it's own page on the website which contains a title with the name of the recipy, followed by a photograph of the recipy and then all the ingredients needed for it's preparation in different small widows.

We will retrieve all this information with the following loop. It will create a dictionary for each category of main courses on the webpage an store each recipy as a dictionary entry which will contain each of it's ingredients.

The first for loop parses all categories of main courses enumerated above.


The second for loop parses each page of the website containing all recipies (in this case this is fairly straigthforward since the link for each page is just the base link for marmitton + the page number: https://www.marmiton.org/recettes/index/categorie/plat-principal/123 for page 123). The algorithm automatically stops when there are no more recipies on the pages.


The third for loop creates a dictionary entry for each recipy containing its ingredients. Each recipy is a link (a) called recipe_card-link, and the dictionary's entry is created using the Level 4 heading recipe-card_title and each. It's ingredients are assigned using the division (div) of the page called card-ingredient and which contain as a separate window each ingredient.


In [2]:
# Create a dictionary to store recipe titles, links, and ingredient names for each category
all_recipes_by_category = {}

# Iterate through categories
for category in categories:
    url_category = url_base + category + '/'
    
    # Create lists to store recipe titles, links, and ingredient names for the current category
    recipe_titles = []
    recipe_links = []
    ingredient_names_list = []  # New list to store ingredient names
    
    # Iterate through pages from 1 to 5
    for i in range(1, 6):
        url_page = url_category + str(i)
        response = requests.get(url_page)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all anchor tags with the class 'recipe-card-link'
        recipe_cards = soup.find_all('a', class_='recipe-card-link')

        # Check if any recipe cards are found
        if recipe_cards:
            for card in recipe_cards:
                # Within each anchor tag, find the h4 element with the class 'recipe-card__title'
                title_element = card.find('h4', class_='recipe-card__title')
                
                # Find the link (href) of the recipe
                recipe_link = card['href']
                response1 = requests.get(recipe_link)
                recipe = BeautifulSoup(response1.text, 'html.parser')

                # Check if the title element is found before adding its text to the list
                if title_element:
                    recipe_titles.append(title_element.text.strip())
                    recipe_links.append(recipe_link)
                    
                    # Extract ingredient names from card-ingredient elements
                    card_ingredient_elements = recipe.find_all('div', class_='card-ingredient')
                    ingredient_names = [element['data-name'] for element in card_ingredient_elements]
                    ingredient_names_list.append(ingredient_names)
        else:
            # If no recipe cards are found, break out of the inner loop
            break

    # Add the lists of recipe titles, links, and ingredient names to the dictionary with the category as the key
    all_recipes_by_category[category] = {'titles': recipe_titles, 'links': recipe_links, 'ingredients': ingredient_names_list}




In [3]:
card_ingredient_elements

[<div class="card-ingredient" data-brandid="163" data-brandname="Amazon" data-name="cognac" data-visibleonfirstdisplay="true">
 <span class="card-ingredient-link af-to-obfuscate-25614" data-encoded-link="aHR0cHM6Ly93d3cuYW1hem9uLmZyL0NvZ25hYy1Db3Vydm9pc2llci1WUy03MC1jbC9kcC9CMDAyOVpFS0pRP3RhZz1tdC1pLTI2MC0yMQ==" target="_blank"> <div class="card-ingredient-image">
 <img alt="" class="lazyload item__icon" data-src="https://assets.afcdn.com/recipe/20170607/67612_origin.jpg" height="150px" src="https://static.afcdn.com/relmrtn/lazyload.png" width="150px"/> </div>
 </span> <div class="card-ingredient-content">
 <div class="card-ingredient-checkbox">
 <input class="checkbox" id="check260" type="checkbox">
 <label for="check260"></label>
 </input></div>
 <span class="card-ingredient-link af-to-obfuscate-25614" data-encoded-link="aHR0cHM6Ly93d3cuYW1hem9uLmZyL0NvZ25hYy1Db3Vydm9pc2llci1WUy03MC1jbC9kcC9CMDAyOVpFS0pRP3RhZz1tdC1pLTI2MC0yMQ==" target="_blank"> <span class="card-ingredient-title">
 

## NB: Si besoin de chercher des categories

In [None]:
url = "https://www.marmiton.org/recettes/index/categorie/plat-principal/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the unordered list with the class 'mrtn-tags-list'
tag_list = soup.find('ul', class_='mrtn-tags-list')

# Extract all the category names from the list items
categories = [li.a.text for li in tag_list.find_all('li', class_='mrtn-tag')]

# Print the list of category names
print(categories)


In [None]:
all_words = ' '.join(all_recipe_titles_by_category['viande'])
word_counts = Counter(all_words.split())
df=pd.DataFrame(list(word_counts.items()),columns=['Word','Occurence'])
rm=["de","le","la","les","et","au","aux","à","la",""]
df=(df.sort_values(by='Occurence',ascending=False)).reset_index(drop=True)
df[:100]

In [None]:
def process_category(data, category_name):
    # Convert all words to lowercase
    data_lower = [word.lower() for word in data]

    # Flatten the list of strings into a single string
    all_words = ' '.join(data_lower)

    # Split the string into words and count occurrences
    word_counts = Counter(all_words.split())

    # Create a DataFrame from the Counter dictionary
    df = pd.DataFrame(list(word_counts.items()), columns=['Word', f'Occurrence_{category_name}'])

    # Remove specified words
    stop_words = ["de", "la", "et", "au", "aux", "à", "la", "en", "des"]
    df = df[~df['Word'].isin(stop_words)]

    # Sort the DataFrame by Occurrence in descending order
    df = df.sort_values(by=f'Occurrence_{category_name}', ascending=False)

    # Reinitialize the index
    df = df.reset_index(drop=True)

    return df

# Create and process DataFrames for each category
df_viande = process_category(all_recipe_titles_by_category['viande'], 'viande')
df_poisson = process_category(all_recipe_titles_by_category['poisson'], 'poisson')
df_vegetarien = process_category(all_recipe_titles_by_category['plat-vegetarien'], 'vegetarien')

# Find common words between the three DataFrames
common_words = set(df_viande['Word']) & set(df_poisson['Word']) & set(df_vegetarien['Word'])

# Remove common words from each DataFrame
df_viande = df_viande[~df_viande['Word'].isin(common_words)]
df_poisson = df_poisson[~df_poisson['Word'].isin(common_words)]
df_vegetarien = df_vegetarien[~df_vegetarien['Word'].isin(common_words)]

# Display the processed DataFrames
print("viande DataFrame:")
print(df_viande)

print("\npoisson DataFrame:")
print(df_poisson)

print("\nvegetarien DataFrame:")
print(df_vegetarien)

In [None]:
common_words

In [None]:
def process_category(data, category_name):
    # Convert all words to lowercase
    data_lower = [word.lower() for word in data]

    # Flatten the list of strings into a single string
    all_words = ' '.join(data_lower)

    # Split the string into words and count occurrences
    word_counts = Counter(all_words.split())

    # Create a DataFrame from the Counter dictionary
    df = pd.DataFrame(list(word_counts.items()), columns=['Word', f'Occurrence_{category_name}'])

    # Remove specified words
    stop_words = ["de", "la", "et", "au", "aux", "à", "la", "en", "des"]
    df = df[~df['Word'].isin(stop_words)]

    # Sort the DataFrame by Occurrence in descending order
    df = df.sort_values(by=f'Occurrence_{category_name}', ascending=False)

    # Reinitialize the index
    df = df.reset_index(drop=True)

    return df

# Create and process DataFrames for each category
df_viande = process_category(all_recipe_titles_by_category['viande'], 'viande')
df_poisson = process_category(all_recipe_titles_by_category['poisson'], 'poisson')
df_vegetarien = process_category(all_recipe_titles_by_category['plat-vegetarien'], 'vegetarien')

# Find common words between the three DataFrames
common_words = set(df_viande['Word']) & set(df_poisson['Word']) & set(df_vegetarien['Word'])

# Keep common words in the DataFrame with the highest occurrence
for word in common_words:
    occurrences_viande = df_viande.loc[df_viande['Word'] == word, f'Occurrence_viande'].values
    occurrences_poisson = df_poisson.loc[df_poisson['Word'] == word, f'Occurrence_poisson'].values
    occurrences_vegetarien = df_vegetarien.loc[df_vegetarien['Word'] == word, f'Occurrence_vegetarien'].values

    max_occurrence = max(occurrences_viande, occurrences_poisson, occurrences_vegetarien)

    if occurrences_viande and occurrences_viande[0] == max_occurrence:
        df_poisson = df_poisson[df_poisson['Word'] != word]
        df_vegetarien = df_vegetarien[df_vegetarien['Word'] != word]
    elif occurrences_poisson and occurrences_poisson[0] == max_occurrence:
        df_viande = df_viande[df_viande['Word'] != word]
        df_vegetarien = df_vegetarien[df_vegetarien['Word'] != word]
    elif occurrences_vegetarien and occurrences_vegetarien[0] == max_occurrence:
        df_viande = df_viande[df_viande['Word'] != word]
        df_poisson = df_poisson[df_poisson['Word'] != word]

# Display the processed DataFrames
print("viande DataFrame:")
print(df_viande)

print("\npoisson DataFrame:")
print(df_poisson)

print("\nvegetarien DataFrame:")
print(df_vegetarien)

In [None]:
from collections import Counter
import pandas as pd
import string

def process_category(data, category_name):
    # Convert all words to lowercase and remove punctuation
    translator = str.maketrans("", "", string.punctuation)
    data_lower = [word.translate(translator).lower() for word in data]

    # Flatten the list of strings into a single string
    all_words = ' '.join(data_lower)

    # Split the string into words and count occurrences
    word_counts = Counter(all_words.split())

    # Create a DataFrame from the Counter dictionary
    df = pd.DataFrame(list(word_counts.items()), columns=['Word', f'Occurrence_{category_name}'])

    # Remove specified words
    stop_words = ["de", "la", "et", "au", "aux", "à", "la", "en", "des"]
    df = df[~df['Word'].isin(stop_words)]

    # Remove words that occur only once
    df = df[df[f'Occurrence_{category_name}'] > 1]

    # Sort the DataFrame by Occurrence in descending order
    df = df.sort_values(by=f'Occurrence_{category_name}', ascending=False)

    # Reinitialize the index
    df = df.reset_index(drop=True)

    return df

# Create and process DataFrames for each category
df_viande = process_category(all_recipe_titles_by_category['viande'], 'viande')
df_poisson = process_category(all_recipe_titles_by_category['poisson'], 'poisson')
df_vegetarien = process_category(all_recipe_titles_by_category['plat-vegetarien'], 'vegetarien')

# Find common words between the three DataFrames
common_words = set(df_viande['Word']) & set(df_poisson['Word']) & set(df_vegetarien['Word'])

# Keep common words in the DataFrame with the highest occurrence
for word in common_words:
    occurrences_viande = df_viande.loc[df_viande['Word'] == word, f'Occurrence_viande'].values
    occurrences_poisson = df_poisson.loc[df_poisson['Word'] == word, f'Occurrence_poisson'].values
    occurrences_vegetarien = df_vegetarien.loc[df_vegetarien['Word'] == word, f'Occurrence_vegetarien'].values

    max_occurrence = max(occurrences_viande, occurrences_poisson, occurrences_vegetarien)

    if occurrences_viande and occurrences_viande[0] == max_occurrence:
        df_poisson = df_poisson[df_poisson['Word'] != word]
        df_vegetarien = df_vegetarien[df_vegetarien['Word'] != word]
    elif occurrences_poisson and occurrences_poisson[0] == max_occurrence:
        df_viande = df_viande[df_viande['Word'] != word]
        df_vegetarien = df_vegetarien[df_vegetarien['Word'] != word]
    elif occurrences_vegetarien and occurrences_vegetarien[0] == max_occurrence:
        df_viande = df_viande[df_viande['Word'] != word]
        df_poisson = df_poisson[df_poisson['Word'] != word]

# Display the processed DataFrames
print("viande DataFrame:")
print(df_viande)

print("\npoisson DataFrame:")
print(df_poisson)

print("\nvegetarien DataFrame:")
print(df_vegetarien)


### Paris Restaurants

In [None]:
url = 'https://bestrestaurantsparis.com/fr/restaurant-paris/41-penthievre-restaurant-paris.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
soup.title.string

In [None]:
dishes = soup.find_all('div', class_='restaurant-menu-item')
dishes

In [None]:
# Extract the dish names and prices
data = []
for dish in dishes:
    price = dish.find('div', class_='restaurant-menu-price').get_text(strip=True)
    description = dish.find('div', class_='restaurant-menu-desc').get_text(strip=True)
    data.append([price, description])

# Create a DataFrame
df = pd.DataFrame(data, columns=['Price', 'Description'])
df


In [None]:
data

In [None]:
data[1]