What qualities contribute to a good recipe?
Initially, we must import the necessary libraries to commence the scraping process.

In [1]:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup  
import pandas as pd
import scipy as sc
import numpy as np
import requests
import re
import time
import csv
from mpl_toolkits.mplot3d import Axes3D
from statistics import LinearRegression
import time

from numpy import average, shape

from matplotlib.widgets import Lasso
import matplotlib.pyplot as plt
from sklearn.linear_model import Perceptron
from sklearn.metrics  import r2_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

import seaborn as sns



#### Scraping ####

Now, as we delve into the scraping process, our first step is to retrieve all the recipe names and corresponding links, saving them to a file for future reference.

In [None]:
def get_full_page_thespruceeats():#this returns a list of all the links to receipies on the page:
    # URL to scrape
    url = "https://www.thespruceeats.com/search?q=&searchType=recipe"

    # Configure the Selenium webdriver
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Run in headless mode (no GUI)
    driver = webdriver.Chrome(options=options)
    driver.get(url)

    # Wait for the page to load
    wait = WebDriverWait(driver, 10)

    # Get the page source and parse it with BeautifulSoup
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, "html.parser")

    results_div = soup.find("div", attrs={"class": "results-list__container"})
    recipe_names = []
    recipe_links = []

    # Scrape the first page
    for li in results_div.find_all("li", class_="results__item"):
        if li.find("a") is not None:
            link = li.find("a").get("href")
        else:
            link = ''

        if li.find("h4", class_="card__title") is not None:
            name = li.find("h4", class_="card__title").text.strip()
        else:
            name = ''
        
        recipe_names.append(name)
        recipe_links.append(link)

    # Scrape subsequent pages if the "Next" button exists
    while True:
        try:
            next_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".pagination__item-link--next")))
            next_button.click()
            time.sleep(5)

            # Get the page source and parse it with BeautifulSoup
            page_source = driver.page_source
            soup = BeautifulSoup(page_source, "html.parser")

            results_div = soup.find("div", attrs={"class": "results-list__container"})

            # Scrape recipe names and links
            for li in results_div.find_all("li", class_="results__item"):
                if li.find("a") is not None:
                    link = li.find("a").get("href")
                else:
                    link = ''

                if li.find("h4", class_="card__title") is not None:
                    name = li.find("h4", class_="card__title").text.strip()
                else:
                    name = ''
                print(f"{name} added")
                recipe_names.append(name)
                recipe_links.append(link)

        except:
            break

    # Create a DataFrame with the recipe names and links
    df = pd.DataFrame({"Recipe_name": recipe_names, "Recipe_link": recipe_links})

    # Write DataFrame to a CSV file
    df.to_csv("Recipe_Links_and_Names.csv", index=False)

    # Close the driver
    driver.quit()

    print("Done!")

Let's execute this function! By saving the retrieved data to a CSV file, we can expedite future work, reducing our dependency on the web driver and safeguarding against potential internet interruptions during subsequent tests.

In [None]:
get_full_page_thespruceeats()

Pepper Steak Stir-Fry Recipe added
Grilled Salmon Burgers With Radicchio Slaw and Sambal Mayonnaise added
Kentucky Buck Cocktail Recipe added
Sparkling Borage Cocktail added
Lunch Box-Worthy Falafel Kebabs added
Chipotle Pumpkin Queso Dip added
Flamin' Hot Cheetos Mac and Cheese Bites added
Cherry Vinegar Recipe added
Turkish Ramadan Flat Bread (Pide) added
Carrot, Cabbage, and Kohlrabi Slaw With Miso Dressing Recipe added
Esquites Recipe (Mexican Corn Off the Cob) added
Gillie Fix Cocktail Recipe added
Raspberry Snakebite Recipe added
Stuffed Italian Sourdough Loaf Recipe added
Strawberry Chicken Salad With Champagne Vinaigrette Recipe added
Lasagne All'astice (Lobster Lasagna) Recipe added
Pear and Pomegranate Champagne Shrub Recipe added
Homemade Smoked Maple Bacon added
Baked S’mores Skillet Dip Recipe added
Garlic Chicken Primavera Pasta added
Cold Soba Noodle Salad Recipe added
Vegetarian Tofu Tacos added
Bay Hill Hummer added
A Recipe for Risotto Made With Amarone Wine (Risotto 

Now that we have obtained the recipe links, our next step is to scrape each individual recipe. To streamline the process, we will break it down into manageable steps. The first step involves creating a "soup" object, which will allow us to parse the HTML content of each recipe page.

In [None]:
data = pd.read_csv("Recipe_Links_and_Names.csv")
data.head()


In [None]:
def load_soup_object(url):
    ###
    #url = "https://www.thespruceeats.com/search?q=&searchType=recipe"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    return soup
    ###

Next, we will proceed to gather information in the order it is presented on our site. To ensure efficiency, we will encapsulate the gathering process within functions as we will be looping over them shortly. Our initial function will collect details such as cook time, prep time, total cooking time, and the number of servings.

Extracting this information can be challenging due to the varying formatting across different recipes. Time values can take on various forms, such as "25 min," "2h 15 min," "2 hours 10 min," or numerous other variations. To address this issue, we have implemented a complex regular expression (regex) expression to handle the diverse time formats encountered during scraping.

In [None]:
def get_cook_times(soup_obj):
    lines=[]
    results_items = soup_obj
    results_items = soup_obj.find_all(class_='comp article__decision-block mntl-block')
    if(results_items==[]):
        soup = soup_obj
        results_items = soup.find_all(class_='comp project-meta')        

    for item in results_items:
        item.find_all(class_='meta-text__data')
        for sub_item in item:
            if bool(sub_item.text.strip()):
                clean_text = sub_item.text.strip().replace('\n', '')
                lines.append(clean_text)

    if(len(lines)>1):
        new_string = lines[0] + lines[1]
        lines[0]= new_string

    #use regex expressions to clean up the line we get it looks something like this
    #['Prep: 15 minsCook: 20 minsTotal: 35 minsServings: 6 servingsYield: 1 cake', 'ratingsAdd a comment']
    #prep = re.findall(r'Prep:\s*(\d+)\s*mins', lines[0])[0]
    cook_time_str = re.findall(r'Cook:\s*(?:(\d+)\s*(?:hrs?|hours?)\s*)?(?:(\d+)\s*mins?)?', lines[0])[0]
    prep_time_str = re.findall(r'Prep:\s*(?:(\d+)\s*(?:hrs?|hours?)\s*)?(?:(\d+)\s*mins?)?', lines[0])[0]    
    total_time_str = re.findall(r'Total:\s*(?:(\d+)\s*(?:hrs?|hours?)\s*)?(?:(\d+)\s*mins?)?', lines[0])[0]
# Convert cook time to minutes


    hours = int(cook_time_str[0]) if cook_time_str[0] else 0
    minutes = int(cook_time_str[1]) if cook_time_str[1] else 0
    cook_time_minutes = hours * 60 + minutes

    hours = int(prep_time_str[0]) if prep_time_str[0] else 0
    minutes = int(prep_time_str[1]) if prep_time_str[1] else 0
    prep_time_minutes = hours * 60 + minutes


    hours = int(total_time_str[0]) if total_time_str[0] else 0
    minutes = int(total_time_str[1]) if total_time_str[1] else 0
    total_minutes = hours * 60 + minutes
    #total = re.findall(r'Total:\s*(\d+)\s*mins', lines[0])[0]
    #servings = re.findall(r'Servings?:\s*(\d+?)\s*(?:to\s*\d+)?\s*(?:servings|ratings)', lines[0])[0] # sometimes instead of saying servings 6 they say servings 6 to 8 in this case we make it servings 6
    #servings = re.findall(r'servings?:\s*(\d+?)\s*(?:to\s*\d+)?\s*(?:servings?|ratings)', lines[0], re.IGNORECASE)[0]
    #text = "The serving size is 3 servings per container."
    
    if(lines[0].count('serv')):
        match = re.search(r'serv\w*:\D*(\d+)', lines[0], re.IGNORECASE)
        if match:
            servings=(match.group(1))
    else:
        servings=1

    

    # Create a pandas DataFrame from the extracted data
    df = pd.DataFrame({
        'Prep': [prep_time_minutes],
        'Cook': [cook_time_minutes],
        'Total': [total_minutes],
        'Servings': [servings]
    })
    df = df.astype(int)
    return df

Moving on to the second block of information, we will now focus on extracting the dish rating. This part posed its own challenges since the rating is displayed in the form of stars, with increments of 0.5. Instead of a numerical value, we need to determine the count of full stars and half stars to accurately represent the rating.

In [None]:
def get_stars(soup_obj):    
    soup = soup_obj
    results_items = soup.find_all(class_='comp js-feedback-trigger aggregate-star-rating mntl-block')    
    #print(results_items.prettify())
    for item in results_items:##result items size is 1
        text=item.prettify()
        full_stars=text.count('class="active"')
        half_stars=text.count('class="half"')
        return(full_stars+0.5*half_stars)

Up next is extracting the rating count, which doesn't pose any significant challenges. We can straightforwardly gather the number of ratings for each recipe without encountering any notable complexities.

In [None]:
def get_rating_count(soup_obj):
    soup = soup_obj
    rating_elements = soup.find_all("div", attrs={'class': "comp aggregate-star-rating__count mntl-aggregate-rating mntl-text-block"})
    for rating_element in rating_elements:
        rating_text = rating_element.text.strip()
        try:
            num_ratings = int(rating_text.split()[0])
            return num_ratings
        except ValueError:
            pass
    return 0

Moving on to the third block of information, we will now focus on extracting the nutritional values. However, we encountered a particular issue in this step. Alcoholic beverages listed on the site do not provide nutritional values, and unfortunately, there were numerous such recipes.

In [None]:
def get_nutritional_values(soup_obj):
    soup = soup_obj
    results_items = soup.find_all(class_='nutrition-info__table--row')

    nutritional_vals=[]    


    for item in results_items:
        nutritional_vals.append(item.text.strip())
    new_list = []
    for s in nutritional_vals:
        # Split the string by the \n character and add the two parts to a new list
        parts = s.split('\n')
        # Add the new strings to the new list in the desired format
        #[caleories:934,fat:134g,carbs:999]
        new_list.extend([parts[1], parts[0]])
    #[calories,934,far,134g,carbs,1123,]
    df = pd.DataFrame({'nutrient': new_list[::2], 'value': new_list[1::2]})

    # Set 'nutrient' column as index and transpose DataFrame
    df = df.set_index('nutrient').T
    if df.empty: #recepies like cocktails have no calories

        df = pd.DataFrame(columns=['nutrient', 'Calories', 'Fat', 'Carbs', 'Protein'])

        # add a row filled with zeros
        df.loc[0] = ['value', 0, '0g', '0g', '0g']
        df['Calories'] = df['Calories'].astype(np.int32)
        return df 

    df['Calories'] = df['Calories'].astype(int)

    return df

Moving on to the fourth block of relevant information, we will now focus on extracting the list of ingredients. Unfortunately, this step proved to be quite challenging. The class names for some recipes were inconsistent and changed over time, leading to numerous hotfixes and extensive debugging. To address this issue, we introduced a condition to check whether it is the old variant of the site or the new one. In some cases, we were able to filter the ingredients from the beginning if it was the new variant, simplifying the process to some extent.

In [None]:
def get_ingridients(soup_obj):
    cond=0
    soup = soup_obj
    span_elements = soup.find_all('span', {'data-ingredient-name': 'true'})

    # create an empty list to store the ingredient names
    ingredient_names = []

    # loop over the span elements and extract their text content
    for span in span_elements:
        ingredient_names.append(span.text)
    #print(ingredient_names)
    if(len(ingredient_names)>0):
        return(ingredient_names)
    else:
        cond=0
        soup = soup_obj
        results_items = soup.find_all(class_='structured-ingredients__list text-passage')
                                            #comp ingredient-list simple-list simple-list--bulleted  
        #print(results_items)                                           
        if(results_items==[]): #sometimes they like to change the class name
            soup = soup_obj
            results_items = soup.find_all(class_='simple-list__item js-checkbox-trigger ingredient text-passage')
            cond=1
        nutritional_vals=[]
        if(results_items==[]):
            return []
        
        final_lst=[]

        for item in results_items:    
            nutritional_vals.append(item.text.strip())
            #print(item.text.strip())
        if(cond==1):
            return nutritional_vals
        else:        
            for i in nutritional_vals:
                my_list = [s.strip() for s in i.split('\n\n\n')]
                final_lst.extend(my_list)
                #print(final_lst)
            
            return(final_lst)

Additionally, we wanted to incorporate some additional information based on the ingredients. We examined the ingredients and attempted to determine if they were classified as dairy, meat, fur (parve), or a combination of dairy and meat. To identify common meat and dairy products, we conducted online searches and included the first few results as keywords for classification purposes.

In [None]:
def analyze_recipe(ingredients):
    dairy_keywords = ["milk", "cheese", "yogurt", "cream", "butter", "whey", "casein", "curds"]
    meat_keywords = ["beef", "chicken", "pork", "lamb", "turkey", "venison", "duck", "bacon", "sausage",
                     "ham", "prosciutto", "pepperoni", "salami", "chorizo", "bresaola", "pastrami",
                     "corned beef", "veal", "goose", "game", "elk", "bison", "rabbit", "boar", "guinea fowl", "quail"]
    
    categories = {'Dairy': 0, 'Meat': 0, 'Fur': 1}
    
    for ingredient in ingredients:
        ingredient = ingredient.lower()
        if any(keyword in ingredient for keyword in dairy_keywords):
            categories['Dairy'] = 1
            categories['Fur'] = 0
        elif any(keyword in ingredient for keyword in meat_keywords):
            categories['Meat'] = 1
            categories['Fur'] = 0
            
    df = pd.DataFrame(categories, index=[0])
    return df

Now, it's time to merge all the gathered data into a single row of a dataframe. This step will involve looping over the extracted information for each recipe and consolidating it into a unified format. By doing so, we will have a comprehensive dataframe that encapsulates all the relevant details for further analysis.

In [None]:
def merge_fast(url,recepie_name):
    df = pd.DataFrame(columns=['Name','Prep', 'Cook', 'Total', 'Servings', 'Rating','Rating_Count','Dairy','Meat','Fur', 'Calories', 'Fat', 'Carbs', 'Protein', 'Ingredients'])
    #Dairy
    #Meat
    #Fur
    soup_obj= load_soup_object(url)

    recipe_df = get_cook_times(soup_obj)

    ratings_list = get_stars(soup_obj)

    nutrition_df = get_nutritional_values(soup_obj)

    ingredients = get_ingridients(soup_obj)

    rating_num = get_rating_count(soup_obj)

    meatdairy_df = analyze_recipe(ingredients)

    if(ingredients==[]):
        ingredients=['','','','']
        print(type(nutrition_df['Calories'][0]))
    new_row = {
        'Name':recepie_name,
        'Prep': recipe_df['Prep'][0],
        'Cook': recipe_df['Cook'][0],
        'Total': recipe_df['Total'][0],
        'Servings': recipe_df['Servings'][0],
        'Rating': ratings_list,
        'Rating_Count':rating_num,
        'Dairy':meatdairy_df['Dairy'][0],
        'Meat':meatdairy_df['Meat'][0],
        'Fur':meatdairy_df['Fur'][0],
        'Calories': nutrition_df['Calories'][0],
        'Fat': nutrition_df['Fat'][0],
        'Carbs': nutrition_df['Carbs'][0],        
        'Protein': nutrition_df['Protein'][0],
        'Ingredients': [ingredients]
    }
    #print(new_row)

    # add the new row to the DataFrame
    df = pd.concat([df, pd.DataFrame(new_row)], ignore_index=True)
    #df_concat = pd.concat([df1, df2], keys=['df2'])
    df['Prep'] = df['Prep'].astype(float)
    df['Cook'] = df['Cook'].astype(float)
    df['Total'] = df['Total'].astype(float)
    df['Servings'] = df['Servings'].astype(float)
    df['Rating'] = df['Rating'    ].astype(float)
    df['Calories'] = df['Calories'].astype(float)

    # display the updated DataFrame
    #print("another line added")
    return(df)

Now, we will implement another function that performs the above-mentioned process for each recipe found thus far. As this operation can take some time, we aim to optimize future work by saving the results to a file. This way, we can have the data readily accessible for further analysis.

In [None]:
def fast_scrape(csv_file_name):
    recipe_names = []
    recipe_links = []
    count=0

    with open(csv_file_name, encoding='utf-8', newline='') as csvfile:
        reader = csv.reader(csvfile)
        next(reader)  # Skip the first row
        for row in reader:
            recipe = ','.join(row).strip().replace('\x9c', '')
            last_comma = recipe.rfind(',')
            if last_comma != -1:
                recipe_name = recipe[:last_comma].strip()
                recipe_link = recipe[last_comma + 1:].strip()
                recipe_names.append(recipe_name)
                recipe_links.append(recipe_link)
            else:
                print(f"Invalid row: {row}")

    final_df=pd.DataFrame()
    for name, link in zip(recipe_names, recipe_links):
        # print(f"Recipe name: {name}")
        # print(f"Recipe link: {link}")
        temp_df=merge_fast(link,name)
        temp_df.set_index('Name', inplace=True)  # set the index of temp_df to the recipe name
        if(final_df.empty):
            final_df=temp_df
        else:
            final_df = pd.concat([final_df,temp_df])
        print(final_df)

    #final_df.to_pickle('dataframe.pkl')
    #final_df.to_pickle('dataframe.pkl', protocol=4, encoding='utf-8')
    final_df.to_csv('my_data.csv', index=True, encoding='utf-8')
    print(final_df)
    return final_df


In [None]:
data = fast_scrape('Recipe_Links_and_Names.csv')
data.head()
print("Describe the DataFrame:\n")
data.describe()
df = data
data['Fat'] = data['Fat'].str.replace('g', '').str.replace(',', '').astype(int)
data['Carbs'] = data['Carbs'].str.replace('g', '').str.replace(',', '').astype(int)
data['Protein'] = data['Protein'].str.replace('g', '').str.replace(',', '').astype(int)

MissingSchema: Invalid URL 'Recipe_link': No scheme supplied. Perhaps you meant http://Recipe_link?

#### First Data Analysis ####

Now that we have our dataframe, let's explore and visualize the data to gain insights. To facilitate this process, we will create functions for different types of plots. Specifically, we will implement functions for scatter plots, histograms, and single-column pie charts. These functions will allow us to effectively visualize and analyze the data. Later, we can focus on cleaning and refining the visualizations for a more polished presentation :)

In [None]:
def draw_scatter_2_params(df, col_name_1,col_name_2):
    df.plot.scatter(x=col_name_1, y=col_name_2)

    plt.show()

In [None]:
def draw_histo_1_params(df, col_name):
    # read in your dataframe from a csv file
    # choose the column you want to use for the histogram

    # sort the column values into bins
    bin_values, bin_edges = np.histogram(df[col_name], bins='auto')

    # create the histogram using the sorted bins
    plt.hist(df[col_name], bins=bin_edges)

    # add labels and title to the histogram
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    plt.title('Histogram of ' + col_name)

    # display the histogram
    plt.show()

In [None]:
def draw_pie_1_params(df,col_name):
    #df = pd.read_csv('my_data.csv')
    # choose the column you want to use for the pie chart    

    # get the count of unique values in the column
    value_counts = df[col_name].value_counts()

    # create the pie chart
    plt.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%')

    # add title to the pie chart
    plt.title('Pie Chart of ' + col_name)

    # display the pie chart
    plt.show()

In addition to the previously mentioned visualizations, we also want to include a pie chart illustrating the distribution of ingredients. Specifically, we aim to showcase the proportion of recipes that contain meat, dairy, a combination of meat and dairy, and fur (parve) ingredients. This pie chart will provide a clear overview of the composition of ingredients used in the recipes analyzed.

In [None]:
def draw_pie_meat_dairy_fur(df):
    #df = pd.read_csv('my_data.csv')
    
    # Calculate the number of recipes in each category
    meat_count = len(df[df['Meat'] == 1])
    dairy_count = len(df[df['Dairy'] == 1])
    fur_count = len(df[df['Fur'] == 1])
    dairy_meat_count = len(df[(df['Dairy'] == 1) & (df['Meat'] == 1)])

    # Create a list of category counts and labels
    counts = [meat_count, dairy_count, fur_count, dairy_meat_count]
    labels = ['Meat', 'Dairy', 'Fur', 'Dairy&Meat']

    # Create the pie chart
    plt.pie(counts, labels=labels, autopct='%1.1f%%')

    # Add a title to the chart
    plt.title('Recipe Categories')

    # Show the chart
    plt.show()

Great! We can now proceed to execute the functions for the different types of plots to visualize the data.

In [None]:
def draw_all_histo(df):
    numeric_cols = df.select_dtypes(include=['int', 'float']).columns.to_list()
    # print(df.describe())
    for col in numeric_cols:
        draw_histo_1_params(df,col)

In [None]:
draw_all_histo(df)

Based on the histogram of calories, we can observe that the most common calorie range for recipes is around 180 calories, which is the average value among approximately 700 recipes. Additionally, a significant portion of the recipes falls within the range of 0 to 500 calories.


Similarly, for the histogram of carbohydrates, we observe a similar trend. The average recipe tends to contain carbohydrates within the range of 0 to 50 grams.


Analyzing the histogram of the number of ingredients, we observe a normal distribution pattern. The average recipe contains around 10 ingredients. The majority of recipes fall within the range of 3 to 25 ingredients, indicating that most recipes can be prepared with a moderate number of ingredients. However, there are a few outliers with more than 27 ingredients, suggesting that some recipes may be more complex and require a wider variety of components.


In [None]:
draw_scatter_2_params(df, 'Cook','Total')

From the scatter plot comparing Total Time and Cook Time, we can observe a linear relationship between the two variables. As the Total Time increases, the Cook Time also tends to increase in a consistent manner. This linear correlation indicates that as the overall cooking duration lengthens, the time required for the actual cooking process also extends accordingly. 

To address the presence of outliers in our data, we will employ a data cleaning approach using the IQR (Interquartile Range) method. By applying this method to the entire dataframe, we can systematically identify and handle outliers across multiple columns. Additionally, we can perform careful adjustments to certain columns, removing unnecessary letters or characters, to convert them into numeric columns (e.g., fat, carbs). This process will help us clean and prepare the data for further analysis, ensuring more accurate and reliable results.

#### Cleaning ####

In [None]:
def clean(df,column_name):
    
    print(df[column_name].describe())
    # Calculate the IQR of the Prep column
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1

    # Determine the upper and lower bounds for the Prep column
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 10 * IQR

    # Replace any values outside of the bounds with NaN
    df.loc[(df[column_name] < lower_bound) | (df[column_name] > upper_bound), column_name] = float('NaN')
    #df = df.dropna(subset=[column_name])
    df.dropna(inplace=True)
    print(df[column_name].describe())
    return df

In our data cleaning process, we will establish a lower boundary of 1.5 times the IQR (Interquartile Range) and an upper boundary of 10 times the IQR. The upper boundary is set relatively high to account for recipes with exceptionally long cooking times, such as those exceeding 8 hours. It is essential to include these legitimate recipes in our analysis. By applying these boundaries, we can identify and handle outliers effectively, ensuring that our cleaned dataset maintains a reasonable range of values while still accounting for the diversity of cooking times present in the recipes.

In [None]:
def clean_df(df):

    
    #print(df)
    numeric_cols = df.select_dtypes(include=['int', 'float']).columns
    #print(numeric_cols)
    # # loop over the selected columns
    for col in numeric_cols:
        clean(df,col)

    return df

In [None]:
df = clean_df(df)
df.head()

After conducting a detailed examination of the data frame, we have identified that certain rows contain incorrect values for "Prep," "Cook," and "Total" time. To rectify this issue, we will implement a simple function to clean the time values for each row. The function will attempt to correct the time values to their appropriate formats whenever possible. However, if a row's time values cannot be corrected, it will be removed from the data frame to ensure data accuracy and consistency. This approach will help us maintain a reliable and clean dataset for further analysis.

In [None]:
def clean_data_time(df):
    #df = pd.read_csv('clean_df.csv')

    # Replace Prep, Cook, and Total values with median if Prep + Cook != Total
    for index, row in df.iterrows():
        if row['Prep'] + row['Cook'] != row['Total']:
            if row['Prep'] != 0:
                df.at[index, 'Cook'] = df.at[index, 'Total'] - df.at[index, 'Prep']
            elif row['Cook'] != 0:
                df.at[index, 'Prep'] = df.at[index, 'Total'] - df.at[index, 'Cook']
            else:
                df.drop(index, inplace=True)

    # Fill missing Total values with Prep + Cook
    df['Total'].fillna(df['Prep'] + df['Cook'], inplace=True)

    # Fill missing Prep or Cook values if only one is missing
    for index, row in df.iterrows():
        if pd.isna(row['Prep']):
            if pd.notna(row['Cook']) and pd.notna(row['Total']):
                df.at[index, 'Prep'] = df.at[index, 'Total'] - df.at[index, 'Cook']
            else:
                df.drop(index, inplace=True)
        elif pd.isna(row['Cook']):
            if pd.notna(row['Prep']) and pd.notna(row['Total']):
                df.at[index, 'Cook'] = df.at[index, 'Total'] - df.at[index, 'Prep']
            else:
                df.drop(index, inplace=True)
        elif pd.isna(row['Total']):
            if pd.notna(row['Prep']) and pd.notna(row['Cook']):
                df.at[index, 'Total'] = df.at[index, 'Prep'] + df.at[index, 'Cook']
            else:
                df.drop(index, inplace=True)

    df = df[(df['Prep'] >= 0) & (df['Cook'] >= 0) & (df['Total'] >= 0)]

    return df

In [None]:
df = clean_data_time(df)
df.head()

With a clean and corrected data frame, we can now re-evaluate the results and draw meaningful conclusions from the refined data.

In [None]:
draw_all_histo(df)

In [None]:
def draw_pie_meat_dairy_fur(df):

    #df = pd.read_csv('my_data.csv')
    
    # Calculate the number of recipes in each category
    meat_count = len(df[df['Meat'] == 1])
    dairy_count = len(df[df['Dairy'] == 1])
    fur_count = len(df[df['Fur'] == 1])
    dairy_meat_count = len(df[(df['Dairy'] == 1) & (df['Meat'] == 1)])

    # Create a list of category counts and labels
    counts = [meat_count, dairy_count, fur_count, dairy_meat_count]
    labels = ['Meat', 'Dairy', 'Fur', 'Dairy&Meat']

    # Create the pie chart
    plt.pie(counts, labels=labels, autopct='%1.1f%%')

    # Add a title to the chart
    plt.title('Recipe Categories')

    # Show the chart
    plt.show()

In [None]:
draw_pie_meat_dairy_fur(df)

After analyzing the distribution of ingredients in the cleaned data frame, we can now observe the proportion of recipes falling into different categories: 'Meat', 'Dairy', 'Fur' (parve), and 'Dairy&Meat' (a combination of dairy and meat).

In [None]:
def scatter_3d(df):

    ax = plt.axes(projection='3d')

    xdata = df['Rating']
    ydata = df['Calories']
    zdata = df['Total']

    ax.set_xlabel('Rating')
    ax.set_ylabel('Calories')
    ax.set_zlabel('Total time')

    ax.scatter3D(xdata, ydata, zdata, c=zdata, depthshade=False)
    plt.show()

scatter_3d(df)

Based on the visualization of recipe ratings, we observe that the ratings range from 0 to 5 stars, with increments of 0.5. Additionally, we can examine the distribution of recipes based on their rating, total cooking time, and calorie content. These visualizations help us understand the distribution and relationships between these parameters in our data frame.


#### Finding Correlations :) ####

Moving forward, to determine the correlations between different parameters in our data frame, we will utilize the following approach:

In [None]:
def get_correlation(df, col1, col2):
    return df[col1].corr(df[col2])

Lets run this with each and every column 

In [None]:
from statistics import correlation


def get_corr_all_columns(df):
    df = pd.read_csv('clean_modified.csv')
    numeric_cols = df.select_dtypes(include=['int', 'float'])
    if 'Unnamed' in numeric_cols.columns:
        numeric_cols = numeric_cols.drop('Unnamed', axis=1)

    correlation = {}
    for col1 in numeric_cols:
        for col2 in numeric_cols:
            if col1 != col2:
                corr_value = get_correlation(df, col1, col2)
                if corr_value > 0.5 or corr_value < -0.5:
                    correlation[(col1, col2)] = corr_value

    # Sort the correlations by their absolute value, in descending order
    sorted_correlations = sorted(correlation.items(), key=lambda x: abs(x[1]), reverse=True)

    # Print the correlations above 0.5 or below -0.5
    for corr, value in sorted_correlations:
        if abs(value) > 0.5:
            print(f"{corr[0]} and {corr[1]}: {value}")

To enhance the visualization and facilitate better analysis, we will create two heatmaps based on the correlation matrix:

Heatmap with all data: This heatmap will display the correlations between all pairs of columns in the data frame. It will provide a comprehensive overview of the relationships among all variables.

Heatmap with strong correlations: This heatmap will focus on displaying only the strong correlations between variables. By setting a threshold for correlation strength (e.g., 0.5 or higher), we can highlight the significant relationships and identify the most influential factors within the dataset.

In [None]:


def draw_heatmap_view(df):
    heat_map_view = df[['Prep', 'Cook', 'Total', 'Servings', 'Rating', 'Rating_Count', 'Dairy', 'Meat', 'Fur', 'Calories','Fat','Carbs','Protein']]
    sns.heatmap(heat_map_view.corr(), annot=True)
    plt.show()

def draw_heatmap_view_important(df):
    heat_map_view = df[['Prep', 'Cook', 'Total', 'Servings', 'Rating', 'Rating_Count', 'Dairy', 'Meat', 'Fur', 'Calories','Fat','Carbs','Protein']]
    corr = heat_map_view.corr()
    mask = (corr > 0.5) | (corr < -0.5)  # set the threshold for correlation values
    sns.heatmap(corr, annot=True, mask=~mask, cmap='coolwarm')
    
    # print the correlations above 0.5 or below -0.5
    print("Correlations above 0.5 or below -0.5:")
    for i in range(len(corr.columns)):
        for j in range(i):
            if mask.iloc[i, j]:
                print(f"{corr.index[i]} - {corr.columns[j]}: {corr.iloc[i, j]}")

    plt.show()

In [None]:
draw_heatmap_view(df)
draw_heatmap_view_important(df)

 It appears that none of the initial parameters we examined in our analysis have a significant correlation with the 'Rating' column. This suggests that the factors we initially considered may not have a direct impact on the recipe rating. While this outcome may be unexpected, lets continue and find out what might affect the Rating.

To determine the factors that contribute to the popularity of a recipe and evaluate the predictability of a recipe's popularity based on these factors, we will begin by filtering the ingredients. Our assumption is that certain ingredients may have a positive or negative impact on the recipe's rating.

With this in mind, we will initiate the cleaning process by applying filters to the ingredients. By categorizing the ingredients based on their potential influence on the recipe's popularity, we can isolate and analyze the impact of specific ingredients or ingredient groups. This approach will help us understand the relationship between ingredients and recipe popularity more effectively.

Let's proceed with the cleaning process and investigate how different ingredients may affect the overall popularity of recipes.

#### Cleaning the Ingredients ####

Lets see what we are working with by using a simple plot to see the most frequent ingredients in the Data Frame.

In [None]:
def initial_data_review(df):
    #Create a list of all the ingredients
    #df=pd.read_csv('my_data.csv')
    all_ingredients = []

    for i in df['Ingredients']:
        all_ingredients += eval(i)
    # Define a list of terms to exclude from the ingredients list
    exclude_terms = ['salt', 'pepper', 'garlic', 'onion', 'paprika', 'cumin', 'chili', 'oregano', 'basil', 'thyme', 'rosemary']
    # Create a list of all the ingredients
    clean_ingredients = []
    for ingredient in all_ingredients:
        ingredient = re.sub(r'\d+(\.\d+)?', '', ingredient) # Remove any quantity
        ingredient = re.sub(r'(\s+\d+)?\s*(large|medium|small)?\s*(cup|teaspoon|tablespoon)s?', '', ingredient, flags=re.IGNORECASE) # Remove any volume/capacity description
        ingredient = ingredient.strip()
        if ingredient and not any(term in ingredient.lower() for term in exclude_terms):
            clean_ingredients.append(ingredient)

    # Count the frequency of each ingredient
    ingredient_counts = pd.Series(clean_ingredients).value_counts()

    # Create a pie chart for the top 10 ingredients
    top_10_ingredients = ingredient_counts.head(10)
    plt.pie(top_10_ingredients, labels=top_10_ingredients.index, autopct='%1.1f%%')
    plt.title('Top 10 Ingredients in Recipes (Excluding Spices)')
    plt.show()

In [None]:
initial_data_review(df)

Although the initial plot of the most frequent ingredients did not provide much insight, it was important to clean the data as many ingredients had additional information such as volume or weight that we did not need for our analysis. Additionally, there were discrepancies in the ingredient names, such as 'Wheat Flour' and 'Corn Flour' that should both be considered as 'Flour'. This type of data cleaning is crucial to ensure the accuracy and reliability of our analysis.

Our idea was to make a dictionary of the top X ingridients and manully filter the junk out of them first lets make the dictionary 

In [None]:
import collections


def find_most_common_ingridients(df):
    final_lst = []
    #df = pd.read_csv('clean.csv')
    ingredients = df['Ingredients'].tolist()

    for item in ingredients:
        temp = item.replace("'", "")   
        temp = item.replace("\\\\xa0", " ")
        temp = item.replace("\\\\u200b", " ")
        temp = temp.strip("[]").split(", ")
        temp_tokens = []
        for item in temp:
            temp_tokens.extend(item.split())
        final_lst.extend(temp_tokens)

    #print(final_lst)
    word_count = collections.Counter(final_lst)

    # Filter out items that start with a digit
    sorted_dict = {k: v for k, v in word_count.items() if not k[0].isdigit()}
    sorted_dict = dict(sorted(sorted_dict.items(), key=lambda x: x[1], reverse=True))
    print('\n\n\n\n')
    # Print the top 100 items

    # Create a DataFrame from the sorted dictionary and save it as a CSV file
    df = pd.DataFrame.from_dict(sorted_dict, orient='index', columns=['count'])
    df.to_csv('Ingredients_dirty.csv')
    
    for i, (key, value) in enumerate(sorted_dict.items()):
        if i == 300:
            break
        print(f"{key}: {value}")
    return sorted_dict

To streamline the data cleaning process, we utilized the power of text editors like Notepad++ and applied regular expressions (regex) to manipulate the ingredient data. Here are the steps we followed:

Lowercasing: We converted all the ingredient names to lowercase to ensure consistency and eliminate any case-related discrepancies.

Removing cells with numbers: Since we wanted to focus on ingredient names rather than quantities or measurements, we removed any cells that contained numbers.

Removing duplicates: We eliminated duplicate ingredient entries to avoid redundancy in our analysis.

Clustering similar ingredients: We manually reviewed the ingredient list and grouped together ingredients that were essentially the same but had slight variations in their names. For example, 'Wheat Flour' and 'Corn Flour' were clustered as 'Flour' since they both belong to the same category.

These manual steps were necessary due to the complexity and variations in ingredient names. By carefully curating and organizing the ingredient data, we can now proceed with a cleaner and more consistent dataset for further analysis.

(figure out how to upload a csv with the notebook we need to upload the ingridients)

Now that we have defined our set of relevant ingredients, it is time to filter the original ingredient data. Our goal is to replace any ingredient that appears as a substring in the original ingredients with the corresponding ingredient name from the new CSV file.

In [None]:
def read_ingridients():
    with open('ingridients.csv', 'r') as file:
    # Initialize an empty list to store the data
        data = []
        for line in file:
            line = line.strip()
            values = line.split(',')
            return values
    return data

In [None]:
def delete_index_columns(df):
    # Get a list of all column names that start with 'Unnamed'
    index_cols = [col for col in df.columns if col.startswith('Unnamed')]
    df.drop(columns=index_cols, inplace=True)
    
    return df

During the data cleaning process, we observed that the word 'tea' appeared frequently in the ingredient lists. This occurrence was a result of the word 'Teaspoon' being transformed to 'tea' after applying our clean_ingredients function. To address this issue and ensure accurate representation, we decided to further clean the ingredient lists by removing the term 'Teaspoon' from the ingredients.

In [None]:
def remove_teaspoon(df, column):
  df=delete_index_columns(df)
  df[column] = df[column].str.replace('teaspoon', '')
  df=delete_index_columns(df)
  return df

In [None]:
def clean_ingridients(df):
    df=delete_index_columns(df)
    df2=df.copy()
    pure_ingredients=read_ingridients()
    print("starting ingridient cleanup this might take a minute")
    #df=pd.read_csv(filename)
    for i, ingredient in enumerate(df2['Ingredients']):
        line = ingredient.split(",")
        for j, item in enumerate(line):
            for single_ingredient in pure_ingredients:
                if single_ingredient.lower() in item.lower():
                    #print(f'found {single_ingredient} in {item}')
                    line[j] = single_ingredient.lower()
        df2.loc[i, 'Ingredients'] = ",".join(line)
        #print(line)
        #print('\n\n')
    #print(df['Ingredients'])
    # save the modified dataframe to a new CSV file
    df2=delete_index_columns(df)
    #df2.to_csv('clean_modified.csv', index=False)
    print("done!")
   
    return df2

In [None]:
df = remove_teaspoon(df,'Ingredients')# Removing the teaspoon word from the list of ingredients
df = clean_ingridients(df)# Cleaning the data by ingredients

lets visualise this and hope that we cleaned it properly. We can see now what is the most frequent ingredients:

In [None]:
def get_ingridients_dict(df):
    ingredients = []

    with open('ingridients.csv', newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            ingredients.extend(row)

    #df= pd.read_csv('clean_modified.csv')


    ingredients_dict = {}

    # loop through the ingredients in the DataFrame
    for line in df['Ingredients']:
        for single_item in line.split(','): # split the ingredients by comma if they're in a single string
            for ing_in_lst in ingredients: # loop through the ingredients in the ingredients list
                if ing_in_lst in single_item.strip(): # check if the ingredient in the list is a substring of the ingredient in the dataset
                    # if the ingredient exists in the dictionary, increment its count
                    if ing_in_lst in ingredients_dict:
                        ingredients_dict[ing_in_lst] += 1
                    # otherwise, add the ingredient to the dictionary with a count of 1
                    else:
                        ingredients_dict[ing_in_lst] = 1

    # sort the dictionary by value in descending order
    sorted_dict = dict(sorted(ingredients_dict.items(), key=lambda x: x[1], reverse=True))

    #print the dictionary of ingredients and their counts
    print("Printing the ingredient Counts:")
    for ing, count in sorted_dict.items():
        print(f'{ing}: {count}')   
    return sorted_dict

In [None]:
Sorted_dict = get_ingridients_dict(df)
print("Printing the ingredient Counts:")
for ing, count in Sorted_dict.items():
  print(f'{ing}: {count}')   


NameError: name 'get_ingridients_dict' is not defined

In [None]:
def draw_ingridient_pie_chart(df,top_n):
    # print()
    ingredient_counts = get_ingridients_dict(df)

    # Extract the top N ingredients by frequency
    ingredient_names = collections.Counter(ingredient_counts).most_common(top_n)
    ingredient_names = [x[0] for x in ingredient_names]
    ingredient_frequencies = [ingredient_counts[name] for name in ingredient_names]
    total_receipe_count = sum(ingredient_frequencies)
    ingredient_percentages = [(count/total_receipe_count)*100 for count in ingredient_frequencies]
    ingredient_labels = ['{} ({:.1f}%)'.format(name, percentage) for name, percentage in zip(ingredient_names, ingredient_percentages)]

    # Create the pie chart
    plt.pie(ingredient_percentages, labels=ingredient_labels)
    plt.title('Top {} Ingredients'.format(top_n))
    plt.show()

looks like it works properly :)

To explore the potential correlation between ingredients and recipe ratings, we employed a technique called "ingredient explosion." This technique involves expanding the ingredients into binary columns, where a value of 1 indicates the presence of an ingredient in a recipe, and 0 indicates its absence.

By utilizing this approach, we can transform the ingredient data into a more structured format that allows us to analyze the relationship between individual ingredients and recipe ratings. The binary columns provide a way to quantify the presence or absence of specific ingredients for each recipe in the dataset.

With the exploded ingredient columns in place, we can proceed with investigating the correlation between these ingredient indicators and the recipe ratings to gain insights into which ingredients may have an impact on the overall rating.

In [None]:
def explode_ingridients_and_get_corr(df):#this function expands the dataframe to each ingridient and tries to find a correlation
    #df = pd.read_csv('clean_modified.csv')

    # extract the ingredients column
    df_ing = df['Ingredients']

    # define a list of ingredients you care about
    important_ingredients = []

    with open('ingridients.csv', newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            important_ingredients.extend(row)

    # loop through the ingredients you care about
    for ingredient in important_ingredients:
        # create a new column with a value of 1 if the recipe contains the ingredient and 0 otherwise
        df[ingredient] = df_ing.str.contains(ingredient).astype(int)

    correlations = df[['Rating'] + [ingredient for ingredient in important_ingredients]].corr()

    # print the correlation coefficients for each ingredient
    # print(correlations.loc['Rating'])
    top_20 = correlations.loc['Rating'].sort_values(ascending=False)[1:21]
    print("Top 20 ingredients with the highest correlation with the 'Rating' column:")
    print(top_20)
    return df

In [None]:
df= explode_ingridients_and_get_corr(df)

After conducting an in-depth analysis, we discovered that there are no significant correlations between individual ingredients and recipe ratings. This finding provides an answer to one of our research questions, indicating that no specific ingredient can directly predict the rating of a recipe.

This outcome emphasizes the complexity of determining what makes a recipe highly rated and underscores the importance of considering various factors such as preparation methods, cooking techniques, flavor combinations, and presentation when evaluating recipe quality.

the natural next step is to try machine learning and see if it can predict the rating

In [None]:
def predict_rating_linear(df):
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split

    # Prepare X and y
    X = df.drop(['Ingredients', 'Name', 'Rating'], axis=1)
    y = df['Rating']

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create and fit the model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Round predicted ratings to nearest half-integer value
    def round_half_up(x):
        # Clamp ratings to 0-5 range
        x = np.clip(x, 0, 5)
        # Round to nearest half-integer value
        return np.floor(x * 2 + 0.5) / 2

    # Make predictions on the training set
    y_train_pred = model.predict(X_train)
    y_train_pred_rounded = round_half_up(y_train_pred)

    # Count correct and incorrect predictions in the training set
    num_train_correct = np.sum(np.abs(y_train_pred_rounded - y_train) <= 0.25)
    num_train_incorrect = len(y_train) - num_train_correct
    percent_train_correct = num_train_correct / len(y_train) * 100

    # Make predictions on the testing set
    y_test_pred = model.predict(X_test)
    y_test_pred_rounded = round_half_up(y_test_pred)

    # Count correct and incorrect predictions in the testing set
    num_test_correct = np.sum(np.abs(y_test_pred_rounded - y_test) <= 0.25)
    num_test_incorrect = len(y_test) - num_test_correct
    percent_test_correct = num_test_correct / len(y_test) * 100

    # Print results
    print(f"Number of recipes in the dataset: {len(df)}")
    print(f"Number of recipes in the training set: {len(X_train)}")
    print(f"Number of recipes in the testing set: {len(X_test)}")
    print(f"Number of correctly predicted recipes in the training set: {num_train_correct}")
    print(f"Number of incorrectly predicted recipes in the training set: {num_train_incorrect}")
    print(f"Percent of correctly predicted recipes in the training set: {percent_train_correct}%")
    print(f"Number of correctly predicted recipes in the testing set: {num_test_correct}")
    print(f"Number of incorrectly predicted recipes in the testing set: {num_test_incorrect}")
    print(f"Percent of correctly predicted recipes in the testing set: {percent_test_correct}%")

    test_results = np.concatenate((y_test.to_numpy().reshape(-1, 1), y_test_pred_rounded.reshape(-1, 1)), axis=1)
    plt.scatter(range(len(test_results)), test_results[:, 0], label='Actual Ratings')
    plt.scatter(range(len(test_results)), test_results[:, 1], label='Predicted Ratings')
    plt.xlabel('Recipe Number')
    plt.ylabel('Rating')
    plt.ylim(0, 5)  # Set y-axis limits
    plt.title('Actual vs. Predicted Ratings')
    plt.legend()
    plt.show()

    return model, num_train_correct, num_train_incorrect, percent_train_correct, num_test_correct, num_test_incorrect, percent_test_correct

In [None]:
predict_rating_linear(df)

#### ADD THE NUMBER IN ACCURACY####
Those results werent looking that great. We will try to use a diifferent machine learning algorthims in order to improve our prediction. The next algorithms we will try to use is KNN. Lets quickly create a new fucntion and run it:

In [None]:
def predict_rating_knn(df):
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.model_selection import train_test_split
    
    # Prepare X and y
    X = df.drop(['Ingredients', 'Name', 'Rating'], axis=1)
    y = df['Rating']
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create and fit the model
    best_k = 0
    best_accuracy = 0
    for k in range(1, 31):
        model = KNeighborsRegressor(n_neighbors=k)
        model.fit(X_train, y_train)
    
        # Make predictions on the training set
        y_train_pred = model.predict(X_train)
        y_train_pred_rounded = np.round(np.clip(y_train_pred, 0, 5) * 2) / 2
    
        # Count correct and incorrect predictions in the training set
        num_train_correct = np.sum(np.abs(y_train_pred_rounded - y_train) <= 0.25)
        percent_train_correct = num_train_correct / len(y_train) * 100
    
        # Make predictions on the testing set
        y_test_pred = model.predict(X_test)
        y_test_pred_rounded = np.round(np.clip(y_test_pred, 0, 5) * 2) / 2
    
        # Count correct and incorrect predictions in the testing set
        num_test_correct = np.sum(np.abs(y_test_pred_rounded - y_test) <= 0.25)
        percent_test_correct = num_test_correct / len(y_test) * 100
        
        # Print results
        print(f"For k={k}:")
        print(f"Number of correctly predicted recipes in the training set: {num_train_correct}")
        print(f"Percent of correctly predicted recipes in the training set: {percent_train_correct}%")
        print(f"Number of correctly predicted recipes in the testing set: {num_test_correct}")
        print(f"Percent of correctly predicted recipes in the testing set: {percent_test_correct}%")
        
        # Check if the current model is better than the previous best model
        if percent_test_correct > best_accuracy:
            best_k = k
            best_accuracy = percent_test_correct
    
    print(f"\nBest k: {best_k}")
    print(f"Best percent of correctly predicted recipes in the testing set: {best_accuracy}%")
    
    # Train the best model
    model = KNeighborsRegressor(n_neighbors=best_k)
    model.fit(X_train, y_train)
    
    # Make predictions on the testing set
    y_test_pred = model.predict(X_test)
    y_test_pred_rounded = np.round(np.clip(y_test_pred, 0, 5) * 2) / 2
    
    # Count correct and incorrect predictions in the testing set
    num_test_correct = np.sum(np.abs(y_test_pred_rounded - y_test) <= 0.25)
    num_test_incorrect = len(y_test) - num_test_correct
    percent_test_correct = num_test_correct / len(y_test) * 100
    
    # Print final results
    print("\nFinal results:")
    print(f"Number of recipes in the dataset: {len(df)}")
    print(f"Number of recipes in the training set: {len(X_train)}")

In [None]:
predict_rating_knn(df)

##### ADD THE NUMBER IN ACCURACY #####

This didnt provide with any better results. As lets resort we will try to implement a working Nural-Network to achieve a possible better results with implementing layers in the model.

In [None]:
def predict_rating_mlp(df):
    import numpy as np
    import pandas as pd
    import tensorflow as tf
    from tensorflow import keras
    from keras import layers
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import r2_score

    print("Preprocessing data...")
    # select features and target variable
    X = df.drop(['Ingredients', 'Name', 'Rating'], axis=1)
    y = df['Rating']

    # scale data
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    X_scraped = scaler.transform(df.drop(['Ingredients', 'Rating', 'Name'], axis=1))
    # split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # build model
    print("Building model...")
    model = keras.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(32, activation='relu'))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(1))

    # compile model
    model.compile(loss='mse', optimizer='adam')

    # fit model
    model.fit(X_train, y_train, epochs=50, verbose=0)

    # Evaluate the model
    mse = model.evaluate(X_test, y_test, verbose=0)
    rmse = np.sqrt(mse)
    print(f'Root Mean Squared Error: {rmse:.3f}')

    # Predict the ratings for the scraped data
    y_pred = model.predict(X_scraped)

    # Round the predicted ratings to the nearest 0.5
    y_pred_rounded = (np.round(y_pred * 2) / 2)

    # Print the predicted ratings and their accuracy compared to the actual ratings
    print('Predicted Ratings:')
    print(y_pred_rounded.flatten())
    accuracy = r2_score(y_test, model.predict(X_test))
    print(f'Accuracy compared to test data: {accuracy:.3f}')
    
    return y_pred_rounded.flatten()

In [None]:
predict_rating_mlp(df)

our accuracy hovers around 0.36(to confirm) which isnt that good so we played around with feature engineering to try and improve our predictions. First we will try to scrape new data from our sites that being the amount of steps needed to complete a recipe named 'Steps' and the number of ingredients needed for the recipe named 'Num_ing'. 

We decide to also make a popularity column that would sum the frequncy of each ingridient in the recepie to try to get a better scroing for a recipe, not only that but we decided to make a difficulty column that takes into account both the steps, cooking time and number of ingredients to better detrmine how each recipe differes from one another.

This means we will need to change our scraping functions and to run the cleannnig function on our new data as well :)

In [None]:
def get_steps_to_cook(soup_obj):
    class_name = "comp mntl-sc-block-group--LI mntl-sc-block mntl-sc-block-startgroup"
    elements = soup_obj.find_all(class_=class_name)

    # Print the number of elements found
    #print(f"Number of instances of {class_name}: {len(elements)}")
    return len(elements)

In [None]:
def merge_fast(url,recepie_name):
    df = pd.DataFrame(columns=['Name','Prep', 'Cook', 'Total', 'Servings', 'Rating','Rating_Count','Dairy','Meat','Fur', 'Calories', 'Fat', 'Carbs', 'Protein','Num_ing','Steps', 'Ingredients'])
    #Dairy
    #Meat
    #Fur
    soup_obj = load_soup_object(url)

    recipe_df = get_cook_times(soup_obj)

    ratings_list= get_stars(soup_obj)

    nutrition_df = get_nutritional_values(soup_obj)

    ingredients = get_ingridients(soup_obj)

    rating_num = get_rating_count(soup_obj)

    meatdaity_df = analyze_recipe(ingredients)

    num_ing = len(ingredients)

    steps = get_steps_to_cook(soup_obj)

    if(ingredients==[]):
        ingredients=['','','','']
        print(type(nutrition_df['Calories'][0]))
    new_row = {
        'Name':recepie_name,
        'Prep': recipe_df['Prep'][0],
        'Cook': recipe_df['Cook'][0],
        'Total': recipe_df['Total'][0],
        'Servings': recipe_df['Servings'][0],
        'Rating': ratings_list,
        'Rating_Count':rating_num,
        'Dairy':meatdaity_df['Dairy'][0],
        'Meat':meatdaity_df['Meat'][0],
        'Fur':meatdaity_df['Fur'][0],
        'Calories': nutrition_df['Calories'][0],
        'Fat': nutrition_df['Fat'][0],
        'Carbs': nutrition_df['Carbs'][0],        
        'Protein': nutrition_df['Protein'][0],
        'Num_ing' : num_ing,
        'Steps' : steps,      
        'Ingredients': [ingredients]
    }
    #print(new_row)

    # add the new row to the DataFrame
    df = pd.concat([df, pd.DataFrame(new_row)], ignore_index=True)
    #df_concat = pd.concat([df1, df2], keys=['df2'])
    df['Prep'] = df['Prep'].astype(float)
    df['Cook'] = df['Cook'].astype(float)
    df['Total'] = df['Total'].astype(float)
    df['Servings'] = df['Servings'].astype(float)
    df['Rating'] = df['Rating'    ].astype(float)
    df['Calories'] = df['Calories'].astype(float)

    # display the updated DataFrame
    #print("another line added")
    return(df)

In [None]:
def fast_scrape(csv_file_name):
    recipe_names = []
    recipe_links = []
    count=0

    with open(csv_file_name, encoding='utf-8', newline='') as csvfile:
        reader = csv.reader(csvfile)
        next(reader)  # Skip the first row
        for row in reader:
            recipe = ','.join(row).strip().replace('\x9c', '')
            last_comma = recipe.rfind(',')
            if last_comma != -1:
                recipe_name = recipe[:last_comma].strip()
                recipe_link = recipe[last_comma + 1:].strip()
                recipe_names.append(recipe_name)
                recipe_links.append(recipe_link)
            else:
                print(f"Invalid row: {row}")

    final_df=pd.DataFrame()
    for name, link in zip(recipe_names, recipe_links):
        # print(f"Recipe name: {name}")
        # print(f"Recipe link: {link}")
        temp_df=merge_fast(link,name)
        temp_df.set_index('Name', inplace=True)  # set the index of temp_df to the recipe name
        if(final_df.empty):
            final_df=temp_df
        else:
            final_df = pd.concat([final_df,temp_df])
        print(final_df)

    #final_df.to_pickle('dataframe.pkl')
    #final_df.to_pickle('dataframe.pkl', protocol=4, encoding='utf-8')
    final_df.to_csv('my_data.csv', index=True, encoding='utf-8')
    print(final_df)

Now we have to run the scraping all over again, as well as to clean the initial data as we did. Its a long process but nothing can be done about it...

In [None]:
Updated_Data = fast_scrape('Recipe_Links_and_Names.csv')
Updated_Data['Fat'] = Updated_Data['Fat'].str.replace('g', '').str.replace(',', '').astype(int)
Updated_Data['Carbs'] = Updated_Data['Carbs'].str.replace('g', '').str.replace(',', '').astype(int)
Updated_Data['Protein'] = Updated_Data['Protein'].str.replace('g', '').str.replace(',', '').astype(int)
Updated_Data = clean_data_time(Updated_Data)
Updated_Data = clean_df(Updated_Data)
Updated_Data = remove_teaspoon(Updated_Data,'Ingredeints')
Updated_Data = clean_ingridients(Updated_Data)


print("DONE!")

In [None]:
def add_popularity_score_to_df(df):
    binary_columns = [col for col in df.columns if col not in ['Name', 'Prep', 'Cook', 'Total', 'Servings', 'Rating', 'Rating_Count', 'Dairy', 'Meat', 'Fur', 'Calories', 'Fat', 'Carbs', 'Protein', 'Num_ing', 'Steps', 'Ingredients']]

    # Calculate the frequencies of each binary value
    ingredient_frequencies = df[binary_columns].sum() / len(df) * 30

    # Calculate the popularity score for each recipe
    popularity_scores = []
    for _, row in df.iterrows():
        score = 0
        for column in binary_columns:
            if row[column] == 1:
                score += ingredient_frequencies[column]
        popularity_scores.append(score)

    # Add the popularity scores as a new column to the DataFrame
    df['Popularity Score'] = popularity_scores

    # Print the updated DataFrame
    #print(df)
    #scraping_functions.draw_histo_1_params(df,'Popularity Score')
    return df

In [None]:
def add_difficulty_column(df):
    # Read the CSV file into a pandas dataframe
    #df = pd.read_csv(csv_file_path)
    
    # Calculate the combined score for each recipe
    df['Combined'] = df['Num_ing'] + df['Steps'] + df['Total']
    
    # Calculate the mean and standard deviation of the combined score
    mean = df['Combined'].mean()
    std_dev = df['Combined'].std()
    
    # Calculate the z-score for each recipe
    df['Z_score'] = (df['Combined'] - mean) / std_dev
    
    # Divide the z-scores into 5 equal-sized groups
    df['Difficulty'] = pd.qcut(df['Z_score'], q=5, labels=[1, 2, 3, 4, 5])
    
    # Write the updated dataframe to a new CSV file
    # new_csv_file_path = csv_file_path.split('.csv')[0] + '_with_difficulty.csv'
    # df.to_csv(new_csv_file_path, index=False)
    #df.to_csv(csv_file_path, index=False, mode='w')
    #save_df(df,"csv_Wtih_diff.csv")
    
    # Print the number of recipes in each difficulty level
    for i in range(1, 6):
        count = df['Difficulty'][df['Difficulty'] == i].count()
        print(f"Difficulty level {i}: {count}")
    
    return df

In [None]:
Updated_Data = add_difficulty_column(Updated_Data)
Updated_Data = explode_ingridients_and_get_corr(Updated_Data)
Updated_Data = add_popularity_score_to_df(Updated_Data)


Lets save the newly accuried data!

In [None]:
def save_df(df,name):
    df.to_csv(name, index=True, encoding='utf-8')

save_df(Updated_Data, 'FinalCSVFile.csv')

Lets see if anything new appearded in regards to the correlation.


In [None]:
def draw_heatmap_view(df):
    heat_map_view = df[['Prep', 'Cook', 'Total', 'Servings', 'Rating', 'Rating_Count', 'Dairy', 'Meat', 'Fur', 'Calories','Fat','Carbs','Protein','Num_ing','Steps','Z_score','Difficulty']]
    sns.heatmap(heat_map_view.corr(), annot=True)
    plt.show()

def draw_heatmap_view_important(df):
    heat_map_view = df[['Prep', 'Cook', 'Total', 'Servings', 'Rating', 'Rating_Count', 'Dairy', 'Meat', 'Fur', 'Calories','Fat','Carbs','Protein','Num_ing','Steps','Z_score','Difficulty']]
    corr = heat_map_view.corr()
    mask = (corr > 0.5) | (corr < -0.5)  # set the threshold for correlation values
    sns.heatmap(corr, annot=True, mask=~mask, cmap='coolwarm')
    
    # print the correlations above 0.5 or below -0.5
    print("Correlations above 0.5 or below -0.5:")
    for i in range(len(corr.columns)):
        for j in range(i):
            if mask.iloc[i, j]:
                print(f"{corr.index[i]} - {corr.columns[j]}: {corr.iloc[i, j]}")

    plt.show()



draw_heatmap_view_important(Updated_Data)

As we can see nothing outstanding really appeared from the extra scraping and future enginering but lets see if our algorithm can be improved with this new data.

In [None]:
predict_rating_linear(Updated_Data)

Surprisingly, our latest machine learning model achieved a score of 0.47, indicating some improvement compared to our previous attempt. Although the score might not be considered high, it does demonstrate a notable 30% enhancement in predictive performance.

During our exploration, we experimented with various feature engineering techniques to extract more meaningful information from the data. We carefully evaluated the impact of each feature and excluded columns and feature engineering attempts that did not yield significant results. This approach allowed us to focus on the most influential factors that could contribute to predicting recipe ratings accurately.

#### Changing our Approach ####

After realizing our previous attempts to predict star ratings were not yielding satisfactory results, we decided to take a different approach. Instead of directly predicting star ratings, we focused on a simpler question: Can we determine if a recipe is "good" or not? We defined "good" as recipes with a 5-star rating. To answer this question, we used a Perceptron model, a binary classification model that determines if a recipe is 5 stars or not. By training the model on factors like cooking time, ingredients, and nutritional values, we aimed to identify what makes these top-rated recipes stand out.

We acknowledge that there is ongoing debate about what constitutes a "good" recipe, but we chose to focus on the highest rating of 5 stars. Our goal was to uncover patterns and insights that distinguish these exceptional recipes. Through the Perceptron model, we aimed to discover the key factors that contribute to a recipe's top rating and use this knowledge to develop new recipes with a higher likelihood of achieving the coveted 5-star status.

While our initial attempts to predict star ratings fell short, we remain hopeful that this new approach will shed light on the secrets of highly rated recipes. By analyzing the factors that set apart 5-star recipes, we aim to gain a deeper understanding of what makes a recipe truly outstanding. Let's delve into the Perceptron model and explore the world of exceptional recipes!

In [None]:
from sklearn.base import accuracy_score
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split


def build_perceptron_model(df):
    #df = pd.read_csv('test123.csv')

    # Save the 'Rating' column separately
    ratings = df['Rating']
    
    # Drop the 'Rating' column from the dataframe
    df = df.drop(['Rating'], axis=1)
    df = df.drop(['Rating_Count'], axis=1)
    df = df.select_dtypes(include=['int', 'float'])
    # Get the column names of the dataframe
    col_names = df.columns

    # Convert the dataframe to a numpy array
    X = df.select_dtypes(include=['int', 'float']).values

    # Define the target variable
    y = (ratings > 4.5).astype(int).values

    # Verify that X and y have the same length
    assert len(X) == len(y), "X and y must have the same length"

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Fit the Perceptron model to the training data
    model = Perceptron()
    model.fit(X_train, y_train)

    # Compute the predictions on the test data
    y_pred = model.predict(X_test)

    # Compute the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")

    # Get the weights learned by the perceptron
    weights = model.coef_[0]

    # Create a new dataframe with the column names and weights
    weights_df = pd.DataFrame({'column': col_names, 'weight': weights})

    # Print the weights dataframe
    print(weights_df)

    #save_df(weights_df,'FML.csv')
    return weights_df

In [None]:
weights_data = build_perceptron_model(Updated_Data)

We were thrilled with the remarkable result we achieved using the Perceptron model: an impressive accuracy score of 0.95. It seemed almost too good to be true, so we decided to put it to the test. We selected 20 recipes from a different website that were not part of our original dataset, and to our delight, the model accurately predicted whether these recipes would be rated as 5 stars or not. This outcome reinforced our confidence in the model's ability to identify exceptional recipes and opened up exciting possibilities for future applications and enhancements.

In [None]:
sorted_df = weights_data.sort_values(by='weight')

# printing the top 10 rows
print(sorted_df.tail(10)[['column', 'weight']])

# printing the bottom 10 rows
print(sorted_df.head(10)[['column', 'weight']])

Based on the weights obtained from our model, we identified the top 5 factors that contribute the most to a recipe being rated as 5 stars. These factors include:
prep time,cook time,servings,calories,number of ingredients and number of steps.

These variables had the highest weights in our weights dataframe, indicating their significant influence on the rating. Conversely, we also identified the 5 worst ingredients, which included popularity score, prep, cook, total time, and servings. These factors had negative weights, suggesting that recipes with longer preparation and cooking times, higher total time, and larger servings are less likely to receive a 5-star rating.

These findings provide valuable insights into the preferences of users when it comes to recipe ratings. Recipes with shorter cooking times, lower servings (indicating higher calories per serving), and unique ingredients tend to have a higher likelihood of receiving a 5-star rating. Additionally, the negative weight associated with popularity score suggests that recipes with lower popularity scores may be more likely to receive higher ratings. This supports our hypothesis that incorporating rare ingredients such as caviar or lobster could potentially improve the chances of a recipe being rated highly. Overall, these findings open up new possibilities for creating recipes that align with user preferences and have a higher probability of receiving favorable ratings.

to conclude : bla bla bla shapira pls help :)

visualize results? haha updated