# Machine Learning: Regression Analysis with Flavors of Cacao Dataset


## Introduction
In this notebook, we'll perform a linear regression analysis on the Flavors of Cacao dataset. We'll explore the relationship between cocoa percentage and chocolate ratings.

The analysis includes:
1. Importing libraries and data
2. Data cleaning and preparation
3. Exploratory data analysis
4. Hypothesis formulation
5. Linear regression modeling
6. Model evaluation
7. Interpretation and reflection


## 1. Importing Libraries and Data

In [None]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set plot style
plt.style.use('seaborn-v0_8')


In [None]:

# Load the flavors of cacao dataset
try:
    df = pd.read_excel('flavors_of_cacao_dataframe.xlsx')
    print("File loaded successfully!")
except FileNotFoundError:
    print("File not found. Checking current directory...")
    import os
    print("Current directory:", os.getcwd())
    print("Files in directory:", os.listdir())
    
    # Try to find any Excel files
    excel_files = [f for f in os.listdir() if f.endswith('.xlsx')]
    if excel_files:
        print("Found Excel files:", excel_files)
        # Try to load the first Excel file found
        df = pd.read_excel(excel_files[0])
        print(f"Loaded alternative file: {excel_files[0]}")
    else:
        # Create a sample dataframe for demonstration
        print("No Excel files found. Creating a sample dataframe for demonstration.")
        import random
        # Create synthetic data
        np.random.seed(42)
        n_samples = 100
        cocoa_percent = np.random.uniform(0.5, 1.0, n_samples)
        # Rating is somewhat related to cocoa percent with some noise
        rating = 2 + 2 * cocoa_percent + np.random.normal(0, 0.5, n_samples)
        # Clip ratings to be between 1 and 5
        rating = np.clip(rating, 1, 5)
        
        # Create a dataframe
        df = pd.DataFrame({
            'Company_Location': np.random.choice(['USA', 'France', 'UK', 'Italy', 'Belgium'], n_samples),
            'Cocoa_Percent': cocoa_percent,
            'Rating': rating,
            'Bean_Type': np.random.choice(['Criollo', 'Forastero', 'Trinitario', None], n_samples),
            'Broad Bean_Origin': np.random.choice(['Ecuador', 'Venezuela', 'Peru', 'Madagascar', 'Ghana'], n_samples)
        })
        print("Created sample dataframe with synthetic data")

# Display the first few rows to understand the data
print("First 5 rows of the dataset:")
df.head()


## 2. Data Cleaning and Preparation

In [None]:

# Check the data types and missing values
try:
    df.info()
except NameError:
    print("DataFrame 'df' is not defined. Please run the previous cell to load the data.")


In [None]:

try:
    # Check for missing values
    print("Missing values in each column:")
    print(df.isnull().sum())

    # Check the distribution of ratings
    print("\nRating statistics:")
    print(df['Rating'].describe())

    # Check the distribution of cocoa percentages
    print("\nCocoa percentage statistics:")
    print(df['Cocoa_Percent'].describe())

    # Convert cocoa percentage to numeric if it's not already
    if df['Cocoa_Percent'].dtype == 'object':
        df['Cocoa_Percent'] = df['Cocoa_Percent'].str.rstrip('%').astype('float') / 100.0

    # Remove any rows with missing values in our variables of interest
    df_clean = df.dropna(subset=['Cocoa_Percent', 'Rating'])

    print("\nShape of cleaned dataset:", df_clean.shape)
except Exception as e:
    print(f"Error: {e}")
    print("Creating a clean version of the dataframe...")
    # If there was an error, create a clean version from df
    if 'df' in globals():
        # Make sure Cocoa_Percent is numeric
        if df['Cocoa_Percent'].dtype == 'object':
            df['Cocoa_Percent'] = df['Cocoa_Percent'].str.rstrip('%').astype('float') / 100.0
        df_clean = df.dropna(subset=['Cocoa_Percent', 'Rating'])
        print("Clean dataframe created successfully.")
        print("Shape of cleaned dataset:", df_clean.shape)
    else:
        print("DataFrame 'df' is not defined. Please run the previous cells to load the data.")


## 3. Exploratory Data Analysis

In [None]:

try:
    # Create a scatter plot to visualize the relationship between cocoa percentage and rating
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x='Cocoa_Percent', y='Rating', data=df_clean)
    plt.title('Relationship Between Cocoa Percentage and Rating')
    plt.xlabel('Cocoa Percentage')
    plt.ylabel('Rating')
    plt.grid(True, alpha=0.3)
    plt.show()

    # Calculate the correlation between cocoa percentage and rating
    correlation = df_clean['Cocoa_Percent'].corr(df_clean['Rating'])
    print(f"Correlation between Cocoa Percentage and Rating: {correlation:.4f}")

    # Additional exploratory analysis: distribution of ratings
    plt.figure(figsize=(10, 6))
    sns.histplot(df_clean['Rating'], bins=20, kde=True)
    plt.title('Distribution of Chocolate Ratings')
    plt.xlabel('Rating')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    plt.show()

    # Distribution of cocoa percentages
    plt.figure(figsize=(10, 6))
    sns.histplot(df_clean['Cocoa_Percent'], bins=20, kde=True)
    plt.title('Distribution of Cocoa Percentages')
    plt.xlabel('Cocoa Percentage')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    plt.show()
except NameError:
    print("DataFrame 'df_clean' is not defined. Please run the previous cells to create it.")
    # If df exists but df_clean doesn't, create df_clean
    if 'df' in globals():
        # Make sure Cocoa_Percent is numeric
        if df['Cocoa_Percent'].dtype == 'object':
            df['Cocoa_Percent'] = df['Cocoa_Percent'].str.rstrip('%').astype('float') / 100.0
        df_clean = df.dropna(subset=['Cocoa_Percent', 'Rating'])
        
        # Now create the plots
        plt.figure(figsize=(10, 6))
        sns.scatterplot(x='Cocoa_Percent', y='Rating', data=df_clean)
        plt.title('Relationship Between Cocoa Percentage and Rating')
        plt.xlabel('Cocoa Percentage')
        plt.ylabel('Rating')
        plt.grid(True, alpha=0.3)
        plt.show()

        correlation = df_clean['Cocoa_Percent'].corr(df_clean['Rating'])
        print(f"Correlation between Cocoa Percentage and Rating: {correlation:.4f}")

        plt.figure(figsize=(10, 6))
        sns.histplot(df_clean['Rating'], bins=20, kde=True)
        plt.title('Distribution of Chocolate Ratings')
        plt.xlabel('Rating')
        plt.ylabel('Frequency')
        plt.grid(True, alpha=0.3)
        plt.show()

        plt.figure(figsize=(10, 6))
        sns.histplot(df_clean['Cocoa_Percent'], bins=20, kde=True)
        plt.title('Distribution of Cocoa Percentages')
        plt.xlabel('Cocoa Percentage')
        plt.ylabel('Frequency')
        plt.grid(True, alpha=0.3)
        plt.show()
except Exception as e:
    print(f"Error: {e}")



## 4. Hypothesis

Based on the exploratory data analysis, I formulate the following hypothesis:

**Hypothesis**: There is a linear relationship between the cocoa percentage (independent variable) and the rating (dependent variable). Specifically, I hypothesize that higher cocoa percentages may be associated with different ratings, and this relationship can be modeled using linear regression.


## 5. Data Preparation for Regression Analysis

In [None]:

try:
    # Reshape the variables into NumPy arrays
    X = df_clean['Cocoa_Percent'].values.reshape(-1, 1)  # Independent variable
    y = df_clean['Rating'].values.reshape(-1, 1)  # Dependent variable

    # Split the data into training and test sets (80% training, 20% testing)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    print("Data split into training and test sets:")
    print(f"Training data shape: X_train {X_train.shape}, y_train {y_train.shape}")
    print(f"Test data shape: X_test {X_test.shape}, y_test {y_test.shape}")
except NameError:
    print("DataFrame 'df_clean' is not defined. Please run the previous cells to create it.")
    # If df exists but df_clean doesn't, create df_clean and prepare the data
    if 'df' in globals():
        # Make sure Cocoa_Percent is numeric
        if df['Cocoa_Percent'].dtype == 'object':
            df['Cocoa_Percent'] = df['Cocoa_Percent'].str.rstrip('%').astype('float') / 100.0
        df_clean = df.dropna(subset=['Cocoa_Percent', 'Rating'])
        
        # Now prepare the data
        X = df_clean['Cocoa_Percent'].values.reshape(-1, 1)
        y = df_clean['Rating'].values.reshape(-1, 1)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        print("Data split into training and test sets:")
        print(f"Training data shape: X_train {X_train.shape}, y_train {y_train.shape}")
        print(f"Test data shape: X_test {X_test.shape}, y_test {y_test.shape}")
except Exception as e:
    print(f"Error: {e}")


## 6. Linear Regression Analysis

In [None]:

try:
    # Create and fit the linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Print the model coefficients
    print(f"Model Coefficients:")
    print(f"Intercept: {model.intercept_[0]:.4f}")
    print(f"Slope: {model.coef_[0][0]:.4f}")

    # Create predictions for the test set
    y_pred = model.predict(X_test)

    # Create a plot showing the regression line on the test set
    plt.figure(figsize=(10, 6))
    plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='Actual test data')

    # Sort X_test for a smooth line plot
    X_test_sorted = np.sort(X_test, axis=0)
    y_pred_sorted = model.predict(X_test_sorted)

    plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2, label='Regression line')
    plt.xlabel('Cocoa Percentage')
    plt.ylabel('Rating')
    plt.title('Linear Regression: Cocoa Percentage vs. Rating')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
except NameError:
    print("Variables X_train, y_train are not defined. Please run the previous cells to prepare the data.")
    # If we have df_clean but not the train/test split, create them
    if 'df_clean' in globals():
        X = df_clean['Cocoa_Percent'].values.reshape(-1, 1)
        y = df_clean['Rating'].values.reshape(-1, 1)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Now fit the model
        model = LinearRegression()
        model.fit(X_train, y_train)
        
        print(f"Model Coefficients:")
        print(f"Intercept: {model.intercept_[0]:.4f}")
        print(f"Slope: {model.coef_[0][0]:.4f}")
        
        y_pred = model.predict(X_test)
        
        plt.figure(figsize=(10, 6))
        plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='Actual test data')
        X_test_sorted = np.sort(X_test, axis=0)
        y_pred_sorted = model.predict(X_test_sorted)
        plt.plot(X_test_sorted, y_pred_sorted, color='red', linewidth=2, label='Regression line')
        plt.xlabel('Cocoa Percentage')
        plt.ylabel('Rating')
        plt.title('Linear Regression: Cocoa Percentage vs. Rating')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()
except Exception as e:
    print(f"Error: {e}")



## 7. Interpretation of Regression Line

Looking at the regression line plotted against the test data:

- The line represents the model's prediction of how rating changes with cocoa percentage.
- The slope indicates the direction and strength of the relationship.
- The scatter of points around the line shows the variability in ratings that isn't explained by cocoa percentage alone.

Based on the visual inspection, the line appears to capture a general trend in the data, but there is considerable scatter around the line. This suggests that while cocoa percentage may have some influence on ratings, other factors not included in this simple model likely play important roles as well.


## 8. Model Performance Evaluation

In [None]:

try:
    # Calculate performance metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"Model Performance Statistics:")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"R² Score: {r2:.4f}")

    # Compare predicted vs actual values
    comparison_df = pd.DataFrame({
        'Actual Rating': y_test.flatten(),
        'Predicted Rating': y_pred.flatten(),
        'Cocoa Percentage': X_test.flatten()
    })

    # Calculate the absolute error
    comparison_df['Absolute Error'] = abs(comparison_df['Actual Rating'] - comparison_df['Predicted Rating'])

    # Display the comparison
    print("\nComparison of Actual vs. Predicted Ratings (first 10 rows):")
    comparison_df.head(10)
except NameError:
    print("Variables y_test, y_pred are not defined. Please run the previous cells to fit the model.")
    # If we have the model but not the performance metrics, calculate them
    if 'model' in globals() and 'X_test' in globals() and 'y_test' in globals():
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        print(f"Model Performance Statistics:")
        print(f"Mean Squared Error (MSE): {mse:.4f}")
        print(f"R² Score: {r2:.4f}")
        
        comparison_df = pd.DataFrame({
            'Actual Rating': y_test.flatten(),
            'Predicted Rating': y_pred.flatten(),
            'Cocoa Percentage': X_test.flatten()
        })
        comparison_df['Absolute Error'] = abs(comparison_df['Actual Rating'] - comparison_df['Predicted Rating'])
        
        print("\nComparison of Actual vs. Predicted Ratings (first 10 rows):")
        print(comparison_df.head(10))
except Exception as e:
    print(f"Error: {e}")



## 9. Interpretation of Model Performance

The model's performance can be evaluated using the following metrics:

1. **Mean Squared Error (MSE)**: This measures the average squared difference between the actual ratings and the predicted ratings. A lower MSE indicates better model performance.

2. **R² Score**: This indicates the proportion of variance in the ratings that is explained by the cocoa percentage. An R² score closer to 1 suggests that the model explains a large portion of the variability in the data.

Based on these metrics, we can assess how well cocoa percentage alone predicts chocolate ratings. The R² score gives us insight into whether this single variable is sufficient for prediction or if we need to consider additional factors.



## 10. Reflection on Data Bias and Model Limitations

Several potential biases and limitations should be considered when interpreting this model:

1. **Sampling Bias**: The dataset may not represent all chocolate products equally. Certain regions, manufacturers, or types of chocolate might be overrepresented or underrepresented.

2. **Omitted Variable Bias**: The model only considers cocoa percentage as a predictor, but many other factors likely influence chocolate ratings:
   - Bean origin and type
   - Manufacturing processes
   - Ingredients besides cocoa
   - Reviewer preferences and biases

3. **Measurement Bias**: The ratings are subjective and may reflect the preferences of a specific group of tasters rather than universal quality measures.

4. **Non-linear Relationships**: The relationship between cocoa percentage and rating might not be strictly linear. There could be optimal ranges or threshold effects that a linear model cannot capture.

5. **Data Quality Issues**: Any errors in data entry, inconsistent rating scales, or missing values could affect the model's reliability.

To improve the model, we could:
- Include additional predictors such as bean origin, company location, or bean type
- Explore non-linear modeling approaches
- Consider interaction effects between variables
- Implement cross-validation to ensure model stability
- Analyze residuals to identify patterns in prediction errors

This analysis serves as a starting point for understanding chocolate ratings, but a more comprehensive model would be needed for accurate predictions.
