# 🌯 Burrito Ratings – Can We Predict a “Great” Burrito?
*Logistic-regression mini-project *

## 🔍 Why this matters  
San Diego’s Burrito Blog rates hundreds of burritos on a 0–5 scale.  
For restaurant owners it would be powerful to **know ahead of time which combinations of ingredients and prep methods are most likely to earn a ≥ 4 ★ review (“Great”).**  
Our goal: build a transparent, baseline logistic-regression model that predicts that binary outcome.


In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

## 🧹 Cleaning & Feature Engineering  
1. Parse `Date` to `pd.DatetimeIndex` so we can time-slice later.  
2. **Target** → create binary `Great` column.  
3. Drop free-text and high-cardinality fields (`Notes`, `Address`, `URL`, …) to avoid leakage.  
4. Convert every ingredient flag from strings like `"x"`, `"yes"`, `"no"` into 1/0 ints.  
5. Remove columns that still leak the label (`overall`, `Rec`).  
6. Ensure dtypes are numeric for modeling.  


In [7]:
import pandas as pd

RAW_URL = (
    "https://raw.githubusercontent.com/"
    "buddhika159/Burrito-Exploratory-Analysis/refs/heads/main/"
    "burritos_dataset.csv"
)

def wrangle(filepath=RAW_URL):
    """Load Burrito dataset from a local file or a URL and return a cleaned DataFrame."""
    df = pd.read_csv(filepath, parse_dates=["Date"], index_col="Date")

    # --- your existing cleaning steps ---
    df.dropna(subset=["overall"], inplace=True)

    df["Great"] = (df["overall"] >= 4).astype(int)

    df = df.drop(
        columns=[
            "Notes",
            "Location",
            "Address",
            "URL",
            "Neighborhood",
            "Carrots",
            "Yelp",
            "Google",
            "Chips",
            "Temp",
            "Synergy",
            "Uniformity",
            "Fish",
            "Rice",
            "Beans",
            "Lettuce",
            "Tomato",
            "Cabbage",
            "Sauce",
            "Salsa.1",
            "Cilantro",
            "Bell peper",
            "Onion",
            "Pineapple",
            "Sour cream",
            "Taquito",
            "Chile relleno",
            "Reviewer",
            "Unreliable",
            "Zucchini",
            "Corn",
            "Sushi",
            "Avocado",
            "Bacon",
            "Mushroom",
            "Egg",
            "Queso",
            "Lobster",
            "Nopales",
            "NonSD",
        ]
    )

    # ... rest of your wrangling code ...

    df = df.drop(columns=["Rec", "overall"])
    df["Beef"] = (      
    df["Beef"]                  # original column
      .astype(str).str.strip()  # make everything a string, trim spaces
      .str.lower()              # normalise case
      .replace(
          {"x": 1, "yes": 1, "y": 1,   # truthy tokens → 1
           "no": 0, "n": 0, "": 0, "nan": 0}
      )
      .fillna(0)                # real NaNs → 0
      .astype(int)              # now safe to cast
)
    
    df = df.astype({"Beef": int})

    return df


# Use it (works from any machine with internet access)
df = wrangle()
df.head()



  .replace(


Unnamed: 0_level_0,Burrito,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Meat,...,Beef,Pico,Guac,Cheese,Fries,Pork,Chicken,Shrimp,Ham,Great
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-18,California,6.49,3.0,,,,,,3.0,3.0,...,1,x,x,x,x,,,,,0
2016-01-24,California,5.45,3.5,,,,,,2.0,2.5,...,1,x,x,x,x,,,,,0
2016-01-24,Carnitas,4.85,1.5,,,,,,3.0,2.5,...,0,x,x,,,x,,,,0
2016-01-24,Carne asada,5.25,2.0,,,,,,3.0,3.5,...,1,x,x,,,,,,,0
2016-01-27,California,6.59,4.0,,,,,,4.0,4.0,...,1,x,,x,x,,,,,1


##  Split the Dataset 
Split the data into the feature matrix (X) and the target vector (y) in order to predict the target ('Great').

In [9]:
target = 'Great'
X = df.select_dtypes('number').drop(columns = target)
y = df[target]

##  Split Into Training Set 
Split x and y into a training set (x_train, y_train) and a test set (x_test, y_test.)

In [10]:
import datetime

cutoff = datetime.datetime(2018,1,1)

mask = X.index < cutoff
X_train= X.loc[mask]
y_train = y.loc[mask]
X_test = X.loc[~mask]
y_test = y.loc[~mask]


##  Establish a Baseline
Split the data into the feature matrix (X) and the target vector (y) in order to predict the target ('Great').

In [11]:
y_train.value_counts()
#0
baseline_acc = .68
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.68


In [None]:

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
!pip install category_encoders
from category_encoders import OneHotEncoder


model_logr = make_pipeline(
    OneHotEncoder(use_cat_names = True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LogisticRegression()

)

model_logr.fit(X_train, y_train)

## 🧠 Model 1 – Baseline Logistic Regression  
Why logistic?  
* Simple, fast, interpretable odds ratios.  
* Gives a probability we can threshold for business trade-offs.  

Pipeline:  
1. **One-hot-Encoder** remaining categorical features (salsa heat, tortilla type, …).  
2. **SimpleImputer** to deal with missing values 
3. **StandardScaler** which often improves performance in a lo  
4. **LogisticRegression** predictor.

In [13]:

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
!pip install category_encoders
from category_encoders import OneHotEncoder


model_logr = make_pipeline(
    OneHotEncoder(use_cat_names = True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LogisticRegression()

)

model_logr.fit(X_train, y_train)

'DOSKEY' is not recognized as an internal or external command,
operable program or batch file.

[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip




### Check Metrics
Calculate the training and test accuracy score for model_logr.

In [14]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

training_acc = model_logr.score(X_train, y_train)
test_acc = model_logr.score(X_test, y_test)

print('Training MAE:', training_acc)
print('Test MAE:', test_acc)

Training MAE: 0.8769633507853403
Test MAE: 1.0


### 🎯 Class Labels vs. Probabilities

In this step we ask the trained **logistic-regression model** for two different
kinds of answers:

| What we call             | Method                             | Shape returned   |   Meaning                                                                |
|--------------------------|------------------------------------|------------------|--------------------------------------------------------------------------|
| **Hard prediction**      | `model_logr.predict(X_test)`       | `(n_samples,)`   | The single class each row is assigned to (`0` = not great, `1` = great). |
| **Probability estimate** | `model_logr.predict_proba(X_test)` | `(n_samples, 2)` | The model’s confidence for *every* class&mdash; each row sums to **1.0**. |

We print the first 10 rows of each so you can see the difference:  
*`y_pred`* shows crisp labels, while *`y_pred_prob`* shows the underlying
likelihoods that drive those labels.  
The latter is what we’ll use later for threshold-tuning and metrics such as
ROC-AUC.

In [None]:
from sklearn.metrics import accuracy_score


y_pred = model_logr.predict(X_test)
print("y pred return class predictions, like 0 or 1 :", y_pred[:10])

y_pred_prob = model_logr.predict_proba(X_test)
print('while y_pred_prob returns probability estimates:', y_pred_prob[:10])

print ("In Summary, we predict that the next burrito will be classified as, "Great" and we are 99.8 percent confident of this predction")

y pred return class predictions, like 0 or 1 : [1]
while y_pred_prob returns probability estimates: [[0.00109271 0.99890729]]
In Summary, we an see that the next burrito will be great and we are 99.8 percent confient of tis predction
