<font color='darkred'>Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, you'll need to update the *apputil\.py* file and the *app\.py* file.

## Exercise 1

Recall the [simple streamlit app](https://github.com/leontoddjohnson/simple_streamlit) and the [coffee analysis data](https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv) used.

Write a Python script called `train.py` that does the following:

- Loads the [coffee analysis data](https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv) (from the URL).
- Trains a (Scikit-Learn) linear regression model to predict `rating` based on the single feature `100g_USD`.
- Saves the trained model in this repository as a pickle file called `model_1.pickle`.

## Exercise 2

Update the script to train a **Decision Tree Regressor** model that predicts `rating` based on *both* `100g_USD` and `roast`, and saves the trained model as `model_2.pickle`. Notice that the `roast` column is categorical, so you'll need to convert it into a numerical label format:

- Create a dictionary that maps *all* categories to a number (e.g., `roast_cat['Medium-Light'] = 1`).
- Use `.map` or `.apply` (in pandas) to create a numerical column to train your model.
- Save the dictionary along with this process for next exercise.

*Note: **Do not worry about model performance**, but interestingly, tree-based models like this tend to perform more efficiently with category labels instead of than one-hot encoded features.*

## Exercise 3

Update the *apputil\.py* file to include a `predict_rating(df_X)` function that takes in a two-column dataframe, `df_X`, with columns `100g_USD` (numerical) and `roast` (in original text form), and returns an array containing corresponding predicted `rating` values. If a `roast` value is not one of the roast values in the training data, the function should only use the `100g_USD` value to make the prediction (recall `model_1.pickle`). Otherwise, it should use both features.

In [13]:
df_coffee_clean_data = df_coffee_data.dropna(subset=["100g_USD", "rating", "roast"])

In [14]:
roast_categories = df_coffee_clean_data["roast"].unique()

In [15]:
roast_categories

array(['Medium-Light', 'Medium', 'Light', 'Medium-Dark', 'Dark'],
      dtype=object)

In [16]:
roast_category = {category: idx for idx, category in enumerate(sorted(roast_categories))}

In [17]:
roast_category

{'Dark': 0, 'Light': 1, 'Medium': 2, 'Medium-Dark': 3, 'Medium-Light': 4}

In [20]:
df_coffee_clean_data["roast_encoded"] = df_coffee_clean_data["roast"].map(roast_category)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coffee_clean_data["roast_encoded"] = df_coffee_clean_data["roast"].map(roast_category)


In [22]:
df_coffee_clean_data.columns

Index(['name', 'roaster', 'roast', 'loc_country', 'origin_1', 'origin_2',
       '100g_USD', 'rating', 'review_date', 'desc_1', 'desc_2', 'desc_3',
       'roast_encoded'],
      dtype='object')

In [23]:
X = df_coffee_clean_data[["100g_USD", "roast_encoded"]]
y = df_coffee_clean_data["rating"]

In [26]:
model_tree = DecisionTreeRegressor()
model_tree.fit(X, y)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [29]:
with open("model_2.pickle", "wb") as f:
    pickle.dump({"model_tree": model, "roast_cat": roast_category}, f)

In [25]:
from sklearn.tree import DecisionTreeRegressor

In [1]:
import pandas as pd

In [2]:
from sklearn.linear_model import LinearRegression
import pickle

In [4]:
df_coffee_data = pd.read_csv(r'https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv')

In [None]:
df_coffee_data

In [6]:
df_coffee_clean_data = df_coffee_data.dropna(subset=["100g_USD", "rating"])

In [7]:
X = df_coffee_clean_data[["100g_USD"]]  # Feature
y = df_coffee_clean_data["rating"]      # Targ

In [8]:
model = LinearRegression()
model.fit(X, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [9]:
with open("model_1.pickle", "wb") as f:
    pickle.dump(model, f)

In [None]:
import pandas as pd
from apputil import predict_rating

df_X = pd.DataFrame([
    [10.00, "Dark"],
    [15.00, "Very Light"]], 
    columns=["100g_USD", "roast"])
y_pred = predict_rating(df_X)
y_pred

## (Optional) Bonus Exercise

Vectorize the `desc_3` column in the coffee analysis data using TF-IDF vectorization. Train a linear regression model to predict `rating` based only on the vectorized text data, and save the trained model as `model_3.pickle`.

Adjust your `predict_rating(X, text=True)` function where the `text` argument indicates that `X` is an array of strings of text (in the style of the reviews in `desc_3`). Update the function so that when `text=True`, it returns predicted ratings based on the text.

Note: you'll need to figure out what to do when the input text contains words that were not in the training data!

In [None]:
X = pd.DataFrame([
    "A delightfull coffee with hints of chocolate and caramel.",
    "A strong coffee with a bold flavor and a smoky finish."], 
    columns=["text"])
y = predict_rating(X, text=True)
y