<font color='darkred'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *app\.py* file and the *apputil\.py* file.

# Building a basic model

We'll build a [Python class](https://pythonbasics.org/class/) called `GroupEstimate` which takes in *categorical* data and corresponding *continuous* values, determines which group a new observation falls into, and "predicts" an estimate value based on the data provided.

## Exercise 1

### Part 1

Define a class `GroupEstimate` that accepts an `estimate` argument, which can be either `"mean"` or `"median"`.

### Part 2

Add a `.fit(X, y)` method that takes in a pandas DataFrame of *categorical* data, `X`, and a 1-D array, `y`. There should be no missing values in `y`, and each row of `X` corresponds to the same "row" in `y`, so they should be the same length.

- Combine `X` and `y` into a shared pandas DataFrame.
- Group the DataFrame by the columns in `X`.
- For each group, calculate either the mean or median value of `y`, depending on the `estimate` argument.
- *Note: Your class should not "store" `X` or `y`.* Only "save" the data needed to accomplish Part 3, below.

### Part 3

Add a `.predict(X_)` method that takes in an array of observations (or a dataframe) corresponding to the columns in `X_`, determines which group they fall into, and returns the corresponding estimates for `y`.

If an incoming category or combination of categories was missing in the original data, return `NaN` for that observation and print a message indicating the number of missing groups.

### Example

For example, if we have a dataframe of coffee reviews, and `X` includes two columns: *country* and *roast type*, we might want to predict the average *review score* for a new coffee from a given country and roast type. In this way, we could run:

```python
X = df_raw[["loc_country", "roast"]]
y = df_raw["rating"]

gm = GroupEstimate(estimate='mean')
gm.fit(X, y)

X_ = [["Guatemala", "Light"],
      ["Mexico", "Medium"],
      ["Canada", "Dark"]]

gm.predict(X_)

>> [88.4, 91. ,  nan]  # say there are no Canadian dark roasts
```

## Bonus Exercise 2

Adjust your `GroupEstimate` class to handle the situation where the combination of categories is missing, but a particular category is not. That is, add to your `.fit` method an optional argument `default_category`. If a combination is missing, the estimate for `y` will be based solely on the group defined by `default_category`.

For example, suppose we have the code in the example above, but we replace the fit line with

```python
# ...
gm.fit(X, y, default_cagegory="country")
# ...

>> [4.5, 3.8, 3.1]
```

In this case, the missing value in that array would be filled with the average review score for Brazilian roasts.

*Hint: consider the `observed` argument of the `groupby` [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and go from there ...*

## Testing the GroupEstimate Class

Let's test the implementation with a simple example:

In [1]:
# Import the class
import pandas as pd
import numpy as np
from apputil import GroupEstimate

# Create sample data - coffee reviews
data = {
    'loc_country': ['Guatemala', 'Guatemala', 'Guatemala', 'Mexico', 'Mexico', 'Brazil', 'Brazil', 'Brazil'],
    'roast': ['Light', 'Light', 'Medium', 'Medium', 'Medium', 'Light', 'Medium', 'Dark'],
    'rating': [88, 89, 85, 90, 92, 87, 86, 84]
}

df_raw = pd.DataFrame(data)
print("Sample data:")
print(df_raw)
print("\nGroup means:")
print(df_raw.groupby(['loc_country', 'roast'])['rating'].mean())

Sample data:
  loc_country   roast  rating
0   Guatemala   Light      88
1   Guatemala   Light      89
2   Guatemala  Medium      85
3      Mexico  Medium      90
4      Mexico  Medium      92
5      Brazil   Light      87
6      Brazil  Medium      86
7      Brazil    Dark      84

Group means:
loc_country  roast 
Brazil       Dark      84.0
             Light     87.0
             Medium    86.0
Guatemala    Light     88.5
             Medium    85.0
Mexico       Medium    91.0
Name: rating, dtype: float64


In [2]:
# Test Exercise 1 - Basic functionality with mean
X = df_raw[["loc_country", "roast"]]
y = df_raw["rating"]

gm = GroupEstimate(estimate='mean')
gm.fit(X, y)

# Test predictions with existing and missing combinations
X_test = [["Guatemala", "Light"],
          ["Mexico", "Medium"],
          ["Canada", "Dark"]]  # This combination doesn't exist

predictions = gm.predict(X_test)
print("\nTest predictions (with mean):")
print(f"Guatemala + Light: {predictions[0]}")
print(f"Mexico + Medium: {predictions[1]}")
print(f"Canada + Dark: {predictions[2]}")


Test predictions (with mean):
Guatemala + Light: 88.5
Mexico + Medium: 91.0
Canada + Dark: nan


In [3]:
# Test with median instead
gm_median = GroupEstimate(estimate='median')
gm_median.fit(X, y)

predictions_median = gm_median.predict(X_test)
print("\nTest predictions (with median):")
print(f"Guatemala + Light: {predictions_median[0]}")
print(f"Mexico + Medium: {predictions_median[1]}")
print(f"Canada + Dark: {predictions_median[2]}")


Test predictions (with median):
Guatemala + Light: 88.5
Mexico + Medium: 91.0
Canada + Dark: nan


In [5]:
# Test Bonus Exercise 2 - Using default_category
gm_default = GroupEstimate(estimate='mean')
gm_default.fit(X, y, default_category="loc_country")

# Test with the same data - Canada doesn't exist, but we should get Brazil's average
X_test_bonus = [["Guatemala", "Light"],
                ["Mexico", "Medium"],
                ["Brazil", "Dark"],  # Exists in data
                ["Canada", "Dark"]]  # Doesn't exist, should use Canada's average if it existed

predictions_default = gm_default.predict(X_test_bonus)
print("\nTest predictions (with default_category='loc_country'):")
print(f"Guatemala + Light: {predictions_default[0]}")
print(f"Mexico + Medium: {predictions_default[1]}")
print(f"Brazil + Dark: {predictions_default[2]}")
print(f"Canada + Dark (fallback to Canada avg): {predictions_default[3]}")  # Will still be NaN since Canada doesn't exist at all


Test predictions (with default_category='loc_country'):
Guatemala + Light: 88.5
Mexico + Medium: 91.0
Brazil + Dark: 84.0
Canada + Dark (fallback to Canada avg): nan


In [6]:
# Better test for default_category - use a missing combination that has a known country
X_test_bonus2 = [["Guatemala", "Light"],    # Exists: should be 88.5
                 ["Guatemala", "Dark"],     # Missing combo, but Guatemala exists - should use Guatemala avg
                 ["Brazil", "Light"]]       # Exists: should be 87

predictions_default2 = gm_default.predict(X_test_bonus2)
print("\nBetter test for default_category:")
print(f"Guatemala + Light (exists): {predictions_default2[0]}")
print(f"Guatemala + Dark (missing, fallback to Guatemala avg): {predictions_default2[1]}")
print(f"Brazil + Light (exists): {predictions_default2[2]}")

print("\nFor reference, country averages:")
print(df_raw.groupby('loc_country')['rating'].mean())


Better test for default_category:
Guatemala + Light (exists): 88.5
Guatemala + Dark (missing, fallback to Guatemala avg): 87.33333333333333
Brazil + Light (exists): 87.0

For reference, country averages:
loc_country
Brazil       85.666667
Guatemala    87.333333
Mexico       91.000000
Name: rating, dtype: float64
