# Student and Problem Set Info
---


## Title: MGSC 310: Problem Set 4

Author:

Ben Labaschin, King of the Notebooks, Destroyer of Worlds.


# Setup

Pre-req:
- Install the `ISLP package`
- restart your runtime
- load the `Caravan` dataset found [here](https://islp.readthedocs.io/en/latest/datasets/Caravan.html) (data dictionary included in documentation).
- assign the `Caravan` data to the variable `caravan`

In [1]:
! pip install ISLP



In [2]:
from ISLP import load_data

caravan = load_data("Caravan")

## Question 1:

1. How many purchased insurance policies were there?
2. If our target variable is `Purchase`, is the dataset imbalanced? If so, by precisely how much?
3. remap `No` values to 0 and `Yes` values to 1.

In [8]:
caravan['Purchase'].value_counts()

# 1. 348 purchases
# 2. Yes it's imbalanced by 5474-348 or 5126 policies

No     5474
Yes     348
Name: Purchase, dtype: int64

In [10]:
# 3
mapping = {"No": 0, "Yes": 1}
caravan['Purchase'] = caravan['Purchase'].map(mapping)

In [12]:
caravan['Purchase'].value_counts()

0    5474
1     348
Name: Purchase, dtype: int64

# Question 2: Balance the Data

1. Using the `sklearn` `resample` function, **downsample** the target `Purchase` such that the 0's and 1's are balanced.
 - This answer will not be counted as correct unless the downsampled features are joined with the downsampled targets into a single DataFrame, as demonstrated in the class code
 - Assign the resulting dataframe to the variable `caravan_downsampled`

In [15]:
from sklearn.utils import resample
from pandas import concat

# Separate the majority and minority classes
caravan_no = caravan[caravan['Purchase'] == 0]
caravan_yes = caravan[caravan['Purchase'] == 1]

# Downsample the majority class
caravan_no_downsampled = resample(caravan_no,
                                  replace=False,
                                  n_samples=len(caravan_yes),
                                  random_state=634)

# Combine the minority class with downsampled majority class
caravan_downsampled = concat([caravan_no_downsampled, caravan_yes])

caravan_downsampled['Purchase'].value_counts()

0    348
1    348
Name: Purchase, dtype: int64

# Question 3: K-Fold and Ridge Regularization

1. Run cross validation upon on the `caravan_downsampled` data, running the `sklearn` [`RidgeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html) estimator (model) upon each fold. Use every variable (except `Purchase`) as a feature, and `Purchase` as the target variable.\
\
Use the class code if you need help running k-fold and ridge regression. But if you want to make your life easier for your next problem, I would use the the `cross_validate` function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate). By setting its argument `cv` equal to, say 3, it's essentially k-fold cross validation with 3 folds. Then you can add `RidgeClassifier` as the estimator, and pass multiple metrics at once (`precision`, `recall`) to record the results. See [cross-validate with multiple metrics](https://scikit-learn.org/stable/modules/cross_validation.html#the-cross-validate-function-and-multiple-metric-evaluation) for more details.\
\
Here is some other useful documentation in addition:
   - [using kfold on a model examples](https://scikit-learn.org/stable/modules/cross_validation.html)
   - [more examples with kfold and models](https://www.askpython.com/python/examples/k-fold-cross-validation)
   - [official kfold documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
2. gather the metric results for [`precision`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) and [`recall`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) for each fold of the model, and append them each to a list. That is,  every result for each fold for `precision` should be appended into a list, and every result for each fold for `recall` should be appended to a list.
 - your result here will be two lists of your metrics
 - if you use `cross_validate` from above, your life will be easier, simply pass the following predefined `precision` and `recall` "macros" to the `cross_validate` function. Fill in the `_`'s with the proper variables.

 ```python
 from sklearn.model_selection import cross_validate
 scoring = ['precision_macro', 'recall_macro']
 ... <other setup code>...
 scores = cross_validate(_, _, _, cv=3, scoring=scoring)

 ```

3. average the results of `precision`, and `recall` lists from above. Print out your scores using `print`.



In [39]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import RidgeClassifier

scoring = ['precision_macro', 'recall_macro']

X = caravan_downsampled.drop("Purchase", axis=1)
y = caravan_downsampled["Purchase"]

clf = RidgeClassifier()
scores = cross_validate(clf, X, y, cv=3, scoring=scoring)
scores


{'fit_time': array([0.01399279, 0.02087998, 0.01889491]),
 'score_time': array([0.01860929, 0.0153892 , 0.01498532]),
 'test_precision_macro': array([0.6277436 , 0.66006564, 0.63996983]),
 'test_recall_macro': array([0.625     , 0.65948276, 0.63793103])}

In [41]:
print("Precision average: ", sum(scores['test_precision_macro'])/3)

Precision average:  0.642593024289638


In [42]:
print("Recall average: ",sum(scores['test_recall_macro'])/3)

Recall average:  0.6408045977011495


# Question 4: K-Fold and Ridge Regularization (Redux)

1. Run the EXACT same code as above *except* run it on the `caravan` variable that hasn't been downsampled. i.e.
 - Run cross validation upon on the **`caravan`** data, running the `sklearn` [`RidgeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html) estimator (model) upon each fold. Use every variable (except `Purchase`) as a feature, and `Purchase` as the target variable.\
 \
 Use the class code if you need help running k-fold and ridge regression. But if you want to make your life easier for your next problem, I would use the the `cross_validate` function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate). By setting its argument `cv` equal to, say 3, it's essentially k-fold cross validation with 3 folds. Then you can add `RidgeClassifier` as the estimator, and pass multiple metrics at once (`precision`, `recall`) to record the results. See [cross-validate with multiple metrics](https://scikit-learn.org/stable/modules/cross_validation.html#the-cross-validate-function-and-multiple-metric-evaluation) for more details.\
 \
 Here is some other useful documentation in addition:
    - [using kfold on a model examples](https://scikit-learn.org/stable/modules/cross_validation.html)
    - [more examples with kfold and models](https://www.askpython.com/python/examples/k-fold-cross-validation)
    - [official kfold documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
 - gather the metric results for [`precision`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) and [`recall`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) for each fold of the model, and append them each to a list. That is,  every result for each fold for `precision` should be appended into a list, and every result for each fold for `recall` should be appended to a list.
  - your result here will be two lists of your metrics
  - if you use `cross_validate` from above, your life will be easier, simply pass the following predefined `precision` and `recall` "macros" to the `cross_validate` function. Fill in the `_`'s with the proper variables.

    ```python
    from sklearn.model_selection import cross_validate
    scoring = ['precision_macro', 'recall_macro']
    ... <other setup code>...
    scores = cross_validate(_, _, _, cv=3, scoring=scoring)
    ```

 - average the results of `precision`, and `recall` lists from above. Print out your scores using `print`.



In [43]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import RidgeClassifier

scoring = ['precision_macro', 'recall_macro']

X = caravan.drop("Purchase", axis=1)
y = caravan["Purchase"]

clf = RidgeClassifier()
scores = cross_validate(clf, X, y, cv=3, scoring=scoring)
scores


{'fit_time': array([0.01863623, 0.02966475, 0.0284276 ]),
 'score_time': array([0.01331091, 0.01157928, 0.01239777]),
 'test_precision_macro': array([0.47005679, 0.97036082, 0.47008767]),
 'test_recall_macro': array([0.49890411, 0.50431034, 0.49972588])}

In [44]:
print("Precision average: ", sum(scores['test_precision_macro'])/3)

Precision average:  0.6368350958832655


In [45]:
print("Recall average: ",sum(scores['test_recall_macro'])/3)

Recall average:  0.5009801105365366


# Question 5: Examining Our Results

1. Look at the average results for the downsampled data and the unaltered data. Which performed better overall?
2. Why do you think the it was that this dataset performed better?
3. Let's say we cared about predicting who would purchase insurance and the false positives weren't that costly to us (we're okay with being wrong), which dataset (the downsampled data vs the untouched data) and which specific metric (precision or recall)  would you use and why? i.e. downsampled (precision/recall) or untouched (precision/recall)

1. The downsampled data performed better. Both precision and recall performed better on average, and even the variance within each fold was tighter.
2. By reducing the "noise" of the underlying dataset, randomly, and balancing the target data, the model was able to find more signal in the dataâ€”it wasn't trained on so much data that was irrelevant that it started finding patterns we didn't want. Further, when we downsample our data, we are reducing the chance that the model fits to the "no purchase" instance, which we care less about than the "purchase" instance.
3. Downsample data, recall. Since false positives aren't that important to us, and precision uses false positives in its denominator, precision is weighing information we care less about. Recall cares about True Positives and False Negatives. Its focus is more about how positive instances we actually labeled correctly, and that's what we care about.