BloomTech Data Science

*Unit 2, Sprint 1, Module 4*

---

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
#Import Block
import pandas as pd
import matplotlib.pyplot as plt
from category_encoders import OneHotEncoder
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Module Project: Logistic Regression

Do you like burritos? 🌯 You're in luck then, because in this project you'll create a model to predict whether a burrito is `'Great'`.

The dataset for this assignment comes from [Scott Cole](https://srcole.github.io/100burritos/), a San Diego-based data scientist and burrito enthusiast. 

## Directions

The tasks for this project are the following:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function .
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build `model_logr` using a pipeline that includes three transfomers and `LogisticRegression` predictor. Train model on `X_train` and `X_test`.
- **Task 7:** Calculate the training and test accuracy score for your model.
- **Task 8:** Create a horizontal bar chart showing the 10 most influencial features for your  model. 
- **Task 9:** Demonstrate and explain the differences between `model_lr.predict()` and `model_lr.predict_proba()`.

**Note** 

You should limit yourself to the following libraries:

- `category_encoders`
- `matplotlib`
- `pandas`
- `sklearn`

# I. Wrangle Data

In [None]:
def wrangle(filepath):
    # Import w/ DateTimeIndex
    df = pd.read_csv(filepath, parse_dates=['Date'],
                     index_col='Date')
    
    # Drop unrated burritos
    df.dropna(subset=['overall'], inplace=True)
    
    # Derive binary classification target:
    # We define a 'Great' burrito as having an
    # overall rating of 4 or higher, on a 5 point scale
    df['Great'] = (df['overall'] >= 4).astype(int)
    
    # Drop high cardinality categoricals
    df = df.drop(columns=['Notes', 'Location', 'Address', 'URL', 'Neighborhood'])
    
    # Drop columns to prevent "leakage"
    df = df.drop(columns=['Rec', 'overall'])

    # Binary encoding column fixes
    yes_cols = ['Chips']
    bin_cols = ['Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese','Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp',
                'Fish', 'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 
                'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso', 'Egg', 'Mushroom', 'Bacon', 
                'Sushi', 'Avocado', 'Corn', 'Zucchini']

    for col in bin_cols:
      df[col] = [1 if (str(i).lower() == 'x') else 0 for i in df[col]]
    for col in yes_cols:
      df[col] = [1 if (str(y).lower() == 'yes') else 0 for y in df[col]]

    # Dissect Burrito column into dummy columns
    bur_types = ['california', 'asada', 'surf', 'carnitas']
    for btype in bur_types:
      df[btype] = [1 if (str(b).casefold().find(btype) != -1) else 0 for b in df['Burrito']]
    df.drop(columns=['Burrito'], inplace=True)
    
    return df

filepath = DATA_PATH + 'burritos/burritos.csv'

**Task 1:** Use the above `wrangle` function to import the `burritos.csv` file into a DataFrame named `df`.

In [None]:
filepath = DATA_PATH + 'burritos/burritos.csv'
df = wrangle(filepath)

During your exploratory data analysis, note that there are several columns whose data type is `object` but that seem to be a binary encoding. For example, `df['Beef'].head()` returns:

```
0      x
1      x
2    NaN
3      x
4      x
Name: Beef, dtype: object
```

**Task 2:** Change the `wrangle` function so that these columns are properly encoded as `0` and `1`s. Be sure your code handles upper- and lowercase `X`s, and `NaN`s.

In [None]:
# Conduct your exploratory data analysis here
# And modify the `wrangle` function above.
pd.options.display.max_columns=60
df.head(15)
#type(df.index)
#df.isna().sum()
#df['Great'].sum()/len(df['Great'])

0.4323040380047506

If you explore the `'Burrito'` column of `df`, you'll notice that it's a high-cardinality categorical feature. You'll also notice that there's a lot of overlap between the categories. 

**Stretch Goal:** Change the `wrangle` function above so that it engineers four new features: `'california'`, `'asada'`, `'surf'`, and `'carnitas'`. Each row should have a `1` or `0` based on the text information in the `'Burrito'` column. For example, here's how the first 5 rows of the dataset would look.

| **Burrito** | **california** | **asada** | **surf** | **carnitas** |
| :---------- | :------------: | :-------: | :------: | :----------: |
| California  |       1        |     0     |    0     |      0       |
| California  |       1        |     0     |    0     |      0       |
|  Carnitas   |       0        |     0     |    0     |      1       |
| Carne asada |       0        |     1     |    0     |      0       |
| California  |       1        |     0     |    0     |      0       |

**Note:** Be sure to also drop the `'Burrito'` once you've engineered your new features.

In [None]:
# Conduct your exploratory data analysis here
# And modify the `wrangle` function above.
#df = pd.get_dummies(df, columns=['Burrito']) <- too many columns
pd.options.display.max_columns=65
df.head(15)

# II. Split Data

**Task 3:** Split your dataset into the feature matrix `X` and the target vector `y`. You want to predict `'Great'`.

In [None]:
X = df.drop(columns=['Great'])
y = df['Great']

**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from 2016 through 2017. 
- Your test set should include data from 2018 and later.

In [None]:
date_mask = X.index < pd.Timestamp("2018")
X_train, y_train = X.loc[date_mask], y.loc[date_mask]
X_test, y_test = X.loc[~date_mask], y.loc[~date_mask]

# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents. 

In [None]:
#print(y_train.value_counts())
model_b = DummyClassifier(strategy="prior").fit(X_train, y_train)
baseline_acc = model_b.predict_proba(X_train)[0][0] #first ranked prior prob. = baseline classf. prob.
print('Baseline Accuracy Score:', baseline_acc)

0    223
1    160
Name: Great, dtype: int64
Baseline Accuracy Score: 0.5822454308093995


# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_logr`, and fit it to your training data. Your pipeline should include:

- a `OneHotEncoder` transformer for categorical features, 
- a `SimpleImputer` transformer to deal with missing values, 
- a [`StandarScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) transfomer (which often improves performance in a logistic regression model), and 
- a `LogisticRegression` predictor.

In [None]:
model_logr = make_pipeline(OneHotEncoder(use_cat_names=True), SimpleImputer(strategy="median"), StandardScaler(), LogisticRegression()).fit(X_train, y_train)

# IV. Check Metrics

**Task 7:** Calculate the training and test accuracy score for `model_lr`.

In [None]:
training_acc = mean_absolute_error(y_train, model_logr.predict(X_train))
test_acc = mean_absolute_error(y_test, model_logr.predict(X_test))

print('Training MAE:', training_acc)
print('Test MAE:', test_acc)

Training MAE: 0.028720626631853787
Test MAE: 0.21052631578947367


# V. Communicate Results

**Task 8:** Create a horizontal barchart that plots the 10 most important coefficients for `model_lr`, sorted by absolute value.

**Note:** Since you created your model using a `Pipeline`, you'll need to use the [`named_steps`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) attribute to access the coefficients in your `LogisticRegression` predictor. Be sure to look at the shape of the coefficients array before you combine it with the feature names.

In [None]:
# Create your horizontal barchart here.
cvals = model_logr.named_steps['logisticregression'].coef_[0] #first row only for coef list
feat = model_logr.named_steps['onehotencoder'].get_feature_names()
top_feat = pd.Series(cvals, index=feat).sort_values(key=abs, ascending=False)
top_feat.head(10).plot(kind="barh")
plt.xlabel('Coefficient of Review')
plt.ylabel('Feature')
plt.title('Coefficients for Logistic Regression');

NameError: ignored

There is more than one way to generate predictions with `model_lr`. For instance, you can use [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression) or [`predict_proba`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression.predict_proba).

**Task 9:** Generate predictions for `X_test` using both `predict` and `predict_proba`. Then below, write a summary of the differences in the output for these two methods. You should answer the following questions:

- What data type do `predict` and `predict_proba` output?
- What are the shapes of their different output?
- What numerical values are in the output?
- What do those numerical values represent?

In [None]:
# Write code here to explore the differences between `predict` and `predict_proba`.
pred, pred_p = model_logr.predict(X_test), model_logr.predict_proba(X_test)
print(type(pred), type(pred_p))
print(pred, pred_p)


<class 'numpy.ndarray'> <class 'numpy.ndarray'>
[1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1
 1] [[1.30978269e-05 9.99986902e-01]
 [1.15928074e-03 9.98840719e-01]
 [9.27050807e-01 7.29491931e-02]
 [5.19430700e-05 9.99948057e-01]
 [9.99468732e-01 5.31267871e-04]
 [9.70650781e-01 2.93492187e-02]
 [1.25907165e-02 9.87409284e-01]
 [5.09193495e-05 9.99949081e-01]
 [5.17410667e-05 9.99948259e-01]
 [1.26379879e-01 8.73620121e-01]
 [7.49522436e-01 2.50477564e-01]
 [9.68595228e-01 3.14047718e-02]
 [3.25365297e-01 6.74634703e-01]
 [2.93828757e-01 7.06171243e-01]
 [1.46123584e-02 9.85387642e-01]
 [5.27026575e-02 9.47297342e-01]
 [3.05465785e-02 9.69453421e-01]
 [9.98692234e-01 1.30776631e-03]
 [9.97489097e-01 2.51090342e-03]
 [9.99552207e-01 4.47793065e-04]
 [9.97058824e-01 2.94117596e-03]
 [4.76062381e-03 9.95239376e-01]
 [9.48880319e-01 5.11196815e-02]
 [3.71845067e-02 9.62815493e-01]
 [4.18149823e-01 5.81850177e-01]
 [9.29305781e-01 7.06942189e-02]
 [9.94801501e-0

**Give your written answer here:**


The predict method returns a vector of values, [0|1] for each entry in the matrix sent, corresponding to the classification the model makes for each entry. 

The predict_proba method instead returns a list of pairs corresponding to the model's probability that each entry is in the first or second category, represented by float values from 0 to 1
