
<div class="alert alert-info" role="alert">
  <p>
    <center><b>Usage Guidelines</b></center>
  </p>

  <p>
    This lesson is part of the <b>DS Lab core curriculum</b>. For that reason, this notebook can only be used on your WQU virtual machine.
  </p>

  <p>
    This means:
    <ul>
      <li><span style="color: red">ⓧ</span> No downloading this notebook.</li>
      <li><span style="color: red">ⓧ</span> No re-sharing of this notebook with friends or colleagues.</li>
      <li><span style="color: red">ⓧ</span> No downloading the embedded videos in this notebook.</li>
      <li><span style="color: red">ⓧ</span> No re-sharing embedded videos with friends or colleagues.</li>
      <li><span style="color: red">ⓧ</span> No adding this notebook to public or private repositories.</li>
      <li><span style="color: red">ⓧ</span> No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.</li>
    </ul>

  </p>
</div>


<font size="+3"><strong>5.1. Working with JSON files</strong></font>

In this project, we'll be looking at tracking corporate bankruptcies in Poland. To do that, we'll need to get data that's been stored in a `JSON` file, explore it, and turn it into a DataFrame that we'll use to train our model.

In [None]:
import gzip
import json

import pandas as pd
import wqet_grader
from IPython.display import VimeoVideo

wqet_grader.init("Project 5 Assessment")


In [None]:
VimeoVideo("694158732", h="73c2fb4e4f", width=600)

# Prepare Data

## Open

The first thing we need to do is access the file that contains the data we need. We've done this using multiple strategies before, but this time around, we're going to use the command line.

In [None]:
VimeoVideo("693794546", h="6e1fab0a5e", width=600)

**Task 5.1.1:** Open a terminal window and navigate to the directory where the data for this project is located.

- [What's the  Linux command line?](../%40textbook/19-linux-command-line.ipynb)
- [Navigate a file system using the Linux command line.](../%40textbook/19-linux-command-line.ipynb)

As we've seen in our other projects, datasets can be large or small, messy or clean, and complex or easy to understand. Regardless of how the data looks, though, it needs to be saved in a file somewhere, and when that file gets too big, we need to *compress* it. Compressed files are easier to store because they take up less space. If you've ever come across a `ZIP` file, you've worked with compressed data.

The file we're using for this project is compressed, so we'll need to use a file utility called `gzip` to open it up.

In [None]:
VimeoVideo("693794604", h="a8c0f15712", width=600)

**Task 5.1.2:** In the terminal window, locate the data file for this project and decompress it.

- [What's gzip?](../%40textbook/19-linux-command-line.ipynb#gzip)
- [What's data compression?](../%40textbook/19-linux-command-line.ipynb#Data-Compressing)
- [Decompress a file using gzip.](../%40textbook/19-linux-command-line.ipynb#gzip)

In [None]:
VimeoVideo("693794641", h="d77bf46d41", width=600)

In [None]:
%%bash
cd data
gzip -dfk ./poland-bankruptcy-data-2009.json.gz

## Explore

Now that we've decompressed the data, let's take a look and see what's there.

In [None]:
VimeoVideo("693794658", h="c8f1bba831", width=600)

**Task 5.1.3:** In the terminal window, examine the first 10 lines of `poland-bankruptcy-data-2009.json`.

- [Print lines from a file in the Linux command line.](../%40textbook/19-linux-command-line.ipynb#Viewing-File-Contents)

In [None]:
%%bash
cd data
head poland-bankruptcy-data-2009.json

Does this look like any of the data structures we've seen in previous projects?

In [None]:
VimeoVideo("693794680", h="7f1302444b", width=600)

**Task 5.1.4:** Open `poland-bankruptcy-data-2009.json` by opening the `data` folder to the left and then double-clicking on the file. 👈

How is the data organized?

Curly brackets? Key-value pairs? It looks similar to a Python dictionary. It's important to note that JSON is not _exactly_ the same as a dictionary, but a lot of the same concepts apply. Let's try reading the file into a DataFrame and see what happens.

In [None]:
VimeoVideo("693794696", h="dd5b5ad116", width=600)

**Task 5.1.5:** Load the data into a DataFrame.

- [Read a JSON file into a DataFrame using pandas.](../%40textbook/03-pandas-getting-started.ipynb#JSON-Files)

In [None]:
df = pd.read_json('./data/poland-bankruptcy-data-2009.json')
df.head()

In [None]:
VimeoVideo("693794711", h="fdb009c4eb", width=600)

Hmmm. It looks like something went wrong, and we're going to have to fix it. Luckily for us, there's an error message to help us figure out what's happening here:

<code style="background-color:#FEDDDE;"><span style="color:#E45E5C">ValueError</span>: Mixing dicts with non-Series may lead to ambiguous ordering.
</code>

What should we do? That error sounds serious, but the world is big, and we can't possibly be the first people to encounter this problem. When you come across an error, copy the message into a search engine and see what comes back. You'll get lots of results. The web has lots of places to look for solutions to problems like this one, and [Stack Overflow](https://stackoverflow.com/) is one of the best. [Click here to check out a possible solution to our problem.](https://stackoverflow.com/questions/57018859/valueerror-mixing-dicts-with-non-series-may-lead-to-ambiguous-ordering)

There are three things to look for when you're browsing through solutions on Stack Overflow.

1. **Context:** A good question is specific; if you click through that link, you'll see that the person asks a **specific** question, gives some relevant information about their OS and hardware, and then offers the code that threw the error. That's important, because we need...
2. **Reproducible Code:** A good question also includes enough information for you to reproduce the problem yourself. After all, the only way to make sure the solution actually applies to your situation is to see if the code in the question throws the error you're having trouble with! In this case, the person included not only the code they used to get the error, but the actual error message itself. That would be useful on its own, but since you're looking for an actual solution to your problem, you're really looking for...
3. **An answer:** Not every question on Stack Overflow gets answered. Luckily for us, the one we've been looking at did. There's a big green check mark next to the first solution, which means that the person who asked the question thought that solution was the best one.

Let's try it and see if it works for us too!

In [None]:
VimeoVideo("693794734", h="fecea6a81e", width=600)

**Task 5.1.6:** Using a context manager, open the file `poland-bankruptcy-data-2009.json` and load it as a dictionary with the variable name `poland_data`.

- [What's a context manager?](../%40textbook/02-python-advanced.ipynb#Create-files-using-Context-Manager)
- [Open a file in Python.](../%40textbook/02-python-advanced.ipynb#Create-files-using-Context-Manager)
- [Load a JSON file into a dictionary using Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-Dictionaries)

In [None]:
# Open file and load JSON
with open('./data/poland-bankruptcy-data-2009.json', 'r') as arq:
    poland_data = json.load(arq)

print(type(poland_data))

Okay! Now that we've successfully opened up our dataset, let's take a look and see what's there, starting with the keys. Remember, the **keys** in a dictionary are categories of things in a dataset.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

In [None]:
VimeoVideo("693794754", h="18e70f4225", width=600)

**Task 5.1.7:** Print the keys for `poland_data`.

- [List the keys of a dictionary in Python.](../%40textbook/01-python-getting-started.ipynb#Dictionary-Keys)

In [None]:
# Print `poland_data` keys
# for chave in poland_data.keys():
#     print(chave)

poland_data.keys()

`schema` tells us how the data is structured, `metadata` tells us where the data comes from, and `data` is the data itself.

Now let's take a look at the values. Remember, the **values** in a dictionary are ways to describe the variable that belongs to a key.

In [None]:
VimeoVideo("693794768", h="8e5b53b0ca", width=600)

**Task 5.1.8:** Explore the values associated with the keys in `poland_data`. What do each of them represent? How is the information associated with the `"data"` key organized?

In [None]:
# Continue Exploring `poland_data`

# poland_data["schema"].keys()
# poland_data["metadata"]
poland_data["data"][0]

This dataset includes all the information we need to figure whether or not a Polish company went bankrupt in 2009. There's a bunch of features included in the dataset, each of which corresponds to some element of a company's balance sheet. You can explore the features by looking at the [data dictionary](./056-data-dictionary.ipynb). Most importantly, we also know whether or not the company went bankrupt. That's the last key-value pair.

Now that we know what data we have for each company, let's take a look at how many companies there are.

In [None]:
VimeoVideo("693794783", h="8d333027cc", width=600)

**Task 5.1.9:** Calculate the number of companies included in the dataset.

- [Calculate the length of a list in Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-Lists)
- [List the keys of a dictionary in Python.](../%40textbook/01-python-getting-started.ipynb#Dictionary-Keys)

In [None]:
# Calculate number of companies
len(poland_data["data"])

And then let's see how many features were included for one of the companies.

In [None]:
VimeoVideo("693794797", h="3c1eff82dc", width=600)

**Task 5.1.10:** Calculate the number of features associated with `"company_1"`.

In [None]:
# Calculate number of features
len(poland_data["data"][0])

Since we're dealing with data stored in a JSON file, which is common for semi-structured data, we can't assume that all companies have the same features. So let's check!

In [None]:
VimeoVideo("693794810", h="80e195944b", width=600)

**Task 5.1.11:** Iterate through the companies in `poland_data["data"]` and check that they all have the same number of features.

- [What's an iterator?](../%40textbook/02-python-advanced.ipynb#Iterators-and-Iterables)
- [Access the items in a dictionary in Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-Lists)
- [Write a for loop in Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-for-Loops)

In [None]:
# Iterate through companies
# for i in range(len(poland_data["data"])):
#     conferes = len(poland_data["data"][i]) == 66
#     if conferes == False:
#         print(poland_data["data"][i])

for company in poland_data["data"]:
    if len(company) != 66:
        print(company)

It looks like they do!

Let's put all this together. First, open up the compressed dataset and load it directly into a dictionary.

In [None]:
VimeoVideo("693794824", h="dbfc9b43ee", width=600)

**Task 5.1.12:** Using a context manager, open the file `poland-bankruptcy-data-2009.json.gz` and load it as a dictionary with the variable name `poland_data_gz`.

- [What's a context manager?](../%40textbook/02-python-advanced.ipynb#Create-files-using-Context-Manager)
- [Open a file in Python.](../%40textbook/02-python-advanced.ipynb#Create-files-using-Context-Manager)
- [Load a JSON file into a dictionary using Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-Dictionaries)

In [None]:
# Open compressed file and load contents
with gzip.open('./data/poland-bankruptcy-data-2009.json.gz','r') as arq:
    poland_data_gz = json.load(arq)

print(type(poland_data_gz))

Since we now have two versions of the dataset — one compressed and one uncompressed — we need to compare them to make sure they're the same.

In [None]:
VimeoVideo("693794837", h="925b5e4e5a", width=600)

**Task 5.1.13:** Explore `poland_data_gz` to confirm that is contains the same data as `data`, in the same format. <span style="display: none">WorldQuant University Canary</span>

In [None]:
# Explore `poland_data_gz`
# poland_data_gz["schema"].keys()
# poland_data_gz["metadata"]
# len(poland_data_gz["data"])
# len(poland_data_gz["data"][0])
# poland_data_gz["data"][0]

Looks good! Now that we have an uncompressed dataset, we can turn it into a DataFrame using `pandas`.

In [None]:
VimeoVideo("693794853", h="b74ef86783", width=600)

**Task 5.1.14:** Create a DataFrame `df` that contains the all companies in the dataset, indexed by `"company_id"`. Remember the principles of *tidy data* that you learned in Project 1, and make sure your DataFrame has shape `(9977, 65)`.

- [Create a DataFrame from a dictionary in pandas.](../%40textbook/03-pandas-getting-started.ipynb#Dictionaries)

In [None]:
df = pd.DataFrame().from_dict(poland_data_gz["data"]).set_index('company_id')
print(df.shape)
df.head()

## Import

Now that we have everything set up the way we need it to be, let's combine all these steps into a single function that will decompress the file, load it into a DataFrame, and return it to us as something we can use.

In [None]:
VimeoVideo("693794879", h="f51a3a342f", width=600)

**Task 5.1.15:** Create a `wrangle` function that takes the name of a compressed file as input and returns a tidy DataFrame. After you confirm that your function is working as intended, submit it to the grader.

In [None]:
def wrangle(filename):

    with gzip.open(filename, 'r') as arq:
        data = json.load(arq)

    df = pd.DataFrame().from_dict(data["data"]).set_index('company_id')

    return df

In [None]:
df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()

In [None]:

wqet_grader.grade(
    "Project 5 Assessment",
    "Task 5.1.15",
    wrangle("data/poland-bankruptcy-data-2009.json.gz"),
)

---
Copyright 2023 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.



<div class="alert alert-info" role="alert">
  <p>
    <center><b>Usage Guidelines</b></center>
  </p>

  <p>
    This lesson is part of the <b>DS Lab core curriculum</b>. For that reason, this notebook can only be used on your WQU virtual machine.
  </p>

  <p>
    This means:
    <ul>
      <li><span style="color: red">ⓧ</span> No downloading this notebook.</li>
      <li><span style="color: red">ⓧ</span> No re-sharing of this notebook with friends or colleagues.</li>
      <li><span style="color: red">ⓧ</span> No downloading the embedded videos in this notebook.</li>
      <li><span style="color: red">ⓧ</span> No re-sharing embedded videos with friends or colleagues.</li>
      <li><span style="color: red">ⓧ</span> No adding this notebook to public or private repositories.</li>
      <li><span style="color: red">ⓧ</span> No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.</li>
    </ul>

  </p>
</div>


<font size="+3"><strong>5.2. Imbalanced Data</strong></font>

In the last lesson, we prepared the data.

In this lesson, we're going to explore some of the features of the dataset, use visualizations to help us understand those features, and develop a model that solves the problem of imbalanced data by under- and over-sampling.

In [None]:
import gzip
import json
import pickle

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import wqet_grader
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from IPython.display import VimeoVideo
from sklearn.impute import SimpleImputer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

wqet_grader.init("Project 5 Assessment")


In [None]:
VimeoVideo("694058667", h="44426f200b", width=600)

# Prepare Data

## Import

As always, we need to begin by bringing our data into the project, and the function we developed in the previous module is exactly what we need.

In [None]:
VimeoVideo("694058628", h="00b4cfd027", width=600)

**Task 5.2.1:** Complete the `wrangle` function below using the code you developed in the last lesson. Then use it to import `poland-bankruptcy-data-2009.json.gz` into the DataFrame `df`.

- [<span id='technique'>Write a function in <span id='tool'>Python</span></span>.](../%40textbook/02-python-advanced.ipynb#Functions)

In [None]:
def wrangle(filename):

    # Open compressed file, load into dictionary

    # Load dictionary into DataFrame, set index

    return df

In [None]:
df = ...
print(df.shape)
df.head()

## Explore

Let's take a moment to refresh our memory on what's in this dataset. In the last lesson, we noticed that the data was stored in a JSON file (similar to a Python dictionary), and we explored the key-value pairs. This time, we're going to look at what the values in those pairs actually are.

In [None]:
VimeoVideo("694058591", h="8fc20629aa", width=600)

**Task 5.2.2:** Use the `info` method to explore `df`. What type of features does this dataset have? Which column is the target? Are there columns will missing values that we'll need to address?

- [Inspect a DataFrame using the `shape`, `info`, and `head` in pandas.](../%40textbook/03-pandas-getting-started.ipynb#Inspecting-DataFrames)

In [None]:
# Inspect DataFrame


That's solid information. We know all our features are numerical and that we have missing data. But, as always, it's a good idea to do some visualizations to see if there are any interesting trends or ideas we should keep in mind while we work. First, let's take a look at how many firms are bankrupt, and how many are not.

In [None]:
VimeoVideo("694058537", h="01caf9ae83", width=600)

**Task 5.2.3:** Create a bar chart of the value counts for the `"bankrupt"` column. You want to calculate the relative frequencies of the classes, not the raw count, so be sure to set the `normalize` argument to `True`.

- [What's a <span id='term'>bar chart</span>?](../%40textbook/06-visualization-matplotlib.ipynb#Bar-Charts)
- [What's a <span id='technique'>majority class</span>?](../%40textbook/14-ml-classification.ipynb#Majority-and-Minority-Classes)
- [What's a <span id='technique'>minority class</span>?](../%40textbook/14-ml-classification.ipynb#Majority-and-Minority-Classes)
- [What's a positive class?](../%40textbook/14-ml-classification.ipynb#Positive-and-Negative-Classes)
- [What's a negative class?](../%40textbook/14-ml-classification.ipynb#Positive-and-Negative-Classes)
- [Aggregate data in a Series using `value_counts` in pandas.](../%40textbook/04-pandas-advanced.ipynb#Working-with-value_counts-in-a-Series)
- [<span id='technique'>Create a bar chart using <span id='tool'>pandas</span></span>.](../%40textbook/07-visualization-pandas.ipynb#Bar-Charts)

In [None]:
# Plot class balance


That's good news for Poland's economy! Since it looks like most of the companies in our dataset are doing all right for themselves, let's drill down a little farther. However, it also shows us that we have an imbalanced dataset, where our majority class is far bigger than our minority class.

In the last lesson, we saw that there were 64 features of each company, each of which had some kind of numerical value. It might be useful to understand where the values for one of these features cluster, so let's make a boxplot to see how the values in `"feat_27"` are distributed.

In [None]:
VimeoVideo("694058487", h="6e066151d9", width=600)

**Task 5.2.4:** Use seaborn to create a boxplot that shows the distributions of the `"feat_27"` column for both groups in the `"bankrupt"` column. Remember to label your axes.

- [What's a <span id='term'>boxplot</span>?](../%40textbook/06-visualization-matplotlib.ipynb#Boxplots)
- [<span id='technique'>Create a boxplot using <span id='tool'>Matplotlib</span></span>.](../%40textbook/06-visualization-matplotlib.ipynb#Boxplots)

In [None]:
# Create boxplot

plt.xlabel("Bankrupt")
plt.ylabel("POA / financial expenses")
plt.title("Distribution of Profit/Expenses Ratio, by Class");

Why does this look so funny? Remember that boxplots exist to help us see the quartiles in a dataset, and this one doesn't really do that. Let's check the distribution of `"feat_27"`to see if we can figure out what's going on here.

In [None]:
VimeoVideo("694058435", h="8f0ae805d6", width=600)

**Task 5.2.5:** Use the `describe` method on the column for `"feat_27"`. What can you tell about the distribution of the data based on the mean and median?

In [None]:
# Summary statistics for `feat_27`


Hmmm. Note that the median is around 1, but the mean is over 1000. That suggests that this feature is skewed to the right. Let's make a histogram to see what the distribution actually looks like.

In [None]:
VimeoVideo("694058398", h="1078bb6d8b", width=600)

**Task 5.2.6:** Create a histogram of `"feat_27"`. Make sure to label x-axis `"POA / financial expenses"`, the y-axis `"Count"`, and use the title `"Distribution of Profit/Expenses Ratio"`.

- [What's a histogram?](../%40textbook/06-visualization-matplotlib.ipynb#Histograms)
- [Create a histogram using Matplotlib.](../%40textbook/06-visualization-matplotlib.ipynb#Histograms)

In [None]:
# Plot histogram of `feat_27`

plt.xlabel("POA / financial expenses")
plt.ylabel("Count"),
plt.title("Distribution of Profit/Expenses Ratio");

Aha! We saw it in the numbers and now we see it in the histogram. The data is very skewed. So, in order to create a helpful boxplot, we need to trim the data.

In [None]:
VimeoVideo("694058328", h="4aecdc442d", width=600)

**Task 5.2.7:** Recreate the boxplot that you made above, this time only using the values for `"feat_27"` that fall between the `0.1` and `0.9` quantiles for the column.

- [What's a boxplot?](../%40textbook/06-visualization-matplotlib.ipynb#Boxplots)
- [What's a quantile?](../%40textbook/05-pandas-summary-statistics.ipynb#Calculate-the-Quantiles-for-a-Series)
- [Calculate the quantiles for a Series in pandas.](../%40textbook/05-pandas-summary-statistics.ipynb#Calculate-the-Quantiles-for-a-Series)
- [Create a boxplot using Matplotlib.](../%40textbook/06-visualization-matplotlib.ipynb#Boxplots)

In [None]:
# Create clipped boxplot

plt.xlabel("Bankrupt")
plt.ylabel("POA / financial expenses")
plt.title("Distribution of Profit/Expenses Ratio, by Bankruptcy Status");

That makes a lot more sense. Let's take a look at some of the other features in the dataset to see what else is out there.

<div class="alert alert-info" role="alert">
    <p><b>More context on <code>"feat_27"</code>:</b> <em>Profit on operating activities</em> is profit that a company makes through its "normal" operations. For instance, a car company profits from the sale of its cars. However, a company may have other forms of profit, such as financial investments. So a company's <em>total profit</em> may be positive even when its profit on operating activities is negative.
    <p><em>Financial expenses</em> include things like interest due on loans, and does not include "normal" expenses (like the money that a car company spends on raw materials to manufacture cars).
   </div>

**Task 5.2.8:** Repeat the exploration you just did for `"feat_27"` on two other features in the dataset. Do they show the same skewed distribution? Are there large differences between bankrupt and solvent companies?

In [None]:
# Explore another feature

Looking at other features, we can see that they're skewed, too. This will be important to keep in mind when we decide what type of model we want to use.

Another important consideration for model selection is whether there are any issues with multicollinearity in our model. Let's check.

In [None]:
VimeoVideo("694058273", h="85b3be2f63", width=600)

**Task 5.2.9:** Plot a correlation heatmap of features in `df`. Since `"bankrupt"` will be your target, you don't need to include it in your heatmap.

- [What's a <span id='term'>heatmap?](../%40textbook/09-visualization-seaborn.ipynb#Correlation-Heatmaps)
- [<span id='technique'>Create a correlation matrix in <span id='tool'>pandas.](../%40textbook/07-visualization-pandas.ipynb#Correlation-Matrices)
- [<span id='technique'>Create a heatmap in <span id='tool'>seaborn.](../%40textbook/09-visualization-seaborn.ipynb#Correlation-Heatmaps)

In [None]:
corr = ...


So what did we learn from this EDA? First, our data is imbalanced. This is something we need to address in our data preparation. Second, many of our features have missing values that we'll need to impute. And since the features are highly skewed, the best imputation strategy is likely median, not mean. Finally, we have autocorrelation issues, which means that we should steer clear of linear models, and try a tree-based model instead.

## Split

So let's start building that model. If you need a refresher on how and why we split data in these situations, take a look back at the Time Series module.

**Task 5.2.10:** Create your feature matrix `X` and target vector `y`. Your target is `"bankrupt"`.

- [What's a <span id='term'>feature matrix</span>?](../%40textbook/15-ml-regression.ipynb#Linear-Regression)
- [What's a <span id='term'>target vector</span>?](../%40textbook/15-ml-regression.ipynb#Linear-Regression)
- [<span id='technique'>Subset a DataFrame by selecting one or more columns in <span id='tool'>pandas</span></span>.](../%40textbook/04-pandas-advanced.ipynb#Subset-a-DataFrame-by-Selecting-One-or-More-Columns)
- [<span id='technique'>Select a Series from a DataFrame in <span id='tool'>pandas</span></span>.](../%40textbook/04-pandas-advanced.ipynb#Combine-multiple-categories-in-a-Series)

In [None]:
target = "bankrupt"
X = ...
y = ...

print("X shape:", X.shape)
print("y shape:", y.shape)

In order to make sure that our model can generalize, we need to put aside a test set that we'll use to evaluate our model once it's trained.

**Task 5.2.11:** Divide your data (`X` and `y`) into training and test sets using a randomized train-test split. Your validation set should be 20% of your total data. And don't forget to set a `random_state` for reproducibility.

- [<span id='technique'>Perform a randomized train-test split using <span id='tool'>scikit-learn</span></span>.](../%40textbook/14-ml-classification.ipynb#Randomized-Train-Test-split)

In [None]:
X_train, X_test, y_train, y_test = ...

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

Note that if we wanted to tune any hyperparameters for our model, we'd do another split here, further dividing the training set into training and validation sets. However, we're going to leave hyperparameters for the next lesson, so no need to do the extra split now.

## Resample

Now that we've split our data into training and validation sets, we can address the class imbalance we saw during our EDA. One strategy is to resample the training data. (This will be different than the resampling we did with time series data in Project 3.) There are many to do this, so let's start with under-sampling.

In [None]:
VimeoVideo("694058220", h="00c3a98358", width=600)

**Task 5.2.12:** Create a new feature matrix `X_train_under` and target vector `y_train_under` by performing random under-sampling on your training data.

- [What is under-sampling?](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Under-sampling)
- [Perform random under-sampling using imbalanced-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Under-sampling)

In [None]:
under_sampler = ...
X_train_under, y_train_under = ...
print(X_train_under.shape)
X_train_under.head()

<div class="alert alert-info" role="alert">
    <b>Note:</b> Depending on the random state you set above, you may get a different  shape for <code>X_train_under</code>. Don't worry, it's normal!
</div>

And then we'll over-sample.

In [None]:
VimeoVideo("694058177", h="5cef977f2d", width=600)

**Task 5.2.13:** Create a new feature matrix `X_train_over` and target vector `y_train_over` by performing random over-sampling on your training data.

- [What is over-sampling?](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Over-sampling)
- [Perform random over-sampling using imbalanced-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Over-sampling)

In [None]:
over_sampler = ...
X_train_over, y_train_over = ...
print(X_train_over.shape)
X_train_over.head()

# Build Model

## Baseline

As always, we need to establish the baseline for our model. Since this is a classification problem, we'll use accuracy score.

In [None]:
VimeoVideo("694058140", h="7ae111412f", width=600)

**Task 5.2.14:** Calculate the baseline accuracy score for your model.

- [What's <span id='tool'>accuracy score</span>?](../%40textbook/14-ml-classification.ipynb#Calculating-Accuracy-Score)
- [Aggregate data in a Series using `value_counts` in pandas.](../%40textbook/04-pandas-advanced.ipynb#Working-with-value_counts-in-a-Series)

In [None]:
acc_baseline = ...
print("Baseline Accuracy:", round(acc_baseline, 4))

Note here that, because our classes are imbalanced, the baseline accuracy is very high. We should keep this in mind because, even if our trained model gets a high validation accuracy score, that doesn't mean it's actually *good.*

## Iterate

Now that we have a baseline, let's build a model to see if we can beat it.

In [None]:
VimeoVideo("694058110", h="dc751751bf", width=600)

**Task 5.2.15:** Create three identical models: `model_reg`, `model_under` and `model_over`. All of them should use a `SimpleImputer` followed by a `DecisionTreeClassifier`. Train `model_reg` using the unaltered training data. For `model_under`, use the undersampled data. For `model_over`, use the oversampled data.

- [What's a <span id='term'>decision tree</span>?](../%40textbook/14-ml-classification.ipynb#Decision-Trees)
- [What's imputation?](../%40textbook/12-ml-core.ipynb#Imputation)
- [<span id='technique'>Create a pipeline in <span id='tool'>scikit-learn</span></span>.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Creating-a-Pipeline-in-scikit-learn)
- [<span id='technique'>Fit a model to training data in <span id='tool'>scikit-learn</span></span>.](../%40textbook/15-ml-regression.ipynb#Fitting-a-Model-to-Training-Data)

In [None]:
# Fit on `X_train`, `y_train`
model_reg = ...
model_reg.fit(..., ...)

# Fit on `X_train_under`, `y_train_under`
model_under = ...
model_under.fit(..., ...)

# Fit on `X_train_over`, `y_train_over`
model_over = ...
model_over.fit(..., ...)

## Evaluate

How did we do?

In [None]:
VimeoVideo("694058076", h="d57fb27d07", width=600)

**Task 5.2.16:** Calculate training and test accuracy for your three models.

- [What's an accuracy score?](../%40textbook/14-ml-classification.ipynb#Evaluation)
- [Calculate the accuracy score for a model in scikit-learn.](../%40textbook/14-ml-classification.ipynb#Evaluation)

In [None]:
for m in [model_reg, model_under, model_over]:
    acc_train = ...
    acc_test = ...

    print("Training Accuracy:", round(acc_train, 4))
    print("Test Accuracy:", round(acc_test, 4))

As we mentioned earlier, "good" accuracy scores don't tell us much about the model's performance when dealing with imbalanced data. So instead of looking at what the model got right or wrong, let's see how its predictions differ for the two classes in the dataset.

In [None]:
VimeoVideo("694058022", h="ce29f57dee", width=600)

**Task 5.2.17:** Plot a confusion matrix that shows how your best model performs on your validation set.

- [What's a confusion matrix?](../%40textbook/14-ml-classification.ipynb#Confusion-Matrix)
- [Create a confusion matrix using scikit-learn.](../%40textbook/14-ml-classification.ipynb#Confusion-Matrix)

In [None]:
# Plot confusion matrix


In this lesson, we didn't do any hyperparameter tuning, but it will be helpful in the next lesson to know what the depth of the tree `model_over`.

In [None]:
VimeoVideo("694057996", h="73882663cf", width=600)

**Task 5.2.18:** Determine the depth of the decision tree in `model_over`.

- [What's a decision tree?](../%40textbook/14-ml-classification.ipynb#Decision-Trees)
- [Access an object in a pipeline in scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Accessing-an-Object-in-a-Pipeline)

In [None]:
depth = ...
print(depth)

# Communicate

Now that we have a reasonable model, let's graph the importance of each feature.

In [None]:
VimeoVideo("694057962", h="f60aa3b614", width=600)

**Task 5.2.19:** Create a horizontal bar chart with the 15 most important features for `model_over`. Be sure to label your x-axis `"Gini Importance"`.

- [What's a <span id='term'>bar chart</span>?](../%40textbook/07-visualization-pandas.ipynb#Bar-Charts)
- [<span id='technique'>Access an object in a pipeline in <span id='tool'>scikit-learn</span></span>.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Accessing-an-Object-in-a-Pipeline)
- [<span id='technique'>Create a bar chart using <span id='tool'>pandas</span></span>.](../%40textbook/07-visualization-pandas.ipynb#Bar-Charts)
- [<span id='technique'>Create a Series in <span id='tool'>pandas</span></span>.](../%40textbook/03-pandas-getting-started.ipynb#Adding-Columns)<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

In [None]:
# Get importances
importances = ...

# Put importances into a Series
feat_imp = ...

# Plot series

plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("model_over Feature Importance");

There's our old friend `"feat_27"` near the top, along with features 34 and 26. It's time to share our findings.

Sometimes communication means sharing a visualization. Other times, it means sharing the actual model you've made so that colleagues can use it on new data or deploy your model into production. First step towards production: saving your model.

In [None]:
VimeoVideo("694057923", h="85a50bb588", width=600)

**Task 5.2.20:** Using a context manager, save your best-performing model to a a file named `"model-5-2.pkl"`.

- [What's serialization?](../%40textbook/03-pandas-getting-started.ipynb#Pickle-Files)
- [Store a Python object as a serialized file using pickle.](../%40textbook/03-pandas-getting-started.ipynb#Pickle-Files)

In [None]:
# Save your model as `"model-5-2.pkl"`


In [None]:
VimeoVideo("694057859", h="fecd8f9e54", width=600)

**Task 5.2.21:** Make sure you've saved your model correctly by loading `"model-5-2.pkl"` and assigning to the variable `loaded_model`. Once you're satisfied with the result, run the last cell to submit your model to the grader.

- [Load a Python object from a serialized file using pickle.](../%40textbook/03-pandas-getting-started.ipynb#Pickle-Files)

In [None]:
# Load `"model-5-2.pkl"`

print(loaded_model)

In [None]:
with open("model-5-2.pkl", "rb") as f:
    loaded_model = pickle.load(f)
    wqet_grader.grade(
        "Project 5 Assessment",
        "Task 5.2.16",
        loaded_model,
    )

---
Copyright 2023 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
