# Code

## 1. Permutation Importance with ELI5

Permutation importance is one of the most reliable ways to see the important features in a model. 

Its advantages:

1. Works on any type of model structure
2. Easy to interpret and implement
3. Consistent and reliable

Permutation importance of a feature is defined as the change in model performance when that feature is randomly shuffled.

PI is available through the eli5 package. Below are PI scores for an XGBoost Regressor model👇

The show_weights function displays the features that hurt the model's performance the most after being shuffled - i.e. the most important features.

![](../images/2022/6_june/eli5_pi.png)

## 2. ConfusionMatrix display for better confusion matrix

If you want much more control over how you display your confusion matrix in Sklearn, use ConfusionMatrixDisplay class.

With the class, you can control how X and Y labels look, what texts they display, the colormap of the matrix and much more.

Besides, it has a from_estimator function that enables you to plot the matrix without having to generate predictions beforehand.

![](../images/2022/6_june/june-matrix_display.png)

## 3. Text representation of a decision tree

Sklearn allows you to print a text representation of a decision tree. Here is an example👇

After taking a minute reading the output, you can easily build a prediction path for any sample in your dataset:

![](../images/2022/6_june/june-text_tree.png)

## 4. Default RMSE in Sklearn

I always found it strange that Room Mean Squared Error wasn't available in Sklearn given that it was such a popular metric. 

Later, I found that I didn't look long enough because it was available as a parameter inside mean_squared_error (squared=False)👇

![](../images/2022/6_june/june-rmse.png)

## 5. Plotting decision trees in Sklearn

Decision trees are everywhere. It has many variations with applications - CART boosted tree in XGBoost, regular and extremely random trees of Sklearn, trees of IsolationForest for outlier detection, etc.

So, it is crucial that you understand how they work. One way you can do this is by visualizing them via Sklearn:

![](../images/2022/6_june/june-viz_tree.png)

## 6. Rule of thumb for fit/predict/fit_transform

Rules of thumb to differentiate between fit/transform/fit_transform functions of Sklearn.

1. All sklearn transformers (e.g. OneHotEncoder, StandardScaler) must be fitted to the training data. When the "fit" function is called, the transformers learn statistical properties of the features like mean, median, variance, quartiles, etc. That's why any function that has "fit" in the name must be called on training data first.

2. The transform function behaves differently based on the estimator's purpose. It is called only after the "fit" function is run because most "transform" functions need the information learned from "fit". "transform" can be used on all sets as long as the "fit" function is called on training.

3. "fit_transform" should also be used only on training data. The only difference is that it simultaneously learns and transforms the statistical properties of the training features.

## 7. The difference between micro, macro, weighted averages

What are the differences between micro, macro and weighted averages and why should you care?

In multi-class classification problems, models often compute a metric for each class. For example, in a 3-class problem, 3 precision scores are returned. We don't care for three, we just need a single global metric. That's where averaging methods come into play.

1. Macro average

This is a simple arithmetic mean. For example, if precision scores are 0.7, 0.8, 0.9, macro average would be their mean - 0.8. 

2. Weighted average

This method takes into account the class imbalance as metrics for each class are multiplied by the proportion of that class. For example, if there are 100 samples (30, 45, 25 for each class respectively) and the precision scores are .7, .8, .9, the weighted average would be:

0.3 * 0.7 + 0.45 * 0.8 + 0.25 * 0.9 = 0.795

3. Micro average

Micro average is the same as accuracy - it is calculated by dividing the number of all correctly classified samples (true positives) by the total number of correctly and incorrectly (true positives + false positives) classified samples of each class.

You should avoid micro average when you have an imbalanced problem. Instead, use macro if you don't care much for class contributions or weighted average when you do.

## 8. Saving to parquet is much faster

Saving and loading Parquet files are much faster and painless. Here is a comparison of how much it takes to save an 11GB dataframe to Parquet and CSV👇

![](../images/2022/6_june/june-parquet_vs_csv.png)

## 9. Parallel execution with joblib

Below is an example of how you can send a thousand HTTP requests in just 2.5 minutes with joblib👇

joblib enables you to fully utilize the cores in your CPU by writing parallel code for your large loops. As a result, you can execute a single function in multiple threads, without wasting time and idle resources.

The library accepts any picklable function, like functions for image resizing, web scraping, file operations, etc.

![](../images/2022/6_june/june-joblib_parallel.png)

## 10. Getting a scorer object from just the name

In a single project, you may evaluate your models using multiple metrics. Instead of importing them one by one from sklearn and pollute your namespace, you can use the "get_scorer" function of the metrics module.

Just pass the name of the metric you want and you get a scorer object ready to use👇

![](../images/2022/6_june/june-get_scorer.png)

## 11. Enabling categorical data support in XGBoost

XGBoost has an experimental but very powerful support for categorical features. The only requirement is that you convert the features to Pandas' category data type before feeding them to XGBoost👇

![](../images/2022/6_june/june-xgb_cats.png)

## 12. Set numeric display precision in Pandas

It is very annoying when Pandas shows long floats in scientific notation. I usually struggle with approximating close-to-zero floats. 

To prevent this, you can change the display option of Pandas to limit the floating point precision👇
![](../images/2022/6_june/june-pandas_precision.png)

## 13. XGBoost builtin-in encoder vs. OneHotEncoder

OneHotEncoder is 7 times worse than the encode that comes with XGBoost. Below is a comparison of OneHotEncoder from sklearn and the built-in XGBoost encoder.

As can be seen, the RMSE score is 7 times worse when OneHotEncoder was pre-applied on the data👇

![](../images/2022/6_june/june-ordinal_vs_xgb.png)

## 14. Get all scorer's names in Sklearn

Sklearn has over 50 metrics to evaluate the performance of its models. To pass those metrics inside pipelines or GridSearch instances, you have to remember their text names. 

If you forget any of them, here is how you can print out the names of all the metrics👇

![](../images/2022/6_june/june-all_scorers.png)

## 15. Best overfitting advice

This is the best advice I read on combatting overfitting:

"To achieve the perfect fit, you must first overfit".

Here are the reasons why:

First, it makes sense - you can't fight overfitting without a model that overfits.

Second, it is a sign of power - if a model is overfitting or perfectly memorizing the training data, it is a sign that model has enough optimization power to learn the patterns in the training data. 

Solving ML problems is all about the tension between optimization (how well the model learns from training data) and generalization (how well the model performs on unseen data). 

After you can build a model that is able to overfit, you should focus on generalization because too much optimization hurts it. You should try less complex model architectures, apply regularization, add random dropout layers (DART trees of XGBoost or DropOut layers in TensorFlow) to tune optimization and increase generalization.

You won't be able to do any of them unless you have a model that overfits.

## 16. Generate a synthetic dataset with outliers

Anomaly detection is a fascinating unsupervised problem. To practice solving it, you can use the PyOD (Python Outlier Detection) library's generate_data function.

Its features are:

1. Controlling the proportion of outliers in the data (contamination)
2. Choosing the number of informative and uninformative features
3. Return the inlier/outlier labels if desired

Here is an example 2-dimensional dataset generated with the function and visualized with Seaborn:

![](../images/2022/6_june/june-outlier_data.png)

## 17. Switch the APIs in XGBoost

If you use the Scikit-learn API of XGBoost, you might lose some of the advantages that comes with its core training API.

For example, the models of the training API enable you to calculate Shapley values on GPUs, a feature that isn't availabe in XGBRegressor or XGBClassifier.

Here is how you can get around this problem by extracting the booster object👇
![](../images/2022/6_june/june-xgb_api.png)

## 18. Conditionals replaced by dictionaries

You can greatly simplify your conditional statements by using dictionaries. 

Of course this approach has its drawbacks, but I have used it to great effect in a project where I collapsed a nested conditional block over 100 lines into just a dozen.

![](../images/2022/6_june/june-conditional_dict.png)

## 19. DTreeViz package to plot decision trees

Visualizing decision trees can be a very fun way of learning how they work. One of the best packages to perform this is the "dtreeviz" package. Here is a sample visual of a decision tree trained on the Iris dataset:

![](../images/2022/6_june/june-dtreeviz.png)

https://mljar.com/blog/visualize-decision-tree/output_19_0.svg

Credit: mljar.com

## 20. Set displaying max number of rows and columns in pandas

Isn't it frustrating when Pandas clips the output of dataframes when there are too much columns or rows? You can get rid of that pesky problem by setting the display option of max number of columns and rows:

![](../images/2022/6_june/june-max_row_col.png)

## 21. Caching functions in Python

Since version 3.9, Python has its own caching decorator in the "functools" module. 

It is dead useful when working with recursive functions or functions that work with memory-heavy arguments. Here is an example use-case from Python docs:

![](../images/2022/6_june/june-cache.png)

# Resources

## 1. Made With ML

There are so darn many MLOps tools right now. The only thing that is more than that is the number of courses, books and resources on MLOps. 

To learn MLOps properly, you only need a few high-quality resources, not dozen. The open-source Made With ML website is one of them.

It teaches everything you need to know to take models from idea to deployment. It has got separate chapters for

✅ Model development
✅ Packaging
✅ Deployment
✅ Testing
✅ Reproducibility

The project has over +30k on GitHub, making it one of the most popular repositors in the MLOps world.

Link to the website: https://madewithml.com/

![image.png](../images/2022/6_june/madewithml.png)

## 2. 460 free textbooks on math, science and statistics

If you are one of the people who get a dopamine hit by collecting more books than they need, this list is for you.

FreeCodeCamp put together a list of over 450 free texts on math, science and statistics, each a single click away: https://bit.ly/3ygDI4u

## 3. Yann LeCun's Deep Learning Course at CDS

A free deep learning course by Yann LeCun himself!

Yann LeCun is one of the three people who are considered as "Godfathers of AI and Deep Learning". He has a Turing award as well, which is considered as the Nobel Prize of computing.

If he is teaching a course and for free, you should definitely take it: https://cds.nyu.edu/deep-learning/

The course covers:

✅ Supervised and unsupervised deep learning
✅ Embedding methods
✅ Metric learning
✅ Convnets and RNNs

## 4. Awesome ML books by Awesome

Awesome ML books by Awesome!

Awesome is a GitHub project with over 200k GitHub starts that outlines a list of lists of all interesting things in programming and computing.

It has a separate list for free and open-source machine learning books as well: https://bit.ly/3ynmYZw

The list is awesome!

![](../images/2022/6_june/awesome_books.png)

## 5. Intro to Pprobability for data science by michigan university (PDF)

Probability is the backbone of data science and machine learning. Let it penetrate your brain using this superb  free book by Michigan University: https://probability4datascience.com/

## 6. Papers with Code datasets

Want to get your hands on high-quality datasets used in cutting-edge research?

The Papers with Code dataset has a curated list of over 6500 such datasets with controls to filter by data type, task and language.

Link to the source: https://bit.ly/3QP50pH

![](../images/2022/6_june/datasets_pwc.gif)

## 7. Writing a scientific article from scratch

If you think writing a scientific article is easy and fun, you and I have very different interpretations of these words. 

Formal and technical writing is probably the most challenging type of writing. 

If done right, scientific publication can open wide doors for career advancement and academic reputaion. Learn to ace it in the guide below:

Download: https://bit.ly/3u2ti5U

Read: https://bit.ly/3njH0xC

![image.png](../images/2022/6_june/scientific_article.png)