# Code

## 1. Encoding categorical features with `pd.factorize`

You don't need to import Sklearn to encode categorical features if you are just data cleaning. Pandas will take care of you, as always!

Using the "factorize" function, you can encode orindal categorical features (categories with orderding) into numeric and get a numeric array as well as the unique values in a series.

Missing values gets encoded as -1 and they won't be considered a new category. However, don't use this function after you've split the data into training and test sets. The encoding of categories happens on a "meet-first" basis, so the same category can be assigned a different number from the training set depending on where it first appears.

![](../images/2022/8_august/august-factorize.png)

## 2. ForAllPeople - universal metrics library in Python

ForAllPeople is a one-of-a-kind Python library that implements all the units in SI (International System of Units).

All the common and uncommon units in math, physics and chemistry are implemented as variable names and are calculation-aware. In other words, correctly using different units in a single calculation can give totally different units as you would have gotten by using a sample physics formula.

![](../images/2022/8_august/forallpeople.gif)

Link to the library: https://github.com/connorferster/forallpeople

## 3. Lovely Matplotlib Plots GitHub - library

How to make Matplotlib default styles unsuck, so you can boldly use the library anywhere? Use the "ipynb" style.

LovelyPlots is a package that loads a new "ipynb" Matplotlib them into the installation. The purpose of this theme is to convert horrendous default Matplotlib styles to publication-level format for scientific paper, thesis and presentations.

Just install it with "pip install LovelyPlots" and write "ply.style.use('ipynb')"

![](../images/2022/8_august/lovely1.png)

## 4. UMAP vs. tSNE vs. PCA

Which one is the fastest - PCA, tSNE or UMAP?

Each dimensionality reduction algorithm preserve the underlying structure of the data differently. But sometimes, you only care about reducing the dimensions of the dataset as fast as possible. 

Below is a speed comparison of the three most-common reduction algorithms. As you can see, tSNE is orders of magnitude slower than others and PCA computes almost instantaneously. 

However, I would advise to use UMAP for most of your use-cases, as it offers a nice middle-ground between performance and the quality of the reduction.

![](../images/2022/8_august/august-pca_tsne_umap.png)

## 5. MLOps.org

The official MLOps website - ml-ops \[dot\] org.

Most of us still don't have a crystal clear idea of the global MLOps landscape. There are so many tools with overlapping features that claim to work for one area of the field but actually ends up disrupting the clear distinctions between each sub-field of MLOps. 

This official website will help you navigate the complex world of MLOps by outlining all the terminology, technology and processes that go into it. There are 9 guides on end-to-end ML lifecycles, levels and design principles of MLOps software.

Definitely check it out!

![](../images/2022/8_august/mlopsorg.png)

## 6. More compressed file saving with Joblib

You are wasting precious memory resources if you are still using vanilla Joblib.

The "dump" function of the Joblib library has a "compress" parameter that lets you specify 9 levels of file compression. The higher the number, the more compressed the file is, thus taking up much smaller size.

However, as you increase compression, the read and write times increase accordingly. So, a common middleground is to use 3 or 4, with 0 being the default (no compression).

Below is an example of how you can save 50% memory resources by going from 0 to 4th level of compression in Joblib.

![](../images/2022/8_august/august-joblib_compression.png)

## 7. How to get a total control over randomness in Python

How do you get a total control over the randomness in your scripts and notebooks? It is not by using np.random.seed! 

According to Robert Kern (a major NumPy contributor) and the Sklearn official user guide, you should use RNG instances for totally reproducible results.

You should replace every mention of "random_state=None" with an instance of np.random.RandomState so that results across all script runs across all threads share the same random state. The behavior of RandomState (RNG) instances is particularly important when you use CV splitters.

You can read more about this from a StackOverflow discussion or a pretty detailed guide on controlling randomness by Sklearn:

SO thread: https://bit.ly/3A2hW5i
Sklearn guide: https://bit.ly/3SwbLh9

## 8. There are no pure Python software engineers...

There are almost no pure Python software engineers...

All the rockstar contributors of popular packages like TensorFlow, Sklearn or NumPy have solid backgrounds from other OOP languages like C#, Java or C++. They know the design patterns of OOP code like the back of their hands and can apply those concepts abstractly to any other OOP language without a hitch.

That's why there is such a quality gap between everyday Python code and the code written on popular GitHub repos. You can't write that kind of quality software if you are coming from a pure Python background. 

That's also why there is such a shortage of good software engineering resources designed purely for Python. People who got their software engineering knowledge from other languages can apply their expertise to Python easily without needing to consult a book or a course.

As an example, the most popular book on OOP design principles in C++ has over 2000 ratings on Amazon while the same book on Python has measly 46 ratings. 

## 9. Hyperparameter tuning for multiple metrics with Optuna

It is a giant waste if you are hyperparameter tuning for multiple metrics in separate sessions.

Optuna allows you to create tuning sessions that enables you to tune for as many metrics as you want. Inside your Optuna objective function, simply measure your model using the metrics you want like precision, recall and logloss and return them separately.

Then, when you initialize a study object, specify whether you want Optuna to minimize or maximize each metric by providing a list of values to "directions".

![](../images/2022/8_august/august-optuna_multiple_metrics.png)

## 10. changedetection.io for web scraping

One of the heavy challenges of web scraping in data science is websites changing their HTML/JavaScript code.

A single class name change or the introduction of a new tag can totally break your scheduled web scrapers. And the hardest part is that the website change their internal markup so frequently that you don't even know what broke your scraper.

For such cases, you can use the open-source changedetection \[dot\] io to watch out for website changes. By simply clicking the "Diff" button you can see what changes and update your code accordingly.

Link to the tool in the comments.

![](../images/2022/8_august/changedetection_sample.png)

Link to the tool: https://changedetection.io/

## 11. GitHub README stats

GitHub profile stats for your READMEs!

If you always wondered how people generate those nice-looking profile stats, then you are in luck. Generating those stats is as easy as adding a single line of Markdown code with a link to your GitHub profile.

Link to the tool's repository (has 44k stars) in the first comment. 

![](../images/2022/8_august/readme_stats.png)

Link to the repo: https://github.com/anuraghazra/github-readme-stats

## 12. Type I and Type II errors in statistics

If you need help remembering the difference between Type I and Type II errors, here is a helpful meme. You probably won't forget the difference for the rest of your life.

Source: effectsizefaq \[dot\] com

![](../images/2022/8_august/errors_stats.jpg)

## 13. Using z-scores for outlier detection is paradoxical

I find using z-scores for outlier detection quite paradoxical. 

In the center of z-scores is the mean, which is a number that is most heavily influenced by extreme values. That's why I can't understand why z-score filtering became the most popular method for anomaly detection.

It is true that when your extreme values lie just outside the 1.5IQR range, the z-scores might be useful. However, who has the time to check that? 

To be absolutely safe, you can use the Median Absolute Deviation (MAD) which uses the median and how much distance the samples are away from the mean. MAD doesn't have distribution assumptions either, while z-scores need normal distribution to work as expected.

![](../images/2022/8_august/august-mad.png)

## 14. I love `git stash` now

Recently, I came to love the "git stash" command. Here is how I am using it and how you can as well:

1. Put uncommitted changes to "shelf" so you can come to them later in your project. Such an awesome feature for quickly trying out new ideas without forgetting them and messing up the work you have done.

2. Removing uncommitted changes - you add something new but it doesn't work as expected. You have come far along that you can't remember all the lines and files you have changed. Simply call "git stash -u" which quitely removes all changes in tracked, untracked and added files.

3. Switch branches without that pesky error that says you have uncommitted changes. Git loves to tell that you can't change branches unless you save the updates on the current branch. To get around this, I used to make temporary commits with a message that says "temp" (I think I would be fired for this in a real job). Now, I just stash the changes, switch branches to check something out, come back to the original branch and pop the stash! So easy.

## 15. Zipping arrays with varying lengths using `zip_longest`

If you want to zip two arrays with different lengths, use the "zip_longest" function from itertools.

The main "zip" function in Python will discard the elements of the larger array so that arrays match during the loop. By using zip_longest, you ensure that no element is omitted and you can use a custom fill value to pad the shorter array.

![](../images/2022/8_august/august-zip_longest.png)

## 16. Conditional looping with `filterfalse` of `itertools`

How to perform conditional looping in Python without using "if" statements? By using the "filterfalse" function.

"filterfalse" accepts a boolean function (usually a lambda) that tells which elements should be discarded during the loop. For example, in the below example we are skipping numbers that are divisibly by three.

In other words, we are only keeping the values that return "False" to the condition inside the looping function.

![](../images/2022/8_august/august-filterfalse.png)

## 17. Speed comparison of the fastest dimensionality reduction algorithms

Building on my earlier post this week, here is a more detailed comparison of the speed of the fastest dimensionality reduction algorithms.

As you can see, tough-old PCA needs almost the same execution time even if you increase the dataset size 5 times. As for the tSNEs, they are embarrassing.

![](../images/2022/8_august/umap_pca_comparison.png)

Source: https://bit.ly/3JAN4fj

## 18. Anatomy of Matplotlib

A plot that is worth a thousand plots.

![](../images/2022/8_august/matplotlib_anatomy.png)

Source: https://bit.ly/3P6gq6H

## 19. First swiss army knife of Matplotlib

The first swiss army knife of Matplotlib - plt.getp

The "plt.getp" function is one of the most flexible and useful functions in all of Matplotlib. And yet, so few use it.

When you call "plt.getp" on any Matplotlib object, it returns the current values of its attributes. You can call it on literally anything - the dots of scatterplots, the lines of bar charts, the spines of axes, the tick locators, the figure itself and it lists all the things you can change about that plot.

![](../images/2022/8_august/august-getp.png)

## 20. Second swiss army knife of Matplotlib

The second swiss army knife of Matplotlib - plt.setp

"plt.setp" is one of the most flexible and useful functions in all of Matplotlib. And yet, so few use it (yep, that is a shameless "almost" duplicate of my last post :)

Calling "setp" on any Matploltib object without any arguments will print a list of all its attributes and what values they accept. Based on that information, you can change whatever aspect of your plot using only a single function.

Combine it with "plt.getp" and you have almost everything you need to infinitely customize your plots.

![](../images/2022/8_august/august-setp.png)

## 21. Creating function clones with certain arguments fixed

How can you literally freeze a Python function? By using functools.partial!

The "partial" function from functools can freeze certain arguments of a function and create a new instance with a much simplified signature. For example, below we are "cloning" the "read_csv" function so that 4 of its arguments are always fixed at custom values.

Now, you can use the "partial_read_csv" just like pd.read_csv - you can even ovverride those arguments you specified while copying the function.

![](../images/2022/8_august/august-partial.png)

# Resources

## 22. 43 machine learning rules and best practices for ML Engineers by Google

> Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

That is a quote from the awesome article by Google Developers that outlines 43 machine learning rules and best practices. Among the 43, there are great advice like:

1. You don't always have to use machine learning.
2. Watch for silent failures (and what they are).
3. Design and implement the metrics before the models.
4. You first model should be stupidly simple like LogReg or LinReg

Read the article in the first comment!

Link: https://bit.ly/3A1JszN

## 23. How to remember all classification metrics forever?

Awesome article on how to remember the difference between classification metrics forever. Now, you don't have to inwardly curse sensitivity and specificity. 

Link to the article: https://bit.ly/3bx7Sbf

## 24. OpenImages - GoogleAPIs

Close to 100 million images, with ~20k categories annotated!

Open Images Dataset V6+ is an open-source repository of almost 100 million high-quality images with over 20k categories annotated for image classification. There are also special images for instance segmentation, object detection (wrapping boxes), etc.

The website has filters for keyword search and download. 

![](../images/2022/8_august/opemimages.png)

Link: https://bit.ly/3Q1Nu0K

## 25. PySnooper - never use another logging library ever again!

With PySnooper, you won't ever have to use print statements or logging functions ever again!

As you can in the image, PySnooper profiles every line of your script and detects new variables and how they change as they go through loops.

Tools like this are super helpful when working with loooong loops.

![](../images/2022/8_august/pysnooper.png)

Link to the library: https://github.com/cool-RR/PySnooper

## 26. Hundreds of Jupyter notebook templates for various tasks

Naas Jupyter Notebook templates - the largest repository of hundreds of production-ready Jupyter Notebook templates.

The GitHub repo is part of the "Awesome" project series on GitHub and collects useful, ready-to-run notebooks on various petty tasks that would otherwise have been to cumbersome to implement yourself.

The only disadvantage is scrolling through the categories to find what you are looking for. They should put up a webiste with a search (at least GitHub pages) - it is the 21st century!

![](../images/2022/8_august/naas.png)

Repo: https://github.com/jupyter-naas/awesome-notebooks

## 27. pipdeptree for much better dependency management

Raise your hand if you used "pip freeze" and vowed to yourself you will never, ever use it again!

I handle dependency conflicts at least once a week - the process is still a mess in Python. Fortunately, I have recently come across a tool called "pipdeptree" which allows you to see dependencies of your environment in a hierarchical fashion.

The library also gives you warnings when there are version conflicts or even worse, circular dependencies (that's usually the sign you have to delete the whole conda environment).

![](../images/2022/8_august/august-pipdeptree.png)

Repo: https://github.com/naiquevin/pipdeptree

## 28. nbdime for diffing notebooks

I am terribly sorry you had to see that. Calling "git diff" when there are changes to jupyter notebooks is one of the ugliest things you will on terminal.

Notebooks are thousands of lines of JSON under the hood. There is structure to them, but that's not something you want to see on the terminal in black and white with no formatting.

Fortunately, there is nbdime for diffing your notebooks. Nbdime is content aware, it shows different output based on the content of notebook cells in a web view.

Here is an example that shows how a change in code leads to a different plot with different colors.

Link to nbdime in the comments.

![](../images/2022/8_august/nbdime_example.png)

Repo: https://bit.ly/3d6drxz