# Code

## 1. Progress bars with Rich

TQDM for progress bars in Jupyter is outdated. Instead, go with Rich!

Apart from being an excellent CLI tool, Rich works greatly with Jupyter Lab as well. 

The display is fully customizable. You can tweak the text labels, the width and ordering of the bars and much more.

Check out the docs for more details in the first comment.

![](../images/2022/7_july/progress_bar_2.gif)

Rich progress bar display: https://bit.ly/3vdDw3U

## 2. Pretty printing with rich

Python REPL is one of the ugliest things you will see in programming. Give it more style with colorful pretty-printing with rich👇

![](../images/2022/7_july/pprint.gif)

## 3. Python news 3.11 article - lying with statistics

Lying with statistics in action: Python 3.11 is 60% faster than 3.10.

Don't easily trust headlines that twist the truth with statistics. In the benchmark test I performed, the median speed improvement was about 25% and in the extreme cases, it exceeded well over 60%.

However, should you even be excited about this improvement? 

I don't think so. The benchmark was performed using functions from native Python libraries people rarely use. If you regularly use libraries like NumPy and TensorFlow which are Python wrappers around other languages, you might not notice the speed up.

Check out the article I wrote for DataCamp down below for the details about the benchmark and how you can rerun it.

Article: https://www.datacamp.com/blog/whats-new-in-python-311-and-should-you-even-bother-with-it

## 4. Encoding rare labels with RareLabelEncoder

Often, when a categorical variable has a high cardinality (too many categories), many of the categories represent only a small proportion of the total. 

Having too many classes with very few samples is noise. For ML models to generalize well for all classes, each class must have enough samples.

One solution to the problem is to group rare categories into a single category called "rare" or "other". The "rarity" can be chosen by selecting a proportion threshold.

You can do this manually in Python but there is a better way. Using the feature-engine library, you can perform the operation using a Sklearn-like transformer.

Useful parameters of RareLabelEncoder:

- tol: threshold
- replace_with: custom text to replace rare categories
- ignore_format: when True, the transformer will work on numerically-encoded features as well. By default, it only works on Pandas "other" or "category" data types.

Link to the transformer docs in the comments👇

RareLabelEncoder: https://bit.ly/3vfjNkv

![](../images/2022/7_july/july-rare_label.png)

## 5. Should you always cross-validate? 

Is it a requirement to use cross-validation every time? The answer is a tentative "Yes".

When your dataset is sufficiently large, every random split of train/test sets should resemble the original data well. However, each model comes with its inherent bias and it will have samples that it favors over others. 

That's why it is always recommended to use CV techniques. Even when the data is large, you should at least go for 2-3 fold CV. 

As the dataset size gets smaller, you can increase the folds. When it is dangerously small, like below 100 rows, you can go for extreme CV techniques such as LeaveOneOut or LeavePOut. 

I have talked about CV techniques in detail in one of my recent articles. Give it a read!

https://bit.ly/3z5e02c

## 6. Group KFold

What is a Group KFold cross validation and when should you use it? Hint: non-IID data.

Traditional CV techniques like KFold are all designed for IID data - independent and identically distributed. In other words, the process that generates each row of the dataset does not have a memory of the past samples. 

But, what if the data is non-IID?

For example, in the Google Brain Ventilator Pressure competiton on Kaggle, participants worked with a simulated dataset of lung pressure of sedated patients connected to a breathing pump.

Each row records several physical attributes of lungs as oxygen goes in and out. So, each "breath" of oxygen into the lungs has over 50 rows of measurements with a timestamp.

Here, we can't use plain-old KFold because the dataset is grouped into thousands of breaths and each breath has more than 50 records. Using KFold has the danger of cutting the dataset "mid-breath".

As a solution, you can use a CV technique called GroupKFold which accepts an additional "groups" argument that tells the estimator where the group IDs are stored in the dataset. 

For the lungs dataset, the "groups" argument would accept the "breath_id" column.

Below is an example of GroupKFold in Sklearn.

To learn more about such CV techniques, you can check out my latest article: https://bit.ly/3z5e02c

![](../images/2022/7_july/july-groupkfold.png)

## 7. Shuffle CV

How can you flirt with the idea of cross validation and yet, still not do it? Hint: Use ShuffleSplit. 

ShuffleSplit is an Sklearn CV estimator that does the following:

It accepts an integer for its n_splits argument and each time, returns shuffled versions of the dataset with custom training/test set proportions.

It is a great alternative to KFold CV because it allows a finer control on the number of folds and samples on in train/test sets. It is also a better choice than KFold for when you have limited data.

To learn more about such CV techniques, you can check out my latest article: https://bit.ly/3z5e02c

![](../images/2022/7_july/july-shufflecv.png)

## 8. Cross validation in time series 

Cross validating using time series data is tricky. You can't use traditional KFold because you will end up training on "future" samples and predicting on the "past". Instead, use TimeSeriesSplit of Sklearn.

The syntax is the same as other CV estimators but with one major difference:

In each fold, the training indices will always come before the test indices. So, each successive train set is a superset of previous sets. 

![](../images/2022/7_july/july-time_series_cv.png)

## 9. CTGAN for imbalanced class

Do you use SMOTE in practice? How effective is it?

Lately, I have been thinking about alternatives to SMOTE for severe imbalanced classification problems. More specifically, I was wondering if synthetic data generators like CTGAN can be used to oversample the minority class.

SMOTE uses neighbors in the feature space and draws new samples along the lines between them. Whereas, CTGAN uses GANs to model the distribution of continuous and discrete columns to synthesize new examples. 

How effective do you think CTGAN would be?

## 10. Autocorrelation in time series with statsmodels

I always found time series analysis fascinating. Especially, autocorrelation.

Autocorrelation is the same as correlation coefficient but it is calculated between a series and its lagging version. Here, lagging means shifting the series a few periods behind, so the present values can be compared to their past. 

Autocorrelation can help discover great insights about the time series, such as:

1. Trend - when a clear trend exists in time series, autocorrelation goes up or down as you further shift the series
2. Seasonality - if autocorrelation goes up and down in fixed periods, seasonality exists in the series.
3. Predictability - high autocorrelation suggests strong predictive power of the series, meaning you can train on the past samples to predict the future.

To make autocorrelation analysis easier, you can plot it using statsmodels. Below is an example autocorrelation plot of temperature in Celcius. As you would expect, there is a strong seasonality in the series, occuring at every 12 lags.

![](../images/2022/7_july/july-autocorr.png)

Advanced time series analysis article: https://bit.ly/3Pmt2qM

## 11. Session expiry trick in Colab

What kind of dirty tricks do you use in Colab?

For me, If I have to get away from my computer and am worried about losing the session, I just import the time module and set the cell to sleep for a few hours - 60 * 60 * 10.

That way, the session will never go idle even if the previous cells finish execution. 

## 12. Git-story

A project's Git tree can become quite complex. To explain how the commits, branchs and tags criss-cross in the repo, use new tool called Git-story to animate the tree. 

Under the hood, Git-story uses Manim, the same animation engine used in 3Blue1Brown videos. Features of Git-story:

1. Single command to create an .mp4 video of git history
2. Move the animation start to any commit
3. Different labels for commits, branch and tags
4. Dark and light mode

Git-story can be a great tool to lower the barriers to contributing many open-source projects by displaying the animation history.

![](../images/2022/7_july/git_story.gif)

Link to git-story: https://initialcommit.com/tools/git-story

## 13. Cyclic and seasonal time series patterns

How to spot cyclic and seasonal time series patterns from a mile away?

If the pattern is repeating with a fixed period frequency or connected to the calendar in some way, the pattern is seasonal or periodic. Examples are temperature between seasons, retail sales, economic data, etc.

If the ups and downs of the series are irregular and resemble random fluctuations, the pattern is cyclic. Usually, the duration of these fluctuations last at least 2 years and you can't reasonably predict when the next spike will occur based on the previous ones. 

Cyclic patterns are usually associated with four phases of the business cycle - prosperity or boom, recession, depression and recovery.

## 14. SQLGlot for changing SQL dialects

Do you know how to change between all SQL dialects - Hive, Presto, Spark, MySQL, PostgreSQL, DuckDB, BigQuery, etc?

With SQLGlot you don't have to. It is a Python library that has the following features:

- Written in pure Python
- Prettify complex SQL queries
- Translate queries between dialects
- Rewrite queries into optimized form
- Parse errors
- Build and modify queries with Python functions

Have you seen it Danny Ma, sir?

Link to the library in the comments.

![](../images/2022/7_july/july-sqlglot.png)

Link to the library: https://github.com/tobymao/sqlglot

## 15. Handcalcs library in Python

Handcalcs is a nifty Python library that converts math code into formulas that look like they were written by hand.

It has got %%render magic command that automatically converts a cell's code into written LaTeX output, which you can save as PDF later.

Handcalcs also exposes a decorator that show the calculation of custom functions with user-provided values just like you would use substitute variables in a formula while solving a problem.

Link to handcalcs in the first comment!

![](../images/2022/7_july/handcalcs.gif)

Link to handcalcs: https://github.com/connorferster/handcalcs

## 16. Avatarify Python

Photorealistic avatars for video conferencing with Python Avatarify.

Avatarify is a project for swapping your face with anyone you like on Zoom and Skype. Apart from the initial fun of it, you might be interested in the implementation of the project.

The animation engine is based on the First Order Motion Model paper by Aliaksandr Siarohin. I am not gonna pretend I am smart enough to understand what that is but you can check out the first comment for the paper link and the desktop app of the project.

![](../images/2022/7_july/avatarify.gif)

Avatirfy GitHub (Desktop app): https://github.com/alievk/avatarify-python

Paper: https://bit.ly/3PIOxmj

## 17. Venn diagrams in Python

Drawing Venn diagrams in Matplotlib!

Matplotlib is built upon tiny moving classes called Artists. Everything is an artist in Matplotlib - each dot, circle, line, text, spine, etc. They all inherit from a base class called Artist.

If you use these Artists correctly you can draw practically everything in Matplotlib (even the Mandelbrot set). matplotlib_venn is a library that takes advantage of this feature and allows you to plot Venn diagrams.

Link to the library in the first comment.

![](../images/2022/7_july/july-venn.png)

The library: https://github.com/konstantint/matplotlib-venn

## 18. Why does ensembling work better than single models?

Why does ensembling work better than single models?

########

Reason 1

########

Members of the ensemble learn different mapping functions from input to output. A good ensemble contains members with as different learning functions as possible that explore the information space created by the data from all angles. They make different assumptions about the structure and make errors in different cases.

########

Reason 2

########

The predictions are always combined in some way. This allows the ensemble to exploit the differences of predictions in all members. In other words, you don't just have to take the word of one model but get a collective opinion on each case, lowering the risk of making an inaccurate prediction.

########

Reason 3

########

There is also a beautiful probabilistic reason why ensemble of models with different scores beat another set of models with similar scores. The prove is a bit long but I will definitely talk about it next week.

We have a heated debate on whether the benefits gained from ensembles outweigh their advantages but that's also topic for another post.

## 19. Voting classifier/regressor

How to reach democracy in machine learning? By using a voting ensemble!

Max voting is a common ensembling technique that uses the majority of vote to label new classification samples. If we have three models with the following predictions in a binary classification problem:

- Model 1 -> class 1
- Model 2 -> class 2
- Model 3 -> class 1

The final prediction would be class 1. VotingClassifier of Sklearn can be used to build such an ensemble.

It takes a list of individual classifiers and ensembles them with the max voting technique when its "voting" parameter is set to "hard". When it is set to "soft", the ensemble uses predicted class probabilities and averages them and thresholds the result.

VotingRegressor is the same as VotingClassifier when its "voting" is set to "soft" and works for regression.

![](../images/2022/7_july/july-voting_classifier.png)

## 20. Stacking ensemble/regressor

People use stacking to silently win competitions on Kaggle. How does it work?

As a rule, multiple performant models with as different learning functions as possible are chosen to form an ensemble. Then, using KFold cross-validation, predictions are generated for each model.

As an example, with 5 models in a stack doing a 5-fold CV on the data, we will have 25 columns of predictions. This concludes the level 1 of the stack. 

In the next level, using these 25 columns of predictions as features, a final - meta estimator is trained with cross-validation and final predictions are made.

This leverages the strength of each individual model in the stack and uses their output as inputs to the final estimator. This helps greatly reduce bias in the predictions.

This complicated ensembling technique is implemented in its basic format in Sklearn as Stacking Classifier/Regressor. You pass a list of base estimators and one final lightweight meta estimator like Logistic Regression. Works just like any Sklearn model.

![](../images/2022/7_july/july-stack.png)

## 21. StackOverflow homepage

This is the StackOverflow homepage, for those who is seeing it for the first time in their lives.

![](../images/2022/7_july/so_homepage.gif)

## 22. PReGex - human-readable RegEx

Human-readable RegEx is finally here! Using the PReGex library, you don't have to use a single RegEx character. 

Below is an example Pregex pattern for matching an email, one of the most common RegEx operations. 

Pregex exposes classes for almost all RegEx characters such as quantifiers, groups, operators, character classes, assertions, etc. Pregex stands for Programmable regex and rightly so, all its classes can be combined into more complex patterns using Python operators.

You can learn more about Pregex in an article written by Khuyen Tran and from the official docs in the first comment.

![](../images/2022/7_july/july-pregex.png)

Article by Khuyen Tran: https://bit.ly/3S6Q6fh

Pregex docs: https://pregex.readthedocs.io/en/latest/

# Resources

## 23. Workera.ai for individuals

How are your data compared to employees' of FAANG companies?

You can easily find out by taking a skill assessment on Workera AI. The platform offers 10 different AI-based tests for your domain. Example tests are:

- AI Fluent/Literate (decision making and communication)
- Data analyst
- Data engineer
- ML and deep learning engineer, etc.

Each skill assessment have separate tests for sub-skills and return a score out of 300. Then, you can compare your score against others who have taken the tests in other companies.

It is free for individuals: https://workera.ai/

![](../images/2022/7_july/workera.png)

## 24. Rachel Thomas's Computational Linear Algebra

YouTube gems: Computational Linear Algebra Course taught by Rachel Thomas, co-founder of Fast AI with Jeremy Howard.

According to Forbes, Rachel Thomas is one of the 20 most incredible women in AI. And if she is teaching a course for free on YouTube, you better take it.

Her Computational Linear Algebra course will give you hands-on practice in implementing the theory of matrix computations in code. The topics covered are: 

- Parallelization
- SVD - Singular Value Decomposition
- PCA - Principle Component Analysis
- LU-factorization
- PageRank

Linear Algebra is one of the core components of ML math and being able to understand its concepts at a level where you can implement them in code goes a long way.

Link to the course in the first comment.

![](../images/2022/7_july/linalg.png)

Link to the course: https://bit.ly/3vmMBHN

## 25. Andrey Karpathy blog on recipe for training neural networks

The secret recipe for training neural networks - by the former director of AI @Tesla.

In one of his articles, Anrej Karapathy shares his valuable knowledge on how best train neural nets and the common mistakes to avoid. It covers:

- why neural networks are not easy "plug and play" software code
- how networks fail silently
- 6-step detailed recipe on applying neural nets to any problem

Check out the article in the first comment.

The secret recipe for training neural networs - Andrej Karapathy: https://bit.ly/3PtCZTE

## 26. Google dataset search

Google has a custom domain for dataset search!

Once you type in a keyword, the search will crawl across hundreds of dataset sources to find the best matches. It has got filters for data format (structured, unstructured), usage rights, topics and when it was last updated.

I found the results of this sub-domain much cleaner than plain old Google Search, which always contains unnecessary pages and slows down your dataset search.

Link to the tool in the first comment👇

![](../images/2022/7_july/google_data_search.gif)

Link to the tool: https://datasetsearch.research.google.com/

## 27. 3Blue1Brown blogs on math

Love 3Blue1Brown videos? Wait till you read the articles!

I recently discovered that 3Blue1Brown website features written lessons of Grant Sanderson's most popular videos. Given that all his videos are exceptional, the number of articles (with play-with-it-yourself animations) is roughly around 100, grouped into 16 categories.

![](../images/2022/7_july/3b1b_articles.gif)

## 28. ML pen and paper exercises

If you keep relearning math concepts for ML, you need some solid deliberate practice.

Watching YouTube videos and reading articles on math concepts gives you the comfortable illusion of "making progress" and "learning something worthwhile". In reality, you are just sitting inside your comfort zone, learning things you know you will inevitable relearn.

Give yourself some serious challenge by solving ML exercises which can mostly be done using pen and paper. You can download them from Arxiv or from the GitHub repo.

Links to both in the first comment👇

![](../images/2022/7_july/pen_paper.png)

Exercises (GitHub): https://github.com/michaelgutmann/ml-pen-and-paper-exercises

Exercises (Arxiv): https://arxiv.org/abs/2206.13446