# Code tricks

## 1. Displaying ROC Curve without generating predictions

Can you spell out ROC curve without looking it up? If yes, don't flatter yourself, because a lot of people can😁.

But not a lot of people know that you can draw the ROC curve without even generating predictions. Just use the RocCurveDisplay class and its from_estimator method👇

![](../images/2022/7_july/july-roc_curve.png)

## 2. Paquet vs. Feather in terms of memory

Parquet is twice more memory-efficient than Feather.

As Parquet file format uses dictionary and RLE (Run-length) encodings and data page compression, it takes far less disk space than feather.

If you want to learn more about the differences, I dropped a link to a SO discussion below👇

![](../images/2022/7_july/july-parquet_vs_feather.png)

Link to SO thread: https://bit.ly/3yuDYvx

## 3. HTML representation of an Sklearn pipeline

You can get an interactive HTML representation of your Sklearn pipeline right inside a notebook.

Just import set_config function from Sklearn and set display to "diagram"👇

![](../images/2022/7_july/sklearn_pipe.gif)

## 4. How to choose correct DPI and figure size in Matplotlib

How to choose a correct DPI and figure size in Matplotlib so you don't lose quality by zooming in?

Matplotlib sets figure size in inches - figsize of (12, 6) is 12 inches wide and 6 inches tall. 

The DPI represents dots or pixels per inch. The default DPI of 100 means for a figsize of (12, 6), the image resolution will be 1200x600 pixels. 

Now, there is also the size of the points, lines or other elements in a plot. Those are measured in points per inch - there are 72 points in an inch. So, in a DPI of 72, a single dot would have the area of a single pixel.

At 144 DPI, the dot would be two pixels or a line would be two pixels thick. So, DPI is like a magnifying glass - a higher DPI scales all elements in a plot.

So, to not lose image quality when zooming in, increase DPI while keeping the figsize constant.

Image and content credit: an SO thread down below👇

![](../images/2022/7_july/dpi.png)

StackOverflow thread on the topic: https://bit.ly/3IrsLjY

## 5. Generate business-day frequency time series with certain workweeks

How to generate a business-day frequency time series with only certain workweeks?

Use the bdate_range function of Pandas and its weekmask parameter. Below, we create a time series from 2020 to 2022, with only Mondays, Wednesdays and Fridays as working weeks.

Such time series can be useful when analyzing data from financial sources.

![](../images/2022/7_july/july-bdate_workweeks.png)

## 6. Visualize all trees of RandomForest

It would be freakishly cool to visualize all the trees in a Random Forest. But how?

Last time, I showed how you can draw a single Decision Tree using PyBaobabdt package using Sankey diagrams. To visualize multiple trees of a RandomForest, we can use Matplotlib subplots like below.

Just remember to set high DPI and high figure size before saving.

Image credit: Pybaobabdt docs. Code to create the plot is down below👇

![](../images/2022/7_july/trees_forest.png)

Pybaobabdt docs: https://bit.ly/3unYtJc

Code to generate the plot: https://bit.ly/3yT9CUV

## 7. Time series offset aliases

There are no less than 27 time series offset aliases. What are they?

Many Pandas functions like date_range have a parameter called freq (frequency) - it denotes how often each data point should occur in a time series. 

Possible values are daily, hourly, weekly, all work days, month start and end, quarterly, yearly, etc. Check out the link below to learn more about them.

List of offset aliases: https://bit.ly/3Roy7Rb

Image credit: Pexels

![](../images/2022/7_july/time.jpg)

## 8. Filtering by partial date components

If you have a DateTimeIndex in your Pandas dataframes, you can filter it by partial date components.

For example, from 1995 to 1997, from 5th month of 1995 to the end of 2000, from the beginning of 2015 to 17th of July of 2018, etc.

And these all work regardless of the time series index granularity. All courtesy of Pandas.

![](../images/2022/7_july/july-partial_filtering.png)

## 9. 10 Sklearn Features buried in the docs

There are more than TWO THOUSAND and five hundred pages in the Sklearn user guide PDF. I've handpicked 10 subtle features that were buried deep inside👇

Link to the article: https://bit.ly/3Isd4ZM

## 10. Displaying Precision/Recall curve without generating predictions

Area under the Precision/Recall curve is one of the best metrics to evaluate the performance of models in imbalanced classification problems.

Precision measures the percentage of true predictions (true positives / (true positives + false positives)).

Recall is the same as sensitivity (true positives / (true positives + false negatives)).

In an imbalanced problem, we are interested in correctly classifying as much of the minority class (positive class or 1) as possible - i.e. true positives. As both the above metrics focus on true positives and don't care about correctly classifying the majority class (true negatives), they are one of the best metrics in this context. 

By varying the decision threshold of the classifier and plotting precision and recall for each threshold, we get a Precision/Recall curve. 

A perfect classifier for an imbalanced problem would have area of 1.

Below is how you can plot the curve in the easiest way possible in Sklearn👇

![](../images/2022/7_july/july-precision_recall_display.png)

## 11. Full list of datetime format strings

Do you know all the TWENTY FOUR datetime format codes? Of course not and neither do I! But I know where to look.

Format codes are those little strings you pass into Python datetime functions like %H, %m, %D, %c, %j, etc. They can denote everything from nanoseconds to whole years and with different name representations in string dates.

Here is the full list from the Python docs: https://bit.ly/3uzMuYT

Image credit: Pexels

![](../images/2022/7_july/time_2.jpg)

## 12. Time series index with holidays

How do you leave out holidays in a time series and still keep its frequency intact? 

Pandas bdate_range function has a "holidays" parameter that accepts a list of datetime objects as holiday dates. The result is a time series with daily frequency with weekends and provided holidays ignored👇

![](../images/2022/7_july/july-holiday_ts.png)

## 13. All Pandas functions to manipulate time series

A while ago, I went completely berserk and wrote about all the Pandas functions you can use to manipulate time series. While the article recieved more than 10k views on Medium, the notebook got a gold medal on Kaggle with over 110 upvotes and 8k views.

Topics covered are:

✅ Basic date and time functions

✅ Missing data imputation in time series (pretty fun stuff here)

✅ Shifts, lags and percent changes

✅ Upsampling and downsampling

✅ Comparing the growth of multiple time series

✅ All about window functions

The article is 13 minutes long, so you better bookmark it!

Link to the article: https://bit.ly/3NZaIme

## 14. Python for/else clause

Did you know that for loops in Python has an "else" clause?

An else in a for loop is executed as soon as a loop finishes or encounters a break statement. In the second snippet below, you can see an example usage from the Python docs. Pretty neat, huh?

![](../images/2022/7_july/july-for_else.png)

## 15. Reading the text of files with Pathlib

You don't have to use "open" function to read the contents of a file. You can use a much better alternative with Pathlib.

After passing the full file path to the Path class, you can call the read_text method which returns the contents as string👇

![](../images/2022/7_july/july-pathlib_read_text.png)

## 16. \_\_file\_\_ variable

Python has a \_\_file\_\_ variable that lets you see the full path to the current script. The variable isn't available in notebooks or in REPL.

![](../images/2022/7_july/july-__file__.png)

## 17. Mapping any crazy distribution to normal with QuantileTransformer

How do you make crazy distributions like bimodals, trimodals, or multimodals normally distributed? Traditional classes like StandardScaler or MinMaxScaler won't work.

Instead, you can use QuantileTransformer of Sklearn which is guaranteed to cast almost any distribution of any shape to normal by using quantile calculations. Don't forget to set output distribution to "normal". Otherwise, you get a uniform distribution.

![](../images/2022/7_july/july-quantile_transformer.png)

## 18. Decomposing time series into trend, seasonality and residuals

Time series has three core components - seasonality, trend and noise (residuals). 

These components aren't easily discernible by looking at the plot of the series itself. So, we often use decomposition to isolate each of these components. 

Seasonality lets you see repeating patterns over the time period of the series. 

Trend shows you the general upwards or downwards progress of the time series from the beginning of its earliest date to the latest. 

Anything left out from these two components is noise.

You can use statsmodels' tsa_decompose function to perform this operation and plot the results. The first subplot displays the series itself while the rest shows the individual components.

You can learn more about time series decomposition in my separate artilce on the topic. Link in the first comment👇

![](../images/2022/7_july/july-decomposition.png)

Time series decomposition article: https://bit.ly/3Pmt2qM

## 19. How to load and upload files to AWS S3

AWS S3 is one of the best options for storing your data. In this article, I show you how to upload/download files in Python to your S3 buckets.

The code itself is fairly straightforward but writing the permissions and settings to allow programmatic access to them is the real pain. So, the article focuses mainly on programmatic access with A LOT OF GIFs. 

Feel free to bookmark it for future reference.

Link to the article: https://bit.ly/3RrKJqu

## 20. Git Cheat Sheet with DataCamp

This is one of the most comprehensive Git cheat sheets on the Internet. I am not just saying that because I wrote the contents but while writing it, I made sure to add tricks and commands not available in many of the most popular Git cheat sheets of other platforms.

Thank you to the DataCamp team for making the text contents awesome with graphics.

Cheat sheet link: https://bit.ly/3PmaKpl

## 21. Transformed Target Regressor to manipulate the target array in Sklearn

Wouldn't be so freakishly comfortable if you could manipulate the target (Y) array right inside Sklearn pipelines?

As you know, in some problems like regression, you have to transform the target array to be normally distributed. But the operation would have to be performed outside your Sklearn pipeline because all Sklearn transformers work on the feature (X) array.

Well, except for one.

The TransformedTargetRegressor let's you add a regression model on top of a transformation function like "np.log" and pass the whole class inside a pipeline.

I have never seen a library who loves its users as much as Sklearn.

![](../images/2022/7_july/july-transformed_target_regressor.png)

# Resources

## 22. Resume guide and template by Terence Kuo to get a job from FAANG

Here is a CV guide and template that got offers from five of the FAANG companies by Terence Kuo.

The article is for all types of programmers alike and is one the most viral posts I've seen on Medium.

https://bit.ly/3OQcgQV

## 23. Calculus playlist by Dr. Trefor Bazett

Calculus is one of the three core pillars of mathematics for machine learning. While 3Blue1Brown gives me a deep visual intuition about its fundamentals, I turn to Dr. Trefor Bazett for more hairier details.

His YouTube playlist on calculus is one of the simplest and yet, most comprehensive resources on learning fascinating calculus.

Link in the first commet👇

![](../images/2022/7_july/calculus.jpg)

Link: https://bit.ly/3P0O2U5

Image credit: https://www.quotemaster.org/Calculus

## 24. MLOps course by Iterative AI

MOOC ecosystem is effectively dead in terms of MLOps resources. We keep seeing cheap courses on ML algorithms but don't see a lot of creators focusing on MLOps, which is just as important.

So, if a company with a massive reputation in the MLOps field releases a comprehensive course on the topic (with certification) completely free of charge, you should gobble up its contents like TicTacs. 

@Iterative AI offers a course with 7 modules and over 70 video lessons on data versioning, pipelining, collaborating, tracking and visualizing model experiments in machine learning projects.

Sign up for free!

Link: https://learn.iterative.ai/

## 25. Multivariable Calculus by 3Blue1Brown on Khan Academy

Would you beleive me if I said 3Blue1Brown (Grant Sanderson) made 175 hidden videos on pure Multivariable Calculus? Well, he did!

Not many people know that Khan Academy's mutivariable calculus playlist is actually taught by Grant Sanderson. Although the videos are not stellar animations like on his own channel, he uses Manim (the animation library he created) in almost every one of them to explain hard multivariable topics in 2D and 3D.

Link to the playlist in the first comment👇

![](../images/2022/7_july/curl.jpg)

Mutlivariable Calculus on Khan Academy: https://www.khanacademy.org/math/multivariable-calculus

YouTube Playlist: https://bit.ly/3nYPrid

Image credit: 3blue1brown.com

## 26. Rich library for CLI color formatting

Rich is one of the most beautiful and useful Python libraries. Put simply, it makes terminal output awesome!

Its features are:

✅ Python code formatting in REPL

✅ Text markup (bold, italic, underline)

✅ +16.7 MILLION truecolors

✅ Logging

✅ Markdown support

✅ Progress bars (in Jupyter as well)

Check out the library link down below👇

![](../images/2022/7_july/rich_progress.gif)

Link to Rich: https://github.com/Textualize/rich

## 27. 9 distance metrics explained in data science

Distance metrics play an important role in many ML algorithms. A leading example is KNN, which uses distance measurements in multiple dimensions for classification and regression problems.

There are other algorithms as well like Local Outlier Factor for outlier detection, UMAP for dimensionality reduction and many more. And the code implementations for each have a distance parameter that lets you choose the method of calculating the distance between data points.

This article gives you an overview of 9 of the most popular ones and when to use them.

Article link: https://bit.ly/3yzhKse

## 28. choosealicense.com

![](../images/2022/7_july/license.png)

Choosing the right license for your open-source project is of paramount importance. It protects you and your work and ensures that business and other law-abiding users can approach your project safely.

The best place to find the correct license is https://choosealicense.com website. It outlines all the available open-source licenses, their limitations and the text contents you can copy/paste inside a file.