# Code

## 1. The difference between two time series dates

How do you find the difference between the dates of two time series?

As long as they have the same format, you can use the difference method of Pandas DateTimeIndex objects.

Below, we create two time series: one for a full year and one for only business days. The rest is fairly easy👇

![](../images/2022/7_july/july-ts_difference.png)

## 2. Set errors to coerce while converting to datetime

When you load a datetime column into Pandas, the default datatype will be "object". To convert it to datetime, you can use pd.to_datetime() function.

However, if the datetime index is corrupted you will get rad, fat errors. You can set those faulty dates to NaT (Not a Time, i.e. datetime NaN) by setting errors to "coerce".

![](../images/2022/7_july/july-errors_coerce.png)

## 3. Print the prediction path of a sample

You can visualize the prediction path of a datapoint in a decision tree using the dtreeviz package. Adding one or two plots like below to your analysis will give viewers a better sense of how predictions are made using decision trees.

Link to dtreeviz in the comments👇

![](../images/2022/7_july/july-pred_path.png)

Link to dtreeviz: https://bit.ly/3nDk2la

## 4. PyBaobab for much better decision tree visualizations

PyBaobabdt is a wonderful Python package to visualize decision trees using Sankey diagrams. For a tree classifier, each class is represented with a color, and the width of each link (or root) represents the number of samples in each class. 

The trees can become as beautiful as you want by controlling the max_depth parameter. 

The link to the package in the comments👇

![](../images/2022/7_july/july-pybaobabdt_code.png)

Comments:

Link to the package: https://gitlab.tue.nl/20040367/pybaobab

Code to create the below tree: https://snappify.io/view/85603a55-6127-4a10-b04d-076a4604c5a9

![](../images/2022/7_july/pybaobab_sample.png)

## 5. Mount Google Drive on Colab

It is infuriating that once a session is ended, Google Colab discards uploaded files. So, instead of uploading your CSVs directly to Colab, you can store them in your Google Drive and access them from your notebooks. Here is how👇

![](../images/2022/7_july/july-gdrive_mount.png)

## 6. Strip unnecessary components from a datetime object

Sometimes, date time objects come in unnecessary granularity. They may have nanoseconds or seconds information when you are just interested in the year/month/day. You can use Pandas to_period function with a frequency name to strip away clutter.

![](../images/2022/7_july/july-to_period.png)

## 7. Pandas explode

What do you do if a dataframe cell contains a list of values? Well, you explode💣 them!

Pandas' explode function takes a column and expands it vertically so that any cells that contain more than one value is stretched across multiple rows. 

![](../images/2022/7_july/july-pandas_explode.png)

## 8. Why beginners won't do LR and keep choosing XGBoost

Last year, I saw that a tabular competition on Kaggle was won by an ensemble of Quadratic Discriminant Analysis models. What is QDA, you ask? I had no idea either.

It was a very eye-opening experience for me as a beginner, because I have thought having learned XGBoost, I could just ignore any other older models. 

I was disctracted by the hot tools. Turns out, it isn't about the tool but how quickly and efficiently you can solve a problem.

Later, I found that for that particular competition's data, QDA was orders of magnitude faster than any tree-based models and could easily beat them in terms of performance.

So, the moral here is that don't approach problems with tools-first mindset. Rather, find the best way to solve it in the simplest way possible. Don't try to look "cool" by using whatever is being popular at the time.

## 9. Set datetime index for plotting

Having a Date Time Index in your dataframe makes it stupidly easy to visualize time series. You don't even have to import matplotlib, just extract the column(s) you want from the dataframe and call plot() on them. Pandas takes care of the rest.

![](../images/2022/7_july/july-datetime_index_plot.png)

## 10. Chaining multiple Pandas functions with Pandas pipe

Pandas has a similar "pipeline" feature like in Sklearn. By chaining multiple "pipe" functions together, you can call multiple preprocessing functions in a single line of code. Makes your code much more readable and easier to debug.

![](../images/2022/7_july/july-pandas_pipe.png)

## 11. Reading missing values with custom representation

People denote missing values on a whim. It might be 0, -9999, # or any other symbol/word that comes to their mind. You can immediately catch those values and encode them properly as NaN values while reading the data with read_csv. Just pass the custom missing value to "na_values".

![](../images/2022/7_july/july-pandas_na_values.png)

## 12. Format dates in Matplotlib plots

Did it ever happen to you when you visualized a time series, the dates on the XAxis got smooshed together making them illegible? You can avoid that by calling the "autofmt_xdate()" function on the figure object to automatically format date labels in Matplotlib.

![](../images/2022/7_july/july-autofmt_xdate.png)

## 13. Array to latex library

array_to_latex is an awesome library that converts NumPy arrays and Pandas DataFrames to LaTeX code. Using the library, you can print the code, store it as a string in a variable or even cooler, copy it to your clipboard!

![](../images/2022/7_july/july-array_to_latex.png)

Link: https://github.com/josephcslater/array_to_latex

## 14. np.logspace

np.linspace has a younger brother in NumPy. While linspace generates evenly-spaced numbers, logspace returns evenly-spaced numbers on a log scale. You can choose any base you want.

![](../images/2022/7_july/july-np_logspace.png)

## 15. Distfit

People fear the unknown - darkness, death, unfamiliar distributions in the data...

Well, that's no longer the case (at least for distributions). Using the distfit library, you can test any distribution against 89 known distributions in SciPy. The resulting visual tells the best theoretical fit on your empirical data. You can throw in some values to check whether they are outliers or not, as well.

The code to generate the plot and distfit library in the comments👇

![](../images/2022/7_july/distfit.png)

Link to distfit: https://bit.ly/3Rdn2SW

Link to the code that generated the plot: https://bit.ly/3nJMFgM

## 17. Faker - generate fake data

As if all the data in the world is not enough, you can generate synthetic datasets as well. Faker is one of the best libraries to do this in Python.

Every time you call a faker function, it returns a new random name, address, email, phone number or many dozens of other fake attributes. Below is a sample banking dataset with 10k records.

![](../images/2022/7_july/july-faker.png)

Link to the library: https://faker.readthedocs.io/

## 18. Mlextend - plot decision boundaries of classifiers

One of the most fun things you can do with your classifier is plot its decision boundaries. But, you will quickly realize that the code to generate such a plot is, put mildly, a giant pain in the keyboard. 

Fortunately, the mlextend package collapses all that code into a function, so that you can draw decision boundaries of any classifier in a single line of code👇

![](../images/2022/7_july/july-decision_boundaries.png)

## 19. Random Forest is the coolest name!
Machine learning community won't come up with a name as cool as Random Forests ever again.

# Resources

## 20. Speech and language processing - Dan Jurafsky

Speech and Language Processing is one of the most comprehensive books on NLP theory. 

Even though the book claims to be introductory, it spans over 27 chapters and 600 pages covering NLP techniques from basic reGex to Chatbot & Dialogue systems. It is a great book for anyone trying to get to the bottom of NLP theory.

Download link in the commets👇

![](../images/2022/7_july/dan_jurafsky.jpg)

Download link: https://stanford.io/3yIUTf9

## 21. CT-GAN - generate synthetic data from existing sources

There are so much more private datasets than open-source. But private datasets can be shared too, if you make sure to preserve the anonymity and fidelity of the data.

One of the best tools to do this is the CTGAN library, which when fit to a dataset, can generate a synthetic dataset with the same distributions and features as the original but hiding any sensitive information. 

The resulting dataset would be completely unrecognizable but still have the statistical properties of the original. 

The CTGAN library is based on the "Modeling Tabular using Conditional GAN" paper. Link to the Python API and paper in the comments👇

![](../images/2022/7_july/ctgan.png)

CTGAN Python package: https://github.com/sdv-dev/CTGAN

CTGAN paper: https://bit.ly/3RgoPXw

## 22. Forecasting with Darts

Time Series forecasting with Darts🎯

Darts is one of the most popular open-source libraries for time series forecasting. It offers a simple Sklearn-like API to work with univariate and multivariate time series datasets. 

It contains a large set of forecasting models from classic ARIMA to Torch models with GPUs and TPUs. It supports PyTorch Lightning as well!

Check out the library from the link below👇

![](../images/2022/7_july/darts_example.png)

Link to the library: https://bit.ly/3bKVcNv

## 23. Discoart by Jina AI

High-quality image generation right from Google Colab🔥

@Jina AI recently released Discoart - an open-source project for generating Disco Diffusion artworks based on text prompts.

Unlike many other image generating models, the library is fully optimized to work with Google Colab's free tier. It has a stupidly simple API allowing you to generate complex images in a single line of code.

Link to the library in the comments👇

![](../images/2022/7_july/discoart.gif)

Link to Discoart: https://github.com/jina-ai/discoart

## 24. Seeing Theory

You have probably rolled million dies and drew million cards while learning statistics. Wouldn't be just freakin' awesome to visualize all those?

Seeing Theory is one of the best statistics resources on the Internet mainly for its interactive and user-animated content. It has beautiful, ingenius visualization tools and explanations that covers important probability concepts in 6 chapters:

✅ Basic Probability

✅ Compound Probability

✅ Probability Distributions

✅ Frequentist Inference

✅ Bayesian Inference

✅ Regression Analysis


Seeing Theory website is an Internet gem. Link in the comments👇

Comments:

Seeing Theory: https://seeing-theory.brown.edu/

![](../images/2022/7_july/seeing_theory.gif)

## 25. Intermediate Python - the book

Most people are at Python standstill. What is that?

When learning data science, beginners learn the basics of Python just enough to move on to fancier stuff like visualization and ML algorithms. Distracted by the vast data science stack, they never go back to strengthening their Python background.

The open-source book - "Intermediate Python" is designed solely to fill in the gaps of your Python knowledge. With over 20 chapters, the book covers some cool Python concepts you never knew existed like:

✅ Decorators
✅ for/else blocks
✅ Coroutines
✅ Generators
✅ Hidden gems from the built-in libraries

and many more. Read it online in the link below👇

Link: https://bit.ly/3yinAy9

## 26. data-to-viz.com

Become a data visualization wizard!

Data to Viz is a masterpiece of a platform for anyone trying to convey data insights in the best way possible. 

It lets you choose the most appropriate chart for your data, shows you the code to do it and highlights how to avoid common pitfalls while plotting that chart.

Its library contains almost 40 types of visuals, sorted by purpose and data type. 

Check out the platform from the link below👇

![](../images/2022/7_july/datatoviz.gif)

Link: https://www.data-to-viz.com/