# 5 Signs You've Become an Advanced Pandas User Without Even Realizing It
## Time to take credit
![](images/pixabay.jpg)

<figcaption style="text-align: center;">
    Image by <a href="https://pixabay.com/users/barbaraalane-756613/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=2144354">Barbara A Lane</a> from <a href="https://pixabay.com//?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=2144354">Pixabay</a>
</figcaption>

### Introduction

### 0. Know when to ditch Pandas

- When you first started out, it might've seemed like Pandas can do everything and learning it would be enough
- Of course, this is largely due to the fact that many online courses out there market Pandas like that
- But, Pandas has many shortcomings and you are now able to spot them from a mile away
- Instead of blindly busting Pandas for any data-related thing, you know how to take a step back and ask "Is Pandas the best option here?"
- There are a few scenarios where the answer to that question is a NO with an exclamation mark. 
    1. Real-time data processing - Imagine a cannon that shoots pieces of real-time data from some process at 100 sph (shoots per hour:). The pieces are coming fast and furious and you have to catch, process and save each one mid-air. Put gently, Pandas will be suffocated. So, you would turn to libraries like Apache Kafka.
    2. Massive datasets - when Wes McKinney first wrote Pandas, he had a rule of thumb: the RAM must be 5-10 times bigger than dataset size for Pandas to work optimally. "Easy enough", you would say if it was 2013, but today, not so much.
    3. High-performance computing - this is like conducting a symphony. Just as a conductor needs to coordinate the action of many different musicians to create a harmonious performance, high-performance computing tasks require coordination and synchronization of multiple processing elements to achieve the best results. As for Pandas, it runs solo.
    4. Production-level data pipelines - Think of data pipelines as a water supply system. Just as a water supply system needs to be reliable, scalable, and maintainable to ensure a constant supply of clean water, data pipelines need similar qualities. While Pandas may take care of cleaning and transformation, the rest should be handled by other libraries. 

- It may be hard to leave the loving furry arms of Pandas, but don't feel guilty to explore other options if that isn't enough
- Personally, I recently took a great liking to Polars. It is written in Rust from ground-up to fix all the shortcomings of Pandas.

https://towardsdatascience.com/7-easy-steps-to-switch-from-pandas-to-lightning-fast-polars-and-never-return-b14c66fc85b9

- You can also play mix-and-match with libraries like Data.table. Here is a code snippet I often use to load large CSVs in a fraction of a second:

```python
import datatable as dt

df = dt.fread("my_large_file.csv").to_pandas()
```

### 1. Need For Speed

- Pandas, humongous library it is, has many alternatives to perform a single task
- You, the old pro, know which method works the fastest in which situations
- For example, you know the differences between iteration functions like apply, applymap, map, iterrows and itertuples like the back of your hand
- You are fully aware of trade-offs between using a slower alternative for better functionality and using the best one for optimal speed
- Even though people call you fussy, you like to use `iloc` and `loc` carefully because you know that the first one is faster for indexing rows while the other is for columns.
- But you try to avoid these accessors while replacing values because you know that conditional indexing is orders of magnitude faster with the `query` function. And you also know that the `replace` function is best friends with `query` to replace values. 
- Besides, You are comfortable with different file formats and consciously choose between CSVs, Parquets, Feathers and HDFs instead of just blindly pouring everything into good-old CSVs. You understand that choosing the right file format for your data may help save your hours and memory resources down the line.
- Above all else, you have a deadly trick up your sleeve - vectorization!
- Instead of looking at DataFrames as _data frames_, you think of them as matrices and the columns as vectors.
- So, whenever you find yourself itching to use an iteration function like `apply`, or `itertuples`, you see whether you can use vectorization to apply a certain function to all elements in a column simultaneously rather than one-by-one.
- On top of that, you use the underlying NumPy arrays with `.values` attribute instead of Pandas series because you've seen first-hand how vectorization is much faster with NumPy arrays.
- When all else fails, you don't call it a day and give up. No. 
- You turn to either Cython or Numba for truly computationally-intensive tasks because you are a pro. While most people learned Pandas superficially, you spent a few excruciating hours to learn these two technologies. That's what sets you apart.
- As if all these weren't enough, you have given the [Enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html) page of Pandas user guide a thorough read.  

### 2. So many data types

- Pandas offers so much flexibility with data types. 
- Instead of just using plain `float`, `int`, and `object` data types, you have made the following two images your wallpapers:

![image.png](attachment:2316d987-6929-4f76-a5a6-c50766ebb54c.png)
PBPython BSD-3 clause

![image.png](attachment:d546d515-bb86-432f-8964-d9fe914c0d2d.png)
SciPy docs

- You consciously choose from the above lists based on your data because you are familiar that using the smallest data type possible is very friendly for your RAM (int8 takes up much less memory than int64. The same goes for floats).
- You also avoid the `object` data type like the plague. It is the worst one there is.
- Before reading data files, you observe their top few rows with `cat file_name.extension` to decide which data types you want to use for the columns
- Then, when using `read_*` functions, you take the steering wheel and fill out the `dtype` parameter for each column instead of letting Pandas decide for itself
- You also perform data manipulation _inplace_ as much as possible. Without it, you know that Pandas spawns off copies of the DataFrames and Series, littering your memory. 
- And you have a very good grip of parameters and classes like `pd.Categorical` and `chunksize`

### 3. Friends with Pandas

- If there is one thing that makes Pandas the king of data analysis libraries, it's got to be its integration with the rest of the data ecosystem
- For example, by now you must have realized how you can change the plotting backend of Pandas from Matplotlib to either Plotly, HVPlot, holoviews, Bokeh, or Altair. Yes, Matplotlib is best friends with Pandas but for once in a while, you fancy something interactive like Plotly or Altair

```
The code to change the plotting backend
```

- Talking about backends, you've also noticed that Pandas added a fully-supported PyArrow implementation for its `read_*` functions to load data files in brand-new 2.0.0 version. 
- Before it was NumPy backend only, but you must know its limitations like little support for non-numeric data types, near-total disregard to missing values or no support for complex data structures (dates, timestamps, categoricals). 
- Before 2.0.0, Pandas had been cooking up in-house solutions to these but they were not as good as some heavy users have hoped
- With PyArrow backend, loading data is considerably faster and it brings a suite of data types that Apache Arrow users are familiar with
- How about web scraping? Like me, you must love how the `read_html` function can retrieve tables from any HTML markup using just its link and return DataFrames. In a single line of code, Pandas uses well-known scraping libraries like `beautifulsoup4` and `lxml`
- Another cool feature of Pandas I am sure you use all the time in JupyterLab is styling DataFrames.
- Since project Jupyter is so awesome, Pandas developers added a bit of HTML/CSS magic so you can spice up plain old DataFrames in a way that reveals additional insights
- And I don't even have to mention the countless functions and classes Pandas borrows from NumPy and SciPy with a lot people none the wiser.

### 4. The data sculptor