## AVOID THIS IN PANDAS

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = dict()
data['old_column'] = np.random.random(10_239)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,old_column
0,0.061572
1,0.865746
2,0.855411
3,0.21795
4,0.756313


#### **MISTAKE 1**
**Don't use iteration when vectorization is an option**

In [3]:
%%timeit
# Mistake
for index, row in df.iterrows():
    if (df.at[index, 'old_column'] > 0.5):
        df.at[index, 'new_column'] = True
    else:
        df.at[index, 'new_column'] = False



537 ms ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]:
%%timeit
# Solution
df['new_column'] = df['old_column'] > 0.5

63.1 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


#### **MISTAKE 2**

When you use <code>inplace=True</code>,
you are modifying the original DataFrame directly. While this might seem convenient, it can lead to unexpected consequences and make the code harder to understand and debug.

Without creating a new DataFrame, you lose the ability to go back to the original state easily. If an error occurs later in your script, it can be challenging to recover the initial DataFrame for debugging.

In [5]:
# Mistake
df.drop_duplicates(inplace=True)

In [6]:
# Solution
df = df.drop_duplicates()

#### **MISTAKE 3**

**Don't forget to reset index after operations**

In [7]:
# Mistake (potential, not always)
df2 = df.query('new_column == True')
df2.head(5)

Unnamed: 0,old_column,new_column
1,0.865746,True
2,0.855411,True
4,0.756313,True
8,0.544334,True
11,0.912871,True


In [8]:
# Solution
df2 = df.query('new_column == True').reset_index()
df2.head(3)

Unnamed: 0,index,old_column,new_column
0,1,0.865746,True
1,2,0.855411,True
2,4,0.756313,True


#### **MISTAKE 4**

Filtering the DataFrame <code>(df[df['country'] == 'US'])</code> directly and then calculating the mean is often more computationally efficient than grouping the entire DataFrame and then extracting a specific value.

In [9]:
%%timeit
# Mistake
df.groupby('new_column')['old_column'].mean()[True]

226 µs ± 2.89 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [10]:
%%timeit
# Solution
df.query('new_column == True')['old_column'].mean()

511 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


#### **MISTAKE 5**

In [11]:
# Mistake
def equal_bad(df1, df2):
    return 1 if df1 == df2 else 0

In [12]:
# Solution
def equal_good(df1, df2):
    return 1 if df1.equals(df2) else 0