# Chapter 4- Code Optimization

## Code Optimization Techniques in Numpy and Pandas

Numpy's ability to perform operations on an entire array is based on a concept called vectorization, a powerful tool that substantially increases performance. Let's demonstrate with a simple example that shows the speed difference between using Python's native function and a NumPy's vectorized function

## Using Numpy vs Python functions

In [1]:
import numpy as np
import time

# Define a large array
large_array = np.random.rand(10**6)

# Numpy way
start = time.time()
print("Numpy sum:", np.sum(large_array))  # This calculates the sum using Numpy's vectorized function
print("Time to calculate the sum in a Numpy list:", time.time() - start)

# Python way of summing elements in an array
start = time.time()
print("Built-in list sum", sum(large_array))  # This calculates the sum using Python's built-in function
print("Time to calculate the sum in a Python list:", time.time() - start)

Numpy sum: 500189.0839013288
Time to calculate the sum in a Numpy list: 0.0003769397735595703
Built-in list sum 500189.0839012991
Time to calculate the sum in a Python list: 0.04026198387145996


## Pandas Optimization Techniques

In [2]:
import pandas as pd
from sklearn import datasets
import numpy as np

# Load the California Housing dataset
california = datasets.fetch_california_housing()
df = pd.DataFrame(data=np.c_[california['data'], california['target']], columns=california['feature_names'] + ['target'])
print(df.head())
print(df.info())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOc

### Choosing pd.Categorical data type

Choosing the pd.Categorical data type (or use .astype('category')) specifically for categorical data (data that takes on a limited, usually fixed, number of possible values), can yield significant savings in memory. Data types like integers and floats take up more memory space than categorical data types, which can significantly reduce memory usage, especially for large datasets. For high-cardinality columns (columns with many thousands of unique values), transforming them into a 'category' type can make operations like grouping much faster.

In [3]:
df['MedInc'] = df['MedInc'].astype('category')

# California Housing dataset does not have any Categorical features so next line of code won't work
# df['Type'] = pd.Categorical(df['Type'])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   MedInc      20640 non-null  category
 1   HouseAge    20640 non-null  float64 
 2   AveRooms    20640 non-null  float64 
 3   AveBedrms   20640 non-null  float64 
 4   Population  20640 non-null  float64 
 5   AveOccup    20640 non-null  float64 
 6   Latitude    20640 non-null  float64 
 7   Longitude   20640 non-null  float64 
 8   target      20640 non-null  float64 
dtypes: category(1), float64(8)
memory usage: 1.9 MB


### Downcasting Numerical Columns

Another way is to reduce memory usage for cases where DataFrame features are not categorical. In Pandas, downcast is a parameter to downcast data types of Dataframe. It's used with pd.to_numeric() function to downcast data types of numeric columns

In [5]:
# Downcast data type for 'AveBedrms' column
df['AveBedrms'] = pd.to_numeric(df['AveBedrms'], downcast='float')
df['Population'] = df['Population'].astype('int32')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   MedInc      20640 non-null  category
 1   HouseAge    20640 non-null  float64 
 2   AveRooms    20640 non-null  float64 
 3   AveBedrms   20640 non-null  float32 
 4   Population  20640 non-null  int32   
 5   AveOccup    20640 non-null  float64 
 6   Latitude    20640 non-null  float64 
 7   Longitude   20640 non-null  float64 
 8   target      20640 non-null  float64 
dtypes: category(1), float32(1), float64(6), int32(1)
memory usage: 1.7 MB


### Method Chaining (& in-place = True)

Another valuable asset is method chaining, which combines several operations into one unified code line, improving execution speed and enhancing code readability.

Also, it is worth noting that the inplace parameter in Pandas operations deserves mention because it is more memory-friendly. It applies changes directly to your DataFrame instead of creating an entirely new frame for the output.

In [7]:
# Regular way
df_copy = df[df['Population'] > 1000]
df_copy.dropna(inplace=True)

# Optimized way
df[df['Population'] > 1000].dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['Population'] > 1000].dropna(inplace=True)


### Practical Example of Efficient Memory Utilization


In [8]:
import pandas as pd
from sklearn import datasets
import numpy as np

# Load the California Housing dataset
california = datasets.fetch_california_housing()
df = pd.DataFrame(data=np.c_[california['data'], california['target']], columns=california['feature_names'] + ['target'])

def memory_usage_pandas(df):
    bytes = df.memory_usage(deep=True).sum()
    return bytes / 1024**2  # Convert bytes to megabytes

original_memory = memory_usage_pandas(df)

# Optimize memory usage in Pandas using categorical data types
# California Housing dataset does not have any Categorical features, so we will use downcasting
df['AveBedrms'] = pd.to_numeric(df['AveBedrms'], downcast='float')
df['AveRooms'] = pd.to_numeric(df['AveRooms'], downcast='float')
optimized_memory = memory_usage_pandas(df)

print(f'Original memory usage: {original_memory} MB')
print(f'Optimized memory usage: {optimized_memory} MB')
print(f'Memory saved: {original_memory - optimized_memory} MB')

Original memory usage: 1.4173622131347656 MB
Optimized memory usage: 1.2598915100097656 MB
Memory saved: 0.157470703125 MB
