## Load the same CSV file 10X times faster and with 10X less memory

### 1. use cols:

Rather than loading data and removing unnecessary columns that aren’t useful when processing your data. load only the useful columns.



In [30]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [31]:
df = pd.read_csv("C:/Users/anike/Downloads/Dataframe.csv")

In [32]:
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Columns: 6 entries, Age to Salary
dtypes: float64(1), int64(2), object(3)
memory usage: 21.9 KB


In [33]:
len(df.columns)

6

In [34]:
req_cols =['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience',
       'Salary']
len(req_cols)

6

In [35]:
df = pd.read_csv("C:/Users/anike/Downloads/Dataframe.csv", usecols=req_cols)

In [36]:
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Columns: 6 entries, Age to Salary
dtypes: float64(1), int64(2), object(3)
memory usage: 21.9 KB


### 2. Using correct dtypes for numerical data:

In [37]:
df['Age'].memory_usage(index=False, deep= True)

792

In [38]:
df['Age'].min()

24

In [39]:
df['Age'].max()

52

In [40]:
df = pd.read_csv("C:/Users/anike/Downloads/Dataframe.csv", dtype={"Age": "int16"})

In [41]:
df['Age'].memory_usage(index=False, deep=True)

198

### 3.Using correct dtypes for categorical columns:

In Dataset, I have a column Gender which is by default parsed as a string, but it contains only a fixed number of values that remain unchanged for any dataset.


In [42]:
df['Gender'].value_counts()

Male      54
Female    45
Name: Gender, dtype: int64

In [43]:
df = pd.read_csv("C:/Users/anike/Downloads/Dataframe.csv", dtype={"Gender": "category"})
df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,Male,Bachelor's,Software Engineer,5.0,90000
1,28,Female,Master's,Data Analyst,3.0,65000
2,45,Male,PhD,Senior Manager,,150000
3,36,Female,Bachelor's,Sales Associate,7.0,60000
4,52,Male,Master's,Director,20.0,200000
...,...,...,...,...,...,...
94,33,Male,Bachelor's,Business Analyst,7.0,75000
95,39,Female,Bachelor's,Training Specialist,12.0,65000
96,47,Male,PhD,Research Scientist,22.0,160000
97,26,Male,Bachelor's,Junior Software Developer,1.0,35000


### 4. nrows, skip rows

 Even before loading all the data into your RAM, it is always a good practice to test your functions and workflows using a small dataset and pandas have made it easier to choose precisely the number of rows (you can even skip the rows that you do not need.)

In most of the cases for testing purpose, you don’t need to load all the data when a sample can do just fine.

In [44]:
len(df)

99

In [45]:
df = pd.read_csv('C:/Users/anike/Downloads/Dataframe.csv', skiprows=[0,2,5])

In [46]:
len(df)

96

### 5. Loading Data in Chunks:

loading data in chunks is actually slower than reading whole data directly as you need to concat the chunks again but we can load files with more than 10’s of GB’s easily.

In [47]:
len(df)

96

In [48]:
df = pd.read_csv("C:/Users/anike/Downloads/Dataframe.csv", chunksize=1000)

In [49]:
total_len = 0
for chunk in df:
    # Do some preprocessing to reduce the memory size of each chunk
    total_len += len(chunk)
print(total_len)


99


In [50]:
tp = pd.read_csv('C:/Users/anike/Downloads/Dataframe.csv', iterator=True, chunksize=1000)  # gives TextFileReader
df = pd.concat(tp, ignore_index=True)

In [51]:
len(df)

99

### 6. Multiprocessing using pandas:

As pandas don’t have njobs variable to make use of multiprocessing power. we can utilize multiprocessinglibrary to handle chunk size operations asynchronously on multi-threads which can reduce the run time by half.

In [52]:
%%time
df = pd.read_csv("C:/Users/anike/Downloads/Dataframe.csv", chunksize=1000)
total_length = 0
for chunk in df:
    total_length += len(chunk)
print(total_length)

99
CPU times: total: 31.2 ms
Wall time: 2.99 ms


In [53]:
pip install dask

Note: you may need to restart the kernel to use updated packages.


### 7. Dask Instead of Pandas:


In [54]:
import dask.dataframe as dd
data = dd.read_csv("C:/Users/anike/Downloads/Dataframe.csv",dtype={'Total_Salary': 'float64'},assume_missing=True)
data.compute

<bound method DaskMethodsMixin.compute of Dask DataFrame Structure:
                   Age  Gender Education Level Job Title Years of Experience   Salary
npartitions=1                                                                        
               float64  object          object    object             float64  float64
                   ...     ...             ...       ...                 ...      ...
Dask Name: read-csv, 1 graph layer>