## Loading a Massive File as small chunks in Pandas

In [1]:

import pandas as pd
from pprint import pprint
  
df = pd.read_csv('zomato.csv')
  
df.columns

Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes',
       'phone', 'location', 'rest_type', 'dish_liked', 'cuisines',
       'approx_cost(for two people)', 'reviews_list', 'menu_item',
       'listed_in(type)', 'listed_in(city)'],
      dtype='object')

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [5]:
df = pd.read_csv("zomato.csv", chunksize=10000)
print(df)

<pandas.io.parsers.readers.TextFileReader object at 0x00000183D9B63BB0>


In [6]:
for data in df:
    pprint(data.shape)

(10000, 17)
(10000, 17)
(10000, 17)
(10000, 17)
(10000, 17)
(1717, 17)


In [7]:
df = pd.read_csv("zomato.csv", chunksize=10)
  
for data in df:
    pprint(data)
    break

                                                 url  \
0  https://www.zomato.com/bangalore/jalsa-banasha...   
1  https://www.zomato.com/bangalore/spice-elephan...   
2  https://www.zomato.com/SanchurroBangalore?cont...   
3  https://www.zomato.com/bangalore/addhuri-udupi...   
4  https://www.zomato.com/bangalore/grand-village...   
5  https://www.zomato.com/bangalore/timepass-dinn...   
6  https://www.zomato.com/bangalore/rosewood-inte...   
7  https://www.zomato.com/bangalore/onesta-banash...   
8  https://www.zomato.com/bangalore/penthouse-caf...   
9  https://www.zomato.com/bangalore/smacznego-ban...   

                                             address  \
0  942, 21st Main Road, 2nd Stage, Banashankari, ...   
1  2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...   
2  1112, Next to KIMS Medical College, 17th Cross...   
3  1st Floor, Annakuteera, 3rd Stage, Banashankar...   
4  10, 3rd Floor, Lakshmi Associates, Gandhi Baza...   
5  37, 5-1, 4th Floor, Bosco Court, Gandhi Baza

## Load the same CSV file 10X times faster and with 10X less memory

### 1. use cols:

Rather than loading data and removing unnecessary columns that aren’t useful when processing your data. load only the useful columns.



In [38]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [39]:
df = pd.read_csv("zomato.csv")

In [40]:
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Columns: 17 entries, url to listed_in(city)
dtypes: int64(1), object(16)
memory usage: 572.2 MB


In [41]:
len(df.columns)

17

In [42]:
req_cols =['address', 'name', 'online_order', 'book_table', 'rate', 'votes',
       'phone', 'location', 'rest_type', 'dish_liked']
len(req_cols)

10

In [43]:
df = pd.read_csv("zomato.csv", usecols=req_cols)

In [44]:
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Columns: 10 entries, address to dish_liked
dtypes: int64(1), object(9)
memory usage: 32.9 MB


### 2. Using correct dtypes for numerical data:

In [45]:
df['votes'].memory_usage(index=False, deep= True)

413736

In [46]:
df['votes'].min()

0

In [47]:
df['votes'].max()

16832

In [48]:
df = pd.read_csv("zomato.csv", dtype={"votes": "int16"})

In [49]:
df['votes'].memory_usage(index=False, deep=True)

103434

In [50]:
(3209000 - 802250)/3209000*100

75.0

### 3.Using correct dtypes for categorical columns:

In Dataset, I have a column online_order which is by default parsed as a string, but it contains only a fixed number of values that remain unchanged for any dataset.

In [51]:
df['online_order'].value_counts()

Yes    30444
No     21273
Name: online_order, dtype: int64

In [52]:
df = pd.read_csv("zomato.csv", dtype={"online_order": "category"})

### 4. nrows, skip rows

 Even before loading all the data into your RAM, it is always a good practice to test your functions and workflows using a small dataset and pandas have made it easier to choose precisely the number of rows (you can even skip the rows that you do not need.)

In most of the cases for testing purpose, you don’t need to load all the data when a sample can do just fine.

In [53]:
len(df)

51717

In [54]:
df = pd.read_csv('zomato.csv', skiprows=[0,2,5])

In [55]:
len(df)

51714

### 5. Loading Data in Chunks:

loading data in chunks is actually slower than reading whole data directly as you need to concat the chunks again but we can load files with more than 10’s of GB’s easily.

In [56]:
len(df)

51714

In [57]:
df = pd.read_csv("zomato.csv", chunksize=1000)

In [58]:
total_len = 0
for chunk in df:
    # Do some preprocessing to reduce the memory size of each chunk
    total_len += len(chunk)
print(total_len)


51717


In [59]:
tp = pd.read_csv('zomato.csv', iterator=True, chunksize=1000)  # gives TextFileReader
df = pd.concat(tp, ignore_index=True)

In [60]:
len(df)

51717

### 6. Multiprocessing using pandas:

As pandas don’t have njobs variable to make use of multiprocessing power. we can utilize multiprocessinglibrary to handle chunk size operations asynchronously on multi-threads which can reduce the run time by half.

In [61]:
%%time
df = pd.read_csv("zomato.csv", chunksize=1000)
total_length = 0
for chunk in df:
    total_length += len(chunk)
print(total_length)

51717
CPU times: total: 8.06 s
Wall time: 8.06 s


### 7. Dask Instead of Pandas:


In [70]:
import dask.dataframe as dd
data = dd.read_csv("zomato.csv",dtype={'approx_cost(for two people)': 'float64'},assume_missing=True)
data.compute

<bound method DaskMethodsMixin.compute of Dask DataFrame Structure:
                  url address    name online_order book_table    rate    votes   phone location rest_type dish_liked cuisines approx_cost(for two people) reviews_list menu_item listed_in(type) listed_in(city)
npartitions=8                                                                                                                                                                                                   
               object  object  object       object     object  object  float64  object   object    object     object   object                     float64       object    object          object          object
                  ...     ...     ...          ...        ...     ...      ...     ...      ...       ...        ...      ...                         ...          ...       ...             ...             ...
...               ...     ...     ...          ...        ...     ...      ...     ...      ... 

In [68]:
pip install dask

Collecting dask
  Downloading dask-2023.6.0-py3-none-any.whl (1.2 MB)
     ---------------------------------------- 1.2/1.2 MB 521.0 kB/s eta 0:00:00
Collecting toolz>=0.10.0
  Downloading toolz-0.12.0-py3-none-any.whl (55 kB)
     -------------------------------------- 55.8/55.8 kB 723.7 kB/s eta 0:00:00
Collecting importlib-metadata>=4.13.0
  Downloading importlib_metadata-6.6.0-py3-none-any.whl (22 kB)
Collecting fsspec>=2021.09.0
  Downloading fsspec-2023.6.0-py3-none-any.whl (163 kB)
     ------------------------------------ 163.8/163.8 kB 490.7 kB/s eta 0:00:00
Collecting partd>=1.2.0
  Downloading partd-1.4.0-py3-none-any.whl (18 kB)
Collecting cloudpickle>=1.5.0
  Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting zipp>=0.5
  Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Collecting locket
  Downloading locket-1.0.0-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: zipp, toolz, locket, fsspec, cloudpickle, partd, importlib-metadata, dask
Successfull