<a href="https://colab.research.google.com/github/Nik8x/Dask_Python_Dataframe_7GB/blob/master/Dask_Python_Dataframe_7GB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into main memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.*

[311 Service Requests – 7Gb+ CSV](https://data.cityofnewyork.us/Social-Services/311-Service-Requests/fvrb-kbbt)

[Dask – A better way to work with large CSV files in Python](https://pythondata.com/dask-large-csv-python/)

In [0]:
!wget "https://data.cityofnewyork.us/api/views/fvrb-kbbt/rows.csv?accessType=DOWNLOAD"

--2019-08-02 14:29:40--  https://data.cityofnewyork.us/api/views/fvrb-kbbt/rows.csv?accessType=DOWNLOAD
Resolving data.cityofnewyork.us (data.cityofnewyork.us)... 52.206.140.205, 52.206.140.199, 52.206.68.26
Connecting to data.cityofnewyork.us (data.cityofnewyork.us)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘rows.csv?accessType=DOWNLOAD’

e=DOWNLOAD              [             <=>    ]   7.73G  4.03MB/s               

In [0]:
!rm -r sample_data

In [7]:
%%bash
mv rows.csv\?accessType\=DOWNLOAD 311.csv
ls -la

311.csv  sample_data


In [8]:
%%time
!wc -lh 311.csv

21236571 311.csv
CPU times: user 1.14 s, sys: 194 ms, total: 1.33 s
Wall time: 4min 1s


In [0]:
pip install dask[complete]

In [0]:
import dask.dataframe as dd

filename = '311.csv'
df = dd.read_csv(filename, dtype='str')
# the data isn’t read into memory. we’ve just set up the dataframe to be ready to do some 
# compute functions on the data in the csv file using familiar functions from pandas.

In [11]:
df.head(2)

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,39679393,07/08/2018 01:43:22 PM,,DOT,Department of Transportation,Street Condition,Wear & Tear,,11208,265 EUCLID AVENUE,EUCLID AVENUE,ATLANTIC AVENUE,FULTON STREET,,,ADDRESS,BROOKLYN,,,Pending,,07/10/2018 09:00:00 AM,05 BROOKLYN,BROOKLYN,1019245,187789,Unspecified,BROOKLYN,,,,,,,,40.68204501462856,-73.87382614938845,"(40.68204501462856, -73.87382614938845)"
1,40983172,11/19/2018 03:00:16 PM,11/19/2018 03:00:28 PM,TLC,Taxi and Limousine Commission,Taxi Report,Driver Report,,11249,237 KENT AVENUE,KENT AVENUE,NORTH 1 STREET,GRAND STREET,,,ADDRESS,BROOKLYN,,,Closed,,11/19/2018 03:00:28 PM,01 BROOKLYN,BROOKLYN,993766,200359,Unspecified,BROOKLYN,,,JFK Airport,,,,,40.716610788840434,-73.96567244879084,"(40.716610788840434, -73.96567244879084)"


In [0]:
# We see that there’s some spaces in the column names. Let’s remove those spaces to make things easier to work with.
%%time
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})

In [0]:
# create a new dataframe with only 'RADIATOR' service calls
%%time
radiator_df = df[df.Descriptor == 'RADIATOR']

In [16]:
# Let’s see how many rows we have using the ‘count’ command
radiator_df.Descriptor.count()

%%time
# To actually compute the count, you have to call “compute” to get dask to run through the dataframe and count the number of records.
radiator_df.Descriptor.count().compute()

69027

In [0]:
%whos