## Profiling - Challenge 3

We need to extract one row of data from a giant CSV file we've inherited and write it to disk in its own file.

Compare the following two cells, which do the same thing, but written slightly differently.

In [None]:
# Code to create CSV in the first place -- only need to run this if CSV isn't already present.
import pandas as pd
import numpy as np
# Two columns of five millions points
data_1 = np.random.rand(5000000)
data_2 = np.random.rand(5000000)
pd.DataFrame.from_dict({'data_1': data_1, 'data_2': data_2}).to_csv('data.csv')

In [None]:
%%writefile challenge_3_1.py

# Approach 1

import pandas as pd
import numpy as np

@profile
def approach_1():
    data = pd.read_csv('data.csv')    
    data['data_1'].to_csv('data_1_only.csv')

if __name__ == '__main__':
    approach_1()

In [None]:
# Profile Approach 1

# Running this outside Jupyter so we don't include it's memory usage
# The time command measures the whole overall execution time 
# (not as accurate as %timeit since only measures one run, but does the job)
!time python -m memory_profiler challenge_3_1.py

In [None]:
%%writefile challenge_3_2.py

# Approach 2

import pandas as pd
import numpy as np

@profile
def approach_2():
    data = pd.read_csv('data.csv', chunksize=10000) 
    
    for chunk_num, chunk in enumerate(data):
        # Append first column of each chunk to CSV (we need to create initial CSV if it's the first chunk)
        if chunk_num == 0:
            chunk['data_1'].to_csv('data_1_only.csv')
        else:
            chunk['data_1'].to_csv('data_1_only.csv', mode='a', header=False)

if __name__ == '__main__':
    approach_2()

In [None]:
# Profile Approach 2

!time python -m memory_profiler challenge_3_2.py

* Which one is faster?

* Why one uses more memory?

* Do we care?