In [12]:
import pandas as pd

# If your data is S3, you can download from the public url to your data. 
# You need to make sure that you set the permissions for the file in S3 to be downloadable by everyone. 
# I usually have a small copy of the data for development that I use before switching to the full dataset. 

#url = "https://s3-ap-southeast-1.amazonaws.com/bigdatasg/colors_100000K.csv"
#url = "https://s3-ap-southeast-1.amazonaws.com/bigdatasg/colors_10000K.csv"
#url = "https://s3-ap-southeast-1.amazonaws.com/bigdatasg/colors_1000K.csv"
url = "https://s3-ap-southeast-1.amazonaws.com/bigdatasg/colors_100K.csv"


# You can start by just loading a few rows from your small data file to ensure everything is working properly.
# However, this approach may still try to download all of the data locally before reading in the first 20 lines. 

numRows = 20
df = pd.read_csv(url, nrows=numRows)
df.head()


Unnamed: 0,orange,9
0,red,4
1,green,0
2,green,9
3,green,4
4,red,8


In [13]:
"""
One of the easiest ways to copy data locally into your notebook while still seeing some 
feedback on the download process is to use the wget command. 
You can use linux commands from your notebook by putting a ! in front of them. 
"""

!wget https://s3-ap-southeast-1.amazonaws.com/bigdatasg/colors_100K.csv
        

--2015-11-13 12:05:54--  https://s3-ap-southeast-1.amazonaws.com/bigdatasg/colors_100K.csv
Resolving s3-ap-southeast-1.amazonaws.com... 54.231.241.138
Connecting to s3-ap-southeast-1.amazonaws.com|54.231.241.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 799940 (781K) [application/octet-stream]
Saving to: `colors_100K.csv.1'


2015-11-13 12:06:01 (132 KB/s) - `colors_100K.csv.1' saved [799940/799940]



In [14]:
"""
Once you've downloading the file to your Jupyter server, you can load the data locally rather than from the url. 

"""
localfile = "colors_100K.csv"
numRows = 20
df = pd.read_csv(localfile, nrows=numRows)
df.head()


Unnamed: 0,orange,9
0,red,4
1,green,0
2,green,9
3,green,4
4,red,8


In [15]:
"""
If you are interested in how many rows there are in your data, you could read through each row of the file
without saving the data in to memory. Here is a quick way to do that. 
"""
myfile = open( localfile )
count = 0
for line in myfile:
   count += 1
print("There were {} lines in the file.".format(count))

There were 100000 lines in the file.


In [16]:
"""
Sometimes your system will not have sufficient memory to read all of your data in to memory. 
When you try to read too much data in to memory, you will crash your kernel. 
This happens a lot in micro instanes since they only have 1GB of memory that Linux, Docker, and Jupyter must share. 
If you are ever interested in figuring out where your instances is breaking, you can incrementally try to load the file. 
"""

step_size = count / 10.0

for x in range(1,11):
    numRows = step_size * x
    df=pd.read_csv(localfile, nrows=numRows)
    print("Successfully loaded {} rows".format(numRows))
df.head()


Successfully loaded 10000.0 rows
Successfully loaded 20000.0 rows
Successfully loaded 30000.0 rows
Successfully loaded 40000.0 rows
Successfully loaded 50000.0 rows
Successfully loaded 60000.0 rows
Successfully loaded 70000.0 rows
Successfully loaded 80000.0 rows
Successfully loaded 90000.0 rows
Successfully loaded 100000.0 rows


Unnamed: 0,orange,9
0,red,4
1,green,0
2,green,9
3,green,4
4,red,8
