# Look at the raw data with your own eyes

Many times when you are given a dataset, you have NO IDEA what the heck might be
in it based upon the name of the file. If you don't know what is in it and
the file extension is wrong, how can you know what program or pandas methond
you should use to open it?

The answer is pretty simple: *look at the contents of it*

## Let's take a look at a small csv

In [11]:
# imports first (taking my own advice this time!)
import json
import pandas as pd

# first generate the csv, so that we know what went into it
df = pd.DataFrame({'hello': [1, 2, 3], 'world': [4, 5, 6]})
df

Unnamed: 0,hello,world
0,1,4
1,2,5
2,3,6


In [8]:
df.to_csv('small.csv', index=False)

## Great, we have a small 3-row csv written to disk that we can manually inspected

When it comes to manually inspecting data, the tools you prefer to use the most
may very well depend on the operating system and development environment you are
used to using.

If you are on windows, you may want to open it in notepad.

If you are on OS X, you may want to open it in text edit.

If you are on linux, you might prefer sublime.

There is no right answer, but you must be able to open the file
to take a look at it.

If you are on a *nix machine (linux, unix, os x, etc.) there are a lot of handy
command line tools that you can use. For small files, I like to use `cat`

In [9]:
! cat small.csv

hello,world
1,4
2,5
3,6


## Those are the entire contents of the file, just for you to see!

Remember, you can do this by opening the file in your favorite text editor.
I'm just using the commands to keep everything in the notebook.

So when is this actually handy? Say somebody did something stupid like the following:

In [29]:
df.to_csv('tricksie.csv', sep='#', index=None)

which would then cause you to want to do something like this:

In [30]:
pd.read_csv('tricksie.csv')

Unnamed: 0,hello#world
0,1#4
1,2#5
2,3#6


What the hell is that??? Why would anyone do something like this?

Although you might be able to guess what is going on from the mangled
dataframe that we have ended up with, you might want to check just
to make sure.

In [31]:
! cat tricksie.csv

hello#world
1#4
2#5
3#6


Great, now that I can see the file, I can kind of tell that
something strange is acting as a separator which gives us a clue that
we need to use the sep option of read_csv 

In [32]:
pd.read_csv('tricksie.csv', sep='#')

Unnamed: 0,hello,world
0,1,4
1,2,5
2,3,6


Well that makes us feel slightly better... now you can continue
on with your work for now and later on track down the person
that created a csv with `#` as a separator and have a very nice
and poite chat with them ;)

## So what if the file is too big to look at the whole thing?

Then you'll need to look at some subset of it. Typically the first few lines are a good bet.

Let's generate a dataset that's a bit too large to open up using a text
editor or something like that.

In [38]:
nlines = 5000000

with open('big.csv', 'wb') as fh:
    fh.write('col1,col2,col2\n'.encode())
    for i in range(0, nlines):
        s = '1,2,3'
        if i != nlines - 1:
            s += '\n'
        fh.write(s.encode())

In [39]:
# use the wc command to count the number of lines in the file
! wc -l big.csv

5000000 big.csv


Alright, you now have a csv with 5 million lines.
Best of luck opening that all at once!

Luckily there's a few ways to deal with this. If you're on a 
\*nix platform, you can use the head() command:

In [41]:
# print out the first 10 lines
! head -n 10 big.csv

col1,col2,col2
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3


If you want to do this in a more cross-platform manor, you could
use the following function that should work with most versions
of python and work on all operating systems:

In [49]:
def head(filename, n=10):
    with open(filename) as fh:
        # get a python list of the first n lines
        list_of_lines = [fh.readline() for i in range(n)]
        # now concatinate them into a string so
        # the print statement looks okay
        lines_to_print = ''.join(list_of_lines)
        # finally, print it out
        print(lines_to_print)

In [50]:
head('big.csv')

col1,col2,col2
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3



There you go! Now you can look at the first n lines of a file
that is arbitrarily large without crashing your machine!