# Introduction to Data Science
## Dealing with data
***

### The command line
From the command line, terminal, or shell you can perform almost any computer operation that you would normally use a mouse and graphical interface (GUI) for. In some cases, such as dealing with raw data files, the command line can give a quick way to start exploring.

![Terminal](images/terminal.png)

#### File system
To navigate the folder structure of the machine you are on you will typically use commands such as `ls` (list) and `cd` (change directory). You can make a directory with `mkdir` or move (`mv`) and copy (`cp`) files. To delete a file you can `rm` (remove) it. To print the contents of a file you can `cat` (concatenate) it to the screen.

Many commands have options you can set when running them. For example to get a listing of files as a vertical list you can pass the `-l` (list) flag, e.g. `ls -l`. During the normal course of using the command line, you will learn the most useful flags. If you want to see all possible options you can always read the `man` (manual) page for a command, e.g. `man ls`. When you are done reading the `man` page, you can exit by hitting `q` to quit.

You can use shell commands in IPython notebooks by prefixing the line with an exclamation point.

In [None]:
!ls

In [None]:
!mkdir test

In [None]:
!ls -l

In [None]:
!ls -l images/

In [None]:
!cp images/terminal.png test/some_picture.png

In [None]:
!ls test/

In [None]:
!rm test/ # you can't delete a folder using `rm`!

In [None]:
# WARNING THIS WILL NOT CONFIRM!
!rm -rf test/

In [None]:
!ls -l

#### Data manipulation and exploration
Virtually anything you want to do with a data file can be done at the command line. There are dozens of commands that can be used together to get almost any result! Lets take a look at the the file `data/users.csv`.

Before we do anything, lets take a look at the first few lines of the file to get an idea of what's in it.

In [None]:
!head data/users.csv

Maybe we want to see a few more lines of the file,

In [None]:
!head -15 data/users.csv

How about the last few lines of the file?

In [None]:
!tail data/users.csv

We can count how many lines are in the file by using `wc` (a word counting tool) with the `-l` flag to count lines,

In [None]:
!wc -l data/users.csv

It looks like there are three columns in this file, lets take a look at the first one alone. Here, we can `cut` the field (`-f`) we want as long as we give the proper delimeter (`-d` defaults to tab).

In [None]:
!cut -f1 -d',' data/users.csv

That's a lot of output. Let's combine the `cut` command with the `head` command by _piping_ the output of one command into another one,

In [None]:
!cut -f1 -d',' data/users.csv | head

We can use pipes (`|`) to string together many commands to create very powerful one liners. For example, lets get the number of unique users in the first column. We will get all values from the first column, sort them, find all unique values, and then count the number of lines,

In [None]:
!cut -f1 -d',' data/users.csv | sort | uniq | wc -l

Or, we can get a list of the top-10 most frequently occuring users. If we give `uniq` the `-c` flag, it will return the number of times each value occurs. Since these counts are the first entry in each new line, we can tell `sort` to expect numbers (`-n`) and to give us the results in reverse (`-r`) order. Note, that when you want to use two or more single letter flags, you can just place them one after another.

In [None]:
!cut -f1 -d',' data/users.csv | sort | uniq -c | sort -nr | head

After some exploration we decide we want to keep only part of our data and bring it into a new file. Let's find all the records that have a negative value in the second and third columns and put these results in a file called `data/negative_users.csv`. Searching through files can be done using _[regular expressions](http://www.robelle.com/smugbook/regexpr.html#expression)_ with a tool called `grep` (Global Regular Expression Printer). You can direct output into a file using a `>`.

In [None]:
!grep '.*,-.*,-.*' data/users.csv > data/negative_users.csv

We can check the data folder to see if our new file is in there,

In [None]:
!ls -l data

#### Using Python, IPython, and Pandas
The command line is great for a first step in data exploration. However, to do some more in depth operations you will generally want to move to a language and environment better suited for involved data manipulation. Here, we will discuss the use of Python as a data crunching tool.

There are many ways to read data in Python. Here we will discuss two options: (1) reading line by line (streaming) and (2) using Pandas to create data frames.

Reading in data line by line can be slightly less structured since it depends on you knowing or determining what each line looks like. For structured data, this might be an unnecessary waste of time. However, for highly unstructed data, this can be very useful.

In [None]:
# Read in the data/users.csv file line by line
file = open("data/users.csv", "r")
for line in file:
    # Break up the line
    user, variable1, variable2 = line.split(",")

    # Skip the header row
    if user == "user":
        continue

    # Print the user and the sum of variable1 and variable2
    print "%s: %.3f" % (user, float(variable1) + float(variable2))

For structured data like we have here, Pandas may make more sense.

In [None]:
# Read in data/users.csv using Pandas
import pandas as pd
users = pd.read_csv("data/users.csv")

In [None]:
# Take a look at the Panda's DataFrame
users.head()

In [None]:
# Add another column to this DataFrame
users['sum'] = users['variable1'] + users['variable2']
users.head()

Python and Pandas allow us to do complex tasks very easily, such as plotting.

Let's visualize the relationship between variable1 vs. variable2 in these data with a scatterplot.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(users['variable1'], users['variable2'], ".", markersize=0.75)

We can update our DataFrame given the new information we discovered.

In [None]:
users.columns = ['user', 'longitude', 'latitude', 'sum']
users.head()

#### Another example
Let's grab a slightly modified version of the survey data from your first homework assignment. It's located on the web, so let's save time by downloading it straight from the command line and placing it in the correct folder.

In [None]:
!curl -o data/ds_survey.csv moakler.org/ds/data/ds_survey.csv

Now that we have the data, let's take a look at what we have.

In [None]:
!head -5 data/ds_survey.csv

It looks like a fairly well structured .csv file. Let's use pandas to turn it into a data frame.

In [None]:
import pandas as pd
survey = pd.read_csv("data/ds_survey.csv")

We have a lot of data here. How can we start to make sense of it? Using the pandas `describe()` method, we can get a summary of the numeric features.

In [None]:
survey.describe()

You might find it easier to flip the data frame,

In [None]:
survey.describe().transpose()

Do we have other features? Some that aren't numeric? Let's take a look.

In [None]:
survey.columns

It looks like we have quite a few. But what type are they if they're not numeric? Let's take a look,

In [None]:
survey.head()

A few fields are Yes/No. Let's turn these into 0's and 1's.

In [None]:
for field in ['full_time', 'regression', 'databases', 'cloud', 'api']:
    survey[field] = (survey[field] == "Yes").astype('int')

In [None]:
survey.describe().transpose()

We can plot some of the features if we want,

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.hist(survey['terminal'], bins=range(1, 11))
plt.ylabel('Count')
plt.xlabel('Rank')
plt.xticks(ha='center')
plt.show()

We can also take a look at subsets of the data,

In [None]:
survey.loc[survey['cloud'] == 1]

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.hist(list(survey.loc[survey['api'] == 1, 'terminal']), color="red", alpha=0.3, bins=range(1, 11), label="API = 1")
plt.hist(list(survey.loc[survey['api'] == 0, 'terminal']), color="blue", alpha=0.3, bins=range(1, 11), label="API = 0")
plt.ylabel('Count')
plt.xlabel('Rank')
plt.legend()