# Data Science at UCSB

# Python for Data Science: Tabular Data

**Jason Freeberg, Fall 2016**

Tabular data is how a lot data is organized. It is not the *only* data format, but it is the easiest to work with because it is well-structured. Other data formats you will come across include [JSON](http://www.json.org/), [Relational and Non-Relational Databases](https://www.mongodb.com/scale/relational-vs-non-relational-database), images, and audio files. And believe it or not you are already familiar with tabular data, it's simply a table with columns and rows. Just like in Excel.

As data scientists to-be, however, we need to make define some terms. We will often refer to rows as *observations* or *records*, and columns as *variables* or *features*. The *header* is the top row containing the names of our variables. In the example below our variables are country, salesperson, order id, and so on. Our observations are individual orders with those variable values. Our header, in this case, would be the row with index #1.

![data_pic](http://mothimages.s3.amazonaws.com/tabular_data_1.png)

In today's lab we will get acquainted with the [pandas module](http://pandas.pydata.org/) by loading a Comma Seperated Value (.csv) [file](https://archive.ics.uci.edu/ml/datasets/Forest+Fires) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.html). We will then check it, coerce the variables to the correct format, check for missing values, and create aggregate reports by conditional selection. Then you'll follow the same pipeline on your own with a different dataset!


In [None]:
# Import the modules we'll need and assign the data's URL.

# By the way, it's customary to include all module imports at the beginning of your script.
import numpy as np
import pandas as pd
import urllib2

UCI_data_URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv'


In [None]:
# A helper function to read the data from the url. 
# Just run this cell, but understand what the function is doing.


def read_csv_from_url(URL):
    """
    Takes as input a string containing the URL pointing to a dataset from the UCI data repository.
    Returns a pandas dataframe containing the data with some columns coerced as strings.
    """
    
    response = urllib2.urlopen(URL)
    lines = pd.read_csv(response, 
                        header = 0,
                        index_col = False,
                        dtype = {'DMC' : str,
                                 'temp' : str,
                                 'area' : str})
    
    return pd.DataFrame(lines)

# Load and Check Data

Using the URL and function above, let's load the data into our notebook as a pandas dataframe. We will then inspect dataframe's size, missing values, and variable types.

In [None]:
# Load the data and print the head of the dataframe.

fire_df = read_csv_from_url(UCI_data_URL)

# Let's check the head and size of our data

print fire_df.head()
print "Number of rows:", fire_df.shape[0]
print "Number of columns:", fire_df.shape[1]

We now have our data loaded and assigned as a pandas.DataFrame object. However, I made a *slight* adjustment and loaded some variables as **strings**. Know that the pandas DataFrame( ) method is very well built and could have inferred the correct types for all columns, but variable coersion is a common data preparation task so we will do it in this lab.

**Right now we'll check for missing and incorrect values.**

In [None]:
# .isnull() returns a dataframe of logical (T/F) entries where True = Is_Null and False = Not_Null.
# We can use the sum() method to take the sums by each column. Remember that True = 1, False = 0.

logical_dataframe = fire_df.isnull()
print logical_dataframe.sum()

So we don't have any **NaN** or **None** values in our columns. But we're not out of the woods yet. Let's take a look at our categorical variables and check that they're reasonable. By printing out the unique strings in each column, we'll be able to see if there are any inappropriate values like misspelled days or months.

In [None]:
# The syntax, "dataFrame.columnName" will return a pandas Series object. 
# We can use the unique() method to get the distinct strings held in the Series object. 

print "Class of our returned column:", type(fire_df.month)
print fire_df.month.unique()
print fire_df.day.unique()

Luckily for us the UCI datasets are often very clean. Although we didn't uncover any missing or incorrect values in this dataset, these types of checks will become routine when you start a project or intern at a company.

**Now we'll look at our column types and make adjustments as necessary.**

In [None]:
# This will show our columns and their corresponding types.
print 'The data types of our features:'
print fire_df.dtypes, '\n'

# That's a lot to look at, let's narrow our search. This is a conditional selection, which we'll get to later.
print 'Our non-numeric variables:'
print fire_df.dtypes[fire_df.dtypes == 'object']  # Condition is in the square brackets.

# Month and day are okay being objects (strings), but those other three need to be converted to floats...
fire_df.DMC = fire_df.DMC.astype(float)
fire_df.area = fire_df.area.astype(float)
fire_df.temp = fire_df.temp.astype(float)

# Conditional Selection

Now that we have vetted the data for discrepancies, we can create do some exploratory analysis. Let's first cover conditional selection. Our data is 517 x 13, but we often won't want to use the entire table all the time. We might only need a couple columns, or perhaps we only want to look at the data on Tuesdays. With pandas, it's easy to select columns and rows based on arbitrary conditions. 

- Here's the basic syntax: *dataframe*[*condition on **rows***]\[*names or numbers of **columns***]
- Alternatively... *dataframe*.ix[*condition on **rows***, *selection of **columns*** ]
  - If you come from an R background, the .ix attribute syntax may seem familiar
- use *dataframe*.iloc**[ ]** for selecting rows and columns based on **purely numerical** indices

Click [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html) for the full documentation on slicing and dicing pandas DataFrames.

In [1]:
# Filter rows...
fire_df[:5]  # prints first ten rows (from 0 to 9)
fire_df.iloc[:5]  # same as above
fire_df.iloc[[1,2,3,4,5]]  # same again
fire_df[fire_df.area > 30]
fire_df[ (fire_df.area > 30) & (fire_df.rain > 10) ]  # two conditions on rows

# Select columns...
fire_df.temp  # a single column
fire_df["temp"]  # also a single column 
fire_df[['day', 'area', 'rain']]  # multiple columns

# Select AND filter...
fire_df[fire_df.area > 30][['day', 'area', 'rain']]

# Using '.ix' and making the same selection as above...
fire_df.ix[ fire_df.area > 30, ['day', 'area', 'rain'] ]

# Just to hide output...
print("Wow Python is so cool!")

NameError: name 'fire_df' is not defined

# Your turn

Get in the driver's seat, because it's your turn to write some code. Look for the &lt;FILL IN&gt; bits. Good luck and be sure to ask Jason for clarification or help.

In [None]:
# Read the data from the URL and store it in a pandas dataframe
newURL = "http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"


def myCSVReader(a_URL):
    #response = <FILL IN>
    #lines = pd.<FILL IN>
    response = urllib2.urlopen(a_URL)
    lines = pd.read_csv(response)
  
    #return <FILL IN>
    return pd.DataFrame(lines)

carData = myCSVReader(newURL)

carData.head()