# Data Wrangling

<div class="alert alert-success">
'Data Wrangling' generally refers to transforming raw data into a useable form for your analyses of interest, including loading, aggregating and formating. 
</div>

Note: Throughout this notebook, we will be using '!' to run the shell command 'cat' to print out the contents of example data files.

## Python I/O

<div class="alert alert-success">
Python has some basic tools for I/O (input / output). 
</div>

<div class="alert alert-info">
Official Python 
<a href="https://docs.python.org/3/library/io.html" class="alert-link">documentation</a> 
on I/O.
</div>

In [1]:
# Check out an example data file
!cat files/dat.txt

First line of data
Second line of data

In [2]:
# First, explicitly open the file object for reading
f_obj = open('files/dat.txt', 'r')

# You can then loop through the file object, grabbing each line of data
for line in f_obj:
    # Note that I'm removing the new line marker at the end of each line (the '\n')
    print(line.strip('\n'))

# File objects then have to closed when you are finished with them
f_obj.close()

First line of data
Second line of data


In [3]:
# Since opening and closing files basically always goes together, there is a shortcut to do both of them
#  Use 'with' keyword to open files, and the file object will automatically be closed at the end of the code block
with open('files/dat.txt', 'r') as f_obj:
    for line in f_obj:
        print(line.strip('\n'))

First line of data
Second line of data


Using Python's I/O is a pretty 'low level' way to read data files, and often takes a lot of work sorting out the details of reading files - for example, in the above example, dealing with the new line character explicitly. 

As long as you have reasonably well structured data files, using standardized file types, you can use higher-level functions that will take care of a lot of these details - loading data straight into pandas data objects, for example.

## Pandas I/O

<div class="alert alert-success">
Pandas has a range of functions that will automatically read in whole files of standard file types in pandas objects. 
</div>

<div class="alert alert-info">
Official Pandas
<a href="http://pandas.pydata.org/pandas-docs/stable/io.html" class="alert-link">documentation</a> 
on I/O. 
</div>

In [4]:
import pandas as pd

In [None]:
# Tab complete to check out all the read functions available
pd.read_

## File types

<div class="alert alert-success">
There are many different standardized (and un-standardized) file types in which data may be stored. Here, we will start by examing CSV and JSON files. 
</div>

### CSV Files

<div class="alert alert-success">
'Comma Separated Value' files store data, separated by comma's. Think of them like lists.
</div>

<div class="alert alert-info">
More information on CSV files from
<a href="https://en.wikipedia.org/wiki/Comma-separated_values" class="alert-link">wikipedia</a>. 
</div>

In [1]:
# Let's have a look at a csv file (printed out in plain text)
!cat files/dat.csv

1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12

#### CSV Files with Python

In [8]:
# Python has a module devoted to working with csv's
import csv

In [9]:
# We can read through our file with the csv module
with open('files/dat.csv') as csvfile:
    csv_reader = csv.reader(csvfile, delimiter=',')
    for row in csv_reader:
        print(', '.join(row))

1,  2,  3,  4
5,  6,  7,  8
9,  10,  11,  12


#### CSV Files with Pandas

In [10]:
# Pandas also has functions to directly load csv data
pd.read_csv?

In [11]:
# Let's read in our csv file
pd.read_csv(open('files/dat.csv'), header=None)

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


### JSON Files

<div class="alert alert-success">
JavaScript Object Notation files can store hierachical key/value pairings. Think of them like dictionaries.
</div>

<div class="alert alert-info">
More information on JSON files from
<a href="https://en.wikipedia.org/wiki/JSON" class="alert-link">wikipedia</a>.
</div>

In [12]:
# Let's have a look at a json file (printed out in plain text)
!cat files/dat.json

{
  "firstName": "John",
  "age": 53
}


In [13]:
# Think of json's as similar to dictionaries
d = {'firstName': 'John', 'age': '53'}
print(d)

{'firstName': 'John', 'age': '53'}


#### JSON Files with Python

In [14]:
# Python also has a module for dealing with json
import json

In [15]:
# Load a json file
with open('files/dat.json') as dat_file:    
    dat = json.load(dat_file)

In [16]:
# Check what data type this gets loaded as
print(type(dat))

<class 'dict'>


#### JSON Files with Pandas

In [17]:
# Pandas also has support for reading in json files
pd.read_json?

In [18]:
# You can read in json formatted strings with pandas
#  Note that here I am specifying to read it in as a pd.Series, as there is a single line of data
pd.read_json('{ "first": "Alan", "place": "Manchester"}', typ='series')

first          Alan
place    Manchester
dtype: object

In [19]:
# Read in our json file with pandas
pd.read_json(open('files/dat.json'), typ='series')

age            53
firstName    John
dtype: object