# Data Wrangling
[Data wrangeling](https://en.wikipedia.org/wiki/Data_wrangling) is the process of transforming and mapping data from a "raw" format into a format more valuable for further downstream pruposes such as analytics. Read more about data. 

Data wrangling can be divided into teo steps

1. Data acquisition
2. Data cleaning


## Data acquisition

Some ways too aquire data can be
    
* Downloading files
* Accessing an API
* Scraping a web page
* Combine data from different formats

### Comma Separated Values (CSV)
 A [comma separated value (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) file is a delimited text file that uses comma to seppate values. The CSV is easy to process with code (unlike [.xlsx](https://fileinfo.com/extension/xlsx)). In Python the contents of a CSV file are commonly represented as a list of . There are two common choices for how to represent each row. In the first option, each row is a list.

In [1]:
# Option 1: Each row is a list
csv = [['Q', 'W', 'E'],
       ['R', 'T', 'Y']]

In the second option each row is a dictionary. This option works well if you have a CSV header because then the keys of each dictionary can be column names and the fields can be values. 

In [2]:
# Option 2: Each row is a dictionary
csv = [{'name1': 'Q', 'name2': 'W', 'name3': 'E'},
       {'name1': 'R', 'name2': 'T', 'name3': 'Y'}]

## Loading data from CSVs
### Python's csv Module
We will be using the `unicodecsv` since it comes with Anaconda and has support for unicode. The `unicodecsv` works exactly the same as Python's [`csv`](https://docs.python.org/2/library/csv.html) module, and its documentation page is still the best way to learn how to use the `unicodecsv` library.
### Data file
Lets have a look at our data file before reading it with `unicodecsv`.

In [3]:
file_extract = ''
with open('data_files/enrollments.csv', 'r') as f:
    for index, line in enumerate(f):
        file_extract += line
        if index == 5:
            break
print(file_extract)

account_key,status,join_date,cancel_date,days_to_cancel,is_udacity,is_canceled
448,canceled,2014-11-10,2015-01-14,65,True,True
448,canceled,2014-11-05,2014-11-10,5,True,True
448,canceled,2015-01-27,2015-01-27,0,True,True
448,canceled,2014-11-10,2014-11-10,0,True,True
448,current,2015-03-10,,,True,False



### Loading the data

Next we will load the data from some file using `unicodecsv`. The mode `rb` in `open('...', 'rb')` means that the file will be opened for reading. The [`csv`](https://docs.python.org/2/library/csv.html) docummentation page mentions that we need to use this. `rb` stands for Read Binary mode. We are using the `DictReader` since our data have a header row. Our reader will be an iterator, the difference between lists and iteratiors in Python can be found [here](https://www.codementor.io/sheena/python-generators-and-iterators-du1082iua). The iterator let's you write a loop to access each element, but only once. 

In [10]:
import unicodecsv
from pprint import pprint

enrollments = []

with open('data_files/enrollments.csv', 'rb') as f:
    reader = unicodecsv.DictReader(f)
    enrollments = list(reader)

pprint(enrollments[0])

OrderedDict([('account_key', '448'),
             ('status', 'canceled'),
             ('join_date', '2014-11-10'),
             ('cancel_date', '2015-01-14'),
             ('days_to_cancel', '65'),
             ('is_udacity', 'True'),
             ('is_canceled', 'True')])


Let's read in two more example files.

In [16]:
daily_engagement = []
project_submsissions = []

def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)

daily_engagement = read_csv('data_files/daily_engagement.csv')
project_submissions = read_csv('data_files/project_submissions.csv')

pprint(daily_engagement[0])
pprint(project_submissions[0])

OrderedDict([('acct', '0'),
             ('utc_date', '2015-01-09'),
             ('num_courses_visited', '1.0'),
             ('total_minutes_visited', '11.6793745'),
             ('lessons_completed', '0.0'),
             ('projects_completed', '0.0')])
OrderedDict([('creation_date', '2015-01-14'),
             ('completion_date', '2015-01-16'),
             ('assigned_rating', 'UNGRADED'),
             ('account_key', '256'),
             ('lesson_key', '3176718735'),
             ('processing_state', 'EVALUATED')])
