# Introduction to working with data

todo: add note about hiding output

CSV files
Using a Python package
Reading data from a file
Manipulating data
Writing to CSV

## CSV

Comma-seperated values represents tabular data

Export spreadsheets, databases, represent data available from an API, e.g. Environment Agency

First line contains column names, e.g.

`column 1 name,column 2 name, column 3 name
first row data 1,first row data 2,first row data 3
second row data 1,second row data 2,second row data 3`

## Packages

Library of Python files

No need to reinvent the wheel! You can use packages other people have developed to do common tasks by 'importing' them

pandas: 
   `"fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python"`   
https://pandas.pydata.org/pandas-docs/version/0.16.1/

## DataFrame

2D tabular data structure used by pandas  
Labeled axes  
Access data using column and row indexes   

|   | a | b | c |
|---|---|---|---|
| 0 | `['a'][0]`  |      |   |
| 1 |  | `['b'][1]`|     |  |  
| 2 |   |     |  `['c'][2]`  |  


## First steps with pandas

We'll work through these instructions together and then you'll follow the rest of the tutorial individual.

`import pandas
df = pandas.read_csv('./data/filename.csv')
`

Then:

![readCSV](./images/readCSVLine.PNG)

In [11]:
# Import pandas and read CSV file here. Print to screen to check!


Let's explore the data. Pandas has correctly recognised the column headings, that's a good start.

Check what data types have been assigned to a few of the columns `print(type(df['column_name'][0]))`

![check data type](./images/printColumnType.PNG)

In [None]:
# Check data types here

### Correcting data types

What data type is the date column?

We'll correct that by adding another parameter to the `read_csv` method.

`df = pandas.read_csv('./data/filename.csv', parse_dates=['column_name'])`

Find out more about timestamps: https://www.unixtimestamp.com/

In [None]:
# Type your code using the parse_dates method here

### Choosing an index  

pandas has used a zero-based index as default, but we want to use a column from the data. 

`df = pandas.read_csv('./data/filename.csv', parse_dates=['column_name'], index_col='column_name')`

In [None]:
# Type your code using a custom index column here 

## Exercises

Work through these instructions at your own pace. Ask for help at any time.

### Unique values

`pandas.unique` tells as all the unique values in a column, e.g. for a column called 'Names'  
`unique_names = pandas.unique(df['Names'])`

1. Create a list of unique values in one of the columns.
2. How many unique values are there in this column? You can look back at the previous notebook if you need a reminder how to do this.

In [None]:
# Unique values code

### Basic statistics

We can calculate basic statistics for a column using `df['column_name'].describe()'`

See what happens when you run this command with a numerical and non-numerical column.

We can run specific queries too: 
`df['column_name'].min()
df['column_name'].max()
df['column_name'].mean()
df['column_name'].std()
df['column_name'].count()`

In [None]:
# Basic statistics code

### Groupby

We can calculate summary statistics for a group of cour choice using the `groupby` method.   

`grouped_data = df.groupby('column_name')`

summary statistics for all numeric columns by specified group  
`grouped_data.describe()`

mean for each numeric column by specified group  
`grouped_data.mean()`

For example, you might have some exam results in a CSV like this:    
`Name,Gender,Result
Fred,Male,45
George,Male,56
Harry,Male,67
Iain,Male,78
Joanne,Female,89
Kira,Female,90
Lucy,Female,89`

You could view summary statistics for exam results (the only numeric column) grouped by gender using this code:   
`grouped_by_gender = df.groupby('Gender')
print(grouped_by_gender.mean())`

And see this:    
`       Result                                                      
        count       mean        std   min    25%   50%    75%   max
Gender                                                             
Female    3.0  89.333333   0.577350  89.0  89.00  89.0  89.50  90.0
Male      4.0  61.500000  14.200939  45.0  53.25  61.5  69.75  78.0`

1. What is the mean page length for books published in Amsterdam?
2. What happens if you group by two columns and then find the mean? Use this syntax for the grouping: `grouped_data2 = df.groupby(['EBBO', 'Page Count'])

### Queries

In [42]:
import pandas

df = pandas.read_csv('./data/testData.csv')

g_d = df.groupby('Gender')

print(g_d.describe())


       Result                                                      
        count       mean        std   min    25%   50%    75%   max
Gender                                                             
Female    3.0  89.333333   0.577350  89.0  89.00  89.0  89.50  90.0
Male      4.0  61.500000  14.200939  45.0  53.25  61.5  69.75  78.0
