# Introduction to working with data

todo: add note about hiding output

CSV files
Using a Python package
Reading data from a file
Manipulating data
Writing to CSV

## CSV

Comma-seperated values represents tabular data

Export spreadsheets, databases, represent data available from an API, e.g. Environment Agency

First line contains column names, e.g.

`column 1 name,column 2 name, column 3 name
first row data 1,first row data 2,first row data 3
second row data 1,second row data 2,second row data 3`

## Packages

Library of Python files

No need to reinvent the wheel! You can use packages other people have developed to do common tasks by 'importing' them

pandas: 
  * Format and clean data
  * Calculate statistics
  * Visualise data when used with other libraries (not covering today)
  * Write data back to a CSV
https://pandas.pydata.org/pandas-docs/version/0.16.1/

## DataFrame

2D tabular data structure used by pandas  
Labeled axes  
Access data using column and row indexes   

|   | a | b | c |
|---|---|---|---|
| 0 | `['a'][0]`  |      |   |
| 1 |  | `['b'][1]`|     |  |  
| 2 |   |     |  `['c'][2]`  |  


## First steps with pandas

We'll work through these instructions together and then you'll work through some exercises individually.

`import pandas
df = pandas.read_csv('./data/filename.csv')
`

![readCSV](./images/readCSVLine.PNG)

In [13]:
import pandas
df = pandas.read_csv('./data/insert_filename_here.csv')

ParserError: Error tokenizing data. C error: Expected 8 fields in line 11, saw 12


In [None]:
# Import pandas and read CSV file here. Print to screen to check.


# Try using print(df.head()) to see what happens


# Now try running print(df.info())


NaN = Not a number (usually missing data, we'll look at this in a future course). You can read more here: https://bit.ly/2RnApT9

Let's explore the data. We can see that pandas has correctly recognised the column headings by asking for a list of them. You can use this command during the activities if you want to remind yourself of the column names (they are case sensitive).

In [12]:
print(df.columns)

Index(['Name', 'Gender', 'Result', 'Group'], dtype='object')


Check what data types have been assigned to a few of the columns `print(type(df['column_name'][0]))`

![check data type](./images/printColumnType.PNG)

In [None]:
# Check Date column
print(type(df['Date'][0]))

In [None]:
# Check data types here

### Choosing an index  

pandas has used a zero-based index as default, but we want to use a column from the data. 

`df = pandas.read_csv('./data/filename.csv', index_col='column_name')`

In [None]:
df = pandas.read_csv('./data/insert_filename_here.csv', parse_dates=['column_name'], index_col='column_name')

In [None]:
# Type your code using a custom index column here 

## Exercises

Work through these instructions at your own pace. Ask for help at any time.

### 1. Unique values

`pandas.unique` tells as all the unique values in a column, e.g. for a column called 'Names'  
`unique_names = pandas.unique(df['Names'])`

In [None]:
# Code for questions 1a and 1b

### 2. Basic statistics

We can calculate basic statistics for a column using `df['column_name'].describe()'`

See what happens when you run this command with a numerical and non-numerical column.

We can run specific queries too:     
`df['column_name'].min()
df['column_name'].max()
df['column_name'].mean()
df['column_name'].std()
df['column_name'].count()`

Find out more here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [None]:
# Code for question 2a

In [None]:
# Code for question 2b

### 3. Grouping and aggregating

We can calculate summary statistics for a group of cour choice using the `groupby` method.   

`grouped_data = df.groupby('column_name')`

summary statistics for all numeric columns by specified group  
`grouped_data.describe()`

mean for each numeric column by specified group  
`grouped_data.mean()`

For example, you might have some exam results in a CSV like this:    
`Name,Gender,Result, Group
Fred,Male,45, B
George,Male,56, A
Harry,Male,67, A
Iain,Male,78, B
Joanne,Female,89, A
Kira,Female,90, B
Lucy,Female,89, A`

You could view summary statistics for exam results (the only numeric column) grouped by gender using this code:   
`grouped_by_gender = df.groupby('Gender')
print(grouped_by_gender.mean())`

And see this (notice the grouping column is not the DataFrame's index):    
`       Result                                                      
        count       mean        std   min    25%   50%    75%   max
Gender                                                             
Female    3.0  89.333333   0.577350  89.0  89.00  89.0  89.50  90.0
Male      4.0  61.500000  14.200939  45.0  53.25  61.5  69.75  78.0`

You could also group by 2 columns. We need to pass a list of columns to the `grouby()` method. Can you remember how to do this?

`g_d = df.groupby(['Gender', 'Group'])
print(g_d.mean())`

`              Result
Gender Group        
Female A        89.0
       B        90.0
Male   A        61.5
       B        61.5`

In [None]:
# Code for questions 3a

In [None]:
# Code for question 3b

### 4. Conditional selections

We might want to view only data that meets certain conditions. 

For example, in a DataFrame containing information about films, we could request only films where the `Director` column matched our request. We do this using the boolean conditions we learned about early in the session. 

We first select the column we're concerned with: `df['Director']`  
Then come up with a boolean condition: `== "Ridley Scott"`  
`df['Director'] == "Ridley Scott"` evaluates to either`True` or `False` and printing this would tell us how each column evlauated. 

To only view films where the condition evaluates to `True` we pass this our code to the DataFrame:  
`df[code with column and condition]` so in our example...  `df[df['Director'] == "Ridley Scott"]`

This looks intimidating because it is a statement nested within another and a different way of using square brackets. Try taking it step by step in the exercise.

In [None]:
# Code for questions 4a, b and c

### Writing to CSV

`df.to_csv('new_data.csv')`