# Tidy-data

__Purpose of this notebook__: Develop a system to approach *untidy-data* with a `tidy-data` mindset.
> Use `pandas` methods to shape/reshape data

Practice __tidy data__ fundamentals:
1. Data is tabular (rows and columns)
1. Each row is an observation
1. Each column in an attribute
1. Each cell has a single value



In [1]:
import numpy as np
import pandas as pd

## 1. Tidying Sales Data

__Universal starting point__:
1. Acquire the data
2. Look at the data format
> This will decide our workflow

In [3]:
# 1. Acquire the data
df_sales = pd.read_csv('./untidy-data/sales.csv')

In [70]:
# 2. Look at the data's format/layout
df_sales

Unnamed: 0,Product,2016 Sales,2016 PPU,2017 Sales,2017 PPU,2018 Sales,2018 PPU
0,A,673,5,231,7,173,9
1,B,259,3,748,5,186,8
2,C,644,3,863,5,632,5
3,D,508,9,356,11,347,14


### Tidy Data Approach

__3 Questions to ask__
1. Is this _tidy_ data?
1. _Why_ is the data untidy?
1. _Which_ pandas methods do you need to make it tidy?

__Question 1__: Is this _tidy_ data?
> Read these as a question
- [x] Data is tabular (rows and columns)
- [ ] Each row is an observation
- [ ] Each column in an attribute
- [x] Each cell has a single value

Yes or no?
> Yes means all four features are met

> No means one or more of the 4 features is unmet
- [ ] Yes - clean data if needed and start exploring
- [x] No - continue to question 2

__Question 2__: _Why_ is the data untidy?
> This question also reveals WHAT needs to be done

These two features are unmet:
- [ ] Each row is an observation
- [ ] Each column in an attribute

1. column 1
> This column _is_ tidy

    - Attribute is a column
    - Rows are observations of attribute


2. columns 2-7

    - Columns are a combination of three attributes: `Year`, `Sales` and `PPU`.
        - Each column header is a combination of year and a C.O.G.S. measure.
        - These values represent different 
    - Rows are values of each corresponding attribute.


__Question 3__: _Which_ tidy data methods do you need?
- `pd.melt()`
- `DataFrame.column_name.str.split(expand=True)`
- `WIP`

Step 1: Use `pandas.melt()`
1. Columns 2-7
    - [ ] Melt into two columns: column names, values
    > Assign the melted dataframe to a variable
    - [ ] Split the variable column into two columns: year, sales measure


In [68]:
sales_data = df_sales.melt(id_vars='Product')

In [71]:
sales_data

Unnamed: 0,Product,variable,value
0,A,2016 Sales,673
1,B,2016 Sales,259
2,C,2016 Sales,644
3,D,2016 Sales,508
4,A,2016 PPU,5
5,B,2016 PPU,3
6,C,2016 PPU,3
7,D,2016 PPU,9
8,A,2017 Sales,231
9,B,2017 Sales,748
