> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser type in the console:


> `> ipython nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Intro to Data Representation and Data Cleaning

_Authors: Dave Yerrington (SF)_

---

<img src="https://snag.gy/ywU34V.jpg" width="250">

### Learning Objectives
*After this lesson, you will be able to:*
- Inspect data types
- Clean up a column using `df.apply()`
- Know what situations to use `.value_counts()` in your code

### Lesson Guide

- [Data quality measures](#data_quality_measures)
- [Common data cleaning strategies](#common_strategies)
- [Pandas tools for cleaning data](#cleaning_tools)
- [Common operations on data by type](#common_operations)
- [Practice inspecting data types and applying functions](#guided_practice)
- [Independent practice: sales data](#independent_pratice)


<a id='common_strategies'></a>

### Common data cleaning strategies

---

 - Remove missing values
 - Remove incorrect values
 - Update incorrect values
  - Removing invalid characters
  - Truncating part of a value
  - Adding extra numeral or string-based data
 - Imputate missing or invalid data
  - Mean / Median / Mode of a column, sometimes within group subsets
  - Model based imputation (K-Nearest Neighbors, MICE, etc.)
 - Backfill or forward fill


<a id='data_quality_measures'></a>

### Measures of data quality

---

 - What is the relative value of the data column?
 - Is the data encoded properly?
 - Is the data consistently encoded? Does it represent the information appropriately?

<a id='cleaning_tools'></a>

### Tools for the data cleaning process using pandas

---

We're starting to get more comfortable with using pandas for manipulating and examining data. Let's add a couple more tools to our toolbox.

The main data types in pandas objects are:
- `float`
- `int`
- `bool`
- `datetime64`
- `timedelta`
- `category`
- `object`

It is always important to evaluate the data types of columns to ensure that the information is properly represented.

See [Pandas: dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf) for a more detailed reference.

We will be using two tools extensively in this lesson:

**The `.apply()` function**

The built-in DataFrame function `.apply()` will apply a function to all the cells, rows, or columns of the DataFrame. We will explore this function in detail beelow.

**Series `.value_counts` attribute**

Pandas Series objects have a `.value_counts` attribute that will return a new series containing the counts of the unique values in the data. The series will be in descending order by default, so the first element is the most frequently occuring value.

Note: `.value_counts` excludes the counts of null values in the column!

See [pandas Series: value_counts](http://nullege.com/codes/search/pandas.Series.value_counts) for more detailed information.


<a id='common_operations'></a>

### Some common operations on data by type

---

- **float**: precision specific math operations
- **int**: operations with whole numbers
- **bool**: control flow conditions
- **datetime64**: resampling, slicing/selection, frequency back/front filling on a date range
- **timedelta**: date comparisons
- **category**: a more powerful set type; can capture for example days as a category with ordinal (ordering) information
- **object**: all types can be represented as an object, but math and date operations will not be possible.  Limited control flow possibilities unless you are comparing strings.

<a id='guided_practice'></a>

### Guided practice: inspecting data types and applying functions

---

[This guided practice follows the questions in the notebook here.](./practice-inspecting-data-applying-functions.ipynb)


In [27]:
import pandas as pd
import numpy as np

**1. Create a small DataFrame with different data types.**

In [2]:
test_data = dict( 
    A = np.random.rand(3),
    B = 1,
    C = 'foo',
    D = pd.Timestamp('20010102'),
    E = pd.Series([1.0]*3).astype('float32'),
    F = False,
    G = pd.Series([1]*3,dtype='int8')
)

In [3]:
test_data

{'A': array([ 0.52262705,  0.19537626,  0.89888063]),
 'B': 1,
 'C': 'foo',
 'D': Timestamp('2001-01-02 00:00:00'),
 'E': 0    1.0
 1    1.0
 2    1.0
 dtype: float32,
 'F': False,
 'G': 0    1
 1    1
 2    1
 dtype: int8}

In [9]:
A = np.random.rand(3)
A

array([ 0.44688535,  0.22198847,  0.23445307])

In [14]:
dft = pd.DataFrame(test_data)
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.29699,1,foo,2001-01-02,1.0,False,1
1,0.856104,1,foo,2001-01-02,1.0,False,1
2,0.195158,1,foo,2001-01-02,1.0,False,1


**2. Examine the data types of the columns.**

In [5]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**3. Create a Series object with the integers 1-5 and float 6.0. What data type is the Series?**

In [6]:
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

If a pandas object contains data of multiple dtypes in a single column, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

**4. Create a Series with data: `[1, 2, 3, 6., 'foo']`. What data type is the series?**

In [18]:
df=pd.Series([1, 2, 3,6.3, 'foo'])
df

0      1
1      2
2      3
3    6.3
4    foo
dtype: object

**5. Find how many columns of each type there are with the `.get_dtype_counts()` function.**

In [20]:
dft.get_dtype_counts().astype(list)

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: object

**With a partner, take 3 minutes to discuss:**

*Without* running this code with a Python interpreter, what types would you expect the common `dtype` to be selected?

    [1, 3, 9, .33, False, '03-20-1978', np.arange(22)]



In [35]:
a=pd.DataFrame([1, 3, 9, .33, False, '03-20-1978', np.arange(22)])
a.get_dtype_counts().astype(list)

object    1
dtype: object

You can do a lot more with dtypes.  Check out 
[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).

**Applying functions to data with `df.apply()`**

Generally `df.apply()` will apply a singlular function to every cell of the dataframe you use it with.  

Note: There is another common built-in function, `df.map()`, which applies a function to each element of a single Series (column). For example:

```python
df['a'].map(my_func)
```

**6. Create another small DataFrame.**

In [6]:
# Create some more test data
df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,-0.410785,0.19024,1.745112,-0.322609
1,0.281392,-0.22949,1.962935,0.946956
2,0.809197,0.318424,1.247255,1.052985
3,0.858677,0.244996,0.090704,1.250873
4,-0.737841,0.966909,0.580455,-1.126967


**7. Use the `.apply()` function to find the square root of all the cells.**

In [7]:
# square root ALL CELLS (NaN == Not a Number)
df.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,,0.436165,1.321027,
1,0.530464,,1.401048,0.973116
2,0.899554,0.564291,1.116806,1.026151
3,0.926648,0.494971,0.301172,1.118425
4,,0.983315,0.761876,


**8. Use `.apply()` to find the mean of the columns.**

In [8]:
df.apply(np.mean, axis=0)

a    0.160128
b    0.298216
c    1.125292
d    0.360248
dtype: float64

**9. Find the mean of the rows.**

In [9]:
df.apply(np.mean, axis=1)

0    0.300489
1    0.740448
2    0.856965
3    0.611313
4   -0.079361
dtype: float64

### Further Reading

For more advanced `.apply` usage, check out these links:

["Why Not"'s Gist Examples](https://gist.github.com/why-not/4582705)

[Chris Albon's Map + Apply Examples](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


**Counting occurrances of unique values with `.value_counts()`**

The `.value_counts` attribute tells us the count of unique values in a column's data.  It's helpful to identify unexpected values and to get a feel for the distribution of the data, especially when looking at group membership.  Looking at the value counts per column can give us a quick overview of values expressed in our data.


Some common use cases of `.value_counts` include:
 - Finding strings inside of mostly numeric / continious data
 - Finding non-numeric values
 - General distributions of categorical variables
 - Identifying the most common and least common values

**10. Use numpy to create a random vector of 50 numbers ranging from 0 to 6.**


In [10]:
data = np.random.randint(0, 7, size = 50)
data

array([5, 5, 5, 2, 4, 2, 5, 4, 5, 4, 1, 1, 1, 3, 6, 1, 4, 6, 1, 0, 2, 2, 2,
       5, 6, 5, 1, 2, 4, 4, 5, 0, 4, 2, 2, 1, 1, 1, 0, 1, 3, 6, 2, 6, 3, 6,
       6, 6, 3, 6])

**11. Convert the vector to a Series and count the occurrences of each number.**

In [11]:
s = pd.Series(data)
s.head()

0    5
1    5
2    5
3    2
4    4
dtype: int64

In [12]:
# The counts of each number that occurs in our array is listed
pd.value_counts(s)

1    10
6     9
2     9
5     8
4     7
3     4
0     3
dtype: int64

<a name="independent_ practice"></a>

### Independent practice: sales data

---

1. Load the `sales.csv` data set from the datasets directory.
- Inspect the data types.
- Imagine you've found out that all your values in column 1 are off by 1. Use `.apply()` or `.map()` to add 1 to column 1 of the dataset.
- Use `.value_counts` to count the values of 1 column of the dataset.


In [39]:
sales=pd.read_csv('/Users/Mahendra/desktop/GA/hw/2.2.1_eda-data_cleaning_intro-lesson/datasets/sales.csv')
sales

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,18.420760,93.802281,337166.53,337804.05
1,4.776510,21.082425,22351.86,21736.63
2,16.602401,93.612494,277764.46,306942.27
3,4.296111,16.824704,16805.11,9307.75
4,8.156023,35.011457,54411.42,58939.90
5,5.005122,31.877437,255939.81,332979.03
6,14.606750,76.518973,319020.69,302592.88
7,4.456466,19.337345,45340.33,55315.23
8,5.047530,26.142470,57849.23,42398.57
9,5.388070,22.427024,51031.04,56241.57


In [40]:
sales.get_dtype_counts()

float64    4
dtype: int64

In [42]:
#squarertto of all the coumns
sales.apply(np.sqrt)

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,4.291941,9.685158,580.660426,581.209128
1,2.185523,4.591560,149.505385,147.433477
2,4.074604,9.675355,527.033642,554.023709
3,2.072706,4.101793,129.634525,96.476681
4,2.855875,5.917048,233.262556,242.775411
5,2.237213,5.646011,505.904942,577.043352
6,3.821878,8.747512,564.819166,550.084430
7,2.111034,4.397425,212.932689,235.191900
8,2.246671,5.112971,240.518669,205.909130
9,2.321222,4.735718,225.900509,237.153052


In [45]:
sales.apply(np.mean,axis=0)

volume_sold          10.018684
2015_margin          46.858895
2015_q1_sales    154631.668200
2016_q1_sales    154699.178750
dtype: float64

In [47]:
pd.value_counts(sales['volume_sold'])

16.059971    1
6.654407     1
7.437252     1
4.200122     1
6.657733     1
12.455723    1
45.556096    1
4.385455     1
10.804337    1
10.270185    1
6.163689     1
7.790503     1
7.560549     1
5.778097     1
8.300180     1
6.537069     1
6.433606     1
4.776510     1
10.186622    1
7.101347     1
11.975769    1
50.275893    1
3.210727     1
11.684510    1
9.292241     1
6.618174     1
5.906274     1
5.047530     1
4.376225     1
8.751920     1
            ..
47.503269    1
5.214882     1
18.420760    1
8.622686     1
10.930398    1
9.003562     1
10.252870    1
6.630904     1
4.941294     1
7.369585     1
7.785867     1
29.878030    1
10.331430    1
9.313785     1
11.997117    1
14.439435    1
12.509967    1
7.754457     1
12.350027    1
5.388070     1
8.555078     1
11.840780    1
5.723896     1
4.456466     1
12.395919    1
9.348477     1
5.781266     1
7.585997     1
5.095464     1
9.849660     1
Name: volume_sold, dtype: int64