<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Intro to Python: Pandas
Week 2 | Day 2

---

### LEARNING OBJECTIVES

- Read a csv file using pandas
- Viewing data: head, columns, values, describe
- Selection: a single column, slicing by row, by position
- Perform boolean indexing on dataframes
- Inspect data types
- Clean up a column using df.apply()
- Know what situations to use .value_counts() in your code



<a name="Series and DataFrame data types"></a>
## Introduction: Series and DataFrame data types (10 mins)

- Series is a one-dimensional labeled array capable of holding any data type (integers, strings,
floating point numbers, Python objects, etc.). The axis labels are collectively referred to as
the index. The basic method to create a Series is to call:

```Python
s = pd.Series(data, index=index)
```

- Here, data can be many different things:
    - a Python dict
    - an ndarray
    - a scalar value (like 5)

- The passed index is a list of axis labels.



- DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict
of Series objects. It is generally the most commonly used pandas object.

- Like Series, DataFrame accepts many different kinds of input:
    - Dict of 1D ndarrays, lists, dicts, or Series
    - 2-D numpy.ndarray
    - Structured or record ndarray
    - A Series
    - Another DataFrame

- Along with the data, you can optionally pass index (row labels) and columns
(column labels) arguments. If you pass an index and / or columns, you are
guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict
of Series plus a specific index will discard all data not matching up to the
passed index.

- If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Here is more information on [series and dataframes](http://pandas.pydata.org/pandas-docs/stable/dsintro.html).

**Check:** What are some differences between Series and DataFrame?



<a name="pd.Series"></a>
## Demo / Guided Practice: pd.Series (25 mins)

Let's create a series and see what `pandas.Series` can do.


In [3]:
# create a series using a numpy random number generator
import pandas as pd
import numpy as np

s = pd.Series(np.random.randint(5, 25, 7), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])  
s

a    10
b    23
c    16
d     6
e    23
f    10
g     6
dtype: int32


Now we have a series of 7 random numbers. Let's try out the same things we did with
a data frame back in W2 L1.1. First, let's look at the series head.

In [4]:
# head of series
s.head()

a    10
b    23
c    16
d     6
e    23
dtype: int32

<details>
    <summary>Solution</summary>
    <code>s.head()</code>
</details>

In [5]:
# tail of series
s.tail()

c    16
d     6
e    23
f    10
g     6
dtype: int32

In [6]:
# summary stats
s.describe()

count     7.000000
mean     13.428571
std       7.345228
min       6.000000
25%       8.000000
50%      10.000000
75%      19.500000
max      23.000000
dtype: float64

In [7]:
# select by location c to g
s["c":"g"]

c    16
d     6
e    23
f    10
g     6
dtype: int32

In [8]:
# select just b
s["b"]

23

In [10]:
# slice for rows 1-3
# s[1:4]
s.iloc[1:4]

b    23
c    16
d     6
dtype: int32

**Check:** How would you select just 'd'?


<a name="Boolean indexing"></a>
## Demo / Guided Practice: Boolean indexing (25 mins)

Another common operation is the use of boolean vectors to filter the data. The operators
are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

Let's create another series and use pandas to do some Boolean indexing.

In [11]:
# create another series ranging from -3 to 3

s = pd.Series(range(-3, 4))
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [13]:
# find the values that are > 0. 
mask = s > 0
s[mask]

4    1
5    2
6    3
dtype: int64

In [14]:
# find the values that are < -1 or > 0.5
mask = (s < -1) | (s > 0.5)
s[mask]

0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [18]:
# find the values that are not < 0.
mask = ~(s < 0)
s[mask]

3    0
4    1
5    2
6    3
dtype: int64

Here is some further information on [boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges).

**Check:** How would you find all the numbers that are < 2?


In [19]:
# find the values that are < 2
mask = s < 2
s[mask]


0   -3
1   -2
2   -1
3    0
4    1
dtype: int64

**Check:** This looks familiar...didn't we already learn how to read in csv files?
Yes, but that was using Python without any libraries or packages. It took 5 lines of
Python, but using Pandas it only takes one line. Nice!

<a name="Viewing data: head/tail, describe"></a>
## Demo / Guided Practice: Viewing data: head/tail, describe (25 mins)


In [21]:
# read in csv file and create a pandas dataframe
import pandas as pd
data = pd.read_csv("sales.csv")
data

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won
5,218895,Kulas Inc,Daniel Hilton,Debra Henley,CPU,2,40000,pending
6,218895,Kulas Inc,Daniel Hilton,Debra Henley,Software,1,10000,presented
7,412290,Jerde-Hilpert,John Smith,Debra Henley,Maintenance,2,5000,pending
8,740150,Barton LLC,John Smith,Debra Henley,CPU,1,35000,declined
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won


In [22]:
# head of dataset
data.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


In [23]:
# tail of dataset
data.tail()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
13,307599,"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,Maintenance,3,7000,won
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won
15,729833,Koepp Ltd,Wendy Yule,Fred Anderson,CPU,2,65000,declined
16,729833,Koepp Ltd,Wendy Yule,Fred Anderson,Monitor,2,5000,presented
17,123456,cosmos,neil,lucy,universe,1,1000000,presented


**Check:** What can looking at the head and tail of a dataset tell us?


In [24]:
# summary stats
data.describe()

Unnamed: 0,Account,Quantity
count,18.0,18.0
mean,443432.111111,1.722222
std,263737.607563,1.017815
min,123456.0,1.0
25%,218895.0,1.0
50%,359944.5,1.5
75%,714466.0,2.0
max,740150.0,5.0


This gives us: count, mean, std, min, 25%, 50%, 75%, and max. Awesome!

**Check:** What was the cautionary tale about relying too heavily on summary stats again?


<a name="Selection: a single column, slicing by row, by position"></a>
## Demo / Guided Practice: Selection: a single column, slicing by row, by position (25 mins)


In [26]:
# select a single column
# data["Product"]
data.Product

0             CPU
1        Software
2     Maintenance
3             CPU
4             CPU
5             CPU
6        Software
7     Maintenance
8             CPU
9             CPU
10            CPU
11    Maintenance
12       Software
13    Maintenance
14            CPU
15            CPU
16        Monitor
17       universe
Name: Product, dtype: object

**Check:** How would you select the 'Quantity' and 'Price' columns separately?


In [27]:
data[["Quantity", "Price"]]

Unnamed: 0,Quantity,Price
0,1,30000
1,1,10000
2,2,5000
3,1,35000
4,2,65000
5,2,40000
6,1,10000
7,2,5000
8,1,35000
9,2,65000


In [28]:
# slice certain rows 
data[9:15]

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won
10,163416,Purdy-Kunde,Cedric Moss,Fred Anderson,CPU,1,30000,presented
11,239344,Stokes LLC,Cedric Moss,Fred Anderson,Maintenance,1,5000,pending
12,239344,Stokes LLC,Cedric Moss,Fred Anderson,Software,1,10000,presented
13,307599,"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,Maintenance,3,7000,won
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won


**Check:** How would you slice for rows 9 to 14?


In [31]:
# slice for rows 9-14


##### Now, let's try selecting by position.

In [31]:
# First, let's slice some rows.
data.iloc[9:15]


Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won
10,163416,Purdy-Kunde,Cedric Moss,Fred Anderson,CPU,1,30000,presented
11,239344,Stokes LLC,Cedric Moss,Fred Anderson,Maintenance,1,5000,pending
12,239344,Stokes LLC,Cedric Moss,Fred Anderson,Software,1,10000,presented
13,307599,"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,Maintenance,3,7000,won
14,688981,Keeling LLC,Wendy Yule,Fred Anderson,CPU,5,100000,won


**Check:** How would you slice for rows 9 to 14?


In [32]:
# slice some columns
data.iloc[:, 1:7]


Unnamed: 0,Name,Rep,Manager,Product,Quantity,Price
0,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000
1,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000
2,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000
3,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000
4,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000
5,Kulas Inc,Daniel Hilton,Debra Henley,CPU,2,40000
6,Kulas Inc,Daniel Hilton,Debra Henley,Software,1,10000
7,Jerde-Hilpert,John Smith,Debra Henley,Maintenance,2,5000
8,Barton LLC,John Smith,Debra Henley,CPU,1,35000
9,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000


**Check:** How would you slice for the 'Manager' and 'Product' columns?


In [33]:
# slice for the 'manager' column
data["Manager"]


0      Debra Henley
1      Debra Henley
2      Debra Henley
3      Debra Henley
4      Debra Henley
5      Debra Henley
6      Debra Henley
7      Debra Henley
8      Debra Henley
9     Fred Anderson
10    Fred Anderson
11    Fred Anderson
12    Fred Anderson
13    Fred Anderson
14    Fred Anderson
15    Fred Anderson
16    Fred Anderson
17             lucy
Name: Manager, dtype: object

In [34]:
# select for an explicit value only
data.loc[0, "Price"]


'30000'

<a name="introduction"></a>
## Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a
couple more tools to our toolbox.

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta,
category, and object.

`df.apply()` will apply a function along any axis of the DataFrame. We'll see it in action below.

`pandas.Series.value_counts` returns Series containing counts of unique values. The resulting
Series will be in descending order so that the first element is the most frequently-occurring
element. Excludes NA values.

- Examples of [dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).
- Examples of [value_counts](http://nullege.com/codes/search/pandas.Series.value_counts).


<a name="Inspect data types "></a>
## Demo /Guided Practice: Inspect data types  (20 mins)

Let's create a small dictionary with different data types in it.


In [35]:
import pandas as pd
import numpy as np
dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                        F = False,
                        G = pd.Series([1]*3,dtype='int8')))
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.09814,1,foo,2001-01-02,1.0,False,1
1,0.429981,1,foo,2001-01-02,1.0,False,1
2,0.803136,1,foo,2001-01-02,1.0,False,1


There is a really easy way to see what kind of dtypes are in each column.


In [36]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the
column will be chosen to accommodate all of the data types (object is the most general).

In [37]:
# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

In [47]:
# string data forces an ``object`` dtype
s = pd.Series([1, 2, 3, 6., 'foo'])
s

0      1
1      2
2      3
3      6
4    foo
dtype: object

The method `DataFrame.dtypes.value_counts()` will return the number of columns of each type in a DataFrame:

In [50]:
data.dtypes.value_counts()

object    6
int64     2
dtype: int64

You can do a lot more with dtypes that you can check out [here](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).

**Check:** Why do you think it might be important to know what kind of dtypes you're working with?

<a name=" df.apply()"></a>
## Demo /Guided Practice:  df.apply() (20 mins)

Let's create a small data frame.


In [51]:
df = pd.DataFrame(np.random.randint(0 , 10, (5, 4)), columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,0,0,3,4
1,9,7,4,7
2,7,3,0,6
3,7,6,6,7
4,6,4,2,1


Use `df.apply` to find the square root of all the values.


In [54]:
# df.apply(np.sum)
# df.apply(np.sum, axis=1)
df.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,0.0,0.0,1.732051,2.0
1,3.0,2.645751,2.0,2.645751
2,2.645751,1.732051,0.0,2.44949
3,2.645751,2.44949,2.44949,2.645751
4,2.44949,2.0,1.414214,1.0


Find the mean of all of the columns.


In [55]:
df.apply(np.mean)

a    5.8
b    4.0
c    3.0
d    5.0
dtype: float64

Find the mean of all of the rows.


In [61]:
df.apply(np.mean, axis=1)

0    1.75
1    6.75
2    4.00
3    6.50
4    3.25
dtype: float64

[df.apply](https://gist.github.com/why-not/4582705)
[df.apply](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


**Check:** How would find the std of the columns and rows?


<a name=".value_counts()"></a>
## Demo /Guided Practice: .value_counts() (20 mins)

Let's create a random array with 50 numbers, ranging from 0 to 7.


In [57]:
data = np.random.randint(0, 7, size = 50)
data

array([4, 0, 5, 2, 3, 5, 2, 1, 1, 4, 2, 0, 6, 6, 6, 1, 4, 3, 4, 0, 3, 2,
       3, 2, 0, 5, 3, 5, 3, 6, 4, 2, 1, 0, 0, 2, 2, 5, 2, 4, 0, 6, 0, 0,
       6, 3, 5, 4, 6, 1])

Convert the array into a series.


In [58]:
data = pd.Series(data)
data

0     4
1     0
2     5
3     2
4     3
5     5
6     2
7     1
8     1
9     4
10    2
11    0
12    6
13    6
14    6
15    1
16    4
17    3
18    4
19    0
20    3
21    2
22    3
23    2
24    0
25    5
26    3
27    5
28    3
29    6
30    4
31    2
32    1
33    0
34    0
35    2
36    2
37    5
38    2
39    4
40    0
41    6
42    0
43    0
44    6
45    3
46    5
47    4
48    6
49    1
dtype: int32

How many of each number is there in the series? Enter `value_counts()`:


In [59]:
data.value_counts()


2    9
0    9
6    7
4    7
3    7
5    6
1    5
dtype: int64

<a name="ind-practice"></a>
## Independent Practice: Topic (20 minutes)
- Use the [sales.csv data set](./assets/datasets/sales_info.csv) - we've seen this a few times in previous lessons!
- Inspect the data types
- You've found out that all your values in column 1 are off by 1, add 1 to column 1 of the dataset
- Use .value_counts to count the values of 1 column of the dataset

**Bonus**
- Add 3 to column 2
- Use .value_counts for each column of the dataset

