# Looking at DataFrame Data

1. Run the cell below to import required libraries and create a DataFrame

In [19]:
import pandas as pd
import numpy as np
import random

num_rows = 100
colors = ['Red', 'Blue', 'Green']

df = pd.DataFrame( {'color': [colors[random.randint(0,2)] for _ in range(num_rows)],
                    'integers': [random.randint(0,15) for _ in range(num_rows)],
                    'floats': [random.random() for _ in range(num_rows)]})
df

Unnamed: 0,color,integers,floats
0,Blue,5,0.533751
1,Blue,8,0.998759
2,Red,9,0.510857
3,Red,6,0.179754
4,Red,2,0.871387
...,...,...,...
95,Blue,8,0.769097
96,Red,3,0.613756
97,Green,14,0.855698
98,Blue,2,0.723230


2. Use the DataFrame `head()` method to view the top five rows. Try giving it a number as an argument to control how many rows are displayed.

In [20]:
df.head()

Unnamed: 0,color,integers,floats
0,Blue,5,0.533751
1,Blue,8,0.998759
2,Red,9,0.510857
3,Red,6,0.179754
4,Red,2,0.871387


3. View summary statistics using the DataFrame `describe()` method.

In [21]:
df.describe()

Unnamed: 0,integers,floats
count,100.0,100.0
mean,7.36,0.533756
std,4.51601,0.296054
min,0.0,0.023109
25%,3.0,0.309348
50%,8.0,0.494346
75%,11.0,0.81461
max,15.0,0.998759


4. The `decribe()` method accepts some optional arguments, including 'include' and 'exclude'. By default, `describe()` only shows statistics for columns with numerical data, but if you add the argument `include=np.object`, it will display statistics for columns with string data. Try this.

In [22]:
df.describe(include=np.object)

Unnamed: 0,color
count,100
unique,3
top,Red
freq,42


5. If you change the argument to `include='all'`, it will display statistics for all columns in the data frame, inserting `NaN` (not a number) when the data type is not appropriate for the statistic. Try viewing statistics for all frames using `describe()`.

In [23]:
df.describe(include='all')

Unnamed: 0,color,integers,floats
count,100,100.0,100.0
unique,3,,
top,Red,,
freq,42,,
mean,,7.36,0.533756
std,,4.51601,0.296054
min,,0.0,0.023109
25%,,3.0,0.309348
50%,,8.0,0.494346
75%,,11.0,0.81461


## Selecting Data
6. You can select a column using bracket syntax very similar to that used with dictionaries. Put the column name, as a string, in brackets after the DataFrame name. Try this with the column 'color'

In [24]:
df['color']

0      Blue
1      Blue
2       Red
3       Red
4       Red
      ...  
95     Blue
96      Red
97    Green
98     Blue
99      Red
Name: color, Length: 100, dtype: object

7. Try selecting the columns 'color' and 'floats' by supplying them as a list of strings in the same bracket syntax.

In [25]:
df[['color', 'floats']]

Unnamed: 0,color,floats
0,Blue,0.533751
1,Blue,0.998759
2,Red,0.510857
3,Red,0.179754
4,Red,0.871387
...,...,...
95,Blue,0.769097
96,Red,0.613756
97,Green,0.855698
98,Blue,0.723230


8. The bracket syntax in DataFrames is overloaded to select rows as well. Selecting rows uses the syntax we used to select slices in Sequences: a start number, a colon, and an upper bound number. Try selecting three rows from the DataFrame using the slice `10:13`

In [26]:
df[10:13]

Unnamed: 0,color,integers,floats
10,Red,2,0.473602
11,Blue,8,0.38073
12,Red,15,0.206527


9. Now let's try the `.loc[]` syntax. It also uses bracket syntax, but in this case you will specify both rows and columns to select. Select all of the rows by supplying a lone colon as the first argument, and the column 'color' by supplying it as a second argument (remember that arguments must be separted by a comma).

In [27]:
df.loc[:'color']

Unnamed: 0,color,integers,floats
0,Blue,5,0.533751
1,Blue,8,0.998759
2,Red,9,0.510857
3,Red,6,0.179754
4,Red,2,0.871387
...,...,...,...
95,Blue,8,0.769097
96,Red,3,0.613756
97,Green,14,0.855698
98,Blue,2,0.723230


10. Now specify a slice, `10:13`, for the first argument and a list of columns, `['color', 'integers']`, as a second, to select **four** rows (the upper bound in `loc[]` is included) and two columns.

In [28]:
df.loc[10:13, ['color', 'integers']]

Unnamed: 0,color,integers
10,Red,2
11,Blue,8
12,Red,15
13,Red,10


11. Now try the `iloc[]` syntax. This used the position of rows and columns to determine selection. In this DataFrame, the labels for the rows are the same as their position, so we can use the same slice `10:13` as the first argument. For the second, use the slice `0:2` to select the first two columns. Notice that with `iloc[]`, the upper bound is not inclusive, so you will get three rows and two columns.

In [29]:
df.iloc[10:13, [0, 2]]

Unnamed: 0,color,floats
10,Red,0.473602
11,Blue,0.38073
12,Red,0.206527
