# Accessing Data within Pandas

## Introduction
In this lesson we're going to dig into various methods for accessing data from our Pandas Series and DataFrames.

## Objectives

You will be able to:
* Understand and explain some key Pandas methods
* Access DataFrame data by using the label
* Perform boolean indexing on both Series and DataFrames
* Use simple selectors for series
* Set new Series and DataFrame inputs

## Importing pandas and the data

First, let's make sure we import `pandas` as `pd`.

In [228]:
import pandas as pd

To show how to access data with Pandas, let's use the "wine" data set in the scikit-learn library (you might have heard about this library before - you'll use it extensively when we get to machine learning!). Don't worry about the code below, we're essentially just making sure you have access to the wine data set.

The data contained in the wine data set are the results of a chemical analysis of wines grown in Italy. It contains the quantities of 13 wine constituents. 

In [229]:
from sklearn.datasets import load_wine

data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)

Great! Our data set is now stored in the variable `df`. As you know, you can look at its elements by using `df` or `print(df)`.

In [230]:
print(df)

     alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0      14.23        1.71  2.43               15.6      127.0           2.80   
1      13.20        1.78  2.14               11.2      100.0           2.65   
2      13.16        2.36  2.67               18.6      101.0           2.80   
3      14.37        1.95  2.50               16.8      113.0           3.85   
4      13.24        2.59  2.87               21.0      118.0           2.80   
5      14.20        1.76  2.45               15.2      112.0           3.27   
6      14.39        1.87  2.45               14.6       96.0           2.50   
7      14.06        2.15  2.61               17.6      121.0           2.60   
8      14.83        1.64  2.17               14.0       97.0           2.80   
9      13.86        1.35  2.27               16.0       98.0           2.98   
10     14.10        2.16  2.30               18.0      105.0           2.95   
11     14.12        1.48  2.32               16.8   

Now what if you only want to see only a few lines of the data, based on certain constraints? You'll learn how to access data in this lesson!

## Methods and attributes to access data information

It won't be a surprise that our `df` object is a pandas DataFrame object. Let's verify this using the `type()`-function

In [231]:
type(df)

pandas.core.frame.DataFrame

There are some methods and attributes associated with pandas objects (both DataFrames *and* series!) which makes retrieving information from the data particularly easy. Some commonly used methods:
- `.head()`
- `.tail()`

And attributes:
- `.index`
- `.columns`
- `.dtypes`
- `.shape`

### Some methods: `.head()`, `.tail()` and `.info()`

By using `.head()` and `.tail()`, you can select the first $n$ rows from your dataframe. The default $n$ is 5, but you can change this value inside the parentheses. For example:

In [232]:
# First 5 rows of df
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [233]:
# last 3 rows of df
df.tail(3)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.2,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840.0
177,14.13,4.1,2.74,24.5,96.0,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560.0


To get a concise summary of the dataframe you can use `.info()`

In [234]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
alcohol                         178 non-null float64
malic_acid                      178 non-null float64
ash                             178 non-null float64
alcalinity_of_ash               178 non-null float64
magnesium                       178 non-null float64
total_phenols                   178 non-null float64
flavanoids                      178 non-null float64
nonflavanoid_phenols            178 non-null float64
proanthocyanins                 178 non-null float64
color_intensity                 178 non-null float64
hue                             178 non-null float64
od280/od315_of_diluted_wines    178 non-null float64
proline                         178 non-null float64
dtypes: float64(13)
memory usage: 18.2 KB


### Some attributes

Using `.index` you can access the index or row labels of the DataFrame.

In [235]:
df.index

RangeIndex(start=0, stop=178, step=1)

Using `.columns`, you can access the column labels of the DataFrame.

In [236]:
df.columns

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'],
      dtype='object')

Using `.dtypes` returns the dtypes in the DataFrame (compare with `.info()!)

In [237]:
df.dtypes

alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280/od315_of_diluted_wines    float64
proline                         float64
dtype: object

`.shape` returns a tuple representing the dimensionality  (in `(rows,columns)` ) of the DataFrame.

In [238]:
df.shape

(178, 13)

## Selecting dataframe information

In the previous section, we deliberately omitted 2 very important attributes:
- `.iloc`, which is a pandas dataframe indexer used for integer-location based indexing / selection by position.
- `.loc`, which has 2 use cases:
       - Selecting by label / index
       - Selecting with a boolean / conditional lookup


### `.iloc`

You can use `.iloc` to select single rows. To select the 4th row, you can use `.iloc[3]` like:

In [239]:
df.iloc[3]

alcohol                           14.37
malic_acid                         1.95
ash                                2.50
alcalinity_of_ash                 16.80
magnesium                        113.00
total_phenols                      3.85
flavanoids                         3.49
nonflavanoid_phenols               0.24
proanthocyanins                    2.18
color_intensity                    7.80
hue                                0.86
od280/od315_of_diluted_wines       3.45
proline                         1480.00
Name: 3, dtype: float64

You can use a colon to select several columns. Note that you'll use a structure `.iloc[a:b]` where the row with index `a` will be included in the selection and the row with index `b` is excluded.

In [240]:
df.iloc[5:8]

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0
6,14.39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0
7,14.06,2.15,2.61,17.6,121.0,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295.0


Next, you can use `,` to perform *column* selections based on their index as well. The command below selects full columns 3-6:

In [241]:
df.iloc[:,3:7]

Unnamed: 0,alcalinity_of_ash,magnesium,total_phenols,flavanoids
0,15.6,127.0,2.80,3.06
1,11.2,100.0,2.65,2.76
2,18.6,101.0,2.80,3.24
3,16.8,113.0,3.85,3.49
4,21.0,118.0,2.80,2.69
5,15.2,112.0,3.27,3.39
6,14.6,96.0,2.50,2.52
7,17.6,121.0,2.60,2.51
8,14.0,97.0,2.80,2.98
9,16.0,98.0,2.98,3.15


Last but not least, you can perform column and row selections at once:

In [242]:
df.iloc[5:10,3:9]

Unnamed: 0,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins
5,15.2,112.0,3.27,3.39,0.34,1.97
6,14.6,96.0,2.5,2.52,0.3,1.98
7,17.6,121.0,2.6,2.51,0.31,1.25
8,14.0,97.0,2.8,2.98,0.29,1.98
9,16.0,98.0,2.98,3.15,0.22,1.85


### `.loc`

 #### a) `.loc` label-based indexing

You can `.loc` to select columns based on their (row index and) column name. Examples:

In [243]:
df.loc[:,"magnesium"]

0      127.0
1      100.0
2      101.0
3      113.0
4      118.0
5      112.0
6       96.0
7      121.0
8       97.0
9       98.0
10     105.0
11      95.0
12      89.0
13      91.0
14     102.0
15     112.0
16     120.0
17     115.0
18     108.0
19     116.0
20     126.0
21     102.0
22     101.0
23      95.0
24      96.0
25     124.0
26      93.0
27      94.0
28     107.0
29      96.0
       ...  
148     92.0
149    113.0
150    123.0
151    112.0
152    116.0
153     98.0
154    103.0
155     93.0
156     89.0
157     97.0
158     98.0
159     89.0
160     88.0
161    107.0
162    106.0
163    106.0
164     90.0
165     88.0
166    111.0
167     88.0
168    105.0
169    112.0
170     96.0
171     86.0
172     91.0
173     95.0
174    102.0
175    120.0
176    120.0
177     96.0
Name: magnesium, Length: 178, dtype: float64

An alternative method here is simply calling `df["magnesium"]`!

In [244]:
df.loc[7:16,"magnesium"]

7     121.0
8      97.0
9      98.0
10    105.0
11     95.0
12     89.0
13     91.0
14    102.0
15    112.0
16    120.0
Name: magnesium, dtype: float64

#### b) boolean indexing using `.loc`

Sometimes you'd like to select certain rows in your data set based on the value for a certain variable. Imagine you'd like to create a new dataframe that only contains the wines with an alcohol percentage below 12. This can be done as follows:

In [245]:
df.loc[df["alcohol"]<12]

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
74,11.96,1.09,2.3,21.0,101.0,3.38,2.14,0.13,1.65,3.21,0.99,3.13,886.0
75,11.66,1.88,1.92,16.0,97.0,1.61,1.57,0.34,1.15,3.8,1.23,2.14,428.0
77,11.84,2.89,2.23,18.0,112.0,1.72,1.32,0.43,0.95,2.65,0.96,2.52,500.0
84,11.84,0.89,2.58,18.0,94.0,2.2,2.21,0.22,2.35,3.05,0.79,3.08,520.0
87,11.65,1.67,2.62,26.0,88.0,1.92,1.61,0.4,1.34,2.6,1.36,3.21,562.0
88,11.64,2.06,2.46,21.6,84.0,1.95,1.69,0.48,1.35,2.8,1.0,2.75,680.0
94,11.62,1.99,2.28,18.0,98.0,3.02,2.26,0.17,1.35,3.25,1.16,2.96,345.0
96,11.81,2.12,2.74,21.5,134.0,1.6,0.99,0.14,1.56,2.5,0.95,2.26,625.0
103,11.82,1.72,1.88,19.5,86.0,2.5,1.64,0.37,1.42,2.06,0.94,2.44,415.0
109,11.61,1.35,2.7,20.0,94.0,2.74,2.92,0.29,2.49,2.65,0.96,3.26,680.0


You can verify that simply using `df[df["alcohol"]<12]`, you can obtain the same result!

However, the .`loc` attribute is useful if you'd only want the color intensity for the wines with an alcohol percentage below 12. You can obtain the result as follows:

In [246]:
df.loc[df["alcohol"]<12, ["color_intensity"]]

Unnamed: 0,color_intensity
74,3.21
75,3.8
77,2.65
84,3.05
87,2.6
88,2.8
94,3.25
96,2.5
103,2.06
109,2.65


## Selectors for series

Until now we've only really discussed pandas DataFrames. Most of these methods and selectors are also applicable to pandas series. See how you can convert a one-column DataFrame into a Pandas Series:

In [247]:
# Let's save our color intensity dataframe into an object `col_intensity`
col_intensity = df["color_intensity"]

In [248]:
type(col_intensity)

pandas.core.series.Series

Note how col_intensity is now a pandas *Series*.

Many of the commands discussed before are readily applicable to series:

In [249]:
col_intensity[0:3]

0    5.64
1    4.38
2    5.68
Name: color_intensity, dtype: float64

In [250]:
col_intensity[col_intensity > 8] # or col_intensity.loc[col_intensity>8]

18      8.700000
49      8.900000
144     8.210000
148     8.420000
149     9.400000
150     8.600000
151    10.800000
153    10.520000
156     9.010000
158    13.000000
159    11.750000
164     9.580000
166    10.680000
167    10.260000
168     8.660000
169     8.500000
171     9.899999
172     9.700000
175    10.200000
176     9.300000
177     9.200000
Name: color_intensity, dtype: float64

## Changing and setting values in DataFrames and series

### Changing values

Imagine that for some reason, you're not interested in the color intensity values for color intensities above 10, and simply want to set all color intensities to 10 when they are bigger than 10. You can use a selector method and then assign it a new value, just like this:

In [251]:
df.loc[df["color_intensity"]>10, "color_intensity"] = 10

### Creating new columns

Now imagine that we want to create a new column named "shade" which has a value "light" when the color_intensity is below 7, and "dark" when the intensity is > 7. This can be done as follows:

In [255]:
df.loc[df["color_intensity"]>7, "shade"] = "dark"
df.loc[df["color_intensity"]<=7, "shade"] = "light"

Have another look at `df`. `shade` is added as a 14th column! 

## Summary

We've introduded a range of techniques for accessing information in Pandas Series and DataFrames, selecting rows and columns, changing values, and creating new columns! Now, it's time for some practice! Let's start working on a lab where you will get a chance to combine some of these methods!