# Accessing and manipulating data

Let's learn how to access and manipulate data

## Create some data

In [101]:
import numpy as np
import pandas as pd

Let's create a `DatetimeIndex` from Pandas of 8 values.

In [102]:
dates = pd.date_range('1/1/2000', periods=8)

print(type(dates))

dates

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

And generate a `numpy.ndarray` with 4 columns each containing 8 random values.

In [103]:
data = np.random.randn(8, 4)

print(type(data))

data

<class 'numpy.ndarray'>


array([[ 0.22301533, -0.91229755, -0.37655895, -0.03156858],
       [-0.29215726,  0.61006436, -0.4831246 ,  0.66490071],
       [ 1.29454158,  1.00031224, -1.13127866, -1.10478697],
       [ 0.52695049, -1.18478426,  1.30233544,  0.4772429 ],
       [ 0.565327  ,  0.87279109,  0.42421472,  0.42892123],
       [-0.58958098,  0.78380302, -0.61400988,  1.55201693],
       [-1.40099747,  0.21238205,  0.70076325,  0.14134093],
       [-0.5960464 ,  1.17449924, -0.6688395 ,  0.46779946]])

Then create a `DataFrame` with the data, indexed by the `DatetimeIndex` we've created and rename each column

In [104]:
df = pd.DataFrame(data, index=dates, columns= ['A', 'B', 'C', 'D'])

print(type(df))

df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,A,B,C,D
2000-01-01,0.223015,-0.912298,-0.376559,-0.031569
2000-01-02,-0.292157,0.610064,-0.483125,0.664901
2000-01-03,1.294542,1.000312,-1.131279,-1.104787
2000-01-04,0.52695,-1.184784,1.302335,0.477243
2000-01-05,0.565327,0.872791,0.424215,0.428921
2000-01-06,-0.589581,0.783803,-0.61401,1.552017
2000-01-07,-1.400997,0.212382,0.700763,0.141341
2000-01-08,-0.596046,1.174499,-0.668839,0.467799


## Accessing data

### Selecting all data in a column

We can access all the data in a column using `[]`. We get a `Series`.

In [105]:
s = df['A']

print(type(s))

s

<class 'pandas.core.series.Series'>


2000-01-01    0.223015
2000-01-02   -0.292157
2000-01-03    1.294542
2000-01-04    0.526950
2000-01-05    0.565327
2000-01-06   -0.589581
2000-01-07   -1.400997
2000-01-08   -0.596046
Freq: D, Name: A, dtype: float64

## Selecting all data in multiple columns

It's also possible to access all the data in multiple column by passing a list of column. In that case we get a `DataFrame`.

In [106]:
print(type(['A', 'B']))


s = df[['A', 'B']]

print(type(s))

s

<class 'list'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,A,B
2000-01-01,0.223015,-0.912298
2000-01-02,-0.292157,0.610064
2000-01-03,1.294542,1.000312
2000-01-04,0.52695,-1.184784
2000-01-05,0.565327,0.872791
2000-01-06,-0.589581,0.783803
2000-01-07,-1.400997,0.212382
2000-01-08,-0.596046,1.174499


### Selecting a value in a column

We can select a value from a column using `[]` twice. First for the column (`['A']`) and second for the position using the index (`dates[2]`).

In [107]:
v = df['A'][dates[2]]

print(type(v))

v

<class 'numpy.float64'>


1.29454157529448

### loc & iloc


In [108]:
x = pd.DataFrame({'x': ['X1', 'X2', 'X3', 'X4'], 'y': ['Y1', 'Y2', 'Y3', 'Y4']}, index=['A', 'B', 'C', 'D'])

x

Unnamed: 0,x,y
A,X1,Y1
B,X2,Y2
C,X3,Y3
D,X4,Y4


In [123]:
x.iloc[1::-1]

Unnamed: 0,A,B,C,D,E
1,7,5,2,8,3
0,2,6,4,5,9


In [110]:
x.loc['B']

x    X2
y    Y2
Name: B, dtype: object

## Filtering data

In [111]:
s = pd.Series(np.random.randint(size=8, low=0, high=10))


### Filtering data using `[]`

In [112]:
s[ s > 3]

2    8
4    6
5    6
6    7
7    8
dtype: int64

### Filtering data using `where`

Using `where` returns a `Series` with the same shape as the origi

In [113]:
s.where(s > 5)

0    NaN
1    NaN
2    8.0
3    NaN
4    6.0
5    6.0
6    7.0
7    8.0
dtype: float64

### Filtering a `DataFrame` on one column

In [114]:
x = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)), columns=list('ABCDE'))

print(x)

x[x['A'] > 2]

   A  B  C  D  E
0  2  6  4  5  9
1  7  5  2  8  3
2  3  6  0  8  2
3  1  9  3  1  8
4  9  1  8  1  1


Unnamed: 0,A,B,C,D,E
1,7,5,2,8,3
2,3,6,0,8,2
4,9,1,8,1,1


## Manipulating data


## Creating a new column

It's possible to create a new column in a `DataFrame`.

### New column is the sum of two existing columns

In [115]:
df['E'] = df['A'] + df['B']

df

Unnamed: 0,A,B,C,D,E
2000-01-01,0.223015,-0.912298,-0.376559,-0.031569,-0.689282
2000-01-02,-0.292157,0.610064,-0.483125,0.664901,0.317907
2000-01-03,1.294542,1.000312,-1.131279,-1.104787,2.294854
2000-01-04,0.52695,-1.184784,1.302335,0.477243,-0.657834
2000-01-05,0.565327,0.872791,0.424215,0.428921,1.438118
2000-01-06,-0.589581,0.783803,-0.61401,1.552017,0.194222
2000-01-07,-1.400997,0.212382,0.700763,0.141341,-1.188615
2000-01-08,-0.596046,1.174499,-0.668839,0.467799,0.578453


### New column is an aggregation of values in same row

Here we need to specify the `axis` for the aggregation.


In [116]:
df['min'] = df.min(axis=1)

df

Unnamed: 0,A,B,C,D,E,min
2000-01-01,0.223015,-0.912298,-0.376559,-0.031569,-0.689282,-0.912298
2000-01-02,-0.292157,0.610064,-0.483125,0.664901,0.317907,-0.483125
2000-01-03,1.294542,1.000312,-1.131279,-1.104787,2.294854,-1.131279
2000-01-04,0.52695,-1.184784,1.302335,0.477243,-0.657834,-1.184784
2000-01-05,0.565327,0.872791,0.424215,0.428921,1.438118,0.424215
2000-01-06,-0.589581,0.783803,-0.61401,1.552017,0.194222,-0.61401
2000-01-07,-1.400997,0.212382,0.700763,0.141341,-1.188615,-1.400997
2000-01-08,-0.596046,1.174499,-0.668839,0.467799,0.578453,-0.668839


## Inverting columns

In [117]:
df[['A', 'B']] = df[['A', 'B']]