In [1]:
import pandas as pd
import numpy as np

# Basetypes: Series and DataFrames

In [24]:
s1 = pd.Series(['milk', 'cheese', 'eggs'])
s2 = pd.Series(np.random.randint(0, 10, (3)))
s1

0      milk
1    cheese
2      eggs
dtype: object

In [25]:
np.array(['milk', 'cheese', 'eggs']).reshape(3, 1)

array([['milk'],
       ['cheese'],
       ['eggs']], dtype='<U6')

Let's put them together in a DataFrame

In [27]:
df = pd.DataFrame({'fruit': s1, 'amount': s2})
df

Unnamed: 0,fruit,amount
0,milk,9
1,cheese,5
2,eggs,6


## Accessing column

Pandas columns can be retrieved in several ways. If we want a single columns:

In [33]:
df.fruit
df['fruit']
df.loc[:, 'fruit']

0      milk
1    cheese
2      eggs
Name: fruit, dtype: object

If we want multiple columns:

In [34]:
df[['fruit', 'amount']]
df.loc[:, ['fruit', 'amount']]

Unnamed: 0,fruit,amount
0,milk,9
1,cheese,5
2,eggs,6


Pandas own methods and attributes have priority over the columns, so some names are reserved, such as `datetime`  and `count`. 

## Slicing

In [19]:
df.iloc[:2, :2]

Unnamed: 0,fruit,some_nr
0,milk,0.075154
1,cheese,0.844885


<img src="https://storage.googleapis.com/lds-media/images/series-and-dataframe.original.png" style="width:600px" />

# Speed

In [20]:
large_array = np.random.rand(1000000)
large_series = pd.Series(large_array)

In [21]:
%%timeit 
large_array.mean()

587 µs ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [22]:
%%timeit 
large_series.mean()

1.06 ms ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


So speed is a bit lower, but in the same realm (2x slower here). 

# Multiplication

In [8]:
df.some_nr * 3

0    0.373950
1    0.882715
2    1.830517
Name: some_nr, dtype: float64

**We can even do that with strings!**

In [9]:
df.fruit * 3

0          milkmilkmilk
1    cheesecheesecheese
2          eggseggseggs
Name: fruit, dtype: object

## Powers

In [23]:
df.some_nr ** 2

0    0.015538
1    0.086576
2    0.372310
Name: some_nr, dtype: float64