# Pandas Introduction

In [1]:
import pandas as pd

## Creating series

Series are the **one dimentional** objects of Pandas.

In [2]:
students = ['Alice', 'Jack', 'Molly']

pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [3]:
numbers = [1, 2, 3]

pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [4]:
students = ['Alice', 'Jack', None]
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [5]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

In [6]:
import numpy as np
np.nan == None

False

In [7]:
np.nan == np.nan

False

In [8]:
np.isnan(np.nan)

True

In [9]:
students_scores = { 'Alice' : 'Physics',
                    'Jack':'Chemistry',
                    'Molly':'English'}
s = pd.Series(students_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [10]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [11]:
students = [("Alice", "Brown"), ("Jack", "White"), ("Molly", "Green")]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [12]:
s = pd.Series(['Physics', 'Chemistry', 'English'], index=['Alcie', 'Jack', 'Molly'])
s

Alcie      Physics
Jack     Chemistry
Molly      English
dtype: object

In [13]:
students_scores = { 'Alice' : 'Physics',
                    'Jack':'Chemistry',
                    'Molly':'English'}
s = pd.Series(students_scores, index=['Alice', 'Molly', 'Sam'])
s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

## Querying Series

In [14]:
students_classes = { 'Alice' : 'Physics',
                    'Jack':'Chemistry',
                    'Molly':'English',
                    'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [15]:
s.iloc[3]

'History'

In [16]:
s.loc['Molly']

'English'

`iloc` and `loc` are not methods, they are attribute, so we use square brackets to index.

In [17]:
s[3]

'History'

In [18]:
s['Molly']

'English'

We don't really need to use `iloc` and `loc`, because Pandas is smart enough to figure out what you are trying to do.

The only problem is when you have integer keys. Because them Panda won't know if you're trying to use `iloc` or `loc`.

In [19]:
class_code = {99:  'Physics',
              100: 'Chemistry', 
              101: 'English',
              102: 'History'}
s = pd.Series(class_code)
s[0]

KeyError: 0

In [20]:
s.iloc[0]

'Physics'

### Using Pandas and Numpy for computations

In [21]:
grades = pd.Series([90, 80, 70, 60])

total = 0 
for grade in grades:
    total += grade
print(total/len(grades))

75.0


In [22]:
total = np.sum(grades)
print(total/len(grades))

75.0


In [23]:
numbers = pd.Series(np.random.randint(0, 1000, 10000))

numbers.head() # returns the first five itens in our series

0    242
1    322
2    107
3    333
4    609
dtype: int64

In [24]:
len(numbers)

10000

In [25]:
%%timeit -n 100
total = 0 
for number in numbers:
    total += number 
    
total/len(numbers)

3.13 ms ± 93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [26]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

112 µs ± 19.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


See the difference? That's huge

In [27]:
numbers.head()

0    242
1    322
2    107
3    333
4    609
dtype: int64

In [28]:
numbers += 2
numbers.head()

0    244
1    324
2    109
3    335
4    611
dtype: int64

Instead of iterating through the series, this is done in parallel

In [29]:
for label, value in numbers.iteritems():
    numbers.at[label] = value+2
numbers.head()

0    246
1    326
2    111
3    337
4    613
dtype: int64

In [30]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,1000))
for label, value in s.iteritems():
    s.at[label] = value+2

11.7 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [31]:
%%timeit -n 10
s = pd.Series(np.random.randint(0, 1000,1000))
s += 2

347 µs ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [33]:
s = pd.Series([1, 2, 3])

s.loc['History'] = 102
s

0            1
1            2
2            3
History    102
dtype: int64

In [35]:
students_classes = pd.Series({ 'Alice' : 'Physics',
                    'Jack':'Chemistry',
                    'Molly':'English',
                    'Sam': 'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [37]:
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [39]:
all_students_classes = students_classes.append(kelly_classes)
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [40]:
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object