

<img src="https://pandas.pydata.org/_static/pandas_logo.png"/>


# What is Pandas?
_pandas_ is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.

## Pandas Key Features

- Fast and efficient __DataFrame__ object with default and customized indexing.
- Tools for loading data into __in-memory__ data objects from different file formats.
- Data alignment and integrated handling of missing data.
- __Reshaping__ and pivoting of date sets.
- Label-based slicing, __indexing__ and subsetting of large data sets.
- __Columns__ from a data structure can be deleted or inserted.
- Group by data for __aggregation__ and transformations.
- High performance __merging and joining__ of data.
- __Time Series__ functionality.


## Pandas Data Structures

The main data structure in use by Pandas are:

		

| Data Structure | Dimensions   | Description                                |
|----------------|--------------|--------------------------------------------|
|  Series        |       1      |              1D labeled homogeneous array  |
|   Data Frame   |       2      |   General 2D labeled tabular structure     |
| Panel  | 3|   General 3D labeled, size-mutable array.  |

The more common data structure in analytical use is the __DataFrame__

A __series__ is basically a list of objects of the same data type. 

A __DataFrame__ is group of series that are not necessarily of the same data type. 



## pandas.Series

A series is an _indexed_ list of values of the same type, and of fix length. 

Let's see some ways to create a series in pandas.


In [69]:
# creating an empty series

#import the pandas library and aliasing as pd
import numpy as np
import pandas as pd

s = pd.Series()
print(s)

Series([], dtype: float64)


In [70]:
# creating a series from a Numpy array. remember that series has an index!
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

100    a
101    b
102    c
103    d
dtype: object


In [71]:
# creating a series from a python Dictionary, while using the keys as index
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [72]:
# create a repetitive series from a scalar
s = pd.Series(5, index=range(0,10))
print(s)

0    5
1    5
2    5
3    5
4    5
5    5
6    5
7    5
8    5
9    5
dtype: int64


### Accessing Series Data

In [73]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

# using the Pythonic list slice notion
print(s[0], '\n')
print(s[:3])

(1, '\n')
a    1
b    2
c    3
dtype: int64


In [74]:
# get elements using their index: 
print(s['b'], '\n')
print(s[['a', 'b', 'd']])


(2, '\n')
a    1
b    2
d    4
dtype: int64


### Series Attributes 


In [75]:
# size returns the length of the series
s = pd.Series(np.random.randn(4))
print(s.size)


4


In [76]:
# 'values' returns the list of values without the index
print(s.values)


[-0.40229121 -0.51855802  0.44540257  0.41885138]


In [77]:
# use the 'head' and tail functions to get the first or last elements in the series
s = pd.Series([1,2,3,4,5,6,7,8,9])
print(s.head(2).values)
print(s.tail(-5).values)



[1 2]
[6 7 8 9]


<hr></hr>


## Time to Exercise 🏋️

### Reveiew the pd.Series? documentation

In [79]:
pd.Series?


### Create a pandas series of odd numbers between 1 to 100

In [41]:
my_vals = list(range(1,100))
odds = [val for val in my_vals if val % 2 == 1]
pd.Series(odds)

0      1
1      3
2      5
3      7
4      9
5     11
6     13
7     15
8     17
9     19
10    21
11    23
12    25
13    27
14    29
15    31
16    33
17    35
18    37
19    39
20    41
21    43
22    45
23    47
24    49
25    51
26    53
27    55
28    57
29    59
30    61
31    63
32    65
33    67
34    69
35    71
36    73
37    75
38    77
39    79
40    81
41    83
42    85
43    87
44    89
45    91
46    93
47    95
48    97
49    99
dtype: int64

### Create a pandas series from a python dictionary of length = 10 of {a : 1, b: 2...} 

In [80]:
my_s = pd.Series({'a' : 1, 
           'b' : 2, 
           'c' : 3, 
           'd' : 4, 
           'e' : 5, 
           'f' : 6, 
           'g' : 7, 
           'h' : 8, 
           'i' : 9, 
           'j' : 10})
my_s

a     1
b     2
c     3
d     4
e     5
f     6
g     7
h     8
i     9
j    10
dtype: int64

### Use the series you've created to print the numeric values of  'd + j' and of 'e - f'

In [81]:
print("d + j = ", my_s['d'] + my_s['j'])
print("e - f = ", my_s['e'] - my_s['f'])

('d + j = ', 14)
('e - f = ', -1)


### You are given a series of unknown length, slice out the middle third of it, without checking it's size first

In [82]:
random_size = np.random.randint(100,200)
ints_between_100_and_200 = list(range(0,200))
ints_list_random_length = ints_between_100_and_200[ : random_size]
ints_list_random_length = pd.Series(ints_list_random_length)
ints_list_random_length


0        0
1        1
2        2
3        3
4        4
5        5
6        6
7        7
8        8
9        9
10      10
11      11
12      12
13      13
14      14
15      15
16      16
17      17
18      18
19      19
20      20
21      21
22      22
23      23
24      24
25      25
26      26
27      27
28      28
29      29
      ... 
166    166
167    167
168    168
169    169
170    170
171    171
172    172
173    173
174    174
175    175
176    176
177    177
178    178
179    179
180    180
181    181
182    182
183    183
184    184
185    185
186    186
187    187
188    188
189    189
190    190
191    191
192    192
193    193
194    194
195    195
Length: 196, dtype: int64

In [83]:
# now print the middle third section

series_length = ints_list_random_length.size 
first_interval_index = int(series_length / 3)
second_interval_index = int(first_interval_index * 2)

print(ints_list_random_length[first_interval_index:second_interval_index])

65      65
66      66
67      67
68      68
69      69
70      70
71      71
72      72
73      73
74      74
75      75
76      76
77      77
78      78
79      79
80      80
81      81
82      82
83      83
84      84
85      85
86      86
87      87
88      88
89      89
90      90
91      91
92      92
93      93
94      94
      ... 
100    100
101    101
102    102
103    103
104    104
105    105
106    106
107    107
108    108
109    109
110    110
111    111
112    112
113    113
114    114
115    115
116    116
117    117
118    118
119    119
120    120
121    121
122    122
123    123
124    124
125    125
126    126
127    127
128    128
129    129
Length: 65, dtype: int64


<hr>

## The DataFrame

Pandas data frames is a Tabular-like  data structures, combining indexed rows and named columns. 

A DataFrame can be created from Lists, Dictionaries, Series, Numpy ndarrays, Another DataFrame or straight from files and databases. 


__Examples:__

In [84]:
import pandas as pd
data = [['Alice',20],['Bob',32],['Charlie',25]]
df = pd.DataFrame(data, columns=['Name','Age'])
print(df)

      Name  Age
0    Alice   20
1      Bob   32
2  Charlie   25


In [85]:
# create a data frame from a dictionary
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
        'Age' :[28,34,29,42]
       }
df = pd.DataFrame(data)
print(df)

   Age   Name
0   28    Tom
1   34   Jack
2   29  Steve
3   42  Ricky


In [86]:
# create a data frame from a list of dictionaries (e.g. a json list)
data = [{'name': 'Felix', 'Age': 22},
        {'name': 'Joe',   'Age': 19},
        {'name': 'Alexa', 'Age': 28, 'Title' : 'CEO'},
       ]
df = pd.DataFrame(data)
print(df)

   Age Title   name
0   22   NaN  Felix
1   19   NaN    Joe
2   28   CEO  Alexa


In [87]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)


     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


### DataFrame attributes and functionality
Pandas Data Frames shares similar functionalities to Numpy. For example:

__df.transpose__ -  Transposes rows and columns. 

__axes__ - Returns a list with the row axis labels and column axis labels. 

__dtypes__ - Returns the columns dtypes. 

__ndim__ - Number of axes / array dimensions. 

__shape__ - The dimensionality of the DataFrame. 

__size__ - Number of elements in the DataFrame. 

__head(n)__ - Returns the first n rows. 

__tail(n)__ - Returns last n rows. 


### Pandas and Descriptive Statistics

Pandas has many built in functionality to easly explore your datasets. 

The common functions are _count()_, _sum()_, _mean()_, _median()_, _mode()_, _std()_, _min()_, _max()_ and a few more. 

Ususally, aggregative function accepts an _axis_ parameter to indicate a _row_ or _column_ aggregation. 

Pandas also have a usefull __describe()__ function to apply a list of descriptive functions at once.


In [88]:
df = pd.DataFrame([
    {'name': 'Alice',    'age': 23,  'grade': 78},
    {'name': 'Bob',      'age': 26,  'grade': 48},
    {'name': 'Charlie',  'age': 21,  'grade': 92},
    {'name': 'Dave',     'age': 22,  'grade': 89},
    {'name': 'Eve',      'age': 27              },
    {'name': 'Frank',    'age': 28,  'grade': 81},
    {'name': 'Greg',     'age': 24,  'grade': 72},
    {'name': 'Grace',    'age': 25,  'grade': 97},
    {'name': 'Heidi',    'age': 26,  'grade': 91},
    {'name': 'Judy',     'age': 24,  'grade': 66}
])

df.describe()

Unnamed: 0,age,grade
count,10.0,9.0
mean,24.6,79.333333
std,2.221111,15.491933
min,21.0,48.0
25%,23.25,72.0
50%,24.5,81.0
75%,26.0,91.0
max,28.0,97.0


#### Playing around with DataFrames


In [89]:
# list all students ages
df['age']

0    23
1    26
2    21
3    22
4    27
5    28
6    24
7    25
8    26
9    24
Name: age, dtype: int64

In [90]:
# What's the class avg score?
df['age'].mean()

24.6

In [91]:
# What are the top three grades?
ages = df['grade'].sort_values() # remember that a 'column is actually a pandas series'
print(ages.tail(3))

2    92.0
7    97.0
4     NaN
Name: grade, dtype: float64


In [92]:
# mm that wasn't cool. lets filter out the NaN using the dropna function
ages.dropna().tail(3)

8    91.0
2    92.0
7    97.0
Name: grade, dtype: float64

In [93]:
# lets slice the name and age columns
new_df = df[['name', 'age']]
new_df

Unnamed: 0,name,age
0,Alice,23
1,Bob,26
2,Charlie,21
3,Dave,22
4,Eve,27
5,Frank,28
6,Greg,24
7,Grace,25
8,Heidi,26
9,Judy,24


### Selecting Data from a DataFrame

We've seen an example of selecting a single or multiple columns. 

Lets see a few more options for Multi-axes slicing.

__df.loc[]__ - Lable based slicing. 

__df.iloc[]__ - integer index based slicing. 


__loc__ has a few methods to access data:

- A single scalar label
- A list of labels
- A slice object
- A Boolean array

Let see some examples:



In [94]:
df

Unnamed: 0,age,grade,name
0,23,78.0,Alice
1,26,48.0,Bob
2,21,92.0,Charlie
3,22,89.0,Dave
4,27,,Eve
5,28,81.0,Frank
6,24,72.0,Greg
7,25,97.0,Grace
8,26,91.0,Heidi
9,24,66.0,Judy


In [95]:
df.loc[:, 'grade'] # using the empty [:] notaion we select all rows  

0    78.0
1    48.0
2    92.0
3    89.0
4     NaN
5    81.0
6    72.0
7    97.0
8    91.0
9    66.0
Name: grade, dtype: float64

In [96]:
# slicing the third row and age column
df.loc[2,'age']

21

In [97]:
# selecting the first row
df.loc[0, :]

age         23
grade       78
name     Alice
Name: 0, dtype: object

In [98]:
# using iloc, we use the numric position of the rows and columns we wish to slice

# get the first 3 rows and first 2 columns
df.iloc[:3, :2]

Unnamed: 0,age,grade
0,23,78.0
1,26,48.0
2,21,92.0


## Time to Exercise  🏋🏻


In [99]:
# Create a data frame from the following json array
zoo_data = [  { "animal": "elephant", "uniq_id": 1001, "water_need": 500 },
              { "animal": "elephant", "uniq_id": 1002, "water_need": 600 },
              { "animal": "elephant", "uniq_id": 1003, "water_need": 550 },
              { "animal": "tiger", "uniq_id": 1004, "water_need": 300 },
              { "animal": "tiger", "uniq_id": 1005, "water_need": 320 },
              { "animal": "tiger", "uniq_id": 1006, "water_need": 330 },
              { "animal": "tiger", "uniq_id": 1007, "water_need": 290 },
              { "animal": "tiger", "uniq_id": 1008, "water_need": 310 },
              { "animal": "zebra", "uniq_id": 1009, "water_need": 200 },
              { "animal": "zebra", "uniq_id": 1010, "water_need": 220 },
              { "animal": "zebra", "uniq_id": 1011, "water_need": 240 },
              { "animal": "zebra", "uniq_id": 1012, "water_need": 230 },
              { "animal": "zebra", "uniq_id": 1013, "water_need": 220 },
              { "animal": "zebra", "uniq_id": 1014, "water_need": 100 },
              { "animal": "zebra", "uniq_id": 1015, "water_need": 80 },
              { "animal": "lion", "uniq_id": 1016, "water_need": 420 },
              { "animal": "lion", "uniq_id": 1017, "water_need": 600 },
              { "animal": "lion", "uniq_id": 1018, "water_need": 500 },
              { "animal": "lion", "uniq_id": 1019, "water_need": 390 },
              { "animal": "kangaroo", "uniq_id": 1020, "water_need": 410 },
              { "animal": "kangaroo", "uniq_id": 1021, "water_need": 430 },
              { "animal": "kangaroo", "uniq_id": 1022, "water_need": 410}
]

### Use the shape, describe and head functions to explore the dataset


### How much Water does all the animals need?


###  Print the number of unique animals in the zoo



###  Print a new data frame with the animal name column and its water need.



###  Use the iloc function to select the last 10 rows of the last column


## Sorting

Pandas has two methods of sorting, by index and by value. 

Examples:

In [100]:
# sorting by index
unsorted_df = pd.DataFrame(data  = [1.75, 1.82, 1.68, 1.72],
                           index = ['James', 'Alex', 'Bob', 'Gary'],
                           columns = ['height'])

print(unsorted_df)
print('\n')

sorted_df=unsorted_df.sort_index()
print('sorted:\n')
print(sorted_df)

       height
James    1.75
Alex     1.82
Bob      1.68
Gary     1.72


sorted:

       height
Alex     1.82
Bob      1.68
Gary     1.72
James    1.75


In [101]:
# sorting by value
unsorted_df.sort_values(by='height')

Unnamed: 0,height
Bob,1.68
Gary,1.72
James,1.75
Alex,1.82


In [102]:
# and in the opposite order
unsorted_df.sort_values(by='height', ascending=False)

Unnamed: 0,height
Alex,1.82
James,1.75
Gary,1.72
Bob,1.68


### Filtering data

Pands uses __queries__ and __condisions__ to filter data frames. 

Examples:


In [103]:
data

[['Alex', 10], ['Bob', 12], ['Clarke', 13]]

In [106]:
# using a simple condition to filter records by their grade
my_condition = data['grade'] > 85
data[my_condition]

TypeError: list indices must be integers, not str

In [105]:
# filter using multiple conditions
data[ (data['grade'] > 85) & (data['age'] > 24)]

TypeError: list indices must be integers, not str

In [68]:
# Find the student with no grade
data[ data['grade'].isna() ]

TypeError: list indices must be integers, not str

## Time to Exercise  💪🏻


In [30]:
# Create a data frame from the following json array
zoo_data = [  { "animal": "elephant", "uniq_id": 1001, "water_need": 500 },
              { "animal": "elephant", "uniq_id": 1002, "water_need": 600 },
              { "animal": "elephant", "uniq_id": 1003, "water_need": 550 },
              { "animal": "tiger", "uniq_id": 1004, "water_need": 300 },
              { "animal": "tiger", "uniq_id": 1005, "water_need": 320 },
              { "animal": "tiger", "uniq_id": 1006, "water_need": 330 },
              { "animal": "tiger", "uniq_id": 1007, "water_need": 290 },
              { "animal": "tiger", "uniq_id": 1008, "water_need": 310 },
              { "animal": "zebra", "uniq_id": 1009, "water_need": 200 },
              { "animal": "zebra", "uniq_id": 1010, "water_need": 220 },
              { "animal": "zebra", "uniq_id": 1011, "water_need": 240 },
              { "animal": "zebra", "uniq_id": 1012, "water_need": 230 },
              { "animal": "zebra", "uniq_id": 1013, "water_need": 220 },
              { "animal": "zebra", "uniq_id": 1014, "water_need": 100 },
              { "animal": "zebra", "uniq_id": 1015, "water_need": 80 },
              { "animal": "lion", "uniq_id": 1016, "water_need": 420 },
              { "animal": "lion", "uniq_id": 1017, "water_need": 600 },
              { "animal": "lion", "uniq_id": 1018, "water_need": 500 },
              { "animal": "lion", "uniq_id": 1019, "water_need": 390 },
              { "animal": "kangaroo", "uniq_id": 1020, "water_need": 410 },
              { "animal": "kangaroo", "uniq_id": 1021, "water_need": 430 },
              { "animal": "kangaroo", "uniq_id": 1022, "water_need": 410}
]

### Print the highest water_need of all animals

### What's the water need of the animal with uniq_id 1013? 

### Whta's the average water_need of an elephent? 

### Bonus: how many types of animals are in the zoo? 