

<img src="https://pandas.pydata.org/_static/pandas_logo.png"/>


# What is Pandas?
_pandas_ is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.

## Pandas Key Features

- Fast and efficient __DataFrame__ object with default and customized indexing.
- Tools for loading data into __in-memory__ data objects from different file formats.
- Data alignment and integrated handling of missing data.
- __Reshaping__ and pivoting of date sets.
- Label-based slicing, __indexing__ and subsetting of large data sets.
- __Columns__ from a data structure can be deleted or inserted.
- Group by data for __aggregation__ and transformations.
- High performance __merging and joining__ of data.
- __Time Series__ functionality.


### Pandas Data Structures

The main data structure in use by Pandas are:

		

| Data Structure | Dimensions   | Description                                |
|----------------|--------------|--------------------------------------------|
|  Series        |       1      |              1D labeled homogeneous array  |
|   Data Frame   |       2      |   General 2D labeled tabular structure     |
| Panel  | 3|   General 3D labeled, size-mutable array.  |

The more common data structure in analytical use is the __DataFrame__

A __series__ is basically a list of objects of the same data type. 

A __DataFrame__ is group of series that are not necessarily of the same data type. 



### pandas.Series

A series is an _indexed_ list of values of the same type, and of fix length. 

Let's see some ways to create a series in pandas.


In [39]:
# creating an empty series

#import the pandas library and aliasing as pd
import numpy as np
import pandas as pd

s = pd.Series()
print(s)

Series([], dtype: float64)


In [40]:
# creating a series from a Numpy array. remember that series has an index!
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

100    a
101    b
102    c
103    d
dtype: object


In [41]:
# creating a series from a python Dictionary, while using the keys as index
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [42]:
# create a repetitive series from a scalar
s = pd.Series(5, index=range(0,10))
print(s)

0    5
1    5
2    5
3    5
4    5
5    5
6    5
7    5
8    5
9    5
dtype: int64


#### Accessing Series Data

In [43]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

# using the Pythonic list slice notion
print(s[0], '\n')
print(s[:3])

1 

a    1
b    2
c    3
dtype: int64


In [44]:
# get elements using their index: 
print(s['b'], '\n')
print(s[['a', 'b', 'd']])


2 

a    1
b    2
d    4
dtype: int64


### Series Attributes 


In [45]:
# size returns the length of the series
s = pd.Series(np.random.randn(4))
print(s.size)


4


In [46]:
# 'values' returns the list of values without the index
print(s.values)


[-1.17881139  0.75582212  1.72079361  1.70281998]


In [47]:
# use the 'head' and tail functions to get the first or last elements in the series
s = pd.Series([1,2,3,4,5,6,7,8,9])
print(s.head(2).values)
print(s.tail(-5).values)



[1 2]
[6 7 8 9]


### The DataFrame

Pandas data frames is a Tabular-like  data structures, combining indexed rows and named columns. 

A DataFrame can be created from Lists, Dictionaries, Series, Numpy ndarrays, Another DataFrame or straight from files and databases. 


__Examples:__

In [48]:
import pandas as pd
data = [['Alice',20],['Bob',32],['Charlie',25]]
df = pd.DataFrame(data, columns=['Name','Age'])
print(df)

      Name  Age
0    Alice   20
1      Bob   32
2  Charlie   25


In [49]:
# create a data frame from a dictionary
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
        'Age' :[28,34,29,42]
       }
df = pd.DataFrame(data)
print(df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [50]:
# create a data frame from a list of dictionaries (e.g. a json list)
data = [{'name': 'Felix', 'Age': 22},
        {'name': 'Joe',   'Age': 19},
        {'name': 'Alexa', 'Age': 28, 'Title' : 'CEO'},
       ]
df = pd.DataFrame(data)
print(df)

    name  Age Title
0  Felix   22   NaN
1    Joe   19   NaN
2  Alexa   28   CEO


In [51]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)


     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


### DataFrame attributes and functionality
Pandas Data Frames shares similar functionalities to Numpy. For example:

__df.transpose__ -  Transposes rows and columns. 

__axes__ - Returns a list with the row axis labels and column axis labels. 

__dtypes__ - Returns the columns dtypes. 

__ndim__ - Number of axes / array dimensions. 

__shape__ - The dimensionality of the DataFrame. 

__size__ - Number of elements in the DataFrame. 

__head(n)__ - Returns the first n rows. 

__tail(n)__ - Returns last n rows. 


### Pandas and Descriptive Statistics

Pandas has many built in functionality to easly explore your datasets. 

The common functions are _count()_, _sum()_, _mean()_, _median()_, _mode()_, _std()_, _min()_, _max()_ and a few more. 

Ususally, aggregative function accepts an _axis_ parameter to indicate a _row_ or _column_ aggregation. 

Pandas also have a usefull __describe()__ function to apply a list of descriptive functions at once.


In [52]:
data = pd.DataFrame([
    {'name': 'Alice',    'age': 23,  'grade': 78},
    {'name': 'Bob',      'age': 26,  'grade': 48},
    {'name': 'Charlie',  'age': 21,  'grade': 92},
    {'name': 'Dave',     'age': 22,  'grade': 89},
    {'name': 'Eve',      'age': 27              },
    {'name': 'Frank',    'age': 28,  'grade': 81},
    {'name': 'Greg',     'age': 24,  'grade': 72},
    {'name': 'Grace',    'age': 25,  'grade': 97},
    {'name': 'Heidi',    'age': 26,  'grade': 91},
    {'name': 'Judy',     'age': 24,  'grade': 66}
])

data.describe()

Unnamed: 0,age,grade
count,10.0,9.0
mean,24.6,79.333333
std,2.221111,15.491933
min,21.0,48.0
25%,23.25,72.0
50%,24.5,81.0
75%,26.0,91.0
max,28.0,97.0


#### Playing around with DataFrames


In [53]:
# list all students ages
data['age']

0    23
1    26
2    21
3    22
4    27
5    28
6    24
7    25
8    26
9    24
Name: age, dtype: int64

In [54]:
# What's the class avg score?
data['age'].mean()

24.6

In [55]:
# What are the top three grades?
ages = data['grade'].sort_values() # remember that a 'column is actually a pandas series'
print(ages.tail(3))

2    92.0
7    97.0
4     NaN
Name: grade, dtype: float64


In [56]:
# mm that wasn't cool. lets filter out the NaN using the dropna function
ages.dropna().tail(3)

8    91.0
2    92.0
7    97.0
Name: grade, dtype: float64

In [57]:
# lets slice the name and age columns
new_df = data[['name', 'age']]
new_df

Unnamed: 0,name,age
0,Alice,23
1,Bob,26
2,Charlie,21
3,Dave,22
4,Eve,27
5,Frank,28
6,Greg,24
7,Grace,25
8,Heidi,26
9,Judy,24


### Selecting Data from a DataFrame

We've seen an example of selecting a single or multiple columns. 

Lets see a few more options for Multi-axes slicing.

__df.loc[]__ - Lable based slicing. 

__df.iloc[]__ - integer index based slicing. 


__loc__ has a few methods to access data:

- A single scalar label
- A list of labels
- A slice object
- A Boolean array

Let see some examples:



In [58]:
data

Unnamed: 0,name,age,grade
0,Alice,23,78.0
1,Bob,26,48.0
2,Charlie,21,92.0
3,Dave,22,89.0
4,Eve,27,
5,Frank,28,81.0
6,Greg,24,72.0
7,Grace,25,97.0
8,Heidi,26,91.0
9,Judy,24,66.0


In [59]:
data.loc[:, 'grade'] # using the empty [:] notaion we select all rows  

0    78.0
1    48.0
2    92.0
3    89.0
4     NaN
5    81.0
6    72.0
7    97.0
8    91.0
9    66.0
Name: grade, dtype: float64

In [60]:
# slicing the third row and age column
data.loc[2,'age']

21

In [61]:
# selecting the first row
data.loc[0, :]

name     Alice
age         23
grade       78
Name: 0, dtype: object

In [62]:
# using iloc, we use the numric position of the rows and columns we wish to slice

# get the first 3 rows and first 2 columns
data.iloc[:3, :2]

Unnamed: 0,name,age
0,Alice,23
1,Bob,26
2,Charlie,21


### Sorting

Pandas has two methods of sorting, by index and by value. 

Examples:

In [63]:
# sorting by index
unsorted_df = pd.DataFrame(data  = [1.75, 1.82, 1.68, 1.72],
                           index = ['James', 'Alex', 'Bob', 'Gary'],
                           columns = ['height'])

print(unsorted_df)
print('\n')

sorted_df=unsorted_df.sort_index()
print('sorted:\n')
print(sorted_df)

       height
James    1.75
Alex     1.82
Bob      1.68
Gary     1.72


sorted:

       height
Alex     1.82
Bob      1.68
Gary     1.72
James    1.75


In [64]:
# sorting by value
unsorted_df.sort_values(by='height')

Unnamed: 0,height
Bob,1.68
Gary,1.72
James,1.75
Alex,1.82


In [65]:
# and in the opposite order
unsorted_df.sort_values(by='height', ascending=False)

Unnamed: 0,height
Alex,1.82
James,1.75
Gary,1.72
Bob,1.68


### Filtering data

Pands uses __queries__ and __condisions__ to filter data frames. 

Examples:


In [66]:
data

Unnamed: 0,name,age,grade
0,Alice,23,78.0
1,Bob,26,48.0
2,Charlie,21,92.0
3,Dave,22,89.0
4,Eve,27,
5,Frank,28,81.0
6,Greg,24,72.0
7,Grace,25,97.0
8,Heidi,26,91.0
9,Judy,24,66.0


In [67]:
# using a simple condition to filter records by their grade
my_condition = data['grade'] > 85
data[my_condition]

Unnamed: 0,name,age,grade
2,Charlie,21,92.0
3,Dave,22,89.0
7,Grace,25,97.0
8,Heidi,26,91.0


In [68]:
# filter using multiple conditions
data[ (data['grade'] > 85) & (data['age'] > 24)]

Unnamed: 0,name,age,grade
7,Grace,25,97.0
8,Heidi,26,91.0


In [69]:
# Find the student with no grade
data[ data['grade'].isna() ]

Unnamed: 0,name,age,grade
4,Eve,27,


## Time to Exercise  💪🏻
