# Getting Started with Pandas a powerful Python data analysis toolkit

## 1.Introduction

### Pandas
First, Pandas is an open source Python library for data analysis. It contains data manipulation and data structures tools designed to make spreadsheet-like data for loading, manipulating, merging, cleaning, among other functions, fast and easy in Python. It is often used with analytical libraries like scikit-learn, data visualization libraries like matplotlib, and numerical computing tools like NumPy and SciPy.

### 1.1 Introduction to pandas Data Structures

Pandas has introduced new data types to Python: **Series and DataFrame**. This two workhorse data structures are not a universal solution for every problem, but they provide a solid basis for most applications. The DataFrame represents your entire spreadsheet or a retangular table of data, whereas the Series is is a single column of the DataFrame.

### Series
A Series is a **one-dimensional** array-like object containing a sequence of values and an associated array of data labels, called its index. It is similar to the built-in Python *list*.

Here is an example of an array of data.


In [2]:
import pandas as pd

In [3]:
from pandas import Series, DataFrame

In [4]:
obj = pd.Series([1, 3, 5, -7, 9])

In [5]:
obj

0    1
1    3
2    5
3   -7
4    9
dtype: int64

This is a string representation of a **Series**. It shows the **index on the left and the values on the right**. We have not specified an index, so a default one is created. You can get the array representation and index object of the Series via its **values and index attributes**, respectively:

In [7]:
obj.values

array([ 1,  3,  5, -7,  9])

In [8]:
obj.index

RangeIndex(start=0, stop=5, step=1)

You can create a Series with a **label** pointing to each data:

In [10]:
obj2 = pd.Series([2, 4, 6, -8, 10], index=['a', 'b', 'c', 'd', 'e'])

In [11]:
obj2

a     2
b     4
c     6
d    -8
e    10
dtype: int64

Additionally, you can use labels in the index when selecting **a single value or a set of values**:

In [13]:
obj2['a']

2

In [14]:
obj2[['b', 'c', 'd']]

b    4
c    6
d   -8
dtype: int64

Also, we can use **NumPy functions or NumPy-like operations**, such as scalar multiplication, filtering with a boolean array, or applying math functions, will preserve the index-value link:

In [16]:
obj2 * 5

a    10
b    20
c    30
d   -40
e    50
dtype: int64

In [17]:
obj2[obj2 > 0]

a     2
b     4
c     6
e    10
dtype: int64

In [18]:
obj2[obj2 < 0]

d   -8
dtype: int64

In [19]:
obj2[obj2 > 8]

e    10
dtype: int64

In [20]:
import numpy as np

In [21]:
np.exp(obj2)

a        7.389056
b       54.598150
c      403.428793
d        0.000335
e    22026.465795
dtype: float64

We can use **Series** as a specialized dictionary. A **dictionary** is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values.

We can make the **Series-as-dictionary** analogy even more clear by constructing a Series object directly from a **Python dictionary**:

In [23]:
population_dict = {'St.Gallen': 526513, 'Genf': 494869, 'Zürich': 1575217, 'Luzern': 424678}

In [24]:
obj3 = pd.Series(population_dict)

In [25]:
obj3

St.Gallen     526513
Genf          494869
Zürich       1575217
Luzern        424678
dtype: int64

By default, a **Series** will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be performed:

In [27]:
obj3['St.Gallen']

526513

Unlike a dictionary, though, the Series also supports array-style operations such as **slicing**:

In [29]:
obj3['St.Gallen':'Zürich']

St.Gallen     526513
Genf          494869
Zürich       1575217
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in **sorted order**. You can **override** this by passing the dict keys in the order you want them to appear in the resulting Series:

In [31]:
states = ['Basel', 'St.Gallen', 'Genf', 'Zürich', 'Luzern']

In [32]:
obj4 = pd.Series(population_dict, index=states)

In [33]:
obj4

Basel              NaN
St.Gallen     526513.0
Genf          494869.0
Zürich       1575217.0
Luzern        424678.0
dtype: float64

Since there is no value for ‘Basel’, it appears as **NaN (Not a Number)**.

Now you can detect the missing data with **isnull** and **notnull** functions.

In [35]:
pd.isna(obj4)

Basel         True
St.Gallen    False
Genf         False
Zürich       False
Luzern       False
dtype: bool

In [36]:
pd.notna(obj4)

Basel        False
St.Gallen     True
Genf          True
Zürich        True
Luzern        True
dtype: bool

Series also has these as instance methods:

In [38]:
obj4.isna()

Basel         True
St.Gallen    False
Genf         False
Zürich       False
Luzern       False
dtype: bool

For **arithmetic operations**, the series functions are automatically aligned according to the index name. In addition, both the series object itself and its index have a name attribute:

In [40]:
obj3

St.Gallen     526513
Genf          494869
Zürich       1575217
Luzern        424678
dtype: int64

In [41]:
obj4

Basel              NaN
St.Gallen     526513.0
Genf          494869.0
Zürich       1575217.0
Luzern        424678.0
dtype: float64

In [42]:
obj3 + obj4

Basel              NaN
Genf          989738.0
Luzern        849356.0
St.Gallen    1053026.0
Zürich       3150434.0
dtype: float64

In [43]:
obj4.name = 'population'

In [44]:
obj4.index.name = 'state'

In [45]:
obj4

state
Basel              NaN
St.Gallen     526513.0
Genf          494869.0
Zürich       1575217.0
Luzern        424678.0
Name: population, dtype: float64

### DataFrame

Foremost, let’s clarify the term **DataFrame**.

In **Pandas**, it is a **two-demonsional** size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.
[pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?source=post_page-----490baa61a07b--------------------------------#) 


Let’s have a look at this.

Let’s first **construct** a new Series listing the area of each of the four states discussed before. Now that we have this along with the population Series from above, we can use a dictionary to construct **a single two-dimensional object** containing this information:

In [48]:
areacode_dict = {'St.Gallen': 71, 'Genf': 22, 'Zürich': 44, 'Luzern': 41}

In [49]:
areacode = pd.Series(areacode_dict)

In [50]:
areacode

St.Gallen    71
Genf         22
Zürich       44
Luzern       41
dtype: int64

In [51]:
population = obj3

In [52]:
states = pd.DataFrame({'population': population, 'areacode': areacode})

In [53]:
states

Unnamed: 0,population,areacode
St.Gallen,526513,71
Genf,494869,22
Zürich,1575217,44
Luzern,424678,41


Now we can access the index labels via the **DataFrame index attribute**. The DataFrame also has a column attribute, which is an index object that contains the column labels.

In [55]:
states.index

Index(['St.Gallen', 'Genf', 'Zürich', 'Luzern'], dtype='object')

In [56]:
states.columns

Index(['population', 'areacode'], dtype='object')

Therefore the **DataFrame** can be thought of as a **generalization of a two-dimensional NumPy array**,
where both the rows and columns have a generalized index for accessing the data.
We can also think of a **DataFrame** as a specialization of a **dictionary**. The DataFrame **maps** a column name to a Series column data, where a dictionary maps a **key to a value**. Now we get the Series object:

In [58]:
states['areacode']

St.Gallen    71
Genf         22
Zürich       44
Luzern       41
Name: areacode, dtype: int64

**Note**: 
For a **DataFrame**, **data[‘col0’] will return the first column**. In a **two-dimensional NumPy array**, **data[0]** will return the **first column**. While a DataFrame is physically two-dimensional, you can use it to represent **higher dimensional** data in a tabular format using hierarchical indexing (also known as **multi-indexing**) to incorporate multiple index levels within a single index. Hierarchical indexing is more common practices, but Pandas provide also **Panel** and **Panel4d** objects that handle three-dimensional and four-dimensional data.

We can construct **DataFrame Objects** in a variety of ways.

* From a list of dicts
* From a single Series object
* From a dictionary of Series objects
* From a two-dimensional NumPy array
* From a NumPy structured array

In [60]:
data = [{'x': i, 'y': 4 * i}
        for i in range(4)]

In [61]:
pd.DataFrame(data)

Unnamed: 0,x,y
0,0,0
1,1,4
2,2,8
3,3,12


In [62]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
St.Gallen,526513
Genf,494869
Zürich,1575217
Luzern,424678


In [63]:
pd.DataFrame({'population': population, 'areacode': areacode})

Unnamed: 0,population,areacode
St.Gallen,526513,71
Genf,494869,22
Zürich,1575217,44
Luzern,424678,41


In [64]:
pd.DataFrame(np.random.rand(3, 2),
             columns = ['a', 'b'],
             index = ['x', 'y', 'z'])

Unnamed: 0,a,b
x,0.018674,0.004473
y,0.720538,0.152199
z,0.564845,0.520253


In [65]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])

In [66]:
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [67]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


This was a short intro to **Data Analysis with Pandas**. Stay tuned!