**pandas** is a open source Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

**pandas** is well suited for many different kinds of data:

*   Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
*   Ordered and unordered (not necessarily fixed-frequency) time series data
*   Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
*   Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

It is worth to stress a few of the things that **pandas** does well:
*   Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
*   Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
*   Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
*   Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
*   Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
*   Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
*   Intuitive merging and joining data sets
*   Flexible reshaping data sets
*   Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
*   Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging








# Install and import
Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

`conda install pandas`

or

`pip install pandas`

Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:

`!pip install pandas`

To import pandas we usually import it with a shorter name since it's used so much:

In [35]:
# excel, csv, sql

In [36]:
import pandas as pd
import numpy as np # we will also need numpy in our examples  

Most commonly used data structures in pandas are:


1.   Series objects: 1D array, similar to a column in a spreadsheet
2.   DataFrame objects: 2D table, similar to a spreadsheet

# Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [37]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
print(type(data))

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
<class 'pandas.core.series.Series'>


As we see in the output, the *Series* wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. 

In [38]:
print(data.values)
print(data.index)

[0.25 0.5  0.75 1.  ]
RangeIndex(start=0, stop=4, step=1)


In [39]:
# Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation
# print(data[1])
print(data[1:3])

1    0.50
2    0.75
dtype: float64


## Series as generalized NumPy array
Though Series are similar to one-dimensional NumPy array, there is an essential difference between them: the presence of index. While the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [40]:
# we can use string as an index (values of other types are also allowed)
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [41]:
# In this case item access works like this:
print(data['b'])
print(data[1])

0.5
0.5


In [42]:
# Non-contiguous or non-sequential indices can be used as well
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [43]:
# Access of an item with index = 5
data[1]

KeyError: 1

## Series as specialized dictionary
Pandas Series can be thought of as Python dictionary.  A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values.

In [None]:
# By default, a Series will be created
#  where the index is drawn from the sorted keys.
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
population['Texas']

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [None]:
population['California':'Florida']

## Constructing Series objects

```Python 
pd.Series(data, index=index)
```

where **index** is an optional argument, and **data** can be a list or NumPy array. In this case index defaults to an integer sequence.

In [None]:
pd.Series([2, 4, 6])

In [None]:
# data can be a scalar
pd.Series(77, index=[111, 222, 333])

`data` can be a dictionary, in which index defaults to the sorted dictionary keys:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'})

In [None]:
# The index can be explicitly set. 
# Series is populated only with the explicitly identified keys.
pd.Series({2:'a', 1:'b', 3:'c'}, index=[1, 2])

# The Pandas DataFrame Object
`DataFrame` is another fundamental structure in Pandas. It can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. 

## DataFrame as a generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

Just as we might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, we can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

In [None]:
population

In [None]:
# Firstly we create a new Series object with dict. 
area_dict = {'California': 423967, 
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312,
             'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
# Now we can use a dictionary to construct a single two-dimensional
# object containing this information from population and area_dict
states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
# DataFrame has an index and column attributes that
# gives access to the index labels
# print(states.index)
# print(states.columns)
print(states.values)

## DataFrame as specialized dictionary
Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier.

In [None]:
states['area']

## Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

### From a single Series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.

In [None]:
pd.DataFrame(population, columns=['population'])

In [None]:
pd.DataFrame([population, area], index=['population', 'area'])

In [None]:
pd.DataFrame([population, area], index=['population', 'area']).T

### From a list of dicts
Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to create some data.

In [None]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]

print(data)
pd.DataFrame(data)

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2},
              {'b': 3, 'c': 4}])

### From a dictionary of Series objects
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

In [None]:
pd.DataFrame({'population': population,
              'area': area})

### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['col_1', 'col_2'],
             index=['row_1', 'row_2', 'row_3'])

# The Pandas Index Object
This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.


In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

In [None]:
# states, population
#

## Index as immutable array
The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices.

In [None]:
print(ind[1])
print(ind[::2])

`Index` objects also have many of the attributes familiar from NumPy arrays.

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

In [None]:
# It is immutable
ind[2] = 4

## Index as ordered set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
print(indA & indB) 
print(indA | indB)
print(indA ^ indB)

# Data Indexing and Selection


## Data Selection in Series
As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. 

## Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [None]:
print('a' in data)
print(data.keys())
print(list(data.items()))

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

In [None]:
data['d'] = 1.25
data

## Series as one-dimensional array
A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing.


In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])

# slicing by explicit index
data['a':'c'] # 2:5 -> 2,3,4

In [None]:
# slicing by implicit integer index
data

In [None]:
# masking
# data[(data > 0.3) & (data < 0.8)]
data[(data > 0.3) & (data < 0.8)]

Note that when slicing with **explicitly indexing** (i.e., `data['a':'c']`) the **final index is included**, whereas with implicit indexing (i.e., `data[0:2]`), it is excluded.

## Indexers: loc, iloc
If Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
# explicit index when indexing
data[2]

In [None]:
data[2:3]

In [None]:
# implicit index when slicing
data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [None]:
data

In [None]:
data.loc[1]

In [None]:
data.loc[1]

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
[1, 3, 5]

In [44]:
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [45]:
data.iloc[1]

0.5

In [46]:
data.iloc[1:3]

5    0.50
3    0.75
dtype: float64

## Data Selection in DataFrame
`DataFrame` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

### DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of related Series objects.

In [47]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area': area, 'pop': pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [None]:
data['area']

Equivalently, we can use attribute-style access with column names that are strings:

In [None]:
data.area

Dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [48]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as two-dimensional array
We can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the **values** attribute:

In [50]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

Many familiar array-like observations can be done on the `DataFrame` itself. For example, we can transpose the full `DataFrame` to swap rows and columns:

In [49]:
data.T.astype('int')

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135
density,90,38,139,114,85


In [None]:
data

In [None]:
# passing a single index to an array accesses a row:
data.values[0]

In [None]:
# passing a single index to a DataFrame accesses a column
data['area']

In [None]:
data

Using the **iloc** indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the `DataFrame` index and column labels are maintained in the result:

In [None]:
# data[:3, :2]
data.iloc[:3, :2]

Similarly, using the **loc** indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [51]:
# data.loc[:'Illinois', :'pop']
data.loc[:'New York', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [52]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


**loc** indexer we can combine masking and fancy indexing as in the following:

In [53]:
data.density > 100

California    False
Texas         False
New York       True
Florida        True
Illinois      False
Name: density, dtype: bool

In [54]:
# print(data.density > 100)
data.loc[data.density > 100, ['pop', 'density']]
# data.loc[data.density > 100, :]
# data.loc[[True, True, False, False, False], :]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [None]:
data.iloc[0, 2] = 90
data

### Additional indexing conventions
There are couple of extra indexing conventions. First, while *indexing* refers to columns, *slicing* refers to rows:


In [55]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Such slices can also refer to rows by number rather than by index:

In [56]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [57]:
# print(data.density > 100)
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


# Short Demo

In [58]:
data = pd.read_csv('sample_data/california_housing_train.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/california_housing_train.csv'

In [None]:
data.head()

In [None]:
data.tail(10)

In [None]:
data.loc[(data.housing_median_age > 50) & (data.median_income > 5)]

In [None]:
data.loc[(data.housing_median_age > 50) & (data.median_income > 5), ['median_house_value', 'total_bedrooms']]

In [None]:
data.loc[(data.housing_median_age > 50) & (data.median_income > 5), ['median_house_value', 'total_bedrooms']]