**Pandas**

Pandas is the primary package for performing data analysis tasks in Python. pandas derives its name from <b>PANel Data AnalysiS</b> and is the fundamental package that provides <b>relational data structures (think Excel, SQL type) and a host of capabilities to play with those data structures</b>. It is the most widely used package in Python for data analysis tasks, and is very good to work with <b>cross sectional, time series, and panel data analysis</b>. Python sits on top of NumPy and can be used with NumPy arrays and the functions in NumPy. How is pandas suited for a researcher’s needs:
<i>
+ Has a tabular data structure that can hold both <b>homogenous and heterogenous data</b>.
+ Very <b>good indexing capabilities</b> that makes data alignment and merging easy.
+ Good <b>time series functionality</b>. No need to use different data structures for time series and cross sectional data. Allows for both <b>ordered and unordered time-series data</b>.
+ A host of <b>statistical functions</b> developed around NumPy and pandas that makes a researcher’s task easy and fast.
+ Programming is lot <b>simpler and faster</b>.
+ Easily handles <b>data manipulation and cleaning</b>.
+ Easy to expand and shorten data sets. <b>Comprehensive merging, joins, and group by functionality to join multiple data sets</b>.
</i>

**Installing pandas** 

In order to check if pandas is installed, go to Package Manager and type pandas. By default, pandas already comes installed with a distribution of Canopy. If the package is not installed, click on Install.

**Importing pandas**

In order to be able to use NumPy, first import it using import statement


In [1]:
import pandas as pd                         # This will import pandas into your workspace

In [2]:
import numpy as np                          # We will be using numpy functions so import numpy

**Data Structures in pandas**

There are two basic data structures in pandas: <b><i>Series and DataFrame</i></b>

**Series:** It is similar to a NumPy 1-dimensional array. In addition to the values that are specified by the programmer, <b>pandas attaches a label to each of the values</b>. If the labels are not provided by the programmer, then pandas assigns labels ( 0 for first element, 1 for second element and so on). A benefit of assigning labels to data values is that it becomes easier to perform manipulations on the dataset as the whole dataset becomes more of a dictionary where each value is associated with a label. 

In [5]:
series1 = pd.Series([10,20,30,40])
series1

0    10
1    20
2    30
3    40
dtype: int64

In [7]:
series1.values

array([10, 20, 30, 40], dtype=int64)

In [8]:
series1.index

RangeIndex(start=0, stop=4, step=1)

<b>If you want to specify custom index values rather than the default ones provided, you can do so using the following command</b>

In [12]:
series2=pd.Series([10,20,30,40,50], index=[1,2,3,4,5])
series2

1    10
2    20
3    30
4    40
5    50
dtype: int64