# Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Installing Pandas

Using Anaconda, we can install pandas as follows: 
```bash
conda install pandas
```
else using PyPI, we are going to install it as 
```bash
pip install pandas
```

At the time of writing this notebook, the pandas was in _v 0.24_ and if you want to install this specific version, then you can pass the version flag as follows: 
```bash
conda install pandas==0.24
```

## What problem does pandas solve?

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal.




## Importing pandas

In [1]:
import pandas as pd

#  Additional libraries
import numpy as np

## Pandas Datatypes

### Series

[**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
```python
s = pd.Series(data, index=index)
```    
Here, data can be a *list* or a *numpy array*. and the index has to be something that works as primary key to address the value in the cell. 

In [2]:
labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10,'b':20,'c':30}

print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)

Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}


In [3]:
s1=pd.Series(data=my_data)
print(s1)

0    10
1    20
2    30
dtype: int64


In [4]:
s1=pd.Series(data=my_data)
print(s1)

0    10
1    20
2    30
dtype: int64


In [5]:
s3=pd.Series(arr, labels)
print(s3)

a    10
b    20
c    30
dtype: int64


In [6]:
s4=pd.Series(d)
print(s4)

a    10
b    20
c    30
dtype: int64


In [7]:
print ("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(arr))
print ("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))
print ("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
print ("\nHolding objects from a dictionary\n",'-'*50, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))


Holding numerical data
-------------------------
0    10
1    20
2    30
dtype: int64

Holding text labels
--------------------
0    a
1    b
2    c
dtype: object

Holding functions
--------------------
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

Holding objects from a dictionary
--------------------------------------------------
0    <built-in method keys of dict object at 0x7f6a...
1    <built-in method items of dict object at 0x7f6...
2    <built-in method values of dict object at 0x7f...
dtype: object


### Operations in Series

We can perform multiple mathematical operations on series such as _sum_, _mean_, _absolute_ etc.

---
### DataFrame

**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

In [8]:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)


The data frame looks like
---------------------------------------------
   W  X  Y  Z
A  8  8  3  4
B  5  1  3  4
C  7  6  1  7
D  1  2  6  5
E  4  4  9  5


In [9]:
d={'a':[10,20],'b':[30,40],'c':[50,60]}
df2=pd.DataFrame(data=d,index=['X','Y'])
print(df2)

    a   b   c
X  10  30  50
Y  20  40  60


In [10]:
# 25 rows and 4 columns
matrix_data = np.random.randint(1,100,100).reshape(25,4)
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,columns=column_headings)

In [11]:
df.head()

Unnamed: 0,W,X,Y,Z
0,90,59,43,47
1,67,6,88,99
2,7,46,72,60
3,14,55,75,95
4,26,79,36,75


In [12]:
df.tail()

Unnamed: 0,W,X,Y,Z
20,90,27,35,41
21,17,67,53,39
22,86,47,92,20
23,90,87,19,11
24,18,9,55,99


In [13]:
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ", type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ", type(df[['X','Z']]), sep='')


The 'X' column
-------------------------
0     59
1      6
2     46
3     55
4     79
5     56
6     94
7     25
8     65
9     70
10    37
11    33
12    62
13    34
14    32
15    85
16    48
17    47
18    73
19    51
20    27
21    67
22    47
23    87
24     9
Name: X, dtype: int64

Type of the column: <class 'pandas.core.series.Series'>

The 'X' and 'Z' columns indexed by passing a list
-------------------------------------------------------
     X   Z
0   59  47
1    6  99
2   46  60
3   55  95
4   79  75
5   56  31
6   94  39
7   25  36
8   65  46
9   70  88
10  37  96
11  33  11
12  62  56
13  34  83
14  32  97
15  85  84
16  48  69
17  47  38
18  73  35
19  51   9
20  27  41
21  67  39
22  47  20
23  87  11
24   9  99

Type of the pair of columns: <class 'pandas.core.frame.DataFrame'>


In [14]:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nLabel-based 'loc' method can be used for selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method can be used for selecting row(s)\n",'-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])


Label-based 'loc' method can be used for selecting row(s)
------------------------------------------------------------

Single row

W    4
X    8
Y    9
Z    9
Name: C, dtype: int64

Multiple rows

   W  X  Y  Z
B  6  3  6  5
C  4  8  9  9

Index position based 'iloc' method can be used for selecting row(s)
----------------------------------------------------------------------

Single row

W    4
X    8
Y    9
Z    9
Name: C, dtype: int64

Multiple rows

   W  X  Y  Z
B  6  3  6  5
C  4  8  9  9


In [15]:
print("\nA column is created by assigning it in relation to an existing column\n",'-'*75, sep='')
df['New'] = df['X']+df['Z']
df['New (Sum of X and Z)'] = df['X']+df['Z']
print(df)
print("\nA column is dropped by using df.drop() method\n",'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is default, so one has to change it to 1
print(df)
df1=df.drop('A')
print("\nA row (index) is dropped by using df.drop() method and axis=0\n",'-'*65, sep='')
print(df1)
print("\nAn in-place change can be done by making inplace=True in the drop method\n",'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)


A column is created by assigning it in relation to an existing column
---------------------------------------------------------------------------
   W  X  Y  Z  New  New (Sum of X and Z)
A  1  6  8  5   11                    11
B  6  3  6  5    8                     8
C  4  8  9  9   17                    17
D  4  3  4  3    6                     6
E  9  8  8  3   11                    11

A column is dropped by using df.drop() method
-------------------------------------------------------
   W  X  Y  Z  New (Sum of X and Z)
A  1  6  8  5                    11
B  6  3  6  5                     8
C  4  8  9  9                    17
D  4  3  4  3                     6
E  9  8  8  3                    11

A row (index) is dropped by using df.drop() method and axis=0
-----------------------------------------------------------------
   W  X  Y  Z  New (Sum of X and Z)
B  6  3  6  5                     8
C  4  8  9  9                    17
D  4  3  4  3                     6
E  9  8  8  3  