# Background

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

## Key Features of Pandas

- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

## Pandas deals with the following three data structures 

- Series
- DataFrame
- Panel

These data structures are built on top of Numpy array, which means they are fast.

Dimension & Description
The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.

|Data Structure|Dimensions|Description|
|---|---|---|
|Series|1|1D labeled homogeneous array, sizeimmutable.|
|Data Frames|2|General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.|
|Panel|3|General 3D labeled, size-mutable array.|

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

#### Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.

# Series
Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

10	23	56	17	52	61	73	90	26	72
Key Points
Homogeneous data
Size Immutable
Values of Data Mutable

In [1]:
import pandas as pd
s = pd.Series()
s

Series([], dtype: float64)

## Creating the series from nd array


In [2]:
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])  
s = pd.Series(data)
s
# By deafult index is applied = range(n); where n = len(array))-1

0    a
1    b
2    c
3    d
dtype: object

In [3]:
# now passing an arry with custom index
series2 = pd.Series(data, index=['a','b','c','d'])
series2

a    a
b    b
c    c
d    d
dtype: object

## Creating the series from Dictonary

In [4]:
dic = {'pen':10,'book':200,'copy':50}
series3 = pd.Series(data)
series3
d = pd.Series(data, dtype = str)

#### Checking data type

In [5]:
type(series3)

pandas.core.series.Series

In [6]:
type(series3[1])

str

## Creating the series from Scalar


In [7]:
series4 = pd.Series(1)
series4

0    1
dtype: int64

In [8]:
series5 = pd.Series(4,index=[0,1,2,3])
series5

0    4
1    4
2    4
3    4
dtype: int64

### Accessing data from Series with Position

In [9]:
series5[0]

4

Retriving rang of indexes

In [10]:
# Selecting first 2 indexes 0 and 1. it does not include 2th index
s[:2] 

0    a
1    b
dtype: object

In [11]:
s[:3]

0    a
1    b
2    c
dtype: object

In [12]:
# All the elements in Series 
# Same as 
s
s[:]

0    a
1    b
2    c
3    d
dtype: object

In [13]:
#it Gives all the elements to the end from 0th index
s[0:]

0    a
1    b
2    c
3    d
dtype: object

 ### Retriving in reverse order

In [14]:
# Reverse order index from -1
# and retrives elemnet in reverse order
s[-2:]

2    c
3    d
dtype: object

In [15]:
# This Gives error 
# s[-1]

### Selecting on the base of label


In [16]:
s1 = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
s1['a']

1

In [17]:
# Selecting Multiple Elements
# Make use double [[]] brackets
s1[['a','c','d']]

a    1
c    3
d    4
dtype: int64

### Droping the elements from a series

In [18]:
s1.drop('a')
s1
# we can see element is no droped from the original series

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [19]:
s2 = s1.drop('a')
s2
# drop fucntion returns the series back after removing the specified columns

b    2
c    3
d    4
e    5
dtype: int64

In [20]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s


0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

# Data Frames
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

### Features of DataFrame
- Potentially columns are of different types
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

### Create DataFrame
A pandas DataFrame can be created using various inputs like 

- Lists
- dict
- Series
- Numpy ndarrays
- Another DataFrame

### Creating Object

In [21]:
df = pd.DataFrame()
df

In [22]:
type(df)

pandas.core.frame.DataFrame

## DataFrame Creation from Lists

In [23]:
ls = [2,4,6,8]
df = pd.DataFrame(ls)
df

Unnamed: 0,0
0,2
1,4
2,6
3,8


In [25]:
ls = [['Ali',10],['Malik',20],['Hassan',30]]
df = pd.DataFrame(ls,columns=['Name','Age'])
df

Unnamed: 0,Name,Age
0,Ali,10
1,Malik,20
2,Hassan,30


In [26]:
type(df)

pandas.core.frame.DataFrame

In [27]:
type(df['Name'])

pandas.core.series.Series

In [28]:
type(df['Age'])

pandas.core.series.Series

In [29]:
type(df['Age'][1])

numpy.int64

In [30]:
type(df['Name'][1])

str

In [31]:
df = pd.DataFrame(ls,columns=['Name','Age'], dtype= float)
df

Unnamed: 0,Name,Age
0,Ali,10.0
1,Malik,20.0
2,Hassan,30.0


In [32]:
type(df['Age'][1])

numpy.float64

In [33]:
type(df['Name'][1])

str

## Creating a DataFrame from Dict of ndarrays / Lists

In [34]:
dic = {'Name': ['Ali','Noor','Hassan'], 'Age' :[10,45,23] }
df = pd.DataFrame(dic)
df

Unnamed: 0,Name,Age
0,Ali,10
1,Noor,45
2,Hassan,23


In [35]:
# using Custom Index
df = pd.DataFrame(dic, index=['stu1','stu2','stu3'])
df

Unnamed: 0,Name,Age
stu1,Ali,10
stu2,Noor,45
stu3,Hassan,23


## Creating the DataFrame using list of Dictionary

In [36]:
ls = [{'a':1, 'b':2},{'a':5, 'b':10, 'c':20}]
d  = pd.DataFrame(ls)
d
#Note − Observe, NaN (Not a Number) is appended in missing areas.


Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [37]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

df1

Unnamed: 0,a,b
first,1,2
second,5,10


In [38]:
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
df2

Unnamed: 0,a,b1
first,1,
second,5,


## Create a DataFram from Dict of Series

In [39]:
d = {'one': pd.Series([1,2,3], index =['a','b','c']),
     'two': pd.Series([1,2,3,4], index = ['a','b','c','d'])
    }
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [40]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [41]:
# Adding a column
df['Three'] = pd.Series([9,3,4], index=['a','b','c'])
df

Unnamed: 0,one,two,Three
a,1.0,1,9.0
b,2.0,2,3.0
c,3.0,3,4.0
d,,4,


### Deleting Columns

In [42]:
# Using Del Function
del df['Three']
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [43]:
# using pop() function
df.pop('one')
df

Unnamed: 0,two
a,1
b,2
c,3
d,4


### Selecting

In [44]:
# Get a column
df['two'] 

a    1
b    2
c    3
d    4
Name: two, dtype: int64

In [45]:
# Get a row - based on a label
df.loc['a'] 

two    1
Name: a, dtype: int64

In [46]:
# Get a row - based on location
df.iloc[0]

two    1
Name: a, dtype: int64

### Slice Rows

In [47]:
df[2:4]

Unnamed: 0,two
c,3
d,4


### Adding Rows

In [52]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
df

Unnamed: 0,a,b
0,1,2
1,3,4
0,5,6
1,7,8


### Deleting Rows

In [64]:
d = {'one': pd.Series([1,2,3], index =['a','b','c']),
     'two': pd.Series([1,2,3,4], index = ['a','b','c','d'])
    }
df = pd.DataFrame(d)
df


Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [65]:
df.drop("a")

Unnamed: 0,one,two
b,2.0,2
c,3.0,3
d,,4


In [66]:
df  
# does not make changes in the original df
# we have to overide

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [67]:
df = df.drop('a')
df

Unnamed: 0,one,two
b,2.0,2
c,3.0,3
d,,4


-------------------------------------------------

### Creating a Series with 100 random Numbers

In [69]:
s = pd.Series(np.random.randn(5))
s

0   -0.664793
1   -0.292523
2    0.083311
3    0.438368
4   -0.111088
dtype: float64

#### Return labels of series - aka axsis

In [70]:
s.axes

[RangeIndex(start=0, stop=5, step=1)]

In [74]:
# Checking str labels/axis
s = pd.Series(np.random.randn(2), index=['a','b'] )
s.axes

[Index(['a', 'b'], dtype='object')]

#### Empty Object

In [75]:
s.empty

False

#### Dimension of Object

In [76]:
s.ndim

1

#### Size

In [77]:
s.size

2

#### Values

In [83]:
s.values

array([-1.16072601, -0.59255357])

#### Head and Tail

In [84]:
s.head()

a   -1.160726
b   -0.592554
dtype: float64

In [86]:
s.tail(1)

b   -0.592554
dtype: float64

DF for examples

In [90]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
df

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,James,26,3.24
2,Ricky,25,3.98
3,Vin,23,2.56
4,Steve,30,3.2
5,Smith,29,4.6
6,Jack,23,3.8


#### Transpose

In [91]:
df.T

Unnamed: 0,0,1,2,3,4,5,6
Name,Tom,James,Ricky,Vin,Steve,Smith,Jack
Age,25,26,25,23,30,29,23
Rating,4.23,3.24,3.98,2.56,3.2,4.6,3.8


#### Axes

In [92]:
df.axes

[RangeIndex(start=0, stop=7, step=1),
 Index(['Name', 'Age', 'Rating'], dtype='object')]

#### dtypes

In [93]:
df.dtypes

Name       object
Age         int64
Rating    float64
dtype: object

#### empty

In [95]:
df.empty

False

#### ndim

In [97]:
df.ndim

2

#### shape

In [98]:
df.shape

(7, 3)

#### size

In [99]:
df.size

21

#### values

In [100]:
df.values

array([['Tom', 25, 4.23],
       ['James', 26, 3.24],
       ['Ricky', 25, 3.98],
       ['Vin', 23, 2.56],
       ['Steve', 30, 3.2],
       ['Smith', 29, 4.6],
       ['Jack', 23, 3.8]], dtype=object)

#### Head and Tail

In [101]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,James,26,3.24
2,Ricky,25,3.98
3,Vin,23,2.56
4,Steve,30,3.2


# Descriptive Statistics

df for the Examples

In [102]:
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
df

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,James,26,3.24
2,Ricky,25,3.98
3,Vin,23,2.56
4,Steve,30,3.2
5,Smith,29,4.6
6,Jack,23,3.8
7,Lee,34,3.78
8,David,40,2.98
9,Gasper,30,4.8


#### sum()

In [108]:
df.sum()

# Note:
# Sum concatinates the str

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object

#### mean()

In [105]:
df.mean()

Age       31.833333
Rating     3.743333
dtype: float64

#### std()

In [107]:
df.std()

Age       9.232682
Rating    0.661628
dtype: float64

**Note** − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.

- Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception.

- Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.

# Summarzing the Data

#### describe()

In [109]:
df.describe()

Unnamed: 0,Age,Rating
count,12.0,12.0
mean,31.833333,3.743333
std,9.232682,0.661628
min,23.0,2.56
25%,25.0,3.23
50%,29.5,3.79
75%,35.5,4.1325
max,51.0,4.8


This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.

- object − Summarizes String columns
- number − Summarizes Numeric columns
- all − Summarizes all columns together (Should not pass it as a list value)


In [110]:
df.describe(include=object)

Unnamed: 0,Name
count,12
unique,12
top,Ricky
freq,1


In [114]:
df.describe(include='all')

Unnamed: 0,Name,Age,Rating
count,12,12.0,12.0
unique,12,,
top,Ricky,,
freq,1,,
mean,,31.833333,3.743333
std,,9.232682,0.661628
min,,23.0,2.56
25%,,25.0,3.23
50%,,29.5,3.79
75%,,35.5,4.1325
