# Pandas
* Python library for data analysis and manipulation
* open source
* Used in **Machine learning** to :
    * read dataset
    * extract the dependent and independent variable from a dataset.
    * calculate statistics about the data, like average, median, max, or min of each column.
    * clean the data by doing things like removing missing values and filtering rows or columns by some criteria
* Plays the leading role in the data preprocessing phase of machine learning for handling the missing values.

__Commands for installing pandas__
* In conda prompt or conda base use 
```python
conda install pandas
```
* In python console or linux terminal
```python
pip3 install pandas
```

## Core components of Pandas
* Two primary components of Pandas are :
    * Series : Columns
    * Dataframes : Multidimensional table, combination of series

In [24]:
import pandas as pd

The above cell imports the pandas library and gives it the alias name pd.

## Series
```python
a = pd.Series(data,index = optional)
```
_Data_ = scalar value, python dictionary, Ndarray<br>
_index_ = specifying values to be used as index<br><br>
_Length of provided index and Data should be same_

In [25]:
#creating series with list
list1=[10,2,3,4,5]
s1 = pd.Series(list1)
print(s1)
print()
print(type(s1))

0    10
1     2
2     3
3     4
4     5
dtype: int64

<class 'pandas.core.series.Series'>


_Notice_ the type of _s1_

In [26]:
#creating series with list (without indexing - default indexing)
s1 = pd.Series([10,2,3,4,5,"tawishi",9.05])
print(s1)
print(type(s1))

0         10
1          2
2          3
3          4
4          5
5    tawishi
6       9.05
dtype: object
<class 'pandas.core.series.Series'>


In [27]:
#Series with custom indexing
index = ['June','July','Feb','Jan']
s2 = pd.Series(['Ámit','Nitin','Rohan','Little'],index=index)
print(s2)

June      Ámit
July     Nitin
Feb      Rohan
Jan     Little
dtype: object


In [28]:
index=[10,20,30,40]
s2 = pd.Series(['Ámit','Nitin','Rohan','Little'],index=index)
print(s2)

10      Ámit
20     Nitin
30     Rohan
40    Little
dtype: object


In [29]:
#creating series with scalar value 
index=[10,20,30,40]
s2 = pd.Series(9,index=index)
print(s2)

10    9
20    9
30    9
40    9
dtype: int64


In [30]:
#creating series from dictionary
dict = {10:"Ämit",30:"Little",20:"Rahul",40:"Tawishi"}
s3=pd.Series(dict)
print(s3)

10       Ämit
30     Little
20      Rahul
40    Tawishi
dtype: object


In [31]:
#creating series from numpy array
import numpy as np
index=['a','b','c','d']  #can be numpy array
arr=np.array([10,20,30,40])
s4=pd.Series(arr,index=index)
print(s4)

a    10
b    20
c    30
d    40
dtype: int64


## Dataframe
* use dictionary to create dataframe
* index = row

#### Defining data for creating Dataframe

In [32]:
data = {'apples':[3,4,0,8],'Mangoes':[5,7,9,3]}

In [33]:
Index=['Nitin','Amit','Rohan','Likita']    #row names/indices
df = pd.DataFrame(data,Index)

print(df)
print(type(df))

        apples  Mangoes
Nitin        3        5
Amit         4        7
Rohan        0        9
Likita       8        3
<class 'pandas.core.frame.DataFrame'>


### Operations on Dataframe

#### Dimensions/Size of Dataframe

In [34]:
 df.shape

(4, 2)

#### Displaying indices in Dataframe

In [35]:
df.index

Index(['Nitin', 'Amit', 'Rohan', 'Likita'], dtype='object')

#### Displaying column names in Dataframe

In [36]:
df.columns

Index(['apples', 'Mangoes'], dtype='object')

#### Total indices

In [37]:
len(df.index)

4

#### Total columns

In [38]:
len(df.columns)

2

#### Displaying first n entries of Dataframe

In [39]:
#use of head(n) method
#default: n=5
df.head()

Unnamed: 0,apples,Mangoes
Nitin,3,5
Amit,4,7
Rohan,0,9
Likita,8,3


In [40]:
#use of tail(n) method
#default: n=5
df.tail()

Unnamed: 0,apples,Mangoes
Nitin,3,5
Amit,4,7
Rohan,0,9
Likita,8,3


In [41]:
print(df.tail(2))

        apples  Mangoes
Rohan        0        9
Likita       8        3


In [42]:
print(df.head(3))

       apples  Mangoes
Nitin       3        5
Amit        4        7
Rohan       0        9


#### info( ) method

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Nitin to Likita
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   apples   4 non-null      int64
 1   Mangoes  4 non-null      int64
dtypes: int64(2)
memory usage: 96.0+ bytes


#### describe( ) method

In [44]:
df.describe() #statistical dataset 

Unnamed: 0,apples,Mangoes
count,4.0,4.0
mean,3.75,6.0
std,3.304038,2.581989
min,0.0,3.0
25%,2.25,4.5
50%,3.5,6.0
75%,5.0,7.5
max,8.0,9.0


#### Transpose of Dataset

In [45]:
#transpose Dataframe
df.T

Unnamed: 0,Nitin,Amit,Rohan,Likita
apples,3,4,0,8
Mangoes,5,7,9,3


### Indexing : selecting data in Pandas dataframe

There are 3 ways of selecting data in a DataFrame:
* using indexing operator
```python
DataFrame[]
```
* using labels
```python
DataFrame.loc[]
```
* using positions
```python
DataFrame.iloc[]
```

In [46]:
# way1 = selection by indexing operator
x = df['apples']
print(x)

Nitin     3
Amit      4
Rohan     0
Likita    8
Name: apples, dtype: int64


In [47]:
y = df['Mangoes']
print(x)

Nitin     3
Amit      4
Rohan     0
Likita    8
Name: apples, dtype: int64


In [48]:
# way2 = using labels
#using .loc[]

x = df.loc[:,:]    #all rows and columns
print(x)
print()
x = df.loc[:,'apples']
print(x)
print()
x = df.loc[:'Rohan',:]
print(x)


        apples  Mangoes
Nitin        3        5
Amit         4        7
Rohan        0        9
Likita       8        3

Nitin     3
Amit      4
Rohan     0
Likita    8
Name: apples, dtype: int64

       apples  Mangoes
Nitin       3        5
Amit        4        7
Rohan       0        9


In [49]:
#way3 : selection by position/location; preferred way
#using .iloc()

x = df.iloc[0:3,0:2]
print(x)
print()
x = df.iloc[0:2,0:2]
print(x)
print()

       apples  Mangoes
Nitin       3        5
Amit        4        7
Rohan       0        9

       apples  Mangoes
Nitin       3        5
Amit        4        7

