# © 2022 by <b> Aspire Academy </b>, All Rights Reserved.

# DATA HANDLING USING PANDAS PART - 01

# INTRO TO PANDAS

Data Science or data analytics is the process of analyzing a large set of data points to get answers to questions related to that dataset.

The need for data analytics arises to handle huge data which is an area of concern for large business organizations, government bodies, communities and consumers.

# How to handle this big data?

It can be handled through **Data Life Cycle** (DLC)

# STEPS IN DLC

1. Data is stored in different formats --- .csv file, and excel file, or an HTML file. 
2. This data is transformed or converted into a single format somewhere and that's where data warehousing comes into picture.
3. Once we have stored the data, we can perform certain analysis on it, i.e., we can perform, join and merge data operation, search operation etc.,
4. Once analysis is done, we can plot this data in the form of a graph which is data visualization.


All these sequences of operations for data analysis can be easily and effectively performed by Python and its libraries.

# Advantages of PANDAS Library(modules)


+ It makes data importing and data analyzing easier.

+ It is a high-level data manipulation tool developed by **Wes McKinney** and built on packages like NumPy and matplotlib to give us a single and convenient place for data analysis and data visualization work.

+ The term Pandas is derived from **Panel Data System**, which is an ecometric term for multidimensional, structured dataset.

## INSTALLING PANDAS

Pandas module doesn't come bundled with Standard Python.

Therefore, it is to be installed using **pip** command from PyPI.

+ PIP Stands for **"Preferred Installer Program"**
+ PyPI Stands for **"Python Package Index"**

In [5]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


In [6]:
## Importing PANDAS
import pandas as pd 

## DATA STRUCTURES IN PANDAS

A data structures is a way of storing and organizing data in a computer so that it can be accessed and worked with in an appropriate way.

**SERIES**

It is a one-dimensional structure storing homogeneous (same data type) mutable (which can be modified) data.

**DATAFRAMES**

It is a two-dimensional structure stroing heterogeneous (multiple data type) mutable data.

**Panel**

It is a three-dimesional way of storing items.

# SERIES

1. One Dimensional Structure
2. Homogeneous (same type) data
3. Mutable (can be modified)
4. Size of Series data is **immutable**
5. It can be created using **Series()**


## CREATION OF SERIES

### SYNTAX

>> **import pandas as pd** <br>
>> **pd.Series(data, index = idx)**


The data supplied to Series() can be either:

1. List
2. Ndarray
3. Scalar Value
4. Dictionary
5. Mathematical expression/function

### MAKING SERIES USING LIST, RANGE(), SCALAR VALUE

In [8]:
# create a series using List
import pandas as pd
import numpy as np

series1 = pd.Series([10,20,30,40])
series1

0    10
1    20
2    30
3    40
dtype: int64

In [2]:
# create a series using two different lists

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
no_days = [31,20,31,30,31,30]

series = pd.Series(no_days, index = months)
print(series)

Jan    31
Feb    20
Mar    31
Apr    30
May    31
Jun    30
dtype: int64


In [3]:
#using range() method

series1 = pd.Series(range(5))
series1

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [4]:
# more example
series = pd.Series([1,2,3,4,5,6])

series.index = ['First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth']
series

First     1
Second    2
Third     3
Fourth    4
Fifth     5
Sixth     6
dtype: int64

In [6]:
# handling FP values for generating a series
series1 = pd.Series([2,4,5,6,7.5])
series1

0    2.0
1    4.0
2    5.0
3    6.0
4    7.5
dtype: float64

In [11]:
#using missing values (NaN)
s = pd.Series([7.5, 5.4, np.NaN, -34.5])
print(s)

0     7.5
1     5.4
2     NaN
3   -34.5
dtype: float64


In [12]:
# more example
series2 = pd.Series([10,20,30,40,50], index = ['a','b','c','d','e'])
series2

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [13]:
series2['d']

40

In [14]:
series2[['a', 'c', 'e']]

a    10
c    30
e    50
dtype: int64

In [15]:
## perform indexing , slicing and accessing data from a series

import pandas as pd

s = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])
print(s[0])
print(s[:3])
print(s[-3:])

1
a    1
b    2
c    3
dtype: int64
b    2
c    3
d    4
dtype: int64


## iloc and loc

**iloc**

+ It is used for indexing or selecting based on position, i.e., by row number and column number.
+ It refers to **position-based indexing**.

**loc**

+ loc is used for indexing or selecting based on name, i.e., by row name and column name.
+ It refers to **name-based indexing**.

In [18]:
s = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])

print(s.iloc[1:4])          # position-based indexing
print(s.loc['b':'e'])       # name-based indexing

b    2
c    3
d    4
dtype: int64
b    2
c    3
d    4
e    5
dtype: int64


In [19]:
# series using scalar value
s1 = pd.Series(10, index = range(0,3))
s1

0    10
1    10
2    10
dtype: int64

In [20]:
s2 = pd.Series(15, index = range(1,6,2))
s2

1    15
3    15
5    15
dtype: int64

In [27]:
# To create a series uinsg a string index

import pandas as pd

s = pd.Series("WELCOME TO ASPIRE ACADEMY", index = ['Fiza', 'Pritvi', 'Tanmoy', 'Rahul', 'Sushil'])
s

Fiza      WELCOME TO ASPIRE ACADEMY
Pritvi    WELCOME TO ASPIRE ACADEMY
Tanmoy    WELCOME TO ASPIRE ACADEMY
Rahul     WELCOME TO ASPIRE ACADEMY
Sushil    WELCOME TO ASPIRE ACADEMY
dtype: object

In [28]:
# To create a series with range() and for loop

s = pd.Series(range(1,15,3), index = [x for x in 'abcde'])
s

a     1
b     4
c     7
d    10
e    13
dtype: int64

In [None]:
# create series using a dictionary

import pandas as pd
series = pd.Series({'Jan': 31, 'Feb':28, 'Mar':31, 'Apr':30})
series

### CREATING SERIES USING DICTIONARY

In [38]:
# create series using a dictionary

import pandas as pd
series = pd.Series({'Jan': 31, 'Feb':28, 'Mar':31, 'Apr':30})
series

Jan    31
Feb    28
Mar    31
Apr    30
dtype: int64

### NAMING A SERIES

In [39]:
import pandas as pd

series = pd.Series({'Jan': 31, 'Feb': 28, 'Mar': 31, 'Apr': 30})

# naming a series
series.name = "Days"
series.index.name = "Month"
series

Month
Jan    31
Feb    28
Mar    31
Apr    30
Name: Days, dtype: int64

In [40]:
# generate a series using a mathematical expressions
import pandas as pd
import numpy as np

s1 = np.arange(10,15)
print(s1)

sobj = pd.Series(index = s1, data = s1*4)
sobj

[10 11 12 13 14]


10    40
11    44
12    48
13    52
14    56
dtype: int64

In [42]:
# To generate a series using a mathematical function (exponentiation)

s1 = np.arange(10,15)
print(s1)

sobj = pd.Series(index = s1, data = s1**2)
sobj

[10 11 12 13 14]


10    100
11    121
12    144
13    169
14    196
dtype: int64

## SERIES OBJECT ATTRIBUTES

| Attribute         |  Description                                     |
| ----------------- | -----------------------------------------------  |
| Series.index      | Returns index of the series                      |
| Series.values     | Returns ndarray                                  |
| Series.dtype      | Returns dtype object of the underlying data      |
| Series.shape      | Returns tuple of the shape of underlying data    |
| Series.nbytes     | Returns number of bytes of underlying data       |
| Series.ndim       | Returns the number of dimensions                 |
| Series.size       | Returns number of elements                       |
| Series.itemsize   | Returns the size of the dtype                    |
| Series.hasnans    | Returns true if there are any NaN                |
| Series.empty      | Returns true if series object is empty           |

In [46]:
import pandas as pd

s = pd.Series(range(1,15,3), index = [x for x in 'abcde'])

In [45]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [47]:
s.values

array([ 1,  4,  7, 10, 13])

In [48]:
s.shape

(5,)

In [49]:
s.size  

5

In [50]:
s.nbytes

40

In [51]:
s.ndim

1

In [52]:
s.empty

False

In [53]:
s.hasnans

False

In [54]:
s.dtypes

dtype('int64')

### Retrieving Values from a series using head() and tail() functions

In [58]:
import pandas as pd

series1 = pd.Series([10,20,30,40,50], index = ['a', 'b', 'c','d','e'])
print(series1)
print("-------------------------")
print(series1.head(2))
print("-------------------------")
print(series1.tail(2))

a    10
b    20
c    30
d    40
e    50
dtype: int64
-------------------------
a    10
b    20
dtype: int64
-------------------------
d    40
e    50
dtype: int64


### MATHEMATICAL OPERATION ON SERIES

In [59]:
s = pd.Series([11,12,13,14], index = range(1,5))
s1 = pd.Series([21,22,23,24], index = range(1,5))
s3 = pd.Series([21,22,23,24], index = range(101,105))

print(s)
print("-----------------------")
print(s1)
print("-----------------------")
print(s3)

1    11
2    12
3    13
4    14
dtype: int64
-----------------------
1    21
2    22
3    23
4    24
dtype: int64
-----------------------
101    21
102    22
103    23
104    24
dtype: int64


In [60]:
s + s1

1    32
2    34
3    36
4    38
dtype: int64

In [61]:
s + s3

1     NaN
2     NaN
3     NaN
4     NaN
101   NaN
102   NaN
103   NaN
104   NaN
dtype: float64

In [62]:
s*s1

1    231
2    264
3    299
4    336
dtype: int64

In [63]:
s/s1

1    0.523810
2    0.545455
3    0.565217
4    0.583333
dtype: float64

## VECTOR OPERATIONS ON SERIES

In [64]:
s

1    11
2    12
3    13
4    14
dtype: int64

In [65]:
s+2

1    13
2    14
3    15
4    16
dtype: int64

In [66]:
s*3

1    33
2    36
3    39
4    42
dtype: int64

In [67]:
s > 13

1    False
2    False
3    False
4     True
dtype: bool

In [68]:
s**2

1    121
2    144
3    169
4    196
dtype: int64

## RETRIEVING VALUES USING CONDITIONS

In [70]:
import pandas as pd

s = pd.Series([1.000000, 1.414214, 1.73051, 2.000000])
s

0    1.000000
1    1.414214
2    1.730510
3    2.000000
dtype: float64

In [71]:
s < 2

0     True
1     True
2     True
3    False
dtype: bool

In [72]:
s[s<2]

0    1.000000
1    1.414214
2    1.730510
dtype: float64

In [73]:
s[s>=2]

3    2.0
dtype: float64

## DELETING ELEMENTS FROM A SERIES

In [74]:
s

0    1.000000
1    1.414214
2    1.730510
3    2.000000
dtype: float64

In [75]:
s.drop(3)

0    1.000000
1    1.414214
2    1.730510
dtype: float64