## Contents

- Pandas Intro
- Creating series data
- Creating data frames
- Accessing data from series/data frames
- Reading data from environment
- Explore data
- Convert data from json to csv etc.,

## Pandas intro

- Developed in 2008 by mckiney
- Used for analysing, cleaning, exploring, and manipulating data

## Pandas can do

- Finding the correlation
- Handling with missing values
- avg, max, min
- plotting

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas

In [3]:
pandas.__version__

'1.2.4'

In [4]:
import pandas as pd

## Series

- list
- tuples
- numpy array

In [6]:
s1= pd.Series([1,2,3,"kakinada"])

print(s1)
print(type(s1))

0           1
1           2
2           3
3    kakinada
dtype: object
<class 'pandas.core.series.Series'>


In [7]:
s1= pd.Series([1,2,3,"kakinada"], index=["x1","x2","x3"])

print(s1)
print(type(s1))

ValueError: Length of passed values is 4, index implies 3.

In [8]:
s1= pd.Series([1,2,3,"kakinada"], index=["x1","x2","x3","x4"])

print(s1)
print(type(s1))

x1           1
x2           2
x3           3
x4    kakinada
dtype: object
<class 'pandas.core.series.Series'>


In [10]:
# series object with tupkee
s2= pd.Series((1,2,3,4))

print(s2)
s2.index = [11,22,3,4]
s2

0    1
1    2
2    3
3    4
dtype: int64


11    1
22    2
3     3
4     4
dtype: int64

In [11]:
## Series object using numpy array

import numpy as np
s3 = pd.Series(np.arange(1,10))
print(s3)
print(type(s3))

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int32
<class 'pandas.core.series.Series'>


In [12]:
s1.index

Index(['x1', 'x2', 'x3', 'x4'], dtype='object')

In [13]:
s2.index

Int64Index([11, 22, 3, 4], dtype='int64')

In [14]:
s3.index

RangeIndex(start=0, stop=9, step=1)

In [15]:
s1

x1           1
x2           2
x3           3
x4    kakinada
dtype: object

In [16]:
s1["x2"]

2

In [17]:
s1["x4"]

'kakinada'

In [18]:
s1.min()

TypeError: '<=' not supported between instances of 'int' and 'str'

In [19]:
s3.min()

1

In [20]:
s2.min()

1

In [21]:
s3.cumsum()

0     1
1     3
2     6
3    10
4    15
5    21
6    28
7    36
8    45
dtype: int32

## Pandas date_range

- to work with date and time formats we use date_range()

### Syntax

- pandas.date_range()

In [23]:
import pandas as pd

d1 = pd.date_range("01-01-2021",periods = 10)
d1

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10'],
              dtype='datetime64[ns]', freq='D')

In [24]:
import pandas as pd

d1 = pd.date_range("01-01-2021",periods = 10, freq= "1H")
d1

DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 01:00:00',
               '2021-01-01 02:00:00', '2021-01-01 03:00:00',
               '2021-01-01 04:00:00', '2021-01-01 05:00:00',
               '2021-01-01 06:00:00', '2021-01-01 07:00:00',
               '2021-01-01 08:00:00', '2021-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

In [26]:
import pandas as pd

d1 = pd.date_range("01-01-2021",end = "01-02-2021", freq= "1H")
d1

DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 01:00:00',
               '2021-01-01 02:00:00', '2021-01-01 03:00:00',
               '2021-01-01 04:00:00', '2021-01-01 05:00:00',
               '2021-01-01 06:00:00', '2021-01-01 07:00:00',
               '2021-01-01 08:00:00', '2021-01-01 09:00:00',
               '2021-01-01 10:00:00', '2021-01-01 11:00:00',
               '2021-01-01 12:00:00', '2021-01-01 13:00:00',
               '2021-01-01 14:00:00', '2021-01-01 15:00:00',
               '2021-01-01 16:00:00', '2021-01-01 17:00:00',
               '2021-01-01 18:00:00', '2021-01-01 19:00:00',
               '2021-01-01 20:00:00', '2021-01-01 21:00:00',
               '2021-01-01 22:00:00', '2021-01-01 23:00:00',
               '2021-01-02 00:00:00'],
              dtype='datetime64[ns]', freq='H')

In [27]:
import pandas as pd
date = pd.Series(pd.date_range("01-01-2021",end = "01-02-2021", freq= "1H"))
print(date)
print(type(date))

0    2021-01-01 00:00:00
1    2021-01-01 01:00:00
2    2021-01-01 02:00:00
3    2021-01-01 03:00:00
4    2021-01-01 04:00:00
5    2021-01-01 05:00:00
6    2021-01-01 06:00:00
7    2021-01-01 07:00:00
8    2021-01-01 08:00:00
9    2021-01-01 09:00:00
10   2021-01-01 10:00:00
11   2021-01-01 11:00:00
12   2021-01-01 12:00:00
13   2021-01-01 13:00:00
14   2021-01-01 14:00:00
15   2021-01-01 15:00:00
16   2021-01-01 16:00:00
17   2021-01-01 17:00:00
18   2021-01-01 18:00:00
19   2021-01-01 19:00:00
20   2021-01-01 20:00:00
21   2021-01-01 21:00:00
22   2021-01-01 22:00:00
23   2021-01-01 23:00:00
24   2021-01-02 00:00:00
dtype: datetime64[ns]
<class 'pandas.core.series.Series'>


## Data Frames

- 2d data type of data
- list
- tuple
- numpy array 
    - string, float, complex-string/object

In [28]:
import pandas as p1
import numpy as np

df1 = p1.DataFrame([[1,2,3],[4,5,"andhra"]])
df2 = p1.DataFrame((1,3),(1,2))
df3 = p1.DataFrame({"k1":11, "m1":[23,34]})
df4 = p1.DataFrame(np.array([[1,3,5],[8,7,6]]))
print(df1)
print(df2)
print(df3)
print(df4)

   0  1       2
0  1  2       3
1  4  5  andhra
   0
1  1
2  3
   k1  m1
0  11  23
1  11  34
   0  1  2
0  1  3  5
1  8  7  6


In [29]:
df1

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,andhra


In [30]:
df2

Unnamed: 0,0
1,1
2,3


In [31]:
df3

Unnamed: 0,k1,m1
0,11,23
1,11,34


In [32]:
df4

Unnamed: 0,0,1,2
0,1,3,5
1,8,7,6


In [33]:
type(df4)

pandas.core.frame.DataFrame

In [34]:
print(type(df4))

<class 'pandas.core.frame.DataFrame'>


In [37]:
df2 = p1.DataFrame(((1,3),(1,2),(3,4)),index = ["x","y","z"])
df2

Unnamed: 0,0,1
x,1,3
y,1,2
z,3,4


In [39]:
df3

Unnamed: 0,k1,m1
0,11,23
1,11,34


In [40]:
df3.columns = ["coulmn1","column2"]
df3.index = ["row1","row2"]
df3

Unnamed: 0,coulmn1,column2
row1,11,23
row2,11,34


In [41]:
df3.columns

Index(['coulmn1', 'column2'], dtype='object')

In [43]:
df3.index

Index(['row1', 'row2'], dtype='object')

In [45]:
import numpy as np
d1 = {"rollno":["18021A0" + str(i) for i in range(1,101)], 
                "m1":np.random.randint(1,100,100),
                "m2":np.random.randint(1,100,100),
                "m3":np.random.randint(1,100,100)}
df = pd.DataFrame(d1)
df

Unnamed: 0,rollno,m1,m2,m3
0,18021A01,78,96,26
1,18021A02,87,22,80
2,18021A03,80,5,73
3,18021A04,68,57,59
4,18021A05,96,17,76
...,...,...,...,...
95,18021A096,82,71,51
96,18021A097,50,75,42
97,18021A098,45,48,36
98,18021A099,52,24,48


In [47]:
df.shape # rows and columns

(100, 4)

In [48]:
df.head(10)# to view first 10 elements of dataframe

Unnamed: 0,rollno,m1,m2,m3
0,18021A01,78,96,26
1,18021A02,87,22,80
2,18021A03,80,5,73
3,18021A04,68,57,59
4,18021A05,96,17,76
5,18021A06,6,58,62
6,18021A07,75,96,91
7,18021A08,94,78,15
8,18021A09,80,5,79
9,18021A010,57,5,29


In [49]:
df.tail(10)# to view last 10 elements of dataframe

Unnamed: 0,rollno,m1,m2,m3
90,18021A091,62,5,92
91,18021A092,43,73,78
92,18021A093,54,40,78
93,18021A094,57,32,5
94,18021A095,27,72,78
95,18021A096,82,71,51
96,18021A097,50,75,42
97,18021A098,45,48,36
98,18021A099,52,24,48
99,18021A0100,53,39,71


## Accessing data from dataFrame

- Indexing
- Slicing

In [50]:
df.head()

Unnamed: 0,rollno,m1,m2,m3
0,18021A01,78,96,26
1,18021A02,87,22,80
2,18021A03,80,5,73
3,18021A04,68,57,59
4,18021A05,96,17,76


In [52]:
df["rollno"]

0       18021A01
1       18021A02
2       18021A03
3       18021A04
4       18021A05
         ...    
95     18021A096
96     18021A097
97     18021A098
98     18021A099
99    18021A0100
Name: rollno, Length: 100, dtype: object

In [56]:
df.head

<bound method NDFrame.head of         rollno  m1  m2  m3
0     18021A01  78  96  26
1     18021A02  87  22  80
2     18021A03  80   5  73
3     18021A04  68  57  59
4     18021A05  96  17  76
..         ...  ..  ..  ..
95   18021A096  82  71  51
96   18021A097  50  75  42
97   18021A098  45  48  36
98   18021A099  52  24  48
99  18021A0100  53  39  71

[100 rows x 4 columns]>