In [None]:
import numpy as np
import pandas as pd

## ***Series***
Series is a one-dimensional array-like object, which contains values and an array of labels, associated with the values. Series can be indexed using labels. <br>
*(Series is similar to NumPy array -- actually, it is built on top of the NumPy array object)
<br>Series can hold any arbitrary Python object.

In [None]:
my_data = [100,200,300]
# Converting my_data (Python list) to Series (pandas series)
pd.Series(data = my_data) # series a index o value print hoi


0    100
1    200
2    300
dtype: int64

Column "0 1 2" is automatically generated index for the elements in series with data 100 200 and 300. We can specify index values and call the data points using these indexes.<br>
Let's pass "my_labels" to the Series as index.  

In [None]:
my_labels = ['x','y','z']
pd.Series(data = my_data, index = my_labels)

x    100
y    200
z    300
dtype: int64

**Series using NumPy arrays **

In [None]:
my_array = np.array(my_data)
pd.Series(data = my_array)

0    100
1    200
2    300
dtype: int64

In [None]:
pd.Series(data = my_array, index = my_labels)

x    100
y    200
z    300
dtype: int64

**Series using dictionary**

In [None]:
my_dic = {'x':100,'y':200,'z':300}
pd.Series(my_dic)

x    100
y    200
z    300
dtype: int64

## Grabbing data from Series:
Indexes are the key thing to understand in Series. Pandas use these indexes (numbers or names) for fast information retrieval. (Index works just like a hash table or a dictionary).

To understand the concepts, Let's create three Series, `ser1`, `ser2`, `ser3` from dictionaries with some random data:

In [None]:
dic_1 = {'Toronto': 500, 'Calgary': 200, 'Vancouver': 300, 'Montreal': 700}
dic_2 = {'Calgary': 200, 'Vancouver': 300, 'Montreal': 700}
dic_3 = {'Calgary': 200, 'Vancouver': 300, 'Montreal': 700, 'Jasper':1000}

In [None]:
# Creating pandas series from the dictionaries
ser1 = pd.Series(dic_1)
ser2 = pd.Series(dic_2)
ser3 = pd.Series(dic_3)

In [None]:
ser1

Toronto      500
Calgary      200
Vancouver    300
Montreal     700
dtype: int64

In [None]:
# Grabbing information for series is very much similar to dictionary.
ser1['Calgary'] # its case sensitive "calgary" is not the same as "Calgary"

200

In [None]:
#pd.merge(ser1, ser2, right_index=True, left_index=True)
ser6=ser1.append(ser2)
ser6

  ser6=ser1.append(ser2)


Toronto      500
Calgary      200
Vancouver    300
Montreal     700
Calgary      200
Vancouver    300
Montreal     700
dtype: int64

In [None]:
ser4 = ser1 + ser2
ser4

Calgary       400.0
Montreal     1400.0
Toronto         NaN
Vancouver     600.0
dtype: float64

### `isnull()`, `notnull()`
* detect missing data

In [None]:
#pd.isnull(ser4)
ser4.isnull()
# shift+tab, its Type is method

Calgary      False
Montreal     False
Toronto       True
Vancouver    False
dtype: bool

In [None]:
#pd.notnull(ser5)
ser4.notnull()

Calgary       True
Montreal      True
Toronto      False
Vancouver     True
dtype: bool

### `axes`, `values`
* `axes`: returns list of the row axis labels/index
* `values`: returns list of values/data<br>

Let's try `axes` and `values` on our series!

In [None]:
# row axis labels (index) list can be obtained
ser1.axes
#<shift+tab> axes type is property, its attribute!

[Index(['Toronto', 'Calgary', 'Vancouver', 'Montreal'], dtype='object')]

In [None]:
# returns the values/data
ser1.values

array([500, 200, 300, 700])

## **Dataframe**
* A very simple way to think about the DataFrame is, "bunch of Series together such as they share the same index". <br>
* A DataFrams is a rectangular table of data that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index. <br>

Let's create two labels/indexes:
* for rows 'r1 to r10'
* for columns 'c1 to c10'

Let's start with a simple example, using **`arange()`** and **`reshape()`** together to create a 2D array (matrix).<br>


In [None]:
index = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()
columns = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

In [None]:
index

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']

In [None]:
columns

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [None]:
array_2d = np.arange(0,100).reshape(10,10)
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [None]:
# Let's create our first DataFrame using index, columns and array_2dnow
df = pd.DataFrame(data = array_2d, index = index, columns = columns) # Array ar dataframe ar difference holo array te kono column and index show kore na.but Dataframe a show kore
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
# Grabbing a single column
#df['c2']
df.c2

r1      1
r2     11
r3     21
r4     31
r5     41
r6     51
r7     61
r8     71
r9     81
r10    91
Name: c2, dtype: int64

In [None]:
type(df['c1']) # It is a pandas Series

pandas.core.series.Series

In [None]:
# Grabbing more than one column, pass the list of columns you need!
df[['c1', 'c10', 'c2']]

Unnamed: 0,c1,c10,c2
r1,0,9,1
r2,10,19,11
r3,20,29,21
r4,30,39,31
r5,40,49,41
r6,50,59,51
r7,60,69,61
r8,70,79,71
r9,80,89,81
r10,90,99,91


## Create New Column

In [None]:
df['new']= df['c1'] + df['c2']  # select *, (c1 + c2) as new from df
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


### Deleting the column -- `drop()`

In [None]:
df.drop('new',axis = 1, inplace = True)  # axis = 1 horche only column drop korse . Axis = 0 hole row drop korto
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


To delete the column, you have to tell the pandas by setting<br>
* ***inplace = True*** (default is inplace=False).<br>

## Rows
 can retrieve a row by its name or position with **[`loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)** and **[`iloc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)**.<br>
**loc** -- Access a group of rows and columns by label(s)

In [None]:
df.loc['r1'] # loc for location in square brackets ,Loc ni dile column ke buzabe. Loc dile row ke buzabe
# we see that the rows are series as well!

c1     0
c2     1
c3     2
c4     3
c5     4
c6     5
c7     6
c8     7
c9     8
c10    9
Name: r1, dtype: int64

In [None]:
df.iloc[0] # iloc[index], index based location

c1     0
c2     1
c3     2
c4     3
c5     4
c6     5
c7     6
c8     7
c9     8
c10    9
Name: r1, dtype: int64

In [None]:
# more than one rows -- pass a list of rows!
df.loc[['r1','r2', 'r3']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29


### Grabbing an element or a sub-set of the dataframe

In [None]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
df.loc[['r2', 'r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19
r5,40,41,42,43,44,45,46,47,48,49


In [None]:
# df.loc(req_row, re_col) -- pass row, col for the element!
df.loc['r1','c1']

0

In [None]:
# for a sub-set, pass the list
df.loc[['r1','r2'],['c1','c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11


In [None]:
# another example - random columns and rows in the list
df.loc[['r2','r5'],['c3','c4']]

Unnamed: 0,c3,c4
r2,12,13
r5,42,43


In [None]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
# We can do a conditional selection as well
df > 5

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,False,False,False,False,False,False,True,True,True,True
r2,True,True,True,True,True,True,True,True,True,True
r3,True,True,True,True,True,True,True,True,True,True
r4,True,True,True,True,True,True,True,True,True,True
r5,True,True,True,True,True,True,True,True,True,True
r6,True,True,True,True,True,True,True,True,True,True
r7,True,True,True,True,True,True,True,True,True,True
r8,True,True,True,True,True,True,True,True,True,True
r9,True,True,True,True,True,True,True,True,True,True
r10,True,True,True,True,True,True,True,True,True,True


This is similar to NumPy boolean mask, lets try this:

    *bool_mask = df % 3 == 0
    *df[bool_mask]
returns values where it is True and NaN where False.

In [None]:
# Return Divisible by 3
bool_mask = df % 3 == 0
bool_mask
df[bool_mask]
# One step and easier to do
# df[df % 3 == 0]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0.0,,,3.0,,,6.0,,,9.0
r2,,,12.0,,,15.0,,,18.0,
r3,,21.0,,,24.0,,,27.0,,
r4,30.0,,,33.0,,,36.0,,,39.0
r5,,,42.0,,,45.0,,,48.0,
r6,,51.0,,,54.0,,,57.0,,
r7,60.0,,,63.0,,,66.0,,,69.0
r8,,,72.0,,,75.0,,,78.0,
r9,,81.0,,,84.0,,,87.0,,
r10,90.0,,,93.0,,,96.0,,,99.0


&#9758; Its not common to use such operation on entire dataframe. We usually use them on a columns or rows instead.<br>
**For example, we don't want a row with NaN values.**<br>
What to do?<br>
Let's have a look at one example.

In [None]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
df['c1']>11  #[df['c1'] > 11]
#df[df['c1']>11]

r1     False
r2     False
r3      True
r4      True
r5      True
r6      True
r7      True
r8      True
r9      True
r10     True
Name: c1, dtype: bool

In [None]:
df[df['c1']>11] # df[boolean_mask]  11 theke boro number


Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
result = df[df['c1']>11]
result['c1']

r3     20
r4     30
r5     40
r6     50
r7     60
r8     70
r9     80
r10    90
Name: c1, dtype: int64

In [None]:
df[df['c1']>11]['c1']

r3     20
r4     30
r5     40
r6     50
r7     60
r8     70
r9     80
r10    90
Name: c1, dtype: int64

In [None]:
# Passing multiple rows in a list
df[df['c1']>11].loc[['r3','r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r5,40,41,42,43,44,45,46,47,48,49


In [None]:
df[df['c1']==70]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79


In [None]:
df[(df['c1']>60) & (df['c2']>80)]
# notice (df['c1']>60)&(df['c2']>80) in () for clear saperation
# with in [] wrapped in df []

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


## **`reset_index()`** and **`set_index()`**<br>
We can reset the index of our dataframe to numerical index (which is default index), `inplace = True` to make the permanent change. *The existing index will be a new column.*

In [None]:
df

Unnamed: 0,index,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,0,0,1,2,3,4,5,6,7,8,9
1,1,10,11,12,13,14,15,16,17,18,19
2,2,20,21,22,23,24,25,26,27,28,29
3,3,30,31,32,33,34,35,36,37,38,39
4,4,40,41,42,43,44,45,46,47,48,49
5,5,50,51,52,53,54,55,56,57,58,59
6,6,60,61,62,63,64,65,66,67,68,69
7,7,70,71,72,73,74,75,76,77,78,79
8,8,80,81,82,83,84,85,86,87,88,89
9,9,90,91,92,93,94,95,96,97,98,99


In [None]:
df.reset_index(inplace = True)
df

Unnamed: 0,index,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,r1,0,1,2,3,4,5,6,7,8,9
1,r2,10,11,12,13,14,15,16,17,18,19
2,r3,20,21,22,23,24,25,26,27,28,29
3,r4,30,31,32,33,34,35,36,37,38,39
4,r5,40,41,42,43,44,45,46,47,48,49
5,r6,50,51,52,53,54,55,56,57,58,59
6,r7,60,61,62,63,64,65,66,67,68,69
7,r8,70,71,72,73,74,75,76,77,78,79
8,r9,80,81,82,83,84,85,86,87,88,89
9,r10,90,91,92,93,94,95,96,97,98,99


In [None]:
df['index']

0     r1
1     r2
2     r3
3     r4
4     r5
5     r6
6     r7
7     r8
8     r9
9    r10
Name: index, dtype: object

In [None]:
df.set_index('index', inplace = True)

df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
df = pd.DataFrame(data = array_2d, index = index, columns = columns)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


### `info()`
Provides a concise summary of the DataFrame.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, r1 to r10
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   c1      10 non-null     int64
 1   c2      10 non-null     int64
 2   c3      10 non-null     int64
 3   c4      10 non-null     int64
 4   c5      10 non-null     int64
 5   c6      10 non-null     int64
 6   c7      10 non-null     int64
 7   c8      10 non-null     int64
 8   c9      10 non-null     int64
 9   c10     10 non-null     int64
dtypes: int64(10)
memory usage: 880.0+ bytes


In [None]:
df.describe()

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
std,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504
min,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
25%,22.5,23.5,24.5,25.5,26.5,27.5,28.5,29.5,30.5,31.5
50%,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
75%,67.5,68.5,69.5,70.5,71.5,72.5,73.5,74.5,75.5,76.5
max,90.0,91.0,92.0,93.0,94.0,95.0,96.0,97.0,98.0,99.0
