# <a id='toc1_'></a>[Analyze and Prepare data using Python](#toc0_)

Lesson: 07

Time: 00:00:00

**Table of contents**<a id='toc0_'></a>    
- [Analyze and Prepare data using Python](#toc1_)    
  - [NumPy](#toc1_1_)    
  - [Pandas](#toc1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---

## <a id='toc1_1_'></a>[NumPy](#toc0_)

In [2]:
import numpy as np

A array of given shape and type, filled with `fill_value`:

In [6]:
np.full(shape=(2, 3), fill_value=6)


array([[6, 6, 6],
       [6, 6, 6]])

Identity matrix/array:

In [7]:
np.identity(n=3)


array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Change typing:

In [13]:
x = np.array([1.0, 3.2, 0.8, 4.0, 7.98])
print(x)
print(x.astype(np.int64))


[1.   3.2  0.8  4.   7.98]
[1 3 0 4 7]


In [14]:
y = np.array([1, 3, 0, 4, 7])
print(y)
print(y.astype(np.float64))


[1 3 0 4 7]
[1. 3. 0. 4. 7.]


_Slice_ and _change_ the original array value:

In [41]:
arr = np.arange(10)
print(arr)  # The original values of arr

x = arr[2:6]
print(x)
print()

x[1] = 17
print(x)
print(arr)  # The values of the array after Slicing and changing the value
print()

x[:] = 64
print(x)
print(arr)  # The values of the array after Slicing and changing the value


[0 1 2 3 4 5 6 7 8 9]
[2 3 4 5]

[ 2 17  4  5]
[ 0  1  2 17  4  5  6  7  8  9]

[64 64 64 64]
[ 0  1 64 64 64 64  6  7  8  9]


But with use the `copy()`:

In [40]:
arr = np.arange(10)
print(arr)  # The original values of arr

x = arr[2:6].copy()
print(x)
print()

x[1] = 17
print(x)
print(arr)  # The values of the array after Slicing and changing the value
print()

x[:] = 64
print(x)
print(arr)  # The values of the array after Slicing and changing the value


[0 1 2 3 4 5 6 7 8 9]
[2 3 4 5]

[ 2 17  4  5]
[0 1 2 3 4 5 6 7 8 9]

[64 64 64 64]
[0 1 2 3 4 5 6 7 8 9]


Boolean indexing _(Return True values of a array on other array)_:

In [65]:
names = np.array(['ali', 'sara', 'taha', 'ali'])
print(names)
print(names == 'ali')


['ali' 'sara' 'taha' 'ali']
[ True False False  True]


In [66]:
data = np.random.randint(low=10, size=(4, 3))
print(data)
print()

print(data[names == 'ali'])
print()

print(~data[names == 'ali'])  # ~ is Not symbol


[[6 2 6]
 [3 9 8]
 [8 8 7]
 [8 2 6]]

[[6 2 6]
 [8 2 6]]

[[-7 -3 -7]
 [-9 -3 -7]]


In [67]:
print(data[names == 'ali', 1:])


[[2 6]
 [2 6]]


In [68]:
mask1 = (names == 'ali') | (names == 'taha')
mask2 = (names == 'ali') & (names == 'taha')

print(mask1)
print(data[mask1])
print()

print(mask2)
print(data[mask2])


[ True False  True  True]
[[6 2 6]
 [8 8 7]
 [8 2 6]]

[False False False False]
[]


Convert Negative values to Zero:

In [78]:
x = np.random.randn(3, 4)
print(x)
print()

x[x < 0] = 0
print(x)


[[ 0.77211152 -0.76597199  0.75102038  0.33594753]
 [-0.68683318  1.85224896  0.75384187  1.84777921]
 [-0.25003729  1.63777146  0.89882982 -0.60706259]]

[[0.77211152 0.         0.75102038 0.33594753]
 [0.         1.85224896 0.75384187 1.84777921]
 [0.         1.63777146 0.89882982 0.        ]]


Fancy indexing (_Indexing using integer arrays_):

In [99]:
# Create data
arr = np.empty(shape=(5, 3))

for i in range(arr.shape[0]):  # arr.shape[0] == 5
    arr[i] = 5*i+1  # is a arbitrary value

print(arr)


[[ 1.  1.  1.]
 [ 6.  6.  6.]
 [11. 11. 11.]
 [16. 16. 16.]
 [21. 21. 21.]]


In [104]:
# Fancy indexing (with use the two brackets)
print(arr[[0, 2, -2, -5, -3, 1]])


[[ 1.  1.  1.]
 [11. 11. 11.]
 [16. 16. 16.]
 [ 1.  1.  1.]
 [11. 11. 11.]
 [ 6.  6.  6.]]


A new type of Slicing:

In [115]:
a = np.arange(35).reshape((7, 5))
print(a)
print()

# First bracket is Row numbers and second is Column numbers
print(a[[6, 0, 2], [2, 4, 3]])
print()

print(a[[2, 6]][:, [0, 3, 1]])


[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]
 [25 26 27 28 29]
 [30 31 32 33 34]]

[32  4 13]

[[10 13 11]
 [30 33 31]]


Transposing arrays and Swapping axes:

In [119]:
arr = np.arange(8).reshape((2, 4))
print(arr)
print()

print(arr.T)  # Transpose


[[0 1 2 3]
 [4 5 6 7]]

[[0 4]
 [1 5]
 [2 6]
 [3 7]]


In [124]:
z = np.arange(60).reshape((3, 4, 5))  # 3*4*5 == 60
print(z)
print()

print(z.shape)


[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]]

 [[20 21 22 23 24]
  [25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]]

 [[40 41 42 43 44]
  [45 46 47 48 49]
  [50 51 52 53 54]
  [55 56 57 58 59]]]

(3, 4, 5)


In [127]:
# Axes swapping
axe = z.swapaxes(0, 1)  # change the Original axes: (3, 4, 5) -> (4, 3, 5)
print(axe)
print()

print(axe.shape)


[[[ 0  1  2  3  4]
  [20 21 22 23 24]
  [40 41 42 43 44]]

 [[ 5  6  7  8  9]
  [25 26 27 28 29]
  [45 46 47 48 49]]

 [[10 11 12 13 14]
  [30 31 32 33 34]
  [50 51 52 53 54]]

 [[15 16 17 18 19]
  [35 36 37 38 39]
  [55 56 57 58 59]]]

(4, 3, 5)


In [129]:
# A way same with z.swapaxes(0, 1)
print(z.transpose((1, 0, 2)))


[[[ 0  1  2  3  4]
  [20 21 22 23 24]
  [40 41 42 43 44]]

 [[ 5  6  7  8  9]
  [25 26 27 28 29]
  [45 46 47 48 49]]

 [[10 11 12 13 14]
  [30 31 32 33 34]
  [50 51 52 53 54]]

 [[15 16 17 18 19]
  [35 36 37 38 39]
  [55 56 57 58 59]]]


Separation of Decimal and Integer part:

In [132]:
x = np.array([2.6, 8.5, -9])

d, i = np.modf(x)

print(np.modf(x))
print(d)  # Decimal part
print(i)  # Integer part


(array([ 0.6,  0.5, -0. ]), array([ 2.,  8., -9.]))
[ 0.6  0.5 -0. ]
[ 2.  8. -9.]


Maximum Values of Multiple arrays:

In [135]:
x = np.random.randn(4)
y = np.random.randn(4)

print(x)
print(y)
print()

print(np.maximum(x, y))  # x and y must be have same length


[-0.06623655  0.55547904 -2.48191337  2.92861181]
[-0.17258407  1.05849755  0.39515491  0.09070186]

[-0.06623655  1.05849755  0.39515491  2.92861181]


Where ( `np.where()` ):

In [140]:
# without Where function (is bad)
arr1 = np.array([1, 5, 8])
arr2 = np.array([4, 7, 12])
condition = np.array([True, False, True])

r = [(x if cond else y) for x, y, cond in zip(arr1, arr2, condition)]

print(r)


[1, 7, 8]


In [142]:
# with Where function (is good)
res = np.where(condition, arr1, arr2)  # arr1 If condition Else arr2
print(res)


[1 7 8]


another example:

In [165]:
x = np.random.randn(2, 3)
print(x)
print()

print(x > 0)
print()

print(np.where(x > 0, 13, -1))  # 13 If x > 0 Else -1
print()

print(np.where(x > 0, 1, x))  # 1 If x > 0 Else x


[[ 0.72039162 -0.74075494 -0.7425139 ]
 [-0.02910374  0.62051568 -0.96183579]]

[[ True False False]
 [False  True False]]

[[13 -1 -1]
 [-1 13 -1]]

[[ 1.         -0.74075494 -0.7425139 ]
 [-0.02910374  1.         -0.96183579]]


Mathematical and Statistical methods:

In [20]:
arr = np.array([7, 8, -2, 4, 2, 3, 1])

print(np.sort(arr))


[-2  1  2  3  4  7  8]


In [21]:
print(np.max(arr))  # Return the maximum of an array.
print(np.amax(arr))  # Return the maximum of an array.
print(np.argmax(arr))  # Returns the indices of the maximum values.


8
8
1


In [23]:
# Calculations on an array that has missing values (NaN)
x = np.array([1, 4, np.nan, 8, 7, np.nan, 2])

print(np.mean(x))  # Use mean func (is bad)
print(np.nanmean(x))  # NanMean func, ignoring NaN values (is good)
print()

print(np.max(x))  # is bad
print(np.nanmax(x))  # is good
print()

print(np.sum(x))  # is bad
print(np.nansum(x))  # is good
print()

print(np.var(x))  # is bad
print(np.nanvar(x))  # is good

# and ...


nan
4.4

nan
8.0

nan
22.0

nan
7.44


In [38]:
# Some of Statistical functions
y = np.array([3, 5, 9, 8, 1, 4, 17, 6])

print(np.var(y))
print()

print(np.std(y))
print()

print(np.mean(y))  # Arithmetic mean.
print()

print(np.average(y, weights=[2, 1, 4, 3, 3, 1, 0.6, 3]))  # Weighted mean.
print()

print(np.median(y))
print()

# q value: Percentile or sequence of percentiles to compute, which must be between 0 and 1 inclusive.
print(np.quantile(y, q=0.25))
print(np.quantile(y, q=[0, 0.25, 0.3, 0.5, 0.75, 1]))
print()

# q value: Percentile or sequence of percentiles to compute, which must be between 0 and 100 inclusive.
print(np.percentile(y, q=25))
print(np.percentile(y, q=[0, 25, 30, 50, 75, 100]))

# and ...


21.234375

4.608077147791691

6.625

6.034090909090909

5.5

3.75
[ 1.    3.75  4.1   5.5   8.25 17.  ]

3.75
[ 1.    3.75  4.1   5.5   8.25 17.  ]


In [40]:
# Sum and CumSum
arr = np.array([1, 2, 3, 4])

print(np.sum(arr))  # sum.
print()

print(np.cumsum(arr))  # Cumulative sum of the elements.


10

[ 1  3  6 10]


Another example:

In [50]:
x = np.arange(1, 10).reshape((3, 3))
print(x)
print()

print(np.sum(x, axis=0))  # Sum of Columns
print()

print(np.sum(x, axis=1))  # Sum of Rows
print()

print(np.cumsum(x, axis=0))  # CumSum of Columns
print()

print(np.cumsum(x, axis=1))  # CumSum of Rows
print()

print(np.prod(x, axis=0))  # Product of Columns
print()

print(np.cumprod(x, axis=0))  # CumProd of Columns
print()


[[1 2 3]
 [4 5 6]
 [7 8 9]]

[12 15 18]

[ 6 15 24]

[[ 1  2  3]
 [ 5  7  9]
 [12 15 18]]

[[ 1  3  6]
 [ 4  9 15]
 [ 7 15 24]]

[ 28  80 162]

[[  1   2   3]
 [  4  10  18]
 [ 28  80 162]]



`all()` and `any()` function:

In [54]:
a = np.array([True, True, False])

print(a.any())  # Is at least one of elements true?
print(np.any(a))
print()

print(a.all())  # Are all elements True?
print(np.all(a))

# in above code, a is numpy array and all/any is inner python function.


True
True

False
False


In [58]:
b = [0, 2, -3]  # in True/False ->  [False, True, True]

print(np.all(b))
print()

print(np.any(b))
print()

print(b.all())  # error
print(b.any())  # error


False

True



AttributeError: 'list' object has no attribute 'all'

Unique function:

In [65]:
arr = np.array([1, 4, 7, 5, 5, 4, 1, 3, 5])

print(np.unique(arr))
print()

print(np.unique(arr, return_index=True))
print()

print(np.unique(arr, return_counts=True))


[1 3 4 5 7]

(array([1, 3, 4, 5, 7]), array([0, 7, 1, 3, 2], dtype=int64))

(array([1, 3, 4, 5, 7]), array([2, 1, 2, 3, 1], dtype=int64))


more in `sort` function:

In [71]:
data = [('alex', 17.5, 35), ('sara', 15.75, 27), ('tomas', 16.25, 27)]
print(data)
print(type(data))
print()

# name and type of any columns
dtype = [('name', 'S10'), ('score', float), ('age', int)]

arr = np.array(data, dtype=dtype)
print(type(arr))
print()

print(np.sort(arr, order='name'))
print(np.sort(arr, order='score'))
print(np.sort(arr, order='age'))
print()

print(np.sort(arr, order=['age', 'score']))


[('alex', 17.5, 35), ('sara', 15.75, 27), ('tomas', 16.25, 27)]
<class 'list'>

<class 'numpy.ndarray'>

[(b'alex', 17.5 , 35) (b'sara', 15.75, 27) (b'tomas', 16.25, 27)]
[(b'sara', 15.75, 27) (b'tomas', 16.25, 27) (b'alex', 17.5 , 35)]
[(b'sara', 15.75, 27) (b'tomas', 16.25, 27) (b'alex', 17.5 , 35)]

[(b'sara', 15.75, 27) (b'tomas', 16.25, 27) (b'alex', 17.5 , 35)]


`in1d` function:

In [78]:
x = np.array([0, 7, 1, 4, 2, 5, 7])
y = np.array([3, 4, 7])

print(np.in1d(x, y))
print(np.in1d(x, y, invert=True))
print()

print(x[np.in1d(x, y)])  # What x elements are in the y?
print(x[np.in1d(x, y, invert=True)])  # What x elements are not in the y?

[False  True False  True False False  True]
[ True False  True False  True  True False]

[7 4 7]
[0 1 2 5]


Save and Load:

In [55]:
a = np.array([4,6,9])

np.save(file='myfile.npy', arr=a)  #Save an array to a binary file in NumPy .npy format.

In [56]:
x = np.load(file='myfile.npy')  # Load arrays or pickled objects from .npy, .npz or pickled files.

print(x)

[4 6 9]


the best way for handle file:

In [57]:
# Save
with open(file='myfile.npy', mode='wb') as f:  # mode wb: Write Binary
    np.save(file=f, arr=a)

In [58]:
# Load
with open(file='myfile.npy', mode='rb') as f:  # mode rb: Read Binary
    x = np.load(file=f)
    
print(x)

[4 6 9]


Close the open file:

In [59]:
f.close()  # The close() method closes an open file.

Save in `txt` or `csv` file _(is slow)_:

In [63]:
arr = np.array([1.02, 5.3, 7, 3.1574])

np.savetxt('myfile.csv', X=arr, delimiter=',', fmt='%0.2f')  # Save an array to a text, csv and etc. file.

In [64]:
d = np.loadtxt('myfile.csv', delimiter=',')  # Load data from a text, csv and etc. file.

print(d)

[1.02 5.3  7.   3.16]


Save several arrays:

In [60]:
arr1 = np.array([1, 2])
arr2 = np.array([3, 4, 5])

np.savez('myfile.npz', x=arr1, y=arr2)  # Save several arrays into a single file in uncompressed .npz format.

In [61]:
data = np.load('myfile.npz')  # Load arrays or pickled objects from .npy, .npz or pickled files.

print(data)
print()

print(data.files)
print()

print(data['x'])
print(data['y'])

<numpy.lib.npyio.NpzFile object at 0x0000029A7CE7E610>

['x', 'y']

[1 2]
[3 4 5]


Close the open file:

In [62]:
data.close()

`inner` and `outer` product:

In [91]:
a = np.array([1, 2, 3])
b = np.array([5, 6, 0])

# 1*5 + 2*6 + 3*0 == 17
print(np.inner(a, b))  # Inner product of two arrays.
print()

print(np.outer(a, b))  # Compute the outer product of two vectors.


17

[[ 5  6  0]
 [10 12  0]
 [15 18  0]]


In [93]:
x = np.array([[1, 2], 
              [3, 4]])

y = np.array([[5, 6], 
              [7, 8]])


print(np.dot(x, y))  # Dot product of two arrays.
print()

print(x @ y)  # Dot product of two arrays.

[[19 22]
 [43 50]]

[[19 22]
 [43 50]]


`inverse` and `pseudo-inverse` of a matrix: 

In [94]:
# Compute the (multiplicative) inverse of a matrix:
print(np.linalg.inv(x))
print()

# Compute the (Moore-Penrose) pseudo-inverse of a matrix:
print(np.linalg.pinv(x))

[[-2.   1. ]
 [ 1.5 -0.5]]

[[-2.   1. ]
 [ 1.5 -0.5]]


Compute the `qr factorization` of a matrix:

In [112]:
m = np.array([[1, 2], 
              [3, 4]])

print(m)

[[1 2]
 [3 4]]


In [114]:
q, r = np.linalg.qr(m)  # Compute the qr factorization of a matrix.

print(q)
print()
print(r)

[[-0.31622777 -0.9486833 ]
 [-0.9486833   0.31622777]]

[[-3.16227766 -4.42718872]
 [ 0.         -0.63245553]]


In [115]:
print(np.allclose(m, np.dot(q, r)))  # m does equal qr.

True


## <a id='toc1_2_'></a>[Pandas](#toc0_)

In [3]:
import pandas as pd

Series:

In [123]:
a = pd.Series([12, 8, 19, 17])

print(a)

0    12
1     8
2    19
3    17
dtype: int64


In [162]:
score = pd.Series([12, 8, 19, 17],
              index=['ali', 'taha', 'sara', 'omid'])

print(score)

ali     12
taha     8
sara    19
omid    17
dtype: int64


In [163]:
score = score.reindex(['ali', 'reza', 'sara', 'omid'])

print(score)

ali     12.0
reza     NaN
sara    19.0
omid    17.0
dtype: float64


In [164]:
print(score['ali'])
print(score[2])

12.0
19.0


In [165]:
score['reza'] = 20

print(score)

ali     12.0
reza    20.0
sara    19.0
omid    17.0
dtype: float64


In [166]:
print(score.index)
print()

print(score.values)
print()

Index(['ali', 'reza', 'sara', 'omid'], dtype='object')

[12. 20. 19. 17.]



In [167]:
score.index.name = 'Names'

print(score)

Names
ali     12.0
reza    20.0
sara    19.0
omid    17.0
dtype: float64


In [168]:
score.name = 'Grade'

print(score)

Names
ali     12.0
reza    20.0
sara    19.0
omid    17.0
Name: Grade, dtype: float64


Drop Row:

In [169]:
score = score.drop(['omid', 'sara'])

print(score)

Names
ali     12.0
reza    20.0
Name: Grade, dtype: float64


Pop:

In [170]:
myser = pd.Series([12, 4, 5, 7, 2],
                  index=['a', 'b', 'c', 'd', 'e'])

print(myser)

a    12
b     4
c     5
d     7
e     2
dtype: int64


In [171]:
myser.pop('c')  # Return item and drops from series. Raise KeyError if not found.

5

In [172]:
print(myser)

a    12
b     4
d     7
e     2
dtype: int64


Missing values:

In [199]:
s = pd.Series([12, 4, 5, np.nan, 7, 2],
                  index=['a', 'b', 'c', 'd', 'e', 'f'])

print(s)

a    12.0
b     4.0
c     5.0
d     NaN
e     7.0
f     2.0
dtype: float64


In [200]:
print(s.isna())
print()

print(s.isna().sum())  # NaN values count

a    False
b    False
c    False
d     True
e    False
f    False
dtype: bool

1


In [201]:
print(s.notna())
print()

print(s.notna().sum())  # not NaN values count

a     True
b     True
c     True
d    False
e     True
f     True
dtype: bool

5


In [202]:
print(s.isin([5]))

a    False
b    False
c     True
d    False
e    False
f    False
dtype: bool


Sort:

In [209]:
print(s.sort_values(ascending=False))

a    12.0
e     7.0
c     5.0
b     4.0
f     2.0
d     NaN
dtype: float64


In [210]:
print(s)

a    12.0
b     4.0
c     5.0
d     NaN
e     7.0
f     2.0
dtype: float64


Rank of values:

In [207]:
print(s.rank())

a    5.0
b    2.0
c    3.0
d    NaN
e    4.0
f    1.0
dtype: float64


Duplicate indices/rows:

In [211]:
d = pd.Series([12, 4, 5, 7, 2],
                  index=['a', 'b', 'a', 'd', 'e'])

print(d)

a    12
b     4
a     5
d     7
e     2
dtype: int64


In [212]:
print(d['a'])

a    12
a     5
dtype: int64


In [213]:
print(d.index.is_unique)

False


In [214]:
print(d.describe())  # Generate descriptive statistics.

count     5.000000
mean      6.000000
std       3.807887
min       2.000000
25%       4.000000
50%       5.000000
75%       7.000000
max      12.000000
dtype: float64


in describe:

In [222]:
print(d.count())
print(d.mean())
print()

print(d.quantile([0.25, 0.5, 0.75, 0.8]))

5
6.0

0.25    4.0
0.50    5.0
0.75    7.0
0.80    8.0
dtype: float64


Condition:

In [227]:
print(d)

a    12
b     4
a     5
d     7
e     2
dtype: int64


In [228]:
print(d >= 5)

a     True
b    False
a     True
d     True
e    False
dtype: bool


In [229]:
print(d.where(d >= 5))

a    12.0
b     NaN
a     5.0
d     7.0
e     NaN
dtype: float64


Remove duplicate rows:

In [236]:
myser = pd.Series(['a', 'a', 'b', 'e', 'f',  'a', 'f', 'd'])

print(myser)

0    a
1    a
2    b
3    e
4    f
5    a
6    f
7    d
dtype: object


In [237]:
print(myser.duplicated())
print()

print(myser.duplicated().sum())

0    False
1     True
2    False
3    False
4    False
5     True
6     True
7    False
dtype: bool

3


`keep : {'first', 'last', False}`\
_default 'first'_
    Method to handle dropping duplicates:

- `'first'` : Drop duplicates except for the first occurrence.
- `'last'` : Drop duplicates except for the last occurrence.
- `False` : Drop all duplicates.

In [238]:
myser.drop_duplicates()  # Return Series with duplicate values removed.

0    a
2    b
3    e
4    f
7    d
dtype: object

In [239]:
print(myser)

0    a
1    a
2    b
3    e
4    f
5    a
6    f
7    d
dtype: object


In [240]:
myser.drop_duplicates(keep='last')

2    b
3    e
5    a
6    f
7    d
dtype: object

`prefix` and `suffix`:

In [241]:
s = pd.Series([1, 2, 3, 4])
print(s)

0    1
1    2
2    3
3    4
dtype: int64


In [242]:
s.add_prefix('item_')

item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

In [244]:
s.add_prefix('item: ')

item: 0    1
item: 1    2
item: 2    3
item: 3    4
dtype: int64

In [246]:
s.add_suffix('_item')

0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

In [248]:
w = s.add_suffix('_item')
print(w)

0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64


Calculate on rows:

In [251]:
a = pd.Series([1, 10, 3], index=['a', 'b', 'c'])
b = pd.Series([4, 5, 6], index=['a', 'b', 'd'])


print(a + b)
print()

print(a * b)
print()

print(a.mod(b))

# and other math operations.

a     5.0
b    15.0
c     NaN
d     NaN
dtype: float64

a     4.0
b    50.0
c     NaN
d     NaN
dtype: float64

a    1.0
b    0.0
c    NaN
d    NaN
dtype: float64


In [253]:
# NaN values fill by 0
print(a.add(b, fill_value=0))  # a + b
print()

print(a.multiply(b, fill_value=0))  # a * b

# and other math operations.

a     5.0
b    15.0
c     3.0
d     6.0
dtype: float64

a     4.0
b    50.0
c     0.0
d     0.0
dtype: float64


Comparison operators:
- `eq`: Equal.
- `ne`: Not equal.
- `gt`: Greater than.
- `ge`: Greater equal. 
- `lt`: Less than.
- `le`: Less equal.

In [4]:
s1 = pd.Series([8, 2, 12, 6, 5, 4])
s2 = pd.Series([20, 2, 7, 6, 2, 1])

In [10]:
print(s1.eq(s2))  # same with s1 == s2

0    False
1     True
2    False
3     True
4    False
5    False
dtype: bool


In [11]:
print(s1.gt(s2))  # same with s1 > s2

0    False
1    False
2     True
3    False
4     True
5     True
dtype: bool


In [13]:
print(s1.ge(s2))  # same with s1 >= s2

0    False
1     True
2     True
3     True
4     True
5     True
dtype: bool


`argmax` and `argmin`:

In [14]:
score = pd.Series({'Java': 15,
                   'C++': 20,
                   'Python': 12,
                   'Pascal': 9})

print(score)

Java      15
C++       20
Python    12
Pascal     9
dtype: int64


In [22]:
print(score.argmax())
print(score.idxmax())  # Return the row label of the maximum value.
print(score.max())

1
C++
20


In [23]:
print(score.argmin())
print(score.idxmin())  # Return the row label of the minimum value.
print(score.min())

3
Pascal
9


_Cumulative sum_ and _Cumulative product_:

In [24]:
s = pd.Series([3, 2, np.nan, 5, 0])
print(s)

0    3.0
1    2.0
2    NaN
3    5.0
4    0.0
dtype: float64


In [26]:
print(s.cumsum())

0     3.0
1     5.0
2     NaN
3    10.0
4    10.0
dtype: float64


In [27]:
print(s.cumprod())

0     3.0
1     6.0
2     NaN
3    30.0
4     0.0
dtype: float64


Value count:

In [28]:
myser = pd.Series(['c', 'a', 'd', 'a', 'a', 'c', 'b', 'b', 'c', 'c'])
print(myser)

0    c
1    a
2    d
3    a
4    a
5    c
6    b
7    b
8    c
9    c
dtype: object


In [32]:
print(myser.value_counts())

c    4
a    3
b    2
d    1
dtype: int64


or:

In [33]:
print(pd.value_counts(myser))

c    4
a    3
b    2
d    1
dtype: int64


Unique values:

In [34]:
print(myser.unique())

['c' 'a' 'd' 'b']


or:

In [37]:
print(pd.unique(myser))

['c' 'a' 'd' 'b']


Append:

In [40]:
s1 = pd.Series([8, 2, 12, 6, 5, 4])
s2 = pd.Series([20, 2, 7, 6, 2, 1])

In [41]:
print(s1.append(s2))

0     8
1     2
2    12
3     6
4     5
5     4
0    20
1     2
2     7
3     6
4     2
5     1
dtype: int64


  print(s1.append(s2))


or _(is better)_:

In [47]:
print(pd.concat([s1, s2], ignore_index=True))

0      8
1      2
2     12
3      6
4      5
5      4
6     20
7      2
8      7
9      6
10     2
11     1
dtype: int64


Combine:

In [48]:
s1 = pd.Series({'ali': 16, 'sara': 17})
s2 = pd.Series({'ali': 18, 'sara': 15, 'taha': 19})

Combine the Series with a Series or scalar according to `func`.

In [56]:
print(s1.combine(s2, func=max))

ali     18.0
sara    17.0
taha     NaN
dtype: float64


In [57]:
print(s1.combine(s2, func=max, fill_value=0))

ali     18
sara    17
taha    19
dtype: int64


In [58]:
print(s1.combine(s2, func=min))

ali     16.0
sara    15.0
taha     NaN
dtype: float64


In [59]:
print(s1.combine(s2, func=min, fill_value=0))

ali     16
sara    15
taha     0
dtype: int64


apply:

In [66]:
myser = pd.Series([10, 5, 100])
print(s)

0     10
1      5
2    100
dtype: int64


`apply`: Invoke function on values of Series.


In [67]:
print(myser.apply(func=np.log10))

0    1.00000
1    0.69897
2    2.00000
dtype: float64


In [70]:
print(myser.apply(func=np.sqrt))

0     3.162278
1     2.236068
2    10.000000
dtype: float64


In [72]:
def f(x):
    return x**2

print(myser.apply(f))

0      100
1       25
2    10000
dtype: int64


Or can use the below code _(but above codes is better)_:

In [75]:
print(np.sqrt(myser))
print()

print(f(myser))

0     3.162278
1     2.236068
2    10.000000
dtype: float64

0      100
1       25
2    10000
dtype: int64


Using by `lambda` function:

In [76]:
lam = lambda x: x**2

print(myser.apply(lam))

0      100
1       25
2    10000
dtype: int64


Transform:

In [89]:
ser = pd.Series([10, 4, 9])
print(ser)

0    10
1     4
2     9
dtype: int64


In [90]:
ser.transform([np.sqrt, np.log10])

Unnamed: 0,sqrt,log10
0,3.162278,1.0
1,2.0,0.60206
2,3.0,0.954243


`agg` _(Aggregate)_:

Aggregate using one or more operations over the specified axis.

In [97]:
ser = pd.Series([10, 4, 9, 2, 18, 6])
print(ser)

0    10
1     4
2     9
3     2
4    18
5     6
dtype: int64


In [105]:
print(ser.agg(['min', 'max', 'mean', 'var', 'std']))

min      2.000000
max     18.000000
mean     8.166667
var     32.166667
std      5.671567
dtype: float64


n Largest values:

In [106]:
data = {'a': 6, 'b': 3, 'c': 8, 'd': 5,
        'e': 9, 'f': 3, 'g': 5, 'h': 4, 'i': 5}

myser = pd.Series(data)
print(myser)

a    6
b    3
c    8
d    5
e    9
f    3
g    5
h    4
i    5
dtype: int64


In [114]:
print(myser.nlargest(4))  # Return the largest n (default=5) elements.

e    9
c    8
a    6
d    5
dtype: int64


In [113]:
print(myser.nlargest(4, keep='last'))


e    9
c    8
a    6
i    5
dtype: int64


n Smallest values:

In [115]:
print(myser.nsmallest())

b    3
f    3
h    4
d    5
g    5
dtype: int64


Group by:

In [127]:
cars_name = ['BMW', 'BMW', 'Benz', 'Benz']
speed = [220, 180, 230, 200]

data = pd.Series(speed, index=cars_name, name='Max Speed')
print(data)

BMW     220
BMW     180
Benz    230
Benz    200
Name: Max Speed, dtype: int64


In [129]:
print(data.groupby(cars_name).max())

BMW     220
Benz    230
Name: Max Speed, dtype: int64


In [130]:
print(data.groupby(cars_name).mean())

BMW     200.0
Benz    215.0
Name: Max Speed, dtype: float64


Between:

`Between`: return boolean Series equivalent to _left <= series <= right_:

In [131]:
s = pd.Series([15, 9, 18, 20])
print(s)

0    15
1     9
2    18
3    20
dtype: int64


In [132]:
print(s.between(10, 20))

0     True
1    False
2     True
3     True
dtype: bool


Drop Na:

In [133]:
s = pd.Series([7, 2, np.nan, 18, 34, np.nan])
print(s)

0     7.0
1     2.0
2     NaN
3    18.0
4    34.0
5     NaN
dtype: float64


In [134]:
print(s.dropna(inplace=False))

0     7.0
1     2.0
3    18.0
4    34.0
dtype: float64


Pandas series to other Types:

In [151]:
ser = pd.Series([15, 9, 18, 20])

print(type(ser))
print(ser)

<class 'pandas.core.series.Series'>
0    15
1     9
2    18
3    20
dtype: int64


- `.to_numpy()`
- `.to_csv()`
- `.to_dict()`
- `.to_clipboard()`
- and ...

example:

In [152]:
arr = ser.to_numpy()

print(type(arr))
print(arr)

<class 'numpy.ndarray'>
[15  9 18 20]


In [153]:
dict = ser.to_dict()

print(type(dict))
print(dict)

<class 'dict'>
{0: 15, 1: 9, 2: 18, 3: 20}


Numpy array to Pandas series:

In [154]:
ser = pd.Series(arr)

print(type(ser))
print(ser)

<class 'pandas.core.series.Series'>
0    15
1     9
2    18
3    20
dtype: int64


Replace:

In [175]:
s = pd.Series([15, 9, 9, 18, 9, 20])
print(s)

0    15
1     9
2     9
3    18
4     9
5    20
dtype: int64


In [176]:
print(s.replace(to_replace=9, value=10, inplace=False))

0    15
1    10
2    10
3    18
4    10
5    20
dtype: int64


Repeat:

In [163]:
s = pd.Series([15, 9, 18, 9, 20])
print(s)

0    15
1     9
2    18
3     9
4    20
dtype: int64


In [177]:
print(s.repeat(2))

0    15
0    15
1     9
1     9
2     9
2     9
3    18
3    18
4     9
4     9
5    20
5    20
dtype: int64


Multi index:

In [192]:
cars = [['BMW', 'BMW', 'Benz', 'Benz'],
        ['A', 'B', 'A', 'B']]

speed = [220, 180, 230, 200]

In [193]:
mi = pd.MultiIndex.from_arrays(cars, names=('Machine', 'Class'))
data = pd.Series(speed, index=mi)

print(data)

Machine  Class
BMW      A        220
         B        180
Benz     A        230
         B        200
dtype: int64


Group by _Machine_:

In [195]:
data.groupby(level='Machine').max()

Machine
BMW     220
Benz    230
dtype: int64

In [197]:
data.groupby(level=0).max()

Machine
BMW     220
Benz    230
dtype: int64

Group by _Class_:

In [196]:
data.groupby(level='Class').max()

Class
A    230
B    200
dtype: int64

In [198]:
data.groupby(level=1).max()

Class
A    230
B    200
dtype: int64

Data frame: