<img src="https://docs.google.com/uc?export=download&id=1lAgBWrlSFP0A4sLSIatnKd6iv0bFOEov">

# Data manipulation: Pandas

#1- Pandas Series

* **pandas** is a library that defines **data structures** and 
**manipulation** tools to be used in Python. It is often combined with 
other numerical  libraries like **Numpy**. 

* In **pandas** we can work with **tabular** or **heterogeneous** data by 
using for example its defined structures **DataFrame.**

* The other important pandas structure is: **Series** structure
Each of the two previous structures are used with an other defined 
object in pandas: **Index** object 

In [54]:
from pandas import Series as S, DataFrame as DF
# we could use pandas.Series([1,2,3])
s1 = S([1,2,3])
print("s1==\n",s1)
df1 = DF([1,2,3])
print("df1==\n",df1)

s1==
 0    1
1    2
2    3
dtype: int64
df1==
    0
0  1
1  2
2  3


* A **Series** is a** sequence** of values of the** same type** associated  with a sequence of labels called **index**.

In [55]:
# printing the index and the values of a series object
print("index==",s1.index)
print("values==",s1.values)

index== RangeIndex(start=0, stop=3, step=1)
values== [1 2 3]


In [56]:
import numpy as np
# d a dictionary with three keys
d= {"Third":3,"Second":2,"First":1}
# creating a series specifying a list and an associated index
s2 = S(list("His"),index=[1,2,3])
print (s2)
s2_2 = S("This",index=[1,2])
print(s2_2)
print("---------------")
# creating a series with a dict object without and with an index
s3 = S({"Third":3,"Second":2,"First":1})
print(s3)
s3_2=S(d,["Third","First","Other"])
print(s3_2)
print("---------------")
# creating a series with ndarray and an associated index
S4 = S(np.random.randn(2),range(2))
print(S4)

1    H
2    i
3    s
dtype: object
1    This
2    This
dtype: object
---------------
First     1
Second    2
Third     3
dtype: int64
Third    3.0
First    1.0
Other    NaN
dtype: float64
---------------
0    0.997399
1   -0.348807
dtype: float64


# 2- Pandas DataFrame

* A **DataFrame** is rectangular **table** of data organized in **rows** and **columns** **associated** with rows and columns** indexes **respectively.

In [57]:
# creating a DataFrame from a list of lists
df2= DF([[1,2,3,4],list("Try it"),list(np.random.randn(5))])
df2

Unnamed: 0,0,1,2,3,4,5
0,1,2,3,4.0,,
1,T,r,y,,i,t
2,0.498491,-0.324574,-2.7263,0.415938,-1.26808,


In [58]:
# creating a DataFrame from list of series:
df4 = DF([S([1,2,3],index=list("abc")),S(["Home","Work","Travel"],index=list("abc"))],index=list("GH"))
df4

Unnamed: 0,a,b,c
G,1,2,3
H,Home,Work,Travel


In [59]:
# creating a DataFrame from a dict of lists
df3 = DF({"Verbs":["Call","Eat","Drive"],"Occurences":[25,3,12]})
df3

Unnamed: 0,Occurences,Verbs
0,25,Call
1,3,Eat
2,12,Drive


In [60]:
# creating a DataFrame from a dict of dicts
df4 = DF({"Verbs":{0:"Call",1:"Eat",2:"Drive"},"Occurences":{3:25,4:3,5:12}})
df4

Unnamed: 0,Occurences,Verbs
0,,Call
1,,Eat
2,,Drive
3,25.0,
4,3.0,
5,12.0,


In [61]:
# creating a DataFrame from a dict of dicts, specifying the rows and columns
df5 = DF({"Verbs":{0:"Call",1:"Eat",2:"Drive"}},index=[1,2],columns=["Verbs"])
df5

Unnamed: 0,Verbs
1,Eat
2,Drive


# 3- Indexing and reindexing


* Indexing and filtering in **Series** and **DataFrame**

In [62]:
# creating a Series
ser= S(range(1,4),index=list("abc"))
# Selecting one element using the given index
print('ser["a"]==',ser["a"])
# Selecting the same element using the default index
print("ser[0]==",ser[0])
# Selecting a slice of elements
print('ser["a":"b"]==',ser["a":"b"])
# But using the default index, will not give the same results:
print('ser[0:1]==',ser[0:1])
# Selecting or filtering values grater than 2
print('ser[ser>2]==',ser[ser>2])
# Selecting a list of elements
print('ser[["a","c"]]==',ser[["a","c"]])
# Assigning a value to a selected slice will affect the original value
ser["a":"b"]=1000
print('ser==',ser)

ser["a"]== 1
ser[0]== 1
ser["a":"b"]== a    1
b    2
dtype: int64
ser[0:1]== a    1
dtype: int64
ser[ser>2]== c    3
dtype: int64
ser[["a","c"]]== a    1
c    3
dtype: int64
ser== a    1000
b    1000
c       3
dtype: int64


In [63]:
# creating a DataFrame
dfr= DF([["a",1],["b",2],["c",3]],index=["r1","r2","r3"],columns=["letters","digits"])
# Selecting one element using the given index
# selecting a row
print('dfr.loc["r1"]==\n',dfr.loc["r1"])

# Selecting a column
print('dfr["letters"]==\n',dfr["letters"])
print('dfr.letters==\n',dfr.letters)
print('dfr.loc[:,"letters"]==\n',dfr.loc[:,"letters"])

# Selecting the same column using the default index for columns
print("dfr.iloc[:,0]==\n",dfr.iloc[:,0])

# For rows, if we want to use the default index, we can use :
# a slice or iloc (the iloc for the same slice will produce the same result)

print("dfr[:1]==\n",dfr[:1])
print("dfr.iloc[0]==\n",dfr.iloc[0])
print("dfr.iloc[0:1]==\n",dfr.iloc[0:1])

# Selecting a slice of elements: for columns we can use :
print("dfr.iloc[:,0:2]==\n",dfr.iloc[:,0:2])
print('dfr.loc[:,"letters":"letters"]==\n',dfr.loc[:,"letters":"letters"])


# Selecting a slice of elements: for rows we can use row labels

print('dfr["r1":"r2"]==\n',dfr["r1":"r2"])
#          with loc

print('dfr.loc["r1":"r2"]==\n',dfr.loc["r1":"r2"])

# or  default indexes
print('dfr[0:2]==\n',dfr[0:2])
#           with iloc
print('dfr.iloc[0:2]==\n',dfr.iloc[0:2])

# Selecting or filtering values greater than 2: selecting rows
print('dfr[dfr["digits"]>2]==\n',dfr[dfr["digits"]>2])
# Selecting or filtering values greater than 2:

print('dfr>2==\n',dfr>2)
print('dfr[dfr>2]==\n',dfr[dfr>2])

# Selecting a list of rows
#     with labels: only with loc
print('dfr.loc[["r1","r3"]]==\n',dfr.loc[["r1","r3"]])
#     with default indexes: only with iloc
print('dfr.iloc[[0,2]]==\n',dfr.iloc[[0,2]])

# Selecting a list of columns:
#     with labels
print('dfr[["digits","letters"]]==\n',dfr[["digits","letters"]])
print('dfr.loc[:,["digits","letters"]]==\n',dfr.loc[:,["digits","letters"]])
#     with default indexes: only with iloc
print('dfr.iloc[:,[1,0]]==\n',dfr.iloc[:,[1,0]])


# selecting one value using the labels and default indexes with at and iat
print('dfr.at["r1","digits"]==\n',dfr.at["r1","digits"])
print('dfr.iat[0,1]==\n',dfr.iat[0,1])
# selecting one value using the labels and default indexes with loc and iloc
print('dfr.loc["r1","digits"]==\n',dfr.loc["r1","digits"])
print('dfr.iloc[0,1]==\n',dfr.iloc[0,1])
# Assigning a value to a selected slice will affect the original value
dfr["r1":"r2"]=1000
dfr


dfr.loc["r1"]==
 letters    a
digits     1
Name: r1, dtype: object
dfr["letters"]==
 r1    a
r2    b
r3    c
Name: letters, dtype: object
dfr.letters==
 r1    a
r2    b
r3    c
Name: letters, dtype: object
dfr.loc[:,"letters"]==
 r1    a
r2    b
r3    c
Name: letters, dtype: object
dfr.iloc[:,0]==
 r1    a
r2    b
r3    c
Name: letters, dtype: object
dfr[:1]==
    letters  digits
r1       a       1
dfr.iloc[0]==
 letters    a
digits     1
Name: r1, dtype: object
dfr.iloc[0:1]==
    letters  digits
r1       a       1
dfr.iloc[:,0:2]==
    letters  digits
r1       a       1
r2       b       2
r3       c       3
dfr.loc[:,"letters":"letters"]==
    letters
r1       a
r2       b
r3       c
dfr["r1":"r2"]==
    letters  digits
r1       a       1
r2       b       2
dfr.loc["r1":"r2"]==
    letters  digits
r1       a       1
r2       b       2
dfr[0:2]==
    letters  digits
r1       a       1
r2       b       2
dfr.iloc[0:2]==
    letters  digits
r1       a       1
r2       b       2
dfr[dfr[

Unnamed: 0,letters,digits
r1,1000,1000
r2,1000,1000
r3,c,3


* **Reindexing**: creating a **new **Series or DataFrame by **changing the order** of a given Series or  DataFrame v**alues**.

In [64]:
#reindexing a series filling the missed values with
# a forward fill method
print(s1)
rs1=s1.reindex([3,2,1,0],method="ffill")
print(rs1)
#reindexing a DataFrame filling the missing values
# with a given argument value
rdf4 =df4.reindex(list("HGI"),columns=["c","b"],fill_value=-1)
rdf4

0    1
1    2
2    3
dtype: int64
3    3
2    3
1    2
0    1
dtype: int64


Unnamed: 0,c,b
H,-1,-1
G,-1,-1
I,-1,-1


# 4- Some operations

* **Dropping**: creating a **new** Series or DataFrame by **dropping** the **rows** or **columns** of a given Series or DataFrame.

In [65]:
# creating a new series
newS= S(np.random.randn(3), index=list("abc"))
print(newS)
# Dropping the first and last values
print(newS.drop(['a','c']))

a    1.121524
b   -0.641447
c    1.108643
dtype: float64
b   -0.641447
dtype: float64


In [66]:
# creating a new series
newDF= DF(np.random.randn(6).reshape(2,3), index=list("ab"),columns=list("ABC"))
print(newDF)
# Dropping the second Column
print(newDF.drop('B',axis=1))

          A         B         C
a -0.943950  0.432016  1.552039
b -0.517964 -1.367501 -0.660572
          A         C
a -0.943950  1.552039
b -0.517964 -0.660572


* We can apply arithmetic operations using **operators** or **defined methods**:

In [67]:
df1 = DF(np.arange(6).reshape(2,3),index=["r1","r2"],columns=["c1","c2","c3"])
df1


Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5


In [68]:
df2 = DF(np.ones((3,3)),index=["r1","r2","r3"],columns=["c1","c2","c3"] )
df2

Unnamed: 0,c1,c2,c3
r1,1.0,1.0,1.0
r2,1.0,1.0,1.0
r3,1.0,1.0,1.0


In [69]:
# the rows and columns will be aligned
df1 + df2

Unnamed: 0,c1,c2,c3
r1,1.0,2.0,3.0
r2,4.0,5.0,6.0
r3,,,


In [70]:
# using the add method : we can fill the missing values
# the fill value will replace the missing values before applying the operation
df1.add(df2,fill_value=5)

Unnamed: 0,c1,c2,c3
r1,1.0,2.0,3.0
r2,4.0,5.0,6.0
r3,6.0,6.0,6.0


In [71]:
ser1 = S(range(6),index=list("abcdef"))
ser1

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64

In [72]:
ser2= S([1]*5,index=list("abcde"))
ser2

a    1
b    1
c    1
d    1
e    1
dtype: int64

In [73]:
# appliying a division between two series
ser1.div(ser2)


a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    NaN
dtype: float64

In [74]:
# applying a reversed division between two series
ser1.rdiv(ser2)

a         inf
b    1.000000
c    0.500000
d    0.333333
e    0.250000
f         NaN
dtype: float64

In [75]:
ser3 = S([5,6],index=["c1","c2"])
ser3

c1    5
c2    6
dtype: int64

In [76]:
# operation between a Series and a DataFrame
df1.add(ser3) # or df1.add(ser3,axis=1)

Unnamed: 0,c1,c2,c3
r1,5.0,7.0,
r2,8.0,10.0,


* we can **apply** functions on pandas structures just by using the structures as arguments or by using the: **apply** , **map** or the **applymap** method.

In [77]:
# call of function mean
np.mean(df1)

c1    1.5
c2    2.5
c3    3.5
dtype: float64

In [78]:
def f1(x):
  return S(np.sum(x),index=["sum"])

df1.apply(f1,axis=0)


Unnamed: 0,c1,c2,c3
sum,3,5,7


In [79]:
def f2(x):
  return float(x)
df1.applymap(f2)


Unnamed: 0,c1,c2,c3
r1,0.0,1.0,2.0
r2,3.0,4.0,5.0


In [80]:
def f3(x):
   return np.where(x>3,"Yes","No")
  
df1.loc["r2"].map(f3)

c1     No
c2    Yes
c3    Yes
Name: r2, dtype: object

* The pandas structures can be **sorted** either by **indexes** or by **values**
* The values can also be **ranked** considering their **position** in a **sorting**

In [81]:
df1.sort_index(ascending=False)

Unnamed: 0,c1,c2,c3
r2,3,4,5
r1,0,1,2


In [82]:
df1.iat[0,1]=1000
df1.sort_values(by=["c2"])

Unnamed: 0,c1,c2,c3
r2,3,4,5
r1,0,1000,2


In [83]:
ser4=S([6,5,1,6,9,0,-3])
ser4.sort_values()

6   -3
5    0
2    1
1    5
0    6
3    6
4    9
dtype: int64

In [84]:
ser4.rank()

0    5.5
1    4.0
2    3.0
3    5.5
4    7.0
5    2.0
6    1.0
dtype: float64

* There is a set of **methods** and **functions** that produce some **descriptiv**e values about the **data** contained in the corresponding structure.

In [85]:
df1

Unnamed: 0,c1,c2,c3
r1,0,1000,2
r2,3,4,5


In [86]:
# some descriptive values
df1.describe()

Unnamed: 0,c1,c2,c3
count,2.0,2.0,2.0
mean,1.5,502.0,3.5
std,2.12132,704.278354,2.12132
min,0.0,4.0,2.0
25%,0.75,253.0,2.75
50%,1.5,502.0,3.5
75%,2.25,751.0,4.25
max,3.0,1000.0,5.0


In [87]:
ser2

a    1
b    1
c    1
d    1
e    1
dtype: int64

In [88]:
# return all the uniques values
ser2.unique()

array([1])

In [89]:
# count the number of each unique value
ser2.value_counts()

1    5
dtype: int64

In [0]:
# check if the DataFrame values are in the argument values 
df1.isin([2,3])

In [97]:
#  check if the Series values are in the argument values 
ser1.isin([2,3])

a    False
b    False
c     True
d     True
e    False
f    False
dtype: bool

# 5- Google Colab Help

* Google Colab allows us to see the list of the available modules and function in a given module.
* It allows us also to access to the help of a given function or attribute.

In [0]:
df1.isin?

# References

* Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc, 2018.


