# DSC4001-01 Exercise 05

**This exercise notebook will go through the data types in Python:**


* Pandas Series
* Pandas DataFrames


[Pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 
 

## Series

The first main data type we will learn about for Pandas is the **Series** data type. A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have *axis labels*, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object, such as ndarray, list, strings, dictionary, etc.

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list, numpy array, or dictionary to a Series:
```
s = pd.Series(data, index)
```

* data: Contains data stored in Series
* index: Must have the same length as data. Will default to RangeIndex (0,1, ..., length(data)-1) if not provided. 


In [3]:
my_list = [0.25, 0.5, 0.75, 1.0]
my_arr = np.arange(1,5)
my_dict = {'a': 10, 'b': 20, 'c': 30}
my_label = ['a','b','c','d']

In [4]:
# Using List: without index

pd.Series(data=my_list)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [5]:
# Using List: with index label

pd.Series(data=my_list, index=my_label)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [6]:
# Using ndarray

pd.Series(my_arr)

0    1
1    2
2    3
3    4
dtype: int64

In [7]:
pd.Series(my_arr, my_label)

a    1
b    2
c    3
d    4
dtype: int64

In [8]:
# Using dictionary

pd.Series(my_dict)

a    10
b    20
c    30
dtype: int64

In [9]:
pd.Series(my_dict, index=['a','c'])

a    10
c    30
dtype: int64

### Data in a Series

``Series`` can hold a variety of object types:

In [10]:
# Using a scalar value

pd.Series(5, my_label)

a    5
b    5
c    5
d    5
dtype: int64

In [11]:
# Using string values

pd.Series(data=['a','b','c','d'])

0    a
1    b
2    c
3    d
dtype: object

### Using an index

Series acts in many ways like a 1D ndarray, and in many ways like a dictionary. 
```
series[index label]
```

ndarray has an implicitly defined integer index, while a Series can have an explicitly defined index associated with the values.

In [12]:
ser1 = pd.Series([1,2,3,4], index=['Kim','Lee','Park','Choi'])
ser1

Kim     1
Lee     2
Park    3
Choi    4
dtype: int64

In [13]:
ser2 = pd.Series([5.2, 3.5, 7.2, 12.5], index=['Kim','Lee','Yoo','Choi'])
ser2

Kim      5.2
Lee      3.5
Yoo      7.2
Choi    12.5
dtype: float64

In [14]:
ser1['Kim']

1

In [15]:
ser1['Kim':'Park']

Kim     1
Lee     2
Park    3
dtype: int64

In [16]:
ser2['Lee']

3.5

Operations between Series automatically align the data based on label index:


In [17]:
ser1 + ser2

Choi    16.5
Kim      6.2
Lee      5.5
Park     NaN
Yoo      NaN
dtype: float64

Dictionary maps arbitrary keys to a set of arbitrary values, while a Series maps typed keys to a set of typed values. We can use *dictionary-style* item access.

In [18]:
ser3 = pd.Series({'b':10, 'a':35, 'c':20, 'e': 45})
ser3

b    10
a    35
c    20
e    45
dtype: int64

In [19]:
ser3['b']

10

In [20]:
ser3['a':'e']

a    35
c    20
e    45
dtype: int64

We can use implicit Python-style integer index: 

In [21]:
# indexing
ser3[0]

10

In [22]:
# slicing
ser3[1:3]

a    35
c    20
dtype: int64

In [23]:
# masking
ser3 > 15

b    False
a     True
c     True
e     True
dtype: bool

In [24]:
ser3 < 40

b     True
a     True
c     True
e    False
dtype: bool

In [25]:
(ser3 > 15) & (ser3 < 40)

b    False
a     True
c     True
e    False
dtype: bool

In [26]:
ser3[ (ser3 > 15) & (ser3 < 40) ]

a    35
c    20
dtype: int64

In [27]:
ser3

b    10
a    35
c    20
e    45
dtype: int64

In [28]:
ser3['a'] = 30
ser3

b    10
a    30
c    20
e    45
dtype: int64

**loc** and **iloc**: Using explicitly defined index labels and implicit integer index labels

In [29]:
ser4 = pd.Series(['a','b','e','c'], index=[1,3,4,2])
ser4

1    a
3    b
4    e
2    c
dtype: object

In [30]:
ser4.loc[1]

'a'

In [31]:
ser4.loc[1:4]

1    a
3    b
4    e
dtype: object

In [32]:
ser4.iloc[1]

'b'

In [33]:
ser4.iloc[1:4]

3    b
4    e
2    c
dtype: object

## DataFrames

DataFrames are the workhorse of Pandas. We can think of a DataFrame as a bunch of Series objects put together to share the same index. 

```
df = pd.DataFrame(data, index, columns)
```

* data: Can contain Series, arrays, constants, or list-like objects
* index: Row labels. Will default to RangeIndex if no index provided
* columns: Column labels. Will default to RangeIndex if no column labels are provided

### Creating a DataFrame

In [34]:
my_dict = {'T1': pd.Series([1, 5, 10], index=['A','B','C']),
           'T2': pd.Series([100, 70, 50, 10], index=['A','D','B','C']),
           'T3': pd.Series(np.random.randint(1, 20, 5), index=['A','B','C','D','E'])}
my_arr = np.random.randn(5,4)
my_list = [{'T1': 1, 'T2': 100, 'T3': 1}, {'T1': 5, 'T2': 50, 'T3': 2}, {'T1': 10, 'T2': 10, 'T3': 3}, {'T2': 70, 'T3': 4}, {'T3': 5} ]
my_row_label = ['A','B','C','D','E']
my_column_label = ['T1','T2','T3','T4']

In [35]:
# Using array

pd.DataFrame(my_arr)

Unnamed: 0,0,1,2,3
0,-0.687071,-1.249704,0.626584,0.253393
1,0.393273,-0.190639,1.359026,-2.11142
2,0.149328,-1.253329,0.862176,-1.465889
3,-0.569126,-0.2365,1.684282,0.777615
4,1.114002,0.507904,-1.572542,-0.357048


In [36]:
pd.DataFrame(my_arr, index=my_row_label)

Unnamed: 0,0,1,2,3
A,-0.687071,-1.249704,0.626584,0.253393
B,0.393273,-0.190639,1.359026,-2.11142
C,0.149328,-1.253329,0.862176,-1.465889
D,-0.569126,-0.2365,1.684282,0.777615
E,1.114002,0.507904,-1.572542,-0.357048


In [37]:
pd.DataFrame(my_arr, index=my_row_label, columns=my_column_label)

Unnamed: 0,T1,T2,T3,T4
A,-0.687071,-1.249704,0.626584,0.253393
B,0.393273,-0.190639,1.359026,-2.11142
C,0.149328,-1.253329,0.862176,-1.465889
D,-0.569126,-0.2365,1.684282,0.777615
E,1.114002,0.507904,-1.572542,-0.357048


In [38]:
# Using dictionary of Series

pd.DataFrame(my_dict)

Unnamed: 0,T1,T2,T3
A,1.0,100.0,13
B,5.0,50.0,16
C,10.0,10.0,17
D,,70.0,15
E,,,19


In [39]:
pd.DataFrame(my_dict, index=['D','A','B'])

Unnamed: 0,T1,T2,T3
D,,70,15
A,1.0,100,13
B,5.0,50,16


In [40]:
pd.DataFrame(my_dict, index=['D','A','B'], columns=['T1','T3'])

Unnamed: 0,T1,T3
D,,15
A,1.0,13
B,5.0,16


In [41]:
# Using list of dictionaries

pd.DataFrame(my_list)

Unnamed: 0,T1,T2,T3
0,1.0,100.0,1
1,5.0,50.0,2
2,10.0,10.0,3
3,,70.0,4
4,,,5


In [42]:
pd.DataFrame(my_list, index=['A','B','C','D','E'])

Unnamed: 0,T1,T2,T3
A,1.0,100.0,1
B,5.0,50.0,2
C,10.0,10.0,3
D,,70.0,4
E,,,5


In [43]:
pd.DataFrame(my_list, index=['A','B','C','D','E'], columns=['T2'])

Unnamed: 0,T2
A,100.0
B,50.0
C,10.0
D,70.0
E,


### Data in a DataFrame: Selection and Indexing

Dictionary maps a *key* to a *value*

DataFrame maps a **column name** to a **Series of column data**




In [44]:
df = pd.DataFrame(my_dict)
df

Unnamed: 0,T1,T2,T3
A,1.0,100.0,13
B,5.0,50.0,16
C,10.0,10.0,17
D,,70.0,15
E,,,19


In [45]:
df['T2']

A    100.0
B     50.0
C     10.0
D     70.0
E      NaN
Name: T2, dtype: float64

In [46]:
# Pass a list of column names

df[['T3','T2']]

Unnamed: 0,T3,T2
A,13,100.0
B,16,50.0
C,17,10.0
D,15,70.0
E,19,


Creating a new column:

In [50]:
df['T4'] = df['T1'] * df['T2']
df

Unnamed: 0,T1,T2,T3,T4,T5
A,1.0,100.0,13,100.0,
B,5.0,50.0,16,250.0,32.0
C,10.0,10.0,17,100.0,34.0
D,,70.0,15,,
E,,,19,,


In [51]:
df['T5'] = 'Hey!'
df

Unnamed: 0,T1,T2,T3,T4,T5
A,1.0,100.0,13,100.0,Hey!
B,5.0,50.0,16,250.0,Hey!
C,10.0,10.0,17,100.0,Hey!
D,,70.0,15,,Hey!
E,,,19,,Hey!


Modify, Delete and Insert columns:
* ``del``, ``drop`` to remove some data


In [52]:
df['T5'] = df['T3'][1:3]*2
df

Unnamed: 0,T1,T2,T3,T4,T5
A,1.0,100.0,13,100.0,
B,5.0,50.0,16,250.0,32.0
C,10.0,10.0,17,100.0,34.0
D,,70.0,15,,
E,,,19,,


In [53]:
del df['T4']
df

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,13,
B,5.0,50.0,16,32.0
C,10.0,10.0,17,34.0
D,,70.0,15,
E,,,19,


In [54]:
df.insert(4, 'T4', df['T1']*df['T2'])
df

Unnamed: 0,T1,T2,T3,T5,T4
A,1.0,100.0,13,,100.0
B,5.0,50.0,16,32.0,250.0
C,10.0,10.0,17,34.0,100.0
D,,70.0,15,,
E,,,19,,


In [59]:
df.drop('T4', axis=1)

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,13,
B,5.0,50.0,16,32.0
C,10.0,10.0,17,34.0
D,,70.0,15,
E,,,19,


In [64]:
df.max(axis=1)

A    100.0
B    250.0
C    100.0
D     70.0
E     19.0
dtype: float64

In [60]:
df
# if inplace is not specified, return a copy

Unnamed: 0,T1,T2,T3,T5,T4
A,1.0,100.0,13,,100.0
B,5.0,50.0,16,32.0,250.0
C,10.0,10.0,17,34.0,100.0
D,,70.0,15,,
E,,,19,,


In [65]:
df.drop('T4', axis=1, inplace=True)
df

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,13,
B,5.0,50.0,16,32.0
C,10.0,10.0,17,34.0
D,,70.0,15,
E,,,19,


In [66]:
df.drop('E', axis=0, inplace=True)
df

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,13,
B,5.0,50.0,16,32.0
C,10.0,10.0,17,34.0
D,,70.0,15,


**Selecting Rows and Colunms**

* Select column: ``df[column label]``
* Select row by label: ``df.loc[row label]``
* Select row by integer location: ``df.iloc[row index label]``


In [67]:
df = pd.DataFrame(my_dict)
df

Unnamed: 0,T1,T2,T3
A,1.0,100.0,13
B,5.0,50.0,16
C,10.0,10.0,17
D,,70.0,15
E,,,19


In [68]:
df['T1']

A     1.0
B     5.0
C    10.0
D     NaN
E     NaN
Name: T1, dtype: float64

In [69]:
df.loc['B']

T1     5.0
T2    50.0
T3    16.0
Name: B, dtype: float64

In [70]:
df.iloc[1]

T1     5.0
T2    50.0
T3    16.0
Name: B, dtype: float64

In [71]:
df.loc[['A','C']]

Unnamed: 0,T1,T2,T3
A,1.0,100.0,13
C,10.0,10.0,17


In [72]:
df.loc[['A','C'],['T1','T3']]

Unnamed: 0,T1,T3
A,1.0,13
C,10.0,17


In [73]:
df.iloc[[0,2],[0,2]]

Unnamed: 0,T1,T3
A,1.0,13
C,10.0,17


In [74]:
df[1:3]

Unnamed: 0,T1,T2,T3
B,5.0,50.0,16
C,10.0,10.0,17


**Conditional selection**

In [75]:
df

Unnamed: 0,T1,T2,T3
A,1.0,100.0,13
B,5.0,50.0,16
C,10.0,10.0,17
D,,70.0,15
E,,,19


In [76]:
df < 13

Unnamed: 0,T1,T2,T3
A,True,False,False
B,True,False,False
C,True,True,False
D,False,False,False
E,False,False,False


In [77]:
df[ df < 13 ]

Unnamed: 0,T1,T2,T3
A,1.0,,
B,5.0,,
C,10.0,10.0,
D,,,
E,,,


In [80]:
df[ df['T2'] > 60 ]

Unnamed: 0,T1,T2,T3
A,1.0,100.0,13
D,,70.0,15


In [81]:
df[ df['T2'] > 60]['T3']

A    13
D    15
Name: T3, dtype: int64

## Quiz

1. Add, subtract, multiple and divide two Pandas Series

In [88]:
ser1 = pd.Series([2, 6, 4, 10, 8])
ser2 = pd.Series([2, 1, 2, 5, 4])

In [89]:
# add: ser_ad
ser_add =  ser1+ser2
# subtract: ser_sub
ser_sub=ser1-ser2
# multiply: ser_mult
ser_mult= ser1*ser2
# divide: ser_div
ser_div=ser1/ser2

In [90]:
print('Add two Series: ')
print(ser_add)

print('Subtract two Series: ')
print(ser_sub)

print('Multiply two Series: ')
print(ser_mult)

print('Divide Ser1 by Ser2: ')
print(ser_div)



Add two Series: 
0     4
1     7
2     6
3    15
4    12
dtype: int64
Subtract two Series: 
0    0
1    5
2    2
3    5
4    4
dtype: int64
Multiply two Series: 
0     4
1     6
2     8
3    50
4    32
dtype: int64
Divide Ser1 by Ser2: 
0    1.0
1    6.0
2    2.0
3    2.0
4    2.0
dtype: float64


2. Add some data to an existing Series:

You can (1) add a new index and value, or (2) make additional Series and append it to the original Series.

Original Series:
```
a         10
b         20
c       Data
d    Science
dtype: object
```

Expected output:
```
a         10
b         20
c       Data
d    Science
e       4001
f     Python
dtype: object
```

In [96]:
ser1 = pd.Series([10, 20, 'Data', 'Science'], index=['a','b','c','d'])
ser1['e']=4001
ser1['f']='Python'
print(ser1)

a         10
b         20
c       Data
d    Science
e       4001
f     Python
dtype: object


In [106]:
# Your code here

ser2=pd.Series([4001,'Python'],index=['e','f'])
ser1.append(ser2)
ser1

a         10
b         20
c       Data
d    Science
e       4001
f     Python
dtype: object

3. Select rows from a given DataFrame based on values in some columns:
* Find rows for ``C1`` value equals to 4

In [97]:
my_dict = {'C1': [1,4,3,4,5], 'C2': [4,5,6,7,8], 'C3': [7,8,9,0,1]}
df = pd.DataFrame(my_dict)
df

Unnamed: 0,C1,C2,C3
0,1,4,7
1,4,5,8
2,3,6,9
3,4,7,0
4,5,8,1


In [110]:
# Your code here
df['C1']==4


0    False
1     True
2    False
3     True
4    False
Name: C1, dtype: bool

In [112]:
df[df['C1']==4]

Unnamed: 0,C1,C2,C3
1,4,5,8
3,4,7,0


In [111]:
df.loc[df['C1'] ==4]

Unnamed: 0,C1,C2,C3
1,4,5,8
3,4,7,0


4. Change the order of a DataFrame columns and rows:

Original Series:
```
	C1	C2	C3
0	1	4	7
1	4	5	8
2	3	6	9
3	4	7	0
4	5	8	1
```

Expected output: change column order: C3, C1, C2 & row order: A, B, E, D, C
```
   C3  C1  C2
A   7   1   4
B   8   4   5
E   1   5   8
D   0   4   7
C   9   3   6
```

In [117]:
my_dict = {'C1': [1,4,3,4,5], 'C2': [4,5,6,7,8], 'C3': [7,8,9,0,1]}
df = pd.DataFrame(my_dict, index=['A','B','C','D','E'])
df

Unnamed: 0,C1,C2,C3
A,1,4,7
B,4,5,8
C,3,6,9
D,4,7,0
E,5,8,1


In [120]:
# Your code here
df=df[['C3','C1','C2']]
df=df.loc[['A', 'B' , 'E' , 'D' , 'C' ]]
df

Unnamed: 0,C3,C1,C2
A,7,1,4
B,8,4,5
E,1,5,8
D,0,4,7
C,9,3,6


5. Delete DataFrame row(s) based on given column value:
* delete rows that ``C1`` value is less than 4

In [121]:
my_dict = {'C1': [1,4,3,4,5], 'C2': [4,5,6,7,8], 'C3': [7,8,9,0,1]}
df = pd.DataFrame(my_dict, index=['A','B','C','D','E'])
df

Unnamed: 0,C1,C2,C3
A,1,4,7
B,4,5,8
C,3,6,9
D,4,7,0
E,5,8,1


In [122]:
# Your code here
df[df['C1']<4]

Unnamed: 0,C1,C2,C3
A,1,4,7
C,3,6,9


6. Remove first 3 rows of a given DataFrame

* use implicit integer index

In [124]:
my_dict = {'C1': [1,4,3,4,5,10,3], 'C2': [4,5,6,7,8,32,2], 'C3': [7,8,9,0,1,11,23], 'C4': [1,3,5,2,3,31,5]}
df = pd.DataFrame(my_dict, index=['A','B','C','D','E','F','G'])
df

Unnamed: 0,C1,C2,C3,C4
A,1,4,7,1
B,4,5,8,3
C,3,6,9,5
D,4,7,0,2
E,5,8,1,3
F,10,32,11,31
G,3,2,23,5


In [126]:
# Your code here
df= df.iloc[3:]
df


Unnamed: 0,C1,C2,C3,C4
D,4,7,0,2
E,5,8,1,3
F,10,32,11,31
G,3,2,23,5


## HW Solutions: