# DSC4001-01 Exercise 05

**This exercise notebook will go through the data types in Python:**


* Pandas Series
* Pandas DataFrames


[Pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 
 

## Series

The first main data type we will learn about for Pandas is the **Series** data type. A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have *axis labels*, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object, such as ndarray, list, strings, dictionary, etc.

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list, numpy array, or dictionary to a Series:
```
s = pd.Series(data, index)
```

* data: Contains data stored in Series
* index: Must have the same length as data. Will default to RangeIndex (0,1, ..., length(data)-1) if not provided. 


In [2]:
my_list = [0.25, 0.5, 0.75, 1.0]
my_arr = np.arange(1,5)
my_dict = {'a': 10, 'b': 20, 'c': 30}
my_label = ['a','b','c','d']

In [3]:
# Using List: without index

pd.Series(data=my_list)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [4]:
# Using List: with index label

pd.Series(data=my_list, index=my_label)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [5]:
# Using ndarray

pd.Series(my_arr)

0    1
1    2
2    3
3    4
dtype: int64

In [6]:
pd.Series(my_arr, my_label)

a    1
b    2
c    3
d    4
dtype: int64

In [7]:
# Using dictionary

pd.Series(my_dict)

a    10
b    20
c    30
dtype: int64

In [8]:
pd.Series(my_dict, index=['a','c'])

a    10
c    30
dtype: int64

### Data in a Series

``Series`` can hold a variety of object types:

In [9]:
# Using a scalar value

pd.Series(5, my_label)

a    5
b    5
c    5
d    5
dtype: int64

In [10]:
# Using string values

pd.Series(data=['a','b','c','d'])

0    a
1    b
2    c
3    d
dtype: object

### Using an index

Series acts in many ways like a 1D ndarray, and in many ways like a dictionary. 
```
series[index label]
```

ndarray has an implicitly defined integer index, while a Series can have an explicitly defined index associated with the values.

In [11]:
ser1 = pd.Series([1,2,3,4], index=['Kim','Lee','Park','Choi'])
ser1

Kim     1
Lee     2
Park    3
Choi    4
dtype: int64

In [12]:
ser2 = pd.Series([5.2, 3.5, 7.2, 12.5], index=['Kim','Lee','Yoo','Choi'])
ser2

Kim      5.2
Lee      3.5
Yoo      7.2
Choi    12.5
dtype: float64

In [13]:
ser1['Kim']

1

In [14]:
ser1['Kim':'Park']

Kim     1
Lee     2
Park    3
dtype: int64

In [15]:
ser2['Lee']

3.5

Operations between Series automatically align the data based on label index:


In [16]:
ser1 + ser2

Choi    16.5
Kim      6.2
Lee      5.5
Park     NaN
Yoo      NaN
dtype: float64

Dictionary maps arbitrary keys to a set of arbitrary values, while a Series maps typed keys to a set of typed values. We can use *dictionary-style* item access.

In [17]:
ser3 = pd.Series({'b':10, 'a':35, 'c':20, 'e': 45})
ser3

b    10
a    35
c    20
e    45
dtype: int64

In [18]:
ser3['b']

10

In [19]:
ser3['a':'e']

a    35
c    20
e    45
dtype: int64

We can use implicit Python-style integer index: 

In [20]:
# indexing
ser3[0]

10

In [21]:
# slicing
ser3[1:3]

a    35
c    20
dtype: int64

In [22]:
# masking
ser3 > 15

b    False
a     True
c     True
e     True
dtype: bool

In [23]:
ser3 < 40

b     True
a     True
c     True
e    False
dtype: bool

In [24]:
(ser3 > 15) & (ser3 < 40)

b    False
a     True
c     True
e    False
dtype: bool

In [25]:
ser3[ (ser3 > 15) & (ser3 < 40) ]

a    35
c    20
dtype: int64

In [26]:
ser3

b    10
a    35
c    20
e    45
dtype: int64

In [27]:
ser3['a'] = 30
ser3

b    10
a    30
c    20
e    45
dtype: int64

**loc** and **iloc**: Using explicitly defined index labels and implicit integer index labels

In [28]:
ser4 = pd.Series(['a','b','e','c'], index=[1,3,4,2])
ser4

1    a
3    b
4    e
2    c
dtype: object

In [29]:
ser4.loc[1]

'a'

In [30]:
ser4.loc[1:4]

1    a
3    b
4    e
dtype: object

In [31]:
ser4.iloc[1]

'b'

In [32]:
ser4.iloc[1:4]

3    b
4    e
2    c
dtype: object

## DataFrames

DataFrames are the workhorse of Pandas. We can think of a DataFrame as a bunch of Series objects put together to share the same index. 

```
df = pd.DataFrame(data, index, columns)
```

* data: Can contain Series, arrays, constants, or list-like objects
* index: Row labels. Will default to RangeIndex if no index provided
* columns: Column labels. Will default to RangeIndex if no column labels are provided

### Creating a DataFrame

In [33]:
my_dict = {'T1': pd.Series([1, 5, 10], index=['A','B','C']),
           'T2': pd.Series([100, 70, 50, 10], index=['A','D','B','C']),
           'T3': pd.Series(np.random.randint(1, 20, 5), index=['A','B','C','D','E'])}
my_arr = np.random.randn(5,4)
my_list = [{'T1': 1, 'T2': 100, 'T3': 1}, {'T1': 5, 'T2': 50, 'T3': 2}, {'T1': 10, 'T2': 10, 'T3': 3}, {'T2': 70, 'T3': 4}, {'T3': 5} ]
my_row_label = ['A','B','C','D','E']
my_column_label = ['T1','T2','T3','T4']

In [34]:
# Using array

pd.DataFrame(my_arr)

Unnamed: 0,0,1,2,3
0,-0.551176,-0.695385,0.904619,0.466945
1,1.014225,0.654857,-0.132557,0.39069
2,3.178332,1.035422,-1.508489,-0.671088
3,0.544516,0.015963,0.648555,-0.250007
4,2.546139,0.520403,-0.518351,0.350861


In [35]:
pd.DataFrame(my_arr, index=my_row_label)

Unnamed: 0,0,1,2,3
A,-0.551176,-0.695385,0.904619,0.466945
B,1.014225,0.654857,-0.132557,0.39069
C,3.178332,1.035422,-1.508489,-0.671088
D,0.544516,0.015963,0.648555,-0.250007
E,2.546139,0.520403,-0.518351,0.350861


In [36]:
pd.DataFrame(my_arr, index=my_row_label, columns=my_column_label)

Unnamed: 0,T1,T2,T3,T4
A,-0.551176,-0.695385,0.904619,0.466945
B,1.014225,0.654857,-0.132557,0.39069
C,3.178332,1.035422,-1.508489,-0.671088
D,0.544516,0.015963,0.648555,-0.250007
E,2.546139,0.520403,-0.518351,0.350861


In [37]:
# Using dictionary of Series

pd.DataFrame(my_dict)

Unnamed: 0,T1,T2,T3
A,1.0,100.0,5
B,5.0,50.0,15
C,10.0,10.0,7
D,,70.0,9
E,,,5


In [38]:
pd.DataFrame(my_dict, index=['D','A','B'])

Unnamed: 0,T1,T2,T3
D,,70,9
A,1.0,100,5
B,5.0,50,15


In [39]:
pd.DataFrame(my_dict, index=['D','A','B'], columns=['T1','T3'])

Unnamed: 0,T1,T3
D,,9
A,1.0,5
B,5.0,15


In [40]:
# Using list of dictionaries

pd.DataFrame(my_list)

Unnamed: 0,T1,T2,T3
0,1.0,100.0,1
1,5.0,50.0,2
2,10.0,10.0,3
3,,70.0,4
4,,,5


In [41]:
pd.DataFrame(my_list, index=['A','B','C','D','E'])

Unnamed: 0,T1,T2,T3
A,1.0,100.0,1
B,5.0,50.0,2
C,10.0,10.0,3
D,,70.0,4
E,,,5


In [42]:
pd.DataFrame(my_list, index=['A','B','C','D','E'], columns=['T2'])

Unnamed: 0,T2
A,100.0
B,50.0
C,10.0
D,70.0
E,


### Data in a DataFrame: Selection and Indexing

Dictionary maps a *key* to a *value*

DataFrame maps a **column name** to a **Series of column data**




In [43]:
df = pd.DataFrame(my_dict)
df

Unnamed: 0,T1,T2,T3
A,1.0,100.0,5
B,5.0,50.0,15
C,10.0,10.0,7
D,,70.0,9
E,,,5


In [44]:
df['T2']

A    100.0
B     50.0
C     10.0
D     70.0
E      NaN
Name: T2, dtype: float64

In [45]:
# Pass a list of column names

df[['T3','T2']]

Unnamed: 0,T3,T2
A,5,100.0
B,15,50.0
C,7,10.0
D,9,70.0
E,5,


Creating a new column:

In [46]:
df['T4'] = df['T1'] * df['T2']
df

Unnamed: 0,T1,T2,T3,T4
A,1.0,100.0,5,100.0
B,5.0,50.0,15,250.0
C,10.0,10.0,7,100.0
D,,70.0,9,
E,,,5,


In [47]:
df['T5'] = 'Hey!'
df

Unnamed: 0,T1,T2,T3,T4,T5
A,1.0,100.0,5,100.0,Hey!
B,5.0,50.0,15,250.0,Hey!
C,10.0,10.0,7,100.0,Hey!
D,,70.0,9,,Hey!
E,,,5,,Hey!


Modify, Delete and Insert columns:
* ``del``, ``drop`` to remove some data


In [48]:
df['T5'] = df['T3'][1:3]*2
df

Unnamed: 0,T1,T2,T3,T4,T5
A,1.0,100.0,5,100.0,
B,5.0,50.0,15,250.0,30.0
C,10.0,10.0,7,100.0,14.0
D,,70.0,9,,
E,,,5,,


In [49]:
del df['T4']
df

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,5,
B,5.0,50.0,15,30.0
C,10.0,10.0,7,14.0
D,,70.0,9,
E,,,5,


In [50]:
df.insert(4, 'T4', df['T1']*df['T2'])
df

Unnamed: 0,T1,T2,T3,T5,T4
A,1.0,100.0,5,,100.0
B,5.0,50.0,15,30.0,250.0
C,10.0,10.0,7,14.0,100.0
D,,70.0,9,,
E,,,5,,


In [51]:
df.drop('T4', axis=1)

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,5,
B,5.0,50.0,15,30.0
C,10.0,10.0,7,14.0
D,,70.0,9,
E,,,5,


In [52]:
df
# if inplace is not specified, return a copy

Unnamed: 0,T1,T2,T3,T5,T4
A,1.0,100.0,5,,100.0
B,5.0,50.0,15,30.0,250.0
C,10.0,10.0,7,14.0,100.0
D,,70.0,9,,
E,,,5,,


In [53]:
df.drop('T4', axis=1, inplace=True)
df

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,5,
B,5.0,50.0,15,30.0
C,10.0,10.0,7,14.0
D,,70.0,9,
E,,,5,


In [54]:
df.drop('E', axis=0, inplace=True)
df

Unnamed: 0,T1,T2,T3,T5
A,1.0,100.0,5,
B,5.0,50.0,15,30.0
C,10.0,10.0,7,14.0
D,,70.0,9,


**Selecting Rows and Colunms**

* Select column: ``df[column label]``
* Select row by label: ``df.loc[row label]``
* Select row by integer location: ``df.iloc[row index label]``


In [55]:
df = pd.DataFrame(my_dict)
df

Unnamed: 0,T1,T2,T3
A,1.0,100.0,5
B,5.0,50.0,15
C,10.0,10.0,7
D,,70.0,9
E,,,5


In [56]:
df['T1']

A     1.0
B     5.0
C    10.0
D     NaN
E     NaN
Name: T1, dtype: float64

In [57]:
df.loc['B']

T1     5.0
T2    50.0
T3    15.0
Name: B, dtype: float64

In [58]:
df.iloc[1]

T1     5.0
T2    50.0
T3    15.0
Name: B, dtype: float64

In [59]:
df.loc[['A','C']]

Unnamed: 0,T1,T2,T3
A,1.0,100.0,5
C,10.0,10.0,7


In [60]:
df.loc[['A','C'],['T1','T3']]

Unnamed: 0,T1,T3
A,1.0,5
C,10.0,7


In [61]:
df.iloc[[0,2],[0,2]]

Unnamed: 0,T1,T3
A,1.0,5
C,10.0,7


In [62]:
df[1:3]

Unnamed: 0,T1,T2,T3
B,5.0,50.0,15
C,10.0,10.0,7


**Conditional selection**

In [63]:
df

Unnamed: 0,T1,T2,T3
A,1.0,100.0,5
B,5.0,50.0,15
C,10.0,10.0,7
D,,70.0,9
E,,,5


In [64]:
df < 13

Unnamed: 0,T1,T2,T3
A,True,False,True
B,True,False,False
C,True,True,True
D,False,False,True
E,False,False,True


In [65]:
df[ df < 13 ]

Unnamed: 0,T1,T2,T3
A,1.0,,5.0
B,5.0,,
C,10.0,10.0,7.0
D,,,9.0
E,,,5.0


In [66]:
df[ df['T2'] > 60 ]

Unnamed: 0,T1,T2,T3
A,1.0,100.0,5
D,,70.0,9


In [67]:
df[ df['T2'] > 60]['T3']

A    5
D    9
Name: T3, dtype: int64

## Quiz

1. Add, subtract, multiple and divide two Pandas Series

In [68]:
ser1 = pd.Series([2, 6, 4, 10, 8])
ser2 = pd.Series([2, 1, 2, 5, 4])

In [69]:
# add: ser_add

# subtract: ser_sub

# multiply: ser_mult

# divide: ser_div


In [70]:
print('Add two Series: ')
#print(ser_add)

print('Subtract two Series: ')
#print(ser_sub)

print('Multiply two Series: ')
#print(ser_mult)

print('Divide Ser1 by Ser2: ')
#print(ser_div)



Add two Series: 
Subtract two Series: 
Multiply two Series: 
Divide Ser1 by Ser2: 


2. Add some data to an existing Series:

You can (1) add a new index and value, or (2) make additional Series and append it to the original Series.

Original Series:
```
a         10
b         20
c       Data
d    Science
dtype: object
```

Expected output:
```
a         10
b         20
c       Data
d    Science
e       4001
f     Python
dtype: object
```

In [71]:
ser1 = pd.Series([10, 20, 'Data', 'Science'], index=['a','b','c','d'])
ser1

a         10
b         20
c       Data
d    Science
dtype: object

In [72]:
# Your code here



3. Select rows from a given DataFrame based on values in some columns:
* Find rows for ``C1`` value equals to 4

In [73]:
my_dict = {'C1': [1,4,3,4,5], 'C2': [4,5,6,7,8], 'C3': [7,8,9,0,1]}
df = pd.DataFrame(my_dict)
df

Unnamed: 0,C1,C2,C3
0,1,4,7
1,4,5,8
2,3,6,9
3,4,7,0
4,5,8,1


In [74]:
# Your code here


4. Change the order of a DataFrame columns and rows:

Original Series:
```
	C1	C2	C3
0	1	4	7
1	4	5	8
2	3	6	9
3	4	7	0
4	5	8	1
```

Expected output: change column order: C3, C1, C2 & row order: A, B, E, D, C
```
   C3  C1  C2
A   7   1   4
B   8   4   5
E   1   5   8
D   0   4   7
C   9   3   6
```

In [75]:
my_dict = {'C1': [1,4,3,4,5], 'C2': [4,5,6,7,8], 'C3': [7,8,9,0,1]}
df = pd.DataFrame(my_dict, index=['A','B','C','D','E'])
df

Unnamed: 0,C1,C2,C3
A,1,4,7
B,4,5,8
C,3,6,9
D,4,7,0
E,5,8,1


In [76]:
# Your code here


5. Delete DataFrame row(s) based on given column value:
* delete rows that ``C1`` value is less than 4

In [77]:
my_dict = {'C1': [1,4,3,4,5], 'C2': [4,5,6,7,8], 'C3': [7,8,9,0,1]}
df = pd.DataFrame(my_dict, index=['A','B','C','D','E'])
df

Unnamed: 0,C1,C2,C3
A,1,4,7
B,4,5,8
C,3,6,9
D,4,7,0
E,5,8,1


In [78]:
# Your code here


6. Remove first 3 rows of a given DataFrame

* use implicit integer index

In [79]:
my_dict = {'C1': [1,4,3,4,5,10,3], 'C2': [4,5,6,7,8,32,2], 'C3': [7,8,9,0,1,11,23], 'C4': [1,3,5,2,3,31,5]}
df = pd.DataFrame(my_dict, index=['A','B','C','D','E','F','G'])
df

Unnamed: 0,C1,C2,C3,C4
A,1,4,7,1
B,4,5,8,3
C,3,6,9,5
D,4,7,0,2
E,5,8,1,3
F,10,32,11,31
G,3,2,23,5


In [80]:
# Your code here


## HW Solutions: