# Code along session - 19 Feb

## Recap

##### OOP
- Components of OOP: class, objects(instantiate an object), methods, attributes
- How to create a class and an object
- What is the difference between attribute and method
- Different ways of documentating a class: docstring, typehinting etc 
- Optional topic: private attributes and property

##### Up to now, what we have learnt to process data with python
- if there is a text file, we can:
  - read the text file with ```with statement```
  - clean the data with, for example, string methods 
  - create python data objects, like list, to store the cleaned data
  - export the data objects to another text file
  
- but it is not enough, because for instance:
  - we will work with more complicated data like tabular data formats as data inputs 
  - there are several drawbacks with using python built-in data types for storing the cleaned data, for instance, performance and difficulty in calculation
  - --> there's why we need numpy and pandas

## Pandas series and dataframe

### What and why?


- In python, we have built-in data types, like list and dictionary
  - low performance: they are slow to performa calculation for a big dataset
    - numpy has an array data object which is faster because all elements in an array should be of the same data type
  - Vectorization: we can do element-wide calculation with numpy arrawy, but not with list. See an example of multiplication of a list with a number below
  
- pandas is a library bulit on numpy so pandas is also fast in calculation

- pandas is needed over numpy because there is an explicit index for pandas that is useful when referencing data
  - even though dictionary has key for each value, we cannot calculate direct with a dictionary

In [None]:
#list can store elements of different types, which makes it slower in calculation
fruit = [1, 3, "apple"]


In [None]:
#multiplication of a list
b = [1, 2, 3]
print(b)

c = b*2
print(c)

[1, 2, 3]
[1, 2, 3, 1, 2, 3]


In [52]:
#multiplication of a list

import numpy as np

a = np.array([1,2,3])
print(a)

d = a*2
print(d)

[1 2 3]
[2 4 6]


In [6]:
#numpy has an implicit index
a[0]


np.int64(1)

### Data objects in pandas - pandas series
- 1D array of indexed data

In [8]:
import pandas as pd

##### how to create
- it can be created from list, array, dictionay

In [None]:
#from a list without explicit index
series_list = pd.Series([0.25, 0.5, 0.75, 1.0])
print(series_list)


0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


In [17]:
#from a list with explicit index
series_list = pd.Series([0.25, 0.5, 0.75, 1.0], index=["a", "b", "c", "d"])
print(series_list)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


In [None]:
#from an array
a = np.array([1,2,3])
series_array = pd.Series(a, index=["a", "b", "c"]) 
print(series_array)


a    1
b    2
c    3
dtype: int64


In [None]:
#from a dictionary without index parameter
series_dict = pd.Series({"a":1, "b":2, "c":3})
print(series_dict)

a    1
b    2
c    3
dtype: int64


In [20]:
#from a dictionary with index parameter
series_dict = pd.Series({"a":1, "b":2, "c":3}, index=["a", "c"])
print(series_dict)

a    1
c    3
dtype: int64


##### how to select data from it

##### selection with 'key', similar to dictionary

In [21]:
#use the key of the original dictioinary
print(series_dict["a"])

1


In [None]:
#also similar to dictionary we can iterate over pandas series
print(series_dict.keys())
print(list(series_dict.items()))

Index(['a', 'c'], dtype='object')
[('a', 1), ('c', 3)]


##### selection with index, similar to list 

In [None]:
series_dict = pd.Series({"a":1, "b":2, "c":3})

#slicing with explicit index
print(series_dict["a":"b"]) #beaware that when slicing a list, it is an open boundary on the right

#slicing with implicit index
print(series_dict[0:2]) #beaware that it is also open boundary on the right here


a    1
b    2
dtype: int64
a    1
b    2
dtype: int64


In [31]:
data = pd.Series(['a', 'b', 'c'], 
                 index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [34]:
#This is very confusing if one is using [] syntax for indexing and slicing in pandas series

#pandas is using explicit index when indexing
print(data[1])

#pandas is using implicit index when slicing
print(data[1:3]) #implicit index is 0, 1, 2, 3 

a
3    b
5    c
dtype: object


In [None]:
#alernatively we can use iloc or loc for indexing and slicing in pandas series 
#for more readable code 

#.loc means that it is always explicit index that is used in indexing and slicing
print(data.loc[1])
print(data.loc[1:3])


#.iloc means that it is always implicit index that is used in indexing and slicing
print(data.iloc[1])
print(data.iloc[1:3])

a
1    a
3    b
dtype: object
b
3    b
5    c
dtype: object


### Data objects in pandas - pandas dataframe
- similar to tabular dataset
- 2D array of indexed data: row index and column names 
- from list, array, dictionary, we can create pandas series
- from mulitple pandas series, we can create a dataframe

##### how to create

In [39]:
#create a dict

area_dict = {'California': 423,
             'Texas': 450,
             'New York': 141,
             'Florida': 141,
             'Illinois': 149}

#create a pandas series
area = pd.Series(area_dict)
area

California    423
Texas         450
New York      141
Florida       141
Illinois      149
dtype: int64

In [40]:
#create a dict
population_dict = {'California': 388,
                   'Texas': 264,
                   'New York': 196,
                   'Florida': 195,
                   'Illinois': 127}

population = pd.Series(population_dict)

population

California    388
Texas         264
New York      196
Florida       195
Illinois      127
dtype: int64

In [44]:
#create a dataframe with two pandas series
states = pd.DataFrame({'pop': population,
                       'area': area})

states

Unnamed: 0,pop,area
California,388,423
Texas,264,450
New York,196,141
Florida,195,141
Illinois,127,149


In [45]:
#column attribute
states.columns #we can reassign column name with this attribute

Index(['pop', 'area'], dtype='object')

In [43]:
#index attribute
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

##### how to select data from it

In [None]:
#select a column

#by using dictionary-style indexing -> preferrable for clean code
print(states['area']) #compare dictionary and pandas 

#by using attribute -> can be confusing if a column name is the same as a method name in dataframe class
print(states.area) #compare with states.pop, it is confusing because there is a pop method of dataframe class


California    423
Texas         450
New York      141
Florida       141
Illinois      149
Name: area, dtype: int64
California    423
Texas         450
New York      141
Florida       141
Illinois      149
Name: area, dtype: int64


In [51]:
#select rows with slicing
print(states["California":"New York"])
print(states[1:4])

            pop  area
California  388   423
Texas       264   450
New York    196   141
          pop  area
Texas     264   450
New York  196   141
Florida   195   141
