----
**Author**: Gunnvant

**Description**: OOPs for Data Science

**Audience**: Beginner

**Pre-requisites**: Python 101, Working with flat files

---



## TOC:

- Creating simple classes
- Methods vs attributes
- Inheritance
- Class Assignment and Excercise

There are two main reasons why you should learn about Object Oriented Programming. 

1. Many popular python libraries that help in creating ML models and building data pipelines provide object oriented interface.
2. Many ML/DS positions these days require some software development skills, knowledge of OOPs will help a learner greatly there.

The focus of this notebook is to introduce enough OOPs so that you can understand and write code written by other people, create your own custom classes or modify classes created by some-one else.

### Motivating Example:
- Write a function to read a csv file
- Write another function to find the number of rows and columns in the file read

In [1]:
import csv
path = "../../data/file_data.csv"

def read_csv(path=path):
    rows = []
    with open(path,'r',encoding='utf-8') as f:
        reader = csv.reader(f,delimiter = ",")
        for row in reader:
            rows.append(row)
    return rows
            
def shape(rows):
    num_cols = len(rows[0])
    num_rows = len(rows)
    return num_cols,num_rows

In [2]:
table = read_csv(path)

In [3]:
shape(table)

(11, 808)

This at the outset looks like a decent interface to work with. Now imagine you need to also handle reading a json or an xml or a yaml file.

Think about the following questions critically:

1. Will you now write a read_json function?
2. How will you organize all the different methods.


One of the motivations of using object oriented programming is to arrange similar functionality under one roof. What we can do is we can impliment both 
`read_csv` and `read_json` methods in one `class`.

In [4]:
import json
class Reader():
    def read_csv(self,path):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter = ",")
            for row in reader:
                rows.append(row)
        return rows
    
    def read_json(self,path):
        with open(path,'r',encoding='utf-8') as f:
            data = json.loads(f.read())
        return data

In [5]:
r = Reader() ## r is the object of the class Reader()

In [6]:
print(dir(r))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'read_csv', 'read_json']


In [7]:
data=r.read_csv(path)

In [8]:
data[0:2]

[['',
  'sensor_id',
  'time',
  'incoming',
  'outgoing',
  'range',
  'date',
  'hour',
  'minute',
  'total',
  'location_name'],
 ['0',
  '52',
  '2021-06-17 07:07:11.937082+00:00',
  '0',
  '2',
  '1min',
  '2021-06-17',
  '7',
  '7',
  '2',
  'reitan_7eleven_carlberner']]

In [9]:
path = "../../data/sample_json.json"
data = r.read_json(path)

In [10]:
data.keys()

dict_keys(['type', 'metadata', 'features', 'bbox'])

```python
class Reader():
    def read_csv(self,path):----> ### This is a method
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter = ",")
            for row in reader:
                rows.append(row)
        return rows
    
    def read_json(self,path):----> ### This is a method
        with open(path,'r',encoding='utf-8') as f:
            data = json.loads(f.read())
        return data
```

Classes can also have attributes, attributes usually contain pre-computed data. Eg, we can add an attribute that shows the shape of the object read in

In [12]:
class Reader():
    def read_csv(self,path):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter = ",")
            for row in reader:
                rows.append(row)
        self.shape = {'type':'csv','ncols':len(rows[0]),'nrows':len(rows)}
        return rows
    
    def read_json(self,path):
        with open(path,'r',encoding='utf-8') as f:
            data = json.loads(f.read())
        self.shape = {'type':'json','len':len(data)}
        return data

path_csv = "../../data/file_data.csv"
path_json = "../../data/sample_json.json"

r = Reader()
data_csv = r.read_csv(path_csv)
print(r.shape)
data_json = r.read_json(path_json)
print(r.shape)

{'type': 'csv', 'ncols': 11, 'nrows': 808}
{'type': 'json', 'len': 4}


There are some common methods in a class that we can impliment. Some of the common default methods are:

1. `__init__()`: Used to initialize the object of the class with some default values
2. `__len__()`: This method helps in finding the length of an object using `len()` function

In [13]:
class Reader():
    def __init__(self):
        self.shape = {}
    
    def __len__(self):
        if 'nrows' in self.shape:
            return self.shape['nrows']
        elif 'len' in self.shape:
            return self.shape['len']
        else:
            return 0
        
    def read_csv(self,path):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter = ",")
            for row in reader:
                rows.append(row)
        self.shape = {'type':'csv','ncols':len(rows[0]),'nrows':len(rows)}
        return rows
    
    def read_json(self,path):
        with open(path,'r',encoding='utf-8') as f:
            data = json.loads(f.read())
        self.shape = {'type':'json','len':len(data)}
        return data

In [14]:
path_csv = "../../data/file_data.csv"
path_json = "../../data/sample_json.json"

r = Reader()
print(r.shape)
print(len(r))
data_csv = r.read_csv(path_csv)
print(r.shape)
print(len(r))
data_json = r.read_json(path_json)
print(r.shape)
print(len(r))

{}
0
{'type': 'csv', 'ncols': 11, 'nrows': 808}
808
{'type': 'json', 'len': 4}
4


### Class Case Study:

1. Create a `class Reader()` which will support reading csv, tsv or pipe-delimited files

In [15]:
class Reader():
    def __init__(self):
        self.shape = () ## ncols,nrows
        self.data = None
        self.columns = None
    def __len__(self):
        if len(self.shape)!=0:
            return self.shape[1]
        else:
            return 0
    def read_csv(self,path,delimiter=","):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter=delimiter)
            for row in reader:
                rows.append(row)
        self.columns = rows[0]
        self.data = rows[1:]
        self.shape = (len(self.columns),len(rows)) 

In [16]:
r = Reader()
print(len(r))
print(r.shape)
print(r.columns)
print(r.data)
r.read_csv(path_csv)
print(len(r))
print(r.shape)
print(r.columns)
print(r.data[0:2])

0
()
None
None
808
(11, 808)
['', 'sensor_id', 'time', 'incoming', 'outgoing', 'range', 'date', 'hour', 'minute', 'total', 'location_name']
[['0', '52', '2021-06-17 07:07:11.937082+00:00', '0', '2', '1min', '2021-06-17', '7', '7', '2', 'reitan_7eleven_carlberner'], ['1', '52', '2021-06-17 07:07:51.166361+00:00', '1', '0', '1min', '2021-06-17', '7', '7', '1', 'reitan_7eleven_carlberner']]


*Example Continued ...*

2. Now lets extend the class `Reader`, check for datatypes, make sure anything which is a number is converted to a int. At this point in time we don't know enough python to check if a string is a float value or not, so in this example we will not handle it

In [17]:
class Reader():
    
    def __init__(self):
        self.shape = () ## ncols,nrows
        self.data = None
        self.columns = None
    
    def __len__(self):
        if len(self.shape)!=0:
            return self.shape[1]
        else:
            return 0
    
    def convert_float(self,value):
        if value.isdigit():
            return float(value)
        else:
            return value
    
    def read_csv(self,path,delimiter=","):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter=delimiter)
            for row in reader:
                rows.append([self.convert_float(i) for i in row])
        self.columns = rows[0]
        self.data = rows[1:]
        self.shape = (len(self.columns),len(rows)) 

In [18]:
r = Reader()
print(len(r))
print(r.shape)
print(r.columns)
print(r.data)
r.read_csv(path_csv)
print(len(r))
print(r.shape)
print(r.columns)
print(r.data[0:2])

0
()
None
None
808
(11, 808)
['', 'sensor_id', 'time', 'incoming', 'outgoing', 'range', 'date', 'hour', 'minute', 'total', 'location_name']
[[0.0, 52.0, '2021-06-17 07:07:11.937082+00:00', 0.0, 2.0, '1min', '2021-06-17', 7.0, 7.0, 2.0, 'reitan_7eleven_carlberner'], [1.0, 52.0, '2021-06-17 07:07:51.166361+00:00', 1.0, 0.0, '1min', '2021-06-17', 7.0, 7.0, 1.0, 'reitan_7eleven_carlberner']]


*Example Continued ...*

3. Now lets change the data attribute to a dictionary

In [19]:
class Reader():
    
    def __init__(self):
        self.shape = () ## ncols,nrows
        self.data = None
        self.columns = None
    
    def __len__(self):
        if len(self.shape)!=0:
            return self.shape[1]
        else:
            return 0
    
    def convert_float(self,value):
        if value.isdigit():
            return float(value)
        else:
            return value
    
    def read_csv(self,path,delimiter=","):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter=delimiter)
            for row in reader:
                rows.append([self.convert_float(i) for i in row])
        self.columns = rows[0]
        self.data = {}
        for idx,col in enumerate(self.columns):
            self.data[col] = []
            for row in rows[1:]:
                self.data[col].append(row[idx])
        self.shape = (len(self.columns),len(rows)) 

In [20]:
r = Reader()
print(len(r))
print(r.shape)
print(r.columns)
print(r.data)
r.read_csv(path_csv)
print(len(r))
print(r.shape)
print(r.columns)
print(r.data['sensor_id'][0:10])

0
()
None
None
808
(11, 808)
['', 'sensor_id', 'time', 'incoming', 'outgoing', 'range', 'date', 'hour', 'minute', 'total', 'location_name']
[52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0]


*Example Continued...*

Lets extend this class to also keep a mapping of datatypes for each of the columns

In [21]:
class Reader():
    
    def __init__(self):
        self.shape = () ## ncols,nrows
        self.data = None
        self.columns = None
        self.dtypes = None
    
    def __len__(self):
        if len(self.shape)!=0:
            return self.shape[1]
        else:
            return 0
    
    def convert_float(self,value):
        if value.isdigit():
            return float(value)
        else:
            return value
        
    def get_dtypes(self,rows):
        self.dtypes = {}
        for idx,value in enumerate(rows[1]):
            self.dtypes[self.columns[idx]] = "num" if type(value)==float else "str"
            
    def read_csv(self,path,delimiter=","):
        rows = []
        with open(path,'r',encoding='utf-8') as f:
            reader = csv.reader(f,delimiter=delimiter)
            for row in reader:
                rows.append([self.convert_float(i) for i in row])
        self.columns = rows[0]
        self.data = {}
        for idx,col in enumerate(self.columns):
            self.data[col] = []
            for row in rows[1:]:
                self.data[col].append(row[idx])
        self.get_dtypes(rows)
        self.shape = (len(self.columns),len(rows)) 

In [22]:
r = Reader()
print(len(r))
print(r.shape)
print(r.dtypes)
print(r.columns)
print(r.data)
r.read_csv(path_csv)
print(len(r))
print(r.shape)
print(r.dtypes)
print(r.columns)
print(r.data['sensor_id'][0:10])

0
()
None
None
None
808
(11, 808)
{'': 'num', 'sensor_id': 'num', 'time': 'str', 'incoming': 'num', 'outgoing': 'num', 'range': 'str', 'date': 'str', 'hour': 'num', 'minute': 'num', 'total': 'num', 'location_name': 'str'}
['', 'sensor_id', 'time', 'incoming', 'outgoing', 'range', 'date', 'hour', 'minute', 'total', 'location_name']
[52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0]


## Class Inheritance

Class Inheritance allows us to create new classes from existing classes.

## Motivatiing example:

1. Imagine now that have the `Reader()` class, we would want to include a functionality that will help us in analyzing the data.
2. Instead of adding functions to the `Reader()` class we can create a new class using `Reader()` class

In [23]:
class Table(Reader):
    def get_num_cols(self):
        cols = []
        for col in self.dtypes:
            if self.dtypes[col]=='num':
                cols.append(col)
        return cols
    def summary(self):
        cols = self.get_num_cols()
        results = {}
        for col in cols:
            results[col]={'mean':sum(self.data[col])/len(self.data[col]),'min':min(self.data[col]),'max':max(self.data[col])}
        return results

In [24]:
r = Table()
print(len(r))
print(r.shape)
print(r.dtypes)
print(r.columns)
print(r.data)
r.read_csv(path_csv)
print(len(r))
print(r.shape)
print(r.dtypes)
print(r.columns)
print(r.data['sensor_id'][0:10])
print(r.summary())

0
()
None
None
None
808
(11, 808)
{'': 'num', 'sensor_id': 'num', 'time': 'str', 'incoming': 'num', 'outgoing': 'num', 'range': 'str', 'date': 'str', 'hour': 'num', 'minute': 'num', 'total': 'num', 'location_name': 'str'}
['', 'sensor_id', 'time', 'incoming', 'outgoing', 'range', 'date', 'hour', 'minute', 'total', 'location_name']
[52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0, 52.0]
{'': {'mean': 403.0, 'min': 0.0, 'max': 806.0}, 'sensor_id': {'mean': 52.0, 'min': 52.0, 'max': 52.0}, 'incoming': {'mean': 164.96034696406443, 'min': 0.0, 'max': 13406.0}, 'outgoing': {'mean': 253.70879801734822, 'min': 0.0, 'max': 20561.0}, 'hour': {'mean': 13.53903345724907, 'min': 0.0, 'max': 23.0}, 'minute': {'mean': 29.71251548946716, 'min': 0.0, 'max': 59.0}, 'total': {'mean': 418.6691449814126, 'min': 1.0, 'max': 33967.0}}
