In [1]:
import numpy as np
from collections import Counter
from inspect import isfunction

## Classes and OOP

You have allready been using classes, even if you weren't aware of it.

**Classes** are architectures of custom objects.

**Objects** or **instances** of a class, are concrete examples of a class you can manipulate.

Objects of a class can be accessed and manipulated thanks to

- **Properties**, which hold custom values

- **Methods**, which allow you to operate and modify your objects properties

E.g.
- `int` is a class
- 0,1,-2 are objects or instances of the `int` class
- 0 can be viewed as the sole property of the object 0
- the `+` operation is a method of the `int` class
- `list` is a class
- `[0,1,2]` is a `list` class object
- `.append()` is a method of the `list` class
- `str` is a class
- `xyz` is a `str` class object
- `.split(), .strip()` are methods of the class
- ...


Methods are basically just functions, tailor made for the objects of your class. 
For safe coding and convenience, it is good practice to specify different types of methods depending their actions on objects (and whether you want the user to play with them or not)...

You will create a custom class `DataFrame` to automate all the preprocessing operations you've done in the previous notebooks.

The properties of the class will be:
- the columns names of your data,
- the `dict` of your data,

And the methods of the class will allow you to perform:
- loading data into your `dict` from a csv file,
- cleaning and preprocessing the data,
- computing simple stats,
- filling na's
-...

In [7]:
import os
os.getcwd()

'/Users/sugumaran/Documents/EM-LYON/Data cleaning and Analysis from scratch/Day 3'

In [6]:
DATA_PATH = r'/Users/sugumaran/Documents/EM-LYON/Data cleaning and Analysis from scratch/Day 3'

In [8]:
# os.chdir('../../data')
os.chdir(DATA_PATH)

In [10]:
#list the contents of the directory
os.listdir()

['03_class refactorisation instructions.ipynb',
 '.ipynb_checkpoints',
 'victoria.csv']

## Class refactorization

### step 1

Create a class `DataFrame` with
- a `columns` property: a list of the columns names
- a `df` property: a dict for the data
- a `__len__` method to compute the number of raws in the data set

In [11]:
class DataFrame():
    # constructor
    def __init__(self):
        self.columns = []
        self.df = dict()

    # length of dataframe
    def __len__(self):
        return len(self.df)

In [12]:
#instantiate an object of the class Dataframe
df = DataFrame()

In [13]:
df.columns, df.df, len(df)

([], {}, 0)

### step 2

Add a 
- a `read_csv()` method to load data into its `df` property from a csv file

In [79]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1


In [80]:
df = DataFrame()

In [81]:
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

In [82]:
df.columns

['product_name',
 'mrp',
 'price',
 'pdp_url',
 'brand_name',
 'product_category',
 'retailer',
 'description',
 'rating',
 'review_count',
 'style_attributes',
 'total_sizes',
 'available_size',
 'color\n']

In [77]:
len(df)

453386

### step 3

Add
- a `__getitem__` method allowing to extract columns of `df`
- a `iloc()` method to extract raws of `df` (input = raw index)

In [101]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]

In [102]:
df = DataFrame()
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

In [103]:
df.__getitem__('mrp')

['$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$14.50 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$15.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$20.00 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$16.50 ',
 '$10.50 ',
 '$1

In [122]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]
        
    def iloc(self, row_index):
        return {col: value for col, value in zip(self.columns, self.df[row_index]) }

In [123]:
df = DataFrame()
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

In [130]:
df.iloc(1)

{'product_name': 'Very Sexy Strappy Lace Thong Panty',
 'mrp': '$14.50 ',
 'price': '$14.50 ',
 'pdp_url': 'https://www.victoriassecret.com/panties/shop-all-panties/strappy-lace-thong-panty-very-sexy?ProductID=328310&CatalogueType=OLS',
 'brand_name': "Victoria's Secret",
 'product_category': 'Strappy Lace Thong Panty',
 'retailer': 'Victoriassecret US',
 'description': 'Lots of cheek peek, pretty lace, a strappy back—this sexy panty is so not subtle. Allover lace with front bow V-back with crisscross straps Low rise Minimal back coverage: lots of cheek peek Imported nylon/spandex',
 'rating': '',
 'review_count': '',
 'style_attributes': '',
 'total_sizes': '"[""XS"", ""S"", ""M"", ""L"", ""XL""]"',
 'available_size': 'S',
 'color\n': 'black\n'}

In [131]:
df.columns

['product_name',
 'mrp',
 'price',
 'pdp_url',
 'brand_name',
 'product_category',
 'retailer',
 'description',
 'rating',
 'review_count',
 'style_attributes',
 'total_sizes',
 'available_size',
 'color\n']

### step 4

Add a a private method to clean the columns names

In [139]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]
        
    def iloc(self, row_index):
        return {col: value for col, value in zip(self.columns, self.df[row_index]) }
    
    #to clean the column names
    def clean_column_names(self, ):
        col_names = [var.strip() for var in self.columns]
        return col_names

In [140]:
df = DataFrame()
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

In [142]:
df.clean_column_names()

['product_name',
 'mrp',
 'price',
 'pdp_url',
 'brand_name',
 'product_category',
 'retailer',
 'description',
 'rating',
 'review_count',
 'style_attributes',
 'total_sizes',
 'available_size',
 'color']

### step 5

Add a `preprocessing()` method, receiving as arguments a list of processing functions.

Instantiate the class, load the data, and run the preprocessing with your processing functions from the previous notebook.

In [148]:
def conv_price(price):
    return (float(price.strip('$') if '$' in price else np.nan))

In [144]:
# a function replacing empty values by "nan" otherwise into a float - shorter implementation
def repl_nan(l_value):
    return (np.nan if l_value=='' else float(l_value))

In [145]:
# a function transforming size variables into a list or size in string
def conv_size(l_string):
    import regex as re
    l_size = l_string.split(',')
    return [re.sub('[^A-Z0-9]', '', size) for size in l_size]

In [146]:
# a function that strips the white space from an element of a list
def clean_last(element):
    return element.strip()

In [147]:
def identity(x):
    return x

In [149]:
processing_functions = [identity,
                       conv_price,
                       conv_price,
                       identity,
                       identity,
                       identity,
                       identity,
                       identity,
                       repl_nan,
                       repl_nan,
                       repl_nan,
                       conv_size,
                       identity,
                       clean_last]

In [155]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()
        self._preprocessed = False #private property

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]
        
    def iloc(self, row_index):
        return {col: value for col, value in zip(self.columns, self.df[row_index]) }
    
    #to clean the column names
    def clean_column_names(self, ):
        col_names = [var.strip() for var in self.columns]
        return col_names
    
    # preprocessing method
    def preprocessing(self, proc_functions):
        if not self._preprocessed:
            self.columns = [s.strip() for s in self.columns]
            self.df = {idx: [f(r) for f,r in zip(proc_functions, row)] for idx, row in self.df.items()}
            self._preprocessed = True
        
        

In [156]:
# preprocessing function explanation
def preprocessing(self, proc_functions):
    if not self._preprocessed:
        self.columns = [s.strip() for s in self.colums]
        fr = []
        for idx, row in self.df.items():
            for f,r in zip(proc_functions, row):
                fr.append(f(r))
            self.df = {idx:f(r)}
            self._preprocessed = True
                

In [157]:
df = DataFrame()
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

In [159]:
df.preprocessing(processing_functions)

In [161]:
df.iloc(1)

{'product_name': 'Very Sexy Strappy Lace Thong Panty',
 'mrp': 14.5,
 'price': 14.5,
 'pdp_url': 'https://www.victoriassecret.com/panties/shop-all-panties/strappy-lace-thong-panty-very-sexy?ProductID=328310&CatalogueType=OLS',
 'brand_name': "Victoria's Secret",
 'product_category': 'Strappy Lace Thong Panty',
 'retailer': 'Victoriassecret US',
 'description': 'Lots of cheek peek, pretty lace, a strappy back—this sexy panty is so not subtle. Allover lace with front bow V-back with crisscross straps Low rise Minimal back coverage: lots of cheek peek Imported nylon/spandex',
 'rating': nan,
 'review_count': nan,
 'style_attributes': nan,
 'total_sizes': ['XS', 'S', 'M', 'L', 'XL'],
 'available_size': 'S',
 'color': 'black'}

In [162]:
df.columns

['product_name',
 'mrp',
 'price',
 'pdp_url',
 'brand_name',
 'product_category',
 'retailer',
 'description',
 'rating',
 'review_count',
 'style_attributes',
 'total_sizes',
 'available_size',
 'color']

### step 6

Add up a `head()` method to display the first raws in the data frame

In [180]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()
        self._preprocessed = False #private property

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]
        
    def iloc(self, row_index):
        return {col: value for col, value in zip(self.columns, self.df[row_index]) }
    
    #to clean the column names
    def clean_column_names(self):
        col_names = [var.strip() for var in self.columns]
        return col_names
    
    # preprocessing method
    def preprocessing(self, proc_functions):
        if not self._preprocessed:
            self.columns = [s.strip() for s in self.columns]
            self.df = {idx: [f(r) for f,r in zip(proc_functions, row)] for idx, row in self.df.items()}
            self._preprocessed = True
        
    def head(self, n):
            count = 0
            for i in range(0,n):
                print(self.df[i])
                count += 1
                if count >= n : break

In [181]:
df = DataFrame()
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

In [184]:
df.head(2)

['Very Sexy Strappy Lace Thong Panty', '$14.50 ', '$14.50 ', 'https://www.victoriassecret.com/panties/shop-all-panties/strappy-lace-thong-panty-very-sexy?ProductID=328310&CatalogueType=OLS', "Victoria's Secret", 'Strappy Lace Thong Panty', 'Victoriassecret US', 'Lots of cheek peek, pretty lace, a strappy back—this sexy panty is so not subtle. Allover lace with front bow V-back with crisscross straps Low rise Minimal back coverage: lots of cheek peek Imported nylon/spandex', '', '', '', '"[""XS"", ""S"", ""M"", ""L"", ""XL""]"', 'S', 'peach melba\n']
['Very Sexy Strappy Lace Thong Panty', '$14.50 ', '$14.50 ', 'https://www.victoriassecret.com/panties/shop-all-panties/strappy-lace-thong-panty-very-sexy?ProductID=328310&CatalogueType=OLS', "Victoria's Secret", 'Strappy Lace Thong Panty', 'Victoriassecret US', 'Lots of cheek peek, pretty lace, a strappy back—this sexy panty is so not subtle. Allover lace with front bow V-back with crisscross straps Low rise Minimal back coverage: lots of c

In [185]:
# other example

In [193]:
import numpy as np

In [194]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()
        self._preprocessed = False #private property

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]
        
    def iloc(self, row_index):
        return {col: value for col, value in zip(self.columns, self.df[row_index]) }
    
    #to clean the column names
    def clean_column_names(self):
        col_names = [var.strip() for var in self.columns]
        return col_names
    
    # preprocessing method
    def preprocessing(self, proc_functions):
        if not self._preprocessed:
            self.columns = [s.strip() for s in self.columns]
            self.df = {idx: [f(r) for f,r in zip(proc_functions, row)] for idx, row in self.df.items()}
            self._preprocessed = True
        
    def head(self, n):
            count = 0
            ls = list()
            for k,v in self.df.items():
                if count < n:
                    ls.append(v)
                else: break
                count += 1
                new_ls = np.array(ls).t.tolist()
                return dict(zip(slef.columns, new_ls))

In [195]:
%%time
df = DataFrame()
df.read_csv('victoria.csv', 'r', sep = '*', encoding='utf-8')

CPU times: user 9.68 s, sys: 23.7 s, total: 33.4 s
Wall time: 41.6 s


In [196]:
df.head(2)

AttributeError: 'numpy.ndarray' object has no attribute 't'

### step 7

Add up a `describe()` method for simple stats of numerical columns, and run it.

In [None]:
class DataFrame():
    def __init__(self):
        self.columns = []
        self.df = dict()
        self._preprocessed = False #private property

    def __len__(self):
        return len(self.df)
    
    # load data from csv file
    def read_csv(self, file_name, read_type, sep, encoding):
        with open(file_name, read_type, encoding = encoding) as file:
            idx = -1
            for line in file:
                row = line.split(sep)
                if idx == -1 :
                    self.columns = row
                    idx += 1
                else :
                    self.df[idx] = row
                    idx += 1
     # to extract columns of dataframe
    def __getitem__(self, col_name):
            col_index = self.columns.index(col_name)
            return [row[col_index] for row in self.df.values()]
        
    def iloc(self, row_index):
        return {col: value for col, value in zip(self.columns, self.df[row_index]) }
    
    #to clean the column names
    def clean_column_names(self):
        col_names = [var.strip() for var in self.columns]
        return col_names
    
    # preprocessing method
    def preprocessing(self, proc_functions):
        if not self._preprocessed:
            self.columns = [s.strip() for s in self.columns]
            self.df = {idx: [f(r) for f,r in zip(proc_functions, row)] for idx, row in self.df.items()}
            self._preprocessed = True
        
    def head(self, n):
            count = 0
            for i in range(0,n):
                print(self.df[i])
                count += 1
                if count >= n : break
                
    def describe(self):
        for col_idx, col_name in enumerate(self.columns):
            col = self.__getitem__(col_name)
            col = [c for c col if c is not np.nan]
            if len(col)>0 and type(col[0])
            
                print(f"minimum: {np.min(col)}")
                print(f"first quartile: {np.percentile}")

### step 8

Add up a `value_counts()` method for categorical columns

### step 9

Add a `fill_na()` method, run it on the relevant columns.

### step 10

Instantiate a new object of your class, load the data of `amazon.csv`, and test your class methods. You might have to update your processing functions.