# Optimus Example

This Notebook is a simple tutorial about DataFrameTransformer, DataFrameAnalyzer and Utilities modules.

- DataFrameTransformer is a dedicated module to easily make dataframe transformations. 

- DataFrameProfiler is a dedicated module to run a basic profile of the dataframe.

- DataFrameAnalyzer is a dedicated module to plot and see important features of a spark
 Dataframe.

- Utilities module contains tool classes that support use of DataFrameTransformer and DataFrameAnalyzer modules. 

### Importing Modules

In [None]:
# Import optimus
import optimus as op

### Instantiation of Utility class
The utility class is a tool class that includes functions to read csv files, setting checkpoint issues (to save or temporally save dataFrames).

In [None]:
# Instance of Utilities class
tools = op.Utilities()

### Reading DataFrame

In [None]:
# Reading dataframe in this case, local file 
# system (hard drive of the pc) is used.

df = tools.read_csv(path="foo.csv", sep=',')

### General view of DataFrame

Initially it is a good idea to see a general view of the DataFrame to be analyzed. 

In the following cell, a basic profile of the DataFrame is shown. This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each varaible, also some statistical information that describes the variable plus a frecuency plot. table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content. 

In [None]:
# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

### Instantiation of analyzer class

But if you want more information for data exploration, Optimus has the DataFrameAnalizer which has several functions for analyzing your dataset. It presents a table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content. 

In [None]:
# Instance of analyzer class
analyzer = op.DataFrameAnalyzer(df=df)

DataFrameAnalizer has a method called columnAnalize. This method can check all rows of
dataFrame and tries to parse each element of each row to determine if the corresponding 
element is a string or a number. Besides, it can show 20 distinct values of each column
classified according the possible datatype value, i.e: a number can be a string, so this 
function can recognize a number in a column of string dataType by trying to parse the string. 

Also the function can plot numerical or categorical histograms.

### General view of DataFrame

Initially it is a good idea to see a general view of the DataFrame to be analyzed. 

In the following cell, the basic results of analyzing the DataFrame are made are shown. Basic results include a table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content. 

In [None]:
analyzer_tables = analyzer.column_analyze(column_list="*", print_type=True, plots=False)

The results obtained by running the analyzer class, details the presence of special chars, 
string columns with possible numbers on them and None and empty string values in columns.

You can also plot histograms for individual columns using the `plot_hist` function. This is an interesting feature because you are plotting from a Spark Dataframe:

In [None]:
analyzer.plot_hist("price","numerical")

### Instantiation of DataFrameTransformer
DataFrameTransformer is a specialized class to make dataFrame transformations. Transformations are optimized as much as possible to internally used native spark 
transformation functions.

In [None]:
# Instance of transformer class 
transformer = op.DataFrameTransformer(df)

In [None]:
transformer.show()

### Trimming blanck spaces at beginning and endings of cells dataFrames

In [None]:
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Triming string blank spaces:
transformer.trim_col("*")

# Printing trimmed dataFrame:
print('Trimmed dataFrame:')
transformer.show(5)

### Removing especial chars and accents:

In [None]:
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Remove special chars:
transformer.remove_special_chars("*").clear_accents("*")

# This can also be done by passing a Regex if you want something more personalized

#####################################################################

#transformer.remove_special_chars_regex("*",'[^\w\s]').clear_accents("*")

#####################################################################

# Printing trimmed dataFrame:
print('Removing special chars and accents dataFrame:')
transformer.show(5)

### Drop dummy column

In [None]:
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Droping a column:
transformer.drop_col("dummyCol")

# Printing trimmed dataFrame:
print('Dataframe without dummy column:')
transformer.show(5)

### Setting all letters to lowerCase

In [None]:
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

print('Setting all letters to lowerCase:')
transformer.lower_case("*")
transformer.show(5)

### Date Transformation (Format of date is changed)

In [None]:
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Priting the new date format:
print('Dataframe without dummy column:')
transformer.date_transform("birth", "yyyyMMdd", "dd-MM-YYYY") \
           .show(5)

### Age calculation from birth date client

In [None]:
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

print("Printing calculation of age born date client")
transformer.age_calculate("birth", "dd-MM-YYYY", "clientAge") \
           .show(5)

### Renaming columns:

In [None]:
# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)
print ("Renaming some columns of dataFrame")
transformer.rename_col(columns=[("clientAge", "age")])
transformer.show(5)

### Changing positions of columns dataframe:

In [None]:
# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)

# This action is to move column age, just after the lastName column
print ("age column moved")
transformer.move_col("age", "lastName", "after")
transformer.show(5)

### Setting a custom transformation
The core of this function is base on the user define function provide from the lambda function provided in the 'func' argument. 

In this example, cells that are not greater than 20, are multiplied by 20, the rest of them stay intact.

In [None]:
# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)

print (' Multiplying by 20 a number if value in cell is greater than 20:')
# Replacing a number:   
func = lambda cell: (cell * 20) if ((cell != None) and (cell < 20)) else cell
transformer.set_col(['price'], func, 'integer')
transformer.show(20)

After the transformation process detailed in the previous cells. It is a good idea to
analyze to see if the transformations have solved issued related with special characters, 
presence of number in column where is to supposed only letters, etc.

### Analyzing columns after transformations

In [None]:
# Setting the new dataFrame transformed into the analyzer class
analyzer.set_data_frame(transformer.df)
analyzer_table = analyzer.column_analyze("*", print_type=True, plots=True)

It can be seen from output of the analyzer object, that there are columns with numbers
even when ceratin column (for example) is supposed to be only of words or letters. 

In order to solve this problem, operationInType function of DataFrameTransformer class 
can be used. 

operationInType function is useful to make operations in a certain element of one dataType. In this particular example, it can be seen in the last output cell (specifically in 'product' column' that are values that don't fit the rest of the data, the aren't strings but they are numbers or empty strings. operationInType can take care about them and clean the column dataFrame.

In the following example, operationInType of function of DataFrameTransformer class is run in order to converts all posible 
parsables strings to integer into a null or none value. Notice how the 110790 value in product
column have been changed, but the rest of the column has remained intact.

### Making transformation in the inferred dataType elements of a certains columns

In [None]:
# This function makes changes or transformation in the column specified only in the cells
# that are recognized as the dataType specified. 
transformer.operation_in_type([('product', 'integer', None)]).show()

Sometimes there a some values that are different but actually are the same. In the product
column for example, there are the following values: 'taaaccoo', 'piza'. It 
can be inferred that the correct value is taco and piza and not the rest of them. This problem can
be solved with the lookup function of the DataFrameTransformer class.

### Replacing multiple string values to a single string

In [None]:
transformer.lookup('product', str_to_replace='taco', list_str=['taaaccoo']) 
transformer.lookup('product', str_to_replace='pizza', list_str=['piza', 'pizzza']) 
transformer.show(20)

As can be notice above, string specified in the list argument 'list_str' have been
replaced to 'str_to_replace' value. 

## Chaining and lazy evaluation

The past transformations were done step by step, but this can be achieved by chaining
all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach.

All the transformation set before can be done into a single line of code thanks to the 
chaining feature of the DataFrameTransformer class. This option is a optimal way to 
make different transformations, because it uses as much as possible all advantages of
the lazy evaluation approach. 

In [None]:
# Instanciate DataFrameTransfomer
transformer = op.DataFrameTransformer(df)
# Get original dataFrame to show it.
transformer.show(20)

# Chaining function transformations
transformer.trim_col("*") \
           .remove_special_chars("*") \
           .clear_accents("*") \
           .lower_case("*") \
           .drop_col("dummyCol") \
           .date_transform("birth", "yyyyMMdd", "dd-MM-YYYY") \
           .age_calculate("birth", "dd-MM-YYYY", "clientAge") \
           .operation_in_type([('product', 'integer', None)]) \
           .lookup('product', str_to_replace='taco', list_str=['taaaccoo']) \
           .lookup('product', str_to_replace='pizza', list_str=['piza', 'pizzza'])  \
        
        
transformer.show(20)