## Cognizant - Artificial Intelligence Virtual Experience Program

### Task 4 - Machine Learning Production
Developing machine learning algorithms for production.

---------------------------------------------------------------
Gala Groceries saw the results of the machine learning model as promising and believe that with more data and time, it can add real value to the business.

To build the foundation for this machine learning use case, they want to implement a first version of the algorithm into production. In the current state, as a Python notebook, this is not suitable to productionize a machine learning model. 

Therefore, as the Data Scientist that created this algorithm, it is your job to prepare a Python module that contains code to train a model and output the performance metrics when the file is run. Additional information about Python modules and running Python files is provided in the additional resources. You can assume for this task that the Python file does not need to process, clean or transform the dataset. The Python file should be able to load a CSV file into a data frame, then immediately start training on that data. Assume that the CSV file will contain the same columns as the dataset that you trained the model on in the previous task.

Be sure to write good quality code, this means following best practices and writing your code in a clear and uniform manner. More information about best practices are provided in the additional resources. Furthermore, make sure to document your code with comments, as this will help the ML engineering team to understand what you’ve written.

In [1]:
# Importing Libraries

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

In [25]:
def get_data(filepath:str):
    
    dt = pd.read_csv(filepath)
    
    print(f'{dt.shape[1]} Rows \n{dt.shape[0]} Columns \n')
    print(dt.info(), '\n\n')
    print(dt.describe())
    
    return dt
    

In [27]:
def training(x_param, y_param, split_ratio=0.8, folds=10, seed=42):
    
    accuracy = []
    
    for i in range(0, folds):
    
        # Instantiate algorithm
        model = RandomForestRegressor()
        scaler = StandardScaler()
        
        # Split training and testing samples
        
        X_train, X_test, Y_train, Y_test = train_test_split(x_param, 
                                                            y_param,
                                                            train_size=split_ratio,
                                                            random_state=seed)
        
        # Fitting
        scaler.fit(X_train)
        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)
        
        # Training
        trained_model = model.fit(X_train, Y_train)
        
        # Predictions
        y_pred = trained_model.predict(X_test)
        
        # Compute accuracy
        mae = mean_absolute_error(y_true=Y_test, y_pred=y_pred)
        accuracy.append(mae)
        print(f"Fold {fold + 1}: Mean Accuracy Error = {mae:.3f}")
    
    
    print(f"Average MAE: {(sum(accuracy) / len(accuracy)):.2f}")
    