# QuickML Documentation

## Installing VMWare

VMWare, or an equivalent (VirtualBox, etc.) needs to be installed to be able to run virtual envrionments.


## Data Pre-Processing Function

The first step to creating a machine learning model is preparing the data to be fed into it by pre-processing. The data needs to be pre-processed and the following steps followed:

1. Acquire the Dataset 
2. Import Necessary Libraries 
3. Import the Dataset
4. Handling Missing Values
5. Encoding Categorical Data
6. Splitting into Training and Test Set
7. Feature Scaling

In [2]:
# Importing All Libraries
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [1]:
# Mapping independent, dependent, categorical and missing data
# to begin data pre-processing.
var_map = {
    "independent" : ["R&D Spend", "Administration","Marketing Spend", "State"],
    "dependent" : ["Profit"],
    "categorical" : ["State"],
    "missing": ["Marketing Spend"]
}

In [8]:
# Defining Function 
def dataPreProcess(dataSet, varMap):
    # Obtaining Data Set
    data_root = pd.read_csv(dataSet)
    data = data_root.copy()

    # Splitting Dependent & Independent Variables
    X = data[varMap['independent']]  
    y = data[varMap['dependent']]

    # Removing any missing data
    imputer = SimpleImputer(missing_values=np.nan , strategy='mean')
    imputer = imputer.fit(X[varMap['missing']])
    X[varMap['missing']] =imputer.transform(X[varMap['missing']])

    # Encoding Categorical Variables
    le = LabelEncoder()
    X[varMap['categorical']]= pd.DataFrame(le.fit_transform(X[varMap['categorical']]))
    col_tans = make_column_transformer( 
                         (OneHotEncoder(), 
                         varMap['categorical']))
    Xtemp2 = col_tans.fit_transform(X[varMap['categorical']])
    # Splitting Into Train and Test Set 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3 , random_state = 0)

    # Feature Scaling
    scale_X = StandardScaler()
    X_train.iloc[: , :] = scale_X.fit_transform(X_train.iloc[: , :])
    X_test.iloc[: , :] = scale_X.fit_transform(X_test.iloc[: , :])

    # Returns a dictionary of pre-processed data
    return(
        {
            'X_train': X_train,
            'X_test': X_test,
            'y_train': y_train,
            'y_test': y_train
        }
    )

The data processing function is responsible for taking a dataset and a mapping of dependent, independent, missing and categorical data. The dataset is split into the dependent and independent data, the missing data is taken care of, and the categorical data is encoded.

Finally, the data is split into the test and train and it is feature scaled. The function returns a dictionary of the train and test matrices and vectors ready for a machine learning model to be fitted on. 