# Machine Learning Useful Functions and Cheats
### Pandas and Skicit-Learn Library

This notebook will contain examples, functions and cheats of Pandas (https://pandas.pydata.org/), numpy (https://numpy.org/) and Skicit-Learn Library (https://scikit-learn.org/stable/)

In [3]:
### libraries treated in this jupyter Notebook:
import numpy as np
import pandas as pd
import sklearn 

### Usually sklearn is only import in small packages:
### for example preprocessing package
from sklearn import preprocessing

Preproccesing  package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.

https://scikit-learn.org/stable/modules/preprocessing.html

### Simple Imputer
#### Package to clean missing values

Simple imputer is a scikit library for doing preprocessing columns cleaning:

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

Attributes:  
The imputation strategy (default=’mean’)
“mean”, “median”,  “most_frequent”,  “constant”

In [1]:
from sklearn.impute import SimpleImputer
def getImputer(df, _strategy="mean", _fill_value=0):
    ### fill Value parameter only is applicable when strategy= constant
    my_imputer = SimpleImputer(missing_values=np.nan, strategy=_strategy, fill_value=_fill_value)
    my_imputer.fit(df)
    return my_imputer

#### Fill values on missing fields of data frame
def fill_missing_values(df, my_imputer):
    imputed_df = pd.DataFrame(my_imputer.transform(df))
    ### Simple imputer remove column names
    imputed_df.columns = df.columns
    return imputed_df

#### Instructions 
### First called getImputer with Strategy on columns selected
### Second called fill_missing_values to impute the values

 ### Train and Test Split (80-20) Method
 #### Useful for split data sets to train and compare
 
 Skicit-Learn Library also helps you to split data (80%-20%) very easilly with the following function:
 
 https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
 
 Parameters: *arrays, test_size (default:None), train size (default:None), random_state (default:None), shuffle (default:True) and stratify (default:None)

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

def split_data(X, y, _train_size=0.8, _random_state=None):
    
    X_train, X_target, y_train, y_target = train_test_split(X, y, train_size=_train_size, 
                                                            test_size=_test_size,
                                                            random_state=_random_state)
    return [X_train, X_target, y_train, y_target]

### Pandas Section
Useful methods for getting nulls, categorical values and other information about data model

#### Null Values

To get nulls by field in Pandas only is needed to run following code:

    Total absolutes --> df.isnull().sum()

    Total in % --> df.isnull().sum()/df.count()*100

Output of both codes: pandas.core.series.Series

In [4]:
# Get columns with num values and convert to list to execute later on codes with them
# Fill in the line below: get names of columns with missing values
def getColumnsWithNulls(df):
    df_nulls= df.isnull().sum()
    df_nulls_perc = df.isnull().sum()/df.count()*100
    
    ### Next two lines output same value:
    null_columns= df_nulls[df_nulls > 0].index.values.tolist()
    null_columns= [col for col in df.columns if df[col].isnull().any()]

    ### We will return columns with Nulls in absolute, relative and null columns
    return [df_nulls,df_nulls_perc, null_columns]

def numberofrecordswithnulls(df):
    ### This function returns only records with null values
    return df.shape[0] - df.dropna().shape[0]

# Drop columns in training and validation data
def reduced_df_missing_values(df):
    return df.drop(cols_with_missing(df), axis=1)

### Assign columns that were missing to have in data set available
def cols_that_were_missing(df):
    for col in cols_with_missing(df):
        df[col + '_was_missing'] = df[col].isnull()

Following code is useful for removing records without value in targeted value, these records are impossible to compare against

In [7]:
def remove_no_target_values(df,target_col):
# Remove rows with missing target, separate target from predictors
    return df.dropna(axis=0, subset=[target_col])

Concat function have two options, 1 concatenating new records (another dataset)  or 2nd for concatenating new columns on the existing data set.

This is very usefull when we split different columns for cleaning in different ways such us scale of variables, categorisation, imputing...

- pd.concat([df_col_1n, df_col_2n], axis=1)

Axis = 1 is concatenating in column-wise way --> New columns. Concatenate rows put axis = 0

## Scikit Categorical Values
This section will provide a high overview on how we can work with categorical values in pandas and scikit learn library

In [None]:
## Get List of Categorical values of a dataframe:
def get_categorical(df):
    return  [col for col in df.columns if df[col].dtype == "object"]

### Drop all Categorical Values
def drop_categorical_columns(df):
    return df.select_dtypes(exclude=['object'])

### Label Encoder
Label encoder package is useful for converting categorical values into numbers:

    from sklearn.preprocessing import LabelEncoder
    
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [7]:
### Clean Categorical variables using Label encoder:
from sklearn.preprocessing import LabelEncoder

def label_categorical_values_df(df):
    # Make copy to avoid changing original data 
    label_df = df.copy()
    object_cols = getCategorical(df)
    # Apply label encoder to each column with categorical data
    label_encoder = LabelEncoder()
    for col in object_cols:
        label_df[col] = label_encoder.fit_transform(df[col])
    return label_df

### One Hot Encoder

Creates new columns as different values have categorical columns.In these new columns values are 0 or 1:

    from sklearn.preprocessing import OneHotEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

There are a number of parameters that can be used to customize its behavior.

- We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data

- setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. 

In [10]:
## All categorical columns
## This method is very useful for reading target and train sets and compare if categorical columns have all values, if they
## don´t have all in common could cause a issue predicting final values, for example if a value in target it is not existing in
## training set the model won´t know how to work

def read_categorical_columns_and_compare(X_train,X_valid):
    object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
    # Columns that can be safely label encoded
    good_label_cols = [col for col in object_cols if 
                       set(X_train[col]) == set(X_valid[col])]
    # Problematic columns that will be dropped from the dataset
    bad_label_cols = list(set(object_cols)-set(good_label_cols))
    print('Categorical columns that will be label encoded:', good_label_cols)
    print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
    return [good_label_cols, bad_label_cols]



Cardinality: Number of categories by category

In [12]:
# Get number of unique entries in each column with categorical data
def categories_by_column(df):
    object_cols = [col for col in df.columns if df[col].dtype == 'object']
    object_nunique = list(map(lambda col: df[col].nunique(), object_cols))
    d = dict(zip(object_cols, object_nunique))
    return sorted(d.items(), key=lambda x: x[1])

In [13]:
# Columns that will be one-hot encoded
def low_and_high_cardinities(df, threshold=10):
    object_cols = [col for col in df.columns if df[col].dtype == 'object']
    low_cardinality_cols = [col for col in object_cols if df[col].nunique() < threshold]
    # Columns that will be dropped from the dataset
    high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
    print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
    print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
    return [low_cardinality_cols, high_cardinality_cols]


#### Low Cardinity Cols direct code:

    low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]

## Pipelines
### Bundle Preprocessing steps & validation

This package also present in sklearn library helps to bundle together several processes to clean data. Example from kaggle:

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

def pipeline(cols):
    numTypes= ['float','cfloat','int','_uint', 'int64', 'float64']
    numerical_cols = [col for col in df.columns if df[col].dtype in numTypes]
    categorical_cols = [col for col in df.columns if df[col].dtype == 'object']
    # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='constant')

    # Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    return preprocessor

We need to provide a model and the preprocessor object to a pipeline, for example:

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

##model = RandomForestRegressor(n_estimators=100, random_state=0)

def createPipeline(preprocessor, model):
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
    return my_pipeline

# Preprocessing of training data, fit model 
##my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
##preds = my_pipeline.predict(X_valid)



## Cross Validation
### Tecnhique to compare models and avoid 1st model sets

Run 80-20 algorithim with multiple sets to compare results - recommenden on small data sets where computer performance is not important. Recommended to use with pipelines to reduce times. Example from kaggle:

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score

def getPipeline():
    my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                                  ('model', RandomForestRegressor(n_estimators=50,
                                                                  random_state=0))
                                 ])
    return my_pipeline


def getCrossvalidation(pipeline, X, y):
    # Multiply by -1 since sklearn calculates *negative* MAE
    scores = -1 * cross_val_score(pipeline, X, y,
                                  cv=5,
                                  scoring='neg_mean_absolute_error')

    return scores


## XGBoost
### State-of-art Algorithm - Ensembler

XGBoost is an iterative, ensembler algorithim. In the photo below we can see more or less how it works. It starts with 1 pretty naive model and it is trained with models that are add to an ensembler:

![image.png](attachment:image.png)

In this example, you'll work with the XGBoost library. XGBoost stands for extreme gradient boosting, which is an implementation of gradient boosting with several additional features focused on performance and speed.

    from xgboost import XGBRegressor

    my_model = XGBRegressor()
    my_model.fit(X_train, y_train)
    
This library is: 
https://xgboost.readthedocs.io/en/latest/build.html

![image.png](attachment:image.png)

### Parameter Tuning
XGBoost has a few parameters that can dramatically affect accuracy and training speed. The first parameters you should understand are:

- n_estimators: specifies how many times to go through the modeling cycle described above. It is equal to the number of models that we include in the ensemble.

Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.

Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on test data (which is what we care about).

Typical values range from 100-1000, though this depends a lot on the learning_rate

### Stopping rounds
early_stopping_rounds offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.

Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. Setting early_stopping_rounds=5 is a reasonable choice. In this case, we stop after 5 straight rounds of deteriorating validation scores.

When using early_stopping_rounds, you also need to set aside some data for calculating the validation scores - this is done by setting the eval_set parameter.

### Learning Rate

Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the learning rate) before adding them in.

This means each tree we add to the ensemble helps us less. So, we can set a higher value for n_estimators without overfitting. If we use early stopping, the appropriate number of trees will be determined automatically.

In general, a small learning rate and large number of estimators will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets learning_rate=0.1

### Jobs
On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won't help.

The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction. But, it's useful in large datasets where you would otherwise spend a long time waiting during the fit command.



In [12]:
## pip install xgboost
from xgboost import XGBRegressor

def XGBoost_Example(X_train, y_train, X_valid, y_valid ,rounds=5, estimators=500, rate=0.05, jobs=4):
    my_model = XGBRegressor(n_estimators=estimators, learning_rate=rate, n_jobs=jobs)
    my_model.fit(X_train, y_train, 
             early_stopping_rounds=rounds, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)
