# Introduction

This is a notebook for storing useful utility functions and useful copy/paste cells for various purposes.

So far, there are:
- A function that displays the information of each feature of a dataframe.
- A function that trains all ML models at once and print out the results along with returning them as a list.
- A cell that installs all various machine learning frameworks.

In [59]:
# Import libraries
import pandas as pd
import numpy as np

# Display Information of a DataFrame

Displays the following information of each feature of a dataframe:
- Total rows and columns
- Column Name
- dtype
- Number of missing rows
- Number of unique values
- Minimum value
- Maximum value
- Mean
- Median

In [5]:
# Check to make sure that the function name is not a reserved keyword
import keyword
keyword.iskeyword('scan')

False

In [40]:
# A function for displaying crucial information about a dataframe all in one go
def scan(df):
    '''
    Displays the following information of each feature in a dataframe:
    - Total rows and columns
    - Column Name
    - dtype
    - Number of missing rows
    - Number of unique values
    - Minimum value
    - Maximum value
    - Mean
    - Median
    '''
    print(f'Rows: {len(df)}')
    print(f'Columns: {len(df.columns)}')
    summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summary['Missing'] = df.isna().sum()
    summary['Uniques'] = df.nunique()
    summary['Min'] = df.min(numeric_only=True)
    summary['Max'] = df.max(numeric_only=True)
    summary['Mean'] = df.mean(numeric_only=True)
    summary['Median'] = df.median(numeric_only=True)
    
    return summary

In [80]:
# Import dummy data
data = pd.read_csv('data/machine failure.csv')

scan(data)

Rows: 10000
Columns: 14


Unnamed: 0,dtypes,Missing,Uniques,Min,Max,Mean,Median
UDI,int64,0,10000,1.0,10000.0,5000.5,5000.5
Product ID,object,0,10000,,,,
Type,object,0,3,,,,
Air temperature [K],float64,0,93,295.3,304.5,300.00493,300.1
Process temperature [K],float64,0,82,305.7,313.8,310.00556,310.1
Rotational speed [rpm],int64,0,941,1168.0,2886.0,1538.7761,1503.0
Torque [Nm],float64,0,577,3.8,76.6,39.98691,40.1
Tool wear [min],int64,0,246,0.0,253.0,107.951,108.0
Machine failure,int64,0,2,0.0,1.0,0.0339,0.0
TWF,int64,0,2,0.0,1.0,0.0046,0.0


# Score Multiple Given Models In One Go

Fit and score multiple machine learning models all in one go, print out the progress, and return a list of scores.

## Use Cases

You can use it for various cases, such as:
- Quickly compare multiple models to decide which ones you want to continue working with.
- See how the score might improve on various model based on the features you've engineered.

In [44]:
# Check to make sure that the function name is not a reserved keyword
import keyword
keyword.iskeyword('scorebulk')

False

In [74]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

def scorebulk(X, y, models, scoring='accuracy', cv=5, random_state=22):
    print(f'Evaluating {len(models)} models...')
    print(f'Scoring: {scoring}')
    print(f'Cross-validation folds: {cv}')
    print(f'================================')
    scores = {}
    skf = StratifiedKFold(n_splits=cv)
    for model_name, model in models.items():
        print(f'Evaluating {model_name}...', end='')
        scores[model_name] = np.mean(cross_val_score(model, X, y, scoring=scoring, cv=skf, n_jobs=-1))
        print('Done.')
    
    return scores

In [53]:
import sys
!conda install -q --yes --prefix {sys.prefix} -c conda-forge catboost xgboost lightgbm

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [79]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
# from sklearn.ensemble import VotingClassifier

models = {'RandomForestClassifier': RandomForestClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'XGBClassifier': XGBClassifier(),
          'CatBoostClassifier': CatBoostClassifier(silent=True),
          'LGBMClassifier': LGBMClassifier(),
          'HistGradientBoostingClassifier': HistGradientBoostingClassifier(),
#           'VotingClassifier': VotingClassifier()
         }

data = pd.read_csv('data/machine failure.csv')
data = data.drop('Product ID', axis=1).drop('Type', axis=1).drop('UDI', axis=1)
data.columns = data.columns.str.replace('[\[\]\<\>]', '', regex=True)

X = data.drop('Machine failure', axis=1)
y = data['Machine failure']

results = scorebulk(X, y, models, scoring='roc_auc', cv=10)
results

Evaluating 7 models...
Scoring: roc_auc
Cross-validation folds: 10
Evaluating RandomForestClassifier...Done.
Evaluating GradientBoostingClassifier...Done.
Evaluating AdaBoostClassifier...Done.
Evaluating XGBClassifier...Done.
Evaluating CatBoostClassifier...Done.
Evaluating LGBMClassifier...Done.
Evaluating HistGradientBoostingClassifier...Done.


{'RandomForestClassifier': 0.986448057483863,
 'GradientBoostingClassifier': 0.9880253171364315,
 'AdaBoostClassifier': 0.9884149311898673,
 'XGBClassifier': 0.9910851297040555,
 'CatBoostClassifier': 0.9930215564486664,
 'LGBMClassifier': 0.9911795152843746,
 'HistGradientBoostingClassifier': 0.9888335769090244}

# Install Machine Learning Frameworks

The following cell will install various machine learning frameworks to the current environment the Jupyter Notebook instance is running on. It will skip the ones you already have installed.

In [65]:
import sys
!conda install -q --yes --prefix {sys.prefix} -c conda-forge catboost xgboost lightgbm

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

