<a href="https://colab.research.google.com/github/BHouwens/MLUtilityBelt/blob/main/UtilityBelt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utility Belt

This is a notebook containing basic utility functions that can be used for data preparation and clean up in ML projects. They're exported to an importable module via `fire` and `notebook2script.py`, as seen in fastai's content. There's no particular order in which the utility functions appear.

## Drop Rare Features

In [None]:
# export

def drop_rare_features(df, thres=.95):
  """
  Drops features in the dataset that have 
  events that are too rare to be statistically useful
  """
  rare_f = []
  print("Running with threshold", thres)
  print("")

  for col in df.columns:
    if df[col].name not in df.select_dtypes(include='category').columns:
      freq = df[col].value_counts(normalize=True)
      sum_freq = df[col].sum()
              
      # should be enough to check whether the most freq is dominant
      if freq.iloc[0] >= thres:
        rare_f.append(col)
        print("\x1b[31m{c}: {v}\x1b[0m".format(c=col, v=freq.iloc[0]))
      else:
        print("{c}: {v}".format(c=col, v=freq.iloc[0]))
  
  df = df.drop(rare_f, axis=1)
  return df

## Roulette Test

In [None]:
import statsmodels.api as sm

ALPHA = .05

def get_relevant_duplicates(df, dep_var):
  """
  Gets number of duplicates with differing targets
  """
  ind_var = [c for c in df.columns if c != dep_var]

  # We need to sift out full duplicate rows
  poss_dups = df.duplicated(ind_var).sum()
  full_dups = df.duplicated().sum()
  return poss_dups - full_dups

def roulette_test(df, dep_var, threshold):
  """
  Performs a 1-proportion z-test on df to check for roulette. 
  Null hypothesis is that the number of duplicates do not 
  constitute a roulette dataset, in that the number is lower than 
  the acceptance threshold
  """
  relevant_dups = get_relevant_duplicates(df, dep_var)
  if relevant_dups == 0: return False

  _, p_val = sm.stats.proportions_ztest(
      relevant_dups, len(df), len(df)*threshold, 'smaller')

  return p_val > ALPHA

## Target Encoding

In [None]:
# export

def target_encode(df, target, categorical_features, aggregators = {"mean","std"}):
  for feature in categorical_features:
    aggregation=df.groupby(feature)[target].agg(aggregators)
    aggregation.columns=[column+"_per_{}_{}".format(feature,target) for column in aggregation.columns.tolist()]

    df = df.merge(aggregation, how="left", on=feature)

  return df

## GAF Image Construction

In [None]:
# export

!pip install -Uqq pyts

import os
import sys
import math
import numpy as np
from PIL import Image
from pyts.image import GramianAngularField
import matplotlib.pyplot as plt

class GAFConstructor:
    """
    A constructor for GAF images from input DataFrames
    """
    def __init__(self, df, historical_points):
      self.df = df
      self.column_count = len(df.columns)
      self.historical_points = historical_points
      self.self.factor = 15
      self.img_dim = self.get_img_dimension()
    
    def build(self, img_stem, path_to_save):
      """
      Main callable function, constructing a GAF image from the 
      constructor input DataFrame
      """
      if not self.df.isnull().values.any():
        self.build_gaf_composites(img_stem, path_to_save)
        self.build_single_img(img_stem, path_to_save)
      else:
        print("DataFrame contains null values. Cannot construct GAF images")
    
    def get_img_dimension(self):
      next_square = self.next_perfect_square()
      square_root = math.sqrt(next_square)
      return int(self.self.factor * square_root)
    
    def next_perfect_square(self):
      if not self.is_perfect_square():
        next_n = math.floor(math.sqrt(self.column_count)) + 1
        return next_n * next_n
      return self.column_count
    
    def is_perfect_square(self):
      sqrt = math.sqrt(self.column_count)
      return int(sqrt) * int(sqrt) == self.column_count
    
    def get_transform(self, series, transformer):
      """
      Creates and returns a GAF transformation of the input series
      """
      gaf = []

      for x in range(0,self.historical_points - self.self.factor):
          gaf.append(series[x: x + self.self.factor].to_numpy())
      
      gaf = np.array(gaf)
      return transformer.transform(gaf)
    
    def build_gaf_composites(self, img_stem, path_to_save):
      transformer = GramianAngularField(image_size=self.self.factor, sample_range=(-1, 1), method='summation', overlapping=False)
      imgs = {}

      for col in self.df.columns:
        imgs[col] = self.get_transform(self.df[col], transformer)[0]
      
      for col in self.df.columns:
        plt.imsave('{p}/{f}.png'.format(p=path_to_save, f="{i}_{c}".format(i=img_stem, c=col)), imgs[col])
    
    def build_single_img(self, img_stem, path_to_save):
      """
      Constructs a single image from input GAF files
      """
      img_paths = ["{p}/{f}.png".format(p=path_to_save, f="{i}_{c}".format(i=img_stem, c=col)) for col in self.df.columns]
      images = [Image.open(x) for x in img_paths]

      new_im = Image.new('RGB', (self.img_dim, self.img_dim))
      x_offset = 0
      y_offset = 0

      for im in images:
        if x_offset < self.img_dim:
          new_im.paste(im, (x_offset,y_offset))
          x_offset += self.factor
        else:
          y_offset += self.factor
          new_im.paste(im, (0,y_offset))
          x_offset = self.factor

      new_im.save('{p}/{s}.jpg'.format(p=path_to_save, s=img_stem))
      for f in img_paths:
        os.remove(f)  


# Export

In [None]:
!pip install fire

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fire
  Downloading fire-0.4.0.tar.gz (87 kB)
[K     |████████████████████████████████| 87 kB 3.4 MB/s 
Building wheels for collected packages: fire
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Created wheel for fire: filename=fire-0.4.0-py2.py3-none-any.whl size=115942 sha256=c6e4c289471213c0baef9d816554d699e66505825cf67c31eff8b4c8b99a9577
  Stored in directory: /root/.cache/pip/wheels/8a/67/fb/2e8a12fa16661b9d5af1f654bd199366799740a85c64981226
Successfully built fire
Installing collected packages: fire
Successfully installed fire-0.4.0


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

%cd gdrive/My Drive/

Mounted at /content/gdrive
/content/gdrive/My Drive


In [None]:
!python notebook2script.py "Colab Notebooks/UtilityBelt.ipynb"

Converted Colab Notebooks/UtilityBelt.ipynb to Colab Notebooks/exp/nb_UtilityBelt.py
