<a href="https://colab.research.google.com/github/Hellblazer99/AutoDateRecognition/blob/main/AutoDateTagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Date Tagging Model #
### This notebook implements a set of function which can automatically extract columns containing dates from a dataset and convert them to a machine understandable format for further processing. <br> It was a course project of [NLP by HSE University from Coursera](https://www.coursera.org/learn/language-processing) ###
The notebook is divided into two sections:


## 1. Select columns containing dates ##
> Here we use a simple 1 layer LSTM for binary classification of data points as `date` and `not a date`. 100 data points are sampled from each row and passed to it. If the network classifies 75 % of those as dates, the column is marked as a date

## 2. Converting dates into YYYY-MM-DD format ##
> Here we use an LSTM model with attention to recognise dates of various formats and convert them into a standard one. It was trained on 10,000 dates of various formats for 50 epochs. It generally achieves 100% accuracy <br>
> This model is the same which is used for Neural Machine Translation. This was a course project of mine (AndrewNG's Deep Learning Course on Coursera) which I modified a bit and included here.


## Before starting you need to download a [zip file](https://drive.google.com/file/d/1X950Oh-0NOFyEHMxJUySREe3wgsTsOJA/view?usp=sharing), unzip it and include those in your runtime at `/content/`. It contains the following files: 


*   `nmt_utils.py`: It is contains several accessory functions for date recognizer model. It also provides several dates in various formats using a faker module
*   `dataset.csv`: This dataset was generated using the above file and other random data from nltk. It consists of 6 features and 4000 rows. Out of 6 features, 3 are date features
*   `class_model.json`: This is the LSTM model which we'll use to classify columns as **date** or **non-date** features
*   `class_model.h5`: These are the weights for the above model which we'll load to save time from retraining it
*   `tokenizer.pkl`: This pickle file contains the tokenizer we'll use feature classfication
*   `nmt_weights.h5`: This file contains the weights of our Neural Translation Model (LSTM with attention) which we'll use to recognise dates. We'll load these weights to save time


We start by installing the environment and importing all files required

In [None]:
!pip install tensorflow==1.2.1
!pip install keras==2.0.7
!pip install faker



In [None]:
from keras.models import load_model, Model, model_from_json
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
from keras.utils import to_categorical
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import random
import pickle
import itertools
from datetime import date
from keras.preprocessing import sequence
from faker import Faker
from nmt_utils import *

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [None]:
# This is our dataset of 6 columns and 4000 rows
df = pd.read_csv("dataset.csv", index_col=0)
df.tail()

Unnamed: 0,Sentences,Date_1,Words,Date_2,Numbers,Date_3
3995,It might be that a long interval would elapse ...,30 oct 1986,associating,saturday october 11 2014,-1734,saturday december 26 1981
3996,During that long interval Starbuck would ever ...,thursday october 3 1996,grown,august 3 1993,6657,january 12 1975
3997,"Not only that , but the subtle insanity of Aha...",tuesday july 31 1984,practiced,tuesday june 24 2014,-6193,monday may 12 2008
3998,For however eagerly and impetuously the savage...,thursday october 6 2011,valueless,5 mar 1996,-8931,9 mar 1989
3999,Nor was Ahab unmindful of another thing .,15 march 2005,Whether,september 16 2016,-494,wednesday march 20 2019


## Selecting columns containing dates ##

---
The function below samples $25\%$ data points from each column and passes them through the date classifier model. If more than $70\%$ of the sample points are classified as **dates**, the column is classified as a **date column**

In [None]:
def select_date_cols(df):
  features = df.columns
  date_classifier = load_date_classifier()
  tokenizer = load_tokenizer()

  TEST_SIZE = int(len(df.index)/4)
  MAX_SEQ_LEN = 30
  THRESH = int(0.7*TEST_SIZE)

  dates = []
  for col in features:
    test_sample = random.sample(list(df[col].astype(str)), TEST_SIZE)
    sequences = tokenizer.texts_to_sequences(test_sample)
    seq_matrix = sequence.pad_sequences(sequences, maxlen=MAX_SEQ_LEN)
    preds = date_classifier.predict_classes(seq_matrix)
    preds = preds.tolist()
    if preds.count([1]) > THRESH:
      dates.append(col)

  print("\nDate Columns: ", dates)
  return dates

In [None]:
# This function loads the date classifier model
def load_date_classifier():
  date_classifierfile = open('class_model.json', 'r')
  date_classifier_json = date_classifierfile.read()
  date_classifierfile.close()
  date_classifier = model_from_json(date_classifier_json)
  date_classifier.load_weights('class_model.h5')
  print("Classifier model loaded")
  return date_classifier

In [None]:
# This function loads the tokenizer used in the model
def load_tokenizer():
  with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
    return tokenizer

In [None]:
# This function passes the dataset to select_date_cols and creates a separate dates dataset
def classify_date_cols(df):
  df.dropna(inplace=True)
  print("Columns containing dates")
  dates = select_date_cols(df)
  date_dic = {x:df[x] for x in dates}
  date_df = pd.DataFrame(date_dic)
  date_df.dropna(inplace=True)
  print("DataFrame of dates")
  print(date_df.head())
  return [date_df, dates]

In [None]:
date_dff, dates_l = classify_date_cols(df)

Columns containing dates
Classifier model loaded
Date Columns:  ['Date_1', 'Date_2', 'Date_3']
DataFrame of dates
                       Date_1  ...                     Date_3
0                 jun 26 2003  ...        sunday june 21 1998
1                 18 sep 1994  ...        tuesday june 5 2007
2              4 january 1983  ...    monday november 19 1984
3  wednesday february 13 1991  ...      saturday july 15 2000
4     thursday august 21 2003  ...  wednesday december 3 2014

[5 rows x 3 columns]


Here we have extracted the columns containing dates and created a separate dataframe. Run the cell below to see its head

In [None]:
date_dff.head()

Unnamed: 0,Date_1,Date_2,Date_3
0,jun 26 2003,sunday august 31 2008,sunday june 21 1998
1,18 sep 1994,4 nov 1999,tuesday june 5 2007
2,4 january 1983,15 november 1988,monday november 19 1984
3,wednesday february 13 1991,thursday march 24 1983,saturday july 15 2000
4,thursday august 21 2003,10 august 1985,wednesday december 3 2014


## Converting dates into YYYY-MM-DD format ##
---
Here we create the LSTM with attention model and load weights to use it for further steps

In [None]:
# Declaring globals for model creation
m = len(date_dff.index)
Tx = 30
Ty = 10
n_a = 32 
n_s = 64 
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
human_vocab = {' ': 0, '.': 1, '/': 2, '0': 3, '1': 4, '2': 5, '3': 6, '4': 7, '5': 8, '6': 9, '7': 10, '8': 11, '9': 12, 'a': 13, 'b': 14,'c': 15,'d': 16,'e': 17,'f': 18, 'g': 19, 'h': 20, 'i': 21, 'j': 22, 'k': 23, 'l': 24, 'm': 25, 'n': 26, 'o': 27, 'p': 28, 'q': 29, 'r': 30, 's': 31, 't': 32, 'u': 33, 'v': 34, 'w': 35, 'x': 36, 'y': 37, '<unk>': 38, '<pad>': 39  }
machine_vocab = {'-': 0, '0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10}
inv_machine_vocab = {0: '-', 1: '0', 2: '1', 3: '2', 4: '3', 5: '4', 6: '5', 7: '6', 8: '7', 9: '8', 10: '9'}
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights')
dotor = Dot(axes = 1)

In [None]:
def one_step_attention(a, s_prev):
    """ 
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    """

    # Repeator is used to repeat s_prev to be of shape (m, Tx, n_s) so that we can concatenate it with all hidden states "a"
    s_prev = repeator(s_prev)
    # We use concatenator to concatenate a and s_prev on the last axis
    concat = concatenator([a, s_prev])
    # We use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e
    e = densor1(concat) 
    # We use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies
    energies = densor2(e)
    # We use "activator" on "energies" to compute the attention weights "alphas"
    alphas = activator(energies)
    # We use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = dotor([alphas, a])
    
    return context

In [None]:
def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """

    X = Input(shape=(Tx, human_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    # Initializing empty list of outputs
    outputs = []
    
    # Defining pre-attention Bi-LSTM
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)
    
    # Iterating for Ty steps
    for t in range(Ty):
    
        # Performing one step of the attention mechanism to get back the context vector at step t
        context = one_step_attention(a, s)
        
        # Applying the post-attention LSTM cell to the "context" vector
        s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])
        
        # Applying Dense layer to the hidden state output of the post-attention LSTM
        out = output_layer(s)
        
        # Appending "out" to the "outputs" list
        outputs.append(out)
    
    # Creating model instance taking three inputs and returning the list of outputs
    model = Model(inputs=[X, s0, c0], outputs=outputs)
    
    return model

In [None]:
def gen_date_recog_model():
  # Creating date recognition model
  date_model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
  # Loading pretrained weights for faster execution
  date_model.load_weights('nmt_weights.h5')
  return date_model

In [None]:
# Creating the nmt (Neural Machine Translation) model
model = gen_date_recog_model()

In [None]:
# This function takes a string of any date format as input and converts it into YYYY-MM-DD format
def date_recognizer(date):
  source = string_to_int(date, Tx, human_vocab)
  source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)
  prediction = model.predict([source, s0, c0])
  prediction = np.argmax(prediction, axis = -1)
  output = [inv_machine_vocab[int(i)] for i in prediction]
  return ''.join(output)

This function takes the dates dataset we got from the previous step and sends it to the date recognition function

In [None]:
# This function is recognising 12,000 dates for the given dataset (Might take around a minute to run)
def recognise_dates(date_df, dates):
  for date in dates:
    date_df[date] = date_df[date].apply(date_recognizer)
  print("After recognising dates")
  print(date_df.head())
  return date_df

In [None]:
date_diff = recognise_dates(date_dff, dates_l)

After recognising dates
       Date_1      Date_2      Date_3
0  2003-06-26  2008-08-31  1998-06-21
1  1994-09-18  1999-11-04  2007-06-05
2  1983-01-04  1988-11-15  1984-11-19
3  1991-02-13  1983-03-24  2000-07-15
4  2003-08-21  1985-08-10  2014-12-03


In [None]:
# This is the dates dataset after the dates were recognised
date_diff.head()

Unnamed: 0,Date_1,Date_2,Date_3
0,2003-06-26,2008-08-31,1998-06-21
1,1994-09-18,1999-11-04,2007-06-05
2,1983-01-04,1988-11-15,1984-11-19
3,1991-02-13,1983-03-24,2000-07-15
4,2003-08-21,1985-08-10,2014-12-03


In [None]:
from datetime import date as date_converter
def parse_dates(val):
  try:
    y,m,d = val.split('-')
    return date_converter(int(y),int(m),int(d))
  except:
    return "NAN"

In [None]:
# This function generates the pairwise difference as desired in the question
def generate_difference(date_df, dates):
  if len(dates) < 2:
    print("Pair-wise difference not possible")
  date_pairs = list(itertools.combinations(dates, 2))
  for date_col in dates:
    date_df[date_col] = date_df[date_col].apply(parse_dates)
  for date_col in dates:
    idx = date_df[date_df[date_col]=="NAN"].index
    date_df.drop(idx, inplace=True)
  for pair in date_pairs:
    date_df[f"{pair[0]} - {pair[1]}"] = date_df[pair[0]] - date_df[pair[1]]
  return date_df

In [None]:
date_pair_diff = generate_difference(date_diff, dates_l)

In [None]:
# Run this cell to get the desired output
date_pair_diff.head()

Unnamed: 0,Date_1,Date_2,Date_3,Date_1 - Date_2,Date_1 - Date_3,Date_2 - Date_3
0,2003-06-26,2008-08-31,1998-06-21,-1893 days,1831 days,3724 days
1,1994-09-18,1999-11-04,2007-06-05,-1873 days,-4643 days,-2770 days
2,1983-01-04,1988-11-15,1984-11-19,-2142 days,-685 days,1457 days
3,1991-02-13,1983-03-24,2000-07-15,2883 days,-3440 days,-6323 days
4,2003-08-21,1985-08-10,2014-12-03,6585 days,-4122 days,-10707 days


## I've kept the implementation of the models used here in a separate notebook to prevent cluttering. You can check them through this link if required:

* [Date Classifier](https://colab.research.google.com/drive/1pmFRFWzBDYwzuFZv834lzXG9wTROs5ZQ?usp=sharing)

Thank You