<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Case Study 2: Develop and evaluate an Anomaly Detection system](21.02-mlpg-CS2-Develop-and-evaluate-an-Anomaly-Detection-system.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [References](22.00-mlpg-References.ipynb) ]>

# 21.3. Case Study 3: Normalize & sort dates using Regular Expressions

* A _`regular expression`_ (aka _`regex`_ or _`regexp`_) is a sequence of characters that define a search pattern
* Usually, such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation
* Regular Expression module in Python is **`re`** (`import re` before using it)

* **Date sorter function:**
  - Identify all of the different date variants encoded in a dataset (examples are given below):
    ```
    * 04/20/2009; 04/20/09; 4/20/09; 4/3/09
    * Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
    * 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
    * Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
    * Feb 2009; Sep 2009; Oct 2010
    * 6/2008; 12/2009
    * 2009; 2010
    ```
  - Normalize and sort the dates using the following function:<br>
* **def date_sorter():**<br>
    \# _RE to extract dates in '1999' or '2019' formats_<br>
    `re1 = '([1|2]\d{3})'`
    
    \# _RE to extract dates in '1/2019' or '01/1999' or '01-2019' or in any combination of these formats_<br>
    `re2 = '(\d{1,2}[/-][1|2]\d{3})'`

    \# _RE to extract dates in '1/1/19' or '01-01-19' or '01/01/2019' or in any combination of these_<br>
    `re3 = '(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'`

    \# _RE to extract dates in 'Jan 2019' formats_<br>
    `re4 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[+\s]\d{4})'`

    \# _RE to extract dates in '1 Jan 2019' or '01 Jan 2019' formats_<br>
    `re5 = '(\d{1,2}[+\s](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[+\s]\d{4})'`

    \# _RE to extract dates in 'Jan 1, 2019' or 'Jan 01, 2019' formats_<br>
    `re6 = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[\S]*[+\s]\d{1,2}[,]{0,1}[+\s]\d{4})'`

    \# _Build the full regular expression_<br>
    `rex = '(%s|%s|%s|%s|%s|%s)' %(re1, re2, re3, re4, re5, re6)`

    \# _Create new columns from the first match of the extracted groups (re1 to re6)_<br>
    `extdate = df.str.extract(rex)`

    \# _Correct the spellings of two instances (this is an optional step)_<br>
    `extdate = extdate.iloc[:,0].str.replace('Janaury', 'January').str.replace('Decemeber', 'December')`

    \# _Stamdardize the date formats (yyyy-mm-dd) and create a series_<br>
    `extdate = pd.Series(pd.to_datetime(extdate))`

    \# _Sort the dates in ascending order as required_<br>
    `extdate = extdate.sort_values(ascending=True).index`

    **return** `pd.Series(extdate.values)`

<!--NAVIGATION-->
<br>

<[ [Case Study 2: Develop and evaluate an Anomaly Detection system](21.02-mlpg-CS2-Develop-and-evaluate-an-Anomaly-Detection-system.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [References](22.00-mlpg-References.ipynb) ]>