In [5]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)
        
# inspections
insp = pd.read_csv("data/inspections.csv")

# violations
viol = pd.read_csv("data/violations.csv")

(ch:wrangling_transformations)=
# Transformations

At times a feature is not in a form best suited for analysis and so we transform it. There are many reasons a feature might need a transformation. The value codings might not be useful for analysis, we may want to combine two features into one using an arithmetic expression, or we might want to pull information out of a feature to create a new feature. We describe these three basic kinds of transformations: type conversions, mathematical transformations, and extractions.

**Type conversion.** This kind of transformation occurs when we convert the data from one format to another to make the data more useful for analysis. We might convert information stored as a string to another format. For example, we would want to convert prices reported as strings to numeric (like changing "$2.17" to 2.17) so that we can compute summary statistics on them. Or, we might want to convert a time stored as a string, such as "1955-10-12", to a datetime format. Yet another example occurs when we  lump categories together, such as reducing the 11 categories for age in DAWN to 5 groupings.     

**Mathematical transformation.** One kind of mathematical transformation is when we change the units of a measurement from, say, pounds to kilograms. We might make unit conversions so statistics on our data can be more easily compared to other statistics.  Yet another reason to transform a feature is to symmetrize its distribution (this notion is covered in {numref}`Section %s <ch:eda_distributions>`. The most common such transformation is the logarithm; another widely used transformation is the square root.  Lastly, we might want to create new features from an arithmetic operations on others. For example, height and weight can be combined to create body mass index using the formula: $height/weight^2$. 

**Extraction**: Sometimes we will want to create a feature that contains partial information taken from another feature. For example, the inspection violations consists of a string with a description of the violation, and we may be interested in only whether the violation is related to, say vermin. We can create a new feature that is True if the violation contains the word "vermin" in its text description and False otherwise.  This conversion of information to logical values (or 0-1 values) is extremely useful in data science. The example in 
{numref}`Section %s <ch:wrangling_restaurants>`, gives you a sense of the power of these binary features. 

The type conversion that deals with dates and times is of particular note because many datasets include time as a feature. The datetime format is worthy of its own section (see below). Many other examples of transformations and their usefulness can be found in {numref}`Chapter %s <ch:eda>`.

## Transforming Time

In [6]:
insp['new_date'] = insp['date'].apply(lambda d: pd.datetime.strptime(str(d),'%Y%m%d'))
insp['year']     = insp['new_date'].apply(lambda d: d.year)


viol['new_date'] = viol['date'].apply(lambda d: pd.datetime.strptime(str(d),'%Y%m%d'))
viol['year']     = viol['new_date'].apply(lambda d: d.year)

#DEPRECATED 

  insp['new_date'] = insp['date'].apply(lambda d: pd.datetime.strptime(str(d),'%Y%m%d'))
  viol['new_date'] = viol['date'].apply(lambda d: pd.datetime.strptime(str(d),'%Y%m%d'))
