# Data Wrangling:

    -->Transforming or converting the data from one format to other format
    -->Uses: Better analysis, Better accuracy when using ML

    Types of Data wrangling:
    -----------------------

    (1) Discretization

    (2) Encoding

    (3) Transformation

    (4) Scaling

# Discretization:
    (1) Discretization is a process of transforming a continuous variable to discrete variable by creating a set of contiguous interval that spans range of variables
    (2) Discretization is also called binning where bin is an alternate name for interval
    
    Need of Discretization:
    -----------------------
    (1) Better understand the data. we can do more analysis by applying group by and cross tab
    (2) Better accuracy in ML and AI

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({"Age":np.arange(1,31)})
df

Unnamed: 0,Age
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [3]:
df["New_age"] = pd.cut(df['Age'],bins=[0,10,15,20,25,30],labels=['children','Teenagers','Adults','Employee','Manager'])
df

Unnamed: 0,Age,New_age
0,1,children
1,2,children
2,3,children
3,4,children
4,5,children
5,6,children
6,7,children
7,8,children
8,9,children
9,10,children


In [4]:
df[(df['Age']>0) & (df['Age']<=10)]

Unnamed: 0,Age,New_age
0,1,children
1,2,children
2,3,children
3,4,children
4,5,children
5,6,children
6,7,children
7,8,children
8,9,children
9,10,children


In [5]:
df[(df['Age']>10) & (df['Age']<=15)]

Unnamed: 0,Age,New_age
10,11,Teenagers
11,12,Teenagers
12,13,Teenagers
13,14,Teenagers
14,15,Teenagers


In [6]:
df[(df['Age']>15) & (df['Age']<=20)]

Unnamed: 0,Age,New_age
15,16,Adults
16,17,Adults
17,18,Adults
18,19,Adults
19,20,Adults


In [7]:
df[(df['Age']>20) & (df['Age']<=25)]

Unnamed: 0,Age,New_age
20,21,Employee
21,22,Employee
22,23,Employee
23,24,Employee
24,25,Employee


In [8]:
df[(df['Age']>25) & (df['Age']<=30)]

Unnamed: 0,Age,New_age
25,26,Manager
26,27,Manager
27,28,Manager
28,29,Manager
29,30,Manager


# Encoding:
    (1) converting a discrete categorical variable to discrete numeric
    (2) reason: machine cannot understand text data

    Nominal:
    --------
    (1) dummies          : pandas
    (2) one hot encoding : sklearn

    ordinal :
    ---------
    (1) map                                    : pandas
    (2) ordinal encoding (order of categories) : sklearn
    (2) label encoding   (alphabetical order)  : sklearn

In [9]:
df = pd.DataFrame({"Shirtsize":['xl','small','medium','large','xl']})
df

Unnamed: 0,Shirtsize
0,xl
1,small
2,medium
3,large
4,xl


In [10]:
df['Shirtsize'].unique()

array(['xl', 'small', 'medium', 'large'], dtype=object)

In [11]:
df['Shirtsize_replace'] = df['Shirtsize'].replace({'small': 0, 'medium': 1, 'large': 2, 'xl': 3})
df

Unnamed: 0,Shirtsize,Shirtsize_replace
0,xl,3
1,small,0
2,medium,1
3,large,2
4,xl,3


In [12]:
df['Shirtsize_map'] = df['Shirtsize'].map({'small': 0, 'medium': 1, 'large': 2, 'xl': 3})
df

Unnamed: 0,Shirtsize,Shirtsize_replace,Shirtsize_map
0,xl,3,3
1,small,0,0
2,medium,1,1
3,large,2,2
4,xl,3,3


In [13]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['small','medium','large','xl']])
df['oe_Shirtsize'] = oe.fit_transform(df[['Shirtsize']])
df

Unnamed: 0,Shirtsize,Shirtsize_replace,Shirtsize_map,oe_Shirtsize
0,xl,3,3,3.0
1,small,0,0,0.0
2,medium,1,1,1.0
3,large,2,2,2.0
4,xl,3,3,3.0


In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['le_Shirtsize'] = le.fit_transform(df['Shirtsize'])
df

Unnamed: 0,Shirtsize,Shirtsize_replace,Shirtsize_map,oe_Shirtsize,le_Shirtsize
0,xl,3,3,3.0,3
1,small,0,0,0.0,2
2,medium,1,1,1.0,1
3,large,2,2,2.0,0
4,xl,3,3,3.0,3


In [15]:
df1 = pd.DataFrame({'city':['H','M','B','H','B','D']})
df1

Unnamed: 0,city
0,H
1,M
2,B
3,H
4,B
5,D


In [16]:
df1['city'].unique()

array(['H', 'M', 'B', 'D'], dtype=object)

In [17]:
pd.get_dummies(df1['city'],dtype='int')

Unnamed: 0,B,D,H,M
0,0,0,1,0
1,0,0,0,1
2,1,0,0,0
3,0,0,1,0
4,1,0,0,0
5,0,1,0,0


In [18]:
pd.get_dummies(df1['city'],dtype='int',drop_first=True)

Unnamed: 0,D,H,M
0,0,1,0
1,0,0,1
2,0,0,0
3,0,1,0
4,0,0,0
5,1,0,0


In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1['city_le'] = le.fit_transform(df1['city'])
df1

Unnamed: 0,city,city_le
0,H,2
1,M,3
2,B,0
3,H,2
4,B,0
5,D,1


# Transformation:

    --> Conversion of skewed distribution data to normal distribution data
    --> For boxcox transformation lambda values ranges from -5 to 5 and lambda shouldn't equal to 1

    left skewed distribution to normal distribution:
    ------------------------------------------------
    (1) Exponential transformation
    (2) Power transformation
    (3) Boxcox transformation : (x**lambda-1)/(lambda-1)

    right skewed distribution to normal distribution:
    -------------------------------------------------
    (1) Logarthmic transformation
    (2) root transformation
    (3) Boxcox transformation : (x**lambda-1)/(lambda-1)

In [20]:
df = pd.DataFrame({'x': [5,6,7,18,19,20,21,22,23,24,25,26,27,28,29,30]})
df.skew()

x   -1.038573
dtype: float64

In [21]:
df['exp_transformation'] = np.exp(df['x'])
df['exp_transformation'].skew()

3.27617198665468

In [22]:
df['pow_transformation'] = df['x']**2 # n= 2 to 5
df['pow_transformation'].skew()

-0.39801047150849417

In [23]:
from scipy.stats import boxcox
df['normal_dist'], lamda = boxcox(df['x'])
df['normal_dist']

0       9.257642
1      13.013915
2      17.290719
3      95.020974
4     104.656988
5     114.695063
6     125.130612
7     135.959322
8     147.177124
9     158.780168
10    170.764807
11    183.127572
12    195.865162
13    208.974427
14    222.452356
15    236.296068
Name: normal_dist, dtype: float64

In [24]:
lamda

1.7764587626584691

In [25]:
df['normal_dist'].skew()

-0.5399337400995566

In [26]:
df1 = pd.DataFrame({'x': [18,19,20,21,22,23,24,25,26,27,28,29,30,40,50]})
df1.skew()

x    1.78074
dtype: float64

In [27]:
df1['log_transformation'] = np.log(df1['x'])
df1['log_transformation'].skew()

1.0965234173645417

In [28]:
df1['root_transformation'] = df1['x']**1/100 # n= 
df1['root_transformation'].skew()

1.7807396355830925

In [29]:
from scipy.stats import boxcox
df1['normal_dist'], lamda = boxcox(df1['x'])
df1['normal_dist'], lamda

(0     0.656666
 1     0.657340
 2     0.657931
 3     0.658452
 4     0.658914
 5     0.659327
 6     0.659697
 7     0.660031
 8     0.660332
 9     0.660606
 10    0.660856
 11    0.661084
 12    0.661294
 13    0.662700
 14    0.663441
 Name: normal_dist, dtype: float64,
 -1.503079917373512)

In [30]:
df1['normal_dist'].skew()

0.09752880874312798

# Scaling:
    --> reducing higher magnitude values to lower magnitude values. It keeps the original shape of distribution
    --> if 1 column has higher magnitude value and other column has lower magnitude value then machine assumes higher magnitude value has importance
    
    Types of scaling:
    -----------------
    (1) Standardization
    (2) Normalization

# Standardization:
    (1) converting each and every value to corresponding z-score is called Standardization
    (2) After converting every value to z-score, if the new column is normal distributed then it is standard normal distribution
    (3) if original data is normal distributed then converted values to corresponding to z-score is standard normal distribution
    (4) if we apply histogram on z-score then it is z-distribution.
    (5) z-distribution can be right skewed
    (6) z-distribution can be left skewed
    (7) z-distribution can be normal distribution (standard normal distribution)
    (8) z-score can be between (-inf,inf). In most cases, it will be from (-4,4)

    Z-score = (x-mean)/std

In [31]:
df = pd.DataFrame({'x':np.arange(1,6)})
df

Unnamed: 0,x
0,1
1,2
2,3
3,4
4,5


In [32]:
zvalue = (df['x']-df['x'].mean())/df['x'].std(ddof=0)
zvalue

0   -1.414214
1   -0.707107
2    0.000000
3    0.707107
4    1.414214
Name: x, dtype: float64

In [33]:
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(df[['x']])

array([[-1.41421356],
       [-0.70710678],
       [ 0.        ],
       [ 0.70710678],
       [ 1.41421356]])

# Normalization:
    (1) converting all values between 0 to 1 is called Normalizarion (min max scaling)
    (2) when to use which scaling technique? ==> In most of the cases we use standard scaler, except few algorithms

    X(scaled) = (x-x.min())/(x.max()-x.min())

In [34]:
X_scaled = (df['x']-df['x'].min())/(df['x'].max()-df['x'].min())
X_scaled

0    0.00
1    0.25
2    0.50
3    0.75
4    1.00
Name: x, dtype: float64

In [35]:
from sklearn.preprocessing import MinMaxScaler
MinMaxScaler().fit_transform(df[['x']])

array([[0.  ],
       [0.25],
       [0.5 ],
       [0.75],
       [1.  ]])