# the 4th chapter
>building good dataset for ML 



## 1. Handling messing values
**to Handle messing values we can use**

1. drop missing data
2. Impute missing data

In [1]:
import pandas as pd
from io import StringIO

In [9]:
# Buildinging sample data
csv_data = '''A, B, C, D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,, 8.0
10.0, 11.0, 12.0,'''

In [49]:
#Loading data
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


### 1- Drop messing data 

***to check null values in columns***
 - df.values
> return array of DF values
 - df.isnull()
> return bool if null == True
 - df.isnull().sum()
> return sum of null


In [13]:
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

In [14]:
df.isnull()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,False,True,False
2,False,False,False,True


In [16]:
df.isnull().sum()

A     0
 B    0
 C    1
 D    1
dtype: int64

<br>

***to drop the null values***
- df.dropna(axis, how = "all", thresh, subset)
> drop null values in df
 - axis = 0 \>\> drop row / axis = 1 \>\>drop col
 - how = "all" , drop if all values is null
 - thresh = \<INT\> drop if real values \< \<INT\>
 - subject = \<col\> drop na in columnes 

In [17]:
# drop null row
df.dropna(axis = 0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [18]:
# drop null col
df.dropna(axis = 1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [19]:
df.dropna(axis = 1, how = "all")

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [23]:
df.dropna(axis = 0, thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


<br><br>

### 2. Impute messing data
- replace missing values with another values
> - mean   : fill with mean of columnes value      # work with numerical values
> - median : the most frequent data in the column  # work with nun numerical values

*can be replaced with ***df.fillna(np.mean)****

In [54]:
# imppute null values with simple imputung
from sklearn.impute import SimpleImputer
import numpy as np

imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data


array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

In [55]:
# replace null valuesw with mean of col
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


<br><br><br>
## Handling categorical data
Dealing with categorical data (**ordinal/nomenal**) by 
1. mapping values
    1. [map ordinal:](#map_ordinal) *map ordinal values to numerical order*
    2. [map nomenal:](#map_nominal) *map unique values with it's index*
2. encoding values
    1. pandas encoding
    2. one hot encoding

In [90]:
# create testing data
import pandas as pd
df = pd.DataFrame([
                   ['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue', 'XL', 15.3, 'class2'],
                   ['red', 'XL', 15.4, 'class2']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2
3,red,XL,15.4,class2


<br><a name = "map_ordinal"></a>***map ordinal data***
> map ordinal values to numerical order

In [91]:
# mapping data
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2
3,red,3,15.4,class2


In [92]:
# reverse mapping data
inv_siza_mapping = {v: k for k, v in size_mapping.items()}
df["size"].map(inv_siza_mapping)


0     M
1     L
2    XL
3    XL
Name: size, dtype: object

<br><a name = "map_nominal"></a>***map nominal data***
> 1. [map unique values with it's index](#map_lable_index)
> 2. [pandas lable encoder](#map_lable_pandas)
> 3. [One-Hot encoder] (#One_Hot_Encoder)

In [93]:
# mapping code
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

In [94]:
# map values
df['classlabel'] = df['classlabel'].map(class_mapping)

In [95]:
# inverse mapping 
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2
3,red,3,15.4,class2


***pandas lable encoder***<a name = map_lable_pandas></a>
> encode lables with it's index using pandas

In [99]:
# code
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y
encoded_df = df.copy()
encoded_df['classlabel'] = y
encoded_df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1
3,red,3,15.4,1


In [102]:
# inverse encoding
y_ = class_le.inverse_transform(y)
encoded_df['classlabel'] = y_
encoded_df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2
3,red,3,15.4,class2


#### Faster implementation

In [112]:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()

X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3],
       [2, 3, 15.4]], dtype=object)

lable encoding may treat nominal values as ordinal values <br>
we can solve this issue by using ***One-Hot encoding***

<a name = One_Hot_Encoder></a>***One-Hot encoding***
> encode values as binary values/ dummy values

In [118]:
from sklearn.preprocessing import OneHotEncoder
X = df[['color', 'size', 'price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

implement on data 

In [119]:
from sklearn.compose import ColumnTransformer
X = df[['color', 'size', 'price']].values
c_transf = ColumnTransformer([ ('onehot', OneHotEncoder(), [0]),('nothing', 'passthrough', [1, 2]) ])
c_transf.fit_transform(X).astype(float)

array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3],
       [ 0. ,  0. ,  1. ,  3. , 15.4]])

to faster implementation to One-Hot encoding we can use .get_dummies(categorical_data) method

In [144]:
# implement get_dummies
# droplast columne (lables)
features = df.columns.drop(df.columns[-1])
# create new df with dummy values
dummy_df = pd.get_dummies(df[features])

dummy_df

Unnamed: 0,size,price,color_blue,color_green,color_red
0,1,10.1,0,1,0
1,2,13.5,0,0,1
2,3,15.3,1,0,0
3,3,15.4,0,0,1


In [147]:
dummy_df = pd.get_dummies(df[features], drop_first=True)

dummy_df

Unnamed: 0,size,price,color_green,color_red
0,1,10.1,1,0
1,2,13.5,0,1
2,3,15.3,0,0
3,3,15.4,0,1
