<a href="https://colab.research.google.com/github/Maternowsky/Maternowsky/blob/main/Dealing_with_Missing_DATA%2C_Ordinal%2C_Nominal%2C_One_Hot_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing Data to test with using Pandas**

In [2]:
import pandas as pd
from io import StringIO
csv_data = '''A,B,C,D
1.0,2.0,3.0, 4.0
5.0, 6.0,,8.0
10.0, 11.0, 12.0,'''
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


## StringIO function allowed us to read a string as if it was csv data and input it into our pandas dataframe

## **Using isnull method to return number of missing values per column**

In [3]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

## **scikit learn was originally developed for working with numpy arrays only, pandas df can be more convenient but numpy is recommended when possible. to access numpy array of df use values attribute**

In [4]:
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

### **Rows (training examples) can be dropped via dropna method**

In [5]:
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


### **Columns(features) can be dropped by setting axis = 1 in dropna**

In [6]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


### **only drop rows where all columns are NaN**

In [7]:
df.dropna(how='all')
#dont have an example where all columns are NaN in our data set

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


### **Drop rows that have fewer than 4 real values**

In [8]:
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


### **Only drop rows where NaN appears in specific columns**

In [9]:
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


## **Estimating values using mean imputation (SimpleImputer[scikit learn])**

In [10]:
from sklearn.impute import SimpleImputer
import numpy as np
imr = SimpleImputer(missing_values = np.nan, strategy = 'mean')
## other strategy parameters are median and most_frequent
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

### ***Imputing using fillna Pandas method***

In [11]:
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


### **Simple imputer is part of the transformer api in scikit learn**

## **Classifiers used in chapter 3 belong to the so-called estimators in scikit-learn**

# **Handling categorical data (ordinal, nominal)**

## **ordinal data can be sorted or ordered like tshirt sizes, nominal has no clear order like tshirt colors**

## **Creating new DataFrame**

In [12]:
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue', 'XL', 15.3, 'class2']])
df.columns = ['color','size','price','classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


### **DF above has nominal feature(color), ordinal feature (size), numerical feature(price)**

### **learning algorithms for classification in this book do not use ordinal info as class labels**

### **Mapping ordinal features**

In [13]:
size_mapping = {'XL': 3, 'L': 2, 'M': 1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


### **transforming integer values back to original string**

In [14]:
inv_size_mapping = {v: k for k , v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

### **Most estimators need class labels as integers, class labels are not ordinal so it doesn't matter what number we assign**

In [15]:
import numpy as np
class_mapping = {label: idx for idx, label  in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

### **using mapping dictionary to transform class labels to integers**

In [16]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1


### **reversing key-value pairs to convert class labels back to string**

In [17]:
inv_class_mapping={v: k for k, v in class_mapping.items()}
df['classlabel']=df['classlabel'].map(inv_class_mapping)
df


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


### **Using LabelEncoder class from scikit learn**

In [18]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
## fit_transform method is shortcut for calling fit and transform seperate
y

array([1, 0, 1])

### **using inverse_transform to change class labels back**

In [19]:
class_le.inverse_transform(y)

array(['class2', 'class1', 'class2'], dtype=object)

### **Using label encoder to transform nominal color column**

In [20]:
X= df[['color','size','price']].values
color_le=LabelEncoder()
X[:,0] = color_le.fit_transform(X[:,0])
X

# color values now encoded as blue = 0, green = 1, and red = 2

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

## **One-hot encoding - so algorithm doesn't sort nominal data**

In [21]:
from sklearn.preprocessing import OneHotEncoder
X=df[['color','size','price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:,0].reshape(-1,1)).toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

### **transfor columns in a multifeature array using ColumnTransformer**

In [22]:
from sklearn.compose import ColumnTransformer
X= df[['color', 'size','price']].values
c_transf = ColumnTransformer([('onehot',OneHotEncoder(), [0]),
                              ('nothing', 'passthrough', [1,2])])
# want to modify the first column and use passthrough to ignore 2nd and 3rd column
c_transf.fit_transform(X).astype(float)

array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

### **Convenient way to get dummy features via one-hot is get_dummies method in pandas**

In [23]:
pd.get_dummies(df[['price','color','size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0,1,0
1,13.5,2,0,0,1
2,15.3,3,1,0,0


### using one hot encoding datasets, keep in mind itroduces multicollinearity, which can be issue for certain methods(like matrix inversion)

### We can remove color_blue column and feature info is still preserved,
### can use get_dummies with drop_first = True

In [24]:
pd.get_dummies(df[['price','color','size']],drop_first=True)

Unnamed: 0,price,size,color_green,color_red
0,10.1,1,1,0
1,13.5,2,0,1
2,15.3,3,0,0


### to drop redundent column via OneHotEncoder

In [25]:
color_ohe = OneHotEncoder(categories = 'auto', drop='first')
c_transf = ColumnTransformer([('onehot',color_ohe, [0]),
                              ('nothing', 'passthrough', [1,2])])
c_transf.fit_transform(X).astype(float)

array([[ 1. ,  0. ,  1. , 10.1],
       [ 0. ,  1. ,  2. , 13.5],
       [ 0. ,  0. ,  3. , 15.3]])

## **Other Nominal Encoding Schemes- Binary Encoding, Count or Frequency Encoding **

## **Optional: Encoding Ordinal features- can encode ordinals using threshold encoding with 0/1 values**

In [27]:
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue','XL', 15.3, 'class2']])
df.columns=['color', 'size', 'price','classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


### Apply  method of pandas DataFrame to write custom lambda expressions

In [28]:
df['x > M'] = df['size'].apply(
    lambda x: 1 if x in {'L','XL'} else 0)
df['x > L'] = df['size'].apply(
    lambda x: 1 if x == 'XL' else 0)
del df['size']
df


Unnamed: 0,color,price,classlabel,x > M,x > L
0,green,10.1,class2,0,0
1,red,13.5,class1,1,0
2,blue,15.3,class2,1,1
