# Data Readiness

## Dealing With Missing Values
Cleaning data by removing what we don't need, and dealing with missing values. There are generally three approaches to handle missing data. <br>
1. The first option is to simply drop any column with missing values. However this is generally not a good idea unless it's a column we aren't planning to include in our features or labels. 

In [None]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

2. The second option is **imputation**, which fills in the missing values with some number. In some cases the values can be infered (NaN for car could just be replaced with 0) or filled with the mean value along each column

In [None]:
from sklearn.impute import SimpleImputer

# Imputation using SimpleImputer
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train)) # Adds values into X train 
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid)) # Adds values into X test 

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

3. The third method is an extention to the second, essentially add another column which has a boolean flag indicated if some column was imputed or not. In some cases this could help, in others, not so much. 

In [None]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed with boolean values being assigned through the isnull() method
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

### Determining Columns With Missing Values
We can determine which models have missing values through the following code

In [None]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

# Alternatively we can get all the cols with missing values like this
cols_with_missing =[col for col in X_train.columns if X_train[col].isnull().any()]

## Dealing With Categorical Variables 
Categorical variables are ones which take a limited number of values. Think along the lines of questions were the answer options are resitricted with things like "Never", "rarely", 'sometimes", and "often". Other cases include codes such as stock tickrs or favorite colors. These variables can be identified by checking each column's datatype. If something is of type object it indicates it is not an integer or float and thus a categorical variable. 

In [None]:
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Again there are a few appraoches to dealing with these variable types. 

1. The first option is again, drop them if they don't relate to features you're interested in. However if they are, you shoould never drop them. 

In [None]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

2. The second approach is called **label encoding**. This process assigns each value a different integer. Note that in doing so we establish a heiarchy in the charegories Sometimes this makes sense (like the never to always question types) but in other cases there may not be a clear ordering in the values. Those that do have an order are called **ordinal vaariables**. 

In [None]:
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Get a list of all columns with categorical data 
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

  Note that sometimes the validation set will contain labels the training set will not and vice versa. In these cases it
    is best to either drop the columns that don't appear in both or take them into account some other way. 

In [None]:
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

3. The third approach is called **One-Hot Encoding** (throwback to ECE labs) which takes the column and creates multiple new categories in its place each with a value of 0 or 1. 1 if the value was the selected one and a 1 if it was not. These categorical variables are refered to as nominal variables and they do not have an intrinsic ranking. One-Hot generally does not perform well if the categorical variable takes on a large number of values (<15).

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Get a list of all columns with categorical data 
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

# Apply one-hot encoder to each column with categorical data creating a dataframe with ONLY of one hot encoded categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features conncatonate the two
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

### Understanding Cardinality 
This relates to the idea of both one hot as well as labeling. Cardinality of a categorical variable is refered to as the number of unique entries of a categorical variable. A category with two unique entries would have a vategorical value of 2. We can get the categorical values of each column like so. 

In [None]:
# Get number of unique entries in each column with categorical data
# object_cols is a define variable containing all the columns in the dataset 
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])