<br><br>
<span style="font-size:2em;font-weight:lighter;">194.025 Introduction to Machine Learning</span><br>
<span style="font-size:3em;font-weight:normal;line-height:70%;">Assignment 1: Data & Pre-processing</span>

---



Welcome to the first assignment of our course **Introduction to Machine Learning**. You will be able to earn up to a total of 10 points. Please read all descriptions carefully to get a full picture of what you have to do. 

**Remark:** Some code cells are put to read-only. Please execute them regardless as they contain important code. You can run a jupyter cell by pressing `SHIFT + ENTER`, or by pressing the play button on top (in the row where you can find the save button). Cells where you have to implement code contain the comment `# YOUR CODE HERE` followed by `raise NotImplementedError`. Simply remove the `raise NotImplementedError`and insert your code.

Some other code cells start with the comment `# hidden tests ...`. Please do not change them in any way as they are used to grade the tasks after your submission.

This part is meant to be a gentle introduction to working with python and jupyter as well as to provide some hands-on examples for how to deal with data. In this case, we look at a data set where the instances (rows) are diamonds and the features (columns) are their size, for example.

In [1]:
import numpy as np
from pandas import read_csv

In [2]:
# load our data set for this exercise
data = read_csv("https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv")

## Data Analysis

We want to investigate what our data set looks like, including size, data types, and missing values. The next cells should give you a very brief overview of the data set.

In [3]:
# print shape of data (rows, columns)
print("Shape of data: " + str(data.shape))

Shape of data: (6000, 8)


In [4]:
# print column names and data types
print(data.dtypes)

Carat Weight    float64
Cut              object
Color            object
Clarity          object
Polish           object
Symmetry         object
Report           object
Price             int64
dtype: object


In [5]:
# print first rows of data set
print(data[:5])

   Carat Weight    Cut Color Clarity Polish Symmetry Report  Price
0          1.10  Ideal     H     SI1     VG       EX    GIA   5169
1          0.83  Ideal     H     VS1     ID       ID   AGSL   3470
2          0.85  Ideal     H     SI1     EX       EX    GIA   3183
3          0.91  Ideal     E     SI1     VG       VG    GIA   4370
4          0.83  Ideal     G     SI1     EX       EX    GIA   3171


In [6]:
# investigate missing values
print(data.isnull().sum())

Carat Weight    0
Cut             0
Color           0
Clarity         0
Polish          0
Symmetry        0
Report          0
Price           0
dtype: int64


#### Missing Values

Fortunately, our data set does not contain any missing values. This is, however, not the best for us in terms of practicing how to deal with missing values. Therefore, we are randomly deleting some of the values :)

In [7]:
data_help = data[['Cut', 'Price', 'Carat Weight']].mask(np.random.random(data[['Cut', 'Price', 'Carat Weight']].shape) < .05)
data['Cut'] = data_help['Cut']
data['Price'] = data_help['Price']
data['Carat Weight'] = data_help['Carat Weight']

In [8]:
# let's check again for missing values
print(data.isnull().sum())

Carat Weight    335
Cut             298
Color             0
Clarity           0
Polish            0
Symmetry          0
Report            0
Price           276
dtype: int64


## Data Pre-Processing

Perfect :) Now that we have a data set with mixed data types and missing values, we are ready to look into the first set of tasks. For this, you will have to implement a number of different functions to process your data correctly.

#### Dealing with Missing Values (2 Points)

You will now need to deal with the missing values that we introduced to the data set previously. As a quick reminder, we know of multiple strategies:
- Delete rows or columns that contain any missing values
- Impute missing numeric values with the mean, min, max, ... of the corresponding column or a dummy value
- Impute missing categorical values with a dummy value

For this exercise, please do not delete any columns! Complete and use the below functions to impute the missing values accordingly.

**Remark:** Pandas DataFrames come with a number of useful functions. One of them is the `.fillna()` method, where you can specify, what you want to impute the missing values with. E.g., you could say `.fillna(mean)` or `.fillna('???')`.
In addition to this, you can use any `numpy` functionality such as `np.mean()`.
To access a specific column in your data set, you can do so with `data['Column Name']`. This way, you can, e.g., calculate the minimum of a column with `np.min(data['Column Name'])`. You can also overwrite the column with new values as long as the size fits (see above where we introduced random missing values).

In [9]:
def impute_column(df, column_name, imp_value):
    """
    Takes the name of a column and a value to impute missing values with
    and overwrites the missing values of said column. This should be
    done using the .fillna() method, as this automatically overwrites
    the values in the underlying data set.
    
    Parameters
    ----------
    df : DataFrame
        our data set
    column_name : str
        the name of the column you want to handle
    imp_value
        the value you want to use to overwrite missing values with
    """
    
    df[column_name] = df[column_name].fillna(imp_value)

In [10]:
# use the above function to impute all missing values with values of your choice

impute_column(data, 'Carat Weight', 0.123)
impute_column(data, 'Cut', 'Ideal')
impute_column(data, 'Price', 3000)

# let's see if you managed to impute all missing values
print(data.isnull().sum())

Carat Weight    0
Cut             0
Color           0
Clarity         0
Polish          0
Symmetry        0
Report          0
Price           0
dtype: int64


In [11]:
# hidden tests - DO NOT CHANGE THIS CELL

#### Scaling your Features (4 Points)

now that we have no more missing values, we can start to apply some of the pre-processing steps we have heard in the lecture. As we have a data set with mixed data types (numeric and categorical), we also have to keep in mind that we need to treat them differently. Let's start with the numeric features and scale them.

For this task, you will have to implement your own **min-max scaler**. Remember that min-max scaling is given by

$$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$$

in case you only want to scale in the range [0,1], or by

$$x' = a + \frac{(x - \min(x))(b - a)}{\max(x) - \min(x)}$$

in case you want to scale in a range [a,b].

In [12]:
# let's create a random vector first
x = np.random.uniform(low=1, high=500, size=(50,))
print(x)

[404.48900122 102.50152415  12.86643321 265.56276686 150.74994165
 480.47336721 136.93728258 169.39661888 383.82726601 213.2095866
 445.62997417 188.78022875 240.95514068 440.02189272 337.18530432
 463.81624409 494.40417357 205.51574405 479.38660978  46.79645648
 466.83746437  63.17930053 197.28324004 491.56842159 158.38705508
 222.08372638 422.53047725 196.85226636 223.75085424 239.80548014
 251.71161798 281.28671633 221.12620451 289.84732424 114.24740396
 187.5192798  496.62130728  84.6714433  430.79084084 294.7646887
 182.48357055 241.88519432 358.09690827 360.5972108  137.13974554
 189.51586061  97.93724777  73.22605152 382.78072826  57.69294424]


In [13]:
def min_max_scale(x, a, b):
    """
    Parameters
    ----------
    x : np.array
        data vector
    a : int
        lower bound of the min-max range
    b : int
        upper bound of the min-max range

    Returns
    -------
    x : np.array
        scaled data vector
    """

    minNum = np.min(x)
    maxNum = np.max(x)
    x = a + (x - minNum) / (maxNum - minNum) * (b - a)
    
    return x

In [14]:
# test your solution by scaling x to a range [10,30]

x = min_max_scale(x, 10, 30)

print("Min: " + str(np.min(x)) + "\nMax: " + str(np.max(x)))

Min: 10.0
Max: 30.0


In [15]:
# hidden tests - DO NOT CHANGE THIS CELL

In [16]:
# test your solution by scaling x to a range [0,1]

x = min_max_scale(x, 0, 1)

print("Min: " + str(np.min(x)) + "\nMax: " + str(np.max(x)))

Min: 0.0
Max: 1.0


In [17]:
# hidden tests - DO NOT CHANGE THIS CELL

In [18]:
# now apply your min-max scaler to our data set by scaling each numeric feature to a range [0,1]

# FOR ME: .apply() performs element-wise operation on a data frame, so I can apply the scaling feature to each column with numeric values
data[['Carat Weight', 'Price']] = data[['Carat Weight', 'Price']].apply(
    lambda numeric_col: min_max_scale(numeric_col.values, 0, 1)
)

In [19]:
# hidden tests - DO NOT CHANGE THIS CELL

In [20]:
# hidden tests - DO NOT CHANGE THIS CELL

#### One-Hot-Encoding of Categorical Features (4 Points)

Now that we have dealt with our numeric data, let's look at the categorical features. As we want to transform all of our data to numeric vectors, we will need to encode our categorical features appropriately. For this, we have heard about **one-hot-encoding**.

As a quick reminder: One-hot-encoding first looks at how many unique values a certain feature has. Then, we create as many new columns as we have unique values and populate them with all 0s. Finally, we update each 0 to a 1, where the respective value matches with the column name.

**Remark:** Numpy offers a method `np.unique()` to get all unique values of a vector. You can also change your numpy array to a simple list by casting `list(array)`. Also, you can create new numpy arrays filled with 0s using `np.zeros((a,b))` where `a` and `b` indicate dimensions of the resulting vector. You can access any desired value with `array[i,j]` where `i` and `j` are the respective positions of your element.

In [21]:
# here is a quick example

# this is our original feature vector
x1 = np.array(['Austria', 'Italy', 'Germany', 'Austria', 'Austria', 'France', 'Italy'])
print(x1)

# and we want to transform it into this
x2 = np.array([[1, 0, 0, 0],
               [0, 1, 0, 0],
               [0, 0, 1, 0],
               [1, 0, 0, 0],
               [1, 0, 0, 0],
               [0, 0, 0, 1],
               [0, 1, 0, 0]])
print(x2)

['Austria' 'Italy' 'Germany' 'Austria' 'Austria' 'France' 'Italy']
[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [1 0 0 0]
 [1 0 0 0]
 [0 0 0 1]
 [0 1 0 0]]


In [22]:
def one_hot_encode(x):
    """
    Parameters
    ----------
    x : np.array
        data vector containing categorical values
        shape (N,) where N is the number of data points

    Returns
    -------
    x_enc : np.array
        one-hot-encoded data vector
        shape (N,D) where D is the number of unique values
        of the original vector
    """
    
    x_unique = np.unique(x)

    x_enc = np.zeros((len(x), len(x_unique)), dtype=int)
    for i, val in enumerate(x):
        category_idx = np.where(x_unique == val)
        x_enc[i, category_idx] = 1
    
    return x_enc

In [23]:
# test your solution by encoding the following vector
x = np.array(['a', 'b', 'a', 'c', 'c', 'd', 'f', 'b', 'c', 'a', 'a', 'e', 'd', 'e', 'c', 'f'])
x_enc = one_hot_encode(x)
print(x_enc)

[[1 0 0 0 0 0]
 [0 1 0 0 0 0]
 [1 0 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 1 0 0]
 [0 0 0 0 0 1]
 [0 1 0 0 0 0]
 [0 0 1 0 0 0]
 [1 0 0 0 0 0]
 [1 0 0 0 0 0]
 [0 0 0 0 1 0]
 [0 0 0 1 0 0]
 [0 0 0 0 1 0]
 [0 0 1 0 0 0]
 [0 0 0 0 0 1]]


In [24]:
# hidden tests - DO NOT CHANGE THIS CELL

Our implemented function only works for a single column, and is also limited as we cannot just apply to a Pandas DataFrame. But in practice we usually want to use available functions and methods anyway. Thus, we will now try to one-hot-encode the categorical features of our data set using the Pandas function `pd.get_dummies()`.

We use this method by specifying a set of parameters:
- `data`: our data set
- `columns`: a list of the column names that we want to encode
- `prefix`: a list of prefixes that we want to add to the new column names (optional)

In the end, it will look something like this `pd.get_dummies(data=data, columns=['aaa', 'bbb', 'ddd'], prefix=['a', 'b', 'd'])`. 

In [25]:
# import pandas
import pandas as pd

# transform the remaining categorical features using the above described function

data = pd.get_dummies(data=data, columns=['Cut', 'Color', 'Clarity', 'Polish', 'Symmetry', 'Report'])

print(data[:5])

   Carat Weight     Price  Cut_Fair  Cut_Good  Cut_Ideal  Cut_Signature-Ideal  \
0      0.350556  0.030037     False     False       True                False   
1      0.253678  0.012941     False     False       True                False   
2      0.260854  0.010053     False     False       True                False   
3      0.282382  0.008211     False     False       True                False   
4      0.253678  0.009932     False     False       True                False   

   Cut_Very Good  Color_D  Color_E  Color_F  ...  Polish_EX  Polish_G  \
0          False    False    False    False  ...      False     False   
1          False    False    False    False  ...      False     False   
2          False    False    False    False  ...       True     False   
3          False    False     True    False  ...      False     False   
4          False    False    False    False  ...       True     False   

   Polish_ID  Polish_VG  Symmetry_EX  Symmetry_G  Symmetry_ID  Symmetry_VG

In [26]:
# hidden tests - DO NOT CHANGE THIS CELL