<div class='bar_title'></div>

*Enterprise AI*

# Assignment 1 - Introduction to Machine Learning

Gunther Gust / Justus Ameling<br>
Chair of Enterprise AI

Summer Semester 24

<img src="https://github.com/GuntherGust/tds2_data/blob/main/images/d3.png?raw=true" style="width:20%; float:left;" />

In this assignment, you are provided with a dataset containing information about a local housing market. The goal of this assignment is to develop a machine-learning model that can predict the price of a house given a set of input features. The assignment is divided into three tasks:
- Data Preprocessing
- Model Development
- Model Evaluation

## Data Preprocessing

We first need to import the necessary libraries for our data preprocessing task. We will first use the `pandas` package, which can be imported by running the command `import pandas as pd`. Thereby, `as` is used to give the package a nickname, making it easier to refer to the package later in the code.

In [None]:
import pandas as pd

Next, we will load our dataset, which is stored as a CSV (comma-separated values) file. This can be done using the command `pd.read_csv("./folder/filename.csv")`. The `./` indicates that the file is stored in the current directory. If the file is stored in a different directory, you can extend the path to the file accordingly. As a result of this command, the dataset is loaded into a pandas dataFrame.

In [None]:
housing_data = # Load the Housing.csv file as a pandas dataframe

### Data Sampling
The first step for our data science task is to split our dataset into a feature set and a target Variable. Our target variable is the house price and can be extracted using the command `housing_data["price"]`, which will return the dataFrame column as a pandas series. Our feature set will contain the remaining columns of the dataFrame. We can extract the remaining columns by using the command `housing_data.drop("price", axis=1)`, which will return a new dataFrame without the column "price".

In [None]:
Y = # Extract the price column from the dataFrame
X = # Extract the remaining columns from the dataFrame

In [None]:
### TEST ###
print(f"TEST X: {X.shape == (545, 12)}\t TEST Y: {Y.shape == (545,)}")
### TEST ###

Next, we need to create a train and a test set. This can be done by using the `train_test_split` function from the `sklearn.model_selection` package. The `train_test_split` function receives the features and the target variable as input and returns four dataFrames. Also, we can decide about the size of the test set by setting the `test_size` parameter. Finally, the `random_state` parameter is used to set the seed for the random number generator, ensuring the split is reproducible. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split">Documentation</a>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = # Create a train-test split with 80% of the data for training and a random state of 0

In [None]:
### TEST ###
print(f"train_test_split: {Y_train.sum() == 2083048940}")
### TEST ###

### Data Inspection
From now on, we will only work with our train dataset. So that we are not biased during our modeling task.<br>
Let us now inspect our dataset. Therefore, we will look at the first five rows of our dataFrame using the command `X_train.head()`. This will give us an overview of the columns in our dataset.

In [None]:
# show the first five rows of the X_train dataFrame

To get some general information about the shape of our dataset, we can use the `shape` attribute of the dataFrame. The shape attribute returns the number of rows and columns in our dataset.

In [None]:
# Print out the number of rows and columns in the X_train dataFrame

Let's create some statistical information about our dataset. We can do this using the `describe()` function, which calculates some statistical parameters (e.g., mean or standard deviation). By default, it will only return information about numerical columns.

In [None]:
# Apply the describe function on the numerical columns of the X_train dataFrame

We can use the `describe()` function with the parameter `include='object'` to receive statistical information about the non-numerical columns. 

In [None]:
# Apply the describe function on the non-numerical columns of the X_train dataFrame

In [None]:
# What is the most frequent value for the feature 'furnishingstatus'?
# Answer:

### Handling Missing Values

The next step is to check for missing values in our dataset. We can do this by using the `isna()` function. This function returns a dataFrame with boolean values, where `True` indicates that the value is missing and `False` indicates that the value is present. To count the number of missing values in each column, we can use the `sum()` function.

In [None]:
# Show the number of missing values fot the X_train dataframe

In [None]:
# Which columns have missing values and how many do they have ?
# Answer:

Now we know that our data includes null values. Since not all machine learning models can handle missing values, we need to handle them. One way of handling missing values is to remove the rows with missing values. However, by doing so we might lose valuable information. Another way of handling missing values is to impute them. First let us import the necessary function from sklearn by running the command `from sklearn.impute import SimpleImputer`. You can find more information about the imputation of missing values <a href="https://scikit-learn.org/stable/modules/impute.html#impute">here</a>

In [None]:
# Import the simpleImputer class from the sklearn library

Next, we will create our Imputer. To do this, we must hand over a strategy defining how the missing values are imputed. For our numeric column `area`, we will use the `mean` strategy, and for our categorical column `furnishing status`, we will use the `most_frequent` strategy.

In [None]:
numerical_imputer = SimpleImputer(strategy='mean')
categorical_imputer = # Create a SimpleImputer object with the strategy as 'most_frequent'

To apply the imputer to our dataset, we can use the `fit_transform()` function. This function will impute the missing values and return a numpy array. To convert the numpy array back to a pandas dataFrame, we can use the `pd.DataFrame()` function. As additional parameters, we need to hand over the columns and the index of the original dataFrame.

In [None]:
numeric_imputed_values = pd.DataFrame(
    numerical_imputer.fit_transform(X_train[["area"]]),index=X_train.index, columns=["area"]
    )
categorical_imputed_values = #apply the imputer on the 'furnishingstatus' column and convert the result to a dataFrame

Now, let us bring the imputed data back to the original dataFrame. We can do this by assigning the imputed dataFrame to the `area` and `furnishingstatus` columns.

In [None]:
X_train["area"] = numeric_imputed_values
X_train["furnishingstatus"] = categorical_imputed_values

Finally, we can prove that our dataset has no missing values by using the `isna()` function and the `sum()` function.

In [None]:
# Show the number of missing values for the X_train dataFrame to prove that the missing values have been filled

### Scaling the Data
Our next preprocessing step is to scale our data. For some models, it can be helpful to scale the data. Before we do this, let us first identify the numerical and the non-numerical columns in our dataset. We can do this by using the `select_dtypes()` function with the parameter `include='object'` and `exclude='object'`.

In [None]:
categorical_columns = X_train.select_dtypes(include="object").columns
numerical_columns = # Select the numerical columns from the X_train dataFrame

In [None]:
### TEST ###
print(f" Number of numerical columns: {len(numerical_columns) == 5}, Number of categorical columns: {len(categorical_columns) == 7}")
### TEST ###

Now, we can scale our numerical columns. To do this, we need to import the `StandardScaler` from the `sklearn.preprocessing` package. <a href="https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling">Here </a> you can find more information about scaling.

In [None]:
# Import the StandardScaler class from the sklearn library

Afterward, we can create our scaler and `fit_transform` numerical columns. To bring the scaled data back to our dataFrame, we can again use the `pd.DataFrame()` function.

Finally, we can set our scaled data as our numerical columns in our original dataFrame.

In [None]:
scaler = # Create a StandardScaler object

scaled_values = # Apply the scaler on the numerical columns of the X_train dataFrame and convert the result to a dataFrame 
X_train[numerical_columns] = scaled_values

In [None]:
### TEST ###
print(f"Scaler: {X_train.area.std() == 1.0011487654563194}")
### TEST ###

### Handling Categorical Variables

We will conduct one more step before we can train our machine-learning model. We are handling categorical variables. Here we will use the `OneHotEncoder` from the `sklearn.preprocessing` package. First, we must import the necessary function by running the command `from sklearn.preprocessing import OneHotEncoder`. <a href="https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features">Here</a> you can find more information about encoding categorical features.

In [None]:
# Import the OneHotEncoder class from the sklearn library

To transform our categorical values, we need to create our `OneHotEncoder`. When we create our encoder, we need to set the parameter sparse to `False`. This will return a numpy array instead of a sparse matrix.

In [None]:
one_hot_encoder = OneHotEncoder(sparse=False)
encoded_values = # Apply the one hot encoder on the categorical columns of the X_train dataFrame and convert the result to a dataFrame # Hint: you can use the get_feature_names_out() method to get the column names 

In [None]:
### TEST ###
print(f"Number of categorical columns: {encoded_values.shape[1]==15}")
### TEST ###

Next, we bring the encoded data back to our dataFrame. Therefore, we first drop the categorical columns of the dataFrame using the `drop()` function. As additional parameters, we need to hand over the `categorical_columns`, the parameter `axis=1` to drop columns instead of rows, and the parameter `inplace=True` to apply the changes to the original dataFrame.

Finally, we can concatenate the encoded data to our original dataFrame. To do this, we can use the `pd.concat()` function. As parameters, we need to hand over the original dataFrame, the encoded dataFrame, and the parameter `axis=1` to concatenate the dataframes column-wise. 

In [None]:
# Drop the categorical columns from the X_train dataFrame 
X_train = # concat the X_train dataFrame with the encoded values column-wise
X_train.head()

## Model Training

We have preprocessed our data and are ready to train our machine-learning model. We will use a Random Forest Regressor for this task. First, we must import the necessary function by running the command `from sklearn.ensemble import RandomForestRegressor`. <a href="https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features" >Here</a> you can find more supervised learning models.

In [None]:
# Import the RandomForestRegressor class from the sklearn library

To create our model, we can call it the `RandomForesRegressor()` function and save it as a variable. Additionally, we set the parameter `random_state=0` to ensure reproducibility.

In [None]:
model = # Create the Randomforestregressor object with a random state of 0

To fit our `model` to the data, we can use the `fit()` function. As input to the function, we must hand over our feature set `X_train` and our target Variable `Y_train`.

In [None]:
# fit the model on the X_train and Y_train data

After the training process is done, we can use the `predict()` function to predict the price of the houses in our dataset. Let us first calculate the in-sample score of our model. Therefore, we have to calculate the prediction using our training dataset.

In [None]:
in_sample_prediction = #Use the model to predict the X_train data

To interpret our model's prediction power, we need to calculate a score. Therefore, we use two metrics: the mean absolute error and the mean absolute percentage error. Both can be imported from `sklearn.metrics` by running the command `from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error`.

In [None]:
# Import the mean_absolute_error function and the mean_percentage_error function from the sklearn library

To calculate the scores, we have to hand over the true values and the predicted values to the functions.

In [None]:
mae = mean_absolute_error(Y_train, in_sample_prediction)
mape = # Calculate the mean absolute percentage error
print(f"Mean Absolute Error: {mae:.2f}\t Mean Absolute Percentage Error: {mape:.2f}")

However, the more interesting part is to evaluate the model on the test dataset. We must, therefore, first pre-process the test data set in the same way as the training data set. After that, we can evaluate our out-of-sample metrics.<br>
First, we apply the `numerical_imputer` and the `categoical_imputer` to the test dataset. But, this time, we only use the `transform()` function because the imputer is already trained on the training dataset. We fit our scaler and encoder only to the training dataset so that we are not biased and have a more realistic scenario since we also cannot fit our scaler or encoder to production data.

In [None]:
numeric_imputed_values = # Apply the numerical imputer on the 'area' column of the X_test dataFrame and convert the result to a dataFrame
categorical_imputed_values = # Apply the categorical imputer on the 'furnishingstatus' column of the X_test dataFrame and convert the result to a dataFrame

X_test["area"] = numeric_imputed_values
X_test["furnishingstatus"] = categorical_imputed_values

Additionally, we also need to scale the values with our previously fitted `scaler`.

In [None]:
scaled_values = # Apply the standard scaler on the numerical columns of the X_test dataFrame and convert the result to a dataFrame
X_test[numerical_columns] = scaled_values

The last preprocessing step is to apply the `one_hot_encoder` with the `transform` function and reassemble the dataFrame.

In [None]:
encoded_values = pd.DataFrame(one_hot_encoder.transform(X_test[categorical_columns]),index=X_test.index,columns=one_hot_encoder.get_feature_names_out())
X_test.drop(categorical_columns, axis=1, inplace=True)
X_test = # Concatenate the X_test dataFrame with the encoded values column-wise
# Print out the first five rows of the X_test dataFrame

Finally, let us calculate the out-of-sample metrics for our model. Therefore, we calculate the prediction on our test dataset and calculate the mean absolute error and the mean absolute percentage error.

In [None]:
mape_oos = # Calculate the out-of-sample mean absolute percentage error
mae_oos = # Calculate the out-of-sample mean absolute error
print(f"Mean Absolute Error: {mae_oos:.2f}\t Mean Absolute Percentage Error: {mape_oos:.2f}")

Now, we have successfully trained and evaluated our machine-learning model. 
- What can you say about the model's performance when you compare the in-sample and the out-of-sample metrics?
 - !--Answer--!
- Name three different techniques that could improve the model:
 - !--Answer--!
 - !--Answer--!
 - !--Answer--!
- Name at least one different encoding technique for categorical variables:
 - !--Answer--!
- Name at least one different scaling technique:
 - !--Answer--!


 Note: Have a look at the <a href="https://scikit-learn.org/stable/user_guide.html">scikit-learn</a> documentation.