### __BUSA3020 Group Assignment - Predicting Used Car Sale Prices__

--- 

**Kaggle Competition Ends:** Friday, 31 May 2024 @ 3:00pm (Week 13)  
**Assignment Due Date on iLearn:** Friday, 31 May 2024 @ 11.59pm (Week 13)  

**Overview:**   

- In the group assignment you will form a team of 3 students and participate in a forecasting competition on Kaggle
- The goal is to predict prices of used cars based on car characteristics and regression models

**Instructions:** 

- Form a team of 3 students 
- Each team member needs to join [https://www.kaggle.com](https://www.kaggle.com/)  
- Choose a team leader and form a team on Kaggle [https://www.kaggle.com/t/ff5fb5beaeb14f7686df98fef9d1c0bc](https://www.kaggle.com/t/ff5fb5beaeb14f7686df98fef9d1c0bc)
    - Team leader to click on `team` and join and invite other 2 team members to join
    - Your **team's name must start** with our unit code, for instance you could have a team called BUSA3020_algorithm_arena
- All team members should work on all the tasks however   
    - Each team member will be responsible for one of the 3 tasks listed below    
- **Your predictions must be generated by a model you develop here** 
    - You will receive a mark of **zero** if your code is not able produce the forecasts you submit to Kaggle 

**Competition Rankings**

The rankings for the competition are determined through two different leaderboards:

- **Public Leaderboard Ranking**: Available during the competition, these rankings are calculated based on 50% of the test dataset, which includes 1,500 observations. This allows participants to see how they are performing while the competition is still ongoing.
- **Final Leaderboard Ranking**: These rankings are recalculated from the other 50% of the test dataset, which consists of the remaining 1,500 observations, and are revealed 5 minutes after the competition concludes. This final evaluation determines the ultimate standings of the competition.



**Marks**: 

- Total Marks: 40
- Your individual mark will consist of:  
    - 50% x overall assignment mark + 45% x mark for the task that you are responsible for + 5% x mark received from your teammates for your effort in group work 
- 1 mark: Ranking in the top 5 positions on the **final** leaderboard for your unit 
- 3 marks: Reaching the 1st place in your unit according to the **final** leaderboard ranking


**Submissions:**  

1. On Kaggle: submit your team's forecast in order to be ranked by Kaggle
    - Limit of 20 submission per day
2. On iLearn **only team leader to submit** the assignment Jupyter notebook re-named to your team's name on Kaggle   
    - The Jupyter notebook must contain team members names/ID numbers, and the group name Kaggle
    - One 15 minute video recording of your work 
        - 5 marks will be deducted from each Task for which there is no video presentation or if you don't follow the above instructions
3. On iLearn each student needs to submit a file with their teammates' names, ID number and a mark for their group effort (out of 100%)
    - You don't need to score yourself

---
---

**Fill out the following information**

- Team Name on Kaggle: `BUSA3020_datanoobs`
- Team Leader and Team Member 1: `Chau Anh Cong`
- Team Member 2: `Tran Tuan Huy Bui`
- Team Member 3: `Thomas Haywood Ruiz`

---
---

**Import Libraries and Dataset**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np
#pd.set_option("display.max_rows", None, "display.max_columns", None, "display.width", None) # pretty printing

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_percentage_error

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

import mlflow
# mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns
mlflow.set_tracking_uri("http://127.0.0.1:5000")


In [None]:
df = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

---

## Task 1: Problem Description and Initial Data Analysis

1. Based on the Competition Overview, datasets and additional information provided on Kaggle, along with insights gained from personal research of the topic, write **Problem Description** (about 500 words) focusing on the sections listed below: 
- Forecasting Problem - explain what we are trying to do and how it could be used in the real world, e.g. who and how may benefit from it (3 marks)    
- Evaluation Criteria - discuss the criteria that  is used to assess forecasting performance in detail (3 marks)     
- Categorise the variables provided in the dataset according to their type; Hint: similar to what we had in Programming Task 1 (2 marks)  
- Missing Values - only explain what you find for both the training and test datasets at this stage (2 marks)
- Provide and discuss some interesting *univariate* summary statistics and distributions in the training dataset  (2 marks)       
- Other Hints:
    - You should **not** discuss any specific predictive algorithms at this stage
    - Minimise the number of cells you use to enhance presentation and readability

**Total Marks: 12**   

Student in charge of this task: `Thomas Haywood Ruiz`

The primary goal of this project is to develop a model that can accurately predict the prices of listed cars based on certain vehicle features, such as year, horsepower, fuel economy, and power etc. The information collected from this analysis will inform stakeholders of the impact of specific features on car price value. Consumers can leverage the findings to make a more informed decision when buying or selling a vehicle. Car dealerships and online marketplaces can optimise  their pricing strategies based on features that have the most impact on car value. Additionally, insurance companies can utilise this data to determine insurance premiums based on car value.

The evaluation criteria for assessing the performance of the forecast model will be measured on the mean absolute percentage error (MAPE). This error metric will indicate the regression model’s performance by comparing the average percentage difference between the predicted and actual prices. For selecting a suitable model, the MAPE results of different forecast models will be contrasted, and the model with the lowest MAPE score will be utilised for the car value analysis. Although MAPE is a widely utilised metric for forecast evaluation, it is sensitive to outliers which may skew the forecasting evaluation accuracy. Therefore, to ensure the reliability and integrity of the results, this limitation will be carefully considered in the analysis. 

In [None]:
# ADD Count for types, and add Date Type, and Text (torque)

In [None]:
print(df['transmission_display'].value_counts())

In [None]:
# print(df['torque'].unique())

In [None]:
df.isnull().sum()

In [None]:
df_test.isnull().sum()

Based on the missing values of both data sets, the test set contains more variables with missing values than the training set. To ensure the reliability of the analysis, these missing values need to be handled appropriately during data preprocessing. 

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['price'])
plt.title('Distribution of price')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
print(df['price'].describe())

These statistics on price distribution reflect market trends, product demand, and availibilities of certain vehicles. The average price of listed cars is approximately `$28851` which gives insight into typical range of car prices, so more regular priced vehicles are being advertised due to the higher availability of lower end cars. The 75th percentile is `$36992` which indicates that 75% of listed cars falls below this value, so typically regular priced cars are being listed over higher end; however, the presence of higher outliers reveals that there still are some high-end luxury cars being listed too. 

In [None]:
make_counts = df['make_name'].value_counts()

plt.figure(figsize=(14, 6))
make_counts.plot(kind='bar')
plt.title('Counts of Car Makes')
plt.xlabel('Car Make')
plt.ylabel('Count')
plt.show()


In [None]:
df.tail()

In [None]:
print(df['year'].value_counts().sort_values())

In [None]:
# print(df['year'] == 2021)

In [None]:
year_counts = df['year'].value_counts().sort_index() 

plt.figure(figsize=(14, 6))
year_counts.plot(kind='bar')
plt.title('Count of Cars per year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

Based on the graph above, there is a higher frequency of more recently manufactured cars being listed compared to older ones. Additionally, considering the price distribution graph, most of these later-made vehicles are priced cheaper which indicates that the year of manufacture may affect the price of a car's value. Moreover, the substantial increase in number of cars listed in 2020 compared to other years suggests several possible reasons:

* There is a high demand for cars manufactured in 2020
* More people are trying to sell their car made in 2020
* There was a major increase in car manufacturing in 2020 with more vehicles being created. 

In [None]:
print(df['is_new'].value_counts())

The split between new and old cars is fairly close

Table listing all the features present in the dataset and their type

|Variable Kind|Number of Features|Feature Names|
| --- | --- | --- |
| Numeric | 18 |  `city_fuel_economy`, `daysonmarket`, `engine_displacement`, `highway_fuel_economy`, `horsepower`, `latitude`, `longitude`, `mileage`, `savings_amount`, `seller_rating`    |
| Nominal  | 16 | `vin`, `body_type`, `city`, `dealer_zip`, `engine_type`, `exterior_color`, `franchise_dealer` `fuel_type`, `interior_color`, `is_new`, `listing_color`, `make_name`, `model_name`, `transmission`, `transmission_display`, `wheel_system` |
| Date  | 16 | `listed_date`, `year` |
| Text  | 16 | `back_legroom`, `front_legroom`, `fuel_tank_volume`, `height`, `length`, `maximum_seating`, `wheelbase`, `width`, `power`, `torque` |

---

## Task 2: Data Cleaning, Missing Observations and Feature Engineering
- In this task you need to follow a set of instructions/questions listed below.
- Make sure you explain each answer carefully both in Markdown text, as well as on your video.

**Total Marks: 12**

Student in charge of this task: `Tran Tuan Huy Bui`

In [None]:
class Cleaner:

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()


    def extract_numerical_values(self, features: list):
        for feature in features:
            self.df[feature] = self.df[feature].str.split().str[0]
            self.df[feature] = pd.to_numeric(self.df[feature], errors='coerce')
        #return self.df


    def extract_multiple_numerical_values(self, feature:str, value1:str, value2:str):
        '''
        Extracts two numerical values from a torque and power
        '''

        self.df[value1] = self.df[feature].str.extract(r'(\d+)', expand=False)\
                                        .apply(pd.to_numeric, errors='coerce')
                                        
        self.df[value2] = self.df[feature].str.extract(r'@\s*(\d+,?\d*)', expand=False)
        self.df[value2] = self.df[value2].str.replace(r',', '', regex=True)\
                                        .apply(pd.to_numeric, errors='coerce')

        del self.df[feature]


    def impute_numerical_columns(self, numerical_cols:list):
        self.df[numerical_cols] = self.df.loc[:, numerical_cols] \
                                    .fillna(self.df[numerical_cols].mean(axis=0))        


    def impute_categorical_columns(self, categorical_cols:list):
        self.df[categorical_cols] = self.df.loc[:, categorical_cols] \
                                    .fillna(self.df[categorical_cols].mode(axis=0).iloc[0])
    
    

train_cleaner = Cleaner(df)

**Task 2, Question 1**: Clean **all** numerical features so that they can be used in training algorithms. For instance, back_legroom feature is in object format containing both numerical values and text. Extract numerical values (equivalently eliminate the text) so that the numerical values can be used as a regular feature.  
(2 marks)

In [None]:
num_one_item_col = ['back_legroom','front_legroom', 
                  'fuel_tank_volume', 'height', 'length', 
                  'maximum_seating', 'wheelbase', 'width']
                  

In [None]:
train_cleaner.extract_numerical_values(num_one_item_col)


`(Task 2, Question 1 Text Here - insert more cells as required)`

**Task 2, Question 2** Create at least 5 new features from the existing numerical variables which contain multiple items of information, for example you could extract maximum torque and torque rpm from the torque variable.  
(2 marks)

In [None]:
train_cleaner.extract_multiple_numerical_values('torque', 'max_torque', 'torque_rpm')
train_cleaner.extract_multiple_numerical_values('power', 'max_power', 'power_rpm')



In [None]:
train_cleaner.df['car_age'] = 2024 - train_cleaner.df['year']

`(Task 2, Question 2 Text Here - insert more cells as required)`

**Task 2, Question 3**: Impute the missing values for all features in both the training and test datasets.   
(3 marks)

In [None]:
## Task 2, Question 3 Code Here

In [None]:
numerical_cols = ['back_legroom', 'city_fuel_economy', 'daysonmarket', 'engine_displacement', 
                  'front_legroom', 'fuel_tank_volume', 'height', 'highway_fuel_economy', 'horsepower', 
                  'latitude', 'longitude', 'length', 'maximum_seating', 'mileage', 'savings_amount', 
                  'seller_rating', 'max_torque', 'torque_rpm', 'max_power', 'power_rpm', 'wheelbase', 'width']

train_cleaner.impute_numerical_columns(numerical_cols)

In [None]:
categorical_cols = ['body_type', 'city', 'dealer_zip', 'engine_type', 'exterior_color', 'franchise_dealer', 'fuel_type',
                    'interior_color', 'is_new', 'listing_color', 'make_name', 'model_name', 'transmission', 'transmission_display', 'wheel_system']

train_cleaner.impute_categorical_columns(categorical_cols)

`(Task 2, Question 3 Text Here - insert more cells as required)`

**Task 2, Question 4**: Encode all categorical variables appropriately as discussed in class. 

- Where multiple values are given for an observation encode the observation as 'other'. 
- Where a categorical feature contains more than 5 unique values, map the features into 5 most frequent values + 'other' and then encode appropriately. For instance, map colours into 5 basic colours + 'other': [red, yellow, green, blue, purple, other] and then encode.  
(2 marks)

In [None]:
## Task 2, Question 4 Code Here



In [None]:
def encode_multiple_values(df, column):
    """
    Encodes observations with multiple values for a column as 'other'
    """
    # Create a boolean mask for rows with multiple values
    multi_value_mask = df[column].apply(lambda x: isinstance(x, list) or isinstance(x, set))
    
    # Replace multiple values with 'other'
    df.loc[multi_value_mask, column] = 'other'
    
    return df

In [None]:
train_cleaner.df.head()
# train_cleaner.df.info()

# Create a copy
df_encode = train_cleaner.df.copy()
# df_encode.head()

In [None]:
def most_frequent(df, feature, n):
    # Create a dataframe to check the most frequent values of each categorical feature
    # Create a list of values for the single column
    values = [1, 2, 3, 4, 5]

    # Create the DataFrame with a single column
    most_fq_df = pd.DataFrame({'Column1': values})

    for feat in feature:
        value_counts = df[feat].value_counts().head(n)

        top_values_df = value_counts.reset_index()

        del top_values_df['count']

        most_fq_df = pd.concat([most_fq_df, top_values_df], axis=1)

    del most_fq_df['Column1']

    return most_fq_df

In [None]:
# Define the list of colors
color = ['black', 'white', 'gray', 'silver', 'red', 'yellow', 'green', 'blue', 'purple', 'other']

# Convert the list of colors to a set for faster lookup
color_set = set(color)

# Create a function to check if any word in the observation matches a color
def match_color(observation):
    # Convert the observation to lowercase for case-insensitive matching
    observation = str(observation).lower()
    
    # Check if any word in the observation matches a color
    for word in observation.split():
        if word in color_set:
            return word
    
    # If no match is found, return 'other'
    return 'other'

In [None]:
allCatMostFrequent_df = most_frequent(df_encode, categorical_cols, 5)
allCatMostFrequent_df

In [None]:
colorFeat = ['exterior_color', 'interior_color', 'listing_color']
df_color = df_encode[colorFeat].copy()

# Apply the match_color function to the 'df_color'
for feat in colorFeat:
    df_color[f'matched_color_{feat}'] = df_encode[feat].apply(match_color)
    
# df_color.head()

matchedColorFeat = ['matched_color_exterior_color', 'matched_color_interior_color', 'matched_color_listing_color']
allColorFeatMostFrequent_df = most_frequent(df_color, matchedColorFeat, 100)

allColorFeatMostFrequent_df

In [None]:
# for feat in categorical_cols:
#     encode_multiple_values(df_encode, feat)
    
#     # Convert the column to string before using str accessor
#     df_encode[feat] = df_encode[feat].astype(str)
    
    
#     other_count = df[feat].str.contains('other').sum()
#     print(f"Number of observations {feat} encoded as 'other': {other_count}")


`(Task 2, Question 4 Text Here - insert more cells as required)`

**Task 2, Question 5**: Perform any other actions you think need to be done on the data before constructing predictive models, and clearly explain what you have done.   
(1 marks)

In [None]:
## Task 2, Question 5 Code Here

`(Task 2, Question 5 Text Here - insert more cells as required)`

**Task 2, Question 6**: Perform some EDA to measure the relationship between the features and the target and carefully explain your findings. 
(2 marks)

In [None]:
## Task 2, Question 6 Code Here

`(Task 2, Question 6 Text Here - insert more cells as required)`

--- 
## Task 3: Fit and tune a forecasting model, submit predictions & win competition

Make sure you **clearly explain each step** you do both in Markdown and on the recoded video.   
*In this task you must not create any additional features and should only relly on the datasets constructed in Task 2.*

1. Build and explain at least 3 machine learning (ML) regression models taking into account the outcomes of Tasks 1 & 2 (3 marks)    
2. Fit the models and tune hyperparameters via cross-validation: make sure you comment and explain each step clearly (3 marks)   
3. Select your best algorithm, create predictions using the test dataset, and submit your predictions on Kaggle's competition page. Make sure you explain all the steps that led you to chose this algorithm both in the video presentation and in your written answer. (4 marks)   
4. Provide Kaggle ranking and score (screenshot your final submission) and comment (e.g. how could you improve your ranking?) (2 mark)   

- Hints:
    - To perform well you will need to iterate Tasks 2 and Task 3
    - Make sure your Python code works, so that a marker that can replicate your Kaggle submission and score.
    - You will receive the mark of zero if your code does not produce the forecasts uploaded to Kaggle 

**Total Marks: 12**

Student in charge of this task: `Chau Anh Cong`

In [None]:
df_train = train_cleaner.df.copy()

In [None]:
y = df_train['price']
X = df_train[numerical_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, y_train.shape

In [None]:
mlflow.set_experiment("Group Assignment")

## Linear Models

In [None]:
with mlflow.start_run(run_name="Linear Regression"):

    # Train a model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "linear_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)


In [None]:
with mlflow.start_run(run_name="Ridge Regression"):

    # Train a model
    model = RidgeCV(alphas=[0.001, 0.01, 0.1, 1, 10, 100, 1000])
    model.fit(X_train, y_train)
    
    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "ridge_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)
    
    alpha = model.alpha_
    # Log parameters
    mlflow.log_param("alpha", alpha)

In [None]:
with mlflow.start_run(run_name="Lasso Regression"):

    # Train a model
    model = RidgeCV(alphas=np.logspace(-4, 1, 50))
    model.fit(X_train, y_train)
    
    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "lasso_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)
    
    alpha = model.alpha_
    # Log parameters
    mlflow.log_param("alpha", alpha)

## Non-linear Models

### SVR

In [None]:
with mlflow.start_run(run_name="SVR Regression"):

    param_grid = {
        'C': [0.1, 1, 10, 100],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'gamma': [0.001, 0.01, 0.1, 1],
    }

    # Initialize the SVR
    svr = SVR()

    # Initialize the RandomizedSearchCV
    random_search = RandomizedSearchCV(estimator=svr, param_distributions=param_grid, 
                                        cv=5, refit=True, random_state=42, n_jobs=-1)

    # Fit the model
    random_search.fit(X_train, y_train)

    # Get the best parameters
    model = random_search.best_estimator_

    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "svr_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)
    
    params = random_search.best_params_

    # Log parameters
    mlflow.log_param('C', params['C'])
    mlflow.log_param('kernel', params['kernel'])
    mlflow.log_param('gamma', params['gamma'])

### Random Forest Regressor

In [None]:
with mlflow.start_run(run_name="Random Forest Regression"):

    param_grid = {
        'n_estimators': [100, 200, 500, 1000],
        'max_depth': [10, 20, 30, 40, 50],
    }

    # Initialize the Random Forest Regressor
    rf = RandomForestRegressor(random_state=42)

    # Initialize the RandomizedSearchCV
    random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, 
                                        cv=5, refit=True, random_state=42, n_jobs=-1)

    # Fit the model
    random_search.fit(X_train, y_train)

    # Get the best parameters
    model = random_search.best_estimator_

    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "random_forest_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)
    
    params = random_search.best_params_
    # Log parameters
    mlflow.log_param('n_estimators', params['n_estimators'])
    mlflow.log_param('max_depth', params['max_depth'])

### AdaBoost

In [None]:
with mlflow.start_run(run_name="AdaBoost Regression"):

    param_grid = {
        'n_estimators': [500, 600, 700],
        'learning_rate': [0.5, 0.6, 0.7, 0.8],
        'estimator': [
            DecisionTreeRegressor(max_depth=1),
            DecisionTreeRegressor(max_depth=5)
        ],
        'loss': ['linear', 'square', 'exponential']
    }

    # Initialize the AdaBoost Regressor
    adb = AdaBoostRegressor(random_state=42)

    # Initialize the RandomizedSearchCV
    random_search = RandomizedSearchCV(estimator=adb, param_distributions=param_grid, 
                                        cv=5, refit=True, random_state=42, n_jobs=-1)

    # Fit the model
    random_search.fit(X_train, y_train)

    # Get the best parameters
    model = random_search.best_estimator_

    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "adaboost_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)
    
    params = random_search.best_params_
    # Log parameters
    mlflow.log_param('n_estimators', params['n_estimators'])
    mlflow.log_param('learning_rate', params['learning_rate'])
    mlflow.log_param('base_estimator', params['estimator'])
    mlflow.log_param('loss', params['loss'])

### XGBRegressor

In [None]:
with mlflow.start_run(run_name="XGBoost Regression"):

    param_grid = {
        'n_estimators': [100, 200, 500, 1000],
        'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
        'max_depth': [3, 5, 7, 10]
    }

    # Initialize the XGBoost Regressor
    xg_reg = XGBRegressor(random_state=42)

    # Initialize the RandomizedSearchCV
    random_search = RandomizedSearchCV(estimator=xg_reg, param_distributions=param_grid, 
                                        cv=5, refit=True, random_state=42, n_jobs=-1)

    # Fit the model
    random_search.fit(X_train, y_train)

    # Get the best parameters
    model = random_search.best_estimator_

    # Make predictions
    predictions = model.predict(X_test)
    
    # Log model
    mlflow.sklearn.log_model(model, "xgboost_model")
    
    # Log metrics
    mape = mean_absolute_percentage_error(y_test, predictions)
    mlflow.log_metric("mape", mape)
    
    params = random_search.best_params_
    # Log parameters
    mlflow.log_param('n_estimators', params['n_estimators'])
    mlflow.log_param('learning_rate', params['learning_rate'])
    mlflow.log_param('max_depth', params['max_depth'])

# Submission

In [None]:
logged_model = 'runs:/c7b064ab714b49139e5eccee7d96b7be/xgboost_model'

# Load model as a PyFuncModel.
model = mlflow.pyfunc.load_model(logged_model)

In [None]:
test_cleaner = Cleaner(df_test)

test_cleaner.extract_numerical_values(num_one_item_col)
test_cleaner.extract_multiple_numerical_values('torque', 'max_torque', 'torque_rpm')
test_cleaner.extract_multiple_numerical_values('power', 'max_power', 'power_rpm')
test_cleaner.df['car_age'] = 2024 - test_cleaner.df['year']
test_cleaner.impute_numerical_columns(numerical_cols)
test_cleaner.impute_categorical_columns(categorical_cols)

In [None]:
df_submit = test_cleaner.df.copy()

In [None]:
output = pd.DataFrame({"vin": df_test['vin'].values, "price": model.predict(df_submit[numerical_cols])})
output

`(Task 3 - insert more cells as required)`

In [None]:
output.to_csv('output.csv', index=False)

---
---
## Marking Criteria

To receive full marks your solutions must satisfy the following criteria:

- Problem Description: 12 marks
- Data Cleaning: 12 marks
- Building Forecasting models: 12 marks
- Competition Points: 4 marks
- Forecasts correctly uploaded to Kaggle
- Python code is clean and concise
- Written explanations are provided in clear and easy to understand sentences
- Video presentations are limited to 15 minutes in duration
- Each team member delivers a 5-minute presentation on their assigned task
    - During the video recording, make sure that both your face and Jupyter Notebook are clearly visible
    - Your code must be readable on the video
    - Discuss both the actions you took and, more importantly, the reasoning behind these actions, explaining the significance of key steps
- The assignment notebook is well-organised and easy to follow
- Failure to meet the above marking criteria will result in a deduction of marks

---
---