# TBD - Kaggle

## Authors
| **Name**              | **NIU**   |
|-----------------------|-----------|
| Arnau Muñoz Barrera   | 1665982   |
| José Ortín López      | 1667573   |


## Database

To access the source Database: [Link to Database](https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data)



## Import Libraries

In [None]:
# install required packages in the notebook environment
%pip install pandas numpy seaborn

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


## Review Initial Structure and Data

In [None]:
df = pd.read_csv('data/vehicles-dataset.csv', engine='python', on_bad_lines='skip')

df.describe(include='all')

##### Determine columns & their types

In [None]:
print("[INFO] Dataset columns & their types: \n", df.dtypes)

##### Determine quantity of NaNs

In [None]:
# Auxiliar function to display missing values
def has_nans(df):
    return df.isna().sum().sum() > 0

# Auxiliar function to display percentage of missing values per column
def get_percentage_nan_per_column(df):
    return ((df.isna().sum().sort_values()) / (len(df) * 100))

print("[INFO] Does the dataset have missing values?", "Yes" if has_nans(df) else "No")

if has_nans(df):
    print("\n [INFO] Percentage of missing values per column: \n", get_percentage_nan_per_column(df))

    df_missing = pd.DataFrame(list(get_percentage_nan_per_column(df).items()), columns=["Field", "Percentage"])

    plt.figure(figsize=(10,6))
    sns.barplot(x="Percentage", y="Field", data=df_missing, palette="viridis")

    plt.title("Percentage of null values per field", fontsize=14)
    plt.xlabel("Percentage (%)")
    plt.ylabel("Field")

    plt.show()
else:
    print("\n [INFO] No missing values detected in the dataset.")


### Conclusions

As we can see in previous results, the main initial factors that should be revised & look into are:

| **Problem description**                                | **Proposed Solution**      | **Affected fields**      |
|--------------------------------------------------------|----------------------------|--------------------------|
| Field that needs format clean-up                       | Field modifications        | **posting_date**         |
| Field with most NaN values                             | Erase column               | **county**, **VIN**      |
| Fields with different types or inconsistent ranges     | Normalization              | **price**, **odometer**  |
| Irrelevant fields                                      | Erase column               | **TBD**                  |



### Clean Up ***Posting_date*** 


### Erase ***County*** & ***VIN***

In [None]:
df = df.drop(['county', 'VIN'], axis=1)
df.describe(include='all')

## Metric Selection

In this section, we will focus on selecting the appropriate classification metrics and mechanisms to analyze the performance of our final model. To do so, we will train our data using logistic regression and generate a set of functions to evaluate the results, such as graphical functions (Precision-Recall Curve and ROC Curve).

The metrics we are going to analyze are the following: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Coefficient of Determination (R²). Once we obtain the results and functions, we will decide which metric to use in order to select the best-performing model.

In [None]:
# Declaration of functions to analyze the metrics:
def calculMetriques(y_true, y_pred, metric='mse'):
    if metric == 'mse':
        result = mean_squared_error(y_true,y_pred)
    elif metric == 'rmse':
        result = np.sqrt(mean_squared_error(y_true,y_pred))
    elif metric == 'mae':
        result = mean_absolute_error(y_true,y_pred)
    elif metric == 'r2':
        result = r2_score(y_true,y_pred)
    else:
        raise ValueError("Métrica no reconocida")

    return result

In [None]:
def mostrar_grafics (y, y_pred):
    # Scatter plot between real values (y) and predicted values (y_pred)
    plt.scatter(y,y_pred)
    # Add a red line representing the ideal line (y = y_pred)
    plt.plot(y,y,'--',c = 'red')
    plt.xlabel('y_real') # Label for the X-axis
    plt.ylabel('y_pred') # Label for the Y-axis
    plt.show()           # Display the plot

    # Scatter plot of the errors (difference between y_pred and y)
    plt.scatter(y,y_pred-y)
    plt.xlabel('y_real') 
    plt.ylabel('error')  # Label for the Y-axis (error)
    plt.show()

    # Scatter plot of the absolute error (MAE) for each y value
    plt.scatter(y,abs(y_pred-y))
    plt.xlabel('y_real') 
    plt.ylabel('MAE')    # Label for the X-axis (Mean Absolute Error)
    plt.show()

    # Gràfic de dispersió del MAPE (error absolut relatiu per cada valor de y)
    plt.scatter(y,abs(y_pred-y)/y)
    plt.xlabel('y_real') 
    plt.ylabel('MAPE')   # Label for the X-axis (Mean Absolute Percentage Error)
    plt.show()

In [None]:
### MODEL TRAINING #####
# Split X and Y
target_att = 'price'
attributes = [k for k in df.keys() if k!= target_att]
X = df[attributes]
y = df[[target_att]]

lr = LinearRegression(fit_intercept = True)
lr.fit(X,y)

# Evaluate the model performance
y_pred = lr.predict(X)

# Call to the metric calculation function
print("Mètriques: \n")
print(" Mean Squared Error (MSE):", calculMetriques(y,y_pred, metric="mse"))
print(" Root Mean Squared Error (RMSE):", calculMetriques(y,y_pred, metric="rmse"))
print(" Mean Absolute Error (MAE):", calculMetriques(y,y_pred, metric="mae"))
print(" Coefficient of Determination (R²):", calculMetriques(y,y_pred, metric="r2"))

# Call to the plot generation function
mostrar_grafics(y, y_pred)

Un cop analitzats els resultats amb les diferents mètriques i mecanismes escollirem quina mètrica utilitzarem per avaluar els rendiments dels models i escollir el que més s'adapta al nostre dataset.

##### Mètriques escollides per avaluar el nostre model
* **Mean Squared Error (MSE)**:

    Mesura l'error quadràtic mitjà entre els valors reals i els predits

    Ens aporta informació sobre com de grans són els errors, penalitzant especialment els errors grans. També ens dona una idea general de la precisió del model.

    $MSE=\frac{1}{m}\sum_{i=1}^{m}(y_i-\hat{y}_i)^2$

    <span style="color: red;">✗ *Mètrica no escollida*</span>

* **Root Mean Squared Error (RMSE)**:
    Arrel quadrada del MSE.

    Ens aporta informació de quants euros ens equivoquem de mitjana en les prediccions. És una mètrica útil perquè ens aporta una visió global del rendiment.

    $RMSE=\sqrt{MSE}$

    <span style="color: red;">✗ *Mètrica no escollida*</span>

* **Mean Absolute Error (MAE)**:
    Calcula l'error mitjà en valor absolut
    
    Ens aporta informació de quant s'equivoca el model de mitjana en valor absolut. A diferència del MSE, no penalitza en excés els outliers. Es una mètrica de gran utilitat quan volem tenir una mesura concreta del comportament global del model sense que els errors molt grans afectin a la mètrica

    $MAE=\frac{1}{m}\sum_{i=1}^{m}\lvert y_i-\hat{y}_i\rvert$

    <span style="color: green;">✔ *Mètrica escollida*</span>

* **Coefficient of Determination (R²)**:
    Percentatge de la variabilitat del preu és explicat pel model

    Ens aporta informació sobre com de bé el model explica l'estructura del dataset. Quan ens trobem amb un R² baix significa que el model no es capaç de capturar la relació entre les variables d'entrades i la variable target (preu). En canvi, quan tenim un R² gran el model es capaç de justificar les variacions en els preus.Es una mètrica de gran utilitat per comparar models amb diferents complexitats 
    


    $R^2 = 1-\frac{\sum_{i=1}^{m}(y_i-\hat{y}i)^2}{\sum{i=1}^{m}(y_i-\bar{y})^2}$
    
    on $\bar{y}$ és la mitjana dels valors reals

    <span style="color: red;">✗ *Mètrica no escollida*</span>