<a href="https://colab.research.google.com/github/IlyaZutler/Project_2-Trucks/blob/main/Basic_ML_Project_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#

# Project guidelines - Heavy Machinery Auction Price Estimator - ML Project Documentation

> https://www.kaggle.com/t/9baafb8850d74e4499c7b1ba97d6f115

## Overview

Welcome to the Heavy Machinery Auction Price Estimator competition! This document will guide you through the steps required to participate in the competition, including setting up your environment, understanding the dataset, building your model, and submitting your predictions.

### Goal
The objective of this competition is to develop a model that accurately estimates the final sale price of heavy machinery based on historical auction data. This will help create a reliable pricing guide for heavy machinery.

### Timeline
- **Start Date:** [Start Date]
- **End Date:** 14/07/2024 (11 days to go)

## Dataset Description

The dataset consists of historical sales data of heavy machinery. It includes detailed information such as manufacturing year, usage hours, and various machine configurations.

### Files
- `train.csv`: Contains the training data with features and the target variable (`SalePrice`).
- `valid.csv`: Contains the validation data with features but without the `SalePrice` column. This is used to generate predictions for submission.

### Data Columns
- `SalesID`: Unique identifier for each sale.
- `SalePrice`: The target variable, representing the final sale price of the machinery.
- `YearMade`: The year the machinery was manufactured.
- `MachineHoursCurrentMeter`: The usage hours of the machinery.
- Other columns representing various machine configurations and features.

## Evaluation Metric

Submissions will be evaluated using the Root Mean Squared Error (RMSE) metric, which measures the average magnitude of the errors between the predicted and actual auction sale prices. The formula for RMSE is:

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

Where \( y_i \) is the actual sale price and \( \hat{y}_i \) is the predicted sale price.

## Submission File

Your submission file should be a CSV file with two columns: `SalesID` and `SalePrice`. The file should include a header and be formatted as follows:

```
SalesID,SalePrice
1,10000
2,15000
...
```

Ensure that your submission file includes all IDs from the validation set and follows the specified format to be successfully scored.

## Getting Started

### 1. Environment Setup

Set up your Python environment with the necessary libraries:

```bash
pip install pandas numpy scikit-learn
```

### 2. Exploratory Data Analysis (EDA)

Conduct EDA to understand the dataset and identify any data quality issues. Look for missing values, outliers, and relationships between features and the target variable.

### 3. Data Preprocessing

- Handle missing values appropriately.
- Encode categorical variables.
- Normalize or standardize numerical features if necessary.

### 4. Baseline Model

Start with a baseline model to establish a benchmark performance. A simple approach is to predict the mean of the target variable:

```python
import pandas as pd

train = pd.read_csv('train.csv')
baseline_prediction = train['SalePrice'].mean()

# Create a submission file
valid = pd.read_csv('valid.csv')
submission = pd.DataFrame({'SalesID': valid['SalesID'], 'SalePrice': baseline_prediction})
submission.to_csv('baseline_submission.csv', index=False)
```

### 5. RandomForestRegressor Model

Build a RandomForestRegressor model:

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Split the data
X = train.drop(columns=['SalePrice'])
y = train['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Validate the model
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE: {rmse}')
```

### 6. Model Improvement

- Handle missing values and categorical variables more effectively.
- Use feature importances to identify key features.
- Perform feature engineering to create new informative features.
- Tune hyperparameters using grid search or other techniques.
- Monitor for overfitting by comparing training and testing performance.

### 7. Final Submission

Generate predictions for the validation set:

```python
valid = pd.read_csv('valid.csv')
X_valid = valid.drop(columns=['SalesID'])
y_valid_pred = model.predict(X_valid)

# Create a submission file
submission = pd.DataFrame({'SalesID': valid['SalesID'], 'SalePrice': y_valid_pred})
submission.to_csv('final_submission.csv', index=False)
```

## Practical Data Science Guidelines

- **Efficient Workflows:** Use a random subset of 20,000 rows for initial experiments. Use the full dataset for the final submission.
- **Iterative Approach:** Start with a basic model and iteratively improve it by trying small ideas.
- **Feature Engineering:** Transform and combine existing features creatively.
- **Documentation:** Keep track of your experiments and results. Document what works and what doesn't.

## Collaboration and Presentation

- **Collaboration:** Discuss your work openly within your team or with other teams. Sharing insights and learning from each other is encouraged.
- **Presentation:** Present your methodology, results, and the techniques that helped the most. Document your journey and the steps you took to achieve your results.

## Have Fun!

Enjoy the process of learning and competing. This is an opportunity to improve your skills and collaborate with others.

Good luck in the competition!

# ok lets get started!

In [1]:
import gdown
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from pathlib import Path

def download_from_gdrive(url, filename):
    # Extract the file ID from the URL
    file_id = url.split('/')[-2]
    download_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file
    if Path(filename).exists():
        print(f"File '{filename}' already exists. Skipping download.")
    else:
        gdown.download(download_url, filename, quiet=False)
        print(f"File downloaded as: {filename}")

train = 'https://drive.google.com/file/d/1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5/view?usp=drive_link'
valid = 'https://drive.google.com/file/d/1j7x8xhMimKbvW62D-XeDfuRyj9ia636q/view?usp=drive_link'
# Example usage

download_from_gdrive(train, 'train.csv')
download_from_gdrive(valid, 'valid.csv')

df = pd.read_csv('train.csv')
df.head()

Downloading...
From (original): https://drive.google.com/uc?id=1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5
From (redirected): https://drive.google.com/uc?id=1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5&confirm=t&uuid=de3b5f4c-7e97-48ec-87ff-7656154e2993
To: /content/train.csv
100%|██████████| 116M/116M [00:02<00:00, 53.2MB/s]


File downloaded as: train.csv


Downloading...
From: https://drive.google.com/uc?id=1j7x8xhMimKbvW62D-XeDfuRyj9ia636q
To: /content/valid.csv
100%|██████████| 3.32M/3.32M [00:00<00:00, 38.1MB/s]


File downloaded as: valid.csv


  df = pd.read_csv('train.csv')


Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,11/16/2006 0:00,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,3/26/2004 0:00,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2/26/2004 0:00,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,5/19/2011 0:00,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,7/23/2009 0:00,...,,,,,,,,,,


In [None]:
df_valid = pd.read_csv('valid.csv')
df_valid.head()

In [None]:
df.info()

In [26]:
# prompt: Для всех котигориальных столбцов напечатать value_counts

categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
  print(f"Value counts for column '{col}':")
  print(df[col].value_counts())
  print()


Value counts for column 'UsageBand':
UsageBand
Medium    33985
Low       23620
High      12034
Name: count, dtype: int64

Value counts for column 'saledate':
saledate
2/16/2009 0:00    1932
2/15/2011 0:00    1352
2/19/2008 0:00    1300
2/15/2010 0:00    1219
2/11/2008 0:00    1100
                  ... 
1/16/2004 0:00       1
3/27/2006 0:00       1
7/25/2003 0:00       1
1/16/2006 0:00       1
6/9/2008 0:00        1
Name: count, Length: 3919, dtype: int64

Value counts for column 'fiModelDesc':
fiModelDesc
310G        5039
416C        4869
580K        4315
310E        4233
140G        4083
            ... 
EX210-5        1
KX025          1
EX120-5F       1
EX100-5E       1
HW180          1
Name: count, Length: 4999, dtype: int64

Value counts for column 'fiBaseModel':
fiBaseModel
580      19798
310      17354
D6       13110
416      12687
D5        9342
         ...  
830-2        1
272          1
PC230        1
KBD65        1
HW180        1
Name: count, Length: 1950, dtype: int64

Val

In [None]:
df.describe()

In [None]:
df.UsageBand.value_counts()

In [None]:
df['UsageBand_2'] = df['UsageBand'].map({'Low': 1, 'Medium': 2, 'High': 3})
df.head()

In [None]:
# meen coding for fiModelDesc
# Create a dictionary to map fiModelDesc values to their mean values
mean_values = df.groupby('fiModelDesc')['SalePrice'].mean().to_dict()
# Replace fiModelDesc values with their mean values
df['fiModelDesc_2'] = df['fiModelDesc'].map(mean_values)

df.head()

In [18]:
# the most simple and most aggressive and dumb data processing
df2 = df.select_dtypes('number')  # drop all categorical variables
df2 = df2.dropna()  # drop all rows with missing values
df2 = df2.set_index('SalesID')
df2

Unnamed: 0_level_0,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand_2,fiModelDesc_2
SalesID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1139246,66000,999089,3157,121,3.0,2004,68.0,1.0,46081.730769
1139248,57000,117657,77,121,3.0,1996,4640.0,1.0,55499.064371
1139249,10000,434808,7009,121,3.0,2001,2838.0,3.0,12437.068966
1139251,38500,1026470,332,121,3.0,2001,3486.0,3.0,31495.714286
1139253,11000,1057373,17311,121,3.0,2007,722.0,2.0,11038.404908
...,...,...,...,...,...,...,...,...,...
6319998,55000,1831997,5864,149,16.0,1999,24.0,1.0,44508.219178
6320871,15200,1827657,18266,149,0.0,2006,24.0,1.0,15210.000000
6321495,11200,1925538,18648,149,99.0,1000,2196.0,1.0,17232.258065
6321692,13000,1920012,18588,149,99.0,2008,24.0,1.0,13265.151515


In [19]:
X = df2.drop(columns='SalePrice')
y = df2['SalePrice']

X_small = X.sample(20000)
y_small = y.loc[X_small.index]

X_train, X_test, y_train, y_test = train_test_split(X_small, y_small, test_size=0.3, random_state=42)


In [20]:
%%time

model = RandomForestRegressor()
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

def RMSE(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

print(f'Train RMSE:', RMSE(y_train, y_train_pred))
print(f'Test RMSE:', RMSE(y_test, y_test_pred))

Train RMSE: 3743.379206231142
Test RMSE: 9842.151455889727
CPU times: user 8.74 s, sys: 2.81 ms, total: 8.74 s
Wall time: 8.85 s


In [23]:
X.MachineID.nunique(), X.auctioneerID.nunique()

(60552, 30)

In [24]:
%%time

X2 = X_small.drop(columns=['MachineID'])
X_train, X_test, y_train, y_test = train_test_split(X2, y_small, test_size=0.3, random_state=42)

model = RandomForestRegressor()
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

def RMSE(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

print(f'Train RMSE:', RMSE(y_train, y_train_pred))
print(f'Test RMSE:', RMSE(y_test, y_test_pred))

Train RMSE: 3760.1239464306695
Test RMSE: 9767.054721812301
CPU times: user 7.14 s, sys: 7.48 ms, total: 7.14 s
Wall time: 8.2 s


In [25]:
X_valid = df_valid.set_index('SalesID')[X2.columns]

# the dumbest null handling possible - fill with mean value
for col in X_valid.columns:
    X_valid[col] = X_valid[col].fillna(X_valid[col].mean())

y_valid_pred = model.predict(X_valid)
y_valid_pred = pd.Series(y_valid_pred, index=X_valid.index, name='SalePrice')
y_valid_pred


KeyError: "['UsageBand_2', 'fiModelDesc_2'] not in index"

In [None]:
from datetime import datetime

f'submission_{datetime.now().isoformat()}'
y_valid_pred.to_csv(f'submission_{datetime.now().isoformat()}.csv')

In [None]:
### Feature importance
pd.Series(
    model.feature_importances_,
    index=model.feature_names_in_
).sort_values(ascending=False)

ModelID                     0.635940
YearMade                    0.188489
MachineHoursCurrentMeter    0.107432
auctioneerID                0.034715
datasource                  0.033423
dtype: float64