<a href="https://colab.research.google.com/github/IlyaZutler/Project_2-Trucks/blob/main/DM%20_%20Project_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dynamic mitochondria Project - Heavy Machinery Auction Price Estimator

> https://www.kaggle.com/t/9baafb8850d74e4499c7b1ba97d6f115

### Timeline
- **Start Date:** [Start Date]
- **End Date:** 14/07/2024 (11 days to go)

## Getting Started

### 1. Environment Setup

Set up your Python environment with the necessary libraries:

```bash
pip install pandas numpy scikit-learn
```

### 2. Exploratory Data Analysis (EDA)

Conduct EDA to understand the dataset and identify any data quality issues. Look for missing values, outliers, and relationships between features and the target variable.

### 3. Data Preprocessing

- Handle missing values appropriately.
- Encode categorical variables.
- Normalize or standardize numerical features if necessary.

### 4. Baseline Model

Start with a baseline model to establish a benchmark performance. A simple approach is to predict the mean of the target variable:

```python
import pandas as pd

train = pd.read_csv('train.csv')
baseline_prediction = train['SalePrice'].mean()

# Create a submission file
valid = pd.read_csv('valid.csv')
submission = pd.DataFrame({'SalesID': valid['SalesID'], 'SalePrice': baseline_prediction})
submission.to_csv('baseline_submission.csv', index=False)
```

### 5. RandomForestRegressor Model

Build a RandomForestRegressor model:

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Split the data
X = train.drop(columns=['SalePrice'])
y = train['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Validate the model
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE: {rmse}')
```

### 6. Model Improvement

- Handle missing values and categorical variables more effectively.
- Use feature importances to identify key features.
- Perform feature engineering to create new informative features.
- Tune hyperparameters using grid search or other techniques.
- Monitor for overfitting by comparing training and testing performance.

### 7. Final Submission

Generate predictions for the validation set:

```python
valid = pd.read_csv('valid.csv')
X_valid = valid.drop(columns=['SalesID'])
y_valid_pred = model.predict(X_valid)

# Create a submission file
submission = pd.DataFrame({'SalesID': valid['SalesID'], 'SalePrice': y_valid_pred})
submission.to_csv('final_submission.csv', index=False)
```

## Practical Data Science Guidelines

- **Efficient Workflows:** Use a random subset of 20,000 rows for initial experiments. Use the full dataset for the final submission.
- **Iterative Approach:** Start with a basic model and iteratively improve it by trying small ideas.
- **Feature Engineering:** Transform and combine existing features creatively.
- **Documentation:** Keep track of your experiments and results. Document what works and what doesn't.

## Collaboration and Presentation

- **Collaboration:** Discuss your work openly within your team or with other teams. Sharing insights and learning from each other is encouraged.
- **Presentation:** Present your methodology, results, and the techniques that helped the most. Document your journey and the steps you took to achieve your results



In [10]:
# !pip install --upgrade scikit-learn
# # Try upgrading again, sometimes package managers need a nudge
# !pip install -U scikit-learn
# # Alternatively, force a reinstall to ensure the latest version

import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
#from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

print(sklearn.__version__)

ImportError: cannot import name '_fit_context' from 'sklearn.base' (/usr/local/lib/python3.10/dist-packages/sklearn/base.py)

In [1]:
import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


ImportError: cannot import name 'SimpleImputer' from 'sklearn.preprocessing' (/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/__init__.py)

In [3]:
!pip install --upgrade scikit-learn
# Try upgrading again, sometimes package managers need a nudge
!pip install -U scikit-learn
# Alternatively, force a reinstall to ensure the latest version

import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # Import from the correct module
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error



ImportError: cannot import name '_fit_context' from 'sklearn.base' (/usr/local/lib/python3.10/dist-packages/sklearn/base.py)

In [None]:
def download_from_gdrive(url, filename):
    # Extract the file ID from the URL
    file_id = url.split('/')[-2]
    download_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file
    if Path(filename).exists():
        print(f"File '{filename}' already exists. Skipping download.")
    else:
        gdown.download(download_url, filename, quiet=False)
        print(f"File downloaded as: {filename}")

train = 'https://drive.google.com/file/d/1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5/view?usp=drive_link'
valid = 'https://drive.google.com/file/d/1j7x8xhMimKbvW62D-XeDfuRyj9ia636q/view?usp=drive_link'
# Example usage

download_from_gdrive(train, 'train.csv')
download_from_gdrive(valid, 'valid.csv')

df = pd.read_csv('train.csv')
df_valid = pd.read_csv('valid.csv')
df.head().T

File 'train.csv' already exists. Skipping download.
File 'valid.csv' already exists. Skipping download.


  df = pd.read_csv('train.csv')


Unnamed: 0,0,1,2,3,4
SalesID,1139246,1139248,1139249,1139251,1139253
SalePrice,66000,57000,10000,38500,11000
MachineID,999089,117657,434808,1026470,1057373
ModelID,3157,77,7009,332,17311
datasource,121,121,121,121,121
auctioneerID,3.0,3.0,3.0,3.0,3.0
YearMade,2004,1996,2001,2001,2007
MachineHoursCurrentMeter,68.0,4640.0,2838.0,3486.0,722.0
UsageBand,Low,Low,High,High,Medium
saledate,11/16/2006 0:00,3/26/2004 0:00,2/26/2004 0:00,5/19/2011 0:00,7/23/2009 0:00


## Exploratory Data Analysis (EDA)

In [None]:
#df.fiProductClassDesc.value_counts()

In [None]:
#df.info()

In [None]:
# df.isnull().sum()

In [None]:
#df.describe()

In [None]:
#sns.histplot(data=df, x='SalePrice', bins=20)

In [None]:
# to see value_counts for all categorical columns, but some realy categorical columns has numerical type like ModelID
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
  print(f"Value counts for column '{col}':")
  print(df[col].value_counts())
  print()

### 3. Data Preprocessing

In [None]:
#Correction mistakes - 'Transmission' replace AutoShift for Autoshift
df['Transmission'] = df['Transmission'].replace('AutoShift', 'Autoshift')

In [None]:
def fix_transmission(X):
    X = X.copy()
    X['Transmission'] = X['Transmission'].replace('AutoShift', 'Autoshift')
    return X

In [None]:
# Convert 'saledate' to datetime
df['saledate'] = pd.to_datetime(df['saledate'])
# Extract the year
df['saleYear'] = df['saledate'].dt.year
df['saleMonth'] = df['saledate'].dt.month

df = df.drop('saledate', axis=1)

In [None]:
def fix_year(X):
    X = X.copy()
    X['saleYear'] = X['saledate'].dt.year
    X['saleMonth'] = X['saledate'].dt.month
    X = X.drop('saledate', axis=1)
    return X

In [None]:
# new fither fiProductClassDesc first word
# df['fiProductClassDesc_first_word'] = df['fiProductClassDesc'].apply(lambda x: x.split()[0])

In [None]:
def fix_first_word(X):
    X = X.copy()
    X['fiProductClassDesc_first_word'] = X['fiProductClassDesc'].apply(lambda x: x.split()[0])
    return X

In [None]:
# OrdinalEncoder for column 'ProductSize':
# df['ProductSize_3'] = df['ProductSize'].map({'Compact': 1, 'Mini': 2, 'High': 3, 'Small': 4, 'Medium': 5, 'High': 6, 'Large / Medium': 7, 'Large': 8})
# df['ProductSize_3'] = df['ProductSize_3'].fillna(df['ProductSize_3'].mean())
# df['ProductSize_3'].value_counts(), df['ProductSize_3'].isnull().sum()

In [None]:
# processing of categorical ordinal features - using libruary Sklern OrdinalEncoder
# encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']], handle_unknown='use_encoded_value', unknown_value= -1) # Handle unknown values
# df['UsageBand_3'] = encoder.fit_transform(df[['UsageBand']])
# mean_real = df['UsageBand_3'][df.UsageBand_3.isin([0,1,2])].mean()
# df['UsageBand_3'] = df['UsageBand_3'].replace(-1, mean_real)
# # Fix: Fill missing values in the 'UsageBand_3' column and keep it in the DataFrame
# df['UsageBand_3'] = df['UsageBand_3'].fillna(mean_real)
# df['UsageBand_3'].value_counts(), df['UsageBand_3'].isnull().sum()



In [None]:
# 'Undercarriage_Pad_Width' - remove ' inch' from data - to numeric
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width'].str.replace(' inch', '').replace('None or Unspecified', -1)
# df['Undercarriage_Pad_Width_3'] = pd.to_numeric(df['Undercarriage_Pad_Width_3'])
# mean_real_2 = df['Undercarriage_Pad_Width_3'][df.Undercarriage_Pad_Width_3 != -1].mean()
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width_3'].replace(-1, mean_real_2)
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width_3'].fillna(mean_real_2)
# df['Undercarriage_Pad_Width_3'].value_counts(),df['Undercarriage_Pad_Width_3'].isnull().sum()

In [None]:
# 'Stick_Length' - remove " from data - to numeric
# df['Stick_Length_3'] = df['Stick_Length'].str.replace("' ", '.').str.replace('"', '').replace('None or Unspecified', -1)
# df['Stick_Length_3'] = pd.to_numeric(df['Stick_Length_3'])
# mean_real_3 = df['Stick_Length_3'][df.Stick_Length_3.isna() | df.Stick_Length_3 != -1 ].mean()
# print(mean_real_3)
# df['Stick_Length_3'] = df['Stick_Length_3'].replace(-1, mean_real_3)
# df['Stick_Length_3'] = df['Stick_Length_3'].fillna(mean_real_3)
# df['Stick_Length_3'].value_counts(), df['Stick_Length_3'].isnull().sum()

In [None]:
#df[df['YearMade'] == 1000].head(20) # i have not idias what to do with year 1000

## Mean / Target coding

In [None]:
#  mean / target coding for all not_number columns
# global_mean_target = df['SalePrice'].mean()

# Replace values with their target mean values
def target_encode(cat, target_mean_dict, target_nan_mean_dict):
    return cat.map(target_mean_dict).fillna(target_nan_mean_dict)

# Create a dictionary to map non-numerical column values to their mean target values
target_mean_dict = {}
target_nan_mean_dict = {}
for col in df.select_dtypes(exclude='number').columns:
    target_mean_dict[col] = df.groupby(col)['SalePrice'].mean().to_dict()
    target_nan_mean_dict[col] = df[df[col].isna()]['SalePrice'].mean()

# Create new columns with target-encoded values
for col in df.select_dtypes(exclude='number').columns:
    df[col + '_2'] = target_encode(df[col], target_mean_dict[col], target_nan_mean_dict[col])
    df[col + '_2'] = df[col + '_2'].astype(float)


In [None]:
def mean_target_encode(X):
    X = X.copy()

    target_mean_dict = {}
    target_nan_mean_dict = {}
    for col in X.select_dtypes(exclude='number').columns:
        target_mean_dict[col] = X.groupby(col)['SalePrice'].mean().to_dict()
        target_nan_mean_dict[col] = X[X[col].isna()]['SalePrice'].mean()

    for col in X.select_dtypes(exclude='number').columns:
        X[col + '_2'] = X[col].map(target_mean_dict[col]).fillna(target_nan_mean_dict[col])
        X[col + '_2'] = X[col + '_2'].astype(float)

    return X


In [None]:
def drop_columns(X, col):
    X = X.copy()
    X = X.drop(columns = col)
    return X

# preprocessing of categorical ordinal features

# Datas for model

In [None]:
#prepare dats
df2 = df.select_dtypes('number')  # drop all categorical variables
df2 = df2.fillna(df2.mean())
df2 = df2.set_index('SalesID')

In [None]:
def prepare_data(X):
    X = X.copy()
    X = X.select_dtypes('number')  # drop all categorical variables
    X = X.fillna(df.mean())
    X = X.set_index('SalesID')
    return X

In [None]:
y = df2['SalePrice']
X = df2.drop(columns=['SalePrice', 'MachineID', 'ModelID'	 ])
#, 'ProductSize_2', 'UsageBand_2', 'Undercarriage_Pad_Width_2','Stick_Length_2'
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns, index=X.index)

X_small = X.sample(150000)
y_small = y.loc[X_small.index]

X_train, X_test, y_train, y_test = train_test_split(X_small, y_small, test_size=0.3, random_state=42)

In [None]:
pipeline = Pipeline([
    ('transmission_fixer', FunctionTransformer(fix_transmission)),
    ('year_fixer', FunctionTransformer(fix_year)),
    ('first_word_fixer', FunctionTransformer(fix_first_word)),
    ('mean_target_encoder', FunctionTransformer(mean_target_encode)),
    ('fillna', SimpleImputer(strategy='mean')),
    ('drop', FunctionTransformer(drop_columns, kw_args=['MachineID', 'ModelID']))
    ('prepare_dats', FunctionTransformer(prepare_data)),

    ('scaler', MinMaxScaler()),  # Шаг стандартизации данных
    ('model', RandomForestRegressor(n_jobs=-1,
                              n_estimators = 200,
                              #max_depth = 12,
                              min_impurity_decrease = 1000
                              )
])

%%time

# Обучаем модель на обучающей выборке
pipeline.fit(X_train, y_train)

# Делаем прогнозы на тестовой выборке
y_pred = pipeline.predict(X_test_transformed)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print(f'Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print(f'Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(f'R²:' , r2_score(y_test, y_test_pred))
print(f'Train MAE:', mean_absolute_error(y_train, y_train_pred))
print(f'Test MAE:', mean_absolute_error(y_test, y_test_pred))


In [None]:
%%time
model = RandomForestRegressor(n_jobs=-1,
                              n_estimators = 200,
                              #max_depth = 12,
                              min_impurity_decrease = 1000

                              )
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print(f'Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print(f'Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(f'R²:' , r2_score(y_test, y_test_pred))
print(f'Train MAE:', mean_absolute_error(y_train, y_train_pred))
print(f'Test MAE:', mean_absolute_error(y_test, y_test_pred))


Train RMSE: 3603.672192370149
Test RMSE: 7285.839171813587
R²: 0.9002193251448545
Train MAE: 2692.8154358164456
Test MAE: 4679.875259798874
CPU times: user 4min 46s, sys: 1.09 s, total: 4min 47s
Wall time: 2min 54s


In [None]:
### Feature importance
pd.Series(
    model.feature_importances_,
    index=model.feature_names_in_
).sort_values(ascending=False)

fiModelDesc_2                      0.796909
saleYear                           0.069227
YearMade                           0.055596
saleMonth                          0.009361
fiBaseModel_2                      0.007822
fiProductClassDesc_2               0.007599
state_2                            0.007339
Enclosure_2                        0.004495
fiSecondaryDesc_2                  0.004020
MachineHoursCurrentMeter           0.003893
auctioneerID                       0.003876
Ripper_2                           0.003363
ProductSize_2                      0.002491
Tire_Size_2                        0.002182
fiModelDescriptor_2                0.001900
fiModelSeries_2                    0.001696
Blade_Type_2                       0.001577
Hydraulics_2                       0.001470
Stick_Length_2                     0.001459
UsageBand_2                        0.001303
Coupler_2                          0.001081
datasource                         0.001062
Blade_Width_2                   