<a href="https://colab.research.google.com/github/IlyaZutler/Project_2-Trucks/blob/main/DM%20_%20Project_2_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dynamic mitochondria Project - Heavy Machinery Auction Price Estimator

> https://www.kaggle.com/t/9baafb8850d74e4499c7b1ba97d6f115

### Timeline
- **Start Date:** [Start Date]
- **End Date:** 14/07/2024 (11 days to go)

### 2. Exploratory Data Analysis (EDA)


### 3. Data Preprocessing

### 6. Model Improvement

- Handle missing values and categorical variables more effectively.
- Use feature importances to identify key features.
- Perform feature engineering to create new informative features.
- Tune hyperparameters using grid search or other techniques.
- Monitor for overfitting by comparing training and testing performance.

## Practical Data Science Guidelines

- **Efficient Workflows:** Use a random subset of 20,000 rows for initial experiments. Use the full dataset for the final submission.
- **Iterative Approach:** Start with a basic model and iteratively improve it by trying small ideas.
- **Feature Engineering:** Transform and combine existing features creatively.
- **Documentation:** Keep track of your experiments and results. Document what works and what doesn't.

## Collaboration and Presentation

- **Collaboration:** Discuss your work openly within your team or with other teams. Sharing insights and learning from each other is encouraged.
- **Presentation:** Present your methodology, results, and the techniques that helped the most. Document your journey and the steps you took to achieve your results



In [86]:
import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


In [87]:
def download_from_gdrive(url, filename):
    # Extract the file ID from the URL
    file_id = url.split('/')[-2]
    download_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file
    if Path(filename).exists():
        print(f"File '{filename}' already exists. Skipping download.")
    else:
        gdown.download(download_url, filename, quiet=False)
        print(f"File downloaded as: {filename}")

train = 'https://drive.google.com/file/d/1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5/view?usp=drive_link'
valid = 'https://drive.google.com/file/d/1j7x8xhMimKbvW62D-XeDfuRyj9ia636q/view?usp=drive_link'
# Example usage

download_from_gdrive(train, 'train.csv')
download_from_gdrive(valid, 'valid.csv')

df = pd.read_csv('train.csv')
df_valid = pd.read_csv('valid.csv')

File 'train.csv' already exists. Skipping download.
File 'valid.csv' already exists. Skipping download.


  df = pd.read_csv('train.csv')


In [88]:
df

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,11/16/2006 0:00,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,3/26/2004 0:00,950FII,950,F,II,,Medium,Wheel Loader - 150.0 to 175.0 Horsepower,North Carolina,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,23.5,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2/26/2004 0:00,226,226,,,,,Skid Steer Loader - 1351.0 to 1601.0 Lb Operat...,New York,SSL,Skid Steer Loaders,,OROPS,None or Unspecified,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,None or Unspecified,None or Unspecified,Standard,,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,5/19/2011 0:00,PC120-6E,PC120,,-6E,,Small,"Hydraulic Excavator, Track - 12.0 to 14.0 Metr...",Texas,TEX,Track Excavators,,EROPS w AC,,,,,,,,,,,2 Valve,,,,,,None or Unspecified,,,,,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,7/23/2009 0:00,S175,S175,,,,,Skid Steer Loader - 1601.0 to 1751.0 Lb Operat...,New York,SSL,Skid Steer Loaders,,EROPS,None or Unspecified,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,None or Unspecified,None or Unspecified,Standard,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336,10500,1840702,21439,149,1.0,2005,,,11/2/2011 0:00,35NX2,35,NX,2,,Mini,"Hydraulic Excavator, Track - 3.0 to 4.0 Metric...",Maryland,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401121,6333337,11000,1830472,21439,149,1.0,2005,,,11/2/2011 0:00,35NX2,35,NX,2,,Mini,"Hydraulic Excavator, Track - 3.0 to 4.0 Metric...",Maryland,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401122,6333338,11500,1887659,21439,149,1.0,2005,,,11/2/2011 0:00,35NX2,35,NX,2,,Mini,"Hydraulic Excavator, Track - 3.0 to 4.0 Metric...",Maryland,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401123,6333341,9000,1903570,21435,149,2.0,2005,,,10/25/2011 0:00,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Florida,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,


## Exploratory Data Analysis (EDA)

In [89]:
#df.fiProductClassDesc.value_counts()

In [90]:
#df.info()

In [91]:
 #df.SalesID.nunique()

In [92]:
# df.isnull().sum()

In [93]:
#df.describe()

In [94]:
#sns.histplot(data=df, x='SalePrice', bins=20)

In [95]:
# to see value_counts for all categorical columns, but some realy categorical columns has numerical type like ModelID
categorical_cols = df.select_dtypes(exclude='number').columns
for col in categorical_cols:
  print(f"Value counts for column '{col}':")
  print(df[col].value_counts())
  print(f"NaN values:{df[col].isnull().sum()}")
  print()

Value counts for column 'UsageBand':
UsageBand
Medium    33985
Low       23620
High      12034
Name: count, dtype: int64
NaN values:331486

Value counts for column 'saledate':
saledate
2/16/2009 0:00    1932
2/15/2011 0:00    1352
2/19/2008 0:00    1300
2/15/2010 0:00    1219
2/11/2008 0:00    1100
                  ... 
1/16/2004 0:00       1
3/27/2006 0:00       1
7/25/2003 0:00       1
1/16/2006 0:00       1
6/9/2008 0:00        1
Name: count, Length: 3919, dtype: int64
NaN values:0

Value counts for column 'fiModelDesc':
fiModelDesc
310G        5039
416C        4869
580K        4315
310E        4233
140G        4083
            ... 
EX210-5        1
KX025          1
EX120-5F       1
EX100-5E       1
HW180          1
Name: count, Length: 4999, dtype: int64
NaN values:0

Value counts for column 'fiBaseModel':
fiBaseModel
580      19798
310      17354
D6       13110
416      12687
D5        9342
         ...  
830-2        1
272          1
PC230        1
KBD65        1
HW180        1


### 3. Data Preprocessing

In [96]:
def Num_to_Object(X, col):
    for col_ in col:
        X[col] = X[col].astype('object')
    return X


df = Num_to_Object(df, col = ['datasource'])
df_valid = Num_to_Object(df_valid, col = ['datasource'])


In [97]:
#df['Transmission'] = df['Transmission'].replace('AutoShift', 'Autoshift')
def fix_mistakes(X, replacement_dict):

    for col, replacements in replacement_dict.items():
        X[col] = X[col].replace(replacements)
    return X


fix_mistakes(df, replacement_dict = {'Transmission': {'AutoShift': 'Autoshift'}})
fix_mistakes(df_valid, replacement_dict = {'Transmission': {'AutoShift': 'Autoshift'}})

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1222837,902859,1376,121,3,1000,0.0,,1/5/2012 0:00,375L,375,,,L,Large / Medium,"Hydraulic Excavator, Track - 66.0 to 90.0 Metr...",Kentucky,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
1,1222839,1048320,36526,121,3,2006,4412.0,Medium,1/5/2012 0:00,TX300LC2,TX300,LC,2,,Large / Medium,"Hydraulic Excavator, Track - 28.0 to 33.0 Metr...",Connecticut,TEX,Track Excavators,,EROPS w AC,,,,,,,,,,,Auxiliary,,,,,,Hydraulic,,,,Steel,None or Unspecified,"12' 4""",None or Unspecified,Yes,Double,,,,,
2,1222841,999308,4587,121,3,2000,10127.0,Medium,1/5/2012 0:00,270LC,270,,,LC,Large / Medium,"Hydraulic Excavator, Track - 24.0 to 28.0 Metr...",Connecticut,TEX,Track Excavators,,EROPS w AC,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Steel,None or Unspecified,"12' 4""",None or Unspecified,None or Unspecified,Double,,,,,
3,1222843,1062425,1954,121,3,1000,4682.0,Low,1/5/2012 0:00,892DLC,892,D,,LC,Large / Medium,"Hydraulic Excavator, Track - 28.0 to 33.0 Metr...",Connecticut,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
4,1222845,1032841,4701,121,3,2002,8150.0,Medium,1/4/2012 0:00,544H,544,H,,,,Wheel Loader - 120.0 to 135.0 Horsepower,Florida,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,20.5,Manual,,,,,,,,,,,,,Standard,Conventional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11568,6333344,1919201,21435,149,2,2005,,,3/7/2012 0:00,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Texas,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
11569,6333345,1882122,21436,149,2,2005,,,1/28/2012 0:00,30NX2,30,NX,2,,Mini,"Hydraulic Excavator, Track - 3.0 to 4.0 Metric...",Florida,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
11570,6333347,1944213,21435,149,2,2005,,,1/28/2012 0:00,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Florida,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Rubber,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
11571,6333348,1794518,21435,149,2,2006,,,3/7/2012 0:00,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Texas,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Rubber,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,


In [98]:
def fix_data(X, date_col):
    for col in date_col:
        X[col] = pd.to_datetime(X[col])
        X[col + '_Year'] = X[col].dt.year
        X[col + '_Month'] = X[col].dt.month
        X = X.drop(col, axis=1)
    return X


df = fix_data(df, date_col = ['saledate'])
df_valid = fix_data(df_valid, date_col = ['saledate'])

In [99]:
def first_word_name(X, col):
    for col_ in col:
        X[col_+'_first_word'] = X[col_].apply(lambda x: x.split()[0] if isinstance(x, str) else x)
    return X


first_word_name(df, col = ['fiProductClassDesc'])
first_word_name(df_valid, col = ['fiProductClassDesc'])

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saledate_Year,saledate_Month,fiProductClassDesc_first_word
0,1222837,902859,1376,121,3,1000,0.0,,375L,375,,,L,Large / Medium,"Hydraulic Excavator, Track - 66.0 to 90.0 Metr...",Kentucky,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,2012,1,Hydraulic
1,1222839,1048320,36526,121,3,2006,4412.0,Medium,TX300LC2,TX300,LC,2,,Large / Medium,"Hydraulic Excavator, Track - 28.0 to 33.0 Metr...",Connecticut,TEX,Track Excavators,,EROPS w AC,,,,,,,,,,,Auxiliary,,,,,,Hydraulic,,,,Steel,None or Unspecified,"12' 4""",None or Unspecified,Yes,Double,,,,,,2012,1,Hydraulic
2,1222841,999308,4587,121,3,2000,10127.0,Medium,270LC,270,,,LC,Large / Medium,"Hydraulic Excavator, Track - 24.0 to 28.0 Metr...",Connecticut,TEX,Track Excavators,,EROPS w AC,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Steel,None or Unspecified,"12' 4""",None or Unspecified,None or Unspecified,Double,,,,,,2012,1,Hydraulic
3,1222843,1062425,1954,121,3,1000,4682.0,Low,892DLC,892,D,,LC,Large / Medium,"Hydraulic Excavator, Track - 28.0 to 33.0 Metr...",Connecticut,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,2012,1,Hydraulic
4,1222845,1032841,4701,121,3,2002,8150.0,Medium,544H,544,H,,,,Wheel Loader - 120.0 to 135.0 Horsepower,Florida,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,20.5,Manual,,,,,,,,,,,,,Standard,Conventional,2012,1,Wheel
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11568,6333344,1919201,21435,149,2,2005,,,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Texas,TEX,Track Excavators,,EROPS,,,,,,,,,,,Standard,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,2012,3,Hydraulic
11569,6333345,1882122,21436,149,2,2005,,,30NX2,30,NX,2,,Mini,"Hydraulic Excavator, Track - 3.0 to 4.0 Metric...",Florida,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,2012,1,Hydraulic
11570,6333347,1944213,21435,149,2,2005,,,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Florida,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Rubber,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,2012,1,Hydraulic
11571,6333348,1794518,21435,149,2,2006,,,30NX,30,NX,,,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Texas,TEX,Track Excavators,,EROPS,,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,,,,Rubber,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,,2012,3,Hydraulic


In [100]:
def ord_encod_nan(X, col, categories):
    encoder = OrdinalEncoder(categories=[categories], handle_unknown='use_encoded_value', unknown_value= -1)
    X[col + '_3'] = encoder.fit_transform(X[[col]])
    X[col + '_3'].replace(-1, np.nan, inplace=True)
    X[col + '_3'] = pd.to_numeric(X[col + '_3'], errors='coerce')

    return X


df = ord_encod_nan(df, col = 'UsageBand', categories = ['Low', 'Medium', 'High'])
df_valid = ord_encod_nan(df_valid, col = 'UsageBand', categories = ['Low', 'Medium', 'High'])

df = ord_encod_nan(df, col = 'ProductSize', categories = ['Mini', 'Compact', 'Small', 'Medium', 'Large / Medium', 'Large', 'High'])
df_valid = ord_encod_nan(df_valid, col = 'ProductSize', categories = ['Mini', 'Compact', 'Small', 'Medium', 'Large / Medium', 'Large', 'High'])

In [101]:
def replace_dict(X, col, repl_dict):
    new_name = col + '_3'

    X[new_name] = X[col]
    for old, new in repl_dict.items():
        X[new_name] = X[new_name].str.replace(old, new, regex=False)
    X[new_name].replace('None or Unspecified', -1)
    X[new_name].replace(-1, np.nan, inplace=True)
    X[new_name] = pd.to_numeric(X[new_name], errors='coerce')

    return X

df = replace_dict(df, col = 'Undercarriage_Pad_Width', repl_dict= {' inch': ''})
df_valid = replace_dict(df_valid, col= 'Undercarriage_Pad_Width', repl_dict = {' inch': ''})

df = replace_dict(df, col = 'Stick_Length', repl_dict= {"' ": '.', '"': ''})
df_valid = replace_dict(df_valid, col= 'Stick_Length', repl_dict = {"' ": '.', '"': ''})

In [102]:
#df[df['YearMade'] == 1000].head(20) # i have not idias what to do with year 1000

## Mean / Target encoding    (depend on test / train split)

In [103]:
def make_target_mean_dict(X, y):
    X1 = X.copy()

    y1 = pd.DataFrame(y)
    X1 = pd.concat([X1, y1], axis=1)
    X1 = X1.rename(columns={0: 'SalePrice'})

    target_mean_dict = {}
    target_nan_mean_dict = {}

    for col in X1.select_dtypes(exclude='number').columns:
        target_mean_dict[col] = X1.groupby(col)['SalePrice'].mean().to_dict()
        target_nan_mean_dict[col] = X1[X1[col].isna()]['SalePrice'].mean()
    X1 = X1.drop(columns=['SalePrice'])

    return target_mean_dict, target_nan_mean_dict


def target_encode(X, target_mean_dict, target_nan_mean_dict):
    for col in X.select_dtypes(exclude='number').columns:
        X[col + '_2'] = X[col].map(target_mean_dict[col]).fillna(target_nan_mean_dict[col])
        X[col + '_2'] = X[col + '_2'].astype(float)
    return X

#target_mean_dict, target_nan_mean_dict = make_target_mean_dict(df, target_col = 'SalePrice')
#df = target_encode(df, target_mean_dict, target_nan_mean_dict)
#df_valid = target_encode(df_valid, target_mean_dict, target_nan_mean_dict)

# Select data for testing model

In [104]:
df2 = df.sample(1500, random_state=42)

y = df2['SalePrice']
X = df2.drop(columns=['SalePrice', 'SalesID', 'MachineID', 'ModelID'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

target_mean_dict, target_nan_mean_dict = make_target_mean_dict(X_train, y_train)
X_train = target_encode(X_train, target_mean_dict, target_nan_mean_dict)
X_test = target_encode(X_test, target_mean_dict, target_nan_mean_dict)

X_train = X_train.select_dtypes('number')
X_test = X_test.select_dtypes('number')

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_train.mean())


#train_test_split

In [105]:
# scaler = MinMaxScaler()  -   nothing to chenge
# X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
# X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
# X_valid = pd.DataFrame(scaler.transform(X_valid), columns=X_valid.columns, index=X_valid.index)

# feature_names = X_train.columns  is is aufully compicated
# imputer = SimpleImputer(strategy='mean')
# X_train = imputer.fit_transform(X_train)
# X_test = imputer.transform(X_test)
# X_valid = imputer.transform(X_valid)
# X_train = pd.DataFrame(X_train, columns=feature_names)
# X_test = pd.DataFrame(X_test, columns=feature_names)
# X_valid = pd.DataFrame(X_valid, columns=feature_names)

# import seaborn as sns
# import matplotlib.pyplot as plt

# plt.figure(figsize=(20, 20))
# sns.heatmap(X.corr(), vmin=-1, fmt=".1f", vmax=1, annot=True, cmap='BrBG')
# plt.show()

In [106]:
%%time
model = RandomForestRegressor(n_jobs=-1,
                              n_estimators = 700,
                              #max_depth = 10,
                              min_impurity_decrease = 10,
                              random_state = 42,
                              max_features = 'sqrt',
                              #max_samples=0.75
                              )
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print(f'Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print(f'Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(f'R²:' , r2_score(y_test, y_test_pred))
print(f'Train MAE:', mean_absolute_error(y_train, y_train_pred))
print(f'Test MAE:', mean_absolute_error(y_test, y_test_pred))


Train RMSE: 3525.859427864453
Test RMSE: 14556.215910600093
R²: 0.5399363319708925
Train MAE: 2335.318471631502
Test MAE: 9918.1059095961
CPU times: user 3.88 s, sys: 205 ms, total: 4.08 s
Wall time: 2.89 s


In [107]:
### Feature importance
pd.Series(
    model.feature_importances_,
    index=model.feature_names_in_
).sort_values(ascending=False)

fiModelDesc_2                      0.287887
fiBaseModel_2                      0.135472
fiProductClassDesc_2               0.084208
YearMade                           0.056679
fiSecondaryDesc_2                  0.046997
Enclosure_2                        0.033026
ProductSize_2                      0.026963
saledate_Year                      0.026730
state_2                            0.026162
ProductSize_3                      0.023775
fiProductClassDesc_first_word_2    0.021548
ProductGroup_2                     0.016570
saledate_Month                     0.016357
ProductGroupDesc_2                 0.015911
fiModelDescriptor_2                0.015470
auctioneerID                       0.011763
Tire_Size_2                        0.010391
Hydraulics_2                       0.009494
Enclosure_Type_2                   0.008532
MachineHoursCurrentMeter           0.008278
Ripper_2                           0.007988
Blade_Type_2                       0.007767
datasource_2                    

# this for Validation#

In [108]:
# df2 = df.select_dtypes('number')
# X_valid2 = df_valid.select_dtypes('number')

# y = df2['SalePrice']
# X = df2.drop(columns=['SalePrice', 'MachineID', 'ModelID', 'SalesID'])
# X_valid = X_valid2.drop(columns=['MachineID', 'ModelID', 'SalesID'])


# target_mean_dict, target_nan_mean_dict = make_target_mean_dict(X, y)
# X = target_encode(X, target_mean_dict, target_nan_mean_dict)
# X_valid = target_encode(X_valid, target_mean_dict, target_nan_mean_dict)

# X = X.select_dtypes('number')
# X_valid = X_valid.select_dtypes('number')

# X = X.fillna(X.mean())
# X_valid = X_valid.fillna(X.mean())


In [109]:
# %%time
# model = RandomForestRegressor(n_jobs=-1,
#                               n_estimators = 700,
#                               #max_depth = 10,
#                               min_impurity_decrease = 10,
#                               random_state = 42,
#                               max_features = 'sqrt',
#                               #max_samples=0.75
#                               )
# model.fit(X, y)

# y_valid_pred = model.predict(X_valid)

In [110]:
# Create a submission file
# submission = pd.DataFrame({'SalesID': X_valid['SalesID'], 'SalePrice': y_valid_pred})
# submission.to_csv('final_submission.csv', index=False)