<a href="https://colab.research.google.com/github/IlyaZutler/Project_2-Trucks/blob/main/DM%20_%20Project_2_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dynamic mitochondria Project - Heavy Machinery Auction Price Estimator

> https://www.kaggle.com/t/9baafb8850d74e4499c7b1ba97d6f115

### Timeline
- **Start Date:** [Start Date]
- **End Date:** 14/07/2024 (11 days to go)

### 2. Exploratory Data Analysis (EDA)

Conduct EDA to understand the dataset and identify any data quality issues. Look for missing values, outliers, and relationships between features and the target variable.

### 3. Data Preprocessing

- Handle missing values appropriately.
- Encode categorical variables.
- Normalize or standardize numerical features if necessary.

### 6. Model Improvement

- Handle missing values and categorical variables more effectively.
- Use feature importances to identify key features.
- Perform feature engineering to create new informative features.
- Tune hyperparameters using grid search or other techniques.
- Monitor for overfitting by comparing training and testing performance.


## Practical Data Science Guidelines

- **Efficient Workflows:** Use a random subset of 20,000 rows for initial experiments. Use the full dataset for the final submission.
- **Iterative Approach:** Start with a basic model and iteratively improve it by trying small ideas.
- **Feature Engineering:** Transform and combine existing features creatively.
- **Documentation:** Keep track of your experiments and results. Document what works and what doesn't.

## Collaboration and Presentation

- **Collaboration:** Discuss your work openly within your team or with other teams. Sharing insights and learning from each other is encouraged.
- **Presentation:** Present your methodology, results, and the techniques that helped the most. Document your journey and the steps you took to achieve your results



In [2]:
import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


In [3]:
def download_from_gdrive(url, filename):
    # Extract the file ID from the URL
    file_id = url.split('/')[-2]
    download_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file
    if Path(filename).exists():
        print(f"File '{filename}' already exists. Skipping download.")
    else:
        gdown.download(download_url, filename, quiet=False)
        print(f"File downloaded as: {filename}")

train = 'https://drive.google.com/file/d/1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5/view?usp=drive_link'
valid = 'https://drive.google.com/file/d/1j7x8xhMimKbvW62D-XeDfuRyj9ia636q/view?usp=drive_link'
# Example usage

download_from_gdrive(train, 'train.csv')
download_from_gdrive(valid, 'valid.csv')

df = pd.read_csv('train.csv')
df_valid = pd.read_csv('valid.csv')

Downloading...
From (original): https://drive.google.com/uc?id=1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5
From (redirected): https://drive.google.com/uc?id=1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5&confirm=t&uuid=473b72ac-4d37-416c-aa02-32f351f1bfe9
To: /content/train.csv
100%|██████████| 116M/116M [00:02<00:00, 51.5MB/s]


File downloaded as: train.csv


Downloading...
From: https://drive.google.com/uc?id=1j7x8xhMimKbvW62D-XeDfuRyj9ia636q
To: /content/valid.csv
100%|██████████| 3.32M/3.32M [00:00<00:00, 21.5MB/s]


File downloaded as: valid.csv


  df = pd.read_csv('train.csv')


## Exploratory Data Analysis (EDA)

In [None]:
#df.fiProductClassDesc.value_counts()

In [None]:
#df.info()

In [None]:
 #df.SalesID.nunique()

In [None]:
# df.isnull().sum()

In [None]:
#df.describe()

In [None]:
#sns.histplot(data=df, x='SalePrice', bins=20)

In [None]:
# to see value_counts for all categorical columns, but some realy categorical columns has numerical type like ModelID
categorical_cols = df.select_dtypes(exclude='number').columns
for col in categorical_cols:
  print(f"Value counts for column '{col}':")
  print(df[col].value_counts())
  print(f"NaN values:{df[col].isnull().sum()}")
  print()

### 3. Data Preprocessing

In [4]:
def Num_to_Object(X, col):
    for col_ in col:
        X[col] = X[col].astype('object')
    return X

df = Num_to_Object(df, col = ['datasource'])
df_valid = Num_to_Object(df_valid, col = ['datasource'])


In [None]:
#df['Transmission'] = df['Transmission'].replace('AutoShift', 'Autoshift')
def fix_mistakes(X, replacement_dict):

    for col, replacements in replacement_dict.items():
        X[col] = X[col].replace(replacements)
    return X

fix_mistakes(df, replacement_dict = {'Transmission': {'AutoShift': 'Autoshift'}})
fix_mistakes(df_valid, replacement_dict = {'Transmission': {'AutoShift': 'Autoshift'}})

In [6]:
def fix_data(X, date_col):
    for col in date_col:
        X[col] = pd.to_datetime(X[col])
        X[col + '_Year'] = X[col].dt.year
        X[col + '_Month'] = X[col].dt.month
        X = X.drop(col, axis=1)
    return X

df = fix_data(df, date_col = ['saledate'])
df_valid = fix_data(df_valid, date_col = ['saledate'])

In [None]:
def first_word_name(X, col):
    for col_ in col:
        X[col_+'_first_word'] = X[col_].apply(lambda x: x.split()[0] if isinstance(x, str) else x)
    return X

first_word_name(df, col = ['fiProductClassDesc'])
first_word_name(df_valid, col = ['fiProductClassDesc'])

In [8]:
def ord_encod_nan(X, col, categories):
    encoder = OrdinalEncoder(categories=[categories], handle_unknown='use_encoded_value', unknown_value= -1)
    X[col + '_3'] = encoder.fit_transform(X[[col]])
    X[col + '_3'].replace(-1, np.nan, inplace=True)
    X[col + '_3'] = pd.to_numeric(X[col + '_3'], errors='coerce')

    return X

df = ord_encod_nan(df, col = 'UsageBand', categories = ['Low', 'Medium', 'High'])
df_valid = ord_encod_nan(df_valid, col = 'UsageBand', categories = ['Low', 'Medium', 'High'])

df = ord_encod_nan(df, col = 'ProductSize', categories = ['Mini', 'Compact', 'Small', 'Medium', 'Large / Medium', 'Large', 'High'])
df_valid = ord_encod_nan(df_valid, col = 'ProductSize', categories = ['Mini', 'Compact', 'Small', 'Medium', 'Large / Medium', 'Large', 'High'])

In [11]:
def replace_dict(X, col, repl_dict):
    new_name = col + '_3'
    X[new_name] = X[col]
    for old, new in repl_dict.items():
        X[new_name] = X[new_name].str.replace(old, new, regex=False)
    X[new_name].replace('None or Unspecified', -1)
    X[new_name].replace(-1, np.nan, inplace=True)
    X[new_name] = pd.to_numeric(X[new_name], errors='coerce')

    return X

df = replace_dict(df, col = 'Undercarriage_Pad_Width', repl_dict= {' inch': ''})
df_valid = replace_dict(df_valid, col= 'Undercarriage_Pad_Width', repl_dict = {' inch': ''})

df = replace_dict(df, col = 'Stick_Length', repl_dict= {"' ": '.', '"': ''})
df_valid = replace_dict(df_valid, col= 'Stick_Length', repl_dict = {"' ": '.', '"': ''})

## New fichers (depending by df split)

In [None]:
#df[df['YearMade'] == 1000].head(20) # i have not idias what to do with year 1000

## Mean / Target coding

In [None]:
def make_target_mean_dict(X, target_col):
    target_mean_dict = {}
    target_nan_mean_dict = {}

    for col in df.select_dtypes(exclude='number').columns:
        target_mean_dict[col] = X.groupby(col)[target_col].mean().to_dict()
        target_nan_mean_dict[col] = X[df[col].isna()][target_col].mean()

    return target_mean_dict, target_nan_mean_dict

target_mean_dict, target_nan_mean_dict = make_target_mean_dict(df, target_col = 'SalePrice')

def target_encode(X, target_mean_dict, target_nan_mean_dict):
    for col in X.select_dtypes(exclude='number').columns:
        X[col + '_2'] = X[col].map(target_mean_dict[col]).fillna(target_nan_mean_dict[col])
        X[col + '_2'] = X[col + '_2'].astype(float)
    return X

df = target_encode(df, target_mean_dict, target_nan_mean_dict)
df_valid = target_encode(df_valid, target_mean_dict, target_nan_mean_dict)

# Select data for model

In [None]:
df = df.select_dtypes('number')
X_valid = df_valid.select_dtypes('number')

y = df['SalePrice']
X = df.drop(columns=['SalePrice', 'MachineID', 'ModelID', 'SalesID'])
X_valid = X_valid.drop(columns=['MachineID', 'ModelID', 'SalesID'])

In [None]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# plt.figure(figsize=(20, 20))
# sns.heatmap(X.corr(), vmin=-1, fmt=".1f", vmax=1, annot=True, cmap='BrBG')
# plt.show()


#train_test_split

In [None]:
X_train = X.sample(30000, random_state=42)
y_train = y.loc[X_train.index]

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# scaler = MinMaxScaler()
# X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
# X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
# X_valid2 = pd.DataFrame(scaler.transform(X_valid2), columns=X_valid2.columns, index=X_valid2.index)

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_train.mean())
X_valid = X_valid.fillna(X_train.mean())

In [None]:
%%time
model = RandomForestRegressor(n_jobs=-1,
                              n_estimators = 700,
                              #max_depth = 10,
                              min_impurity_decrease = 10,
                              random_state = 42,
                              max_features = 'sqrt',
                              #max_samples=0.75
                              )

model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

y_valid_pred = model.predict(X_valid)

print(f'Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print(f'Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(f'R²:' , r2_score(y_test, y_test_pred))
print(f'Train MAE:', mean_absolute_error(y_train, y_train_pred))
print(f'Test MAE:', mean_absolute_error(y_test, y_test_pred))


Train RMSE: 3791.6000972295774
Train MAE: 2576.7964623446896
CPU times: user 14min 11s, sys: 8.82 s, total: 14min 20s
Wall time: 8min 43s


In [None]:
### Feature importance
pd.Series(
    model.feature_importances_,
    index=model.feature_names_in_
).sort_values(ascending=False)

fiModelDesc_2                      0.292358
fiBaseModel_2                      0.104009
fiProductClassDesc_2               0.083600
YearMade                           0.083561
saledate_Year                      0.059312
Enclosure_2                        0.053806
fiSecondaryDesc_2                  0.050498
ProductSize_2                      0.033550
state_2                            0.018536
fiModelDescriptor_2                0.017398
fiProductClassDesc_first_word_2    0.017108
saledate_Month                     0.016775
ProductGroupDesc_2                 0.016284
ProductGroup_2                     0.014976
MachineHoursCurrentMeter           0.012377
auctioneerID                       0.011468
datasource_2                       0.008987
Ripper_2                           0.007140
Hydraulics_2                       0.006955
fiModelSeries_2                    0.006494
Tire_Size_2                        0.005989
Blade_Type_2                       0.005690
Coupler_System_2                

In [None]:
# Create a submission file
submission = pd.DataFrame({'SalesID': X_valid['SalesID'], 'SalePrice': y_valid_pred})
submission.to_csv('final_submission.csv', index=False)