<a href="https://colab.research.google.com/github/IlyaZutler/Project_2-Trucks/blob/main/DM%20_%20Project_2_1_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dynamic mitochondria Project - Heavy Machinery Auction Price Estimator

> https://www.kaggle.com/t/9baafb8850d74e4499c7b1ba97d6f115

### Timeline
- **Start Date:** [Start Date]
- **End Date:** 14/07/2024 (11 days to go)

### 2. Exploratory Data Analysis (EDA)

Conduct EDA to understand the dataset and identify any data quality issues. Look for missing values, outliers, and relationships between features and the target variable.

### 3. Data Preprocessing

- Handle missing values appropriately.
- Encode categorical variables.
- Normalize or standardize numerical features if necessary.

### 6. Model Improvement

- Handle missing values and categorical variables more effectively.
- Use feature importances to identify key features.
- Perform feature engineering to create new informative features.
- Tune hyperparameters using grid search or other techniques.
- Monitor for overfitting by comparing training and testing performance.

### 7. Final Submission

Generate predictions for the validation set:

```python
valid = pd.read_csv('valid.csv')
X_valid = valid.drop(columns=['SalesID'])
y_valid_pred = model.predict(X_valid)

# Create a submission file
submission = pd.DataFrame({'SalesID': valid['SalesID'], 'SalePrice': y_valid_pred})
submission.to_csv('final_submission.csv', index=False)
```

## Practical Data Science Guidelines

- **Efficient Workflows:** Use a random subset of 20,000 rows for initial experiments. Use the full dataset for the final submission.
- **Iterative Approach:** Start with a basic model and iteratively improve it by trying small ideas.
- **Feature Engineering:** Transform and combine existing features creatively.
- **Documentation:** Keep track of your experiments and results. Document what works and what doesn't.

## Collaboration and Presentation

- **Collaboration:** Discuss your work openly within your team or with other teams. Sharing insights and learning from each other is encouraged.
- **Presentation:** Present your methodology, results, and the techniques that helped the most. Document your journey and the steps you took to achieve your results



In [172]:
import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


In [173]:
def download_from_gdrive(url, filename):
    # Extract the file ID from the URL
    file_id = url.split('/')[-2]
    download_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file
    if Path(filename).exists():
        print(f"File '{filename}' already exists. Skipping download.")
    else:
        gdown.download(download_url, filename, quiet=False)
        print(f"File downloaded as: {filename}")

train = 'https://drive.google.com/file/d/1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5/view?usp=drive_link'
valid = 'https://drive.google.com/file/d/1j7x8xhMimKbvW62D-XeDfuRyj9ia636q/view?usp=drive_link'
# Example usage

download_from_gdrive(train, 'train.csv')
download_from_gdrive(valid, 'valid.csv')

df = pd.read_csv('train.csv')
df_valid = pd.read_csv('valid.csv')

File 'train.csv' already exists. Skipping download.
File 'valid.csv' already exists. Skipping download.


  df = pd.read_csv('train.csv')


## Exploratory Data Analysis (EDA)

In [174]:
#df.fiProductClassDesc.value_counts()

In [175]:
#df.info()

In [176]:
 #df.SalesID.nunique()

In [177]:
# df.isnull().sum()

In [178]:
#df.describe()

In [179]:
#sns.histplot(data=df, x='SalePrice', bins=20)

In [180]:
# to see value_counts for all categorical columns, but some realy categorical columns has numerical type like ModelID
categorical_cols = df.select_dtypes(exclude='number').columns
for col in categorical_cols:
  print(f"Value counts for column '{col}':")
  print(df[col].value_counts())
  print(f"NaN values:{df[col].isnull().sum()}")
  print()

Value counts for column 'UsageBand':
UsageBand
Medium    33985
Low       23620
High      12034
Name: count, dtype: int64
NaN values:331486

Value counts for column 'saledate':
saledate
2/16/2009 0:00    1932
2/15/2011 0:00    1352
2/19/2008 0:00    1300
2/15/2010 0:00    1219
2/11/2008 0:00    1100
                  ... 
1/16/2004 0:00       1
3/27/2006 0:00       1
7/25/2003 0:00       1
1/16/2006 0:00       1
6/9/2008 0:00        1
Name: count, Length: 3919, dtype: int64
NaN values:0

Value counts for column 'fiModelDesc':
fiModelDesc
310G        5039
416C        4869
580K        4315
310E        4233
140G        4083
            ... 
EX210-5        1
KX025          1
EX120-5F       1
EX100-5E       1
HW180          1
Name: count, Length: 4999, dtype: int64
NaN values:0

Value counts for column 'fiBaseModel':
fiBaseModel
580      19798
310      17354
D6       13110
416      12687
D5        9342
         ...  
830-2        1
272          1
PC230        1
KBD65        1
HW180        1


### 3. Data Preprocessing

In [181]:
#df['Transmission'] = df['Transmission'].replace('AutoShift', 'Autoshift')

In [None]:
def fix_mistakes(X, replacement_dict):
    #X = X.copy()
    for col, replacements in replacement_dict.items():
        X[col] = X[col].replace(replacements)
    return X

fix_mistakes(df, replacement_dict = {'Transmission': {'AutoShift': 'Autoshift'}})
fix_mistakes(df_valid, replacement_dict = {'Transmission': {'AutoShift': 'Autoshift'}})

In [183]:
def fix_data(X, date_col):
    for col in date_col:
        X[col] = pd.to_datetime(X[col])
        X[col + '_Year'] = X[col].dt.year
        X[col + '_Month'] = X[col].dt.month
        X = X.drop(col, axis=1)
    return X

df = fix_data(df, date_col = ['saledate'])
df_valid = fix_data(df_valid, date_col = ['saledate'])

In [None]:
def first_word_name(X, col):
    for col_ in col:
        X[col_+'_first_word'] = X[col_].apply(lambda x: x.split()[0] if isinstance(x, str) else x)
    return X

first_word_name(df, col = ['fiProductClassDesc'])
first_word_name(df_valid, col = ['fiProductClassDesc'])

In [185]:
# OrdinalEncoder for column 'ProductSize':
# df['ProductSize_3'] = df['ProductSize'].map({'Compact': 1, 'Mini': 2, 'High': 3, 'Small': 4, 'Medium': 5, 'High': 6, 'Large / Medium': 7, 'Large': 8})
# df['ProductSize_3'] = df['ProductSize_3'].fillna(df['ProductSize_3'].mean())
# df['ProductSize_3'].value_counts(), df['ProductSize_3'].isnull().sum()

In [186]:
# processing of categorical ordinal features - using libruary Sklern OrdinalEncoder
# encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']], handle_unknown='use_encoded_value', unknown_value= -1) # Handle unknown values
# df['UsageBand_3'] = encoder.fit_transform(df[['UsageBand']])
# mean_real = df['UsageBand_3'][df.UsageBand_3.isin([0,1,2])].mean()
# df['UsageBand_3'] = df['UsageBand_3'].replace(-1, mean_real)
# # Fix: Fill missing values in the 'UsageBand_3' column and keep it in the DataFrame
# df['UsageBand_3'] = df['UsageBand_3'].fillna(mean_real)
# df['UsageBand_3'].value_counts(), df['UsageBand_3'].isnull().sum()


In [187]:
# 'Undercarriage_Pad_Width' - remove ' inch' from data - to numeric
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width'].str.replace(' inch', '').replace('None or Unspecified', -1)
# df['Undercarriage_Pad_Width_3'] = pd.to_numeric(df['Undercarriage_Pad_Width_3'])
# mean_real_2 = df['Undercarriage_Pad_Width_3'][df.Undercarriage_Pad_Width_3 != -1].mean()
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width_3'].replace(-1, mean_real_2)
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width_3'].fillna(mean_real_2)
# df['Undercarriage_Pad_Width_3'].value_counts(),df['Undercarriage_Pad_Width_3'].isnull().sum()

In [188]:
# 'Stick_Length' - remove " from data - to numeric
# df['Stick_Length_3'] = df['Stick_Length'].str.replace("' ", '.').str.replace('"', '').replace('None or Unspecified', -1)
# df['Stick_Length_3'] = pd.to_numeric(df['Stick_Length_3'])
# mean_real_3 = df['Stick_Length_3'][df.Stick_Length_3.isna() | df.Stick_Length_3 != -1 ].mean()
# print(mean_real_3)
# df['Stick_Length_3'] = df['Stick_Length_3'].replace(-1, mean_real_3)
# df['Stick_Length_3'] = df['Stick_Length_3'].fillna(mean_real_3)
# df['Stick_Length_3'].value_counts(), df['Stick_Length_3'].isnull().sum()

In [189]:
#df[df['YearMade'] == 1000].head(20) # i have not idias what to do with year 1000

## Mean / Target coding

In [190]:
def make_target_mean_dict(X, target_col):
    target_mean_dict = {}
    target_nan_mean_dict = {}

    for col in df.select_dtypes(exclude='number').columns:
        target_mean_dict[col] = X.groupby(col)[target_col].mean().to_dict()
        target_nan_mean_dict[col] = X[df[col].isna()][target_col].mean()

    return target_mean_dict, target_nan_mean_dict


target_mean_dict, target_nan_mean_dict = make_target_mean_dict(df, target_col = 'SalePrice')


def target_encode(X, target_mean_dict, target_nan_mean_dict):
    for col in X.select_dtypes(exclude='number').columns:
        X[col + '_2'] = X[col].map(target_mean_dict[col]).fillna(target_nan_mean_dict[col])
        X[col + '_2'] = X[col + '_2'].astype(float)
    return X

df = target_encode(df, target_mean_dict, target_nan_mean_dict)
df_valid = target_encode(df_valid, target_mean_dict, target_nan_mean_dict)

# Datas for model

In [227]:
df2 = df.select_dtypes('number')
X_valid = df_valid.select_dtypes('number')

y = df2['SalePrice']
X = df2.drop(columns=['SalePrice', 'MachineID', 'ModelID', 'SalesID'])
X_valid2 = X_valid.drop(columns=['MachineID', 'ModelID', 'SalesID'])

X_small = X.sample(1000, random_state=42)
y_small = y.loc[X_small.index]

#сменить комменты
X_train, X_test, y_train, y_test = train_test_split(X_small, y_small, test_size=0.2, random_state=42)
# X_train = X_small
# y_train = y_small


# scaler = MinMaxScaler()
# X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
# X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
# X_valid2 = pd.DataFrame(scaler.transform(X_valid2), columns=X_valid2.columns, index=X_valid2.index)

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_train.mean())
X_valid2 = X_valid2.fillna(X_train.mean())


In [229]:
%%time
model = RandomForestRegressor(n_jobs=-1,
                              n_estimators = 500,
                              #max_depth = 10,
                              min_impurity_decrease = 100,
                              random_state = 42
                              )

model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

y_valid_pred = model.predict(X_valid2)

print(f'Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print(f'Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(f'R²:' , r2_score(y_test, y_test_pred))
print(f'Train MAE:', mean_absolute_error(y_train, y_train_pred))
print(f'Test MAE:', mean_absolute_error(y_test, y_test_pred))


Train RMSE: 3398.7265872829685
Test RMSE: 9511.716290436589
R²: 0.8241455394021174
Train MAE: 2345.9663404491343
Test MAE: 6449.848791341991
CPU times: user 6.85 s, sys: 145 ms, total: 6.99 s
Wall time: 4.35 s


In [218]:
### Feature importance
pd.Series(
    model.feature_importances_,
    index=model.feature_names_in_
).sort_values(ascending=False)

fiModelDesc_2                      0.855111
saledate_Year                      0.069607
YearMade                           0.051306
fiProductClassDesc_2               0.003654
fiBaseModel_2                      0.003606
saledate_Month                     0.002307
Ripper_2                           0.002255
Enclosure_2                        0.002019
ProductSize_2                      0.001969
fiSecondaryDesc_2                  0.001346
Tire_Size_2                        0.000944
fiModelDescriptor_2                0.000499
MachineHoursCurrentMeter           0.000456
Blade_Width_2                      0.000394
Drive_System_2                     0.000394
Hydraulics_2                       0.000364
fiModelSeries_2                    0.000291
Blade_Type_2                       0.000281
fiProductClassDesc_first_word_2    0.000274
ProductGroup_2                     0.000268
Transmission_2                     0.000267
ProductGroupDesc_2                 0.000263
Scarifier_2                     

In [219]:
# Create a submission file
submission = pd.DataFrame({'SalesID': X_valid['SalesID'], 'SalePrice': y_valid_pred})
submission.to_csv('final_submission.csv', index=False)