<a href="https://colab.research.google.com/github/IlyaZutler/Project_2-Trucks/blob/main/DM%20_%20Project_2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dynamic mitochondria Project - Heavy Machinery Auction Price Estimator

> https://www.kaggle.com/t/9baafb8850d74e4499c7b1ba97d6f115

### Timeline
- **Start Date:** [Start Date]
- **End Date:** 14/07/2024 (11 days to go)

## Getting Started

### 5. RandomForestRegressor Model

### 6. Model Improvement

- Handle missing values and categorical variables more effectively.
- Use feature importances to identify key features.
- Perform feature engineering to create new informative features.
- Tune hyperparameters using grid search or other techniques.
- Monitor for overfitting by comparing training and testing performance.

### 7. Final Submission

Generate predictions for the validation set:

```python
valid = pd.read_csv('valid.csv')
X_valid = valid.drop(columns=['SalesID'])
y_valid_pred = model.predict(X_valid)

# Create a submission file
submission = pd.DataFrame({'SalesID': valid['SalesID'], 'SalePrice': y_valid_pred})
submission.to_csv('final_submission.csv', index=False)
```

## Practical Data Science Guidelines

- **Efficient Workflows:** Use a random subset of 20,000 rows for initial experiments. Use the full dataset for the final submission.
- **Iterative Approach:** Start with a basic model and iteratively improve it by trying small ideas.
- **Feature Engineering:** Transform and combine existing features creatively.
- **Documentation:** Keep track of your experiments and results. Document what works and what doesn't.

## Collaboration and Presentation

- **Collaboration:** Discuss your work openly within your team or with other teams. Sharing insights and learning from each other is encouraged.
- **Presentation:** Present your methodology, results, and the techniques that helped the most. Document your journey and the steps you took to achieve your results



In [5]:
import gdown
from pathlib import Path

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.pipeline import Pipeline

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
def download_from_gdrive(url, filename):
    # Extract the file ID from the URL
    file_id = url.split('/')[-2]
    download_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file
    if Path(filename).exists():
        print(f"File '{filename}' already exists. Skipping download.")
    else:
        gdown.download(download_url, filename, quiet=False)
        print(f"File downloaded as: {filename}")

train = 'https://drive.google.com/file/d/1guqSpDv1Q7ZZjSbXMYGbrTvGns0VCyU5/view?usp=drive_link'
valid = 'https://drive.google.com/file/d/1j7x8xhMimKbvW62D-XeDfuRyj9ia636q/view?usp=drive_link'
# Example usage

download_from_gdrive(train, 'train.csv')
download_from_gdrive(valid, 'valid.csv')

df = pd.read_csv('train.csv')
df_valid = pd.read_csv('valid.csv')
df.head().T

## Exploratory Data Analysis (EDA)

In [None]:
#df.fiProductClassDesc.value_counts()

In [None]:
#df.info()

In [None]:
# df.isnull().sum()

In [None]:
#df.describe()

In [None]:
#sns.histplot(data=df, x='SalePrice', bins=20)

In [None]:
# to see value_counts for all categorical columns, but some realy categorical columns has numerical type like ModelID
categorical_cols = df.select_dtypes(exclude='number').columns
for col in categorical_cols:
  print(f"Value counts for column '{col}':")
  print(df[col].value_counts())
  print(f"NaN values:{df[col].isnull().sum()}")
  print()

### 3. Data Preprocessing

In [7]:
def fix_transmission(X):
    X['Transmission'] = X['Transmission'].replace('AutoShift', 'Autoshift')
    return X

In [8]:
def fix_year(X):
    X['saledate'] = pd.to_datetime(X['saledate'])
    X['saleYear'] = X['saledate'].dt.year
    X['saleMonth'] = X['saledate'].dt.month
    X = X.drop('saledate', axis=1)
    return X

In [9]:
def fix_first_word(X):
    X['fiProductClassDesc_first_word'] = X['fiProductClassDesc'].apply(lambda x: x.split()[0])
    return X

In [None]:
# OrdinalEncoder for column 'ProductSize':
# df['ProductSize_3'] = df['ProductSize'].map({'Compact': 1, 'Mini': 2, 'High': 3, 'Small': 4, 'Medium': 5, 'High': 6, 'Large / Medium': 7, 'Large': 8})
# df['ProductSize_3'] = df['ProductSize_3'].fillna(df['ProductSize_3'].mean())
# df['ProductSize_3'].value_counts(), df['ProductSize_3'].isnull().sum()

In [None]:
# processing of categorical ordinal features - using libruary Sklern OrdinalEncoder
# encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']], handle_unknown='use_encoded_value', unknown_value= -1) # Handle unknown values
# df['UsageBand_3'] = encoder.fit_transform(df[['UsageBand']])
# mean_real = df['UsageBand_3'][df.UsageBand_3.isin([0,1,2])].mean()
# df['UsageBand_3'] = df['UsageBand_3'].replace(-1, mean_real)
# # Fix: Fill missing values in the 'UsageBand_3' column and keep it in the DataFrame
# df['UsageBand_3'] = df['UsageBand_3'].fillna(mean_real)
# df['UsageBand_3'].value_counts(), df['UsageBand_3'].isnull().sum()

In [None]:
# 'Undercarriage_Pad_Width' - remove ' inch' from data - to numeric
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width'].str.replace(' inch', '').replace('None or Unspecified', -1)
# df['Undercarriage_Pad_Width_3'] = pd.to_numeric(df['Undercarriage_Pad_Width_3'])
# mean_real_2 = df['Undercarriage_Pad_Width_3'][df.Undercarriage_Pad_Width_3 != -1].mean()
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width_3'].replace(-1, mean_real_2)
# df['Undercarriage_Pad_Width_3'] = df['Undercarriage_Pad_Width_3'].fillna(mean_real_2)
# df['Undercarriage_Pad_Width_3'].value_counts(),df['Undercarriage_Pad_Width_3'].isnull().sum()

In [None]:
# 'Stick_Length' - remove " from data - to numeric
# df['Stick_Length_3'] = df['Stick_Length'].str.replace("' ", '.').str.replace('"', '').replace('None or Unspecified', -1)
# df['Stick_Length_3'] = pd.to_numeric(df['Stick_Length_3'])
# mean_real_3 = df['Stick_Length_3'][df.Stick_Length_3.isna() | df.Stick_Length_3 != -1 ].mean()
# print(mean_real_3)
# df['Stick_Length_3'] = df['Stick_Length_3'].replace(-1, mean_real_3)
# df['Stick_Length_3'] = df['Stick_Length_3'].fillna(mean_real_3)
# df['Stick_Length_3'].value_counts(), df['Stick_Length_3'].isnull().sum()

In [None]:
#df[df['YearMade'] == 1000].head(20) # i have not idias what to do with year 1000

## Mean / Target coding

In [10]:
class MeanTargetEncode(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.target_mean_dict = None
        self.target_nan_mean_dict = None

    def fit(self, X, y):
        X = X.copy()
        y = pd.DataFrame(y)
        X = pd.concat([X, y], axis=1)
        X = X.rename(columns={0: 'SalePrice'})

        self.target_mean_dict = dict()
        self.target_nan_mean_dict = dict()

        for col in X.select_dtypes(exclude='number').columns:
            self.target_mean_dict[col] = X.groupby(col)['SalePrice'].mean().to_dict()
            self.target_nan_mean_dict[col] = X[X[col].isna()]['SalePrice'].mean()

        X = X.drop(columns=['SalePrice'])

        return self

    def transform(self, X):
        X = X.copy()

        for col in X.select_dtypes(exclude='number').columns:
            X[col + '_2'] = X[col].map(self.target_mean_dict[col]).fillna(self.target_nan_mean_dict[col])
            X[col + '_2'] = X[col + '_2'].astype(float)

        return X


# Datas for model

In [11]:
def drop_columns(X, col):
    X = X.drop(columns = col)
    return X

In [12]:
def prepare_data(X):
    X = X.select_dtypes('number')  # drop all categorical variables
    return X

In [13]:
df_2 = df.sample(150000)

y = df_2['SalePrice']
X = df_2.drop(columns=['SalePrice'])


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

pipeline = Pipeline([
    ('transmission_fixer', FunctionTransformer(fix_transmission)),
    ('year_fixer', FunctionTransformer(fix_year)),
    ('first_word_fixer', FunctionTransformer(fix_first_word)),
    ('mean_target_encode', MeanTargetEncode()),
    ('drop', FunctionTransformer(drop_columns, kw_args={'col': ['MachineID', 'ModelID','SalesID']})),
    ('prepare_dats', FunctionTransformer(prepare_data)),
    ('fillna', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
    ('model', RandomForestRegressor(n_jobs=-1,
                              n_estimators = 200,
                              #max_depth = 12,
                              min_impurity_decrease = 1000,
                              random_state=42
                              ))
])


pipeline.fit(X_train, y_train)

y_train_pred = pipeline.predict(X_train)

y_test_pred = pipeline.predict(X_test)

print(f'Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
print(f'Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(f'R²:' , r2_score(y_test, y_test_pred))
print(f'Train MAE:', mean_absolute_error(y_train, y_train_pred))
print(f'Test MAE:', mean_absolute_error(y_test, y_test_pred))


Train RMSE: 3601.1786361075306
Test RMSE: 7836.2937888276765
R²: 0.8852581841956384
Train MAE: 2690.375799904813
Test MAE: 4919.277895616473


In [None]:
regressor = pipeline.named_steps['model']
feature_importances = regressor.feature_importances_
feature_names = regressor.feature_names_in_

# Создаем DataFrame с важностями признаков и сортируем их
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)

AttributeError: 'RandomForestRegressor' object has no attribute 'feature_names_in_'