<a href="https://colab.research.google.com/github/Shivanshu04/Grouphousing_floor_pred/blob/main/final_machine_learning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Absolutely, I have included the step to recommend material details according to the predicted stage of the project in the final step. Here is the updated plan:

### Step 1: Load and Inspect the Dataset
- Load the new dataset that you will upload.
- Inspect the dataset to understand its structure and identify the relevant features and target variable.

### Step 2: Data Preprocessing
- **Handle missing values in the 'current_stage' column:**
  - If 'current_stage' is NaN and 'project_status' is 'completed', set 'current_stage' to 'Handover'.
- Convert date columns to datetime data type and create new features based on the date columns.
- Fill other NaN values with appropriate strategies (like mean or median for numerical columns).
- Perform label encoding on the categorical target variable to convert it to numerical values for model training.

### Step 3: Feature Engineering
- Create additional features that might be useful for predicting the current stage of the project. This could include ratios or differences between existing features.
- Select relevant features for training the model, excluding non-numeric and irrelevant columns.

### Step 4: Data Splitting
- Split the dataset into training and testing sets to train the model and evaluate its performance.

### Step 5: Model Training
- Train a predictive model (e.g., Gradient Boosting Machine) using the training set.

### Step 6: Model Evaluation
- Evaluate the performance of the trained model using the testing set to ensure that it can make accurate predictions.

### Step 7: Prediction Function
- Define a prediction function that takes the necessary inputs (such as the start date, end date, and built-up area), uses the trained model to predict the current stage of the project.
- **Recommend materials:** Use the additional data to recommend materials for the predicted stage of the project.
- Test the prediction function to ensure it works correctly.

Once we have defined the prediction function, we will be able to use it to make predictions on new data and recommend materials for the predicted stage.

Let me know if we can proceed with these steps or if you have any other specifications.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
import numpy as np

# Define the file path to the dataset
file_path = '/content/4th_floor_with_estimated_durations_final_output.csv'

# Re-loading the dataset
data = pd.read_csv(file_path)

# Step 2: Data Preprocessing
# Converting date columns to datetime objects
data['actual_commencement_date'] = pd.to_datetime(data['actual_commencement_date'], errors='coerce')
data['estimated_finish_date'] = pd.to_datetime(data['estimated_finish_date'], errors='coerce')

# Creating new features based on the date columns
current_date = datetime.now()
data['duration_until_estimated_finish'] = (data['estimated_finish_date'] - data['actual_commencement_date']).dt.days
data['duration_since_commencement'] = (current_date - data['actual_commencement_date']).dt.days
data['remaining_duration'] = (data['estimated_finish_date'] - current_date).dt.days
data['progress_ratio'] = data['duration_since_commencement'] / data['duration_until_estimated_finish']

# Handling missing values in the 'current_stage' column
data.loc[(data['current_stage'].isna()) & (data['Project_status'] == 'Completed'), 'current_stage'] = 'Handover'

# Step 3: Feature Engineering
# Creating new features
data['year_of_commencement'] = data['actual_commencement_date'].dt.year
data['month_of_commencement'] = data['actual_commencement_date'].dt.month
data['year_of_estimated_finish'] = data['estimated_finish_date'].dt.year
data['month_of_estimated_finish'] = data['estimated_finish_date'].dt.month
data['days_exceeding_estimated_duration'] = data['duration_since_commencement'] - data['duration_until_estimated_finish']
data['is_delayed'] = (data['remaining_duration'] < 0).astype(int)

# Handling other missing values with appropriate strategies
data.fillna(data.mean(numeric_only=True), inplace=True)

# Performing label encoding on the 'current_stage' column
label_encoder = LabelEncoder()
data['current_stage'] = data['current_stage'].astype(str) # Converting to string to handle any NaN values left
data['current_stage_encoded'] = label_encoder.fit_transform(data['current_stage'])

# Step 4: Data Splitting
# Selecting relevant features for the model
feature_columns = [
    'duration_until_estimated_finish', 'duration_since_commencement', 'remaining_duration',
    'progress_ratio', 'year_of_commencement', 'month_of_commencement',
    'year_of_estimated_finish', 'month_of_estimated_finish',
    'days_exceeding_estimated_duration', 'is_delayed'
]

# Defining the feature set and the target variable
X = data[feature_columns]
y = data['current_stage_encoded']

# Splitting the data into training and testing sets (80% training and 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Displaying a message to indicate the preprocessing steps are completed



In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Initializing and training the Gradient Boosting Classifier
gbm_model = GradientBoostingClassifier(random_state=42)
gbm_model.fit(X_train, y_train)

# Displaying a message to indicate that the model has been trained
"Gradient Boosting Model has been trained successfully."


'Gradient Boosting Model has been trained successfully.'

In [None]:
#combine correct code


In [None]:
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

def evaluate_model(model, X_test, y_test, label_encoder):
    try:
        # Predicting the current stage on the testing set
        y_pred = model.predict(X_test)

        # Calculating the accuracy of the model
        accuracy = accuracy_score(y_test, y_pred)

        # Getting the unique labels present in the test set and predictions
        unique_labels = np.unique(np.concatenate((y_test, y_pred)))

        # Getting the classification report with the correct labels
        class_report = classification_report(y_test, y_pred, labels=unique_labels, target_names=label_encoder.classes_[unique_labels])

        return accuracy, class_report
    except Exception as e:
        print(f"Error evaluating model: {e}")
        return None, None

# Usage:

accuracy, class_report = evaluate_model(gbm_model, X_test, y_test, label_encoder)

if accuracy is not None and class_report is not None:
    print("Model Accuracy:", accuracy)
    print("\nClassification Report:\n", class_report)


Model Accuracy: 0.950207468879668

Classification Report:
                                                                                              precision    recall  f1-score   support

                                                  1st Floor slab casting_estimated_duration       1.00      0.67      0.80         3
                                               1st floor Columns casting_estimated_duration       0.50      0.50      0.50         2
                                                  2nd Floor slab casting_estimated_duration       1.00      0.88      0.93         8
                                                  3rd Floor slab casting_estimated_duration       1.00      1.00      1.00         7
                                              3rd floor Columns casting _estimated_duration       0.50      1.00      0.67         1
                                                  4th Floor slab casting_estimated_duration       1.00      1.00      1.00         5
         

In [None]:
# Getting the unique stage names in the main dataset and the materials data file
import pandas as pd

materials_data = pd.read_csv("/content/Copy of stage_with materai.csv")
unique_stages_main_dataset = data['current_stage'].unique()
unique_stages_materials_data = materials_data['Activity'].unique()

unique_stages_main_dataset, unique_stages_materials_data


(array(['Handover',
        'Plumbing & Sanitary,Electrification Works_estimated_duration',
        'Plastering on outer sides_estimated_duration',
        'Painting and Finishing_estimated_duration',
        '3rd Floor slab casting_estimated_duration',
        'Electrical concealed, PVC Fitting, plastering at 1st-4th floor_estimated_duration',
        'nan', 'Brick work at 1st Floor _estimated_duration',
        '3rd floor Columns casting _estimated_duration',
        'Tiles work_estimated_duration',
        'Electrical concealed, PVC Fitting, plastering at ground floor_estimated_duration',
        '4th Floor slab casting_estimated_duration',
        'Brick work of 2nd to 4th Floor _estimated_duration',
        'Doors & Windows Fixing Furniture work_estimated_duration',
        'Ground Floor slab casting _estimated_duration',
        'Cleaning & survey_estimated_duration',
        '1st Floor slab casting_estimated_duration',
        'Excavation,leveling & P.C.C  for Basement  B1 _esti

In [None]:
# Removing the "_estimated_duration" suffix from the stage names in the main dataset
data['current_stage_cleaned'] = data['current_stage'].str.replace('_estimated_duration', '')

# Updating the label encoder to use the cleaned stage names
label_encoder = LabelEncoder()
data['current_stage_encoded'] = label_encoder.fit_transform(data['current_stage_cleaned'].astype(str))

# Creating a dictionary to map the encoded labels to the cleaned stage names
label_to_stage_mapping = dict(zip(range(len(label_encoder.classes_)), label_encoder.classes_))

# Displaying the cleaned unique stage names
cleaned_unique_stages_main_dataset = data['current_stage_cleaned'].unique()
cleaned_unique_stages_main_dataset


array(['Handover', 'Plumbing & Sanitary,Electrification Works',
       'Plastering on outer sides', 'Painting and Finishing',
       '3rd Floor slab casting',
       'Electrical concealed, PVC Fitting, plastering at 1st-4th floor',
       'nan', 'Brick work at 1st Floor ', '3rd floor Columns casting ',
       'Tiles work',
       'Electrical concealed, PVC Fitting, plastering at ground floor',
       '4th Floor slab casting', 'Brick work of 2nd to 4th Floor ',
       'Doors & Windows Fixing Furniture work',
       'Ground Floor slab casting ', 'Cleaning & survey',
       '1st Floor slab casting',
       'Excavation,leveling & P.C.C  for Basement  B1 ',
       '4th floor Columns casting ',
       'Brick work at Basement to Ground Floor ', 'Slab of B (bottom) ',
       'Electrical concealed, PVC Fitting, plastering at Basement',
       '2nd floor Columns casting ', '2nd Floor slab casting',
       'Raft footing, Column B1, Retaining wall Reinforcement ,Concrete pouring ',
       '1st flo

In [None]:
# Updating the prediction function to use the cleaned stage names

# Defining the prediction function
def predict_current_stage(inputs):
    """
    Function to predict the current stage of a project and recommend materials.

    Args:
    inputs (dict): Dictionary containing the necessary inputs (start date, end date).

    Returns:
    dict: Dictionary containing the predicted stage and recommended materials.
    """
    # Creating a data frame from the inputs
    input_data = pd.DataFrame([inputs])

    # Converting date columns to datetime objects and creating new features
    input_data['actual_commencement_date'] = pd.to_datetime(input_data['actual_commencement_date'])
    input_data['estimated_finish_date'] = pd.to_datetime(input_data['estimated_finish_date'])
    current_date = datetime.now()
    input_data['duration_until_estimated_finish'] = (input_data['estimated_finish_date'] - input_data['actual_commencement_date']).dt.days
    input_data['duration_since_commencement'] = (current_date - input_data['actual_commencement_date']).dt.days
    input_data['remaining_duration'] = (input_data['estimated_finish_date'] - current_date).dt.days
    input_data['progress_ratio'] = input_data['duration_since_commencement'] / input_data['duration_until_estimated_finish']
    input_data['year_of_commencement'] = input_data['actual_commencement_date'].dt.year
    input_data['month_of_commencement'] = input_data['actual_commencement_date'].dt.month
    input_data['year_of_estimated_finish'] = input_data['estimated_finish_date'].dt.year
    input_data['month_of_estimated_finish'] = input_data['estimated_finish_date'].dt.month
    input_data['days_exceeding_estimated_duration'] = input_data['duration_since_commencement'] - input_data['duration_until_estimated_finish']
    input_data['is_delayed'] = (input_data['remaining_duration'] < 0).astype(int)

    # Selecting the relevant features
    input_features = input_data[feature_columns]

    # Making the prediction using the trained model
    predicted_label = gbm_model.predict(input_features)[0]

    # Getting the predicted stage and the recommended materials
    predicted_stage = label_to_stage_mapping[predicted_label]
    recommended_materials = materials_data.loc[materials_data['Activity'].str.contains(predicted_stage, case=False, na=False), 'Materials (suggestions)'].values[0]

    # Returning the results
    return {
        "Predicted Stage": predicted_stage,
        "Recommended Materials": recommended_materials
    }

# Testing the prediction function with a sample input
test_input = {
    "actual_commencement_date": "2022-03-22",
    "estimated_finish_date": "2024-12-31",
}

predict_current_stage(test_input)


{'Predicted Stage': 'Electrical concealed, PVC Fitting, plastering at 1st-4th floor',
 'Recommended Materials': 'cement, sand,  circuit pipe, Cpvc,&Pvc pipe'}

In [None]:
# Updating the prediction function to handle cases where no match is found in the materials data

# Defining the prediction function
def predict_current_stage(inputs):
    """
    Function to predict all possible current stages of a project and recommend materials for each stage.

    Args:
    inputs (dict): Dictionary containing the necessary inputs (start date, end date).

    Returns:
    list: List of dictionaries containing the possible stages and recommended materials, ordered by probability.
    """
    # Creating a data frame from the inputs
    input_data = pd.DataFrame([inputs])

    # Converting date columns to datetime objects and creating new features
    input_data['actual_commencement_date'] = pd.to_datetime(input_data['actual_commencement_date'])
    input_data['estimated_finish_date'] = pd.to_datetime(input_data['estimated_finish_date'])
    current_date = datetime.now()
    input_data['duration_until_estimated_finish'] = (input_data['estimated_finish_date'] - input_data['actual_commencement_date']).dt.days
    input_data['duration_since_commencement'] = (current_date - input_data['actual_commencement_date']).dt.days
    input_data['remaining_duration'] = (input_data['estimated_finish_date'] - current_date).dt.days
    input_data['progress_ratio'] = input_data['duration_since_commencement'] / input_data['duration_until_estimated_finish']
    input_data['year_of_commencement'] = input_data['actual_commencement_date'].dt.year
    input_data['month_of_commencement'] = input_data['actual_commencement_date'].dt.month
    input_data['year_of_estimated_finish'] = input_data['estimated_finish_date'].dt.year
    input_data['month_of_estimated_finish'] = input_data['estimated_finish_date'].dt.month
    input_data['days_exceeding_estimated_duration'] = input_data['duration_since_commencement'] - input_data['duration_until_estimated_finish']
    input_data['is_delayed'] = (input_data['remaining_duration'] < 0).astype(int)

    # Selecting the relevant features
    input_features = input_data[feature_columns]

    # Making the prediction using the trained model to get the probability of each stage
    predicted_probs = gbm_model.predict_proba(input_features)[0]

    # Getting all possible stages and the recommended materials for each stage, ordered by probability
    predictions = []
    for i, prob in enumerate(predicted_probs):
        stage = label_to_stage_mapping[i]
        recommended_materials = materials_data.loc[materials_data['Activity'].str.contains(stage, case=False, na=False), 'Materials (suggestions)']
        recommended_materials = recommended_materials.values[0] if not recommended_materials.empty else "No materials suggested"
        predictions.append({
            "Stage": stage,
            "Probability": prob,
            "Recommended Materials": recommended_materials
        })

    # Sorting the predictions by probability in descending order
    predictions = sorted(predictions, key=lambda x: x['Probability'], reverse=True)

    # Returning the results
    return predictions

# Testing the prediction function with a sample input
test_input = {
    "actual_commencement_date": "2023-01-01",
    "estimated_finish_date": "2023-12-31",
}

predict_current_stage(test_input)


  recommended_materials = materials_data.loc[materials_data['Activity'].str.contains(stage, case=False, na=False), 'Materials (suggestions)']


[{'Stage': 'Tiles work',
  'Probability': 0.7588985450343456,
  'Recommended Materials': 'Tiles'},
 {'Stage': 'Painting and Finishing',
  'Probability': 0.240418600886764,
  'Recommended Materials': 'paint, putty,primer'},
 {'Stage': 'Plastering on outer sides',
  'Probability': 0.00034630110196847433,
  'Recommended Materials': 'cement, sand'},
 {'Stage': 'Handover',
  'Probability': 5.3903872672968355e-05,
  'Recommended Materials': 'No materials suggested'},
 {'Stage': 'Electrical concealed, PVC Fitting, plastering at ground floor',
  'Probability': 4.980366320809199e-05,
  'Recommended Materials': 'cement, sand,  circuit pipe, Cpvc,&Pvc pipe'},
 {'Stage': 'Cleaning & survey',
  'Probability': 3.705526428761009e-05,
  'Recommended Materials': nan},
 {'Stage': '3rd Floor slab casting',
  'Probability': 3.134730084204892e-05,
  'Recommended Materials': 'TMT bar ,cement, sand, aggregates'},
 {'Stage': '4th Floor slab casting',
  'Probability': 2.7485928877424896e-05,
  'Recommended Mat

In [None]:
from pandas.tseries.offsets import MonthEnd

# Updating the prediction function to provide predictions for the current and upcoming stages for each month

# Defining the prediction function
def predict_current_stage(inputs):
    """
    Function to predict the current and upcoming stages for each month and recommend materials for each stage.

    Args:
    inputs (dict): Dictionary containing the necessary inputs (start date, end date).

    Returns:
    list: List of dictionaries containing the predictions for each month.
    """
    # Creating a data frame from the inputs
    input_data = pd.DataFrame([inputs])

    # Converting date columns to datetime objects
    input_data['actual_commencement_date'] = pd.to_datetime(input_data['actual_commencement_date'])
    input_data['estimated_finish_date'] = pd.to_datetime(input_data['estimated_finish_date'])

    # Creating a list to store the predictions for each month
    monthly_predictions = []

    # Looping over a range of dates from the current date to the estimated finish date, with a step size of one month
    current_date = pd.to_datetime("today")
    while current_date <= input_data['estimated_finish_date'].iloc[0]:
        # Creating new features using the current date in the loop
        input_data['duration_until_estimated_finish'] = (input_data['estimated_finish_date'] - input_data['actual_commencement_date']).dt.days
        input_data['duration_since_commencement'] = (current_date - input_data['actual_commencement_date']).dt.days
        input_data['remaining_duration'] = (input_data['estimated_finish_date'] - current_date).dt.days
        input_data['progress_ratio'] = input_data['duration_since_commencement'] / input_data['duration_until_estimated_finish']
        input_data['year_of_commencement'] = input_data['actual_commencement_date'].dt.year
        input_data['month_of_commencement'] = input_data['actual_commencement_date'].dt.month
        input_data['year_of_estimated_finish'] = input_data['estimated_finish_date'].dt.year
        input_data['month_of_estimated_finish'] = input_data['estimated_finish_date'].dt.month
        input_data['days_exceeding_estimated_duration'] = input_data['duration_since_commencement'] - input_data['duration_until_estimated_finish']
        input_data['is_delayed'] = (input_data['remaining_duration'] < 0).astype(int)

        # Selecting the relevant features
        input_features = input_data[feature_columns]

        # Making the prediction using the trained model to get the probability of each stage
        predicted_probs = gbm_model.predict_proba(input_features)[0]

        # Getting the most likely stage and the recommended materials for the current date in the loop
        top_prediction_index = np.argmax(predicted_probs)
        top_stage = label_to_stage_mapping[top_prediction_index]
        top_probability = predicted_probs[top_prediction_index]
        recommended_materials = materials_data.loc[materials_data['Activity'].str.contains(top_stage, case=False, na=False), 'Materials (suggestions)']
        recommended_materials = recommended_materials.values[0] if not recommended_materials.empty else "No materials suggested"

        # Adding the prediction for the current date to the list of monthly predictions
        monthly_predictions.append({
            "Date": current_date.strftime('%Y-%m-%d'),
            "Stage": top_stage,
            "Probability": top_probability,
            "Recommended Materials": recommended_materials
        })

        # Moving to the next month
        current_date = current_date + MonthEnd(1)

    # Returning the results
    return monthly_predictions

# Testing the prediction function with a sample input
test_input = {
    "actual_commencement_date": "	2023-8-21",
    "estimated_finish_date": "2024-12-03",
}

predict_current_stage(test_input)


[{'Date': '2023-09-27',
  'Stage': 'Raft footing, Column B1, Retaining wall Reinforcement ,Concrete pouring ',
  'Probability': 0.9999997952300529,
  'Recommended Materials': 'TMT bar ,cement, sand, aggregates'},
 {'Date': '2023-09-30',
  'Stage': 'Raft footing, Column B1, Retaining wall Reinforcement ,Concrete pouring ',
  'Probability': 0.9999997910600721,
  'Recommended Materials': 'TMT bar ,cement, sand, aggregates'},
 {'Date': '2023-10-31',
  'Stage': 'Ground Floor slab casting ',
  'Probability': 0.9999986250059786,
  'Recommended Materials': 'TMT bar ,cement, sand, aggregates ,Pvc pipes,circuit pipes,lightbox, fan box'},
 {'Date': '2023-11-30',
  'Stage': '2nd floor Columns casting ',
  'Probability': 0.9999998738109648,
  'Recommended Materials': 'TMT bar ,cement, sand, aggregates'},
 {'Date': '2023-12-31',
  'Stage': 'Brick work at Basement to Ground Floor ',
  'Probability': 0.9999993383237755,
  'Recommended Materials': ' blocks/bricks'},
 {'Date': '2024-01-31',
  'Stage': '

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
pip install sklearn2pmml
