Install required packages:
pip install pandas scikit-learn
What I've done so far has mainly consisted of combining the two datasets through merging with queue data on project ID. The two datasets:
-
GI_Interactive_Queue: provides project attributes such as capacity, fuel type, queue timing, and etc.
-
gi_master_extracted_data: contains upgrade cost information stored in a JSON structure.
- Split project_ids into individual rows for projects with multiple IDs.
- Parse the project_cost_map_json (apparently needed to transforms strings into actual dictionary objects)
- Extracts project-level upgrade costs from the JSON structure.
The target variable: cost (extracted from the cost map) The input feature: Summer MW (Only included one feature so far to ensure that the model is working)
The linear regression model is done with scikit-learn: Cost = Intercept + (Coefficient × Summer MW)
The model outputs: Regression intercept Cost coefficient per MW Example cost prediction for a hypothetical project
The example: a prediction for a 100 MW project, estimating the expected upgrade cost using the trained model.
Goal: Trying to calculate the median timeline since the average is too skewed.
Input feature: in_service_timeline
There is a function to parse strings like "24-48 months" from the datasets, the script converts them into numeric values by: removing the word "months." splitting the lower and upper bounds Taking the average of the range
And then calculate the median. Currently, the baseline model (example) doesn't matter since it will all just come out to be the median as the initial result.