<a href="https://colab.research.google.com/github/IAMDSVSSANGRAL/Capstone--Project/blob/main/appliance_energy_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -Samadhan**


# **Project Summary -**

Objective:
The objective of this project is to develop a regression model that accurately predicts the energy consumption of household appliances based on various input features. The model aims to provide insights into energy usage patterns and facilitate energy efficiency improvements in residential settings.

Data:
The project utilizes a dataset that contains information on household appliance energy consumption along with several relevant input features. The dataset includes variables such as temperature, humidity, time of day, and various appliance power readings. The data is collected over a specific time period and is representative of real-world residential energy usage scenarios.

Tasks:

Exploratory Data Analysis (EDA):

Perform a thorough analysis of the dataset to understand the distribution, statistics, and relationships among variables.
Identify any missing values, outliers, or data quality issues that need to be addressed.
Visualize the data using appropriate charts and graphs to gain insights into the patterns and trends.
Data Preprocessing:

Handle missing values by applying suitable imputation techniques or deciding on appropriate strategies for dealing with them.
Address outliers and anomalies by considering various methods such as removal, transformation, or capping.
Normalize or scale the data if necessary to ensure all features are on a similar scale.
Feature Engineering:

Explore the relationships between the input features and the target variable (appliance energy consumption) to identify potential feature engineering opportunities.
Create new features, derive meaningful variables, or transform existing variables to capture important patterns or interactions in the data.
Model Development:

Split the dataset into training and testing sets for model development and evaluation.
Select an appropriate regression algorithm (e.g., linear regression, decision tree regression, random forest regression) based on the project requirements and characteristics of the data.
Train the model using the training data and tune hyperparameters to optimize performance.
Evaluate the model's performance using various metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared.
Model Evaluation and Interpretation:

Assess the model's performance on the testing data to measure its ability to generalize to unseen data.
Interpret the model's coefficients or feature importance to gain insights into the factors that have the most significant impact on appliance energy consumption.
Validate the model's predictions against domain knowledge or external benchmarks to ensure its reliability and usefulness.
Model Deployment and Recommendations:

Deploy the trained model into a production environment or create a user-friendly interface for stakeholders to interact with the model.
Provide recommendations based on the model's predictions and insights to improve energy efficiency, optimize appliance usage, or suggest modifications in residential settings.
Conclusion:
The Appliance Energy Prediction regression project aims to develop a robust regression model to accurately predict household appliance energy consumption. By analyzing and understanding the data, performing feature engineering, and building an effective regression model, the project provides valuable insights and recommendations for optimizing energy usage and promoting energy-efficient practices in residential settings.

Note: This project summary provides a general outline and can be tailored based on specific requirements, dataset characteristics, and project goals.

# **GitHub Link -**

https://github.com/IAMDSVSSANGRAL/applianceenergyprediction

# **Problem Statement**


**Write Problem Statement Here.**

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from pandas.plotting import scatter_matrix
%matplotlib inline

import seaborn as sns
from datetime import datetime as dt

pd.set_option('display.max_columns', None)

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#importing the data set
data_raw = pd.read_csv('/content/drive/MyDrive/Santa/Regression capstone/data_application_energy.csv')

In [None]:
#creating a copy of data set
data = data_raw.copy()

### Dataset First View

In [None]:
# Dataset First Look
data.head()

In [None]:
# Dataset Rows & Columns count
num_rows, num_cols = data.shape

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

### Dataset Information

In [None]:
# Dataset Info
data.info()

In [None]:
# Assuming your date column is named "date_column"
data['date'] = pd.to_datetime(data['date'])


In [None]:
# Setting date as the index:
data.set_index('date', inplace=True)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count assinged a dataframe name 'df'
df = data[data.duplicated()]

In [None]:
#There is no duplicate rows in the data
df.head()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isna().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
import matplotlib.pyplot as plt

# Plotting the null matrix
msno.matrix(data)

# Customizing the plot
plt.title('Null Matrix')
plt.show()


### What did you know about your dataset?

We examine the data and found out that there is no Duplicate rows in the data set.

We also came to know that there is no null values in the data set.

While examining the data type, Date feature got data type as an object which is expected to be datetime .

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

**The observation data consists of the following variables:**


datetime year-month-day hour : minute:second

Appliances: energy use in Wh [TARGETED]

lights: energy use of light fixtures in the house in Wh

T1: Temperature in kitchen area, in Celsius

RH_1: Humidity in kitchen area, in %

T2: Temperature in living room area, in Celsius

RH_2:Humidity in living room area, in %

T3:Temperature in laundry room area

RH_3:Humidity in laundry room area, in %

T4:Temperature in office room, in Celsius

RH_4:Humidity in office room, in %

T5:Temperature in bathroom, in Celsius

RH_5:Humidity in bathroom, in %

T6:Temperature outside the building (north side), in Celsius

RH_6:Humidity outside the building (north side), in %

T7:Temperature in ironing room , in Celsius

RH_7:Humidity in ironing room, in %

T8:Temperature in teenager room 2, in Celsius

RH_8:Humidity in teenager room 2, in %

T9:Temperature in parents room, in Celsius

RH_9:Humidity in parents room, in %

T_out:Temperature outside (from Chièvres weather station), in Celsius

Press_mm_hg: (from Chièvres weather station), in mm Hg

RH_out: Humidity outside (from Chièvres weather station), in %

Windspeed: (from Chièvres weather station), in m/s

Visibility: (from Chièvres weather station), in km

Tdewpoint: (from Chièvres weather station), °C

rv1: Random variable 1, nondimensional

rv2: Rnadom variable 2, nondimensional

### Check Unique Values for each variable.

In [None]:
# Checking Unique Values count for each variable.
for i in data.columns.tolist():
  print("The unique values in",i, "is",data[i].nunique(),".")

In [None]:
# Round the unique values to two decimal places
rounded_unique_values = data.apply(lambda x: set(round(val, 2) for val in x))

# Print the unique values for each feature
for feature, unique in rounded_unique_values.items():
    print(f'{feature}: {unique}')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Separating columns:
temperature_column = [i for i in data.columns if "T" in i]
humidity_column = [i for i in data.columns if "RH" in i]
other = [i for i in data.columns if ("T" not in i)&("RH" not in i)]

In [None]:
data[temperature_column].describe(include='all')

In [None]:
data[humidity_column].describe()

In [None]:
data[other].describe()

The first, second and third quartiles are 0 for the lights column, which means that most of the information of this column is 0.

**prescence of outlier in other columns**

Looking at the statistics of "other" columns, we can see that there are some outliers in Visibility, Windspeed and Appliance.

In [None]:
# Counting values of the "lights" column:
data['lights'].value_counts(normalize=True)

77% value of lights column are 0 and it is not relevant for prediction. so we are going to drop this column

In [None]:
# Dropping the lights column:
data.drop(columns='lights', inplace=True)

In [None]:
#examining the outlier in dataset
# Assuming 'data' is your DataFrame
num_columns = len(data.columns)
fig, axes = plt.subplots(nrows=num_columns, figsize=(8, num_columns*4))

for i, column in enumerate(data.columns):
    data.boxplot(column=column, ax=axes[i])
    axes[i].set_title(f'Box Plot for {column}')
    axes[i].set_xlabel('Column')
    axes[i].set_ylabel('Values')

plt.tight_layout()
plt.show()

In [None]:
#close look pon three columns
fig_sub = make_subplots(rows=1, cols=3, shared_yaxes=False)

fig_sub.add_trace(go.Box(y=data['Appliances'].values,name='Appliances'),row=1, col=1)
fig_sub.add_trace(go.Box(y=data['Windspeed'].values,name='Windspeed'),row=1, col=2)
fig_sub.add_trace(go.Box(y=data['Visibility'].values,name='Visibility'),row=1, col=3)
fig_sub.show()



In [None]:
#AUTOEDA
!pip install sweetviz
import sweetviz as sv
sweet_report = sv.analyze(data)
sweet_report.show_html('sweet_report.html')

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
#creating new features
data['month'] = data.index.month
data['weekday'] = data.index.weekday
data['hour'] = data.index.hour
data['week'] = data.index.week
data['day'] = data.index.day
data['day_of_week'] = data.index.dayofweek

In [None]:
# Create a pivot table to aggregate the daily energy consumption
daily_energy = data.pivot_table(values='Appliances', index='day', columns='month', aggfunc = np.mean)

# Create a heatmap using the pivot table
plt.figure(figsize=(10, 5))
plt.title('Daily Energy Consumption')
plt.xlabel('Month')
plt.ylabel('Day')
plt.imshow(daily_energy, cmap='YlGnBu', aspect='auto')
plt.colorbar(label='Energy Consumption')
plt.xticks(range(0,5), ['Jan', 'Feb', 'Mar', 'Apr', 'May'])
plt.yticks(range(1, 32))
plt.show()


##### 1. Why did you pick the specific chart?

I choose this chart to identify the distribution of each variable in the data.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Map the day of the week values to their respective names
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
data['day_of_week'] = data['day_of_week'].map(lambda x: day_names[x])

# Create a box plot or violin plot to compare energy consumption across different days of the week
plt.figure(figsize=(10, 6))
sns.boxplot(x='day_of_week', y='Appliances', data=data, order=day_names)  # or sns.violinplot()
plt.title('Appliance Energy Consumption by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Energy Consumption')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Create a line plot to show the trend of energy consumption over time
plt.figure(figsize=(15, 6))
plt.plot(data.index, data['Appliances'])
plt.title('Energy Consumption of Appliances Over Time')
plt.xlabel('Date')
plt.ylabel('Energy Consumption')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Visualizing distributions using Histograms:
data.hist(figsize=(17, 20), grid=False);

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
correlation_matrix = data.corr()
plt.figure(figsize=(18, 15))
sns.heatmap(correlation_matrix, annot=True, cmap="RdYlGn")
plt.title("Correlation Matrix Heatmap")
plt.show()

In [None]:
#reorder the data for clear vision
desired_order = ["T1","T2","T3","T4","T5","T6","T7","T8","T9","T_out","Tdewpoint","RH_1","RH_2","RH_3","RH_4","RH_7","RH_8","RH_9","RH_6","RH_5","RH_out","Press_mm_hg",
                "Windspeed","Visibility","rv1", "rv2","Appliances"]
#assinging new_data as new name of dataframe
new_data = data.reindex(columns=desired_order)

In [None]:
# Correlation Heatmap visualization code
correlation_matrix = new_data.corr()
plt.figure(figsize=(18, 15))
sns.heatmap(correlation_matrix, annot=True, cmap="RdYlGn")
plt.title("Correlation Matrix Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Get the list of column names in your dataset
columns = data.columns

# Determine the number of rows and columns for subplots
num_rows = len(columns)
num_cols = 1

# Create subplots with specified number of rows and columns
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 80))

# Iterate over each column (excluding "Appliances") and create pair plot
for i, column in enumerate(columns):
    #if column != "Appliances":
        sns.scatterplot(data=data, x="Appliances", y=column, ax=axes[i])
        axes[i].set_xlabel("Appliances")
        axes[i].set_ylabel(column)

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

There is huge prescece of heteroscedasticity and we usually do log tranformation to solve this error.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant linear relationship between the independent variables and the appliance energy consumption.

Alternative Hypothesis (H1): There is a significant linear relationship between the independent variables and the appliance energy consumption.

#### 2. Perform an appropriate statistical test.

In [None]:
data.columns

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import pearsonr

# Extract the two continuous variables you want to test
column_to_drop = ['Appliances','day_of_week']
independent_variables = data.drop(column_to_drop, axis = 1)
dependent_variable = data['Appliances']

# Step 2: Perform the Correlation Test (Pearson correlation)
correlation_coefficients, p_values = [], []
for feature in independent_variables.columns:
    correlation_coefficient, p_value = pearsonr(independent_variables[feature], dependent_variable)
    correlation_coefficients.append(correlation_coefficient)
    p_values.append(p_value)

# Step 3: Interpret the Results for each feature
alpha = 0.05  # Significance level (commonly set to 0.05)
for i, feature in enumerate(independent_variables.columns):
    print(f"Correlation Coefficient for '{feature}': {correlation_coefficients[i]:.4f}")
    print(f"P-value for '{feature}': {p_values[i]:.4f}")

    if p_values[i] < alpha:
        print("Result: The correlation is statistically significant (reject H0).\n")
    else:
        print("Result: There is no significant correlation (fail to reject H0).\n")


##### Which statistical test have you done to obtain P-Value?

In the practical implementation provided earlier, the statistical test used to obtain the p-value is the Pearson correlation coefficient test. The Pearson correlation coefficient, also known as Pearson's r or simply r, is a measure of the linear relationship between two continuous variables.

##### Why did you choose the specific statistical test?

The p-value obtained from the test indicates the probability of observing the calculated correlation coefficient (or a more extreme value) if the null hypothesis is true. The null hypothesis (H0) in this context states that there is no significant linear relationship between the two variables.

By comparing the p-value to a chosen significance level (alpha), commonly set to 0.05 (5%), we can determine whether to reject or fail to reject the null hypothesis. If the p-value is less than alpha, we reject the null hypothesis, suggesting a statistically significant correlation. If the p-value is greater than alpha, we fail to reject the null hypothesis, indicating no significant correlation.

This test is appropriate when you want to assess the strength and direction of the linear relationship between two continuous variables. It is commonly used to explore the association between variables in correlation analysis and is widely used in various fields of research and data analysis.

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

Thankfully there is no missing value in out dataset

#### What all missing value imputation techniques have you used and why did you use those techniques?

We have not used any missing values handling technique as there are no Nan Values in the data set

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
data.info()

In [None]:
columns_to_drop = ['day_of_week','rv1','rv2','hour']
data.drop(columns_to_drop, axis=1, inplace=True)

#####variance threshold

In [None]:
'''from sklearn.feature_selection import VarianceThreshold
var = VarianceThreshold(threshold=0.0)
var.fit(data)'''

In [None]:
'''var.get_support()'''

In [None]:
'''data.columns[var.get_support()]'''

In [None]:
'''constant_column = [column for column in data.columns if column not in data.columns[var.get_support()]]'''

In [None]:
'''print(len(constant_column))'''

In [None]:
'''for feature in constant_column:
  print(feature)'''

###finding the skewed and symmetrical data

In [None]:
#examining the skewness in the dataset to check the distribution
skewness = data.skew()
print(skewness)

#ginding the absolute value
abs(skewness)

# setting up the threshold
skewness_threshold = 0.5

# Separate features into symmetrical and skewed based on skewness threshold
symmetrical_features = skewness[abs(skewness) < skewness_threshold].index
skewed_features = skewness[abs(skewness) >= skewness_threshold].index

#printing the features
print(symmetrical_features)
print(skewed_features)

# Create new DataFrames for symmetrical and skewed features
symmetrical_data = data[symmetrical_features]
skewed_data = data[skewed_features]


###5. Data Transformation

In [None]:
skewed_data.drop('Appliances',axis = 1,inplace = True)

In [None]:
skewed_data

In [None]:
#import the liabrary
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer

# Initialize the PowerTransformer
power_transformer = PowerTransformer()

# Fit and transform the data using the PowerTransformer
power_transformed = pd.DataFrame(power_transformer.fit_transform(skewed_data))
power_transformed.columns = skewed_data.columns


In [None]:
power_transformed

In [None]:
# Reset the index to the default integer index
symmetrical_data.reset_index(drop=True, inplace=True)

In [None]:
symmetrical_data

In [None]:
# Concatenate horizontally (along columns)
tranformed_data = pd.concat([symmetrical_data, power_transformed], axis=1)

In [None]:
tranformed_data

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
#Yes My data needs transformation specially skewed data , i used power transformaiton to solve this concern

### 6. Scaling the DATA set

In [None]:
#importing the desired liabrary
from sklearn.preprocessing import StandardScaler

# StandardScaler
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(tranformed_data))
scaled_data.columns = tranformed_data.columns
scaled_data

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
'''# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
#scaler = StandardScaler()
#scaled_data = scaler.fit_transform(new_data)

# Apply PCA
pca = PCA(n_components=2)  # Specify the number of co

principal_components = pca.fit_transform(scaled_data)

# Access the principal components and explained variance ratio
components = pca.components_
explained_variance_ratio = pca.explained_variance_ratio_'''


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
x = scaled_data
y = data['Appliances']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=3)

70/30 Split: This ratio involves splitting the data into 70% for training and 30% for testing. It is a commonly used ratio when there is a sufficient amount of data available. The larger portion is used for training the model, while the smaller portion is used for evaluating its performance.

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 - Simple Linear Regression Model

In [None]:
#importing the mdoel
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#defining the object
reg = LinearRegression()
reg.fit(x_train, y_train)

#training dataset score
training_score = reg.score(x_train, y_train)

#predicting the value
y_pred = reg.predict(x_test)

#visual of training score
print("Train score:" ,training_score)

#calculating the testing accuracy
MSE  = mean_squared_error((y_test),(y_pred))
print("Test MSE :" , MSE)

r2 = r2_score((y_test),(y_pred))
print("Test R2 :" ,r2)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(y_pred - y_test,kind ='kde')

In [None]:
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# Create a Linear Regression model (you can replace this with any other regression model)
model = LinearRegression()

# Define hyperparameter search space (you can customize this based on your model)
param_dist = {'fit_intercept': [True, False],
              'copy_X': [True, False],
              'positive':[True, False]}

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist,
                                   n_iter=10, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

# Fit the RandomizedSearchCV to find the best hyperparameters
random_search.fit(x_train, y_train)

# Get the best hyperparameters and model
best_params = random_search.best_params_
best_model = random_search.best_estimator_

# Train the best model with the entire training dataset
best_model.fit(x_train, y_train)

# Evaluate the best model on the test set
test_predictions = best_model.predict(x_test)

# Calculate evaluation metrics for the test predictions (e.g., mean squared error)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, test_predictions)
r2 = r2_score((y_test),(test_predictions))

print("Best Hyperparameters:", best_params)
print("Test MSE:", mse)
print("Test R2:", r2)


In [None]:
best_model.score(x_train, y_train)

In [None]:
sns.displot(test_predictions - y_test,kind ='kde')

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2 - Polynomial Regression model


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
data.info()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 2

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
model = LinearRegression()

# Train the model using the polynomial features
model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = model.predict(X_train_poly)
test_predictions = model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


#### Let just try to implement the same model with degree = 3 , to ehance the things

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 3

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
model = LinearRegression()

# Train the model using the polynomial features
model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = model.predict(X_train_poly)
test_predictions = model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3 - RIDGE Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import  Ridge
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 2

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
ridge_model = Ridge(alpha=1.0)

# Train the model using the polynomial features
ridge_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = ridge_model.predict(X_train_poly)
test_predictions = ridge_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Create a Ridge Regression model
ridge_model = Ridge()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'alpha': [0.1, 1.0, 10.0]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best degree and alpha
grid_search.fit(X_train_poly, y_train)

# Get the best degree and alpha from the GridSearchCV results
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best Alpha:", best_alpha)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


### ML Model - 4 - Lasso Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import  Lasso
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 2

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
lasso_model = Lasso(alpha=1.0)

# Train the model using the polynomial features
lasso_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = lasso_model.predict(X_train_poly)
test_predictions = lasso_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Create a Ridge Regression model
lasso_model = Lasso()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'alpha': [0.1, 1.0, 10.0]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best degree and alpha
grid_search.fit(X_train_poly, y_train)

# Get the best degree and alpha from the GridSearchCV results
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best Alpha:", best_alpha)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import  Lasso
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 3

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
lasso_model = Lasso(alpha=1.0)

# Train the model using the polynomial features
lasso_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = lasso_model.predict(X_train_poly)
test_predictions = lasso_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Create a Ridge Regression model
lasso_model = Lasso()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'alpha': [0.1, 1.0, 10.0]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best degree and alpha
grid_search.fit(X_train_poly, y_train)

# Get the best degree and alpha from the GridSearchCV results
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best Alpha:", best_alpha)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


### ML Model - 5 - elastic net Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import  ElasticNet
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 2

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
ElasticNet_model = ElasticNet(alpha=1.0)

# Train the model using the polynomial features
ElasticNet_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = ElasticNet_model.predict(X_train_poly)
test_predictions = ElasticNet_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

# Create a Ridge Regression model
ElasticNet_model = ElasticNet()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'alpha': [0.1, 1.0, 10.0]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best degree and alpha
grid_search.fit(X_train_poly, y_train)

# Get the best degree and alpha from the GridSearchCV results
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best Alpha:", best_alpha)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


In [None]:
'''from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import  ElasticNet
from sklearn.metrics import mean_squared_error, r2_score


X = data.drop(columns=['Appliances'],axis= 1)
y = data['Appliances']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 3

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
ElasticNet_model = ElasticNet(alpha=1.0)

# Train the model using the polynomial features
ElasticNet_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = ElasticNet_model.predict(X_train_poly)
test_predictions = ElasticNet_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()
'''

In [None]:
'''from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

# Create a Ridge Regression model
ElasticNet_model = ElasticNet()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'alpha': [0.1, 1.0, 10.0]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best degree and alpha
grid_search.fit(X_train_poly, y_train)

# Get the best degree and alpha from the GridSearchCV results
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best Alpha:", best_alpha)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()'''


### Ranfom Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=20, random_state=42)

# Train the model
rf_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions_rf = rf_model.predict(X_train_poly)
test_predictions_rf = rf_model.predict(X_test_poly)

# Evaluate the model
train_mse_rf = mean_squared_error(y_train, train_predictions_rf)
test_mse_rf = mean_squared_error(y_test, test_predictions_rf)

train_r2_rf = r2_score(y_train, train_predictions_rf)
test_r2_rf = r2_score(y_test, test_predictions_rf)

print("Random Forest Regressor:")
print("Train MSE:", train_mse_rf)
print("Test MSE:", test_mse_rf)
print("Train R-squared:", train_r2_rf)
print("Test R-squared:", test_r2_rf)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions_rf)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


In [None]:
'''from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor model
random_forest_model = RandomForestRegressor()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'n_estimators': [10, 20],
              'max_depth': [None, 10, 20]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=random_forest_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best hyperparameters
grid_search.fit(X_train_poly, y_train)

# Get the best hyperparameters from the GridSearchCV results
best_n_estimators = grid_search.best_params_['n_estimators']
best_max_depth = grid_search.best_params_['max_depth']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best n_estimators:", best_n_estimators)
print("Best max_depth:", best_max_depth)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()'''


###GRADIENT BOOSTING

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Create a Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
gb_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions_gb = gb_model.predict(X_train_poly)
test_predictions_gb = gb_model.predict(X_test_poly)

# Evaluate the model
train_mse_gb = mean_squared_error(y_train, train_predictions_gb)
test_mse_gb = mean_squared_error(y_test, test_predictions_gb)

train_r2_gb = r2_score(y_train, train_predictions_gb)
test_r2_gb = r2_score(y_test, test_predictions_gb)

print("Gradient Boosting Regressor:")
print("Train MSE:", train_mse_gb)
print("Test MSE:", test_mse_gb)
print("Train R-squared:", train_r2_gb)
print("Test R-squared:", test_r2_gb)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions_gb)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


### XGBOOST

In [None]:
import xgboost as xgb

# Create an XGBoost Regressor model
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
xgb_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions_xgb = xgb_model.predict(X_train_poly)
test_predictions_xgb = xgb_model.predict(X_test_poly)

# Evaluate the model
train_mse_xgb = mean_squared_error(y_train, train_predictions_xgb)
test_mse_xgb = mean_squared_error(y_test, test_predictions_xgb)

train_r2_xgb = r2_score(y_train, train_predictions_xgb)
test_r2_xgb = r2_score(y_test, test_predictions_xgb)

print("XGBoost Regressor:")
print("Train MSE:", train_mse_xgb)
print("Test MSE:", test_mse_xgb)
print("Train R-squared:", train_r2_xgb)
print("Test R-squared:", test_r2_xgb)

# Plot the actual vs. predicted values for the test set
plt.scatter(y_test, test_predictions_xgb)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'k--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted values (Test Set)")
plt.show()


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor object
rf_regressor = RandomForestRegressor(n_estimators=100, max_features='sqrt')

# Fit the Random Forest model to the training data
rf_regressor.fit(x_train, y_train)

# Make predictions on the test data
predictions = rf_regressor.predict(x_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", mse)
print("R-squared:", r2)
rf_regressor.score(x_train, y_train)

In [None]:
rf_regressor.score(x_train, y_train)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Create a Random Forest Regressor object
rf_regressor = RandomForestRegressor(n_estimators=100, max_features='sqrt')

# Fit the Random Forest model to the training data
rf_regressor.fit(x_train, y_train)

# Make predictions on the test data
predictions = rf_regressor.predict(x_test)

# Reshape y_train and y_test into 1-dimensional arrays
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


In [None]:
from sklearn.ensemble import AdaBoostRegressor

# Create an AdaBoostRegressor object
ada_regressor = AdaBoostRegressor(n_estimators=100, learning_rate=0.1)

# Fit the AdaBoost model to the training data
ada_regressor.fit(x_train, y_train)

# Make predictions on the test data
predictions = ada_regressor.predict(x_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Create a GradientBoostingRegressor object
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,)

# Fit the Gradient Boosting model to the training data
gb_regressor.fit(x_train, y_train)

# Make predictions on the test data
predictions = gb_regressor.predict(x_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

In [None]:
 from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Create an SVR object
svm_regressor = SVR(kernel='rbf', C=1.0, epsilon=0.1)

# Fit the SVR model to the training data
svm_regressor.fit(x_train, y_train)

# Make predictions on the test data
predictions = svm_regressor.predict(x_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

In [None]:
scaled_data.shape

#### DECISION TREE

In [None]:
x = scaled_data.iloc[:,1:34]
y = scaled_data.iloc[:,0:1]

In [None]:
#train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeRegressor

treemodel = DecisionTreeRegressor()

treemodel.fit(x_train,y_train)

train_score = treemodel.score(x_train,y_train)

#prediction stage
y_pred = treemodel.predict(x_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
print(train_score)

In [None]:
#cross validation
parameter = { 'criterion' :['squared_error', 'absolute_error'],
              'max_depth' : [1,2,3,4,5,6,7,8,9,10]}

from sklearn.model_selection import RandomizedSearchCV

treemodel =  DecisionTreeRegressor()
cv = RandomizedSearchCV(treemodel,param_distributions = parameter, scoring ='neg_mean_squared_error',cv = 5)

In [None]:
cv.fit(x_train,y_train)

In [None]:
cv.best_params_

In [None]:
cv.best_score_

In [None]:
cv.score(x_train,y_train)

In [None]:
y_pred = cv.predict(X_test)

In [None]:
# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

In [None]:
x = scaled_data.iloc[:,1:34]
y = scaled_data.iloc[:,0:1]

###polynomial regression


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures


In [None]:
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(x_train)

In [None]:
model = LinearRegression()

In [None]:
k = 5  # Number of folds
scores = cross_val_score(model, X_poly, y_train, cv=k, scoring='neg_mean_squared_error')
mse_scores = -scores  # Convert negative mean squared error to positive

In [None]:
model.score(X_poly, y_train)

In [None]:
new_data_poly = poly.transform(X_test)
predicted_y = model.predict(new_data_poly)

In [None]:
# Evaluate the model performance
mse = mean_squared_error(y_test, predicted_y)
r2 = r2_score(y_test, predicted_y)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Create a Random Forest Regressor object
rf_regressor = RandomForestRegressor(n_estimators=100, max_features='sqrt')

# Fit the Random Forest model to the training data
rf_regressor.fit(x_train,y_train)

# Make predictions on the test data
predictions = rf_regressor.predict(x_test)

# Reshape y_train and y_test into 1-dimensional arrays
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***