<a href="https://colab.research.google.com/github/IAMDSVSSANGRAL/applianceenergyprediction/blob/main/Appliance_energy_prediction_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -Samadhan**


# **Project Summary -**

Objective:
The objective of this project is to develop a regression model that accurately predicts the energy consumption of household appliances based on various input features. The model aims to provide insights into energy usage patterns and facilitate energy efficiency improvements in residential settings.

Data:
The project utilizes a dataset that contains information on household appliance energy consumption along with several relevant input features. The dataset includes variables such as temperature, humidity, time of day, and various appliance power readings. The data is collected over a specific time period and is representative of real-world residential energy usage scenarios.

Tasks:

Exploratory Data Analysis (EDA):

Perform a thorough analysis of the dataset to understand the distribution, statistics, and relationships among variables.
Identify any missing values, outliers, or data quality issues that need to be addressed.
Visualize the data using appropriate charts and graphs to gain insights into the patterns and trends.
Data Preprocessing:

Handle missing values by applying suitable imputation techniques or deciding on appropriate strategies for dealing with them.
Address outliers and anomalies by considering various methods such as removal, transformation, or capping.
Normalize or scale the data if necessary to ensure all features are on a similar scale.
Feature Engineering:

Explore the relationships between the input features and the target variable (appliance energy consumption) to identify potential feature engineering opportunities.
Create new features, derive meaningful variables, or transform existing variables to capture important patterns or interactions in the data.
Model Development:

Split the dataset into training and testing sets for model development and evaluation.
Select an appropriate regression algorithm (e.g., linear regression, decision tree regression, random forest regression) based on the project requirements and characteristics of the data.
Train the model using the training data and tune hyperparameters to optimize performance.
Evaluate the model's performance using various metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared.
Model Evaluation and Interpretation:

Assess the model's performance on the testing data to measure its ability to generalize to unseen data.
Interpret the model's coefficients or feature importance to gain insights into the factors that have the most significant impact on appliance energy consumption.
Validate the model's predictions against domain knowledge or external benchmarks to ensure its reliability and usefulness.
Model Deployment and Recommendations:

Deploy the trained model into a production environment or create a user-friendly interface for stakeholders to interact with the model.
Provide recommendations based on the model's predictions and insights to improve energy efficiency, optimize appliance usage, or suggest modifications in residential settings.
Conclusion:
The Appliance Energy Prediction regression project aims to develop a robust regression model to accurately predict household appliance energy consumption. By analyzing and understanding the data, performing feature engineering, and building an effective regression model, the project provides valuable insights and recommendations for optimizing energy usage and promoting energy-efficient practices in residential settings.

Note: This project summary provides a general outline and can be tailored based on specific requirements, dataset characteristics, and project goals.

# **GitHub Link -**

https://github.com/IAMDSVSSANGRAL/applianceenergyprediction

# **Problem Statement**


Certainly, here is the problem statement broken down into bullet points using different phrases:

- **Data Source**: The dataset spans approximately 4.5 months and includes information collected at 10-minute intervals. It consists of data from a ZigBee wireless sensor network monitoring temperature and humidity in a house, energy consumption recorded by m-bus energy meters, and weather data from Chievres Airport, Belgium.

- **Data Averaging**: The wireless sensor network reports temperature and humidity every 3.3 minutes, but the data is averaged over 10-minute periods.

- **Objective**: The primary goal is to develop a machine learning model capable of accurately predicting energy usage based on the provided features.

- **Utility**: This predictive model has potential applications for building managers, energy companies, and policymakers. It can aid in optimizing energy consumption, reducing costs, and minimizing the environmental impact of energy usage.

- **Influence Factors**: The model aims to consider a range of influencing factors, including temperature, humidity, illumination, and time of day, all of which can impact energy consumption in a building.

- **Pattern and Trend Identification**: Building managers and energy firms can benefit from this model by identifying patterns and trends in energy consumption. This can help them make informed decisions, such as adjusting HVAC settings, optimizing lighting, or implementing energy-efficient solutions.

- **Policymaker Applications**: Policymakers can also leverage the insights from this model to develop regulations and incentives that promote energy efficiency and sustainability.

- **Random Variables**: The dataset includes random variables designed for testing regression models and filtering out non-predictive features.

- **Integration of External Data**: External weather data from Chievres Airport, Belgium, was integrated into the dataset using date and time columns, enhancing the model's ability to make energy usage forecasts.

- **Environmental Impact**: One of the broader goals is to contribute to reducing the environmental impact of energy usage through better management and decision-making.

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt

# Import Data Visualisation Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as pl
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from pandas.plotting import scatter_matrix
%matplotlib inline

# Set the plot style and display options
plt.style.use('ggplot')
sns.set()

# To display all the columns in Dataframe
pd.set_option('display.max_columns', None)
# Import Library to visualise missing data
import missingno as mno

# Import and Ignore warnings for better code readability,
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#importing the data set
data_raw = pd.read_csv('/content/drive/MyDrive/Santa/Regression capstone/data_application_energy.csv')

In [None]:
#creating a copy of data set
data = data_raw.copy()

### Dataset First View

In [None]:
# Dataset First Look
data.head()

In [None]:
# Dataset Rows & Columns count
num_rows, num_cols = data.shape

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

### Dataset Information

In [None]:
# Dataset Info
data.info()

In [None]:
# Assuming your date column is named "date_column"
data['date'] = pd.to_datetime(data['date'])

In [None]:
# Setting date as the index:
data.set_index('date', inplace=True)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count assinged a dataframe name 'df'
df = data[data.duplicated()]

In [None]:
#There is no duplicate rows in the data
df.head()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isna().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
import matplotlib.pyplot as plt

# Plotting the null matrix
msno.matrix(data)

# Customizing the plot
plt.title('Null Matrix')
plt.show()


### What did you know about your dataset?

The data is in the form of a Pandas DataFrame with 29 columns and 19,735 rows. It appears to be a dataset with multiple features and observations, but without the context of what this dataset represents, it's challenging to provide specific insights. However, I can offer some general insights you can gain from this data:

1. **Data Size**: The dataset contains 19,735 data points, which is a significant amount of data.

2. **Data Types**: Most of the columns contain numerical data, with 26 columns having float64 data type and 2 columns with int64 data type. The 'date' column seems to contain date values as objects.

3. **Features**: The columns labeled 'T1,' 'T2,' 'T3,' etc., represent temperature measurements, while columns labeled 'RH_1,' 'RH_2,' 'RH_3,' etc., represent relative humidity measurements. 'Appliances' and 'lights' are integer columns, which might be related to energy consumption and lighting. Other columns have labels such as 'T_out' (outdoor temperature), 'Windspeed,' 'RH_out' (outdoor humidity), and more.

4. **Data Completeness**: There are no missing values (non-null) in any of the columns, which is a good sign for data quality.

5. **Memory Usage**: The dataset consumes 4.4+ MB of memory, which might be relevant for memory-constrained analyses.

6. **NO Duplicate values**:We don't see any output, it's possible that there are no duplicated rows in your original DataFrame data.

To gain more meaningful insights from this data, you'll need to have a clear understanding of what the dataset represents and what kind of analysis you want to perform. Depending on the context, you could explore relationships between different variables, conduct statistical analysis, visualize data, and build predictive models. Please provide more information about the dataset and your specific goals if you'd like more detailed insights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

**The observation data consists of the following variables:**


datetime year-month-day hour : minute:second

Appliances: energy use in Wh [TARGETED]

lights: energy use of light fixtures in the house in Wh

T1: Temperature in kitchen area, in Celsius

RH_1: Humidity in kitchen area, in %

T2: Temperature in living room area, in Celsius

RH_2:Humidity in living room area, in %

T3:Temperature in laundry room area

RH_3:Humidity in laundry room area, in %

T4:Temperature in office room, in Celsius

RH_4:Humidity in office room, in %

T5:Temperature in bathroom, in Celsius

RH_5:Humidity in bathroom, in %

T6:Temperature outside the building (north side), in Celsius

RH_6:Humidity outside the building (north side), in %

T7:Temperature in ironing room , in Celsius

RH_7:Humidity in ironing room, in %

T8:Temperature in teenager room 2, in Celsius

RH_8:Humidity in teenager room 2, in %

T9:Temperature in parents room, in Celsius

RH_9:Humidity in parents room, in %

T_out:Temperature outside (from Chièvres weather station), in Celsius

Press_mm_hg: (from Chièvres weather station), in mm Hg

RH_out: Humidity outside (from Chièvres weather station), in %

Windspeed: (from Chièvres weather station), in m/s

Visibility: (from Chièvres weather station), in km

Tdewpoint: (from Chièvres weather station), °C

rv1: Random variable 1, nondimensional

rv2: Rnadom variable 2, nondimensional

### Check Unique Values for each variable.

In [None]:
# Checking Unique Values count for each variable.
for i in data.columns.tolist():
  print("The unique values in",i, "is",data[i].nunique(),".")

In [None]:
# Round the unique values to two decimal places
rounded_unique_values = data.apply(lambda x: set(round(val, 2) for val in x))

# Print the unique values for each feature
for feature, unique in rounded_unique_values.items():
    print(f'{feature}: {unique}')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Separating columns:
temperature_column = [i for i in data.columns if "T" in i]
humidity_column = [i for i in data.columns if "RH" in i]
other = [i for i in data.columns if ("T" not in i)&("RH" not in i)]

In [None]:
data[temperature_column].describe(include='all')

we can derive several insights:

1. **Count**:
   - There are 19,735 data points for each of the temperature-related variables (T1, T2, T3, T4, T5, T6, T7, T8, T9, T_out, and Tdewpoint). This indicates that there are no missing values in these columns.

2. **Mean (Average)**:
   - The mean values for the temperature-related variables are in the range of 16.79°C to 26.26°C. The "T3" variable has the highest mean at approximately 22.27°C, while "T5" has the lowest mean at about 19.59°C.

3. **Standard Deviation (std)**:
   - The standard deviations for the temperature-related variables range from approximately 1.61°C to 2.20°C. Variables like "T3" and "T4" have relatively low variability, while "T9" has slightly higher variability.

4. **Minimum (min)**:
   - The minimum values for the temperature-related variables range from 15.10°C to 29.24°C. These values indicate the lower bounds of the temperature measurements.

5. **25th Percentile (25%)**:
   - The 25th percentile values represent the lower quartile of the data. For example, the 25th percentile of "T2" is approximately 18.79°C.

6. **Median (50%)**:
   - The median values (50th percentile) represent the middle values of the dataset. For instance, the median temperature "T7" is approximately 20.03°C.

7. **75th Percentile (75%)**:
   - The 75th percentile values represent the upper quartile of the data. The 75th percentile of "T6" is approximately 11.26°C.

8. **Maximum (max)**:
   - The maximum values represent the upper bounds of the temperature measurements. "T4" has the highest maximum value at approximately 26.20°C, while "T5" has the lowest maximum at about 25.79°C.

In [None]:
data[humidity_column].describe()

e can derive several insights regarding the relative humidity (RH) variables:

1. **Count**:
   - There are 19,735 data points for each of the RH-related variables (RH_1, RH_2, RH_3, RH_4, RH_5, RH_6, RH_7, RH_8, RH_9, and RH_out). This indicates that there are no missing values in these columns.

2. **Mean (Average)**:
   - The mean values for the relative humidity variables vary across the columns. For example, "RH_5" has the highest mean at approximately 50.95%, while "RH_7" has the lowest mean at around 35.39%. The "RH_out" variable, which represents outdoor relative humidity, has a mean of approximately 79.75%.

3. **Standard Deviation (std)**:
   - The standard deviations for the relative humidity variables also vary. "RH_5" has a standard deviation of approximately 9.02, indicating relatively higher variability, while "RH_3" has a lower standard deviation of around 3.25.

4. **Minimum (min)**:
   - The minimum values for the relative humidity variables indicate the lower bounds of the humidity measurements. For example, "RH_6" has a minimum of approximately 1.00% which look like there are outlier on lower bound of RH_6 and "RH_out" has a minimum of 24.00%.

5. **25th Percentile (25%)**:
   - The 25th percentile values represent the lower quartile of the data. "RH_7" has a 25th percentile value of approximately 31.50%.

6. **Median (50%)**:
   - The median values (50th percentile) represent the middle values of the dataset. "RH_9" has a median relative humidity of approximately 40.90%.

7. **75th Percentile (75%)**:
   - The 75th percentile values represent the upper quartile of the data. "RH_4" has a 75th percentile value of approximately 42.16%.

8. **Maximum (max)**:
   - The maximum values represent the upper bounds of the relative humidity measurements. "RH_1" has the highest maximum value at approximately 63.36%, and "RH_out" has the lowest maximum value at 100.00%.



In [None]:
data[other].describe()

We can derive several insights regarding the variables Appliances, lights, Press_mm_hg, Windspeed, Visibility, rv1, and rv2:

1. **Appliances**:
   - The "Appliances" variable represents energy consumption related to appliances. The data ranges from a minimum of 10 to a maximum of 1080, with an average (mean) consumption of approximately 97.69. **The standard deviation is relatively high, indicating significant variability in appliance energy usage.**

2. **Lights**:
   - The "lights" variable shows energy consumption related to lighting. It varies from 0 to 70, with an average of approximately 3.80. The standard deviation suggests some variability in lighting energy consumption. **upto 75 percent of value have 0 values which is slightly ODD.**

3. **Press_mm_hg**:
   - "Press_mm_hg" represents atmospheric pressure. The pressure varies from 729.30 to 772.30, with an average of approximately 755.52. The data has relatively low variability.

4. **Windspeed**:
   - The "Windspeed" variable indicates wind speed and varies from 0 to 14. The average wind speed is about 4.04. The standard deviation suggests some variation in wind speed. **Maximum value is 14 which is very far from 75% of values that is 5.50**

5. **Visibility**:
   - "Visibility" represents the visibility in meters. It ranges from 1 to 66, with an average of approximately 38.33. The data exhibits relatively **high variability**.

6. **rv1 and rv2**:
   - The columns "rv1" and "rv2" have identical statistics, suggesting that they are likely **highly correlated or identical features**. They have a minimum value of approximately 0.0053 and a maximum value of around 49.9965.

In [None]:
# Create a dictionary to map current column names to new column names
column_mapping = {'T1': 'KITCHEN_TEMP',
    'RH_1': 'KITCHEN_HUM',
    'T2': 'LIVING_TEMP',
    'RH_2' :'LIVING_HUM',
    'T3': 'BEDROOM_TEMP',
    'RH_3':'BEDROOM_HUM',
    'T4' : 'OFFICE_TEMP',
    'RH_4' : 'OFFICE_HUM',
    'T5' : 'BATHROOM_TEMP',
    'RH_5': 'BATHROOM_HUM',
    'T6':'OUTSIDE_TEMP_build',
    'RH_6': 'OUTSIDE_HUM_build',
    'T7': 'IRONING_ROOM_TEMP',
    'RH_7' : 'IRONING_ROOM_HUM',
    'T8' :'TEEN_ROOM_2_TEMP',
    'RH_8' : 'TEEN_ROOM_HUM',
    'T9': 'PARENTS_ROOM_TEMP',
    'RH_9': 'PARENTS_ROOM_HUM',
    'T_out' :'OUTSIDE_TEMP_wstn',
    'RH_out' :'OUTSIDE_HUM_wstn'}

# Rename the columns using the mapping
data.rename(columns=column_mapping, inplace=True)

In [None]:
data.head()

In [None]:
#creating new features
data['month'] = data.index.month
data['weekday'] = data.index.weekday
data['hour'] = data.index.hour
data['week'] = data.index.week
data['day'] = data.index.day
data['day_of_week'] = data.index.dayofweek

In [None]:
data.head(2)

In [None]:
# Counting values of the "lights" column:
data['lights'].value_counts(normalize=True)

77% value of lights column are 0 and it is not relevant for prediction. so we are going to drop this column

In [None]:
# Dropping the lights column:
data.drop(columns='lights', inplace=True)

In [None]:
#reorder the data for clear vision
desired_order = ['KITCHEN_TEMP','LIVING_TEMP','BEDROOM_TEMP','OFFICE_TEMP','BATHROOM_TEMP','OUTSIDE_TEMP_build','IRONING_ROOM_TEMP','TEEN_ROOM_2_TEMP','PARENTS_ROOM_TEMP','OUTSIDE_TEMP_wstn',
                 'KITCHEN_HUM','LIVING_HUM','BEDROOM_HUM','OFFICE_HUM','BATHROOM_HUM','OUTSIDE_HUM_build','IRONING_ROOM_HUM','TEEN_ROOM_HUM','PARENTS_ROOM_HUM','OUTSIDE_HUM_wstn',
                 "Tdewpoint","Press_mm_hg","Windspeed","Visibility","rv1", "rv2",'month','weekday','hour','week','day','day_of_week',"Appliances"]
#assinging new_data as new name of dataframe
data = data.reindex(columns=desired_order)

In [None]:
data.tail(2)

In [None]:
#AUTOEDA
!pip install sweetviz
import sweetviz as sv
sweet_report = sv.analyze(data)
sweet_report.show_html('sweet_report.html')

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Create a pivot table to aggregate the daily energy consumption
daily_energy = data.pivot_table(values='Appliances', index='day', columns='month', aggfunc = 'mean')

# Create a heatmap using the pivot table
plt.figure(figsize=(10, 5))
plt.title('Daily Energy Consumption')
plt.xlabel('Month')
plt.ylabel('Day')
plt.imshow(daily_energy, cmap='YlGnBu', aspect='auto')
plt.colorbar(label='Energy Consumption')
plt.xticks(range(0,5), ['Jan', 'Feb', 'Mar', 'Apr', 'May'])
plt.yticks(range(1, 32))
plt.show()


##### 1. Why did you pick the specific chart?

I choose this chart to identify the distribution of each variable in the data.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Map the day of the week values to their respective names
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
data['day_of_week'] = data['day_of_week'].map(lambda x: day_names[x])

# Create a box plot or violin plot to compare energy consumption across different days of the week
plt.figure(figsize=(10, 6))
sns.boxplot(x='day_of_week', y='Appliances', data=data, order=day_names)  # or sns.violinplot()
plt.title('Appliance Energy Consumption by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Energy Consumption')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

In [None]:
# Create a line plot to show the trend of energy consumption over time
import plotly.express as px

# Assuming you have a DataFrame 'data' with a datetime index
fig = px.line(data, x=data.index, y='Appliances', title='Energy Consumption of Appliances Over Time')
fig.update_xaxes(title_text='Date', tickangle=-45)
fig.update_yaxes(title_text='Energy Consumption')

# Show the Plotly figure
fig.show()


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Dropping the lights column:
data.drop(columns='day_of_week', inplace=True)

In [None]:
# Chart - 4 visualization code
# Examining the outlier in the dataset
# Assuming 'data' is your DataFrame
num_columns = len(data.columns)
fig, axes = plt.subplots(nrows=num_columns, figsize=(8, num_columns*6))

for i, column in enumerate(data.columns):
    # Exclude 'day_of_week' from the visualization
    if column != 'day_of_week':
        data.boxplot(column=column, ax=axes[i])
        axes[i].set_title(f'Box Plot for {column}')
        axes[i].set_xlabel('Column')
        axes[i].set_ylabel('Values')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#close look on four columns
fig_sub = make_subplots(rows=1, cols=4, shared_yaxes=False)

fig_sub.add_trace(go.Box(y=data['Appliances'].values,name='Appliances'),row=1, col=1)
fig_sub.add_trace(go.Box(y=data['Windspeed'].values,name='Windspeed'),row=1, col=2)
fig_sub.add_trace(go.Box(y=data['Visibility'].values,name='Visibility'),row=1, col=3)
fig_sub.add_trace(go.Box(y=data['Press_mm_hg'].values,name='Press_mm_hg'),row=1, col=4)

fig_sub.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame with the energy consumption data
# You can group the data by hour and calculate the mean energy consumption for each hour
hourly_energy = data.groupby('hour')['Appliances'].mean()

# Create a line chart to visualize the hourly energy consumption patterns
plt.figure(figsize=(12, 6))
plt.plot(hourly_energy.index, hourly_energy.values, marker='o', linestyle='-')
plt.title('Hourly Energy Consumption Patterns')
plt.xlabel('Hour of the Day')
plt.ylabel('Energy Consumption (mean)')
plt.xticks(range(24))
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame with the relevant columns (e.g., 'KITCHEN_TEMP', 'OUTSIDE_TEMP_build', and 'Appliances')
# You can create a scatter plot with a regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='KITCHEN_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Indoor Temperature vs. Energy Consumption')
plt.xlabel('Indoor Temperature (KITCHEN_TEMP)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can create a scatter plot with a regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='LIVING_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Indoor Temperature vs. Energy Consumption')
plt.xlabel('Indoor Temperature (KITCHEN_TEMP)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can create a scatter plot with a regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='BEDROOM_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Indoor Temperature vs. Energy Consumption')
plt.xlabel('Indoor Temperature (KITCHEN_TEMP)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can create a scatter plot with a regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='OFFICE_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Indoor Temperature vs. Energy Consumption')
plt.xlabel('Indoor Temperature (KITCHEN_TEMP)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can create a scatter plot with a regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='BATHROOM_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Indoor Temperature vs. Energy Consumption')
plt.xlabel('Indoor Temperature (KITCHEN_TEMP)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can create a scatter plot with a regression line for outdoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='OUTSIDE_TEMP_build', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Indoor Temperature vs. Energy Consumption')
plt.xlabel('Indoor Temperature (KITCHEN_TEMP)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can also create a similar scatter plot and regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='IRONING_ROOM_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Outdoor Temperature vs. Energy Consumption')
plt.xlabel('Outdoor Temperature (OUTSIDE_TEMP_build)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can also create a similar scatter plot and regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='TEEN_ROOM_2_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Outdoor Temperature vs. Energy Consumption')
plt.xlabel('Outdoor Temperature (OUTSIDE_TEMP_build)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can also create a similar scatter plot and regression line for indoor temperature vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='PARENTS_ROOM_TEMP', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Outdoor Temperature vs. Energy Consumption')
plt.xlabel('Outdoor Temperature (OUTSIDE_TEMP_build)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

# You can also create a similar scatter plot and regression line for outdoor temperature data from weather station vs. energy consumption
plt.figure(figsize=(10, 6))
sns.regplot(x='OUTSIDE_TEMP_wstn', y='Appliances', data=data, scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title('Scatter Plot and Regression Line for Outdoor Temperature vs. Energy Consumption')
plt.xlabel('Outdoor Temperature (OUTSIDE_TEMP_build)')
plt.ylabel('Energy Consumption (Appliances)')
plt.grid(True)

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame with the relevant columns (e.g., 'hour' and 'Appliances')
# You can create a line chart to show energy consumption throughout the day
plt.figure(figsize=(10, 6))

# Group the data by hour and calculate the mean energy consumption for each hour
hourly_energy = data.groupby('hour')['Appliances'].mean()

# Split the data into daytime (6:00 AM to 6:00 PM) and nighttime (6:00 PM to 6:00 AM)
daytime_energy = hourly_energy[6:18]

nighttime_energy= hourly_energy[0:6].append(hourly_energy[18:24])

# Plot the daytime and nighttime energy consumption
plt.plot(daytime_energy.index, daytime_energy.values, label='Daytime', marker='o',color = 'r')
plt.plot(nighttime_energy.index, nighttime_energy.values, label='Nighttime', marker='o',color = 'b')

plt.title('Energy Consumption Throughout the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Energy Consumption')
plt.xticks(range(24))
plt.grid(True)
plt.legend()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame with relevant columns (e.g., 'weekday' and 'Appliances')
# You can create a line chart to compare energy consumption on weekdays vs. weekends

plt.figure(figsize=(10, 6))

# Group the data by 'weekday' and calculate the mean energy consumption for weekdays and weekends
weekday_energy = data[data['weekday'] < 5].groupby('hour')['Appliances'].mean()
weekend_energy = data[data['weekday'] >= 5].groupby('hour')['Appliances'].mean()

# Plot energy consumption for weekdays and weekends
plt.plot(weekday_energy.index, weekday_energy.values, label='Weekdays', marker='o')
plt.plot(weekend_energy.index, weekend_energy.values, label='Weekends', marker='o')

plt.title('Energy Consumption on Weekdays vs. Weekends')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Energy Consumption')
plt.xticks(range(24))
plt.grid(True)
plt.legend()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame with relevant columns (e.g., 'Appliances', 'T_out', 'RH_out', 'Windspeed')
# You can create scatter plots to explore these relationships

plt.figure(figsize=(6, 4))

# Scatter plot of energy consumption vs. TDewPoint
sns.scatterplot(data=data, x='Tdewpoint', y='Appliances', alpha=0.5, label='Outdoor Temperature')

# Scatter plot of energy consumption vs. Press_mm_hg
sns.scatterplot(data=data, x='Press_mm_hg', y='Appliances', alpha=0.5, label='Outdoor Humidity')

# Scatter plot of energy consumption vs. wind speed
sns.scatterplot(data=data, x='Windspeed', y='Appliances', alpha=0.5, label='Wind Speed')

# Scatter plot of energy consumption vs. OUTSIDE_TEMP_wstn
sns.scatterplot(data=data, x='OUTSIDE_TEMP_wstn', y='Appliances', alpha=0.5, label='OUTSIDE_TEMP_wstn')

# Scatter plot of energy consumption vs. OUTSIDE_HUM_wstn
sns.scatterplot(data=data, x='OUTSIDE_HUM_wstn', y='Appliances', alpha=0.5, label='OUTSIDE_HUM_wstn')

plt.title('Energy Consumption vs. Weather Variables')
plt.xlabel('Weather Variables')
plt.ylabel('Energy Consumption')
plt.legend()
plt.grid(True)

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Visualizing distributions using Histograms:
data.hist(figsize=(17, 20), grid=True);

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

##### 1. Why did you pick the specific chart?

Answer Here.

In [None]:
# Correlation Heatmap visualization code
correlation_matrix = data.corr()
plt.figure(figsize=(21, 18))
sns.heatmap(correlation_matrix, annot=True, cmap="RdYlGn")
plt.title("Correlation Matrix Heatmap")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Get the list of column names in your dataset
columns = data.columns

# Determine the number of rows and columns for subplots
num_rows = len(columns)
num_cols = 1

# Create subplots with specified number of rows and columns
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 80))

# Iterate over each column (excluding "Appliances") and create pair plot
for i, column in enumerate(columns):
    #if column != "Appliances":
        sns.scatterplot(data=data, x="Appliances", y=column, ax=axes[i])
        axes[i].set_xlabel("Appliances")
        axes[i].set_ylabel(column)

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

There is huge prescece of heteroscedasticity and we usually do log tranformation to solve this error.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant linear relationship between the independent variables and the appliance energy consumption.

Alternative Hypothesis (H1): There is a significant linear relationship between the independent variables and the appliance energy consumption.

#### 2. Perform an appropriate statistical test.

In [None]:
data.columns

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import pearsonr

# Extract the two continuous variables you want to test
column_to_drop = ['Appliances']
independent_variables = data.drop(column_to_drop, axis = 1)
dependent_variable = data['Appliances']

# Step 2: Perform the Correlation Test (Pearson correlation)
correlation_coefficients, p_values = [], []
for feature in independent_variables.columns:
    correlation_coefficient, p_value = pearsonr(independent_variables[feature], dependent_variable)
    correlation_coefficients.append(correlation_coefficient)
    p_values.append(p_value)

# Step 3: Interpret the Results for each feature
alpha = 0.05  # Significance level (commonly set to 0.05)
for i, feature in enumerate(independent_variables.columns):
    print(f"Correlation Coefficient for '{feature}': {correlation_coefficients[i]:.4f}")
    print(f"P-value for '{feature}': {p_values[i]:.4f}")

    if p_values[i] < alpha:
        print("Result: The correlation is statistically significant (reject H0).\n")
    else:
        print("Result: There is no significant correlation (fail to reject H0).\n")


##### Which statistical test have you done to obtain P-Value?

In the practical implementation provided earlier, the statistical test used to obtain the p-value is the Pearson correlation coefficient test. The Pearson correlation coefficient, also known as Pearson's r or simply r, is a measure of the linear relationship between two continuous variables.

##### Why did you choose the specific statistical test?

The p-value obtained from the test indicates the probability of observing the calculated correlation coefficient (or a more extreme value) if the null hypothesis is true. The null hypothesis (H0) in this context states that there is no significant linear relationship between the two variables.

By comparing the p-value to a chosen significance level (alpha), commonly set to 0.05 (5%), we can determine whether to reject or fail to reject the null hypothesis. If the p-value is less than alpha, we reject the null hypothesis, suggesting a statistically significant correlation. If the p-value is greater than alpha, we fail to reject the null hypothesis, indicating no significant correlation.

This test is appropriate when you want to assess the strength and direction of the linear relationship between two continuous variables. It is commonly used to explore the association between variables in correlation analysis and is widely used in various fields of research and data analysis.

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

Thankfully there is no missing value in out dataset

#### What all missing value imputation techniques have you used and why did you use those techniques?

We have not used any missing values handling technique as there are no Nan Values in the data set

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
df= data.copy()
col_list = list(df.describe().columns)

#find the outliers using boxplot
plt.figure(figsize=(25, 20))
plt.suptitle("Box Plot", fontsize=18, y=0.95)

for n, ticker in enumerate(col_list):

    ax = plt.subplot(8, 4, n + 1)

    plt.subplots_adjust(hspace=0.5, wspace=0.2)

    sns.boxplot(x=df[ticker],color='pink', ax = ax)

    # chart formatting
    ax.set_title(ticker.upper())


In [None]:
# Handling Outliers & Outlier treatments
import pandas as pd
import numpy as np

def find_outliers_iqr(data):
    # Calculate the first quartile (Q1) and third quartile (Q3) for each column
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)

    # Calculate the interquartile range (IQR) for each column
    iqr = q3 - q1

    # Calculate the lower and upper bounds for outliers for each column
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    # Check for outliers in each column and count the number of outliers
    outliers_count = (data < lower_bound) | (data > upper_bound)
    num_outliers = outliers_count.sum()

    return num_outliers


outliers_per_column = find_outliers_iqr(data)
print("Number of outliers per column:")
print(outliers_per_column.sort_values(ascending = False))



In [None]:
# Handling Outliers & Outlier treatments
for ftr in col_list:
  print(ftr,'\n')
  q_25= np.percentile(df[ftr], 25)
  q_75 = np.percentile(df[ftr], 75)
  iqr = q_75 - q_25
  print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q_25, q_75, iqr))
  # calculate the outlier cutoff
  cut_off = iqr * 1.5
  lower = q_25 - cut_off
  upper = q_75 + cut_off
  print(f"\nlower = {lower} and upper = {upper} \n ")
  # identify outliers
  outliers = [x for x in df[ftr] if x < lower or x > upper]
  print('Identified outliers: %d' % len(outliers))
  #removing outliers
  if len(outliers)!=0:

    def bin(row):
      if row[ftr]> upper:
        return upper
      if row[ftr] < lower:
        return lower
      else:
        return row[ftr]



    data[ftr] =  df.apply (lambda row: bin(row), axis=1)
    print(f"{ftr} Outliers Removed")
  print("\n-------\n")

In [None]:
plt.figure(figsize=(25, 20))
plt.suptitle("Box Plot without Outliers", fontsize=18, y=0.95)
#plot the all figures in loop with boxplot
for n, ticker in enumerate(col_list):

    ax = plt.subplot(8, 4, n + 1)

    plt.subplots_adjust(hspace=0.5, wspace=0.2)

    sns.boxplot(x=data[ticker],color='g' ,ax = ax)

    # chart formatting
    ax.set_title(ticker.upper())


In [None]:
data.shape

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# create new features
# create a column average building temperature based on all temperature
data['Average_building_Temperature']=data[['KITCHEN_TEMP','LIVING_TEMP','BEDROOM_TEMP','OFFICE_TEMP','BATHROOM_TEMP','IRONING_ROOM_TEMP','TEEN_ROOM_2_TEMP','PARENTS_ROOM_TEMP']].mean(axis=1)
#create a column of difference between outside and inside temperature
data['Temperature_difference']=abs(data['Average_building_Temperature']-data['OUTSIDE_TEMP_build'])

#create a column average building humidity
data['Average_building_humidity']=data[['KITCHEN_HUM','LIVING_HUM','BEDROOM_HUM', 'OFFICE_HUM','BATHROOM_HUM','IRONING_ROOM_HUM','TEEN_ROOM_HUM','PARENTS_ROOM_HUM']].mean(axis=1)
#create a column of difference between outside and inside building humidity
data['Humidity_difference']=abs(data['OUTSIDE_HUM_build']-data['Average_building_humidity'])




In [None]:
#do not remove hour
columns_to_drop = [
'rv1',
'rv2']
data.drop(columns_to_drop, axis=1, inplace=True)

In [None]:
data.shape

###finding the skewed and symmetrical data

In [None]:
#examining the skewness in the dataset to check the distribution
skewness = data.skew()

#ginding the absolute value
abs(skewness)

# setting up the threshold
skewness_threshold = 0.5

# Separate features into symmetrical and skewed based on skewness threshold
symmetrical_features = skewness[abs(skewness) < skewness_threshold].index
skewed_features = skewness[abs(skewness) >= skewness_threshold].index

# Create new DataFrames for symmetrical and skewed features
print('FEATURES FOLLOWED SYMMETRICAL DISTRIBUTION :')
symmetrical_data = data[symmetrical_features]
print(symmetrical_features)

print('FEATURES FOLLOWED SKEWED DISTRIBUTION :')
skewed_data = data[skewed_features]
print(skewed_features)


###5. Data Transformation

In [None]:
'''skewed_data.drop('Appliances',axis = 1,inplace = True)'''

In [None]:
skewed_data

In [None]:
#import the liabrary
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer

# Initialize the PowerTransformer
power_transformer = PowerTransformer()

# Fit and transform the data using the PowerTransformer
power_transformed = pd.DataFrame(power_transformer.fit_transform(skewed_data))
power_transformed.columns = skewed_data.columns


In [None]:
power_transformed

In [None]:
# Reset the index to the default integer index
symmetrical_data.reset_index(drop=True, inplace=True)

In [None]:
symmetrical_data

In [None]:
# Concatenate horizontally (along columns)
tranformed_data = pd.concat([symmetrical_data, power_transformed], axis=1)

In [None]:
tranformed_data

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
#Yes My data needs transformation specially skewed data , i used power transformaiton to solve this concern

### 6. Scaling the DATA set

In [None]:
#importing the desired liabrary
from sklearn.preprocessing import StandardScaler

# StandardScaler
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(tranformed_data))
scaled_data.columns = tranformed_data.columns
scaled_data

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Initialize a PCA instance without specifying the number of components
pca = PCA()

# Fit the PCA model to your standardized data
pca.fit(scaled_data)

# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Create an elbow plot to visualize the explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Elbow Plot')
plt.grid()
plt.show()


# Create a PCA instance and specify the number of components you want to retain
# For example, if you want to retain 10 components, set n_components=10
n_components = 10
pca = PCA(n_components=n_components)

# Fit the PCA model to your standardized data and transform it
transformed_data_pca = pca.fit_transform(scaled_data)

# The variable 'transformed_data_pca' now contains your data in the reduced-dimensional space with 'n_components' principal components.

# You can also access explained variance to see how much variance is explained by each component
explained_variance = pca.explained_variance_ratio_

In [None]:
explained_variance

In [None]:
transformed_data_pca.shape

In [None]:
transformed_data_pca

### 8. Data Splitting

In [None]:
x = transformed_data_pca
y = data['Appliances']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=3)

70/30 Split: This ratio involves splitting the data into 70% for training and 30% for testing. It is a commonly used ratio when there is a sufficient amount of data available. The larger portion is used for training the model, while the smaller portion is used for evaluating its performance.

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 - Simple Linear Regression Model

In [None]:
#importing the mdoel
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#defining the object
reg = LinearRegression()
reg.fit(x_train, y_train)

#training dataset score
training_score = reg.score(x_train, y_train)

#predicting the value
y_pred = reg.predict(x_test)

#calculating the training accuracy
print("Train score:" ,training_score)

#calculating the MSE
MSE  = mean_squared_error((y_test),(y_pred))
print("Test MSE :" , MSE)

#calculating the testing accuracy
r2 = r2_score((y_test),(y_pred))
print("Test R2 :" ,r2)

In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(y_pred - y_test,kind ='kde')

In [None]:
#plot to compare the predicted values against the actual values.
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The machine learning model used in your code is a linear regression model, which is a type of regression model used to predict a continuous target variable based on one or more independent variables. Here's an explanation of the model and its performance using the provided evaluation metric score chart:

**Model Explanation**:
- **Model Type**: Linear Regression
  - Linear regression is a simple and commonly used regression technique that assumes a linear relationship between the independent variables and the target variable.

**Performance Evaluation**:
- **Training Score**: The training score is 0.176, which represents the coefficient of determination (R-squared) for the model's performance on the training data. It indicates that approximately 17.6% of the variance in the target variable (y_train) can be explained by the model. This suggests that the model has relatively weak explanatory power on the training data.

- **Test Mean Squared Error (MSE)**: The test Mean Squared Error is 1518.43. MSE measures the average squared difference between the actual target values (y_test) and the predicted values (y_pred) on the test data. A lower MSE is desirable, and this value indicates the average prediction error of the model on the test data.

- **Test R-squared (R2) Score**: The test R2 score is 0.156, which represents the coefficient of determination (R-squared) for the model's performance on the test data. It indicates that approximately 15.6% of the variance in the target variable (y_test) can be explained by the model. This suggests that the model has relatively weak explanatory power on the test data.

**Interpretation**:
- The linear regression model has limited predictive power in this context, as indicated by both the training and test R-squared scores. The R-squared values are relatively low, indicating that the model does not explain a significant proportion of the variance in the target variable.

- The test Mean Squared Error (MSE) of 1518.43 suggests that the model's predictions have a relatively high average squared error when compared to the actual target values. This indicates that the model's predictions are not very accurate.

**Recommendations**:
- The model's performance can be improved by considering the following:
  - Feature Engineering: Evaluate the features used in the model and consider feature selection or transformation to improve model performance.
  - Model Complexity: Experiment with more complex models or non-linear models if a linear relationship may not adequately capture the data.
  - Data Quality: Ensure the quality and relevance of the training and test data. Data preprocessing and data cleaning may be necessary.
  - Hyperparameter Tuning: Optimize the hyperparameters of the linear regression model.

In summary, the linear regression model used in this context shows limited explanatory power and predictive performance. To achieve better results, you may need to explore more advanced modeling techniques and further data analysis.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# Create a Linear Regression model (you can replace this with any other regression model)
model = LinearRegression()

# Define hyperparameter search space (you can customize this based on your model)
param_dist = {'fit_intercept': [True, False],
              'copy_X': [True, False],
              'positive':[True, False]}

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist,
                                   n_iter=10, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

# Fit the RandomizedSearchCV to find the best hyperparameters
random_search.fit(x_train, y_train)

# Get the best hyperparameters and model
best_params = random_search.best_params_
best_model = random_search.best_estimator_

# Train the best model with the entire training dataset
best_model.fit(x_train, y_train)

training_score_val = best_model.score(x_train, y_train)
# Evaluate the best model on the test set
test_predictions = best_model.predict(x_test)

# Calculate evaluation metrics for the test predictions (e.g., mean squared error)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, test_predictions)
r2 = r2_score((y_test),(test_predictions))


print("Best Hyperparameters:", best_params)


#visual of training score
print("Train score:" ,training_score_val)
print("Test MSE:", mse)
print("Test R2:", r2)


In [None]:

sns.displot(test_predictions - y_test,kind ='kde')

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(test_predictions)
plt.legend(["Predicted","Actual"])
plt.show()

##### Which hyperparameter optimization technique have you used and why?

We have used the RandomizedSearchCV hyperparameter optimization technique. RandomizedSearchCV is a popular technique for hyperparameter tuning, and it is different from other techniques like GridSearchCV and Bayesian Optimization. Here's why you might choose RandomizedSearchCV over other techniques and why it's commonly used:

**RandomizedSearchCV**:

1. **Randomized Search Space Exploration**: RandomizedSearchCV, as the name suggests, explores the hyperparameter space in a randomized manner. It samples a specified number of candidate hyperparameter combinations from a defined search space. This approach can be more efficient than GridSearchCV for high-dimensional search spaces.

2. **Efficiency**: When dealing with a large number of hyperparameters and their possible values, RandomizedSearchCV offers a way to efficiently explore different combinations without trying every possible combination, which can be time-consuming.

3. **Parallel Processing**: It allows for parallel processing (as indicated by `n_jobs=-1`), which means it can take advantage of multi-core processors and speed up the search process.

4. **Scalability**: RandomizedSearchCV is more scalable and suitable for large search spaces and datasets. It's often chosen when you have limited computational resources.

5. **Balanced Exploration**: By randomly selecting hyperparameter combinations, it can provide a balanced exploration of the hyperparameter space, potentially avoiding overfitting to specific combinations.

6. **Performance**: While it may not guarantee finding the absolute best hyperparameters, it often performs well in practice and can help discover hyperparameters that lead to good model performance.

**GridSearchCV**, on the other hand, performs an exhaustive search over all possible hyperparameter combinations, which can be computationally expensive. It's typically suitable when you have a smaller search space or want to ensure that you've explored every possible combination.

**Bayesian Optimization** is another technique that uses probabilistic models to guide the search process. It's beneficial when you have limited computational resources and want to make informed choices based on past evaluations.

In summary, RandomizedSearchCV is chosen in scenarios where efficiency, scalability, and balanced exploration of the hyperparameter space are important. It is particularly useful when you have a large number of hyperparameters and their ranges to consider, making it a practical choice for many machine learning applications. However, the choice of hyperparameter optimization technique should depend on the specific characteristics of your problem, available resources, and the nature of the hyperparameter search space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There doesn't seem to be any improvement in the model's performance as indicated by the evaluation metrics.

Here's a comparison of the metrics before and after hyperparameter optimization:

Before Hyperparameter Optimization (Before CV):
- Train R-squared (R2) Score: 0.176
- Test Mean Squared Error (MSE): 1518.43
- Test R-squared (R2) Score: 0.156

After Hyperparameter Optimization (After CV):
- Train R-squared (R2) Score: 0.176
- Test Mean Squared Error (MSE): 1518.43
- Test R-squared (R2) Score: 0.156

The metrics remain the same before and after hyperparameter optimization. This suggests that the optimization process did not lead to any noticeable improvement in the model's performance based on these evaluation metrics.

In practice, hyperparameter optimization may not always result in significant improvements. The choice of hyperparameters and optimization techniques can vary depending on the dataset and the model. If the performance of the model is unsatisfactory, you may want to consider other approaches such as feature engineering, feature selection, trying different model algorithms, or collecting more data if possible. Additionally, exploring a wider range of hyperparameters or other optimization techniques may be necessary to make more substantial improvements.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Let's explain each of the evaluation metrics mentioned and their indications toward business impact:

1. **R-squared (R2) Score**:
   - **Indication Towards Business**: R-squared measures the proportion of the variance in the target variable that can be explained by the model. It ranges from 0 to 1, where a higher R-squared indicates that the model explains more variance in the data.
   - **Business Impact**: A high R-squared suggests that the model is effective in explaining and predicting the target variable, which can be valuable for business decision-making. It indicates how well the model aligns with the underlying patterns in the data. However, it's important to consider other metrics and domain knowledge to assess the true business impact.

2. **Mean Squared Error (MSE)**:
   - **Indication Towards Business**: MSE quantifies the average squared difference between predicted and actual values. Lower MSE values indicate more accurate predictions.
   - **Business Impact**: A lower MSE signifies that the model's predictions are closer to the actual values, which is beneficial for businesses. It implies that the model's predictions are more reliable and can lead to more informed decisions. For example, in sales forecasting, lower MSE means better inventory management.

3. **R-squared (R2) Score for Test Data**:
   - **Indication Towards Business**: This R-squared score specifically measures the proportion of variance explained by the model on unseen test data. It helps assess the model's generalization performance.
   - **Business Impact**: A high R2 score on the test data suggests that the model is likely to perform well in real-world scenarios and make accurate predictions. This leads to more confident and effective business decisions.

In the provided context, the ML model's performance is not strong, as indicated by the low R-squared scores and relatively high MSE. The R-squared scores are less than 0.2, which means that the model explains a small proportion of the variance in the target variable. This can indicate that the model may not be highly reliable for making business decisions in its current form.

**Business Impact**:
- The low R-squared scores may result in less accurate predictions, potentially leading to suboptimal business decisions.
- For businesses, making decisions based on a model with limited predictive power can be risky and result in inefficient resource allocation.

To improve business impact, it's advisable to consider the following actions:
- Explore more advanced modeling techniques that may capture the data's underlying patterns better.
- Invest in data quality and feature engineering to extract more relevant information.
- Collect additional data to enhance the model's performance.
- Regularly monitor and update the model as new data becomes available.

In conclusion, the evaluation metrics provide insights into the model's performance and its potential impact on business decisions. In this case, there is room for improvement to make the model more valuable for making informed and accurate business choices.

### ML Model - 2 - Polynomial Regression model


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Assuming you have already split your data into x_train, x_test, y_train, and y_test

# Choose the degree of the polynomial (e.g., 2 for quadratic)
degree = 2

# Create a Polynomial Regression model using a pipeline
polyreg = make_pipeline(PolynomialFeatures(degree), LinearRegression())

# Fit the model to the training data
polyreg.fit(x_train, y_train)

# Predict on the test data
y_pred = polyreg.predict(x_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Calculate the R2 score for the training data
training_r2 = polyreg.score(x_train, y_train)

print(f"Training R-squared (R2) Score: {training_r2:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")

In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(y_pred - y_test,kind ='kde')

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV

# Create a Polynomial Regression model without specifying the degree
polyreg = make_pipeline(PolynomialFeatures(), LinearRegression())

# Define a range of polynomial degrees to be tested
param_grid = {'polynomialfeatures__degree': range(1, 3)}

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(polyreg, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the model to the training data
grid_search.fit(x_train, y_train)

# Get the best polynomial degree
best_degree = grid_search.best_params_['polynomialfeatures__degree']

# Create a Polynomial Regression model with the best degree
best_polyreg = make_pipeline(PolynomialFeatures(degree=best_degree), LinearRegression())

# Perform cross-validation to evaluate the model
cv_scores = cross_val_score(best_polyreg, x_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_r2_scores = cross_val_score(best_polyreg, x_train, y_train, cv=5, scoring='r2')

# Calculate the mean squared error and R2 score
mse_cv = -cv_scores.mean()
r2_cv = cv_r2_scores.mean()

# Fit the best model to the training data
best_polyreg.fit(x_train, y_train)

# Predict on the test data
y_pred = best_polyreg.predict(x_test)

# Evaluate the model on the test data
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best Polynomial Degree: {best_degree}")
print(f"Cross-Validation Mean Squared Error: {mse_cv:.2f}")
print(f"Cross-Validation R-squared (R2) Score: {r2_cv:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")


In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(y_pred - y_test,kind ='kde')

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There doesn't seem to be any improvement in the model's performance as indicated by the evaluation metrics.

Here's a comparison of the metrics before and after hyperparameter optimization:

Before Hyperparameter Optimization (Before CV):
- Train R-squared (R2) Score: 0.176
- Test Mean Squared Error (MSE): 1518.43
- Test R-squared (R2) Score: 0.156

After Hyperparameter Optimization (After CV):
- Train R-squared (R2) Score: 0.176
- Test Mean Squared Error (MSE): 1518.43
- Test R-squared (R2) Score: 0.156

The metrics remain the same before and after hyperparameter optimization. This suggests that the optimization process did not lead to any noticeable improvement in the model's performance based on these evaluation metrics.

In practice, hyperparameter optimization may not always result in significant improvements. The choice of hyperparameters and optimization techniques can vary depending on the dataset and the model. If the performance of the model is unsatisfactory, you may want to consider other approaches such as feature engineering, feature selection, trying different model algorithms, or collecting more data if possible. Additionally, exploring a wider range of hyperparameters or other optimization techniques may be necessary to make more substantial improvements.





> Indented block



### ML Model - 3 - RIDGE Regression Model

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Assuming you have already created the 'x_train', 'x_test', 'y_train', and 'y_test' datasets
# 'x_train' and 'x_test' are the results of polynomial regression on PCA-transformed data

# Create a PolynomialFeatures instance (with degree=2 for quadratic features)
poly_features = PolynomialFeatures(degree=2)

# Transform the data to include polynomial features
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)

# Create a Ridge regression model
ridge_reg = Ridge(alpha=1.0)  # You can adjust the alpha parameter (regularization strength)

# Fit the Ridge model to the training data
ridge_reg.fit(x_train_poly, y_train)

# Predict on the test data
y_pred = ridge_reg.predict(x_test_poly)

# Calculate R-squared (R2) for the test data
test_r2 = ridge_reg.score(x_test_poly, y_test)

# Calculate R-squared (R2) for the training data
training_r2 = ridge_reg.score(x_train_poly, y_train)

# Calculate Mean Squared Error (MSE) for the test data
mse = mean_squared_error(y_test, y_pred)


print(f"Test R-squared (R2) Score: {test_r2:.2f}")
print(f"Training R-squared (R2) Score: {training_r2:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")


In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(y_pred - y_test,kind ='kde')

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

The choice of the degree for polynomial regression should be based on the balance between model complexity and performance. In your case, you have results for two different degrees, 2 and 3. Here's an interpretation of your results:

**Degree 2 Model:**
- Test R-squared (R2) Score: 0.23
- Training R-squared (R2) Score: 0.27
- Mean Squared Error (MSE): 7933.72

**Degree 3 Model:**
- Test R-squared (R2) Score: 0.36
- Training R-squared (R2) Score: 0.27
- Mean Squared Error (MSE): 6589.76

Here are some insights:

1. **Test R-squared (R2) Score**: The degree-3 model has a higher R2 score on the test data, indicating that it explains more of the variance in the target variable compared to the degree-2 model. A higher R2 suggests better predictive performance.

2. **Training R-squared (R2) Score**: Both models have similar training R2 scores. The degree-2 model's training R2 score is slightly lower, which could indicate some overfitting.

3. **Mean Squared Error (MSE)**: The degree-3 model has a lower MSE on the test data, which is a measure of prediction accuracy. A lower MSE is generally desirable.

Given these insights, the degree-3 model appears to be better at capturing the underlying relationships in the data and making predictions on unseen data. It has a higher R2 score and a lower MSE on the test data. However, you should also consider the complexity of the model and potential overfitting. If you believe that the degree-3 model generalizes well to new data and doesn't overfit, it may be a good choice.

#### 2. Cross- Validation & Hyperparameter Tuning


In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
import numpy as np

# Assuming you have 'x' and 'y' as your data and target variable

# Create a PolynomialFeatures instance (with degree=3 for cubic features)
poly_features = PolynomialFeatures(degree=2)

# Create a Ridge regression model
ridge_reg = Ridge()

# Create a pipeline with the polynomial features and Ridge regression
pipeline = Pipeline([
    ('polynomial_features', poly_features),
    ('ridge_regression', ridge_reg)
])

# Define hyperparameters and values to search
param_grid = {
    'ridge_regression__alpha': [0.001, 0.01, 0.1, 1]  # You can adjust the alpha values
}

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x, y)  # Use the full dataset for cross-validation

# Get the best hyperparameters from the grid search
best_alpha = grid_search.best_params_['ridge_regression__alpha']

# Create a Ridge regression model with the best hyperparameters
best_ridge_reg = Ridge(alpha=best_alpha)

# Fit the Ridge model to the training data
best_ridge_reg.fit(x_train, y_train)

# Calculate cross-validated R-squared (R2) scores
cv_scores = cross_val_score(best_ridge_reg, x_train, y_train, cv=5, scoring='r2')

# Calculate R-squared (R2) score on the test data
test_r2 = best_ridge_reg.score(x_test, y_test)

print(f"Best Alpha: {best_alpha}")
print(f"Cross-Validated R-squared (R2) Scores: {cv_scores}")
print(f"Mean R-squared (R2) Score: {np.mean(cv_scores):.2f}")
print(f"Training R-squared (R2) Score: {best_ridge_reg.score(x_train, y_train):.2f}")
print(f"Test R-squared (R2) Score: {test_r2:.2f}")


In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 4 - Lasso Regression Model

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Assuming you have already created the 'x_train', 'x_test', 'y_train', and 'y_test' datasets
# 'x_train' and 'x_test' are the results of polynomial regression on PCA-transformed data

# Create a PolynomialFeatures instance (with degree=3 for cubic features)
poly_features = PolynomialFeatures(degree=2)

# Transform the data to include polynomial features
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)

# Create a Lasso regression model
lasso_reg = Lasso(alpha=1.0)  # You can adjust the alpha parameter (regularization strength)

# Fit the Lasso model to the training data
lasso_reg.fit(x_train_poly, y_train)

# Predict on the test data
y_pred = lasso_reg.predict(x_test_poly)

# Calculate R-squared (R2) for the test data
test_r2 = lasso_reg.score(x_test_poly, y_test)

# Calculate R-squared (R2) for the training data
training_r2 = lasso_reg.score(x_train_poly, y_train)

# Calculate Mean Squared Error (MSE) for the test data
mse = mean_squared_error(y_test, y_pred)

print(f"Test R-squared (R2) Score: {test_r2:.2f}")
print(f"Training R-squared (R2) Score: {training_r2:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Assuming you have 'x' and 'y' as your data and target variable

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)

# Create a PolynomialFeatures instance (with degree=3 for cubic features)
poly_features = PolynomialFeatures(degree=2)

# Transform the data to include polynomial features
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)

# Scale the features
scaler = StandardScaler()
x_train_poly = scaler.fit_transform(x_train_poly)
x_test_poly = scaler.transform(x_test_poly)

# Create a Lasso regression model
lasso_reg = Lasso(max_iter=10000)  # Increase max_iter and adjust the alpha parameter if needed

# Define hyperparameters and values to search
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10]  # You can adjust the alpha values
}

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(lasso_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train_poly, y_train)  # Use the training data for cross-validation

# Get the best hyperparameters from the grid search
best_alpha = grid_search.best_params_['alpha']

# Create a Lasso regression model with the best hyperparameters
best_lasso_reg = Lasso(alpha=best_alpha, max_iter=10000)

# Fit the Lasso model to the training data
best_lasso_reg.fit(x_train_poly, y_train)

# Predict on the test data
y_pred = best_lasso_reg.predict(x_test_poly)

# Calculate R-squared (R2) for the test data
test_r2 = best_lasso_reg.score(x_test_poly, y_test)

# Calculate R-squared (R2) for the training data
training_r2 = best_lasso_reg.score(x_train_poly, y_train)

# Calculate Mean Squared Error (MSE) for the test data
mse = mean_squared_error(y_test, y_pred)

# Calculate Mean Squared Error (MSE) for the training data
training_mse = mean_squared_error(y_train, best_lasso_reg.predict(x_train_poly))

# Calculate cross-validated R-squared (R2) scores
cv_scores = cross_val_score(best_lasso_reg, x, y, cv=5, scoring='r2')

print(f"Best Alpha: {best_alpha}")
print(f"Test R-squared (R2) Score: {test_r2:.2f}")
print(f"Training R-squared (R2) Score: {training_r2:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Training Mean Squared Error (MSE): {training_mse:.2f}")
print(f"Cross-Validated R-squared (R2) Scores: {cv_scores}")
print(f"Mean R-squared (R2) Score: {np.mean(cv_scores):.2f}")


In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 5 - elastic net Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import  ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

# Specify the degree of polynomial (you can change this based on your data)
degree = 2

# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a Linear Regression model
ElasticNet_model = ElasticNet(alpha=1.0)

# Train the model using the polynomial features
ElasticNet_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions = ElasticNet_model.predict(X_train_poly)
test_predictions = ElasticNet_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Polynomial Regression (Degree {}):".format(degree))
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

# Create a Ridge Regression model
ElasticNet_model = ElasticNet()

# Perform Cross-Validation and Hyperparameter Tuning
param_grid = {'alpha': [0.1, 1.0, 10.0]}  # Define the hyperparameter grid

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=ElasticNet_model, param_grid=param_grid,
                           scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV to find the best degree and alpha
grid_search.fit(X_train_poly, y_train)

# Get the best degree and alpha from the GridSearchCV results
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

# Make predictions on the training and test data
train_predictions = best_model.predict(X_train_poly)
test_predictions = best_model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("Best Alpha:", best_alpha)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 6 - Ranfom Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=20, random_state=42)

# Train the model
rf_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions_rf = rf_model.predict(X_train_poly)
test_predictions_rf = rf_model.predict(X_test_poly)

# Evaluate the model
train_mse_rf = mean_squared_error(y_train, train_predictions_rf)
test_mse_rf = mean_squared_error(y_test, test_predictions_rf)

train_r2_rf = r2_score(y_train, train_predictions_rf)
test_r2_rf = r2_score(y_test, test_predictions_rf)

print("Random Forest Regressor:")
print("Train MSE:", train_mse_rf)
print("Test MSE:", test_mse_rf)
print("Train R-squared:", train_r2_rf)
print("Test R-squared:", test_r2_rf)

In [None]:
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
import numpy as np

# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=20, random_state=42)

# Define scoring functions
scoring = {
    'mse': make_scorer(mean_squared_error),
    'r2': make_scorer(r2_score)}

# Train the model
rf_model.fit(X_train_poly, y_train)

# Make predictions on the test data
test_predictions_rf = rf_model.predict(X_test_poly)

# Calculate Test MSE and Test R-squared
test_mse_rf = mean_squared_error(y_test, test_predictions_rf)
test_r2_rf = r2_score(y_test, test_predictions_rf)

# Perform cross-validation
k = 5  # Number of folds (you can adjust this as needed)
mse_scores = -cross_val_score(rf_model, X_train_poly, y_train, cv=k, scoring=scoring['mse'])
r2_scores = cross_val_score(rf_model, X_train_poly, y_train, cv=k, scoring=scoring['r2'])

# Calculate the mean and standard deviation of MSE and R-squared
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)

# Print the cross-validation results
print("Cross-Validation Results for Random Forest Regressor:")
print(f"Train MSE: {mean_mse:.2f} ")
print(f"Train R-squared: {mean_r2:.2f} ")
print(f"Test MSE: {test_mse_rf:.2f}")
print(f"Test R-squared: {test_r2_rf:.2f}")


In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(test_predictions_rf - y_test,kind ='kde')

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(test_predictions_rf)
plt.legend(["Predicted","Actual"])
plt.show()

### ML Model - 7 - GRADIENT BOOSTING

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Create a Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
gb_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions_gb = gb_model.predict(X_train_poly)
test_predictions_gb = gb_model.predict(X_test_poly)

# Evaluate the model
train_mse_gb = mean_squared_error(y_train, train_predictions_gb)
test_mse_gb = mean_squared_error(y_test, test_predictions_gb)

train_r2_gb = r2_score(y_train, train_predictions_gb)
test_r2_gb = r2_score(y_test, test_predictions_gb)

print("Gradient Boosting Regressor:")
print("Train MSE:", train_mse_gb)
print("Test MSE:", test_mse_gb)
print("Train R-squared:", train_r2_gb)
print("Test R-squared:", test_r2_gb)

In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(test_predictions_gb - y_test,kind ='kde')

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(test_predictions_gb)
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
'''# We can use this code snippet for cross validation
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

# Create a Gradient Boosting Regressor
gb_model = GradientBoostingRegressor(random_state=42)

# Define a parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 4]
}

# Create a GridSearchCV object with 5-fold cross-validation
grid_search = GridSearchCV(gb_model, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the GridSearchCV object on your data
grid_search.fit(X_train_poly, y_train)

# Get the best model from the search
best_gb_model = grid_search.best_estimator_

# Make predictions using the best model
train_predictions_gb = best_gb_model.predict(X_train_poly)
test_predictions_gb = best_gb_model.predict(X_test_poly)

# Evaluate the best model
train_mse_gb = mean_squared_error(y_train, train_predictions_gb)
test_mse_gb = mean_squared_error(y_test, test_predictions_gb)
train_r2_gb = r2_score(y_train, train_predictions_gb)
test_r2_gb = r2_score(y_test, test_predictions_gb)

print("Best Gradient Boosting Regressor after hyperparameter tuning:")
print("Train MSE:", train_mse_gb)
print("Test MSE:", test_mse_gb)
print("Train R-squared:", train_r2_gb)
print("Test R-squared:", test_r2_gb)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)'''

### ML Model - 8 - XGBOOST

In [None]:
import xgboost as xgb

# Create an XGBoost Regressor model
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
xgb_model.fit(X_train_poly, y_train)

# Make predictions on the training and test data
train_predictions_xgb = xgb_model.predict(X_train_poly)
test_predictions_xgb = xgb_model.predict(X_test_poly)

# Evaluate the model
train_mse_xgb = mean_squared_error(y_train, train_predictions_xgb)
test_mse_xgb = mean_squared_error(y_test, test_predictions_xgb)

train_r2_xgb = r2_score(y_train, train_predictions_xgb)
test_r2_xgb = r2_score(y_test, test_predictions_xgb)

print("XGBoost Regressor:")
print("Train MSE:", train_mse_xgb)
print("Test MSE:", test_mse_xgb)
print("Train R-squared:", train_r2_xgb)
print("Test R-squared:", test_r2_xgb)



In [None]:
# Visualizing evaluation Metric Score chart
sns.displot(test_predictions_xgb - y_test,kind ='kde')

In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(test_predictions_xgb)
plt.legend(["Predicted","Actual"])
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

### ML Model - 9 - SUPPORT VECTOR REGRESSOR

In [None]:
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Assuming you have already split your data into x_train, x_test, y_train, and y_test
# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)

# Create an SVR model
svr = SVR(kernel='rbf')  # You can choose the kernel (e.g., 'linear', 'rbf', 'poly')

# Fit the SVR model to the training data
svr.fit(x_train, y_train)

# Predict on the test data
y_pred = svr.predict(x_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Calculate the R2 score for the training data
training_r2 = svr.score(x_train, y_train)

print(f"Training R-squared (R2) Score: {training_r2:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")


In [None]:
#### 2. Cross- Validation & Hyperparameter Tuning
plt.figure(figsize=(8,5))
plt.plot(np.array(y_test))
plt.plot(y_pred)
plt.legend(["Predicted","Actual"])
plt.show()

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***