<a href="https://colab.research.google.com/github/Pawanme9034/Bike_Sharing_Demand_Prediction-Capstone_Project/blob/main/Bike_Sharing_Demand_Prediction_Capstone_ProjectML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Bike Sharing Demand Prediction



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Name-**  Pawan Kumar Singh


# **Project Summary -**

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


# **GitHub Link -**

https://github.com/Pawanme9034/Bike_Sharing_Demand_Prediction-Capstone_Project/blob/main/Bike_Sharing_Demand_Prediction_Capstone_ProjectML.ipynb

# **Problem Statement**


Problem Statement is the prediction of bike count required at each hour for the stable supply of rental bikes.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#let's import the modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')



### Dataset Loading

In [None]:
# Load Dataset
import requests
from io import StringIO
# uploading data through Github directly
url = "https://raw.githubusercontent.com/Pawanme9034/Bike_Sharing_Demand_Prediction-Capstone_Project/main/SeoulBikeData.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

bike_df=pd.read_csv(data)

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bike_df.shape

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
value=len(bike_df[bike_df.duplicated()])
value

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

### What did you know about your dataset?

1. This Dataset contains 8760 lines and 14 columns.
2. the data is orgenized and there are timestamp.
3. there are no missing or null values in dataset.
4. dtypes: float64(6), int64(4), object(4)
5. memory usage: 848.3+ KB



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe()

### Variables Description

**Date** : *The date of the day, during 365 days from 01/12/2017 to 30/11/2018, formating in DD/MM/YYYY, type : str*, we need to convert into datetime format.

**Rented Bike Count** : *Number of rented bikes per hour which our dependent variable and we need to predict that, type : int*

**Hour**: *The hour of the day, starting from 0-23 it's in a digital time format, type : int, we need to convert it into category data type.*

**Temperature(°C)**: *Temperature in Celsius, type : Float*

**Humidity(%)**: *Humidity in the air in %, type : int*

**Wind speed (m/s)** : *Speed of the wind in m/s, type : Float*

**Visibility (10m)**: *Visibility in m, type : int*

**Dew point temperature(°C)**: *Temperature at the beggining of the day, type : Float*

**Solar Radiation (MJ/m2)**: *Sun contribution, type : Float*

**Rainfall(mm)**: *Amount of raining in mm, type : Float*

**Snowfall (cm)**: *Amount of snowing in cm, type : Float*

**Seasons**: *Season of the year, type : str, there are only 4 season's in data *.

**Holiday**: *If the day  is holiday period or not, type: str*

**Functioning Day**: *If the day is a Functioning Day or not, type : str* *italicized text* **bold text**



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(bike_df.apply(lambda col: col.unique()))

In [None]:
#print the unique value
bike_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Rename the complex columns name
bike_df=bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Temperature(�C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Dew point temperature(�C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
bike_df.columns

In [None]:
bike_df.head(2)

In [None]:
# Changing the "Date" column into three "year","month","day" column
bike_df['Date'] = bike_df['Date'].apply(lambda x:dt.datetime.strptime(x,"%d/%m/%Y"))
bike_df['Date']

In [None]:
bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month
bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df['weekdays_weekend']=bike_df['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )
bike_df=bike_df.drop(columns=['Date','day','year'],axis=1)


In [None]:
bike_df['weekdays_weekend'].value_counts()

* ***So we convert the "date" column into 3 different column i.e "year","month","day".***
* ***The "year" column in our data set is basically contain the 2 unique number contains the details of from 2017 december to 2018 november so if i consider this is a one year then we don't need the "year" column so we drop it***.
* ***The other column "day", it contains the details about the each day of the month, for our relevence we don't need each day of each month data but we need the data about, if a day is a weekday or a weekend so we convert it into this format and drop the "day" column***.

In [None]:
bike_df.head(2)

In [None]:
#Change the int64 column into catagory column
cols=['Hour','month','weekdays_weekend']
for col in cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:
#assign the numerical coulmn to variavle
numerical_columns=list(bike_df.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
bike_df.head(2)

In [None]:
bike_df.info()


#### Chart - 1

In [None]:
bike_df.head(2)

In [None]:
# Chart - 1 visualization code
fig,ax=plt.subplots(figsize=(20,8))
sns.barplot(data=bike_df,x='month',y='Rented_Bike_Count',ax=ax,capsize=.2,hue='Seasons')
ax.set(title='Count of Rented bikes acording to Month ')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

# Hourly Distribution

In [None]:
# Chart - 2 visualization code
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=bike_df,x='Hour',y='Rented_Bike_Count',hue='weekdays_weekend',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend ')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
bike_df.head(2)

In [None]:
# Chart - 3 visualization code
fig, ax = plt.subplots(figsize=(20, 8))

sns.barplot(data=bike_df,x='Seasons',y='Rented_Bike_Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented Bikes According to Weekdays/Weekends')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Specify line colors using the "color" parameter
# sns.scatterplot(tips['total_bill'],tips['tip'],hue=df['sex'],style=df['smoker'],size=df['size'])
sns.scatterplot(data=bike_df, y='Rented_Bike_Count', x='Snowfall')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

import seaborn as sns
import matplotlib.pyplot as plt

# Set the style of the plot
sns.set_style('darkgrid')

# Create the scatter plot
sns.scatterplot(data=bike_df, x='Temperature', y='Rented_Bike_Count', color='blue', alpha=0.5, marker='o', s=50)

# Customize the plot
plt.xlabel('Temperature (°C)')
plt.ylabel('Rented Bike Count')
plt.title('Scatter Plot of Temperature vs Rented Bike Count')
plt.legend(['Bike Data'], loc='upper right')

# Adjust the plot aesthetics
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import seaborn as sns

# Set the style of the plot
sns.set_style('darkgrid')

# Create the line plot
sns.lineplot(data=bike_df, x='Solar_Radiation', y='Rented_Bike_Count', color='green')

# Customize the plot
plt.xlabel('Solar Radiation')
plt.ylabel('Mean Rented Bike Count')
plt.title('Mean Rented Bike Count by Solar Radiation')

# Adjust the plot aesthetics
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
bike_df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
for col in numerical_features:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=bike_df[col],y=bike_df['Rented_Bike_Count'],scatter_kws={"color": 'orange'}, line_kws={"color": "black"})

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
df_col=bike_df.columns
for col in df_col:
  if bike_df[col].dtype != object:
    sns.distplot(bike_df[col])
    plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
bike_df.corr()['Rented_Bike_Count']

In [None]:
# Correlation Heatmap visualization code
## plot the Correlation matrix
plt.figure(figsize=(20,8))
correlation=bike_df.corr()
mask = np.triu(np.ones_like(correlation, dtype=bool))
sns.heatmap((correlation),mask=mask, annot=True,cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# sns.pairplot(bike_df,hue='Seasons',corner=True)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Assumption : Observations are identically distributed(all the elemets in the test have equail probability to ocure)

In [None]:
#Cheking Histogram
plt.figure(figsize=(14, 6))
plt.hist(bike_df['Rented_Bike_Count'])
plt.show()


#### 2. Perform an appropriate statistical test.

In [None]:
#Help from Python
from scipy.stats import shapiro

DataToTest = bike_df['Rented_Bike_Count']

stat, p = shapiro(DataToTest)

print('stat=%.2f, p=%.30f' % (stat, p))

if p > 0.05:
    print('Normal distribution')
else:
    print('Not a normal distribution')

##### Which statistical test have you done to obtain P-Value?

Normality test using Shapiro-Wilk Test : tests If data is normally distributed

##### Why did you choose the specific statistical test?

The Shapiro-Wilk test is used to assess the normality of a dataset. It helps determine whether the data follows a normal distribution or deviates significantly from it.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Asumption - Identical and Normal Distribution

#### 2. Perform an appropriate statistical test.

In [None]:
bike_df.corr()['Rented_Bike_Count']

In [None]:
FirstSample =bike_df[1:30]['Rented_Bike_Count']
SecondSample = bike_df[1:30]['Temperature']

plt.plot(FirstSample,SecondSample)
plt.show()

In [None]:
#Spearman Rank Correlation
from scipy.stats import spearmanr
stat, p = spearmanr(FirstSample, SecondSample)

print('stat=%.3f, p=%5f' % (stat, p))
if p > 0.05:
    print('independent samples')
else:
    print('dependent samples')

In [None]:
#pearson correlation
from scipy.stats import pearsonr
stat, p = pearsonr(FirstSample, SecondSample)

print('stat=%.3f, p=%5f' % (stat, p))
if p > 0.05:
    print('independent samples')
else:
    print('dependent samples')

##### Which statistical test have you done to obtain P-Value?

Answer.
Correlation Test - Pearson and Spearman’s Rank Correlation

##### Why did you choose the specific statistical test?

Pearson correlation is used to measure the strength and direction of a linear relationship between two continuous variables.

Spearman's rank correlation is used to measure the strength and direction of a monotonic relationship between two variables, which can be continuous or ranked.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
contingency_data = pd.crosstab(bike_df['Seasons'], bike_df['Holiday'],margins = False)
contingency_data

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency
import scipy.stats as stats


# Perform the chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_data)

# Print the chi-square test result
print("Chi-square statistic:", chi2)
print("p-value:", p_value)
print("Degrees of freedom:", dof)

if p > 0.05:
    print('independent categories')
else:
    print('dependent categories')

##### Which statistical test have you done to obtain P-Value?

Based on the chi-square test results, the chi-square statistic is 122.59, the p-value is approximately 2.14e-26, and the degrees of freedom is 3.

Since the p-value is extremely small (smaller than the typical significance level of 0.05), we can reject the null hypothesis of independence between the "Seasons" and "Holiday" variables. This suggests that there is a significant association or dependency between the two variables in the dataset.



##### Why did you choose the specific statistical test?

the occurrence of seasons and holidays in the dataset is not independent of each other. The variables "Seasons" and "Holiday" are related, and the difference in their frequencies is statistically significant.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
pip install missingno

In [None]:

import missingno as msno
import pandas as pd

# Assuming you have a DataFrame named bike_df
msno.matrix(bike_df)


In [None]:
# Handling Missing Values & Missing Value Imputation
print(bike_df.isna().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

lockely there are no missing values in the dataset.so i don't use any missing value imputation technique

### 2. Handling Outliers

### Rented_Bike_Count is skewed so i use boxplot

In [None]:
bike_df.skew()

In [None]:
bike_df['Rented_Bike_Count'].skew()

In [None]:
# Handling Outliers & Outlier treatments
#Boxplot of Rented Bike Count to check outliers
plt.figure(figsize=(10,6))
plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=bike_df['Rented_Bike_Count'])
plt.show()

The above boxplot shows that we have detect outliers in Rented Bike Count column

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10, 8))
plt.xlabel('Rented Bike Count')
plt.ylabel('Density')

ax = sns.distplot(np.sqrt(bike_df['Rented_Bike_Count']), color="purple")
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).mean(), color='blue', linestyle='dashed', linewidth=2)
ax.axvline(np.sqrt(bike_df['Rented_Bike_Count']).median(), color='green', linestyle='dashed', linewidth=2)

plt.show()


In [None]:
#After applying sqrt on Rented Bike Count check wheater we still have outliers
plt.figure(figsize=(10,6))

plt.ylabel('Rented_Bike_Count')
sns.boxplot(x=np.sqrt(bike_df['Rented_Bike_Count']))
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

np.sqrt(bike_df['Rented_Bike_Count']) will return a new Series object with the square root of each value in the 'Rented_Bike_Count' column.


This transformation is often applied to data to reduce skewness, as taking the square root can help normalize the distribution and make it more symmetric.


### 3. Categorical Encoding

In [None]:
#Assign all catagoriacla features to a variable
categorical_features = bike_df.select_dtypes(include=['object', 'category']).columns
categorical_features

In [None]:
bike_df_copy = pd.get_dummies(bike_df, columns=categorical_features, drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

One-hot encoding is a common preprocessing step for categorical features in machine learning models. By converting categorical variables into binary columns, it allows machine learning algorithms to work with categorical data more effectively.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

In [None]:
bike_df_copy.columns

# Train test split

In [None]:
X = bike_df_copy.drop(columns=['Rented_Bike_Count'], axis=1)
y = np.sqrt(bike_df_copy['Rented_Bike_Count'])
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)


In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X_train, 0.7)
len(set(corr_features))

In [None]:
corr_features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
X_train.drop(corr_features,axis=1)
X_test.drop(corr_features,axis=1)

##### What all feature selection methods have you used  and why?

Correlated features are features that exhibit a high degree of linear relationship or dependency between them. Having highly correlated features can introduce multicollinearity in a model, which can affect the model's performance and interpretability.

Correlated features are features that exhibit a high degree of linear relationship or dependency between them. Having highly correlated features can introduce multicollinearity in a model, which can affect the model's performance and interpretability.

##### Which all features you found important and why?

other featuers are not correlated to each other, Correlated features are not considered important because they can introduce redundancy, affect model interpretability, lead to model instability, and increase the risk of overfitting. Selecting uncorrelated features can improve model performance, enhance interpretability, and ensure more reliable and generalized results.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Plotting the distplots without any transformation

for col in X_train.columns:
    plt.figure(figsize=(14,4))
    plt.subplot(121)
    sns.distplot(X_train[col])
    plt.title(col)

    plt.subplot(122)
    stats.probplot(X_train[col], dist="norm", plot=plt)
    plt.title(col)

    plt.show()

In [None]:
# Transform Your data
# Apply Yeo-Johnson transform
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PowerTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd

pt1 = PowerTransformer()
X_train_transformed2 = pt1.fit_transform(X_train)
X_test_transformed2 = pt1.transform(X_test)

lr = LinearRegression()

# Perform cross-validation with 5 folds
cv_scores = cross_val_score(lr, X_train_transformed2, y_train, cv=5, scoring='r2')

# Train the model on the full training set
lr.fit(X_train_transformed2, y_train)

# Predict on the test set
y_pred3 = lr.predict(X_test_transformed2)

r2 = r2_score(y_test, y_pred3)
print("R-squared score:", r2)

df = pd.DataFrame({'cols': X_train.columns, 'Yeo_Johnson_lambdas': pt1.lambdas_})
# print(df)

print("Cross-validated R-squared scores:", cv_scores)
print("Mean R-squared score:", cv_scores.mean())
mean_r2 = np.mean(cv_scores)
print("Mean R-squared score:", mean_r2)


In [None]:

# pd.DataFrame({'cols':X_train.columns,'Yeo_Johnson_lambdas':pt1.lambdas_})

In [None]:
X_train_transformed = pd.DataFrame(X_train_transformed2,columns=X_train.columns)

for col in X_train_transformed.columns:
    plt.figure(figsize=(14,4))
    plt.subplot(121)
    sns.distplot(X_train[col])
    plt.title(col)

    plt.subplot(122)
    sns.distplot(X_train_transformed[col])
    plt.title(col)

    plt.show()

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Instantiate a LinearRegression model
lr = LinearRegression()

# Fit the model using the scaled features (X_train_scaled) and the continuous labels (y_train)
lr.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = lr.predict(X_test_scaled)

# Calculate the mean squared error (MSE) on the test set
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
from sklearn.metrics import r2_score

# Assuming y_test contains the true target values and y_pred contains the predicted values
r2 = r2_score(y_test, y_pred)
print("R-squared score:", r2)
from sklearn.model_selection import cross_val_score

# Assuming X_train_scaled and y_train contain the scaled training data and target values, respectively
# Assuming lr is your trained linear regression model

# Perform cross-validation with 5 folds
cv_scores = cross_val_score(lr, X_train_scaled, y_train, cv=5, scoring='r2')

# Print the cross-validated R-squared scores
print("Cross-validated R-squared scores:", cv_scores)
print("Mean R-squared score:", cv_scores.mean())


In [None]:
X_train

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

ax1.scatter(X_train['Visibility'], X_train['Temperature'])
ax1.set_title("Before Scaling")
ax2.scatter(X_train_scaled['Visibility'], X_train_scaled['Temperature'],color='red')
ax2.set_title("After Scaling")
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['Visibility'], ax=ax1)
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled['Visibility'], ax=ax2)
sns.kdeplot(X_train_scaled['Temperature'], ax=ax2)
plt.show()

# Comparison of Distributions

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Age Distribution Before Scaling')
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('Age Distribution After Standard Scaling')
sns.kdeplot(X_train_scaled['Temperature'], ax=ax2)
plt.show()

# Why scaling is important?

In [None]:
from sklearn.preprocessing import MinMaxScaler

MinMaxScaler_scaler = MinMaxScaler()

# fit the scaler to the train set, it will learn the parameters
MinMaxScaler_scaler.fit(X_train)

# transform train and test sets
X_train_scaled_MinMaxScaler = MinMaxScaler_scaler.transform(X_train)
X_test_scaled_MinMaxScaler = MinMaxScaler_scaler.transform(X_test)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create an instance of the LinearRegression model
lr = LinearRegression()

# Fit the model on the scaled training data
lr.fit(X_train_scaled_MinMaxScaler, y_train)

# Predict the target variable for the scaled test data
y_pred_MinMaxScaler = lr.predict(X_test_scaled_MinMaxScaler)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred_MinMaxScaler)

print("Mean Squared Error:", mse)

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred_MinMaxScaler)
print("R-squared score:", r2)

# Perform cross-validation with 5 folds
cv_scores_MinMaxScaler = cross_val_score(lr, X_train_scaled_MinMaxScaler, y_train, cv=5, scoring='r2')
mean_r2_MinMaxScaler = np.mean(cv_scores_MinMaxScaler)
print("Mean cross-validated R-squared score:", mean_r2_MinMaxScaler)
print("Mean R-squared score:", cv_scores_MinMaxScaler.mean())


In [None]:
X_train_scaled_MinMaxScaler = pd.DataFrame(X_train_scaled_MinMaxScaler, columns=X_train.columns)
X_test_scaled_MinMaxScaler = pd.DataFrame(X_test_scaled_MinMaxScaler, columns=X_test.columns)

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

ax1.scatter(X_train['Visibility'], X_train['Temperature'])
ax1.set_title("Before Scaling")
ax2.scatter(X_train_scaled_MinMaxScaler['Visibility'], X_train_scaled_MinMaxScaler['Temperature'],color='red')
ax2.set_title("After Scaling")
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Before Scaling')
sns.kdeplot(X_train['Visibility'], ax=ax1)
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('After Standard Scaling')
sns.kdeplot(X_train_scaled_MinMaxScaler['Visibility'], ax=ax2)
sns.kdeplot(X_train_scaled_MinMaxScaler['Temperature'], ax=ax2)
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))

# before scaling
ax1.set_title('Age Distribution Before Scaling')
sns.kdeplot(X_train['Temperature'], ax=ax1)

# after scaling
ax2.set_title('Age Distribution After Standard Scaling')
sns.kdeplot(X_train_scaled_MinMaxScaler['Temperature'], ax=ax2)
plt.show()

##### Which method have you used to scale you data and why?

I use standarization method just buecause this parform slitly better then normalization mathod

## appling standardization method on model

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### 7. Dimesionality Reduction

The "Curse of Dimensionality" refers to the challenges that arise when dealing with high-dimensional data. It causes sparsity of data, increased computational complexity, overfitting, and difficulties with distance-based algorithms. Mitigation strategies include feature selection, dimensionality reduction, regularization, and leveraging domain knowledge.

##### Do you think that dimensionality reduction is needed? Explain Why?

Apply PCA to the training set:

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score


In [None]:
for i in range(1,47):
  pca = PCA(n_components=i)  # Specify the number of components to keep
  X_train_pca = pca.fit_transform(X_train)
#Fit a linear regression model on the transformed training set:
  lrp = LinearRegression()
  lrp.fit(X_train_pca, y_train)
# Transform the test set using the same PCA transformation:
  X_test_pca = pca.transform(X_test)
# Make predictions on the transformed test set:
  y_pred = lrp.predict(X_test_pca)
# Evaluate the performance of the model using the R-squared score:
  r2 = r2_score(y_test, y_pred)
  print("R-squared score:", r2)

In [None]:
pca = PCA(n_components=42)  # Specify the number of components to keep
X_train_pca = pca.fit_transform(X_train)
#Fit a linear regression model on the transformed training set:
lrp = LinearRegression()
lrp.fit(X_train_pca, y_train)
# Transform the test set using the same PCA transformation:
X_test_pca = pca.transform(X_test)
# Make predictions on the transformed test set:
y_pred = lrp.predict(X_test_pca)
# Evaluate the performance of the model using the R-squared score:
r2 = r2_score(y_test, y_pred)
print("R-squared score:", r2)


In [None]:
# transforming in 3D
pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
import plotly.express as px
y_train_pca = y_train.astype(str)
fig = px.scatter_3d(df, x=X_train_pca[:,0], y=X_train_pca[:,1], z=X_train_pca[:,2],
              color=y_train_pca)
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    paper_bgcolor="LightSteelBlue",
)
fig.show()

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
# Fit the Algorithm
reg= LinearRegression()
reg.fit(X_train, y_train)


In [None]:
#check the score
reg.score(X_train, y_train)


In [None]:
# Predict on the model
#get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)
print("Predicted target values:", y_pred_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate the evaluation metric scores
train_score = r2_score(y_train, y_pred_train)
print(train_score)
test_score = r2_score(y_test, y_pred_test)
print(test_score)

In [None]:
# Cross checking with cross val score
np.mean(cross_val_score(reg,X_train, y_train,scoring='r2',cv=10))

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)



#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the parameter grid for GridSearchCV

param_grid = {
    'fit_intercept': [True, False],
}

# Fit the Algorithm
# Perform GridSearchCV with cross-validation
grid_search = GridSearchCV(reg, param_grid, cv=5)
grid_search.fit(X_train, y_train)



In [None]:
# Predict on the model
# Print the best parameters and best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

In [None]:

# Perform cross-validation with the best estimator
linear_regression_best = grid_search.best_estimator_
cv_scores = cross_val_score(linear_regression_best, X_train, y_train, cv=10)

# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())

# Fit the best estimator on the training data
linear_regression_best.fit(X_train, y_train)

# Evaluate the model on the testing data
test_score = linear_regression_best.score(X_test, y_test)
print("Test Score:", test_score)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

# **DECISION TREE**

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 1 Implementation
from sklearn.tree import DecisionTreeRegressor

decision_regressor = DecisionTreeRegressor(max_depth=None,
                                           max_features=12,
                                           max_leaf_nodes=150)

# Fit the Algorithm
decision_regressor.fit(X_train, y_train)

In [None]:

#get the X_train and X-test value
y_pred_train_d = decision_regressor.predict(X_train)
y_pred_test_d = decision_regressor.predict(X_test)

In [None]:
# Predict on the model
decision_regressor.score(X_train, y_train)

In [None]:
# Visualizing evaluation Metric Score chart
#import the packages
from sklearn.metrics import mean_squared_error
print("Model Score:",decision_regressor.score(X_train,y_train))

#calculate MSE
MSE_d= mean_squared_error(y_train, y_pred_train_d)
print("MSE :",MSE_d)

#calculate RMSE
RMSE_d=np.sqrt(MSE_d)
print("RMSE :",RMSE_d)


#calculate MAE
MAE_d= mean_absolute_error(y_train, y_pred_train_d)
print("MAE :",MAE_d)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_d= r2_score(y_train, y_pred_train_d)
print("R2 :",r2_d)
Adjusted_R2_d=(1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_d))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

In [None]:

# for importance, name in sorted(zip(decision_regressor.feature_importances_, X_train.columns), reverse = True):
  #print (name,importance)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [None, 5,8,9, 10, 15,16],
    'max_features': [2, 5,9,10,12],
    'max_leaf_nodes': [100,150,180]
}


In [None]:
# Perform cross-validation with GridSearchCV
# Create the grid search object
grid_search = GridSearchCV(decision_regressor, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search object to the training data
grid_search.fit(X_train, y_train)

# Get the best estimator from grid search
best_regressor = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred = best_regressor.predict(X_test)

# Calculate the mean squared error on the test set
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Print the best parameters found by grid search
print("Best Parameters:", grid_search.best_params_)

In [None]:
# Print the best parameters and best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)



In [None]:
# Evaluate the model on the test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print("Test Accuracy: ", accuracy)


In [None]:
# Perform cross-validation with best parameters
cross_val_scores = cross_val_score(best_model, X, y, cv=5)
print("Cross-Validation Scores: ", cross_val_scores)
print("Average Cross-Validation Score: ", cross_val_scores.mean())

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

# RandomForestRegressor

In [None]:
# ML Model - 3 Implementation
#import the packages
from sklearn.ensemble import RandomForestRegressor
# Create an instance of the RandomForestRegressor
rf_model = RandomForestRegressor()

In [None]:
# Fit the Algorithm
rf_model.fit(X_train,y_train)

In [None]:
# Predict on the model

y_pred_train_r = rf_model.predict(X_train)
y_pred_test_r = rf_model.predict(X_test)

In [None]:
rf_model.score(X_train, y_train)

In [None]:
# Perform cross-validation
scores = cross_val_score(rf_model, X, y, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)

print("Root Mean Squared Error scores:", rmse_scores)
print("Average RMSE score:", np.mean(rmse_scores))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#import the packages
from sklearn.metrics import mean_squared_error
print("Model Score:",rf_model.score(X_train,y_train))

#calculate MSE
MSE_rf= mean_squared_error(y_train, y_pred_train_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)


#calculate MAE
MAE_rf= mean_absolute_error(y_train, y_pred_train_r)
print("MAE :",MAE_rf)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_rf= r2_score(y_train, y_pred_train_r)
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Number of trees in random forest
n_estimators = [20,60,100,120]

# Number of features to consider at every split
max_features = [0.2,0.6,1.0]

# Maximum number of levels in tree
max_depth = [2,8,None]

# Number of samples
max_samples = [0.5,0.75,1.0]

# Bootstrap samples
bootstrap = [True,False]

# Minimum number of samples required to split a node
min_samples_split = [2, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]
# Fit the Algorithm

# Predict on the model

In [None]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples,
              'bootstrap':bootstrap,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf
             }
print(param_grid)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rf_grid = RandomizedSearchCV(estimator = rf_model,
                       param_distributions = param_grid,
                       cv = 5,
                       verbose=2,
                       n_jobs = -1)

In [None]:
rf_grid.fit(X_train,y_train)

In [None]:
rf_grid.best_params_

In [None]:
rf_grid.best_score_

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***