Technological Institute of the Philippines | Quezon City - Computer Engineering
--- | ---
Course Code: | CPE 019
Code Title: | Emerging Technologies in CpE 2 - Classifications and Regression
<hr> | <hr>
<u>**CPE019 Assignment (2nd Sem, A.Y. 2023-2024)** | **Assignment 7.1**
**Name** | Cortez, Angelica
**Section** | CPE32S3
**Schedule**: |Wednesday - 10:30am - 1:30pm
**Date Performed**: |04/07/2024
**Date Submitted**: |04/11/2024
**Instructor**: | Engr.Roman Richard
<hr>

#Instructions:

Choose any dataset applicable to the classification problem, and also, choose any dataset applicable to the regression problem.


Explain your datasets and the problem being addressed.


For classification, do the following:


* Create a base model


* Evaluate the model with k-fold cross validation


* Improve the accuracy of your model by applying additional hidden layers


For regression, do the following:


* Create a base model


* Improve the model by standardizing the dataset


* Show tuning of layers and neurons (see evaluating small and larger networks)


Submit the link to your Google Colab (make sure that it is accessible to me)

# CLASSIFICATION

# ABOUT THE DATASET
The Raisin dataset comprises images of Kecimen and Besni raisin varieties sourced from Turkey, encompassing 900 raisins, with 450 samples from each variety. Utilizing CVS, seven morphological features were extracted from these images after undergoing preprocessing stages. These features include area, perimeter, major and minor axis lengths, eccentricity, convex area, and extent. The dataset's objective lies in classifying these raisins into their respective varieties using artificial intelligence techniques, providing a practical application for agricultural sorting and quality control processes.

Features:

* Area - Represents the number of pixels within the boundaries of the raisin, indicating its size.


* Perimeter - Measures the length of the boundary of the raisin, providing information about its shape.


* MajorAxisLength - Indicates the length of the main axis of the raisin, representing its longest dimension.


* MinorAxisLength - Represents the length of the minor axis of the raisin, indicating its shortest dimension.


* Eccentricity - Provides a measure of the eccentricity of the ellipse that best fits the shape of the raisin, offering insights into its elongation.


* ConvexArea - Indicates the number of pixels in the smallest convex shell that can enclose the raisin, providing information about its convexity.


* Extent - Represents the ratio of the area of the raisin to the total number of pixels in the bounding box, indicating how much the raisin fills its bounding box.


Target Variable:

* Class - The target variable indicates the variety of the raisin, with two categories: "Kecimen" and "Besni."



Source of the dataset: https://archive.ics.uci.edu/dataset/850/raisin

### Preprocessing of data

In [None]:
pip install scikeras

Collecting scikeras
  Downloading scikeras-0.12.0-py3-none-any.whl (27 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.12.0


In [None]:
# Import libraries

import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
# sets the random seed for reproducibility and to produce the same results each time
seed = 7
numpy.random.seed(seed)

In [None]:
# load a dataset from a csv file
raisin_df = pandas.read_csv("/content/Raisin_Dataset.csv")
dataset = raisin_df.values

In [None]:
# displays the entire dataframe
raisin_df

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.040,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.684130,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.190770,290.827533,0.564011,81463,0.792772,1073.251,Kecimen
...,...,...,...,...,...,...,...,...
895,83248,430.077308,247.838695,0.817263,85839,0.668793,1129.072,Besni
896,87350,440.735698,259.293149,0.808629,90899,0.636476,1214.252,Besni
897,99657,431.706981,298.837323,0.721684,106264,0.741099,1292.828,Besni
898,93523,476.344094,254.176054,0.845739,97653,0.658798,1258.548,Besni


In [None]:
# summarize the dataframe
raisin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Area             900 non-null    int64  
 1   MajorAxisLength  900 non-null    float64
 2   MinorAxisLength  900 non-null    float64
 3   Eccentricity     900 non-null    float64
 4   ConvexArea       900 non-null    int64  
 5   Extent           900 non-null    float64
 6   Perimeter        900 non-null    float64
 7   Class            900 non-null    object 
dtypes: float64(5), int64(2), object(1)
memory usage: 56.4+ KB


In [None]:
# checking for null values
raisin_df.isna().sum()

Area               0
MajorAxisLength    0
MinorAxisLength    0
Eccentricity       0
ConvexArea         0
Extent             0
Perimeter          0
Class              0
dtype: int64

In [None]:
# encode class values as integers with specific mapping
encoded_y = raisin_df['Class'] = raisin_df['Class'].map({'Besni': 1, 'Kecimen': 0})

In [None]:
# displays 10 rows from the dataframe and check if the "Class" column change into numerical values
raisin_df.sample(10)

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
256,61463,369.399745,213.61962,0.815832,63117,0.786777,966.493,0
538,145693,591.180144,321.431191,0.839272,151644,0.648528,1595.364,1
259,40403,289.259316,179.223383,0.784922,41209,0.722204,761.949,0
485,85492,437.013969,250.892609,0.818781,89018,0.723344,1182.575,1
453,96582,446.705203,278.325498,0.782172,100113,0.706598,1216.979,1
841,92472,424.419993,282.211505,0.746902,95982,0.746163,1204.61,1
862,105308,473.313393,284.170622,0.799711,108094,0.787014,1259.934,1
891,107486,462.813134,296.091238,0.768571,108914,0.759967,1235.078,1
60,66774,348.557975,246.476257,0.707082,69097,0.692446,1003.374,0
63,64717,342.57671,245.732037,0.696759,66649,0.662541,997.989,0


In [None]:
# split the data to separate features (X) and target labels (Y)
x = raisin_df.iloc[:, :-1].values
y = raisin_df["Class"].values

# split the dataset into training and testing sets (70% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=11342)

# standardized the features using standardscaler
norm_data = StandardScaler()
X_train= norm_data.fit_transform(X_train)
X_test = norm_data.transform(X_test)

***ANALYSIS:***

The first step involves preprocessing data for a machine learning task using Python libraries such as pandas, numpy, scikeras, and keras. Initially, the required packages are installed and imported. The dataset is loaded from a CSV file named "Raisin_Dataset.csv". Basic exploratory data analysis (EDA) is performed, including displaying the dataframe, summarizing its information, and checking for null values. Additionally, the data is then split into features (X) and target labels (Y). Label encoding is applied to convert class values into integers for model compatibility. This ensures uniformity in data representation. Finally, a sample of the dataframe is displayed to confirm the transformation of the "Class" column into numerical values.

### Create a Base Model and Evaluate the model with k-fold cross validation

In [None]:
def base_model():
	# create model
  model = Sequential()
  model.add(Dense(32, input_shape=(7,), activation='relu'))
  model.add(Dense(16, activation='relu'))
  model.add(Dense(1, activation='sigmoid'))
	# Compile model
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

Evaluate the model with k-fold cross validation

In [None]:
estimator = KerasClassifier(model=base_model, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Base Model: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Base Model: 80.22% (6.63%)


***ANALYSIS:***

For creating a base model and evaluating the model with k-fold cross validation, they presents the creation and evaluation of a base model using a neural network architecture in Python. The base model comprises three dense layers with specific activation functions. It is compiled with binary cross-entropy loss and the Adam optimizer. The model is trained and evaluated using k-fold cross-validation with 10 folds. The results indicate an average accuracy of 80.22% with a standard deviation of 6.63%. Therefore, while 80.22% accuracy may seem reasonable, I think this accuracy represents the average performance of the model across multiple folds of cross-validation. The standard deviation indicates the variability or consistency of the model's performance across these folds.

### Improve the accuracy of your model by applying additional hidden layers

In [None]:
from keras.layers import Dropout
def apply_model():
    model = Sequential()
    model.add(Dense(32, input_shape=(7,), activation='relu'))
    model.add(Dropout(0.2))  # Dropout layer for regularization
    model.add(Dense(64, activation='relu'))  # Additional hidden layer
    model.add(Dense(32, activation='relu'))  # Additional hidden layer
    model.add(Dense(16, activation='relu'))  # Additional hidden layer
    model.add(Dense(1, activation='sigmoid'))


    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=apply_model, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, x, encoded_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 86.78% (4.26%)


***ANALYSIS:***

The analysis involves enhancing the model's accuracy by introducing additional hidden layers to the neural network architecture. The function "apply_model" defines a model with three extra hidden layers compared to the previous base model. Dropout regularization is also applied to mitigate overfitting. The model is then compiled and evaluated using k-fold cross-validation with 10 folds. I think, the result shows an improved accuracy of 86.78% with a standard deviation of 4.26%. Fortunatelly, this enhancement suggests that the added hidden layers and regularization technique contribute positively to the model's performance.

# REGRESSION

# ABOUT THE DATASET

The Energy Efficiency Dataset focuses on predicting the heating load and cooling load requirements of buildings. The dataset contains various attributes such as the relative compactness, surface area, wall area, roof area, overall height, and glazing area of buildings. The goal is to estimate the amount of heating and cooling load needed for efficient temperature regulation in buildings based on these attributes. It's a regression problem because the target variables, heating load and cooling load, are continuous values representing quantities rather than categories.



* X1 - Relative Compactness - This variable represents the relative compactness of the building, which is the ratio of its volume to its surface area.

* X2 - Surface Area - Surface area refers to the total external area of the building, including walls, roof, and windows.

* X3 - Wall Area - Wall area specifically refers to the total surface area of the walls of the building.

* X4 - Roof Area - Roof area denotes the total surface area of the building's roof.

* X5 - Overall Height - Overall height indicates the height of the building from the ground to its highest point.

* X6 - Orientation - Orientation describes the direction in which the building faces (e.g., north, south, east, west).

* X7 - Glazing Area - Glazing area represents the total area of windows or glass walls in the building.

* X8 - Glazing Area Distribution - Glazing area distribution refers to the distribution of windows or glass walls across different sides of the building.

* y1 - Heating Load - Heating load is the amount of heating required to maintain a comfortable indoor temperature in the building.

* y2 - Cooling Load - Cooling load represents the amount of cooling needed to maintain a comfortable indoor temperature in the building.

Source of the dataset: https://archive.ics.uci.edu/dataset/242/energy+efficiency?fbclid=IwAR1I4x_ALtxzDP0dZ5hFofTr8w1Pce6Ja1h2f6PIAQdIs3pXyR0GhOz2M74_aem_AU43AgJo8nMty60ri4p9IJ1uB5fsuOUVZkW-_3yWkR6P2KnNwYOU1D213AftwwHaz3k00Rg0XgvPZLBeNmBfh1Tk

### Preprocessing of data

In [None]:
# Import libraries

from keras.models import Sequential
from keras.layers import Dense
from scikeras.wrappers import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
# load a dataset

enEff_df = pandas.read_csv("/content/ENB2012_data.csv")
data_set = enEff_df.values

In [None]:
enEff_df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5,17.88,21.40
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5,16.54,16.88
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5,16.44,17.11
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5,16.48,16.61


In [None]:
seed = 7
numpy.random.seed(seed)

In [None]:
enEff_df.dtypes

X1    float64
X2    float64
X3    float64
X4    float64
X5    float64
X6      int64
X7    float64
X8      int64
Y1    float64
Y2    float64
dtype: object

In [None]:
x = enEff_df.iloc[:, :8].values
y = enEff_df[['Y1', 'Y2']].values

### Create a base model

In [None]:
def baseline_model():
  model = Sequential()
  model.add(Dense(32, input_shape=(8,), kernel_initializer= 'normal' , activation= 'relu' ))
  model.add(Dense(2, kernel_initializer= 'normal'))

  model.compile(loss= 'mean_squared_error' , optimizer= 'adam')
  return model

estimator = KerasRegressor(model=baseline_model, epochs=50, batch_size=10, verbose=0)
kfold = KFold(n_splits=10)
results = cross_val_score(estimator, x, y, cv=kfold, scoring='neg_mean_squared_error')
print("Baseline Model: %.2f%% (%.2f%%) MSE" % (results.mean(), results.std()))

Baseline Model: -19.88% (9.65%) MSE


***ANALYSIS:***

For the regression problem using the Energy Efficiency dataset, I created a base model with a neural network architecture. The model consists of two dense layers, with the first layer having 32 neurons and using the ReLU activation function. The output layer has two neurons, representing the regression output. The model is compiled with the mean squared error loss function and the Adam optimizer. Through 10-fold cross-validation, the result of the baseline model, with a mean squared error MSE of -19.88% and a standard deviation of 9.65%, indicates moderate performance for the regression problem.

### Improve the model by standardizing the dataset

In [None]:
def standardized_model():
  model = Sequential()
  model.add(Dense(32, input_shape=(8,), kernel_initializer= 'normal' , activation= 'relu' ))
  model.add(Dense(2, kernel_initializer= 'normal'))

  model.compile(loss= 'mean_squared_error' , optimizer= 'adam')
  return model

# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(model=standardized_model, epochs=10, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10)
results = cross_val_score(pipeline, x, y, cv=kfold, scoring='neg_mean_squared_error')
print("Standardized data: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Standardized data: -20.36 (6.94) MSE


***ANALYSIS:***

After standardizing the dataset to ensure uniformity in feature scales, I evaluated the model's performance using a standardized dataset. The standardized model, consisting of two dense layers with the same architecture as the baseline model, was trained and evaluated using 10-fold cross-validation. The result showed an average mean squared error MSE of -20.36%, with a standard deviation of 6.94%.

### Show tuning of layers and neurons (see evaluating small and larger networks)


FOR SMALLER NETWORK
```
# 32 inputs -> [16] -> 2 output

```



In [None]:
# Function to create a smaller model
def smaller_model():
  model = Sequential()
  model.add(Dense(16, input_shape=(8,), kernel_initializer= 'normal' , activation= 'relu' ))
  model.add(Dense(2, kernel_initializer= 'normal'))

  model.compile(loss= 'mean_squared_error' , optimizer= 'adam')
  return model

# Evaluate smaller model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(model=smaller_model, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10)
results = cross_val_score(pipeline, x, y, cv=kfold, scoring='neg_mean_squared_error')
print("Smaller Model: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Smaller Model: -9.50 (3.86) MSE



FOR LARGER NETWORK
```
# 60 inputs -> [32 -> 16 -> 8] -> 2 output
```



In [None]:
# Function to create a smaller model
def larger_model():
# create model
  model = Sequential()
  model.add(Dense(32, input_shape=(8,), kernel_initializer= 'normal' , activation= 'relu' ))
  model.add(Dense(16, kernel_initializer='normal', activation='relu'))
  model.add(Dense(8, kernel_initializer='normal', activation='relu'))
  model.add(Dense(2, kernel_initializer= 'normal'))
# Compile model
  model.compile(loss= 'mean_squared_error' , optimizer= 'adam')
  return model

# Evaluate smaller model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(model=larger_model, epochs=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10)
results = cross_val_score(pipeline, x, y, cv=kfold, scoring='neg_mean_squared_error')
print("Larger Model: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Larger Model: -10.13 (4.42) MSE


***ANALYSIS:***


In exploring the impact of tuning layers and neurons on model performance, I experimented with both smaller and larger neural network architectures. For the smaller model, which comprised one hidden layer with 16 neurons, the mean squared error or MSE achieved was -9.50%, with a standard deviation of 3.86%. Subsequently, I evaluated a larger model consisting of three hidden layers with 32, 16, and 8 neurons, respectively. This larger model yielded an MSE of -10.13%, with a standard deviation of 4.42%. Comparing the two, while the larger model resulted in a slightly higher MSE, it also exhibited higher variability. These results suggest that increasing the complexity of the network by adding more layers and neurons may not always lead to significant improvements in performance, and finding the optimal architecture involves balancing model complexity with predictive accuracy.

#CONCLUSION:

In conclusion, for the classification task, I utilized the Raisin dataset, which comprises features related to the characteristics of different types of raisins. The goal was to classify raisins into two classes: 'Besni' and 'Kecimen'. Initially, I created a base model with a simple neural network architecture and evaluated its performance using k-fold cross-validation. The base model achieved an accuracy of 80.22%. To enhance accuracy, I introduced additional hidden layers to the model and applied dropout regularization. As a result, the accuracy improved to 86.78%, indicating the effectiveness of the added complexity in improving classification performance.

For the regression task, I employed the Energy Efficiency dataset, which contains features such as building parameters and energy consumption rates. The objective was to predict the heating and cooling load of buildings based on these parameters. Initially, I created a base regression model. Then, I improved the model's performance by standardizing the dataset to ensure uniformity in feature scales. Additionally, I experimented with tuning the number of layers and neurons in the neural network architecture, evaluating both small and larger networks to find the optimal configuration for predicting heating and cooling load accurately. Based on the resources in the module, the accuracy should be close to the lower negative while my model's accuracy didn't quite reach the desired level. I think it is because of balancing the model to get the desired accuracy.

Even though I didn't hit the mark with the desired accuracy, learning to train a base model using regression was a valuable experience. It taught me the importance of understanding the data, selecting appropriate features, and tuning the model parameters. While falling short of the target may be disappointing, it's a part of the learning process. I now have a better understanding of the challenges involved in regression modeling and can use this knowledge to improve future attempts. Furthermore, this experience has helped me grow as a computer engineering student and equipped me with valuable skills for tackling similar problems in the future.