## Neural Networks with Keras
Implementing different strategy on different things for neural networks on the "Video Game Sales" dataset from the Kaggle website using different metrics https://www.kaggle.com/datasets/gregorut/videogamesales/data.

### Loading The Data and The Necessary Libraries:

In [None]:
# Importing necessary libraries
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense, LeakyReLU
from keras.optimizers import SGD, Adam, RMSprop
from keras.initializers import RandomUniform, glorot_uniform, Zeros
from keras.activations import relu, sigmoid
from keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import category_encoders as ce

import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn

# Load the dataset
data_nn = pd.read_csv('vgsales.csv')

### Basic EDA and Basic Feature Engineering:

In [3]:
# preprocess the data for both parts
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [4]:
data.isna().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

In [5]:
data.dropna(inplace=True)
data.isna().sum()

Rank            0
Name            0
Platform        0
Year            0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
dtype: int64

In [20]:
categorical_features = ['Name', 'Platform', 'Genre', 'Publisher']

In [None]:
# Split data into features and target
X = data_nn.drop('Global_Sales', axis=1)
y = data_nn['Global_Sales']

# Preprocess the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345)

encoder = ce.TargetEncoder(cols=categorical_features)
X_train[categorical_features] = encoder.fit_transform(X_train[categorical_features], y_train)
X_test[categorical_features] = encoder.transform(X_test[categorical_features])

# Handle missing values by filling them with the median of the column
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Set random seed for reproducibility
np.random.seed(12345)

### Basic Function for evaluation:

In [144]:
def evaluate_model(model):
    y_pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print(f"RMSE: {rmse:.4f}")

### Different Weight Initialization Methods:

In [145]:
def build_model_weight(initializer):
    model = Sequential()
    model.add(Dense(64, input_dim=X_train.shape[1], kernel_initializer=initializer, activation='relu'))
    model.add(Dense(32, kernel_initializer=initializer, activation='relu'))
    model.add(Dense(1, kernel_initializer=initializer, activation='linear'))
    return model

In [143]:
initializers = [RandomUniform(seed=12345), glorot_uniform(seed=12345), Zeros()]
initializer_names = ['Uniform', 'Xavier', 'Zero']

for initializer, name in zip(initializers, initializer_names):
    print(f"\nResults for {name} initialization:")
    model = build_model_weight(initializer)
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32, verbose=0)
    evaluate_model(model)


Results for Uniform initialization:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0208

Results for Xavier initialization:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
RMSE: 0.0170

Results for Zero initialization:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 1.0141


### Discussion: Weight Initialization
In this challenge, we investigated how various weight initialization techniques affected a neural network's performance for a regression problem. Three initialization strategies were put to the test: Xavier (Glorot), Zero, and Uniform. The outcomes and our observations are as follows:

1. Uniform Initialization:
* RMSE: 0.0208
* Initial weights are assigned consistently across a specific range using uniform initialization. By distributing the weights equally, this technique seeks to guarantee that no neuron has a sizable advantage from the beginning. With an RMSE of 0.0208, the model with uniform initialization fared reasonably well in our trial. This implies that the weights were sufficiently balanced to give the learning process a respectable beginning.

2. Xavier (Glorot) Initialization:
* RMSE: 0.0170
* The goal of Xavier initialization, sometimes referred to as Glorot initialization, is to maintain gradient and activation variance consistency throughout layers. Because it helps reduce the problems of vanishing or expanding gradients, this technique typically leads to faster convergence and improved overall performance. The Xavier-initialized model fared better in our trial than the other models, with the lowest RMSE of 0.0170. This demonstrates that Xavier initialization, which offers a more reliable and effective learning process, is very useful for our regression job.

3. Zero Initialization:
* RMSE: 1.0141
* Every weight is set to zero at zero initialization. Because all of the neurons in the layer receive the identical updates during training, this approach often results in inefficient learning. With an RMSE of 1.0141, the model with zero initialization performed terribly, as was to be expected. Since zero initialization fails to disrupt the symmetry between neurons, it is ineffective for training neural networks, as evidenced by the high mistake rate.

### Analysis and Impact of Weight Initialization
In neural network training, weight initialization is essential. It has a direct impact on how fast and efficiently a model learns from the data. Our experiments led us to the following conclusions:

* Uniform Initialization: This gave the model a balanced start, which led to respectable performance. It wasn't the best, but it made sure the network could still learn—just not as quickly as it could have with more advanced techniques.

* Xavier Initialization: This initialization method was by far the best; it accelerated the convergence of the network and reduced its error rate. This technique leads to more effective training because it keeps a strong gradient flow throughout the network layers, which is very helpful for deep learning.

* Zero Initialization: Drawn attention to itself as a bad example, zero initialization's subpar performance highlighted how crucial it is to select a suitable initialization approach. It proved that a network cannot identify significant patterns in the data unless there is appropriate weight variety.

### Conclusion
In conclusion, a neural network's performance and convergence are greatly impacted by the weight initialization strategy that is selected. In our regression task, Xavier initialization proved to be the most successful method, resulting in faster convergence and improved model correctness. This experiment emphasises how crucial it is to choose the right weight initialization strategies in order to improve learning and attain peak performance when training neural networks.

### Varying The Number of Layers:

In [146]:
def build_model_layer(num_layers):
    model = Sequential()
    model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
    for _ in range(num_layers - 1):
        model.add(Dense(64, activation='relu'))
    model.add(Dense(1, activation='linear'))
    return model

In [148]:
layers = [1, 3, 5]
layer_names = ['1 Hidden Layer', '3 Hidden Layers', '5 Hidden Layers']

for num_layers, name in zip(layers, layer_names):
    print(f"\nResults for {name}:")
    model = build_model_layer(num_layers)
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32, verbose=0)
    evaluate_model(model)


Results for 1 Hidden Layer:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0382

Results for 3 Hidden Layers:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step
RMSE: 0.0300

Results for 5 Hidden Layers:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step
RMSE: 0.0374


### Discussion: Varying the Number of Layers
In this job, we investigated how a neural network's performance on a regression issue is affected by the number of hidden layers in the network. Three distinct network topologies were tested: one with one hidden layer, three with three secret layers, and five with five hidden layers. The outcomes and our observations are as follows:

1. 1 Hidden Layer:
* RMSE: 0.0382
* The most basic model, which had just one hidden layer, performed fairly well, with an RMSE of 0.0382. This model is computationally efficient since it has fewer parameters, which allows it to train more quickly. However, because of its simplicity, it might not be able to detect more intricate patterns in the data.

2. 3 Hidden Layers:
* RMSE: 0.0300
* The model's performance was greatly enhanced by the addition of more layers, which caused the RMSE to drop below 0.0300. The model achieved equilibrium between training duration and complexity. Better prediction accuracy resulted from the network's ability to identify more complex patterns in the data thanks to the extra layers. This enhancement implies that, up to a certain degree, deepening the network can improve its capacity for generalisation.

3. 5 Hidden Layers:
* RMSE: 0.0374
* The model's performance declined marginally with five hidden layers as compared to the three-layer model, yielding an RMSE of 0.0374. Despite being more intricate, this architecture didn't perform any better. The added depth probably brought with it new problems, including overfitting, or training deeper networks with problems like vanishing gradients, which can hinder learning.

### Analysis and Impact of Number of Layers
A neural network's ability to learn from data is directly impacted by the number of layers in the network. Below is a summary of the effects that were noticed:

* 1 Hidden Layer: Although this model is easier to use and trains more quickly, it might not fully capture all of the data's intricacies. It works well for issues with less complexity or in situations where processing power is scarce.

* 3 Hidden Layers: In our trial, this arrangement yielded the best results by striking a balance between training efficiency and complexity. The network was able to simulate more intricate relationships in the data thanks to the extra layers, which increased accuracy.

* 5 Hidden Layers: The model's capability is increased by adding additional layers, but doing so also raises concerns like overfitting and challenging training. The fact that the three-layer network outperformed the five-layer network in our instance indicates that the extra complexity was not helpful—possibly even harmful.

### Conclusion
This experiment emphasises how crucial it is to choose a neural network's layer count appropriately. Deeper networks may be able to catch more intricate patterns, but in order to prevent overfitting and training problems, they also need to be tuned and regularised more carefully. With an RMSE of 0.0300, a model including three hidden layers yielded the highest results for our regression task, offering the optimum balance. This demonstrates that each issue calls for a customised approach to architecture design and that adding layers does not necessarily translate into improved performance.

## Different Activation Functions:

In [150]:
def build_model_activation(activation):
    model = Sequential()
    model.add(Dense(64, input_dim=X_train.shape[1]))
    if activation == 'leaky_relu':
        model.add(LeakyReLU(alpha=0.1))
    else:
        model.add(Dense(64, activation=activation))
    model.add(Dense(32, activation=activation))
    model.add(Dense(1, activation='linear'))
    return model

In [151]:
activations = ['relu', 'leaky_relu', 'sigmoid']
activation_names = ['ReLU', 'Leaky ReLU', 'Sigmoid']

for activation, name in zip(activations, activation_names):
    print(f"\nResults for {name} activation:")
    model = build_model_activation(activation)
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32, verbose=0)
    evaluate_model(model)


Results for ReLU activation:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
RMSE: 0.0091

Results for Leaky ReLU activation:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0331

Results for Sigmoid activation:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
RMSE: 0.0612


### Discussion: Activation Functions
In this challenge, we looked at how various activation functions affect a neural network's ability to solve a regression problem. ReLU, Leaky ReLU, and Sigmoid were the three activation functions that we examined. The outcomes and our observations are as follows:

1. ReLU Activation:
* RMSE: 0.0091
* One of the most widely used activation functions in neural networks, especially for deep learning, is ReLU (Rectified Linear Unit). It makes training faster and solves the vanishing gradient problem well. The model with ReLU activation performed better in our experiment than the others, with the lowest RMSE of 0.0091. This suggests that ReLU works really well for our regression challenge, allowing for effective learning and better model performance.

2. Leaky ReLU Activation:
* RMSE: 0.0331
* A version of ReLU called Leaky ReLU was created to solve the "dying ReLU" issue, in which neurons can stop learning altogether and became dormant. When the unit is not in use, it permits a slight, non-zero gradient. The model with Leaky ReLU activation exhibited a higher RMSE of 0.0331 than ReLU, despite its theoretical advantages. This shows that although Leaky ReLU can be useful in some situations, it did not enhance performance to the same extent for our particular regression problem.

3. Sigmoid Activation:
* RMSE: 0.0612
* This higher error rate highlights the limitations of the Sigmoid function, especially in deeper networks where it can significantly slow down learning and reduce model performance. In our experiment, the model with Sigmoid activation performed the worst, with an RMSE of 0.0612. This function was widely used in the past but has since fallen out of favour due to issues like vanishing gradients and slower convergence.

### Analysis and Impact of Activation Functions
The performance and training dynamics of neural networks are strongly impacted by the selection of activation function. Below is a summary of the effects that were noticed:

* ReLU Activation: ReLU is the recommended option for this regression work because it performed the best out of all the functions that were examined. It was obviously advantageous that it might reduce disappearing gradients and enable quicker, more effective training.

* Leaky ReLU Activation: Leaky ReLU does not improve performance in this circumstance, despite addressing some of the shortcomings of regular ReLU. This indicates that the simplicity and effectiveness of the standard ReLU for our data were greater than the advantages of the Leaky ReLU's small negative value gradient.

* Sigmoid Activation: The subpar performance of the Sigmoid function confirmed its limits in contemporary neural network applications. It is less appropriate for applications requiring deep or complex networks due to its problems with vanishing gradients and slower training.

### Conclusion
This experiment demonstrates how important activation functions are to neural network performance and training. ReLU was the clear victor for our regression challenge, offering the optimal ratio of accuracy to efficiency. Although theoretically better, leaky ReLU did not perform better in this situation than ReLU. According to what is known at this time, the Sigmoid function was substantially behind because of its intrinsic restrictions. These results highlight how crucial it is to choose the right activation functions depending on the particular needs and features of the work at hand.

## Different Learning Rates:

In [153]:
def build_model_lrate():
    model = Sequential()
    model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='linear'))
    return model

In [155]:
learning_rates = [0.01, 0.1, 0.5]
learning_rate_names = ['0.01', '0.1', '0.5']

for lr, name in zip(learning_rates, learning_rate_names):
    print(f"\nResults for learning rate {name}:")
    model = build_model_lrate()
    optimizer = Adam(learning_rate=lr)
    model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mean_squared_error'])
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32, verbose=0)
    evaluate_model(model)


Results for learning rate 0.01:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0343

Results for learning rate 0.1:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0399

Results for learning rate 0.5:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 1.0178


### Discussion: Learning Rates
In this challenge, we looked at how various learning rates affected a neural network's convergent and performing capabilities for a regression problem. Three different learning rates were tested: 0.01, 0.1, and 0.5. The outcomes and our observations are as follows:

1. Learning Rate: 0.01
* RMSE: 0.0343
* Among the tested rates, the performance was highest with a learning rate of 0.01 with an RMSE of 0.0343. Because of its very low learning rate, the model was able to gradually and steadily converge by making little adjustments to the weights. By making sure that weight adjustments are made carefully, a conservative learning rate can improve model performance, as seen by the model's ability to minimise error without going overboard.

2. Learning Rate: 0.1
* RMSE: 0.0399
* The model's performance somewhat declined with a learning rate of 0.1, yielding an RMSE of 0.0399. The model was able to converge faster due to the accelerated training process caused by the greater learning rate. But the model was probably unable to reach the same accuracy as the model with a learning rate of 0.01 because of the bigger step sizes in the weight updates, which led the model to bounce around the ideal solution. This shows that even while quicker convergence is feasible, if the learning rate is too high, it could result in more inaccuracy.

3. Learning Rate: 0.5
* RMSE: 1.0178
* An RMSE of 1.0178 indicated poor performance at the maximum learning rate of 0.5. The model made unusually large weight changes as a result of this excessive learning rate, which resulted in instability and divergence. The model probably overshot the ideal solution several times, leading to a large error rate, rather than progressively getting closer to it. This result emphasises the risks associated with adopting an overly high learning rate, since it may impair the model's capacity to converge correctly and learn efficiently.

### Analysis and Impact of Learning Rates
One crucial hyperparameter that has a big impact on neural network performance and training dynamics is the learning rate. Below is a summary of the effects that were noticed:

* Learning Rate: 0.01: The optimal compromise between accuracy and convergence speed was offered by this rate. The model was able to learn efficiently and achieve the lowest RMSE thanks to the smaller, more regulated updates.

* Learning Rate: 0.1: This pace reduced the accuracy of the model but made training faster. This learning rate prevented the model from settling into the ideal answer, which led to larger updates and a marginally higher error.

* Learning Rate: 0.5: This unnaturally high rate of learning resulted in instability and subpar work. The model diverged as a result of the big updates, showing that an excessively high learning rate might be harmful to learning.

### Conclusion
The significance of selecting a suitable learning rate when training neural networks is shown by this experiment. The best results were obtained with a lower learning rate (0.01), which balanced the need for precise weight modifications and steady convergence. Increased inaccuracy was a result of faster training at higher learning rates (0.1 and 0.5), with the maximum rate producing noticeable instability. These results demonstrate that, even while faster training may be preferred, it is critical to determine a learning rate that guarantees accuracy and efficiency while avoiding the dangers of divergence and overshooting.

## Different Optimization Algorithms:

In [156]:
def build_model_optimization():
    model = Sequential()
    model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='linear'))
    return model

In [157]:
optimizers = [SGD(), Adam(), RMSprop()]
optimizer_names = ['SGD', 'Adam', 'RMSprop']

for optimizer, name in zip(optimizers, optimizer_names):
    print(f"\nResults for {name} optimizer:")
    model = build_model_optimization()
    model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mean_squared_error'])
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32, verbose=0)
    evaluate_model(model)


Results for SGD optimizer:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0266

Results for Adam optimizer:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0592

Results for RMSprop optimizer:
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE: 0.0168


### Discussion: Optimization Algorithms
In this challenge, we investigated the effects of several optimisation strategies on a neural network's performance on a regression problem. Three optimizers were put to the test: Adam (Adaptive Moment Estimation), RMSprop (Root Mean Square Propagation), and SGD (Stochastic Gradient Descent). The outcomes and our observations are as follows:

1. SGD Optimizer:
* RMSE: 0.0266
* SGD stands for stochastic gradient descent, and it's one of the most straightforward and popular optimisation algorithms. It uses the gradients of the loss function to update the model's parameters. An RMSE of 0.0266 was obtained by the SGD optimizer in our experiment, suggesting good performance. SGD is a dependable option due to its simplicity, although it frequently needs the learning rate to be carefully adjusted and can occasionally converge slowly or become stuck in local minima.

2. Adam Optimizer:
* RMSE: 0.0592
* Adam combines the benefits of AdaGrad and RMSprop, two further SGD enhancements. In addition to keeping a moving average of the gradients and squared gradients, it calculates adaptive learning rates for every parameter. Although the Adam optimizer is widely used and proven to be successful in other contexts, its RMSE of 0.0592 was higher in our work. This implies that the learning rate or other hyperparameters may not have been optimally fitted, which may have made Adam less successful in navigating the loss landscape for our particular regression problem.

3. RMSprop Optimizer:
* RMSE: 0.0168
* By dividing the learning rate by an exponentially decreasing average of squared gradients, RMSprop is engineered to adjust the learning rate for every parameter. This enables faster convergence and aids in solving the vanishing gradient problem. RMSprop performed better in our experiment than both SGD and Adam, with the lowest RMSE of 0.0168. This shows how well it optimised the model for our regression job, indicating that RMSprop was able to better balance the efficient parameter updates.

### Analysis and Impact of Optimization Algorithms
The performance of neural networks during training is greatly impacted by the optimisation algorithm selection. Below is a summary of the effects that were noticed:

##### SGD Optimizer:
* With an RMSE of 0.0266, a good balance was obtained.
* Its performance suggests that SGD can be a good optimizer, even for complicated workloads, with the right tuning.

##### Adam Optimizer:
* Adam's RMSE of 0.0592 indicates that his performance in this task was not optimal, even with its sophisticated methods for adjusting learning rates.
* This implies that Adam might not always be the best option right out of the box and would need more careful adjustment or might not work well for all kinds of issues.

##### RMSprop Optimizer:
* With an RMSE of 0.0168, gave the best performance.
* It achieved better outcomes in this assignment by navigating the loss landscape more skillfully and adjusting learning rates according to the magnitude of recent gradients.

### Conclusion
The significance of choosing the appropriate optimisation technique for neural network training is demonstrated by this experiment. SGD is still a strong and dependable option, however RMSprop performed better in our regression task and had the lowest RMSE. Adam did not fare as well in this particular circumstance, despite its widespread use, highlighting the fact that no single optimizer is always the best. Every optimisation algorithm has advantages and disadvantages, and the particulars of the given situation might affect how effective an algorithm is. For this reason, experimenting with various optimizers and fine-tuning their hyperparameters is essential to getting the best results for a particular assignment.