# CCT College Dublin

## Assessment Cover Page

**Module Title**: Machine Learning for AI  
**Assessment Title**: ML_CA1  
**Lecturer Name**: David McQuaid  
**Student Full Name**: Ingrid Menezes Castro  
**Student Number**: 2020341  
**Assessment Due Date**: 31/05/2024  
**Date of Submission**: 31/05/2024  

**GITHUB LINK**: https://github.com/IC2020341/IngridCastro_ML_CA2

## Declaration

<div style="border: 1px solid black; padding: 10px;">
By submitting this assessment, I confirm that I have read the CCT policy on Academic Misconduct and understand the implications of submitting work that is not my own or does not appropriately reference material taken from a third party or other source. I declare it to be my own work and that all material from third parties has been appropriately referenced. I further confirm that this work has not previously been submitted for assessment by myself or someone else in CCT College Dublin or any other higher education institution.
</div>

-------------

In [None]:
# Imports
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn


# Data Preparation
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold,cross_val_score


# NN
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.layers import Dense, LeakyReLU
from tensorflow.keras.models import Sequential


# RegressionAlgorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import GradientBoostingRegressor


# Other
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
# Instalations

-----------

# Summary

**1. Neural Networks**
- Data Understanding
- Data Visualisation
- Data Preparation
- Neural Networks to predict Income
- Regression Algorithm:
- Prediction of a New Customer

**2. Semantic Analysis**
- Data Understanding
- Data Preparation
- Task 1: Sentiment Analysis
- Task 2: Visualisations

------------

# 1. Neural Networks

## 1.1. Data Understanding

In this first part of the data analysis we try to understand what are we dealing with, search for missing/ duplicated/NA values and do some EDA

In [None]:
df1 = pd.read_csv("BankRecords.csv")

In [None]:
df1.head()

In [None]:
df1.shape

In [None]:
df1.info()

In [None]:
df1.describe()

In [None]:
df1.isnull().sum()

In [None]:
df1.nunique()

In [None]:
df1.duplicated().sum()

In [None]:
df1.isna().sum()

## 1.2. Data Visualisations

In [None]:
sns.pairplot(df1)
plt.show()

In [None]:
numeric_columns = df1.select_dtypes(include=['float64', 'int64']).columns
categorical_columns = df1.select_dtypes(include=['object', 'bool', 'category']).columns

df1[numeric_columns].hist(bins=30, figsize=(15, 10), layout=(len(numeric_columns)//3+1, 3))
plt.tight_layout()
plt.show()

for column in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(x=column, data=df1)
    plt.title(f'Distribution of {column}')
    plt.show()

In [None]:
age_counts = df1['Age'].value_counts().sort_index(ascending=True)
age_counts

Observation: There are negative values for Experience(Years) as should be seen below. This should be treated when Scaling the data.

In [None]:
experience_counts = df1['Experience(Years)'].value_counts().sort_index(ascending=True)
experience_counts

## 1.3. Data preparation

In data preparation we need to do the following:
- Encode variables;
- Scale data;
- Prepare for modelling;

For the encoding of variables we need to transform the categorical variables in numerical so we can later scale, split etc. In the next cells we will label encoder the variables:
- Personal Loan;
- Securities Account;
- CD Account;
- Online Banking;
- and Credit Card;

These variables are expressed in 'Yes' or 'No' and we encoded Yes to be 1 and No to be 0.

In [None]:
label_encoder = LabelEncoder()
columns = ['Personal Loan', 'Securities Account', 'CD Account', 'Online Banking', 'CreditCard']

for column in columns:
    df1[column] = label_encoder.fit_transform(df1[column])

In [None]:
df1.head()

As you can see above the 'Education' variable is still categorical, so for this one we will apply dummies which will create three new columns:
- Education_Degree;
- Education_Diploma;
- Education_Masters;

On the next cell I have transformed the Boolean columns in INT32.

In [None]:
df1 = pd.get_dummies(df1, columns=['Education'])
df1.head()

In [None]:
boolean = ['Education_Degree', 'Education_Diploma', 'Education_Masters']
df1[boolean] = df1[boolean].astype(int)

In [None]:
df1.head()

In [None]:
df1.drop(columns=['Sort Code'], inplace=True)

In [None]:
df1.dtypes

In [None]:
df1['Credit Score'] = (df1['Credit Score'] * 10).astype(int)

In [None]:
df1.head()

### Data preparation for modelling and scaling

For scaling I will use the MinMaxScaler. The independent Variables (X) are all the other columns but 'Income(Thousands's)', while the dependent variable (y) is 'Income(Thousands's)'.

In [None]:
X = df1.iloc[:, np.r_[0:3, 4:15]]
y = df1.iloc[:, 3]

In [None]:
scaler = MinMaxScaler()
X= scaler.fit_transform(X)

## 1.4. Neural Networks to predict Income

In [None]:
model = Sequential()
model.add(Dense(100, input_shape=(X.shape[1],), activation='relu'))
model.add(LeakyReLU(negative_slope=0.01))
model.add(Dense(150))
model.add(LeakyReLU(negative_slope=0.01))
model.add(Dense(50))
model.add(LeakyReLU(negative_slope=0.01))
model.add(Dense(50))
model.add(LeakyReLU(negative_slope=0.01))
model.add(Dense(1))

In [None]:
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse'])
model.fit(X, y, epochs=500, batch_size=10, verbose=2)

In [None]:
scores = model.evaluate(X,y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

In [None]:
y_pred_nn = model.predict(X)

In [None]:
print(y_pred_nn)

In [None]:
plt.figure(figsize=(8, 6))

plt.scatter(y, y_pred_nn, color='blue', label='Neural Network Predictions')

plt.plot(y, y, color='red', linestyle='--', label='Perfect Predictions')

plt.title('Comparison of Predictions: Neural Network')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.grid(True)
plt.show()

## 1.5. Regression Algorithm to predict Income

The first thing we need to do is to see which would be a better fit for this dataset. For that we should compare their efficiency and then optimise and tune our chosen algorithm.

### Data preparation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)

### Model comparison

In [None]:
models = []

models.append(("DT", DecisionTreeRegressor()))
models.append(("RF", RandomForestRegressor()))
models.append(("LR", LinearRegression()))
models.append(("RDG", Ridge()))
models.append(("LSS", Lasso()))
models.append(("EN", ElasticNet()))
models.append(("GBR", GradientBoostingRegressor()))

In [None]:
results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=10, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
    results.append(cv_results)
    names.append(name)
    print('%s: Mean MSE = %f, Standard Deviation = %f' % (name, -cv_results.mean(), cv_results.std()))

In [None]:
pyplot.boxplot(results, labels = names)
pyplot.title("Algorithm Comparison")
pyplot.show()

### Random Forest Regressor

The model that performed best was the Random Forest Regressor, so that will be optimized and later compared to the performance of our Neural Networks.

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

In [None]:
y_pred_RF_bo = model.predict(X) #model before optimisation

In [None]:
print(y_pred_RF_bo)

In [None]:
plt.figure(figsize=(8, 6))

plt.scatter(y, y_pred_RF_bo, color='blue', label='Random Forest Predictions')

plt.plot(y, y, color='red', linestyle='--', label='Perfect Predictions')

plt.title('Random Forest Predictions before optimisation')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.grid(True)
plt.show()

### Random Forest Optimisation

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 80, num = 10)]

# Number of features to consider at every split
max_features = ['sqrt']

# Maximum number of levels in tree
max_depth = [2, 4, 6, 8, 10, 12]

# Minimum number of samples required to split a node
min_samples_split = [2, 3, 4, 5, 8]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3]

# Method of selecting samples for training each tree
bootstrap = [True, False]

param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(param_grid)

In [None]:
rf_model = RandomForestRegressor()

rf_Grid = GridSearchCV(estimator = rf_model, param_grid = param_grid, cv = 3, verbose=2, n_jobs = 4)
rf_Grid.fit(X_train, y_train)

In [None]:
rf_Grid.best_params_

In [None]:
train_mse = mean_squared_error(y_train, rf_Grid.predict(X_train))

test_mse = mean_squared_error(y_test, rf_Grid.predict(X_test))

print(f'Train MSE: {train_mse:.5f}')
print(f'Test MSE: {test_mse:.5f}')