# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**TEAM NM1**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

![ml.jpg](attachment:ml.jpg)

# Introduction

Machine Learning (ML) can be defined as the use of algorithms and statistical models to analyse and draw patterns and relationships within our data in order to teach computer systems to learn, adapt and apply itself within that data without following explicit instructions. A basic framework or process exists in assisting data scientist to effectively teach these computer systems to predict more accurate results and analysis. It is the Data Science Process and contains the following phases:

![data-science-life-cycle.png](attachment:data-science-life-cycle.png)

This is a basic framework which has been used by data scientist over the years and it has been effective in the creation of models. But then what is a model? It is a program that can find patterns or make decisions from a previously unseen dataset. Since our data contains numerical features, we will be using regression to assist us with the development and application of our model. Regression can be simply defined as a technique for investigating the relationship between independent variables or features and a dependent variable or outcome. In machine learning, it is used as a mathematical method for predictive modelling, whereby an algorithm is used to predict continuous outcomes. There are three forms of regression, but for the context of this project we will explore two: Simple Linear Regression(SLR) and Multiple Linear Regression(MLR).

Simple Linear Regression ,or SLR, can be defined as a statistical method for establishing the relationship between two variables using a straight line. The line is drawn by finding the slope and intercept, which define the line and minimize regression errors. A straight line can be defined using the following equation:

$$y = a + bx$$

Basically, $𝑎$ is the intercept of the line with the y-axis, and $𝑏$ is the gradient.

Multiple Linear Regression, or MLR, is a statistical technique that uses multiple linear regression to model more complex relationships between two or more independent variables and one dependent variable. It is used when there are two or more x variables. Our regression equation now becomes:

$$𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+...+𝛽𝑝𝑋𝑝$$

where:

- $𝑌$ is the reponse variable which depends on the  𝑝predictor variables;
- $𝛽0$ is the intercept, interpreted as the value of  $𝑌$ when all predictor variables are equal to zero;
- $𝛽𝑗$ is the average effect on  $𝑌$ of a one unit increase in  $𝑋𝑗$, assuming all other predictors are held fixed.

# Problem Statement

![electricity-pylon-3916954_960_720.jpg](attachment:electricity-pylon-3916954_960_720.jpg)

The government of Spain is embarking on a significant expansion of its renewable energy infrastructure investments to transition towards a more sustainable energy mix. To achieve this, comprehensive insights into the trends and patterns of renewable energy sources and fossil fuel energy generation are imperative. Our organization has been entrusted with the task of addressing the following key objectives:

- <b>Data Analysis and Cleaning:</b> Rigorously analyze the provided dataset to uncover trends, patterns, and potential errors. Implement data cleaning and quality enhancement procedures to ensure that the dataset is robust and reliable for further analysis.

- <b>Feature Engineering:</b> Investigate the possibility of augmenting the dataset with additional features that can provide deeper insights into energy generation and shortfalls. Effective feature engineering is essential for developing a robust predictive model.

- <b>Demand Shortfall Forecasting:</b> Develop a predictive model capable of accurately forecasting the three-hourly demand shortfalls. The demand shortfall will serve as our target variable, and we aim to model it as a function of various city-specific weather features, including pressure, wind speed, humidity, and more.

- <b>Model Evaluation:</b> Assess the performance of multiple machine learning models and identify the most accurate and reliable model. We will employ rigorous evaluation metrics to determine the model's effectiveness.

- <b>Feature Importance:</b> Explore and determine the significance of individual features in influencing the model's prediction decisions. Understanding feature importance is crucial for optimizing renewable energy strategies.

- <b>Explainability:</b> Translate the inner workings of the predictive model into understandable insights for a non-technical audience. The ability to explain complex data-driven decisions in plain language is vital for informed decision-making.

Throughout the project, our overarching goal is to support Spain's sustainable energy transition by providing valuable insights, robust predictive models, and actionable recommendations. We understand that the provided dataset may require extensive preprocessing and feature engineering to achieve accurate forecasts and informed decision-making.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

![import%20process.jpg](attachment:import%20process.jpg)

As data scientists we will require various tools and applications that will assist us in going about the various processes and steps to ensure we build and develop an effective model that will produce accurate results. The tools required can be classified as packages or libraries, and they have advanced capabilities and operations within their desired context of use. The categories or contexts we will focus on are libraries that can load data, manipulate data, visualise the data, prepare the data to be used for analysis, model building, statistical functions, data preprocessing, numerical operations as well as linear algebra. Once we have loaded our desired packages, we can then begin working on the data. These packages make the life of a data scientist much simpler when going about the development and deployment of a dependable model.

In [10]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd  # Data processing and manipulation 
import matplotlib.pyplot as plt  # Data visualization
import seaborn as sns  # Statistical data visualization
import numpy as np  # Numerical operations and linear algebra 

# Libraries for data preparation and model building
from scipy.stats import norm  # Statistical functions
from sklearn.preprocessing import StandardScaler  # Data preprocessing
import warnings  # Warning handling
from statsmodels.graphics.factorplots import interaction_plot
warnings.filterwarnings('ignore')  # Ignore warnings

# Display Matplotlib plots in Jupyter Notebook
%matplotlib inline  

# Setting global constants to ensure notebook results are reproducible
PARAMETER_CONSTANT = '###'

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

To build a reliable and useful model, it is important to work on a dataset that is recognisable and contains structured data in order to be able to train, test and evaluate the model. For this to happen, we need to load the data on to our notebook in order to have access to the features and observations it entails. Features, in this context, are referred to the columns and observations are the rows. Structured data occurs in the form of a table with columns and rows that can be read and analysed. Underneath is the process or procedure used to load and access this structured data on our notebook for deeper analysis. 

In [4]:
df = pd.read_csv('df_test.csv') # load the data, used the pandas read_csv() to import our dataset 
df.head() #Preview the dataset

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,...,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min
0,8763,2018-01-01 00:00:00,5.0,level_8,0.0,5.0,87.0,71.333333,20.0,3.0,...,287.816667,280.816667,287.356667,276.15,280.38,286.816667,285.15,283.15,279.866667,279.15
1,8764,2018-01-01 03:00:00,4.666667,level_8,0.0,5.333333,89.0,78.0,0.0,3.666667,...,284.816667,280.483333,284.19,277.816667,281.01,283.483333,284.15,281.15,279.193333,278.15
2,8765,2018-01-01 06:00:00,2.333333,level_7,0.0,5.0,89.0,89.666667,0.0,2.333333,...,284.483333,276.483333,283.15,276.816667,279.196667,281.816667,282.15,280.483333,276.34,276.15
3,8766,2018-01-01 09:00:00,2.666667,level_7,0.0,5.333333,93.333333,82.666667,26.666667,5.666667,...,284.15,277.15,283.19,279.15,281.74,282.15,284.483333,279.15,275.953333,274.483333
4,8767,2018-01-01 12:00:00,4.0,level_7,0.0,8.666667,65.333333,64.0,26.666667,10.666667,...,287.483333,281.15,286.816667,281.816667,284.116667,286.15,286.816667,284.483333,280.686667,280.15


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


![Exploratory-Data-Analysis.png](attachment:Exploratory-Data-Analysis.png)

EDA, or Exploratory Data Analysis can be defined as the critical process of performing initial investigations on data so as to discover patterns, to identify anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. It is often used by data scientists to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. <br> 
Exploratory Data Analysis (EDA) is a crucial step in understanding your dataset and gaining insights before diving into modeling or decision-making. Here's a breakdown of the best approach to carry out EDA on your dataset:

- <b>Understand the Problem and Goals:</b><br>
Clearly define the problem you want to solve or the goals you aim to achieve with your dataset.<br>
Identify what you want to learn or discover through EDA.

- <b>Data Collection and Familiarization:</b><br>
Gather the dataset you intend to analyze.<br>
Understand the dataset's structure, data types, and any accompanying documentation.

- <b>Data Cleaning:</b><br>
Handle Missing Values: Identify and deal with missing data through imputation, removal, or other appropriate methods.<br>
Handle Outliers: Detect and decide how to treat outliers, which may include filtering, transformation, or more advanced methods.

- <b>Univariate Analysis:</b><br>
Examine Individual Variables: Analyze each variable independently to understand its distribution, central tendency, spread, and outliers.<br>
Visualize Distributions: Use histograms, box plots, and summary statistics to describe the data.

- <b>Bivariate and Multivariate Analysis:</b><br>
Explore Relationships: Investigate relationships between variables using scatter plots, correlation matrices, and cross-tabulations.<br>
Identify Patterns: Look for patterns, trends, or associations between different variables.<br>
Group Comparisons: Compare groups within categorical variables.

- <b>Visualizations:</b><br>
Create Data Visualizations: Generate various plots, such as histograms, scatter plots, bar charts, box plots, and heatmaps to visualize data patterns.<br>
Interpret Visuals: Interpret the visualizations to draw insights and identify trends or anomalies.<br><br>

The significance of EDA is to assist data scientists in ensuring that the results they produce are valid and applicable to any desired business outcomes and goals as well as guiding stakeholders by confirming they are asking the right questions.


<b>Check the shape of the dataset</b>
- It is a good practise to first check the sape of the dataset so as to get a general overview of what our dataset is about

In [5]:
# print the shape
shape = df.shape
print('The shape of the dataset: ', shape)

The shape of the dataset:  (2920, 48)


Now we can see that the dataset contains <b>2920 instances </b> and <b>48 variables</b>.

<b>Summary of the Dataset</b>

In [6]:
#summary of the dataset, which gives us the number of filled values along with the data types of each columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2920 entries, 0 to 2919
Data columns (total 48 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            2920 non-null   int64  
 1   time                  2920 non-null   object 
 2   Madrid_wind_speed     2920 non-null   float64
 3   Valencia_wind_deg     2920 non-null   object 
 4   Bilbao_rain_1h        2920 non-null   float64
 5   Valencia_wind_speed   2920 non-null   float64
 6   Seville_humidity      2920 non-null   float64
 7   Madrid_humidity       2920 non-null   float64
 8   Bilbao_clouds_all     2920 non-null   float64
 9   Bilbao_wind_speed     2920 non-null   float64
 10  Seville_clouds_all    2920 non-null   float64
 11  Bilbao_wind_deg       2920 non-null   float64
 12  Barcelona_wind_speed  2920 non-null   float64
 13  Barcelona_wind_deg    2920 non-null   float64
 14  Madrid_clouds_all     2920 non-null   float64
 15  Seville_wind_speed   

<b>Dataset description</b>
- The dataset contains several columns which holds data about the environmental atmospheric conditions of different locations within Spain
  Showing the data repressentations of various weather and atmospheric conditions such as <b>Cloud, wind, rainfall, snow, temparature and pressure</b>
- It also contains both numerical and categorical data with data types ranging between float, integer and object data types.
- The first column without variable name is the observation_ID

<b>Statiscal Properties of the Dataset</b>

In [7]:
# Statiscal Properties of the Dataset which helps us to view the Statiscal Properties of numerical variables in the dataset, it excludes character variables
df.describe()

Unnamed: 0.1,Unnamed: 0,Madrid_wind_speed,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,...,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min
count,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,...,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0,2920.0
mean,10222.5,2.45782,0.067517,3.012785,67.123516,62.644463,43.355422,2.283562,15.477283,162.643836,...,290.695462,288.888393,289.911289,284.920684,286.522375,289.124971,288.483641,290.152431,287.869763,286.61813
std,843.075718,1.774838,0.153381,1.99634,20.611292,24.138393,30.486298,1.654787,25.289197,97.749873,...,7.113599,9.089699,7.119411,6.803424,6.492355,7.168049,6.221324,7.906915,8.977511,8.733163
min,8763.0,0.0,0.0,0.0,11.666667,8.0,0.0,0.0,0.0,0.0,...,273.816667,269.816667,272.65,266.483333,268.12,271.483333,270.138667,271.15,268.713333,267.816667
25%,9492.75,1.333333,0.0,1.666667,52.0,43.0,13.333333,1.0,0.0,86.666667,...,284.816667,281.483333,284.3075,280.15,281.778333,283.483333,284.15,284.483333,280.816667,279.816667
50%,10222.5,2.0,0.0,2.333333,70.333333,63.0,45.0,1.666667,0.0,140.0,...,290.15,287.483333,289.483333,284.483333,286.265,288.816667,288.483333,289.15,286.396667,285.483333
75%,10952.25,3.333333,0.0,4.0,85.0,84.0,75.0,3.333333,20.0,233.333333,...,296.483333,295.483333,295.816667,289.816667,291.119167,295.15,292.816667,295.15,294.4525,293.15
max,11682.0,13.333333,1.6,14.333333,100.0,100.0,97.333333,10.666667,93.333333,360.0,...,309.483333,313.483333,308.15,307.483333,308.966667,306.816667,310.816667,314.483333,312.223333,310.15


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

![31896diagram.png](attachment:31896diagram.png)

Feature engineering is the process of creating new features (variables) or transforming existing features in your dataset to improve the performance of machine learning models. It involves selecting, modifying, or creating features in a way that makes them more informative and suitable for modeling. Feature engineering is a crucial step in the data preprocessing phase of a data science project. Here's an explanation of the feature engineering process:

- <b>Feature Selection:</b><br>
Start by selecting relevant features from your dataset. Remove irrelevant or redundant features that do not contribute to the target variable or may introduce noise.

- <b>Handling Missing Data:</b><br>
Identify missing values in your dataset and decide how to handle them. You can fill missing values with appropriate measures like mean, median, mode, or use more advanced techniques like imputation.

- <b>Creating Interaction Terms:</b><br>
Combine two or more features to create new features that capture the interaction between them. This can help the model capture complex relationships in the data.

- <b>Categorical Encoding:</b><br>
Deal with categorical variables by converting them into numerical format. Common methods include one-hot encoding, label encoding, or target encoding, depending on the nature of the data and the machine learning algorithm you plan to use.

- <b>Time-Series Feature Engineering:</b><br>
For time-series data, you may create lag features, rolling statistics, or other time-based features that capture temporal dependencies.<br>

Feature engineering is both an art and a science, and it often requires experimentation and domain knowledge to create the most informative features. It can significantly impact the performance of machine learning models, making it a critical step in the data science pipeline.

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

Throughout the data modeling phase, the goal is to create a model that accurately predicts outcomes or solves a specific problem while ensuring that it performs well on unseen data. Model evaluation, validation, and interpretation are essential to gain confidence in the model's reliability and effectiveness in real-world applications.<br>
Data modeling is a phase in the data science process where you create and evaluate machine learning models. This involves the following key steps:

- <b>Model Selection:</b><br>
In this phase, you choose the machine learning algorithms or models that are most suitable for your problem. The selection depends on the nature of your data (e.g., structured or unstructured), the type of problem (e.g., classification or regression), and the complexity of relationships you want to capture.

- <b>Data Splitting:</b><br>
To assess your model's performance, you divide your dataset into two parts: a training set and a testing set. The training set is used to train the model, while the testing set is reserved for evaluating how well the model generalizes to new, unseen data.

- <b>Model Training:</b><br>
With the training data, you feed it into the chosen model, which learns patterns, relationships, and associations within the data. The model adjusts its internal parameters to make predictions.

- <b>Model Evaluation:</b><br>
You assess the model's performance using various evaluation metrics, depending on the problem type. For example, in classification tasks, you might use accuracy, precision, recall, or F1 score, while regression tasks can use metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).

- <b>Hyperparameter Tuning:</b><br>
Most machine learning models have hyperparameters, which are settings that control the learning process. You search for the best hyperparameters through techniques like grid search or random search to optimize model performance.

- <b>Deployment:</b><br>
If your model meets the desired performance and is validated, you deploy it for practical use. Deployment can involve integrating the model into applications, websites, or decision support systems.

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

The model performance stage is a critical step in the data science and machine learning pipeline, occurring after you've built and evaluated your predictive model. This stage involves assessing how well your model performs in real-world scenarios and making critical decisions about its readiness for deployment. Here's an in-depth look at the model performance stage:

1. <b>Selecting the Best Way to Compare Model Performance:</b><br>

Model evaluation involves comparing different models to determine which one performs best. The choice of the evaluation metric depends on the specific problem and business goals. Common evaluation metrics include:<br>

<b>Accuracy:</b> This metric is suitable for balanced classification tasks, but it can be misleading when dealing with imbalanced datasets.<br>

<b>Precision and Recall:</b> These metrics are useful for imbalanced classification problems. Precision measures the accuracy of positive predictions, while recall measures the ability to find all positive instances.<br>

<b>F1 Score:</b> The F1 score is the harmonic mean of precision and recall. It provides a balance between the two and is valuable for imbalanced datasets.<br>

<b>Mean Squared Error (MSE):</b> MSE is commonly used for regression tasks. It measures the average squared difference between predicted and actual values.<br>

2. <b>Choosing the Best Model:</b>

Once you have selected a suitable evaluation metric(s), you can compare the performance of different models. The process often includes the following steps:<br>

<b>Train Multiple Models:</b> Train a variety of machine learning algorithms, including different classifiers or regressors.<br>

<b>Cross-Validation:</b> Use cross-validation techniques like k-fold cross-validation to assess the model's generalization performance.<br>

<b>Evaluate Models:</b> Assess each model's performance based on the chosen evaluation metric(s). This includes measuring their accuracy, precision, recall, F1 score, MSE, R2, or other relevant metrics.<br>

In summary, selecting the best model and comparing model performance involve a comprehensive evaluation process. The choice should be based on business goals, interpretability, domain knowledge, and the trade-offs between various factors. Documenting the selection process and rationale is essential for transparency and accountability in model deployment.

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

The stage of model explanation in the data science and machine learning process is essential for understanding how a model makes its predictions, especially in complex models like deep neural networks or ensemble models. Model explanation techniques help provide insights into the reasons behind a model's decisions. Here's an overview of the model explanation stage:

- <b>Interpretability vs. Complexity:</b><br>
As machine learning models become more powerful and complex, their decision-making processes often become less interpretable. Model explanation aims to bridge the gap between model complexity and human understanding.
<br><br>
- <b>Why Model Explanation?</b><br>
Model explanations are critical for various reasons, including:<br>
Trust: Understanding why a model makes specific predictions builds trust in its decisions.<br>
Accountability: In regulated industries, models must be explainable to meet legal requirements.<br>
Bias Detection: Model explanations help uncover and mitigate biases in the model's predictions.<br>
Insights: Explanations provide insights into the relationship between input features and predictions.
<br><br>
- <b>Communication:</b><br>
Effectively communicating model explanations to stakeholders, especially non-technical audiences, is vital for ensuring that model decisions are well-received and understood.<br><br>

In summary, the stage of model explanation focuses on making machine learning models transparent and interpretable. It helps answer the question "why" a model makes specific predictions and is crucial for gaining trust, accountability, and insights into complex model decisions. Model explanations contribute to responsible AI and are a valuable aspect of model deployment and application

## 8. Conclusion

The development of an effective and reliable model is a very thorough and meticulous process, as seen and discussed by the contents above. Each and every phase needs to be carefully understood so that the requirements and needs of that phase can be achieved. By achieving these requirements, one can successfully build a model which has learned all it needs with the use of algorithms, and can be used with unseen data without supervision.