<H1> 1. What is Machine Learning? </h1>
<font size = "+1"> Machine learning is the science (and art) of programming computers so they can
learn from data.<p><br>
Here is a slightly more general definition:<br>
[Machine learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.<br>
—Arthur Samuel, 1959

<h3> (a) Traditional Approach </h3>
<font size = "+1">Consider how you would write a spam filter using traditional programming technique<br>
1.  You might notice that some words or phrases (such as “4U”, “credit card”, “free”, and “amazing”) tend to come up a lot in the subject line.<br>
2. You would write a detection algorithm for each of the patterns that you noticed,
and your program would flag emails as spam if a number of these patterns were
detected.<br>
3. You would test your program and repeat steps 1 and 2 until it was good enough3.
to launch.



![image.png](attachment:cff314bd-3ebb-4d96-955b-7468043e0ece.png)

<font size = "+1"> Since the problem is difficult, your program will likely become a long list of complex
rules—pretty hard to maintain.


<h3> (b) Machine Learning approach </h3>
<font size = "+1"> What if spammers notice that all their emails containing “4U” are blocked? They
might start writing “For U” instead. A spam filter using traditional programming
techniques would need to be updated to flag “For U” emails. If spammers keep
working around your spam filter, you will need to keep writing new rules forever.</h5>

![image-2.png](attachment:e166efe8-bc50-45c2-9397-9641aa1aacef.png)

<h3>(c) Adaption to automation </h3>
<font size = "+1">In contrast, a spam filter based on machine learning techniques automatically notices
that “For U” has become unusually frequent in spam flagged by users, and it starts
flagging them without your intervention. <h5>

![image-5.png](attachment:343841ae-d3fc-4842-9fcd-4c7c88bb7f80.png)

<h3>(d) Helping humans</h3>
<font size = "+1"> Finally, machine learning can help humans learn (Figure 1-4). ML models can be
inspected to see what they have learned (although for some models this can be
tricky). <br>For instance, once a spam filter has been trained on enough spam, it can
easily be inspected to reveal the list of words and combinations of words that it
believes are the best predictors of spam. Sometimes this will reveal unsuspected
correlations or new trends, and thereby lead to a better understanding of the prob‐
lem. Digging into large amounts of data to discover hidden patterns is called data
mining, and machine learning excels at it. </h5>

![image-6.png](attachment:4de23256-4c6d-4147-a98a-b6c5ef40b645.png)

<h3>(e) To summarize, machine learning is great for:</h3>
<font size = "+1">• <i>Detecting tumors in brain scans </i><br>
• problems for which existing solutions require a lot of fine-tuning or long lists of rules (a machine learning model can often simplify code and perform better than
the traditional approach)<br>
• Complex problems for which using a traditional approach yields no good solu‐•
tion (the best machine learning techniques can perhaps find a solution)<br>
• Fluctuating environments (a machine learning system can easily be retrained on•
new data, always keeping it up to date).<br>
• Getting insights about complex problems and large amounts of dat

<h3>(f) Examples of Applications:</h3>
<font size = "+1">• <i>Detecting tumors in brain scans </i><br>
This is semantic image segmentation, where each pixel in the image is classified
(as we want to determine the exact location and shape of tumors), typically using
CNNs or transformers.<br><br>
• <i>Automatically classifying news articles </i><br>
This is natural language processing (NLP), and more specifically text classifica‐
tion, which can be tackled using recurrent neural networks (RNNs) and CNNs,
but transformers work even better.<br><br>
• <i>Automatically flagging offensive comments on discussion forums </i><br>
This is also text classification, using the same NLP tools.<br><br>
• <i>Making your app react to voice commands </i><br>
This is speech recognition, which requires processing audio samples: since they
are long and complex sequences, they are typically processed using RNNs, CNNs,
or transformers. <br><br>
• <i>Representing a complex, high-dimensional dataset in a clear and insightful diagram </i><br>
This is data visualization, often involving dimensionality reduction techniques.<br><br>
• <i>Recommending a product that a client may be interested in, based on past purchases </i><br>
This is a recommender system. One approach is to feed past purchases (and
other information about the client) to an artificial neural network (see Chap‐
ter 10), and get it to output the most likely next purchase. This neural net would
typically be trained on past sequences of purchases across all clients.<br><br>

<H1> 2. Types of Machine Learning Systems </h1>
<font size = "+1"> There are so many different types of machine learning systems that it is useful to
classify them in broad categories, based on the following criteria:<p><br>
• How they are supervised during training (supervised, unsupervised, reinforcement, self-supervised, and others)<br><br>
• Whether or not they can learn incrementally on the fly (online versus batch 
learning)<br><br>
• Whether they work by simply comparing new data points to known data points,
or instead by detecting patterns in the training data and building a predictive
model, much like scientists do (instance-based versus model-based learning)

<h3> 2.1 Training Supervision </h3>
<font size = "+1"> ML  systems  can  be  classified  according  to  the  amount  and  type  of  supervision  they
get  during  training.  </font>
<h4> 2.1.1 Supervised Learning </h4>
<font size = "+1">In supervised learning, the training set you feed to the algorithm includes the desired solutions, called labels.<br><br> A typical supervised learning task is <b>classification</b>. The spam filter is a good exampleof this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails. e.g., Logistic Regression, Decision Tree

![image-8.png](attachment:ff4ccc0d-0d18-46b1-a7fd-165fa8efd9e9.png)

<font size = "+1"> Another  typical  task  is  to  predict  a  target  numeric  value,  such  as  the  price  of  a
car,  given  a  set  of  features  (mileage,  age,  brand,  etc.).  This  sort  of  task  is  called
<b> Regression </b>.  To  train  the  system,  you  need  to  give  it  many  examples  of
cars, including both their features and their targets (i.e., their prices). e.g., Linear Regression, Decision Tree

![image-9.png](attachment:ba16ed89-438a-44cc-9993-28171f417766.png)

<h4> 2.1.2 Unsupervised Learning </h4>
<font size = "+1">In  unsupervised  learning,  as  you  might  guess,  the  training  data  is  unlabeled. The system tries to learn without a teacher. e.g., Clustering, Dimensionality reduction<br>
For  example,  say  you  have  a  lot  of  data  about  your  blog’s  visitors.  You  may  want
to  run  a  clustering  algorithm  to  try  to  detect  groups  of  similar  visitors.
At  no  point  do  you  tell  the  algorithm  which  group  a  visitor  belongs  to:  it  finds
those  connections  without  your  help. <br>For example,  it  might  notice  that  40%  of  your
visitors are teenagers who love comic books and generally read your blog after school,
while  20%  are  adults  who  enjoy  sci-fi  and  who  visit  during  the  weekends.

![image-10.png](attachment:97ba6903-74d4-4833-b4aa-956b603ce1e9.png)

<font size = "+1"><b>Visualization </b>  algorithms  are  also  good  examples  of  unsupervised  learning:  you  feed
them a lot of complex and unlabeled data, and they output a 2D or 3D representation
of your data that can easily be plotted. <br><br>
A  related  task  is <b> dimensionality  reduction</b>,  in  which  the  goal  is  to  simplify  the  data
without losing too much information. One way to do this is to merge several correla‐ted features into one. <br> For example, a car’s mileage may be strongly correlated with its
age, so the dimensionality reduction algorithm will merge them into one feature that
represents the car’s wear and tear. This is called <i>feature extraction</i>.
<br><br>Yet  another  important  unsupervised  task  is  <b>anomaly  detection. </b><br>For  example,  detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning  algorithm.  The  system  is  shown  mostly  normal  instances  during  training,  so  it learns to recognize them; then, when it sees a new instance, it can tell whether it looks like a normal one or whether it is likely an anomaly.

![image-12.png](attachment:8ba2cf2b-8e7a-4753-bd49-54b7f9f85294.png)

<h4> 2.1.3 Semi-supervised Learning </h4>
<font size = "+1">Since  labeling  data  is  usually  time-consuming  and  costly,  you  will  often  have  plenty
of unlabeled instances, and few labeled instances. Some algorithms can deal with data
that’s partially labeled. This is called semi-supervised learning.

![image-13.png](attachment:495135a4-7a51-41e8-94b0-d9ae6ba81603.png)

<font size = "+1"> Some photo-hosting services, such as Google Photos, are good examples of this. Once
you upload all your family photos to the service, it automatically recognizes that the
same person A shows up in photos 1, 5, and 11, while another person B shows up in
photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now
all the system needs is for you to tell it who these people are. Just add one label per
person and it is able to name everyone in every photo, which is useful for searching
photos.<br><br>
Most semi-supervised learning algorithms are combinations of unsupervised and
supervised algorithms. For example, a clustering algorithm may be used to group
similar instances together, and then every unlabeled instance can be labeled with the
most common label in its cluster. Once the whole dataset is labeled, it is possible to
use any supervised learning algorithm.

<h4> 2.1.4 Self-supervised Learning </h4>
<font size = "+1">Another approach to machine learning involves actually generating a fully labeled
dataset from a fully unlabeled one.<br>
For example, suppose that what you really want is to have a pet classification model:
given a picture of any pet, it will tell you what species it belongs to. If you have a
large dataset of unlabeled photos of pets, you can start by training an image-repairing
model using self-supervised learning.  <br>If you have a large dataset of unlabeled images, you can randomly mask
a small part of each image and then train a model to recover the original image. During training, the masked images are used as the inputs to the
model, and the original images are used as the labels.


![image-14.png](attachment:cb901a40-5bc1-49e6-9c49-7b41af1616f3.png)

<font size = "+1">Once it’s performing well, it should be able
to distinguish different pet species: when it repairs an image of a cat whose face is masked, it must know not to add a dog’s face.<br> It is now possible to tweak
the model so that it predicts pet species instead of repairing images. The final step
consists of fine-tuning the model on a labeled dataset: the model already knows what
cats, dogs, and other pet species look like, so this step is only needed so the model
can learn the mapping between the species it already knows and the labels we expect
from it.

<h4> 2.1.5 Reinforcement learning </h4>
<font size = "+1">Reinforcement learning is a very different beast. The learning system, called an agent
in this context, can observe the environment, select and perform actions, and get
rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get
the most reward over time. A policy defines what action the agent should choose
when it is in a given situation.<br>

![image-15.png](attachment:c62b0fe8-3684-4357-bfce-74a1228d47cf.png)

![image-16.png](attachment:76476f0b-e6ea-45f5-8deb-43a94a3dd4a9.png)

<h3> 2.2 Batch Versus Online Learning </h3>
 <font size = "+1">Another criterion used to classify machine learning systems is whether or not the
system can learn incrementally from a stream of incoming data. </font>
<h4> 2.2.1 Batch learning </h4>

<font size = "+1"> In batch learning, the system is incapable of learning incrementally: it must be trained
using all the available data. This will generally take a lot of time and computing
resources, so it is typically done offline. First the system is trained, and then it is
launched into production and runs without learning anymore; it just applies what it
has learned. This is called <i>offline learning.</I><br><br>Unfortunately, a model’s performance tends to decay slowly over time, simply because
the world continues to evolve while the model remains unchanged. This phenom‐
enon is often called model rot or data drift. The solution is to regularly retrain the
model on up-to-date data.<br><br>If you want a batch learning system to know about new data (such as a new type of
spam), you need to train a new version of the system from scratch on the full dataset
(not just the new data, but also the old data), then replace the old model with the new
one. Fortunately, the whole process of training, evaluating, and launching a machine
learning system can be automated fairly easily.<br>This solution is simple and often works fine, but training using the full set of data
can take many hours, so you would typically train a new system only every 24 hours
or even just weekly.<br>A better option in all these cases is to use algorithms that are capable of learning
incrementally.
<h4> 2.2.2 Online learning</h4>
<font size = "+1"> In online learning, you train the system incrementally by feeding it data instances
sequentially, either individually or in small groups called mini-batches. Each learning
step is fast and cheap, so the system can learn about new data on the fly.

![image-17.png](attachment:d744913e-69db-4633-bcb7-51bb1edfc481.png)

<font size = "+1"> Online learning is useful for systems that need to adapt to change extremely rapidly
(e.g., to detect new patterns in the stock market). It is also a good option if you have
limited computing resources; for example, if the model is trained on a mobile device.
<p><font size = "+1">Additionally, online learning algorithms can be used to train models on huge datasets
that cannot fit in one machine’s main memory (this is called <i>out-of-core </i>learning).
The algorithm loads part of the data, runs a training step on that data, and repeats the
process until it has run on all of the data. This is done offline and not in live system. Think it as an incremental learning.

![image-19.png](attachment:5c4bc8c9-503d-415d-86af-9b2d85363b19.png)

<font size = "+1"> One important parameter of online learning systems is how fast they should adapt
to changing data: this is called the learning rate. If you set a high learning rate, then
your system will rapidly adapt to new data, but it will also tend to quickly forget the
old data (and you don’t want a spam filter to flag only the latest kinds of spam it was
shown). <br> Conversely, if you set a low learning rate, the system will have more inertia;
that is, it will learn more slowly, but it will also be less sensitive to noise in the new
data or to sequences of nonrepresentative data points (outliers).

<h3> 2.3 Instance-Based Versus Model-Based Learning </h3>
<font size = "+1"> One more way to categorize machine learning systems is by how they generalize.
Most machine learning tasks are about making predictions. This means that given a number of training examples, the system needs to be able to make good predictions for (generalize to) examples it has never seen before. Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.<br></font>
<h4> 2.3.1 Instance-based learning </h4>
<font size = "+1">  If you were to
create a spam filter this way, it would just flag all emails that are identical to emails
that have already been flagged by users—not the worst solution, but certainly not the
best.<br>
Instead of just flagging emails that are identical to known spam emails, your spam filter could be programmed to also flag emails that are very similar to known spam emails This requires a measure of similarity between two emails. A (very basic) similarity measure between two emails could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email.<br>
This is called <i>instance-based learning</I>: the system learns the examples, then generalizes to new cases by using a similarity measure to compare them to the learned examples (or a subset of them).

![image-20.png](attachment:51928d9d-72e9-430e-b627-7f75cab5a7fb.png)

<h4> 2.3.2 Model based learning </h4>
<font size = "+1">Another way to generalize from a set of examples is to build a model of these examples and then use that model to make predictions.<br>
This is called <i>instance-based learning</I>: the system learns the examples, and then generalizes them to new cases by using a similarity measure to compare them to the learned examples (or a subset of them).

![image.png](attachment:10c09442-7c09-4190-ad44-b045cb1b67ee.png)

<font size = "+1">In summary:<br>
• You studied the data.<br>
• You selected a model.<br>
• You trained it on the training data (i.e., the learning algorithm searched for the
model parameter values that minimize a cost function).<br>
• Finally, you applied the model to make predictions on new cases (this is called
inference), hoping that this model will generalize well.<br>
This is what a typical machine learning project looks like.

<h1>3. Main Challenges of Machine Learning</h1>
<font size = "+1"> In short, since your main task is to select a model and train it on some data, the two
things that can go wrong are “bad model” and “bad data”. <u>Let’s start with examples of bad data</u>.<br>
<h3>3.1.1 Insufficient Quantity of Training Data </h3> <font size = "+1">
Machine learning, unlike humans, takes a lot of data for most machine
learning algorithms to work properly. Even for very simple problems you typically
need thousands of examples, and for complex problems such as image or speech
recognition you may need millions of examples (unless you can reuse parts of an
existing model)


-------------------------------------------------------------

<font size = "+1.5"> <u>Question 1.</u> What is The Unreasonable Effectiveness of Data ?<br><br><u>Answer.</u> In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric
Brill showed that very different machine learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natural language
disambiguation once they were given enough data. <br> As the authors put it, “these results suggest that we may want to reconsider the trade-
off between spending time and money on algorithm development versus spending it
on corpus development” <br>The idea that data matters more than algorithms for complex problems was further
popularized. </h5>


![image.png](attachment:c4fd30f8-f0fe-44fe-9af2-9ca7e64530bd.png)

-----------------------------------------------------------------------------

<h3>3.1.2 Non-representative Training Data </h3> <font size = "+1">
In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is true whether you use instance-based learning or model-based learning.


<font size = "+1"> For example, the set of countries we used in below ML problem for training the linear model was
not perfectly representative; it did not contain any country with a GDP per capita lower than $23,500 or higher than $62,500.

![image.png](attachment:7d5a9529-567c-4395-b7e0-33a7b4efd7f7.png)

<font size = "+1"> If we add data that contain any country with a GDP per capita lower than $23,500 or higher than $62,500, and train a linear model on it, we get the solid line, while the old model is represented by the dotted line.

![image.png](attachment:2089ed9d-c271-49cb-890c-368a5aaa8259.png)

<font size = "+1">As you can see, not only does adding a few missing
countries significantly alter the model, but it makes it clear that such a simple linear
model is probably never going to work well. It seems that very rich countries are not
happier than moderately rich countries (in fact, they seem slightly unhappier!), and
conversely some poor countries seem happier than many rich countries. <br>
If the sample is too small, you
will have <i>sampling noise</i> (i.e., nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampling method is flawed. This is called <i>sampling bias.</i>

------------------------------------------------------------------------------

<font size = "+1.5"> <u>Question 2.</u> Write an example of Sampling Bias ?<br><br><u>Answer.</u>  The most famous example of sampling bias happened during the US presidential election in 1936, which pitted Landon against Roosevelt: the <i>Literary Digest</i> conducted a very large poll, sending mail to about 10 million people. It got 2.4 million answers, and predicted with high confidence that Landon would get 57% of the votes. Instead, Roosevelt won with 62% of the votes. The flaw was in the <i>Literary Digest’s</i> sampling method:<br><br>
• First, to obtain the addresses to send the polls to, the <i>Literary Digest</i> used telephone directories, lists of magazine subscribers, club membership lists, and the like. All of these lists tended to favor wealthier people, who were more likely to vote Republican (hence Landon). <br><br>
• Second, less than 25% of the people who were polled answered. Again this introduced a sampling bias, by potentially ruling out people who didn’t care
much about politics, people who didn’t like the <i>Literary Digest</i>, and other key
groups. This is a special type of sampling bias called nonresponse bias. </h5>


--------------------------------------------------

<h3>3.1.3 Poor-Quality Data </h3> <font size = "+1">
Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-
quality measurements), it will make it harder for the system to detect the underlying
patterns, so your system is less likely to perform well. It is often well worth the effort
to spend time cleaning up your training data. e.g., 
<br><br>• If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually. <br><br>
• If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it.


<h3>3.1.4 Irrelevant Features </h3> <font size = "+1">
As the saying goes: garbage in, garbage out. Your system will only be capable of
learning if the training data contains enough relevant features and not too many
irrelevant ones. A critical part of the success of a machine learning project is coming
up with a good set of features to train on. This process, called feature engineering,
involves the following steps: e.g., 
<br><br>• Feature selection (selecting the most useful features to train on among existingfeatures). <br><br>
• Feature extraction (combining existing features to produce a more useful one—as we saw earlier, dimensionality reduction algorithms can help)<br><br>
• Creating new features by gathering new data.<br><br>

<u> Now see some examples of bad algorithms. </u>
<h3> 3.2.1 Overfitting the Training Data</h3> <font size = "+1">
Let say if you get cheated by your loved one, you might be tempted to say all men / women are cheaters. Overgeneralizing is
something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In machine learning this is called <b>overfitting</b>. This means that the model performs well on the training data, but it does not generalize
well i.e., not performing well on test dataset or the data model has never seen before.
<br> Even though a high-degree polynomial life satisfaction model performs much better on the training data than the simple linear model, would you really trust its predictions? for e.g., at 80000 USD ?

![image.png](attachment:595cd914-5c4b-469a-8ae4-ad4c68e78909.png)

<font size = "+1"> Constraining a model to make it simpler and reduce the risk of overfitting is called <i>regularization</i>.<br>
<font size = "+1"> The dotted line represents the original model that
was trained on the countries represented as circles (without the countries represented
as squares), the solid line is our second model trained with all countries (circles and
squares), and the dashed line is a model trained with the same data as the first model
but with a regularization constraint. You can see that regularization forced the model
to have a smaller slope: this model does not fit the training data (circles) as well as
the first model, but it actually generalizes better to new examples that it did not see
during training (squares).

![image.png](attachment:73d56bd8-b4a0-4977-a007-5d075edd0a74.png)

-------------------------------

<font size = "+1.5"> <u>Question 3.</u> What is regularization ?<br><br><u>Answer.</u>  Constraining a model to make it simpler and reduce the risk of overfitting is called <i>regularization.</i><br>For example, the linear model we defined earlier has two parameters,
θ<sub>0</sub> and θ<sub>1</sub>. This gives the learning algorithm two degrees of freedom to adapt the model
to the training data: it can tweak both the height (θ<sub>0</sub>) and the slope (θ<sub>1</sub>) of the line. If
we forced θ<sub>1</sub> = 0, the algorithm would have only one degree of freedom and would
have a much harder time fitting the data properly: all it could do is move the line
up or down to get as close as possible to the training instances, so it would end up
around the mean. A very simple model indeed! If we allow the algorithm to modify
θ<sub>1</sub> but we force it to keep it small, then the learning algorithm will effectively have
somewhere in between one and two degrees of freedom. It will produce a model that’s
simpler than one with two degrees of freedom, but more complex than one with just
one. You want to find the right balance between fitting the training data perfectly and
keeping the model simple enough to ensure that it will generalize well </h5>


---------------------

<h3> 3.2.2 Underfitting the Training Data</h3> <font size = "+1">Underfitting is the opposite of overfitting: it occurs when your
model is too simple to learn the underlying structure of the data. For example, a
linear model of life satisfaction is prone to underfit; reality is just more complex
than the model, so its predictions are bound to be inaccurate, even on the training
examples.<br><br>
Here are the main options for fixing this problem:<br><br>
• Select a more powerful model, with more parameters.<br><br>
• Feed better features to the learning algorithm (feature engineering).<br><br>
• Reduce the constraints on the model (for example by reducing the regularization)

![image.png](attachment:c2923386-87f1-4fed-bf56-9562942d6a23.png)

<h1>4. Testing and Validating </h1>
<font size = "+1"> The only way to know how well a model will generalize to new cases is to actually
try it out on new cases. One way to do that is to put your model in production and
monitor how well it performs. This works well, but if your model is horribly bad,
your users will complain—not the best idea.<br>
A better option is to split your data into two sets: the <i>training set</i> and the <i>test set.</i>
As these names imply, you train your model using the training set, and you test it
using the test set. The error rate on new cases is called the generalization error (or
<i>out-of-sample error</i>), and by evaluating your model on the test set, you get an estimate
of this error. This value tells you how well your model will perform on instances it
has never seen before.<br>
If the training error is low (i.e., your model makes few mistakes on the training
set) but the generalization error is high, it means that your model is overfitting the
training data.

-------------------------------------

<font size = "+1.5"> <u>Question 4.</u> What are Hyperparameters ?<br><br><u>Answer.</u>  A Machine Learning model is defined as a mathematical model with a number of parameters that need to be learned from the data. By training a model with existing data, we are able to fit the model parameters. 
However, there is another kind of parameter, known as Hyperparameters, that cannot be directly learned from the regular training process. They are usually fixed before the actual training process begins. These parameters express important properties of the model such as its complexity or how fast it should learn. <br> <br></h5>
Some examples of model hyperparameters include:<br>

<font size = "+1.5"> 1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization. <br>
<font size = "+1.5"> 2. The learning rate for training a neural network. <br>
<font size = "+1.5"> 3. The C and sigma hyperparameters for support vector machines. <br>
<font size = "+1.5"> 4. The k in k-nearest neighbors. <br>


---------------------------------

<h1>5. Hyperparameter Tuning and Model Selection </h1> <font size = "+1">Evaluating a model is simple enough: just use a test set. But suppose you are hesitating between two types of models (say, a linear model and a polynomial model): how
can you decide between them? One option is to train both and compare how well
they generalize using the test set.<br> Now suppose that the linear model generalizes better, but you want to apply some
regularization to avoid overfitting. The question is, how do you choose the value of
the regularization hyperparameter? One option is to train 100 different models using
100 different values for this hyperparameter. Suppose you find the best hyperparameter value that produces a model with the lowest generalization error—say, just 5%
error. You launch this model into production, but unfortunately it does not perform
as well as expected and produces 15% errors. What just happened?<br>
The problem is that you measured the generalization error multiple times on the test
set, and you adapted the model and hyperparameters to produce the best model for
that particular set. This means the model is unlikely to perform as well on new data.<br>
A common solution to this problem is called holdout validation: you
simply hold out part of the training set to evaluate several candidate models and select the best one. The new held-out set is called the validation set (or the devel‐
opment set, or dev set). More specifically, you train multiple models with various
hyperparameters on the reduced training set (i.e., the full training set minus the
validation set), and you select the model that performs best on the validation set.
After this holdout validation process, you train the best model on the full training set
(including the validation set), and this gives you the final model. Lastly, you evaluate
this final model on the test set to get an estimate of the generalization error.

![image.png](attachment:d73a398d-2391-4d69-b933-5e27e79bf167.png)

-------------------------------------

<font size = "+1.5"> <u>Question 5.</u> What is No Free Lunch Theorem ?<br><br><u>Answer.</u>  In a famous 1996 paper, David Wolpert demonstrated that if you make absolutely
no assumption about the data, then there is no reason to prefer one model over any
other. This is called the No Free Lunch (NFL) theorem. <br>For some datasets the best
model is a linear model, while for other datasets it is a neural network. There is no
model that is a priori guaranteed to work better on all datasets.(hence the name of the theorem).
The only way to know for sure which model is best is to evaluate them all.<br> Since
this is not possible, in practice you make some reasonable assumptions about the
data and evaluate only a few reasonable models. For example, for simple tasks you
may evaluate linear models with various levels of regularization, and for a complex
problem you may evaluate various neural networks.

---------------------------------