# Useful resources:

>* [GitHub Repository for code snippet in the book](https://github.com/ageron/handson-ml3)
>* [Python tutorial](https://learnpython.org/)
>* [Matplotlib & Numpy](https://homl.info/tutorials)
>* [Mathematics Resources](https://homl.info/tutorials)
>* [!!♥♥♥ So Important machine learning course (take it after this book)](https://homl.info/skdoc)

# Content:
* [Chapter (1)](#ch1)
* [Chapter (2)](#ch2)


* [Chapter (8)](#ch8)
* [Chapter (9)](#ch9)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# 🚩 Chapter: 1 <a id='ch1'></a>

>* Your spam filter is a machine learning program that, given examples of spam emails (flagged by users) and examples of regular emails (nonspam, also called “ham”), can learn to flag spam. 
>* The examples that the system uses to learn are called the training set. 
>* Each training example is called a training instance (or sample). 
>* The part of a machine learning system that learns and makes predictions is called a model. 
>* Neural networks and random forests are examples of models.

>* Digging deeper into the data to discover patterns is called `Data Mining` machine learning helps to find those patterns if we cannot find them ... like if you want to discriminate between the voice of "Two" and "one" you cannot write a hard code that differentiate between both in different situations and in the existence of noise... so machine learning model could be trained and helps us find that pattern.
> ![image.png](attachment:image.png)


>To summarize, machine learning is great for:

>* Problems for which existing solutions require a lot of fine-tuning or long lists of rules (a machine learning model can often simplify code and perform better than the traditional approach)

>* Complex problems for which using a traditional approach yields no good solution (the best machine learning techniques can perhaps find a solution)

>* Fluctuating environments (a machine learning system can easily be retrained on new data, always keeping it up to date)

>* Getting insights about complex problems and large amounts of data

> #### `target` & `label` refers to the same thing ... but `target` is commonly used with regression and `label` with classification.
> #### `features` are sometimes called `predictors` or `attributes`.

________________________________________________

>* How they are supervised during training **(supervised, unsupervised, semi-supervised, self-supervised, and others)

>* Whether or not they can learn incrementally on the fly **(online versus batch learning)

>* Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do **(instance-based versus model-based learning)** <font color = 'red'> this type is related to **genaralization**</font>.

## Examples of Machinea Learning:


>`Analyzing images of products on a production line to automatically classify them`
>This is image classification, typically performed using convolutional neural networks (CNNs; see Chapter 14) or sometimes transformers (see Chapter 16).

>`Detecting tumors in brain scans`
>This is semantic image segmentation, where each pixel in the image is classified (as we want to determine the exact location and shape of tumors), typically using CNNs or transformers.

>`Automatically classifying news articles`
>This is natural language processing (NLP), and more specifically text classification, which can be tackled using recurrent neural networks (RNNs) and CNNs, but transformers work even better (see Chapter 16).

>`Automatically flagging offensive comments on discussion forums`
>This is also text classification, using the same NLP tools.

>`Summarizing long documents automatically`
>This is a branch of NLP called text summarization, again using the same tools.

>`Creating a chatbot or a personal assistant`
>This involves many NLP components, including natural language understanding (NLU) and question-answering modules.

>`Forecasting your company’s revenue next year, based on many performance metrics`
>This is a regression task (i.e., predicting values) that may be tackled using any regression model, such as a linear regression or polynomial regression model (see Chapter 4), a regression support vector machine (see Chapter 5), a regression random forest (see Chapter 7), or an artificial neural network (see Chapter 10). If you want to take into account sequences of past performance metrics, you may want to use RNNs, CNNs, or transformers (see Chapters 15 and 16).

>`Making your app react to voice commands`
>This is speech recognition, which requires processing audio samples: since they are long and complex sequences, they are typically processed using RNNs, CNNs, or transformers (see Chapters 15 and 16).

>`Detecting credit card fraud`
>This is anomaly detection, which can be tackled using isolation forests, Gaussian mixture models (see Chapter 9), or autoencoders (see Chapter 17).

>`Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment`
>This is clustering, which can be achieved using k-means, DBSCAN, and more (see Chapter 9).

>`Representing a complex, high-dimensional dataset in a clear and insightful diagram`
>This is data visualization, often involving dimensionality reduction techniques (see Chapter 8).

>`Recommending a product that a client may be interested in, based on past purchases`
>This is a recommender system. One approach is to feed past purchases (and other information about the client) to an artificial neural network (see Chapter 10), and get it to output the most likely next purchase. This neural net would typically be trained on past sequences of purchases across all clients.

>`Building an intelligent bot for a game`
>This is often tackled using reinforcement learning (RL; see Chapter 18), which is a branch of machine learning that trains agents (such as bots) to pick the actions that will maximize their rewards over time (e.g., a bot may get a reward every time the player loses some life points), within a given environment (such as the game). The famous AlphaGo program that beat the world champion at the game of Go was built using RL.

____________________________

##### Semi-Supervised Algorithm:

> Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all the system needs is for you to tell it who these people are. Just add one label per person⁠3 and it is able to name everyone in every photo, which is useful for searching photos.

>Most semi-supervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, a clustering algorithm may be used to group similar instances together, and then every unlabeled instance can be labeled with the most common label in its cluster. Once the whole dataset is labeled, it is possible to use any supervised learning algorithm.

_______________________

##### Self-Supervised Algorithm:

> For example, if you have a large dataset of unlabeled images, you can randomly mask a small part of each image and then train a model to recover the original image (Figure 1-12). During training, the masked images are used as the inputs to the model, and the original images are used as the labels.

> The resulting model may be quite useful in itself—for example, to repair damaged images or to erase unwanted objects from pictures. But more often than not, a model trained using self-supervised learning is not the final goal. You’ll usually want to tweak and fine-tune the model for a slightly different task—one that you actually care about.

> For example, suppose that what you really want is to have a pet classification model: given a picture of any pet, it will tell you what species it belongs to. If you have a large dataset of unlabeled photos of pets, you can start by training an image-repairing model using self-supervised learning. Once it’s performing well, it should be able to distinguish different pet species: when it repairs an image of a cat whose face is masked, it must know not to add a dog’s face. Assuming your model’s architecture allows it (and most neural network architectures do), it is then possible to tweak the model so that it predicts pet species instead of repairing images. The final step consists of fine-tuning the model on a labeled dataset: the model already knows what cats, dogs, and other pet species look like, so this step is only needed so the model can learn the mapping between the species it already knows and the labels we expect from it.

_______________________

>* `model rot` or `data drift` is when the model becomes incompatible with the evolved state of the community. "it is a drawback of offline learning / batch learning".

>* online learning algorithms can be used to train models on huge datasets that cannot fit in one machine’s main memory (this is called `out-of-core learning`


____________________________________

>* Both `utility function (or fitness function)` and `cost function` are measure of performance ... but `utility function` measures how good the model is & `cost function` measures how bad it is (how it if far from the errors).


## Main Challenges in Machine Learning:

> Challenges are either `bad data` or `bad model`.

###### Insufficient quantity of training data:
> you should put much work in enhancing and preparing data but because it is hard to obtain more data, you should put some work on choosing the best algorithm and tuning its parameters. 

>it is not always easy or cheap to get extra training data⁠—so don’t abandon algorithms just yet.

###### Nonrepresentative Training Data:
> the training data for predicting country happy rate may be non-representative like this:
>> you may study only the wealthier contries and abandon the other. Therefore, your model will perform bad on generalization.(what you did is **sampling bias**)
>> The following dotted line is when the data was biased ... the solid line when the data was representative. (difference leads to strange results that make Linear Regression is not suitable here "as some poor countries seems happier than other moderately rich countries and this unlogic")
>>![image.png](attachment:image.png)

> Perhaps the most famous **example of sampling bias** happened during the US presidential election in 1936, which pitted Landon against Roosevelt: the Literary Digest conducted a very large poll, sending mail to about 10 million people. It got 2.4 million answers, and predicted with high confidence that Landon would get 57% of the votes. Instead, Roosevelt won with 62% of the votes. The flaw was in the Literary Digest’s sampling method:

>>* **First,** to obtain the addresses to send the polls to, the Literary Digest used telephone directories, lists of magazine subscribers, club membership lists, and the like. All of these lists tended to favor wealthier people, who were more likely to vote Republican (hence Landon).
>>* **Second,** less than 25% of the people who were polled answered. Again this introduced a sampling bias, by potentially ruling out people who didn’t care much about politics, people who didn’t like the Literary Digest, and other key groups. This is a special type of sampling bias called nonresponse bias.

###### Poor-Quality Data:
> Data cleaning is crucial to fix errors, outliers, and noise...etc.

###### Irrelevant Features:
> **Garbage in, garbage out**, you need to select the most relevant featurs while feeding them to the model and abandon the irrelevant ones.
>> This process, called feature engineering, involves the following steps:
>>* Feature selection (selecting the most useful features to train on among existing features)
>>* Feature extraction (combining existing features to produce a more useful one⁠—as we saw earlier, dimensionality reduction algorithms can help)
>>* Creating new features by gathering new data.

### The previous were `bad data` challenges and following are `bad model` challenges ###

###### Overfitting:
> Good example:
>* Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small, which introduces **sampling noise**, then the model is likely to detect patterns in the noise itself (as in the taxi driver example). Obviously these patterns will not generalize to new instances. For example, say you feed your life satisfaction model many more attributes, including uninformative ones such as the country’s name. In that case, a complex model may detect patterns like the fact that all countries in the training data with a w in their name have a life satisfaction greater than 7: New Zealand (7.3), Norway (7.6), Sweden (7.3), and Switzerland (7.5). How confident are you that the w-satisfaction rule generalizes to Rwanda or Zimbabwe?
> ----- **this example is related to how irrelevant features may affect the model badl** -----


> Solutions:
>* Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model.
>* Gather more training data.
>* Reduce the noise in the training data (e.g., fix data errors and remove outliers).

> Constraining a model to **make it simpler and reduce the risk of overfitting** is called `regularization`. 
>* Example is that if you build a linear regression model with two parameters θ0 & θ1, then your model has two `degrees of freedom` ... when you force θ1 = 0 in order to simplify the model, then you have only one degree of freedom for the model to either move up or down to end just moving around the mean ... same if you force θ0 = 0 ... Better technique If we allow the algorithm to modify θ1 but we force it to keep it small, this will leave the model with somewhere in between one and two degrees of freedom instead of one to generalize well and at the same time the model will be simpler.
>* try to find the right balance between fitting the data & make the model simple ... to ensure that it will generalize well.
>* regularization parameter is a `hyper-parameter` that is set before training and remains constant duting training. Unlike model parameters that is optimized in the training process. 
>![image-2.png](attachment:image-2.png)

###### Underfitting: 
> To solve underfitting, you can: 
>* choose another proper model (more complex one).
>* feed more relevant features to the model.
>* reduce restrictions on the model in case of regularization by reducing the lamda parameter.


>**Adding more data will reduce the `overfitting`. but doesn't help in case of `underfitting`**
_____________________________________________

>* `generalization error` is the error on the test set (set that is supposed to be like the real world).

>* It is not always best practice to use `80-20%` rule of test train split ... you may use `99-1%` if for example, your data is 1 Million instance.. then 1% will be about 10 000 which is sufficient to test on.

_____________________________

>### If you are confused whether model to choose among `three models`: 
>* you can evaluate all of them on the test set and choose the best.

>### If you are confused about `15 modelS` to choose from "not always exist this case":
>* you can use only test-set to determine, because if you evaluate all the models on the same test set and choose the best, you choose the best that adapts to this set well "not necessary generalize well" ... (TO HANDLE THIS PROBLEM, USE ANOTHER SET CALLED `VALIDATION_SET/DEVELOPMENT_SET/DEV_SET`. AND HOLD OUT THE TEST SET FOR FINAL EVALUATION)
 

>### If you have `one model`, and want to choose the `best value for regularization parameter` "a hyperparameter":
>* Do the same for validation-set

>💡 Try `not` to make the validation-set is `too large` relative to train-set...(this won't provide enough trainig for the model before it is set to be choosen among different models...like if you want a sprinter to run in a marathon he isn't trained well before he evaluated in a marathon)

>💡 Try `not` to make the validation-set is `too small` relative to the train-set ... (this can't give trustworthy judgement that certain model is good)


> Final Note is that: if you have `small amount of data`, then you are forced to choose small train and validaton sets ... in this case use `cross-validation` technique in which you create multiple sets to be validation set and apply the same model on and take average ... then use the another model on them and take average ... and do so for all the models. Then model with best average is the best model. (when you check certain validation-set, the other data becomes train-set)

_____________________________

### Data Mismatch: 

> is when for example, want to make a mobile app to differentiate among images of flowers. To make this, you trained it on a web images for flowers ... therefore when it is tested in reality, it was so poor .
> this is beacuse the web-images are totally different from those taken by mobile in reality.`Data Mismatch`

> To solve this situation, you have two ways:
>* **Preprocess your data** to make those web-images the same as taken by mobile. "for ex: reduce quality"
>* **If by chance you have some instances in those web-images which are similar to the reality photos,** collect those images and shuffle them to make some of them in the validation set and other in the test set.


> To check `whether you have data mismatch or it is overfitting !!`, turn to `train-dev` it totally different from `dev-set`:
>* One solution is to hold out some of the training pictures (from the web) in yet another set that Andrew Ng dubbed the train-dev set (Figure 1-26). After the model is trained (on the training set, not on the train-dev set), you can evaluate it on the train-dev set. If the model performs poorly, then it must have overfit the training set, so you should try to simplify or regularize the model, get more training data, and clean up the training data. But if it performs well on the train-dev set, then you can evaluate the model on the dev set. If it performs poorly, then the problem must be coming from the data mismatch. You can try to tackle this problem by preprocessing the web images to make them look more like the pictures that will be taken by the mobile app, and then retraining the model. Once you have a model that performs well on both the train-dev set and the dev set, you can evaluate it one last time on the test set to know how well it is likely to perform in production.
>![image.png](attachment:image.png)


_____________________________

# 🚩 Chapter: 2 <a id='ch2'></a>

#### Popular open data repositories:

* [OpenML.org](https://openml.org/)

* [Kaggle.com](https://kaggle.com/datasets)

* [PapersWithCode.com](https://paperswithcode.com/datasets)

* [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml)

* [Amazon’s AWS datasets](https://registry.opendata.aws/)

* [TensorFlow datasets](https://tensorflow.org/datasets)

#### Meta portals (they list open data repositories):

* [DataPortals.org](https://dataportals.org/)

* [OpenDataMonitor.eu](https://opendatamonitor.eu/)

#### Other pages listing many popular open data repositories:

* [Wikipedia’s list of machine learning datasets](https://homl.info/9)

* [Quora.com](https://homl.info/10)

* [The datasets subreddit](https://reddit.com/r/datasets)

____________________________

## `Main steps` in machine learning project:
> ![image.png](attachment:image.png)

# 🚩 Chapter: 9 <a id='ch9'></a>

> * The vast majority of the available data is unlabeled, therefore Most of the machine learning applications required Unsupervised learning.

> * if intelligence was a cake, `unsupervised learning` would be the cake, `supervised learning` would be the `icing` on the cake, and `reinforcement learning` would be the cherry on the cake.

* some unsupervised tasks are:
> 1. Dimensionality reduction
> 2. Clustering (can be used in many applications including):
>> * **Customer segmentation** (Can be used as a step of more advanced algorithm like doing customer segmentation for further recommender system)
>> * **Data analysis** (you can cluster your data and analyze each cluster separately )
>> * **Dimensionality reduction** (It is usually possible to measure each instance’s affinity with each cluster; affinity is any measure of how well an instance fits into a cluster. Each instance’s feature vector x can then be replaced with the vector of its cluster affinities. If there are k clusters, then this vector is k-dimensional. The new vector is typically much lower-dimensional than the original feature vector, but it can preserve enough information for further processing.)
>> * **Feature engineering***
>> * **Anomaly detection**(if you have clustered the users of your website based on their behavior, you can detect users with unusual behavior, such as an unusual number of requests per second.)
>> * **Semi-supervised learning**(Like google images)
>> * **Search engines**
>> * **Image segmentation**(clustering pixels according to their color, then replacing each pixel’s color *with the mean color of its cluster*.)
> 3. Anomaly detection (AKA: Outlier Detection)
>> * Can be used to detect the outlier instances "not just outlier of each column". Note **normal instances are called `inliers` not `outliers`**
> 4. Density estimation 
>> * The task of estimating the probability density function (PDF)... it is useful for anomaly detection algorithm.
________________________