##### AI

“The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

In other words, any element of intelligence can be broken down into small steps so that each of the steps is as such so simple and “mechanical” that it can be written down as a computer program.

Artificial intelligence is narrow : being able to solve one problem tells us nothing about the ability to solve another, different problem

Narrow AI refers to AI that handles one task

##### Machine learning

Systems that improve their performance in a given task with more and more experience or data.

##### Data Science

Includes machine learning and statistics, certain aspects of computer science including algorithms, data storage, and web application development

##### Deep learning
Deep learning refers to certain kinds of machine learning techniques where several “layers” of simple processing units (neurons) are connected in a network so that the input to the system is passed through each one of them in turn.



The first stage of the problem solving process: defining the choices and their consequences

To define what our goal is, or in other words, when we can consider the problem solved

## Minimax algorithm

Uncertanity

<b>Fuzzy logic</b> is a type of mathematical logic that deals with reasoning that is approximate rather than fixed and exact.
Fuzzy logic is used to handle problems that involve ambiguity, vagueness, and uncertainty.

A classic example of fuzzy logic is in temperature control. Instead of saying "on" (1) or "off" (0) for a heater or air conditioner, you can use fuzzy logic to gradually adjust the heating or cooling based on the difference between the desired temperature and the current temperature. This allows for a smoother, more human-like control of temperature.

IF temperature is slightly cool, turn AC to low
IF temperature is very cold, turn AC to high

We then define fuzzy sets like "slightly cool" and "very cold" with smoothMembership functions:

Slightly cool:

            Temp < 65 = 0
            65 < temp < 68 = smoothly increasing from 0 to 1
            Temp > 68 = 1
Very cold:

            Temp > 72 = 1
            68 < temp < 72 = smoothly decreasing from 1 to 0
            Temp < 68 = 0
            
Based on the degree of membership of the current temp in each fuzzy set, the AC level is adjusted proportionally

**Probability** has turned out to be the best approach for reasoning under uncertainty. It gives the ability to think of uncertainty as a thing that can be quantified (expressed as a number) at least in principle.

The main advantage of probabilistic reasoning is the ability to handle uncertain and conflicting evidence

The weather forecast says it's going to rain with 90% probability tomorrow but the day turns out to be all sun and no rain.
We can't conclude that the weather forecast was wrong based on only the single event. The forecast said it's going to rain with 90% probability, which means it would not rain with 10% probability or in one out of 10 days. It is perfectly plausible that the day in question was the 1 in 10 event. Concluding that the probability 90% was correct would also be wrong because by the same argument, we could then conclude that 80% chance of rain was also correct, and both cannot be correct at the same time.


**Odds**

An expression like 3:1 (three to one), which means that we expect that for every three cases of an outcome there is one case of the opposite outcome

The odds 1:5 (one win for every five losses), even if it can be expressed as the decimal number 0.2, is different from 20% probability (or probability 0.2 using the mathematicians’ notation). The odds 1:5 mean that you’d have to play the game six times to get one win on the average. The probability 20% means that you’d have to play five times to get one win on the average

In general, if the odds in favor of an event are x:y, the probability of the event is given by x / (x+y)

Prior and posterior odds

Prior odds : Our assessment of the odds before obtaining some new information that may be relevant

Posterior odds : Update the prior odds when new information becomes available,  the odds after obtaining the information

Likelihood ratio : Is the probability of the observation in case the event of interest, divided by the probability of the observation in case of no event.

## Bayes rule

posterior odds = likelihood ratio × prior odds

    P(A|B) =       P(B|A) * P(A) 
             ------------------------
                        P(B)

P(A|B) is the posterior probability, or probability of A given B.

P(B|A) is the likelihood, or probability of B given A.

P(A) is the prior probability, or probability of A.

P(B) is the prior probability, or probability of B.


### The base rate fallacy

The base rate fallacy is when the base rate or prior probability is ignored when making probability calculations.

The base rate fallacy leads to overestimating the relevance of new evidence, when we should be combining it with the prior base rate to get the full picture. Bayes' Rule helps avoid this fallacy by formally incorporating the base rate.

Example : let’s assume that 5 in 100 women have breast cancer. Suppose that if a person has breast cancer, then the mammograph test will find it 80 times out of 100. Suppose that if the person being tested actually doesn’t have breast cancer, the chances that the test nevertheless comes out positive are 10 in 100.

The base rate fallacy could occur if we ignored the fact that only 5 in 100 women have breast cancer to begin with.

For example, if a woman tests positive, we might make the mistake of thinking:

The test is 80% accurate for women with cancer
Therefore, there's an 80% chance this woman has breast cancer
But this ignores the base rate - the 5% prevalence of breast cancer in the general population.

Using Bayes' Rule and taking into account the base rate gives a different result:

Prior probability of cancer 
               
                     P(Cancer) : 0.05 (5/100)

Positive test probability if they do have cancer
            
                  P(Positive ∣ Cancer) : 0.80 (80/100) 

Positive test probability if they don't have cancer:
                
                   P(Positive ∣ NoCancer) : 0.10 (1/10)                   

Posterior probability of having cancer with a positive test is actually only about 29.6% (~30%)

![image.png](attachment:17ad5cc4-100d-4fc7-b0b1-a68dcc22bd31.png)

                           (0.80*0.05)              0.04
                    -------------------------- = ---------- = 0.296
                     (0.80*0.05)+(0.10*0.95)        0.135 

<br>
<br>
<br>
<br>


Prior odds - 5:95 (five women with breast cancer for every 95 women without breast cancer)

Likelihood ratio : The probability of a positive result in case of cancer divided by the probability of a positive result in case of no cancer. With the above numbers, this is given by 80/100 divided by 10/100, which is 8

Posterior Odds = Prior odds * likelihood ratio
                
                 5:95 * 8 
                 
                 = 40:95


The base rate fallacy occurs when someone overestimates the likelihood of having breast cancer upon receiving a positive test result without adequately considering the low prevalence of the disease in the population

#### linear regression and Nearest neighbours

Where linear regression excels compared to nearest neighbors is interpretability

Interpreting the trained model in nearest neighbors in a similar fashion as the weights in linear regression is impossible: the learned model is basically the whole data, and it is usually way too big and complex to provide us with much insight.

#### logistic regression

We can turn the linear regression method’s outputs into predictions about labels. The technique for doing this is called logistic regression

Logistic regressionlike linear regression, it is also constrained by the linearity property

#### Neural Network

The argument for neural networks is that by simulating the lower-level, “subsymbolic” data processing on the level of neurons and neural networks, intelligence will emerge.

Example - Image Recognition 

A symbolic, rules-based approach would involve manually engineering features like pixel patterns, strokes, etc. and writing rules like "if see circular shape in top right, recognize as 9".

A neural network takes a different subsymbolic approach. The input image is encoded into neuron activation values. These neurons are connected in layers and weighted connections are learned based on training data. There are no manually coded rules or features.

This allows neural networks to handle complex, real-world tasks where traditional symbolic AI might struggle

##### GPUs for NN

The CPU can retrieve data to be processed from the computer’s memory, and store the result in the memory. Thus, data storage and processing are handled by two separate components of the computer: the memory and the CPU. 

In neural networks, the system consists of a large number of neurons, each of which can process information on its own so that instead of having a CPU process each piece of information one after the other, the neurons process vast amounts of information simultaneously.

While a CPU processes data sequentially, neural networks have massively parallel processing across neurons, enabling efficient learning from huge datasets.

We need parallel processing to simulate neural networks and Graphics processors (or graphics processing units, GPUs) have this capability and they have become a cost-effective solution for running massive deep learning methods.

#### Activations

Once the linear combination has been computed, the neuron does one more operation. It takes the linear combination and puts it through a so-called activation function

Typical examples of the activation function include:

identity function: do nothing and just output the linear combination

step function: if the value of the linear combination is greater than zero, send a pulse (ON), otherwise do nothing (OFF).

![image.png](attachment:5d695245-dfc0-4183-a498-fb1dbf009b96.png)

                        if input sum > 0:
                        output = 1 (ON)
                        else:
                        output = 0 (OFF)

Sigmoid function: a “soft” version of the step function

![image.png](attachment:42b6c518-4bd7-4e2f-a972-9bc4301b4ec2.png)

                linear_combination = (x1 * w1) + (x2 * w2)
                # Activation function (sigmoid function)
                output = 1 / (1 + math.exp(-linear_combination))

Real, biological neurons communicate by sending out sharp, electrical pulses called “spikes”, so that at any given time, their outgoing signal is either on or off (1 or 0).

The step function imitates this behavior.

However, artificial neural networks tend to use activation functions that output a continuous numerical activation level at all times, such as the sigmoid function

Perceptron - A simple neuron model with the step activation function. 
It was among the very first formal models of neural computation


Take a small black-and-white image that’s 100 pixels wide and 100 pixels tall. You feed this image to your neural net by setting the excitement of each simulated neuron in the input layer so that it’s equal to the brightness of each pixel (10,000 neurons (100x100) representing the brightness of every pixel in the image).

![image.png](attachment:b0251b5b-a2a6-4d4a-9f4a-0d4f009b2ea8.png)

You then connect this big layer of neurons to another big layer of neurons above it, say a few thousand, and these in turn to another layer of another few thousand neurons, and so on for a few layers. Finally, in the topmost layer of the sandwich, the output layer, you have just two neurons—one representing “hot dog” and the other representing “not hot dog.” The idea is to teach the neural net to excite only the first of those neurons if there’s a hot dog in the picture, and only the second if there isn’t.

#### backprop

The goal of backprop is to change those weights so that they make the network work: so that when you pass in an image of a hot dog to the lowest layer, the topmost layer’s “hot dog” neuron ends up getting excited.

The way it works is that you start with the last two neurons, and figure out just how wrong they were: how much of a difference is there between what the excitement numbers should have been and what they actually were?

The way it works is that you start with the last two neurons, and figure out just how wrong they were: how much of a difference is there between what the excitement numbers should have been and what they actually were?

You keep doing this until you’ve gone all the way to the first set of connections, at the very bottom of the network. 

At that point you know how much each individual connection contributed to the overall error, and in a final step, you change each of the weights in the direction that best reduces the error overall. 

The incredible thing is that when you do this with millions or billions of images, the network starts to get pretty good at saying whether an image has a hot dog in it.

And what’s even more remarkable is that the individual layers of these image-recognition nets start being able to “see” images in sort of the same way our own visual system does.

That is, the first layer might end up detecting edges, in the sense that its neurons get excited when there are edges and don’t get excited when there aren’t; the layer above that one might be able to detect sets of edges, like corners; the layer above that one might start to see shapes; and the layer above that one might start finding stuff like “open bun” or “closed bun,” in the sense of having neurons that respond to either case. 

The net organizes itself, in other words, into hierarchical layers without ever having been explicitly programmed that way.

Neural nets can be thought of as trying to take things—images, words, recordings of someone talking, medical data—and put them into what mathematicians call a high-dimensional vector space, where the closeness or distance of the things reflects some important feature of the actual world

#### Vectors in NN

Big patterns of neural activity, can be captured in a vector space, with each neuron’s activity corresponding to a number, and each number to a coordinate of a really big vector.

Example :

While looking at the neural activity in the visual cortex of the brain when a person is shown various images of geometric shapes. We have a set of neurons in this region, and each neuron is responsible for detecting specific features in the images.

For instance:

Neuron A is sensitive to detecting straight lines.
Neuron B is activated when it sees curves.
Neuron C is dedicated to identifying angles.

Now, we monitor the brain while showing different shapes. When a straight line is shown, neuron A becomes active. When a curve is shown, neuron B fires up. When an angle is displayed, neuron C shows activity.

We can represent these patterns of neural activity in a vector space, where each neuron's activity corresponds to a number. For example:

For a stimulus with a straight line: [1, 0, 0]
For a stimulus with a curve: [0, 1, 0]
For a stimulus with an angle: [0, 0, 1]

In this simple example, we've created a vector space with each neuron's activity corresponding to a specific coordinate within a vector. These vectors help us capture the neural responses to different visual stimuli.

Suppose we are recording from a neural network with 1000 neurons. At a given moment, we observe the following neuron activation levels:

Neuron 1: 0.2 

Neuron 2: 0.5

Neuron 3: 0.3

...

Neuron 1000: 0.8

We can represent this overall pattern of neural activity as a 1000-dimensional vector:

Neural Activity Vector = [0.2, 0.5, 0.3, ..., 0.8]

Each element of the vector represents the firing rate of one neuron. So the vector captures the instantaneous state of the entire neural network in one mathematical representation.

We can then perform vector operations on these neural activity patterns. For example, we can calculate the distance between two patterns by taking the Euclidean distance of their vector representations. Or we can find patterns with similar activity using vector correlations.

Over time, the neural network's activity traces out a trajectory through this high-dimensional vector space. Analyzing these trajectories with mathematical techniques sheds light on how information is represented and processed by the neural network dynamics.


You can feed the text of Wikipedia, many billions of words long, into a simple neural net, training it to spit out, for each word, a big list of numbers that correspond to the excitement of each neuron in a layer. 

If you think of each of these numbers as a coordinate in a complex space, then essentially what you’re doing is finding a point, known in this context as a vector, for each word somewhere in that space. 

Now, train your network in such a way that words appearing near one another on Wikipedia pages end up with similar coordinates, and voilà, something crazy happens: words that have similar meanings start showing up near one another in the space. 

That is, “insane” and “unhinged” will have coordinates close to each other, as will “three” and “seven,” and so on.

What’s more, so-called vector arithmetic makes it possible to, say, subtract the vector for “France” from the vector for “Paris,” add the vector for “Italy,” and end up in the neighborhood of “Rome.” It works without anyone telling the network explicitly that Rome is to Italy as Paris is to France.


A thought is in a way a dance of vectors, "some big pattern of neural activity in the head of a person"

Neural nets are just thoughtless fuzzy pattern recognizers, and as useful as fuzzy pattern recognizers can be

A real intelligence doesn’t break when you slightly change the requirements of the problem it’s trying to solve