# Harshit's Artificial Intelligence (AI) Notes

According to me, the field of AI is roughly divided into the following sub-domains:

- Data Science
- Learning
- Robotics

### Data Science
Data science is the field which deals with the large amount of data we require and have in the worldly problems. The following things maybe roughly put in the field of data science:

- Data Storage
- Databases
- Data Mining
- Data Exploration
- Big Data
- Data Cleansing
- Data Analysis
- Data Visualization And Representation

### Learning

Learning is the science in which we make computers or machines learn in a way which is quite similar to how a human learns. Learning is mostly about mathematics. Statistics plays a major role in learning and hence, in general, it is also known as "statistical learning". When statistical learning methods are converted to computer algorithms and programs, we call it "machine learning".

### Robotics

Robotics is the science of creating machines which try to replicate how a human body works using mechanical, electrical, computer and electronic systems.

---

## Statistical Learning

There are 2 aspects of every statistical learning problems. They are:

- Input Variable
- Output Variable

### Input Variables
Input variables are the various factors affecting the _output_ or the _result_ of any statistical learning problem. They are also known as _predictors_, _independent variables_ or _features_.

Example:

In the advertising budget problem, the budget of **television based advertisements**, **radio based advertisements** and **newspaper based advertisements** can be shown by variables, **$X_{1}$**, **$X_{2}$** and **$X_{3}$** respectively.

Factors | Variables
--- | ---
TV | $X_{1}$
RADIO | $X_{2}$
NEWSPAPER | $X_{3}$

These are the input variables for the advertising budget problem.

### Output Variables

The final result of a prediction or learning problem. It is the outcome of the whole problem. The output variables are dependent on the input variables for that particular problem. They are also known as _responses_ or _dependent variables_.

Example:
In the advertising budget problem, after we have predicted the **sales** from the given input variables, using one of the many statistical learning methods, we get the resultant **sales**. We can represent that as **$Y$**.

Result | Variable
--- | ---
Sales | $Y$

### Relationship between the input variables and the output variables

Thus, we observe a _quantitative response_ ($Y$), due to _$p$_ different _predictors_ ($X_{1}, X_{2}, X_{3}, \cdots , X_{p}$).

Summing up the $p$ predictors into one, we get:
$$
X = X_{1} + X_{2} + X_{3} + \cdots + X_{p}
$$

Also, $X$ affects $Y$. Thus, there is some relationship between $X$ and $Y$.

It can be shown by the following equation:
$$
Y = f(X) + \varepsilon
$$

where,
- $f(X)$ is a fixed but unknown function which is dependent on $X$,
- $\varepsilon$ is the _error term_ which is independent of $X$ and has a mean of _zero_. Errors are **_positive_** if the observation lies **above** the _curve of $f(X)$_ and **_negative_** if they lie **below** it.

Our goal is to find an estimate of $f$, which would fit $X$ to $Y$ with the minimum error.

### Why estimate $f$?

Our motive behind estimating $f$ can be one (or both) of the following:
1. Prediction
2. Inference

### Prediction

Prediction means trying to estimate a result for the future based on the past. We humans predict something by acknowledging the data from the past and trying to guess or estimate the future. Similarly, machines can be taught using various learning techniques and they can then estimate or guess the future. Examples of prediction or domains where prediction can be applied are:

- Weather forecasts
- Disaster analysis
- Stock markets
- Traffic forecast

For prediction, we have a set of _inputs_, $X$, readily available. The _output_, $Y$, is not available to us.

We can then predict $Y$ as follows:
$$
\hat{Y} = \hat{f}(X)
$$

Where,
- $\hat{Y}$ is our prediction for $Y$, and,
- $\hat{f}(X)$ is our estimate for $f(X)$.

Here, the error term, $\varepsilon$, averages to zero.

The accuracy of $\hat{Y}$, as a prediction of $Y$, depends on 2 quantities,
1. Reducible error
2. Irreducible error

In general, $\hat{f}$ will not be an accurate estimate for $f$, and it will introduce a _reducible error_. We can reduce or minimize it using better statistical learning methods.

Even though we perfectly estimate $f$ such that $\hat{Y} = f(X)$, our prediction would still contain an error, because, $Y$ is also a function of $\varepsilon$. Variability associated with $\varepsilon$ also affects the accuracy of our prediction. This is the _irreducible error_. We cannot remove it how much ever we try.

**Why is the irreducible error > 0?**

$\varepsilon$ may contain _unmeasurable variables_ that are useful in predicting $Y$. It may also contain _unmeasurable variations_.

$$
E(Y - \hat{Y})^{2} = E[f(X) + \varepsilon - \hat{f}(X)]^{2} = [f(X) - \hat{f}(X)]^{2} + var(\varepsilon)
$$

where,
- $E(Y - \hat{Y})^{2}$ is the _expected value_,
- $[f(X) - \hat{f}(X)]^{2}$ is the _squared difference_ between the predicted and actual value of $Y$. It is _reducible_ in nature.
- $var({\varepsilon})$ is the variance associated with $\varepsilon$. It is irreducible in nature.

### Inference

We are often interested in knowing how the output of a problem is affected by the input. We want to know how $Y$ is affected as $X = X_{1} + X_{2} + X_{3} + \cdots + X_{p}$ changes.

Thus, we want to understand the relationship between $X$ and $Y$.

Here, $\hat{f}$ cannot be treated as a "_black box_". We need the exact form of $\hat{f}$.

**We may want to infer,**
- which predictors are _associated_ with the response,
- what the _relationship between each_ predictor and the response is (positive, negative, etc.).
- can the relationship between $Y$ and each predictor be adequately summarized using a linear equation or is the relationship more complicated (quadratic, cubic, etc.)?

### How do we estimate $f$?

To estimate $f$, we need to teach our method. To teach, we have some data which we call as the _training data_. Training data is the dataset or part of the dataset which is used to train or teach the method on how to estimate $f$.

**Characteristics of training data:**

- $i$ denotes the $i^{th}$ observation out of the total $n$ observations.
- $j$ denotes the $j^{th}$ predictor out of the $p$ total predictors.

Thus, $x_{ij}$ is the $i^{th}$ observation of the $j^{th}$ predictor.

Thus, $y_{i}$ is the response variable for the $i^{th}$ observation.

Thus,

Our training data set consists of,
$$
{(x_{1}, y_{1}), (x_{2}, y_{2}, \cdots, (x_{n}, y_{n}))}
$$
where,
$$
x_{i} = (x_{i1}, x_{i2}, \cdots, x_{ip})^{T}
$$

Our goal is to apply a statistical learning method to our training data in order to estimate the unknown fucntion $f$.

We want to find a function $f$ such that,
$Y \approx \hat{f}(X)$, for any observation $(X, Y)$.

Most statistical methods for this task can be classified into:

1. Parametric methods
2. Non-parametrix methods

### Parametric Methods

**Involves a 2-step, _model based_ approach.**

**STEP 1:**

We, first, make an assumption about the functional form, or shape, of $f$.

One very simple, and maybe the first assumption we may make for any given problem, would be that $f$ is _linear_ in $X$:

$$
f(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \cdots + \beta_{p}X_{p}
$$

This is a _linear model_.

Linear models are very simple. Instead of having to estimate an entirely arbitrary $p$-dimensional function $f(X)$, one only needs to estimate the $p + 1$ coefficients, $\beta_{0}, \beta_{1}, \beta_{2}, \cdots, \beta_{p}$.

**STEP 2:**

After a model is selected, a _procedure_ needs to be selected to _fit_ or _train_ the model. In case of our linear model, we need to estimate the parameters $\beta_{0}, \beta_{1}, \beta_{2}, \cdots, \beta_{p}$ such that,

$$
Y \approx \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \cdots + \beta_{p}X_{p}
$$

The most common approach to fitting the model is referred to as "**_(ordinary) least squares_**". However, _least squares_ is one of the many possible methods to fit a linear model.

**Thus, a _parametric method_ reduces the problem of estimating $f$ down to one of estimating a set of parameters.**

## Machine Learning Models

### Decision Trees

A **_Decision Tree_** is a very basic model used in machine learning. It is not very accurate for predictions and much more powerful models do exist, but they are very easy to understand and sometimes also act as building blocks for the more complex models.

Here is an example of a simple decision tree:

![Simple Decision Tree (PNG)](/res/simple_decision_tree.png)

The process of recognizing patterns from data is called **fitting** or **training**. The data used to **train** or **fit** is called as **training data**.

There can be more complex, and thus more accurate decision trees. An example of a more accurately predicting decision tree is given below:

![Deeper Decision Tree (PNG)](/res/deeper_decision_tree.png)

This tree gives more accurate predictions than say a tree like the one below:

![Less Deep Decision Tree (PNG)](/res/less_deep_decision_tree.png)

Deeper decision trees have more '_splits_'. _Splits_ allow us to introduce more number of _factors_ in our decision making process. More the number of factors, better the prediction. The last node of the tree, where we obtain our prediction, is known as the **leaf** node.

Let's use some data to try out our new tricks.

### Data exploration using Pandas

Pandas is an open-source Python data analysis library. We can use Pandas in our Python code by importing it. We generally import Pandas using the abbreviation **pd**.

In [2]:
import pandas as pd

The most important feature of Pandas is its **_DataFrame_** object. A DataFrame is a container which holds the type of data which is similar to a table or an Excel sheet or a SQL table. This DataFrame object can then allow us to do a lot of things on the data using powerful methods in the Pandas library.

As an example, we will be looking at data about home prices in Melbourne, Australia.

In [8]:
# saving filepath to a variable for easier access
melb_data_filepath = 'melb_data.csv'
# Read data from a CSV file and store it in a DataFrame object called melb_data
melb_data = pd.read_csv(melb_data_filepath)
# Print a summary of data in Melbourne data
melb_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


### Interpreting data descriptions

The table shows 8 numbers for all the columns in the original dataset.

The **count** column shows the number of rows containing non-missing values. Missing values are those for which there is no possible data. For example, the count for bedroom 2 in a 1 bedroom house will not be available and thus missing.

The **mean** is an average of the values.

**std** is for the standard deviation of the values. Standard deviation shows us how numerically the data is spread out. For more about standard deviation, refer to the basic statistics section.

Sort the data in ascending order. The first value is the **min** value. **25%**, **50%** and **75%** are percentile values (_$x^{th}$ percentile_). They indicate the values which are bigger than $x$% of the values in the dataset and smaller than $(x - 100)$% of the values in the dataset. **max** is the largest number.

### Selecting data for modeling

The Melbourne housing problem dataset has a lot of variables in it. It makes it difficult to grasp the data and understand it. We need to narrow down the number of columns (or factors) to those which actually matter. We will start by doing this intuitively.

The **columns** property of the DataFrame object allows us to see a list of all the columns we have in our dataset.

In [10]:
# We will continue our Python code from the last cell containing Python code
melb_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The Melbourne dataset contains rows with missing values. We will learn how to handle missing values later, so for now we will just discard (or drop) those rows. To do so, we will use the **dropna** method. The **na** stands for **Not Available**.

In [13]:
melb_data = melb_data.dropna(axis=0)

### Selecting the prediction target

Using Pandas, one can pull out a single variable using the _dot-notation_. This single column is stored in a **_Series_**. It is similar to a DataFrame but with only a single column.

We will use the _dot-notation_ to select a column which is called a **prediction target**. Our prediction target here would be the house prices. By convention, the prediction target is stored in a variable called ```y```.

In [14]:
y = melb_data.Price

### Choosing features

The