# Gradient Descent Algorithm

Given a function: <br>

\begin{equation}
f(x;\theta) = \theta_0 + \theta_1 x 
\end{equation}<br>

where $\theta$ is a vector that determines what our function will be.<br>

For instatnce, if $\theta = [0.2, 1.5]$, the function is then $f(x; \theta) = 0.2 + 1.5 x$. <br>


Let's say We wish to find a $\theta$ that minimise $|f(x;\theta) - y|$

We want to choose $\theta$ so as to minimise function $j(\theta)$. <br>

We starts with some "initial guess" for $\theta$, and that repeatedly changes $\theta$ to make $j(\theta)$ smaller, until hopefully we converge to a value of $\theta$ that minimises $j(\theta)$. <br>

Specifically, let’s consider the gradient descent algorithm, which starts with some initial $\theta$, and repeatedly performs the update: $\theta$

\begin{equation}
d(\textbf{u}, \textbf{v}) = d(\textbf{v}, \textbf{u}) = \sqrt{\sum_{i=1}^{n} (\textbf{u}_i - \textbf{v}_i)^2}
\end{equation} <br>

This is also known as $\ell^2$ distance between $\textbf{u}$ and $\textbf{v}$, or the $\ell^2$-norm of $\textbf{u} - \textbf{v}$:<br>

\begin{equation}
\| \textbf{u} - \textbf{v} \|_2
\end{equation} <br>

Implement Euclidean distance with Numpy

In [2]:
import numpy as np
def batch_gradient_descent(alpha, x, y, n_iters=100):
    m = x.shape[0] # number of samples
    theta = np.ones(2)
    x_transpose = x.transpose()
    for iter in range(0, n_iters):
        hypothesis = np.dot(x, theta)
        loss = hypothesis - y
        J = np.sum(loss ** 2) / (2 * m)  # cost
        print("iter %s | J: %.3f" % (iter, J))  
        gradient = np.dot(x_transpose, loss) / m         
        theta = theta - alpha * gradient  # update
    return theta

如果实现正确, 运行下方代码时不应该出现 `AssertionError` 或 `TypeError`

## 机器学习的组成要素


机器学习的三个组成要素:<br>

1. 计算机可以理解的经验 (模型)<br><br>
1. 经验概括能力好坏的度量标准 (目标函数)<br><br>
1. 选择更合适的经验以获得更好的概括能力 (模型优化)<br><br>

<br>
机器学习通常被分为监督(supervised)和非监督(unsupervised)学习. 详见 https://en.wikipedia.org/wiki/Machine_learning

![machlearn.png](attachment:machlearn.png)

## Supervised Learning 有监督的机器学习

给算法提供问题描述和对应的答案, 或者解决步骤和对应的反馈, 让算法概括如何解决问题

## 回归问题

给定不同地区的收入, 学校师生比等等共13列数据, 和该地区对应的房价中位数, 让算法概括如何根据这些数据来估算一个地区的房价
* 模型定义为一个函数 <br><br>
  输入是$m$个地区的信息$\textbf{X}$, 表示为一个$m\times 13$矩阵 <br><br>
  输出是$m$个地区的房价中位数$\textbf{y}$, 表示为一个长度为$m$的向量 : <br><br>
<br><br>
   \begin{equation}
   f(\textbf{X}):=\{\textbf{X} \mapsto \textbf{y} \}\\
   \textbf{X} \in \mathbb{R}^{m\times 13},
   \textbf{y} \in \mathbb{R}^m
   \end{equation}
<br><br>
<br><br>

* 目标函数我们设$f(\textbf{X})$的均方误差(mean squared error), 假设 $\textbf{y}$是真实的$m$个房价中位数, 模型预测的房价中位数为 $f(\textbf{X}_i), i = 1,2,...,m$ . 
<br><br>
     \begin{equation}
      L(\textbf{X}) = MSE(\textbf{y} - f(\textbf{X}) = \frac{1}{m} \sum_{i=1}^{m} (\textbf{y} - f(\textbf{X}))^2
     \end{equation}
<br><br>
* 优化: 我们希望找到正确率最高的模型 $f$
<br><br>
 \begin{equation}
     f = argmax\quad L(\textbf{X})
 \end{equation}
 <br><br>
<br>

In [7]:
for col_description in open('boston_housing_desc.txt', 'r').readlines():
    print(col_description)

1. CRIM - per capita crime rate by town

2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

3. INDUS - proportion of non-retail business acres per town.

4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

5. NOX - nitric oxides concentration (parts per 10 million)

6. RM - average number of rooms per dwelling

7. AGE - proportion of owner-occupied units built prior to 1940

8. DIS - weighted distances to five Boston employment centres

9. RAD - index of accessibility to radial highways

10. TAX - full-value property-tax rate per $10,000

11. PTRATIO - pupil-teacher ratio by town

12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

13. LSTAT - % lower status of the population

14. MEDV - Median value of owner-occupied homes in $1000's



In [8]:
import pandas as pd
housing = pd.read_csv('boston_housing_m.csv', delimiter=',')
housing.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD1,TAX1,PTRATIO1,B1,LSTAT1,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,0.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,0.0,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,0.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,0.0,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,0.0,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1,18.9


### 分类问题

<br>
给定一些鸢尾花(iris)的花萼长, 花萼宽, 花瓣长和花瓣宽的测量结果, 和与之对应的三个品种: {setosa, versicolor, virginica}, 让算法概括如何在已知这些测量结果的前提下准确地判断花的品种.
* 模型定义为一个函数 <br><br>
  输入是$m$朵花测量结果$\textbf{X}$, 表示为一个$m\times 4$矩阵 <br><br>
  输出是$m$朵花对应的品种$\textbf{y}$, 表示为一个长度为$m$的向量 : <br><br>
<br><br>
   \begin{equation}
   f(\textbf{X}):=\{\textbf{X} \mapsto \textbf{y} \}\\
   \textbf{X} \in \mathbb{R}^{m\times 4},
   \textbf{y} \in \{\text{setosa, versicolor, virginica}\}^m
   \end{equation}
<br><br>
<br><br>

* 目标函数我们设为正确率(accuracy), 即对$m$朵花, 在已知这些花的真实品种 $\textbf{y}$, 但模型不知道其品种的情况下, 模型猜对一次加一分, 猜错不加分. 最终结果除以 $m$ 得到正确率. 
<br><br>
     \begin{equation}
      L(\textbf{X}) = \frac{1}{m} \sum_{\textbf{X}_i=1}^{m}\begin{cases}
           0 \quad\quad\text{if $\quad f(\textbf{X}_i) = \textbf{y}_{i}$}
            \\
            \quad
            \\
            1 \quad\quad\text{if $\quad f(\textbf{X}_i) \neq \textbf{y}_{i}$}
            \end{cases}
     \end{equation}
<br><br>
* 优化: 我们希望找到正确率最高的模型 $f$
<br><br>
 \begin{equation}
     f = argmax\quad L(\textbf{X})
 \end{equation}
 <br><br>
<br>

In [6]:
import pandas as pd
iris = pd.read_csv('iris.csv', delimiter=',')
iris.iloc[np.r_[0:5, 63:68, 125:130],:]

Unnamed: 0,Sepal length,Sepal width,Petal length,Petal width,Species
0,5.2,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.3,setosa
63,6.1,2.9,4.7,1.4,versicolor
64,5.6,2.9,3.6,1.3,versicolor
65,6.7,3.1,4.4,1.4,versicolor
66,5.6,3.0,4.5,1.5,versicolor
67,5.8,2.7,4.1,1.0,versicolor


# 使用Pandas

#### 载入iris数据集

载入一个csv格式的数据集可以使用`numpy.loadtxt`或 `pandas.load_csv`<br>

推荐使用`pandas.load_csv`<br>

此外`pandas.load_excel`还可以载入excel格式的数据集<br>

In [9]:
import pandas as pd
import numpy as np
iris = pd.read_csv('iris.csv', delimiter=',')
print('数据集现在的类型是', type(iris))
print('所有的Species:', np.unique(iris['Species']))
print('数据集的前10行:')
iris.head(10)

数据集现在的类型是 <class 'pandas.core.frame.DataFrame'>
所有的Species: ['setosa' 'versicolor' 'virginica']
数据集的前10行:


Unnamed: 0,Sepal length,Sepal width,Petal length,Petal width,Species
0,5.2,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.3,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


`DataFrame`可以看成是一个字典, 其中每一个key对应一个`Series`, 也即一列数据.<br>

可以用`pd.DataFrame.as_matrix()`将其转为一个`ndarray`. <br>

`DataFrame`也可以当成一个2维的`ndarray`使用, 但索引和slicing的方式不同 <br>

In [10]:
print('DataFrame的形状', iris.shape)
print('DataFrame沿第0轴求平均值, 即每一列的平均值')
np.mean(iris, axis=0)

DataFrame的形状 (149, 5)
DataFrame沿第0轴求平均值, 即每一列的平均值


Sepal length    5.831544
Sepal width     3.057718
Petal length    3.742282
Petal width     1.192617
dtype: float64

In [11]:
print('DataFrame中的一列数据是一个Series', type(iris['Sepal length']))

DataFrame中的一列数据是一个Series <class 'pandas.core.series.Series'>


In [12]:
try:
    iris[:10, [2,3, -1]]
except TypeError:
    print('DataFrame 的切割和索引方式与 ndarray 不同')
    
print('DataFrame 的切割和索引:')
iris.iloc[:10,[2,3,-1]]

DataFrame 的切割和索引方式与 ndarray 不同
DataFrame 的切割和索引:


Unnamed: 0,Petal length,Petal width,Species
0,1.4,0.2,setosa
1,1.4,0.2,setosa
2,1.3,0.2,setosa
3,1.5,0.2,setosa
4,1.4,0.3,setosa
5,1.7,0.4,setosa
6,1.4,0.3,setosa
7,1.5,0.2,setosa
8,1.4,0.2,setosa
9,1.5,0.1,setosa


In [13]:
iris.loc[:10,['Petal length', 'Petal width', 'Species']]

Unnamed: 0,Petal length,Petal width,Species
0,1.4,0.2,setosa
1,1.4,0.2,setosa
2,1.3,0.2,setosa
3,1.5,0.2,setosa
4,1.4,0.3,setosa
5,1.7,0.4,setosa
6,1.4,0.3,setosa
7,1.5,0.2,setosa
8,1.4,0.2,setosa
9,1.5,0.1,setosa


可以用`pd.DataFrame.as_matrix()`将`DataFrame`转为一个`ndarray`

In [14]:
iris_ndarray = iris.as_matrix()
print('iris_ndarray 现在的类型是', type(iris_ndarray))
print('iris 每一列的名称', iris.columns.values)

iris_ndarray 现在的类型是 <class 'numpy.ndarray'>
iris 每一列的名称 ['Sepal length' 'Sepal width' 'Petal length' 'Petal width' 'Species']
