**<font color = black size=6>实验十:EM算法</font>**

In [133]:
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import random
from scipy.stats import norm

**<font color = blue size=4>第一部分:实验任务</font>**

<span style="color:purple">=======  
现在给定了一个班级中所有同学的身高数据文件(height.csv)，但不知道各个学生的性别。假设男生身高服从一高斯分布$N_1(\mu_1,\sigma_1^2)$，女生身高服从另一高斯分布$N_2(\mu_2,\sigma_2^2)$，我们可以写出一个混合高斯模型:$x\sim\alpha_1 N_1(\mu_1,\sigma_1^2)+\alpha_2 N_2(\mu_2,\sigma_2^2)$。请使用EM算法完成对该混合高斯分布的估计(即求出对参数$\alpha_1,\alpha_2,\mu_1,\sigma_1,\mu_2,\sigma_2$的估计值)。我们简化地记$\theta_1=(\alpha_1,\mu_1,\sigma_1)$, $\theta_2=(\alpha_2,\mu_2,\sigma_2)$</span>

<span style="color:purple">该数据集(height.csv)特征信息只包括了1个特征，即学生的身高。我们沿用理论课PPT的设置，将隐变量$z_i$按照男生$z_i=1$、女生$z_i=2$的形式进行描述</span>

**<font color = black size=4>E步（Expectation Step）</font>**

<span style="color:purple">1) 将数据集'height.csv'载入并转换为你需要的格式</span>

In [134]:
D = pd.read_csv('./data.csv')
print(type(D))
print(D.iloc[0])


<class 'pandas.core.frame.DataFrame'>
height    170.269308
Name: 0, dtype: float64


<span style="color:purple">2)初始化  
初始化$t=0$时的参数($\alpha_1(0)$, $\alpha_2(0)$, $\mu_1(0)$, $\mu_2(0)$, $\sigma_1(0)$, $\sigma_2(0)$). </span>

In [135]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]

parameter=[1,1,0,2,0.1,0.1]

<span style="color:purple">3)编写函数P(x,parameter,z)  
给定参数$(\alpha_1(t),\alpha_2(t),\mu_1(t),\mu_2(t),\sigma_1(t),\sigma_2(t))$以及数据集D,计算每个样本$x_i$的$P(x_i,z_i|\theta)$.</span>

<span style="color:purple">.  
当$z_i=1$时,$$P(x_i,z_i|\theta)=\alpha_1(t)f_1(x_i|\theta_1(t))$$
    当$z_i=2$时,$$P(x_i,z_i|\theta)=\alpha_2(t)f_2(x_i|\theta_2(t))$$</span>

<span style="color:purple"> .   
其中$f_1(x_i|\theta_1(t))$为样本$x_i$在模型$N_1$中的概率密度,公式如下:
    $$f(x_i|\theta_1(t))=\frac{1}{{\sqrt{2\pi}\sigma_1}} e^{-\frac{(x_i-\mu_1)^2}{2\sigma_1^2}}$$</span>

In [136]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]
def P(x,parameter,z):
    #if z==1:
        
    #elif z==2:
    
    if z == 1:
        alpha1 = parameter[0]
        mu1 = parameter[2]
        sigma1 = parameter[4]
        p = alpha1 * norm.pdf(x, mu1, sigma1)
    elif z == 2:
        alpha2 = parameter[1]
        mu2 = parameter[3]
        sigma2 = parameter[5]
        p = alpha2 * norm.pdf(x, mu2, sigma2)
    return p

print(P(2,parameter,2))

3.989422804014327


<span style="color:purple">4)编写函数Y(x,parameter,z)  
给定参数$(\alpha_1(t),\alpha_2(t),\mu_1(t),\mu_2(t),\sigma_1(t),\sigma_2(t))$以及数据集D,计算每个样本$x_i$的$y_{1,i}=P((z_i=1)|x_i,\theta)$和$y_{2,i}=P((z_i=2)|x_i,\theta)$.  
公式如下:  
</span>

$$P((z_i=1)|x_i,\theta) = \frac{\alpha_1(t)f_1(x_i|\theta_1(t))}{\alpha_1(t)f_1(x_i|\theta_1(t))+\alpha_2(t)f_2(x_i|\theta_2(t))}$$  
$$P((z_i=2)|x_i,\theta) = \frac{\alpha_2(t)f_2(x_i|\theta_2(t))}{\alpha_1(t)f_1(x_i|\theta_1(t))+\alpha_2(t)f_2(x_i|\theta_2(t))}$$

In [137]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]
def Y(x,parameter,z):
    #if z==1:
        
    #elif z==2:
    
    if z == 1:
        alpha1 = parameter[0]
        alpha2 = parameter[1]
        mu1 = parameter[2]
        mu2 = parameter[3]
        sigma1 = parameter[4]
        sigma2 = parameter[5]
        y1 = (alpha1 * norm.pdf(x, loc=mu1, scale=sigma1)) / \
            ((alpha1 * norm.pdf(x, loc=mu1, scale=sigma1)) + \
             (alpha2 * norm.pdf(x, loc=mu2, scale=sigma2)))
        y2 = 1 - y1
        
        return y1
    elif z == 2:
        alpha1 = parameter[0]
        alpha2 = parameter[1]
        mu1 = parameter[2]
        mu2 = parameter[3]
        sigma1 = parameter[4]
        sigma2 = parameter[5]
        y2 = (alpha2 * norm.pdf(x, loc=mu2, scale=sigma2)) / \
            ((alpha1 * norm.pdf(x, loc=mu1, scale=sigma1)) + \
            (alpha2 * norm.pdf(x, loc=mu2, scale=sigma2)))
        y1 = 1 - y2
        return y2


<span style="color:purple">5)编写函数Q(x,parameter)  
 计算对数似然函数在该分布和基于$\theta(t)$下的期望值$Q(\theta)$.单个样本的期望值公式如下:$$E_{z_i}logP(x_i,z_i|\theta)=y_{1,i}log(\alpha_1(t)f_1(x_i|\theta_1(t)))+y_{2,i}log(\alpha_2(t)f_2(x_i|\theta_2(t)))$$</span>

In [138]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]
def Q(x,parameter):
    alpha1 = parameter[0]
    alpha2 = parameter[1]
    mu1 = parameter[2]
    mu2 = parameter[3]
    sigma1 = parameter[4]
    sigma2 = parameter[5]
    y1 = Y(x,parameter,1)
    y2 = Y(x,parameter,2)
    
    p1 = y1 * alpha1 * norm.pdf(x, loc=mu1, scale=sigma1)
    p2 = y2 * alpha2 * norm.pdf(x, loc=mu2, scale=sigma2)

    q = np.log(p1) + np.log(p2)

    return q

**<font color = black size=4>M步 (Maximization Step)</font>**

<span style="color:purple">6)编写函数alpha_expection(D,parameter)  
 给定参数$(\alpha_1(t),\alpha_2(t),\mu_1(t),\mu_2(t),\sigma_1(t),\sigma_2(t))$以及数据集D，计算第$(t+1)$轮的$(\alpha_1(t+1),\alpha_2(t+1))$的更新值.
</span>

$$\alpha_1(t+1)=\frac{\sum_{i=1}^m{y_{1,i}}}{m}$$  
$$\alpha_2(t+1)=\frac{\sum_{i=1}^m{y_{2,i}}}{m}$$

In [139]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]
def alpha_expection(D,parameter):
    m = len(D)
    alpha1_sum = 0
    alpha2_sum = 0

    for _,x in D.items():
        y1 = Y(x, parameter, 1)
        y2 = Y(x, parameter, 2)
        alpha1_sum += y1
        alpha2_sum += y2

    alpha1_new = alpha1_sum / m
    alpha2_new = alpha2_sum / m

    return alpha1_new, alpha2_new

<span style="color:purple">7)编写函数mu_expection(D,parameter)  
给定参数$(\alpha_1(t),\alpha_2(t),\mu_1(t),\mu_2(t),\sigma_1(t),\sigma_2(t))$以及数据集D，计算第$(t+1)$轮的$(\mu_1(t+1),\mu_2(t+1))$的更新值.
</span>

$$\mu_1(t+1)=\frac{\sum_{i=1}^m{y_{1,i}x_i}}{\sum_{i=1}^m{y_{1,i}}}$$
$$\mu_2(t+1)=\frac{\sum_{i=1}^m{y_{2,i}x_i}}{\sum_{i=1}^m{y_{2,i}}}$$

In [140]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]
def mu_expection(D,parameter):
    y1_sum = 0
    y2_sum = 0
    x_y1_sum = 0
    x_y2_sum = 0

    for _,x in D.items():
        y1 = Y(x, parameter, 1)
        y2 = Y(x, parameter, 2)
        y1_sum += y1
        y2_sum += y2
        x_y1_sum += y1 * x
        x_y2_sum += y2 * x

    mu1_new = x_y1_sum / y1_sum
    mu2_new = x_y2_sum / y2_sum

    return mu1_new, mu2_new

<span style="color:purple">8)编写函数sigma_expection(D,parameter,mu_next_1,mu_next_2)  
给定参数$(\alpha_1(t),\alpha_2(t),\mu_1(t),\mu_2(t),\sigma_1(t),\sigma_2(t))$以及数据集D，计算第$(t+1)$轮的$(\sigma_1(t+1),\sigma_2(t+1))$的更新值.
</span>

$$\sigma_1(t+1)=\sqrt{\frac{\sum_{i=1}^m{y_{1,i}(x_i-\mu_1(t+1))^2}}{\sum_{i=1}^m{y_{1,i}}}}$$
$$\sigma_2(t+1)=\sqrt{\frac{\sum_{i=1}^m{y_{2,i}(x_i-\mu_2(t+1))^2}}{\sum_{i=1}^m{y_{2,i}}}}$$

In [141]:
#parameter=[alpha1,alpha2,mu1,mu2,sigma1,sigma2]
def sigma_expection(D,parameter,mu_next_1,mu_next_2):
    y1_sum = 0
    y2_sum = 0
    x_mu1_sq_sum = 0
    x_mu2_sq_sum = 0
    
    for _,x in D.items():
        y1 = Y(x, parameter, 1)
        y2 = Y(x, parameter, 2)
        y1_sum += y1
        y2_sum += y2
        x_mu1_sq_sum += y1 * ((x - mu_next_1) ** 2)
        x_mu2_sq_sum += y2 * ((x - mu_next_2) ** 2)
    
    sigma1_new = (x_mu1_sq_sum / y1_sum) ** 0.5
    sigma2_new = (x_mu2_sq_sum / y2_sum) ** 0.5
    
    return sigma1_new, sigma2_new

**<font color = black size=4>E步和M步的迭代过程</font>**

<span style="color:purple">9) 利用前面编写的函数完成EM算法的迭代过程，直至达到收敛要求。请至少完成【3次】不同的初始值下的迭代过程，并比较选出最好的。  
    收敛要求给出如下几种参考:  
    1.迭代轮数达到指定轮数;  
    2.每轮参数更新的差值小于阈值.</span>

<img src='./EM Algorithm Pseudocode.png'>

.  
我们给出这个数据集的正确相关信息作为参考:$\theta_1:(\alpha_1=0.625,\mu_1=175,\sigma_1=4)$,$\theta_2:(\alpha_2=0.375,\mu_2=165,\sigma_2=6)$

In [142]:
# 定义 EM 算法的迭代过程
def em_algorithm(D, max_iter=100, threshold=1e-6):
    # 初始化参数 alpha, mu, sigma
    alpha1 = 0.5
    alpha2 = 0.5
    mu1 = 0
    mu2 = 1
    sigma1 = 1
    sigma2 = 1
    
    # 记录每轮迭代后的参数
    alpha1_list = []
    alpha2_list = []
    mu1_list = []
    mu2_list = []
    sigma1_list = []
    sigma2_list = []
    
    # 迭代过程
    for iter in range(max_iter):
        # 保存上一轮的参数
        alpha1_prev = alpha1
        alpha2_prev = alpha2
        mu1_prev = mu1
        mu2_prev = mu2
        sigma1_prev = sigma1
        sigma2_prev = sigma2
        
        # E步：计算后验概率
        gamma_list = []
        
        # Q()
        
        # M步：更新参数
        alpha1, alpha2 = alpha_expection(D, [alpha1_prev, alpha2_prev, mu1_prev, mu2_prev, sigma1_prev, sigma2_prev])
        mu1, mu2 = mu_expection(D, [alpha1_prev, alpha2_prev, mu1_prev, mu2_prev, sigma1_prev, sigma2_prev])
        sigma1, sigma2 = sigma_expection(D, [alpha1_prev, alpha2_prev, mu1_prev, mu2_prev, sigma1_prev, sigma2_prev], mu1, mu2)
        
        # 计算参数的差值
        delta_alpha1 = abs(alpha1 - alpha1_prev)
        delta_alpha2 = abs(alpha2 - alpha2_prev)
        delta_mu1 = abs(mu1 - mu1_prev)
        delta_mu2 = abs(mu2 - mu2_prev)
        delta_sigma1 = abs(sigma1 - sigma1_prev)
        delta_sigma2 = abs(sigma2 - sigma2_prev)
        
        # 更新参数记录
        alpha1_list.append(alpha1)
        alpha2_list.append(alpha2)
        mu1_list.append(mu1)
        mu2_list.append(mu2)
        sigma1_list.append(sigma1)
        sigma2_list.append(sigma2)
        
        
        print('delta_alpha1',delta_alpha1)
        print('delta_alpha2',delta_alpha2)
        print('delta_mu1',delta_mu1)
        print('delta_mu2',delta_mu2)
        print('delta_sigma1',delta_sigma1)
        print('delta_sigma2',delta_sigma2)
        # 判断收敛条件
        # if iter > 0 and delta_alpha1 < threshold and delta_alpha2 < threshold and delta_mu1 < threshold and delta_mu2 < threshold and delta_sigma1 < threshold and delta_sigma2 < threshold:
        #     break
    
    # 返回收敛后的参数列表
    return alpha1_list, alpha2_list, mu1_list, mu2_list, sigma1_list, sigma2_list


# 设置迭代的最大轮数和阈值
max_iter = 1000
threshold = 1e-6

# 不同初始值下的迭代过程
initial_parameters = [
    [0.3, 0.7, 1, 2, 1, 1],
    [0.5, 0.5, 0, 1, 0.5, 0.5],
    [0.1, 0.9, -1, 1, 2, 2]
]

best_alpha1_list = []
best_alpha2_list = []
best_mu1_list = []
best_mu2_list = []
best_sigma1_list = []
best_sigma2_list = []

best_likelihood = float('-inf')

# 迭代过程
for parameters in initial_parameters:
    alpha1_list, alpha2_list, mu1_list, mu2_list, sigma1_list, sigma2_list = em_algorithm(D, max_iter, threshold)
    # final_likelihood = Q(D, [alpha1_list[-1], alpha2_list[-1], mu1_list[-1], mu2_list[-1], sigma1_list[-1], sigma2_list[-1]])
    
    # if final_likelihood > best_likelihood:
    #     best_alpha1_list = alpha1_list
    #     best_alpha2_list = alpha2_list
    #     best_mu1_list = mu1_list
    #     best_mu2_list = mu2_list
    #     best_sigma1_list = sigma1_list
    #     best_sigma2_list = sigma2_list
    #     best_likelihood = final_likelihood

# 输出最好的结果
print("Best Parameters:")
print("alpha1:", best_alpha1_list[-1])
print("alpha2:", best_alpha2_list[-1])
print("mu1:", best_mu1_list[-1])
print("mu2:", best_mu2_list[-1])
print("sigma1:", best_sigma1_list[-1])
print("sigma2:", best_sigma2_list[-1])


delta_alpha1 [nan nan nan ... nan nan nan]
delta_alpha2 [nan nan nan ... nan nan nan]
delta_mu1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_mu2 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_sigma1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_sigma2 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_alpha1 [nan nan nan ... nan nan nan]
delta_alpha2 [nan nan nan ... nan nan nan]
delta_mu1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999  

  y1 = (alpha1 * norm.pdf(x, loc=mu1, scale=sigma1)) / \
  y2 = (alpha2 * norm.pdf(x, loc=mu2, scale=sigma2)) / \


delta_alpha1 [nan nan nan ... nan nan nan]
delta_alpha2 [nan nan nan ... nan nan nan]
delta_mu1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_mu2 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_sigma1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_sigma2 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_alpha1 [nan nan nan ... nan nan nan]
delta_alpha2 [nan nan nan ... nan nan nan]
delta_mu1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999  

delta_alpha1 [nan nan nan ... nan nan nan]
delta_alpha2 [nan nan nan ... nan nan nan]
delta_mu1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_mu2 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_sigma1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_sigma2 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999   NaN
Name: height, Length: 2000, dtype: float64
delta_alpha1 [nan nan nan ... nan nan nan]
delta_alpha2 [nan nan nan ... nan nan nan]
delta_mu1 0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
1995   NaN
1996   NaN
1997   NaN
1998   NaN
1999  

IndexError: list index out of range

**<font color = blue size=4>第二部分:作业提交</font>**

一、实验课下课前提交完成代码，如果下课前未完成，请将已经完成的部分进行提交，未完成的部分于之后的实验报告中进行补充  
要求:  
1)文件格式为：学号-姓名.ipynb  
2)【不要】提交文件夹、压缩包、数据集等无关文件，只需提交单个ipynb文件即可，如果交错请到讲台前联系助教，删掉之前的错误版本后再进行提交

二、实验报告截止日期： 【11月24日 14:20】
要求：  
1)文件格式为：学号-姓名.pdf  
2)【不要】提交文件夹、压缩包、代码文件、数据集等任何与实验报告无关的文件，只需要提交单个pdf文件即可  
3)文件命名时不需要额外添加“实验几”等额外信息，按照格式提交  
4)每周的实验报告提交地址会变化，且有时间限制，提交时间为下周的实验课开始时，请注意及时提交。

实验十(EM算法)的实验报告上交地址:https://send2me.cn/9UjusmMn/S_Cz3o_FpKQEsA  

三、课堂课件获取地址:https://www.jianguoyun.com/p/DZKTh-IQp5WhChiIn6gFIAA  
实验内容获取地址:https://www.jianguoyun.com/p/DbCHB9wQp5WhChiKn6gFIAA