# Ancestral Sampling

In the following notebook, we conducted ancestral sampling, and calculated the accuracy on both train and test set.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import torch
from scipy.special import softmax 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from collections import Counter

### Data Processing

The data is loaded from the pickle file and divided into dataframes corresponding to the features and the target attributes.

In [2]:
# Load Data
df = pd.read_pickle('./pickle/df.pkl')

In [3]:
# Sample 200 random data points from the whole data
df_sampled = df.sample(200,random_state=47)
df_features = df_sampled.iloc[:,:-1].copy()
df_target = df_sampled.iloc[:,-1].copy()

In [4]:
# Normalize the data
scaler = StandardScaler()
df_features = scaler.fit_transform(df_features)

In [5]:
# Devide data into train set and test set
df_train_features, df_test_features,df_train_target, df_test_target =  train_test_split(df_features,df_target,stratify=df_target,random_state=47, test_size=0.5)
df_train_target = df_train_target.to_numpy()
df_test_target = df_test_target.to_numpy()

In [6]:
# Parameters
D = df_train_features.shape[1]
N_train = df_train_features.shape[0]
N_test = df_test_features.shape[0]
n_cat = 11 

### Ancestral Sampling 

Sample random parameters from the normal distribution and compute the accuracy on the train set. The accuracy obtained does not exceed 0.09, which means that the results are obtained by chance.

In [7]:
# sample coefficients (beta)
beta = np.random.normal(0,1,size=D)
print("beta:", beta)

beta_array = np.zeros((n_cat,D))

for i in range(n_cat):
    
    beta_array[i,:] = np.random.normal(0,1,size=D)
    
# sample observations (y's)
y = np.zeros((N_train,n_cat))
for n in range(N_train):
    
    probs = np.zeros(n_cat)
    for i in range(n_cat):
        probs[i] = np.array([(np.dot(beta_array[i,:], df_train_features[n,:]))])
        
    p =  softmax(probs)
    y[n,:] = np.random.multinomial(1, p)
    print('n, p and y ', n, p, y[n,:])

beta: [ 2.0029377   0.46873328 -0.45788922  0.15469789  1.65517814 -1.40596334
 -1.1825488   0.00703063 -0.70670871  1.33186812  0.06180392  0.98037278
  0.06253616  0.57500729]
n, p and y  0 [1.24546751e-05 6.86118818e-04 9.61971559e-01 1.66829003e-03
 2.65351494e-04 1.35638617e-04 4.35781795e-05 3.10429792e-02
 8.96409490e-05 4.08296138e-03 1.42813474e-06] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
n, p and y  1 [0.19251091 0.00239054 0.09833176 0.01261024 0.0176309  0.18107501
 0.00630443 0.39141977 0.02773475 0.01192291 0.05806878] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
n, p and y  2 [3.23834143e-05 2.39778031e-03 9.96392469e-02 1.65363161e-04
 1.05762586e-03 3.64137027e-04 6.82780540e-03 8.55134065e-06
 3.12335224e-01 5.41193385e-01 3.59784974e-02] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
n, p and y  3 [3.10402463e-01 6.25188719e-02 4.67421331e-05 1.04647120e-04
 1.85487588e-08 1.43115655e-04 1.36930143e-04 2.99961396e-03
 6.19988425e-01 2.13009940e-04 3.44616243e-03] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.

# Accuracy

In [9]:
# Create a label array to compare with the true values
label_array = np.array([]) 
for i in y:
    label_array = np.append(label_array,np.argmax(i))

In [10]:
print("Train Accuracy:", 1.0*np.sum(label_array == df_train_target.flatten()) / len(df_train_target))

Train Accuracy: 0.09


In [11]:
print("Test Accuracy:", 1.0*np.sum(label_array == df_test_target.flatten()) / len(df_test_target))

Test Accuracy: 0.09
