In [121]:
import pandas as pd
import numpy as np
import seaborn as sns
from itertools import combinations
import matplotlib.pyplot as plt

**Data Description**<br>
Attached are the absolute value (unitless) and phase value (in unit of rad) of the sound field distribution corresponding to 20 different input locations.

In each file, there is a matrix containing two columns of data, The first column represents the spatial coordinates of the field distribution. The second column represents the absolute or phase values. The spatial coordinates in the first column starts from 0, goes up to 100, and jump back to 0, goes up to 100 again. Each group of data from 0-100 represents the field distribution generates by one input location. There are in total 20 groups of data, corresponding to 2 different input location.

When you separate the data into 20 groups, each of them corresponds to the output (field distribution along a line) when the input is from a particular location along another line. For example, the first group corresponds to the location x = 1, the second group corresponds to location x = 2, the 20th group corresponds to x = 20.

The goal is to use those data to train the model so that: If I have a particular output, which is the linear combination of any of the 20 output groups, I know the combination of input locations. 

The training data can be any of the linear combinations of the 20 groups.

In [122]:
abs_val = pd.read_csv("abs_p.csv")

In [123]:
phase_val = pd.read_csv("arg_p.csv")

In [124]:
# Get rid of the first few rows about data distribution
abs_val = abs_val.iloc[7:]
phase_val = phase_val.iloc[7:]

In [125]:
# Divide the original dataset into 20 groups associated with 20 input points
input_dist = list()
group_start = list(abs_val[abs_val.iloc[:,0] == '0'].index)
group_start.append(len(abs_val) + 7)
for i in np.arange(len(group_start)-1):
    start_idx = group_start[i] - 7
    end_idx = group_start[i+1] - 7
    group_abs = np.array(abs_val[start_idx:end_idx]).astype(float)
    group_phase = np.array(phase_val[start_idx:end_idx]).astype(float)
    #wave_pressure = group_abs[:,1] * np.exp(group_phase[:,1]*1j)
    wave_pressure = group_abs[:,1] * np.exp(group_phase[:,1])
    input_dist.append(wave_pressure)
input_dist = np.array(input_dist)

In [126]:
# Randomly generate data set
position = np.arange(20) + 1
choices = list()
output = list()
for i in np.arange(1e4):
    random_a = np.random.uniform()
    random_b = 1 - random_a
    choice = np.random.choice([0, 1], size=20, p=[random_a, random_b])
    choices.append(choice)
    output.append(sum(input_dist[choice.astype(bool)]))

Use several supervised learning method and then compare the best one with the score return.
Will try about the decesion trees, random forest, linear regression, SVM, and logistic regression. No need to use neural network.

In [127]:
# import all the possible machine learning packages
from sklearn.model_selection import train_test_split
# out_train, out_test = train_test_split(output, test_size = 0.2, random_state=42)
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression


In [128]:
lin_reg = LinearRegression()
tree_reg = DecisionTreeRegressor()

In [129]:
input_dist = np.asmatrix(input_dist).transpose()
model = lin_reg.fit(input_dist,output[1])

In [130]:
model.coef_

array([ 1.00000000e+00,  2.24621029e-15,  1.82891211e-15,  1.75626187e-15,
        2.37061698e-15,  1.80937828e-15,  1.09045805e-15,  8.31721692e-16,
        1.00000000e+00,  1.00000000e+00,  1.00000000e+00, -8.81579572e-17,
        1.00000000e+00, -1.60688889e-15, -1.43946289e-15, -2.54121796e-15,
       -1.78706784e-15,  1.00000000e+00,  1.00000000e+00,  1.83452811e-16])

In [131]:
choices[1]

array([1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0])