# Introduction to Data Fusion
Developed by Roger Wang (rq.wang@rutgers.edu)

## What's data fusion?
Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.

An introduction: https://www.youtube.com/watch?v=6qV3YjFppuc

## We are good at data fusion naturally
Humans are always performing data fusion and very good at it.

REALLY? Let me show you some examples.

McGurk Effect: https://www.youtube.com/watch?v=PWGeUztTkRA

Cocktail party effect: https://www.youtube.com/watch?v=mN--nV61gDo

Now, let's learn what we can use mathmatics to do data fusion.

## Motivation: 
"Due to the rich characteristics of natural processes and environments, it is rare that a single acquisition method provides complete understanding thereof. Information about a phenomenon or a system of interest can be obtained from different types of instruments, measurement techniques, experimental setups, and other types of sources. "

## Data fusion is a challenging task for several reasons:
1. the data are generated by very complex systems: biological, environmental, sociological, and psy- chological, to name a few, driven by numerous underlying processes that depend on a large number of variables to which we have no access. 
2. due to the augmented diversity, the number, type, and scope of new research questions that can be posed is potentially very large. 
3. working with heterogeneous data sets such that the respective advantages of each data set are maximally exploited, and drawbacks suppressed, is not an evident task. (Lahat et al., 2015)

In a broader perspective, each data aquisition framework is a modelity, and the collection of the frameworks is called multimodel. Data assimilation can be treated as a multimodel of a modeling modelity and a data stream modelity.

A key property of the multimodel is complementarity so that each modelity brings value (or information) to the multimodel to resolve the uncertainty. "In mathematical terms, this added value is known as diversity. Diversity allows to reduce the number of degrees of freedom in the system by providing constraints that enhance uniqueness, interpretability, robustness, performance, and other desired properties, ... Diversity can be found in a broad range of scenarios, and plays a key role in a wide scope of mathematical and engineering studies." (Lahat et al., 2015)

## Mathematical Presentation:
In general, we are interested in a system:

$$ x=f(\mathbf{z}),$$
where $\mathbf{z}$ is a series of contributing factors that determine the system state of $x$.

## Data fusion types:
There are two types of data fusion methods: model-driven and data-driven. When a forecasting model is unknown, too complicated to use, or rapidily changing, we have to use model-free methods.

## Data preparation
This time, we will focus on the global temperature reconstructed from various sources. They can be downloaded from https://www.ncdc.noaa.gov/paleo-search/study/10437

In [None]:
%cd
#!mkdir ./tempData
%cd ./tempData
#!wget https://www1.ncdc.noaa.gov/pub/data/paleo/contributions_by_author/frank2010/ensembles-10yearsmth.txt

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('ensembles-10yearsmth.txt',delimiter='\t')
m, n = df.shape
df.head()

In [None]:
fig=plt.plot(df['Year'],df.iloc[:,1:],'.')

In [None]:
df['mean']=df.iloc[:,1:].mean(axis=1)
df.head()

In [None]:
fig=plt.plot(df['Year'],df.iloc[:,1:-2],'.',color='grey',alpha=0.1)
fig=plt.plot(df['Year'],df['mean'],'white')

In [None]:
var=df.iloc[:,1:n].describe().iloc[2,:]
var.hist()

In [None]:
df['var_mean1']=df.iloc[:,1:n]@np.reciprocal(var)/np.sum(np.reciprocal(var))
fig=plt.plot(df['Year'],df.iloc[:,1:-2],'.',color='grey',alpha=0.1)
fig=plt.plot(df['Year'],df['mean'],'white')
fig=plt.plot(df['Year'],df['var_mean1'],'black')

In [None]:
from sklearn.linear_model import LinearRegression
# clean data
df=df.dropna()

model = LinearRegression(fit_intercept=True)

var=np.zeros(n-1)
for i, c in enumerate(df.columns[1:n]):
    model.fit(df['Year'][:,np.newaxis],df[c][:,np.newaxis])
    yfit=model.predict(df['Year'][:,np.newaxis])
    var[i]=np.var(df[c][:,np.newaxis]-yfit)

plt.hist(var)
plt.show()    

df['var_mean2']=df.iloc[:,1:n]@np.reciprocal(var)/np.sum(np.reciprocal(var))
# fig=plt.plot(df['Year'],df.iloc[:,1:-2],'.',color='grey',alpha=0.1)
fig=plt.plot(df['Year'],df['mean'],'black',label='naive mean')
fig=plt.plot(df['Year'],df['var_mean1'],'blue',label='weighted mean 1')    
fig=plt.plot(df['Year'],df['var_mean2'],'red',label='weighted mean 2')    
plt.legend()

# Challenge:
Complete the following code to use polynomial regression (degree 2) to remove the trend, calculate the variance and conduct the weighted average with the new variance vector.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# poly = PolynomialFeatures(__, include_bias=False)

var=np.zeros(n-1)
for i, c in enumerate(df.____):
    poly_model = make_pipeline(PolynomialFeatures(___),
                           LinearRegression())               #
    poly_model.fit(df[___][:,np.newaxis],df[___][:,np.newaxis])     # fitting the data
    yfit=poly_model.predict(df[___][:,np.newaxis])              # making predictions
    var[i]=np.var(df[___][:,np.newaxis]-yfit)                  # calculate variance

plt.hist(var)
plt.show()    

df['var_mean3']=df.iloc[:,1:n]@np.reciprocal(var)/np.sum(np.reciprocal(var))    # normalization

# fig=plt.plot(df['Year'],df.iloc[:,1:-2],'.',color='grey',alpha=0.1)
fig=plt.plot(df['Year'],df['mean'],'black',label='naive mean')
fig=plt.plot(df['Year'],df['var_mean1'],'blue',label='weighted mean 1')    
fig=plt.plot(df['Year'],df['var_mean2'],'red',label='weighted mean 2')    
fig=plt.plot(df['Year'],df['var_mean3'],'yellow',label='weighted mean 3')    

plt.legend()

# Data Fusion using Matrix Factorization

In [None]:
#!pip install pymf3

In [None]:
import pymf3
import numpy as np

data=df.iloc[:,1:n].to_numpy()

# nmf_mdl = pymf3.semiNMF(data, num_bases=1)

nmf_mdl = pymf3.semiNMF(data, num_bases=1)
nmf_mdl.factorize(niter=1000)
# plt.plot(nmf_mdl.W)
# scalar=np.mean(np.mean(data))/np.mean(nmf_mdl.W)

df['seminmf']=nmf_mdl.W
# fig=plt.plot(df['Year'],df.iloc[:,1:-2],'.',color='grey',alpha=0.1)
fig = plt.figure(figsize=(12,8))
fig=plt.plot(df['Year'],df['mean'],'black',label='naive mean')
fig=plt.plot(df['Year'],df['var_mean1'],'blue',label='weighted mean 1')    
fig=plt.plot(df['Year'],df['var_mean2'],'red',label='weighted mean 2')    
fig=plt.plot(df['Year'],df['var_mean3'],'yellow',label='weighted mean 3')    
fig=plt.plot(df['Year'],df['seminmf']*np.mean(nmf_mdl.H),'purple',label='semi-NMF')    
# fig=plt.plot(df['Year'],df.iloc[:,1:3],'x',label='semi-NMF',alpha=0.5)    

plt.legend()