# Data science is OSEMN

According to a popular model, the elements of data science are

* Obtaining data
* Scrubbing data
* Exploring data
* Modeling data
* iNterpreting data

and hence the acronym OSEMN, pronounced as “Awesome”.

We will start with the **O**, moving towards the rest later, but first let's have a quick look at what it all boils down to:

In [None]:
import numpy as np
data = np.loadtxt('populations.txt')
year, hares, lynxes, carrots = data.T # trick: columns to variables


from matplotlib import pyplot as plt
%matplotlib inline

plt.axes([0.2, 0.1, 0.5, 0.8]) 
plt.plot(year, hares, year, lynxes, year, carrots) 
plt.legend(('Hare', 'Lynx', 'Carrot'), loc=(1.05, 0.5)) 

By plotting the data a clear (and reasonable) correlations between pray and predator becomes evident. How can it be quantified? Is that statistical significant? What about the correlation between carrots and hares? Is that evident? Is that significant?

Finding correlations in data is the main goal of data science, though that is not the end of the story: as this precious [site](http://tylervigen.com/spurious-correlations) demonstrates, **correlations is not causation**. 


*Exercise*: write an algorithm that determins and quantifies a correlation between two time series. Use as an example the hare-lynx-carrot dataset.

In [None]:
hares_mean = np.mean(hares)
lynxes_mean = np.mean(lynxes)
carrots_mean = np.mean(carrots)
hares_std = np.std(hares)
lynxes_std = np.std(lynxes)
carrots_std = np.std(carrots)

lags = np.arange(-len(hares) + 1, len(hares))

# Hares - Lynxes
crosscorr_hares_lynxes = np.correlate(hares - hares_mean, lynxes - lynxes_mean, mode='full')
crosscorr_hares_lynxes = crosscorr_hares_lynxes / ( hares_std * lynxes_std )

plt.plot(lags,crosscorr_hares_lynxes)
plt.xlabel('Lag')
plt.ylabel('Scaled cross-correlation')
plt.title('Hares - Lynxes')
plt.show()

corr_quantificator_hl = np.max(crosscorr_hares_lynxes)
print('Hares - Lynxes:', corr_quantificator_hl)


# Hares - Carrots
crosscorr_hares_carrots = np.correlate(hares - hares_mean, carrots - carrots_mean, mode='full')
crosscorr_hares_carrots = crosscorr_hares_carrots / ( hares_std * carrots_std ) 

plt.plot(lags,crosscorr_hares_carrots)
plt.xlabel('Lag')
plt.ylabel('Scaled cross-correlation')
plt.title('Hares - Carrots')
plt.show()

corr_quantificator_hc = np.max(crosscorr_hares_carrots)
print('Hares - Carrots:', corr_quantificator_hc)


# Lynxes - Carrots
crosscorr_lynxes_carrots = np.correlate(lynxes - lynxes_mean, carrots - carrots_mean, mode='full')
crosscorr_lynxes_carrots = crosscorr_lynxes_carrots / ( lynxes_std * carrots_std )

plt.plot(lags,crosscorr_lynxes_carrots)
plt.xlabel('Lag')
plt.ylabel('Scaled cross-correlation')
plt.title('Lynxes - Carrots')
plt.show()

corr_quantificator_lc = np.max(crosscorr_lynxes_carrots)
print('Lynxes - Carrots:', corr_quantificator_lc)


# Hares - Hares (autocorrelation)
autocorr_hares = np.correlate(hares - hares_mean, hares - hares_mean, mode='full')
autoscorr_hares = autocorr_hares / ( hares_std * hares_std )

plt.plot(lags,autoscorr_hares)
plt.xlabel('Lag')
plt.ylabel('Scaled auto-correlation')
plt.title('Hares - Hares')
plt.show()

corr_quantificator_hh = np.max(autoscorr_hares)
print('Hares - hares:', corr_quantificator_hh)
