### Learning
* Supervised Learning: uses labeled inputs to train models and learn outputs
* Unsupervised Learning: uses patterns from inputs to learn outputs
* Reinforcement Learning: agent learning in an interactive environment based on rewards and penalties

### Features
* Qualitative: finite number of categories of groups
	* Nominal Data: categorical data without order, e.g. genders or countries
	* Ordinal Data: categorical data with order, e.g. ages or moods

* Quantitative Data: numerically valued data (discrete or contiuous), e.g. height or cats owned

* One-hot Encoding: '1' if value matches category

### Output
* Supervised
	* Classification: predict discrete classes
		* Multiclass: output the exact class, e.g. cat or dog or horse, plant species, types of fruit
		* Binary: output Boolean, e.g. cat or NOT cat, positive or negative, spam or NOT spam
		
	* Regression: predict continuous values, e.g. Apple stock, weather, the market

### Model
* Training: create prediction vector from model, calculate loss between prediction and true values, make adjustments
* Validation: reality check during/after training to ensure model can handle unseen data
* Testing: checks how generalizable the final chosen model is 

* Loss: the closer the prediction is to the true value, the smaller the loss
	* L1 Loss: loss = sum(|y_real - y_predicted|)
	* L2 Loss: loss = sum((y_real - y_predicted) ** 2)
	* Binary Cross-Entropy Loss: loss = -1/N * sum(y_real * log(y_predicted) + (1-y_real) * log((1-y_predicted)))

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

In [11]:
cols = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'class']
df = pd.read_csv('data/magic04.data', names=cols)		# Create dataframe
df.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [12]:
df['class'] = (df['class'] == 'g').astype(int)		# Transform 'class' column to numerical values

In [13]:
df.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,1
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,1
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,1
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,1
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,1


In [None]:
for label in cols[:-1]:		# Grab every feature (minus the class)
	# Get everything where class = 1 (all values in this case) and get label
	plt.hist(df[df['class']==1][label], color='blue', label='gamma', alpha=0.7, density=True)
	plt.hist(df[df['class']==0][label], color='red', label='hadron', alpha=0.7, density=True)
	plt.title(label)
	plt.ylabel('Probability')
	plt.xlabel(label)
	plt.legend()
	plt.show()

# Train, validation, test datasets

In [29]:
train, valid, test = np.split(df.sample(frac=1), [int(0.6*len(df)), int(0.8*len(df))])

In [26]:
def scale_dataset(dataframe, oversample=False):
	X = dataframe[dataframe.columns[:-1]].values		# Get all features
	y = dataframe[dataframe.columns[-1]].values		# Get class column

	scaler = StandardScaler()						# Init scaler
	X = scaler.fit_transform(X)						# Transform X

	if oversample:
		ros = RandomOverSampler()
		X, y = ros.fit_resample(X, y)

	data = np.hstack((X, np.reshape(y, (-1, 1))))	# Stack, put them side by side

	return data, X, y

In [34]:
train, valid, test = np.split(df.sample(frac=1), [int(0.6*len(df)), int(0.8*len(df))])
train, X_train, y_train = scale_dataset(train, oversample=True)			# Need more info
valid, X_valid, y_valid = scale_dataset(valid, oversample=False)		# Don't care about balance
test, X_test, y_test = scale_dataset(test, oversample=False)			# Don't care about balance