# Modeling and Prediction

The role of machine learning is to discover patterns and relationship in data and to put those use. The end goal is pretty simple though the methods may range from simple to extremely complicated, the goal however is to take the known and predict what we don't know. 

### Finding relationships between input and target (label) data

Trying to predict MPG for vehicles. The dataset could have model year, vehicle weight, horsepower, number of cylinders so on and the vehicles MPG rating. 

Input features are typically referred to using the symbol X, with subscripts differenting inputs when multiple input features exist. For example X_1 refers to manufactuer region, X_2 to model year and so on. 

Target variable is typically referred as Y. So a simple formula could be

Y = f(X) + Error

f: is the unknown function that relates the input variables to the target, Y. It is commonly referred to as the signal. 

E:(error) is called noise.

##### "The challenge of machine learning is to use data to determine what the true signal is, while ignoring noise"

If you knew f() in the auto mobile challenge then you'd be able to know the MPG rating of any car. But you could have numerous sources of noise, E, including:

1. imperfect measurement of each vehicle's MPG rating cased by small inaccuracies in the measuring devices --measurement noise
2. Variations in manufacturing process, causing each care in the fleet to have slightly different MPG measurements --- manufacturing process noise
3. Noise in the measurement of the input features, such as weight and horsepower
4. Lack of access to the braoder set of features that would exactly determine MPG

Assuming a good estimate of f. Machine learning has 2 goals predictions and inference. 

#### Prediction

Giving a healthymodel you can generate predictions of the target (Y) given new information (X_new). Giving you new data as needed. 

Examples of ML prediction cases:
deciphering handwritten digits or voice recordings
predicting stock market
forecasting
predicting which users are most likely to click, convert or buy
predicting which users will need product support and which are liekly to unsubscribe
determining which transactions are fraudulent
making recommendations

#### Inference

machine learning models and better understand the relationships between theinput features and the output target. Such as:

which input features are most strongly related to the target variable?
Are those relationships positive or negative?
Is f a simple relationship, or is it a function that's more nuanced and nonlinear?

## Models

#### Parametric vs nonparametric
assume that f takes a specific functional form, whereas nonparametric models don't make such strict assumptions. Parametric approaches tend to be simple and interpretable, but less accurate. 

Nonparametric approaches are usually les interpretable but more accurate across a broad range of problems. 

### Parametric Methods

linear regression is a parametric models. It assumes f is a linear combination of the numerical values of the inputs. 

f(X) = B_0 + X_1 x B_1 + X_2 x B_2

other commonly used parametric models include:

logistic regression
polynomial regression
linear discriminant analysis
quadratic discriminant analysis
(parametric) mixture models
naive bayes

#### drawbacks

the biggest drawback is the strong assumption about the true form of the function

### Nonparametric methods

f doesn't take a simple fixed function. The form and complexity of f adapts to complexity of the data. For example a classification tree.

Other examples:
k-nearest neighbors
splines
basis expansion methods
kernal smoothing
generalized additive models
neural nets
bagging
boosting
random forests
support vector machines


### supervised vs unsupervised

supervised problems is where you have access to the target variable for set of training data. Unsupervised are ones in which there's no identified target variable.

unsupervised have 2 main classes:

#### clustering

use the inpute features to discover natural groupings (k-means so on)

#### dimensionality reduction
transform the input features into a small number of coordinates that caputre most of the variability of the data (principle component analysis (PCA), multidimensional scaling, manifold learning

### Classifications

this is about putting things into buckets so to speak



In [11]:
#example code
from sklearn.linear_model import LogisticRegression as Model
import  numpy as np

In [28]:
def train(features, target):
    print(features)
    print(target)
    model = Model()
    model.fit(features, target)
    return model


In [29]:
def predict(model, new_features):
    preds = model.predict(new_features)
    return preds

In [34]:
titanic_feats = np.array([[3,1,3], [22, 38, 25]])
titanic_target = np.array([0,1,0])
titanic_test = np.array([3,23])

In [35]:
model = train(titanic_feats, titanic_target)
#predictions = predict(model, titanic_test)

[[ 3  1  3]
 [22 38 25]]
[0 1 0]


ValueError: Found input variables with inconsistent numbers of samples: [2, 3]