# Machine Learning Notes

## 1. Introduction to Machine Learning
### 1.1 Intro to Machine Learning

** Model **: a model is a relationship that exist between different variables. 

** Machine Learning **: we use it to refer to creating and using models that are leaned from data which in other content might be called predictive modeling or data mining. 

** Features **: are whatever inputs we provide to our model. We choose features based on experience and domain expertise 
Supervised vs. Unsupervised Learning

** Supervised learning **: there is a set of data labelled with the correct answers to learn from In supervised learning, we relate response to predictions (prediction), find a relation between the response and the prediction (inference)

**Unsupervised learning**: in which there are no such labels as supervised learning.  In unsupervised learning, there is no associated response for the observation: no response to predict and consequently no regression model. 
Therefore, the analysis that can be done for this case is to find relations between variables or observables, like clustering. 

**overfitting**: a model the performs well on the training set and does a poor job on the new data. Low bias and high variance leads to overfitting. 

**under-fitting**: does a poor job on both trading and new data. High bias and low variance corresponds to under-fitting.      

**Bias and Variance**
**Bias**: the difference in prediction that arise due to use of different models. To fix high bias one should add features.

** Variance **: 
the difference in prediction that arise due to use of different training sets. 
To improve the model prediction by hight variance we should remove features. 

### 1.2. How to choose the right model?

How to choose a model that performs best on the test set? In this situation we should split the data into three parts: a training set to build the models, a validation set for choosing among trained models, and a test set for judging the final model.  
Pretty much always we extract features from our data that far into one of these three categories:  i) yes or no which we can encode as 0 or 1 —> Naive Bayes Classifier  ii) we require numerical features —> Regression Models iii) we have a choice from discrete set of options —> Decision Trees that can deal with numeric or categorical data

Bias and Variance
Bias: the difference in prediction that arise due to use of different models. To fix high bias one should add features. 
Variance: 
the difference in prediction that arise due to use of different training sets. 
To improve the model prediction by hight variance we should remove features. 


## 2. SciKit Learn

### 2.1 Starting with SciKit Learn
Loading a data base



In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
type(iris)
print iris.data
print iris.feature_names
print iris.target #target is what we are going to predict
print iris.target_names

print  type(iris.data) #checking the type of feature and response
print type(iris.target)
print iris.data.shape
print iris.target.shape

x = iris.data  #Storing feature and response 
y = iris.target

### 2.2 K nearest neighbor 

**Source**: Data Science from Scratch & http://scikit-learn.org/stable/modules/neighbors.html

#### Introduction
* As a simplest predictive model, makes no mathematical assumption, and does not require any sort of heavy machinary. The only requirements are  i) some notion of distance, and ii) an assumption that points that are close to one another are similar 
* The K nearest neighbor predictive model does not help to understand the drivers of whatever phenomena we are looking at. 
* **curse of dimensionality**: points in high-dimensional spaces tend not to be close to one another at all. 

* It's great for many applications, with personalization tasks being among the most common. To make a personalized offer to one customer, you might employ KNN to find similar customers and base your offer on their purchase behaviors. KNN has also been applied to medical diagnosis and credit scoring.

#### 2.2.1 What is k-Nearest Neighbors?
The model for kNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other other types of data such as categorical or binary data, Hamming distance can be used.
In the case of regression problems, the average of the predicted attribute may be returned. In the case of classification, the most prevalent class may be returned.

#### 2.2.2 Nearest Neighbor Algorithms
1. **Brute Force**: computation of distances between all pairs of points in the dataset
K-D Tree: the basic idea to make more efficient algorithm is that if point A is very distant from point B, and point B is very close to point C, then we know that points A and C are very distant, without having to explicitly calculate their distance.  (https://www.youtube.com/watch?v=TLxWtXEbtFE)

2. ** Ball Tree **: To address the inefficiencies of KD Trees in higher dimensions, the ball tree data structure was developed. Where KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres. This makes tree construction more costly than that of the KD tree, but results in a data structure which can be very efficient on highly-structured data, even in very high dimensions. Because of the spherical geometry of the ball tree nodes, it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure of the training data.

3. The choice of neighbors search algorithm is controlled through the keyword 'algorithm', which must be one of ['auto', 'ball_tree', 'kd_tree', ‘brute’].

#### 2.2.3 Scikit Nearest Neighbors
Nearest Neighbors Classification
Neighbors-based classification does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.
scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier and RadiusNeighborsClassifier. When the data is not uniformly sampled, the latter can be a better choice. For high-dimensional parameter spaces, this method becomes less effective due to the so-called “curse of dimensionality”.
Default is using uniform weights. To weight the neighbors that may contribute more to the fit, one can change weights = ‘uniform' to weights = ‘distance’ or a user-defined function. 
current status: what is the difference between nearest neighbor and the classifiers? I have opened some tabs to watch videos and read some tutorials.
An Example:

In [None]:
from sklearn.datasets import load_iris
x = iris.data 
y = iris.target
print x.shape and print y.shape
from sklearn.neighbors import KNeighbrosClassifier 
knn = KNeaighborClassifier(n_neighbors = 5)
# to see all the default values print knn
knn.predict([3,5,4,2])
# or 
X_New = [[3,5,4,2], [5,4,3,2]] 
# and then 
knn.predict(X_New)


In [None]:
# Another way:  
knn.fit(x,y)
knn.predic(X_New)

#### 2.2.4 Nearest Neighbors Regression
1. Neighbors-based regression can be used where the data labels are continuous variables. The label assigned to a query point is computed based the mean of the labels of its nearest neighbors.

2. KNeighborsRegressor and RadiusNeighborsRegressor

#### 2.2.5 Comparing different Models
The video below is really useful as it employs scikit tools to do the whole process. There are other ways as well, for example for splitting the database. https://www.youtube.com/watch?v=0pP4EwWJgIU&index=5&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A 

##### Using Regression method

In [None]:
from sklearn.cross_validation import train_test_split
print X.shape and print y.shape # to test the date base:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
print X_train.shape and print y_train.shape # To test the new trading and test datatbase, 
logreg = LogistinRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)

##### Using k nearest neighbor method

In [None]:
knn = kNeighborsClassifier(n_neighbors = 5)
kn.fit(X_train, y_train)
y_pred = knn,predict(X_test)
print metrics.accuracy_score(y_test, y_pred)