Explain how the Maximal Marginal Classifier works:
----
Wikipedia: In machine learning, a margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier (e.g. perceptron or linear discriminant analysis) is used, the distance (typically euclidean distance, though others may be used) of an example from the separating hyperplane is the margin of that example.

The notion of margin is important in several machine learning classification algorithms, as it can be used to bound the generalization error of the classifier. These bounds are frequently shown using the VC dimension. Of particular prominence is the generalization error bound on boosting algorithms and support vector machines.

My own words: Contextually, it would make sense if I described this in terms of the SVM. In the SVM, the goal is to classify between more than one classes, and the method of classification is essentially finding a line, polynomial, or radial section of the graph which is homogeneous (i.e. containing all of the same classification of data, or atleast predominantly) and can be quantified. In SVM, the MMC attempts to tune the distance from the hyperplane of the data that is considered when tuning.

Instead of the Support Vector Machine, here's a paper applying the MMC to the Perceptron algorithm:
https://people.eecs.berkeley.edu/~jrs/189/lec/03.pdf

What are the limitations of the Maximal Marginal Classifier?
----


In what way does the Support Vector Classifier extend MMC?
----
Wikipedia: In geometry, the hyperplane separation theorem is a theorem about disjoint convex sets in n-dimensional Euclidean space. There are several rather similar versions. In one version of the theorem, if both these sets are closed and at least one of them is compact, then there is a hyperplane in between them and even two parallel hyperplanes in between them separated by a gap. In another version, if both disjoint convex sets are open, then there is a hyperplane in between them, but not necessarily any gap. An axis which is orthogonal to a separating hyperplane is a separating axis, because the orthogonal projections of the convex bodies onto the axis are disjoint.

The hyperplane separation theorem is due to Hermann Minkowski. The Hahn–Banach separation theorem generalizes the result to topological vector spaces.

A related result is the supporting hyperplane theorem.

In geometry, a maximum-margin hyperplane is a hyperplane which separates two 'clouds' of points and is at equal distance from the two. The margin between the hyperplane and the clouds is maximal. See the article on Support Vector Machines for more details.

My own words: I essentially explained this relationship in question 1; the SVM creates the hyperplane based upon the maximal marginal classifier, rather, the maximal marginal classifier helps the SVM identify the values that should be considered when tuning a learning function to the dataset. For example, if you have a really well defined boundary between class "0" and class "1", but class "1" has a factor of 10 more values that class "0". Intuitively, in regression, the line of fit would be skewed in the direction of class "1". If, instead, you utilize a support vector classifier coupled with the maximal marginal classifier, you can tune the distance from the hyperplane which would allow the algorithm to focus on the data at the boundaries, and make the line of best fit, rather than that arbitrary linear fit. This idea extends to polynomial and radial kernels. To really put it in plain words, the MMC simply says how far from the hyperplane the SVM should look.

Explain the metrics Precision, Recall, F-score and Accuracy:
----
Precision: $\frac{tp}{tp + fp}$

Recall: $\frac{tp}{tp + fn}$

F-score: 2* $\frac{precision * recall}{precision + recall}$

Accuracy: $\frac{tp + tn}{tp + tn + fp + fn}$

Provide numerical examples when F1 is fairer than Accurracy:
----
Most obviously, F-score is best for data with stratum; Rather, not all data is cleanly 50/50 between two classes, and when it isn't, F1 is fairer than accuracy. F-score isn't affected by the true negative values, but rather simply just the true positive values, and those which were incorrectly classified.

Explain the principle of using TF-IDF for finding relevant words. Provide a numerical example.
----
My own words: TF-IDF is a nice little tool for finding relevant words, or moreover relevant words to works.

Step one: Calculate the term frequency of each document. This can be easily done by simply tokenizing the corpus, splitting it up between "documents", and finding out how many times each representative token shows up in that document, and tabulate. Term frequency is important for telling us which words the author favors.

Step two: Calculate the inverse document frequency. This is calculated by taking $N$, the total number of documents, and dividing it by the number of documents that contain a token, $t$ for every token in the corpus.

Step three: Once you've calculated the term frequency and inverse document frequency for every token in the corpus, generate the product of the two to find the corpus' TF-IDF data.

Intuition: Incredibly common words like "I" "the" "and" will be completely smoothed from scoring, because they show up in every document, they have a score of 0, and thus are ignored. TF-IDF gives much higher weights to those words that are unique among documents of a corpus, and effectively smoothes words that show up incredibly commonly, like colloquial human language.

What are n-grams? Provide an example in which n-grams (n>1) is better than unigram:
----
My own words: N-grams are powerful in retaining the sequencing of words rather than simply looking at each individual token in an independent setting. Bigrams, and trigrams allow for the scientist to peek at the sentence structure, and nab occurrences like "not happy", "not sunny", etc. which are effective negations of the sentiment. You could utilize those negation factors to more appropriately explain the content and sentiment of the text.

Example:
"I have not had an awesome day, it's been crazy!"

Unigrams, the sentence looks overwhelmingly positive.
Bigrams, you begin to understand negations.
Trigrams, you start peeking at the underlying sentence structure

Describe a commercial application of sentiment analysis:
---

What is the role of cost and gamma in the radial SVM kernel in terms of flexibility and generalization (overfitting data)?
----

Provide a small example of using the decision tree for classification.
----