# Table of Contents
* [1. Overview of Project](#overview-project)
* [2. Overview of Machine Learning](#overview-ml)
* [3. Analyst Scored Domains Dataset](#scored-domains)
    * [3.1 Visualization](#visualization)
    * [3.2 Statistical Analysis](#statistical-analysis)
    * [3.3 Machine Learning Models](#ml-models)
* [4 Inferrence Scored Domains Dataset](#inferrence-scored-domains)
    * [4.1 Visualization](#visualization)
    * [4.2 Statistical Analysis](#statistical-analysis)
    * [4.3 Machine Learning Models](#ml-models)
* [5 Analysis](#analysis)
* [6 Conclusions](#conclusions)

--- 
# 1. Overview of Project <a class="anchor" id="overview-project"></a>

PPS currently uses the **Google Detection** API in order to find uses of particular pictures through out the web. Once Google Detection has found the use cases of the image, domain data is gathered, and placed in a mongodb. Based on certain input features, a rule based scoring system is in place to determine if the domain is with pursuing to prosecute. 

##  Current Rule Based Scoring
* **`essential.score.latest.score`**: the output of the scoring system (from 1 - 13, where 13 means the domain is valuable)
*  **`essential.BellonaStatus`**: first thing they go off of. Values of `SKIP_` or `HOLD_` are given scores of zero since they are not valuable. 
* **`ssential.score.latest.components.whoisScoreComponents.registrantContactCountry`**: USA gets a score of 3, some european countries, singapore, canada and austraila are given a score of 1, everything else is 0
* **`essential.score.latest.components.privateRegistrationStatus`**:  If the domain is registered privately (meaning done by a domain service, that means we can't easily determine who the owner is and thus it makes it harder prosecute, so we give these a score of 0, everything else is whatever the whois score was (which isn't zero since we already filtered those out). 
* **`essential.score.latest.components.alexaData.trafficData.ran`**:  finally, you can get a boost from alexa based on if the site is popular. Anything below 10000 gets a bonus of 10 (so for a USA site, this gives 13), and that goes down to a bonus of 2 for a score below 500000, with scores in between. 

##  Idea
> Implement a supervised machine learning classification algorithm in order to better predict what domains are worth pursuing for copy right infringement

---
# 2. Overview of Machine Learning <a class="anchor" id="overview-ml"></a>
Here are a few terms that are used pretty frequently and would help to have a base understanding of throughout the rest of this report. 

## Input Features
* The inputs variables that the model will use to make predictions with. Ex: **`domainAge`**

## Class Labels/Output/Response
* This refers to the class or label that is associated with a set of input features. In our case it is: **`analystResult`**.

## Training example
* A training example is made up of input variables and an output label. A machine learning model will try and find a relationship (map) between the input features and the output label. 

## Training
* The process of running a machine learning model on training data, during which the model's parameters will "learn" the relationship that maps the inputs to the outputs.

## Predicting
* Once a model has been trained, it can then take a new example(an **unseen** set of input features) and make a prediction of what class it will be belong to. If we know the class ahead of time, we can measure prediction accuracy as well as other performance metrics.

## Train/Test split - Preventing overfitting 
* Because most machine learning models will learn their parameters based off of the data they are trained on, I want to make every effort to prevent overfitting (Seen below). 

![overfitting.jpg](attachment:overfitting.jpg)

* Overfitting will occur when the model begins to tune its parameters based off of noise in the training data. This allows for the accuracy for predictions made on training data (that the model has seen before) to be very high, but often means that when new unseen test data is presented, the accuracy plummets. 
* To prevent this, I will often split my data from the start into a training and testing set.
* For example, say I have 1000 labeled examples. I may split them into a training set of 800 examples and a test set of 200 examples. I would train my model with the set of 800, and then make predictions with the 200 test examples. Because the model had not seen the 200 test examples, they are a good indicator of how it may perform

## Accuracy vs. Precision vs. Recall vs. F1-Score

There are several metrics that will be used in scoring model prediction performance in this report. But first to motivate the need for these different metrics, imagine that a model has the main goal to maximize its accuracy. Now consider the following situation: the model is trying to predict which patients may have cancer. Overall, 99% of the population is cancer free, and only 1% has the disease. Now, if the model only tries to maximize accuracy, it may just predict that **no one** has cancer, and hence it would have a 99% accuracy. Therefore we need to consider a confusion matrix in order to better optimize our results and prevent the situation above from occuring. 

![confusion%20matrix.png](attachment:confusion%20matrix.png)

True positive and true negatives are the observations that are correctly predicted and therefore shown in green. We want to minimize false positives and false negatives so they are shown in red color. These terms are a bit confusing. So let’s take each term one by one and understand it fully.

**True Positives (TP)** - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes. E.g. if actual class value indicates that domains is valuable and predicted class tells you the same thing.

**True Negatives (TN)** - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. E.g. if actual class says this domain is not valuable and predicted class tells you the same thing.

False positives and false negatives, these values occur when your actual class contradicts with the predicted class.

**False Positives (FP)** – When actual class is no and predicted class is yes. E.g. if actual class says this domain is not valuable but predicted class tells you that domain is valuable. Note, this is also refered to as **Type I error**.

**False Negatives (FN)** – When actual class is yes but predicted class in no. E.g. if actual class value indicates that this domain is valuable and predicted class tells you that domain is not valuable. Note, this is also refered to as **Type II error**.

Once you understand these four parameters then we can calculate Accuracy, Precision, Recall and F1 score.

### Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.

### $$Accuracy = \frac{TP+TN}{TP+FP+FN+TN}$$

### Precision 
"How often does the algorithm cause a false alarm?". Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all domains that our model labeled as worth pursuing, how many actually were worth pursuing? 

### $$Precision = \frac{TP}{TP+FP}$$

### Recall
"How sensitive is our algorithm?". Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the domains that were actually marked as worth pursuing, how many did the model classify as worth pursuing? 

### $$Recall = \frac{TP}{TP+FN}$$

### F1 Score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.er

### $$F1 = \frac{2*(Recall * Precision)}{(Recall + Precision)}$$

I am going to be using these metrics throughout this report. 