# Machine Learning Assessment
This assessment is for determining how much you have learnt in the past few sprints, the results of which will be used to determine how EDSA can best prepare you for the working world. This assessment consists of theory and practical questions in Regression, Classification, and NLP.

The answers for this test will be input into Athena as Multiple Choice Questions. The questions are included in this notebook and are made ** bold ** and numbered according to the Athena Questions.

As this is a time-constrained assessment, if you are struggling with a question, rather move on to a task you are better prepared to answer rather than spending unnecessary time on one question.

**_Good Luck!_**

## Section 1 - Machine Learning Theory Questions
Lets start with a couple theoretical questions.

** Q1. Which of the following would be considered a sign of overfitting? **

A) Much lower testing error than training error.

B) Equal training and testing error.

C) Very large testing error.

D) Much lower training error than testing error.


** Q2. In the equation below, what does ```a``` represent? **

$$
y = ax + b
$$

A) a is the y-intercept of the linear model.

B) a is the x-intercept of the linear model.

C) a is the slope, or gradient, of the linear model.

D) a is an unknown quantity.


** Q3. What is true about the function below? **

$$
y = 3x + 2
$$

A) The y-variable is always 2 units greater than the x-variable.

B) If we increase the x-variable by 2 units, the y-variable will increase by 1 unit.

C) When the value of the x-variable is 2, the y-variable will be equal to 0.

D) When the value of the x-variable is 0, the y-variable will be equal to 2.


** Q4. Into what interval does the logistic function transform the response? **

A. Between -1 and 1

B. Between 0 and 1

C. Between Minus infinity and 0

D. Between 0 and infinity

## Section 2 - Machine Learning Practical Questions
**_Note_** While there are other ways to obtain the answers for the Athena questions, we recommend writing the following functions to ensure you get the correct answers.
### Imports

In [None]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score

import string
from nltk import TreebankWordTokenizer

### Reading in the data
For this assessment we will be using a dataset about the quality of wine. This dataset will be used for both the classification and regression questions. Read in the data and take a look at it.

In [None]:
df = pd.read_csv('winequality.csv')
df.head()

## Task 1 - Data pre-processing

Write a function to pre-process the data so that we can run it through the classifier. The function should:
* Split the data into features and labels
* Standardise the features using sklearn's ```StandardScaler```
* Split the data into 70% training and 30% testing data
* Set random_state to equal 42 for this internal method
* If there are any NAN values, fill them with zeros

_**Function Specifications:**_
* Should take a dataframe as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

**Note: be sure to pay attention to the test size and random state you use as the following questions assume you split the data correctly**

In [None]:
def data_preprocess(df):
    
    #your code here
    
    return 

In [None]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[7])
print(y_test[7])

_**Expected Outputs:**_

```python
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[7])
print(y_test[7])

[-0.57136659 -0.83357715 -1.02610916 -0.26533465 -0.61846572 -0.79974133
 -0.48035289 -0.31396599 -1.32623327 -0.26925241 -1.0773326   0.50996897]
7
[ 1.75018984 -0.07953061  1.88358789 -0.95318049 -0.76558683  0.39882967
 -0.98745132 -1.34019672  0.76818761  1.12849668  0.46278438 -1.16701119]
5
```

** Q5. What is the result of printing out X_train[10][5]? ** 

** Q6. What is the result of printing out X_test[10][5]? ** 

** Q7. What is the result of printing out y_train[10]? ** 

** Q8. What is the result of printing out y_test[10]? ** 

## Task 2 - Training a Logistic Regression Model

Now that we have formatted our data, we can fit a model using sklearn's `LogisticRegression` class with its default parameters. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `LogisticRegression` model.
* The returned model should be fitted to the data.

In [None]:
def train_model(X_train, y_train):
    
    #your code here
    
    return 

** Q9. What is the intercept term of the fitted model? ** 

** Q10. What is the value of lm.coef_[1][3]? ** 

## Task 3 - Testing Classification model

Now that you have trained your model, let's see how well it does on the test set. Write a function which returns the accuracy of your trained model when tested with the test set.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a `float` of the accuracy of the model. This number should be between zero and one.

In [None]:
def calculate_accuracy(lm, X_test, y_test):
    
    #your code here
    
    return

In [None]:
print(calculate_accuracy(lm,X_test,y_test))

** Q11. What is the accuracy of this Logistic Regression model? **

## Task 4 - Train Random Forest Classifier model

Let us try improve this accuracy by training a model using sklearn's `RandomForestClassifier` class with its random_state is set to 6. We'll write a function that will take as input the features and label variables that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `RandomForestClassifier` model which has a random state of 6.
* The returned model should be fitted to the data.

In [None]:
#def train_rf_model

#your code here

Now that you have trained your model, lets see how well it does on the test set. Use the calculate_accuracy function you previously created to do this.

In [None]:
print(calculate_accuracy(clf,X_test,y_test))

** Q12. What is the accuracy of this Random Forest model? ** 

## Task 5 - Train Linear Regression Model

Since this dataset is about predicting quality, which ranges from 1 to 10, lets try fit the data to a regression model instead of a classification model and see how well that performs.

Fit a model using sklearn's `LinearRegression` class with its default parameters. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `LinearRegression` model.
* The returned model should be fitted to the data.

In [None]:
def train_reg_model(X_train, y_train):
    
    #your code here
    
    return

** Q13. What is the result of printing out reg.intercept_? ** 

** Q14. What is the result of printing out reg.coef_[1]? ** 

## Task 6 - Test Regression Model

We would now like to test our regression model. This test should give the residual sum of squares, which for your convenience is written as
$$
RSS = \sum_{i=1}^N (p_i - y_i)^2,
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.


In [None]:
#your code here

** Q15. What is the RSS value for this Linear Regression Model? ** 

## Task 7 - Train Decision Tree Regresson Model

Let us try improve this accuracy by training a model using sklearn's `DecisionTreeRegressor` class with a random state value of 10. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `DecisionTreeRegressor` model with a random state value of 10.
* The returned model should be fitted to the data.

In [None]:
#your code here

Now that you have trained your model, lets see how well it does on the test set. Use the function you previously created to do this.

** Q16. What is the RSS value for this Decision Tree Regression Model? **

## Task 8 - Compare Classification and Regression

How do these regression models compare to the classification models? Its hard to compare residual sum of squares and accuracy. Lets do something simple to compare them. Lets round the regression predictions to their closest integer, and use those values to calculate accuracy.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Caluclate the model predictions and round them to their closest integer.
* Should return a `float` of the accuracy of the model. This number should be between zero and one.


In [None]:
def compare_reg_class(model, X_test, y_test):
    
    #your code here
    
    return 

In [None]:
print(compare_reg_class(reg,  X_test, y_test))
print(compare_reg_class(dt,  X_test, y_test))

** Q17. What is the accuracy of your Linear Regression Model? ** 

** Q18. What is the accuracy of your Decision Tree Regression Model? ** 

## Task 9 - Mean Absolute Error
Write a function to compute the Mean Absolute Error (MAE), which is given by:

$$
MAE = \frac{1}{N} \sum_{n=i}^N |p_i - y_i|
$$

where $p_i$ refers to the $i^{\rm th}$ `prediction`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take two `arrays` as input. You can think of these the `predictions` and `y_test` variables you get when testing a model. 
* Should return the mean absolute error over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.


In [None]:
#def mean_abs_err

#your code here

In [None]:
print(mean_abs_err(np.array([5,7,1.2]),np.array([3.2,2,2])))

** Q19. What is the result of printing out mean_abs_err(np.array([1,1,1]),np.array([2,2,2]))? ** 

** Q20. What is the result of printing out mean_abs_err(np.array([5,7,1.2]),np.array([3.2,2,2]))? ** 

## Section 3 - Natural Language Processing



** Q21. Which of the following would be considered a stop word? **

A) and

B) book

C) like

D) always

### Reading in the data
For the practical questions lets read in the first chapter of Treasure Island written by Robert Louis Stevenson. The text file treasure_island.txt contains only this first chapter.

In [None]:
data = open('treasure_island.txt', 'r', encoding='ISO-8859-1').read()
print(data[:863])

## Task 10 - Text pre-processing
Write a function that removes the punctuation from the text and converts all the letters to lowercase letters.

## Task 11 - Tokenisation
Tokenise the data using nltk's TreebankWorkTokenizer

** Q22. What is the 133rd token in this chapter? **

** Q23. What is the 21st token in this chapter? **

## Task 12 - Count
Write a function which counts the number of times a word occurs in the text

** Q24. How many times does the word "admiral" appear in the text? **

** Q25. How many times does the word "captain" appear in the text? **