<div align="right"><sub>Data Science and Machine Intelligence, Semester 2, 2017</sub>
</div>


# Assignment 1
Due Date: Friday, September 1st, 5pm.

Value: 25% of course mark

Every numbered task is worth 1 mark.

---

### Exercise on Linear Regression   

In this exercise, you will be working with statistical data about crime data in different cities that relate predictors/features such as educational level of the population, government expenditure on policing, individual income and inequality with crime in a given city.  The file containing the data set for this assignment is `crime.csv`.  Make sure you open the file and visually inspect it yourself to understand the structure of your data. 

To summarize, in this data set, the **features/predictors** ($X$) are:

* Education: mean years of schooling of the population aged 25 years or over
* Police: per capita expenditure on police protection
* Income: Average per capita monthly income
* Inequality: Income inequality is calculated as the percentage of families earning below half the median income

and the **response variable/target/outcome** ($y$) is:

* Crime: crime rate as number of offenses per 100,000 population

you can load the data in memory and output the first 5 instances in the dataset using the following code:

In [1]:
import pandas as pd

data = pd.read_csv('crime.csv', index_col=0)
print(data.head())

   Education  Police  Income  Inequality  Crime
1       12.1     5.8    3940        26.1    791
2        7.3    10.3    5570        19.4   1635
3       10.9     4.5    3180        25.0    578
4        6.1    14.9    6730        16.7   1969
5        7.1    10.9    5780        17.4   1234


You can extract the feature matrix $X$ (education, police, income, inequality) and the target vector **y** (crime) using the following code:

In [2]:
import numpy as np
from sklearn.utils import shuffle
feature_cols = ['Education','Police','Income','Inequality']
target = ['Crime']
X = np.array(data[feature_cols])
y = np.array(data[target])
X, y = shuffle(X, y, random_state=1)

1. Plot the Education predictor/feature variable against Crime (the predictor should be on the x axis and crime on the y axis). 
2. Plot the Police predictor/feature variable against Crime. 
3. Plot the Income predictor/feature variable against Crime. 
4. Plot the Inequality predictor/feature variable against Crime. 
5. Is the education variable positively or negatively correlated with crime?
6. Is the police variable positively or negatively correlated with crime?
7. Split the data in 2 halves: training set and test set
8. Fit a multivariate linear regression model on the training data using all the features available
9. What are the intercept ($\theta_0$) and coefficients ($\theta_1$, $\theta_2$, $\theta_3$ and $\theta_4$) of the model?
10. What is the $R^2$ score (i.e. the coefficient of determination that measures the proportion of the outcomes variation explained by the model) for the training data? and for the test data?
10. Given the following imaginary cities with the provided values for the predictors education, police, income and inequality, which city should have the highest level of crime according to your model?:

| City name        | education           | police  | income  | inequality |
| ------------- |:-------------:| -----:| -----:| -----:|
| City 1      | 10 | 5 | 6000  | 16   |
| City 2      | 8 | 11 | 4500  | 25   |
| City 3      | 6 | 8 | 3780  |  17  |
| City 4      | 12 | 6 | 5634  |  22  |

<ol start="12">
  
<li>Re-instantiate your linear regression model with the parameter `fit_intercept` set to `False` and rerun your analysis on the entire feature matrix $X$. When we set the `fit_intercept` to `False` we are basically fitting a model with no intercept parameter $\theta_0$. Output the coefficients you get for $\theta_1 ... \theta_4$.</li>
<li>Calculate the coefficients for $\theta_1 ... \theta_4$ using the analytical/close form solution of linear regression. Make sure those estimates coincide with what you get in Exercise 12 to be certain you got it right. Use the matrix algebra functionality provided by the `numpy` library to find the optimal vector **$\theta$**. Provide the line of code you created to calculate the solution.</li>
</ol>

### Exercise on Classification

"Churn Rate" is a business term describing the rate at which customers leave or cease paying for a product or service. It's a critical figure in many businesses. Understanding what keeps customers engaged is valuable. Consequently, there's growing interest among companies to develop better churn-detection techniques. Predicting churn is particularly important for businesses with subscription models such as cell phone, pay-TV, or any type of service in exchange of a subscription fee. 

The data set we'll be using, `churn.csv`, is real data from a Telecommunications company customer data set. Each row represents a subscribing telephone customer. Each column contains customer attributes such as call minutes used during different times of day, charges incurred for services, lifetime account duration, etc. The last column is the label indicating whether the customer quit the service (1) or is still a customer of the telecom company (0). 

your task:

<ol start="14">
  
<li> Read the data from the file into the appropriate $X$ and $y$ data structures and shuffle it.</li>
<li> Split the data into a training set and test set (test set size should be 33%)</li>
<li> Scale the data using the StandardScaler class from scikit-learn</li>
<li> Train a logistic regression model and estimate its performance on the test data</li>
<li> Train a K nearest neighbors classifier and estimate its performance on the test data</li>
<li> Train a Multilayer Perceptron (Artificial neural network) classifier and estimate its performance on the test data</li>
<li> Train a support vector machine classifier (using a radial basis function as kernel) and estimate its performance on the test data</li>
<li> Print out a confusion matrix for the support vector machine classifier</li>
<li> Print out a classification report for the support vector machine classifier (displaying precision, recall and f1-score)</li>
<li> Plot an ROC curve for the logistic regression model classifier</li>
</ol>

### Exercise on Regularization

Using the Boston house prices [data set](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html):

<ol start="24">
<li> Fit a linear regression model using Ridge regularization and print out the coefficients of the model </li>
<li> Fit a linear regression models using Lasso regularization and print out the coefficients of the model </li>
<li> Describe what is most striking difference between the coefficients of Ridge and Lasso regression.</li>
</ol>

I provide you with some initial code to get you started:

In [None]:
from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor
import numpy as np
import pylab as pl

from sklearn.datasets import load_boston
boston = load_boston()

X = np.array([np.concatenate((v,[1])) for v in boston.data])
y = boston.target

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Exercise on Validation

<ol start="26">

<li> In your own words, explain what the last two lines of code do in the following code snippet: </li>

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
cv = cross_val_score(KNeighborsClassifier(1), X, y, cv=10)
cv.mean()

### Deliverables

<ol start="27">

<li> You should hand in a single python notebook file (.ipynb) containing all the necessary code to answer each of the previously enumerated tasks. Include comments in the code to clarify which exercise your code snippet is trying to solve. Use separate code cells to organize your code. Use comments also to provide answers to each task when appropriate.  Please, make sure your code runs smoothly so I can execute it in my computer. Name your Python notebook according to the following pattern: `yourLastName_DSMI_A1.py`. Create your own private repository for you to work on the assignment by using the following link: https://classroom.github.com/a/ozIoYww6
The notebook file and the data files should be available in your own private GitHub repository “https://github.com/OPClasses2/assignment1-YourGithubUserName”. 
</li>