# Network Analytics
## Group Assignment 1

### Group I:
Mark O'Shea  
Rejpal Matharu
Mingyang Tham  
Anna Kurek  
Letty Huang  
Yiting Wang

### Part B:

This script begins by importing the necessary libraries and codes.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix



#### Steps 1, 2
This dataset contains information on 1599 different wines, including some aspects of their chemical composition and their quality. The data is loaded into a `pandas` dataframe, and an additional binary variable $good\_wine$ is created based on the $quality$ variable - $good\_wine = 1$ if $quality \geq 6$, and $0$ otherwise. 

In [2]:
wines = pd.read_table('winequality-red.csv', sep='\;')
wines.columns = wines.columns.str.replace('"', '')
wines['good_wine'] = np.where(wines['quality']>=6, 1, 0)

  """Entry point for launching an IPython kernel.


#### Steps 3, 4
Since the Nearest Neighbours method is sensitive to scaling, each datapoint in the dataset has to be scaled to its standard form, removing the mean and standardising the units to the standard deviation of each variable. This is done to each variable except for $good\_wine$, using the formula:   $$x_{standard} = \frac{x - \mu}{\sigma}$$
where $\mu$ is the mean of each variable, and $\sigma$ is the standard deviation of each variable. 

The normalisation is done before splitting the dataset to ensure that the mean and variance used in the normalisation calculations are calculated using the entire dataset. Doing so provides a larger sample to calculate these statistics with, allowing it to be more representative of the population.

After normalising the variables, the dataset is randomly shuffled and split into two dataframes of equal size, both of which are done by the `train_test_split` function in the `sklearn.model_selection` library. The first set of data would be used for training, the the second would be reserved as a test dataset, to be used later to evaluate the model that would be ultimately chosen.

In [6]:
scaler = StandardScaler()
nwine = pd.DataFrame(scaler.fit_transform(wines), index = wines.index, columns = wines.columns)
nwine['good_wine'] = wines['good_wine'][:]

winetrain, winetest = train_test_split(nwine, test_size=0.5, random_state=123)

#### Steps 5, 6

This training dataset has now been cleaned and is ready to be trained on. In this study, several k-Nearest Neighbours models with varying values of k would be evaluated, each modelled using a 5-fold cross validation method and the mean accuracy over all 5 folds taken to be the accuracy of the model. The nearest neighbours were calculated using the Euclidean distance over all given variables except $quality$, as this variable was used to create the category of interest, $good\_wine$.

A hundred models were evaluated, with k starting at 1 and increasing incrementally by 5 until it reaches 501, and the result of each model saved into a dictionary. 

In [13]:
#Running 5-fold Cross Validation on all K's from 1 to 501. Recording results of each k in a dictionary.
num = 1
resultdict = {}
while num < 502:
    knn=KNeighborsClassifier(n_neighbors=num)
    xvres = cross_val_score(knn, X=winetrain.loc[:,'fixed acidity':'alcohol'], 
                            y=winetrain['good_wine'], cv=5, scoring='accuracy')
    resultdict[num] = xvres.mean()
    num+=5

#Saving best k result:
bestk = max(resultdict, key=resultdict.get)

#### Step 7

The best model that returned the highest percentage of correctly classified wines occurred when $k=51$, thus the 51-nearest neighbours model is chosen as the optimal model. The model is then retrained over the entire training dataset and used to predict observations in the test dataset, providing an unabiased estimate of the model's performance on new data.

In [15]:
knn=KNeighborsClassifier(n_neighbors=bestk)
knn.fit(winetrain.loc[:,'fixed acidity':'alcohol'], winetrain['good_wine'])
predictions = knn.predict(winetest.loc[:,'fixed acidity':'alcohol'])

confusion_matrix(y_true=winetest['good_wine'], y_pred=predictions)

array([[248, 139],
       [ 86, 327]])

Of the 413 wines of good quality in the test set, this model correctly classified 327 of them, and of the 387 wines of non-good quality, the model correctly classified 248 of them. Given that $good\_wine$ was the category of interest, this result indicates that the model has a sensitivity of 79% and a specificity of 64%, for a total accuracy of 72%. 

Compared to a naive model, which predicts every wine to fall under the majority class, this model fares very well. Given the training dataset used, a naive model would predict all wines to be a good wine, as the majority (442 of 799) wines in the training data are good wines. Using this prediction, it would have correctly classified 413 of the 800 wines in the test dataset, for a total accuracy of 52%. 

In [24]:
print(sum(winetrain['good_wine']))
print(winetrain.shape)
print(sum(winetest['good_wine']))
print(winetest.shape)

442
(799, 13)
413
(800, 13)
