# COMP2420/6420 - Introduction to Data Management, Analysis and Security


### Australian National University

### College of Engineering and Computer Science

Assignment 3 
============

  
|**Maximum marks**         |**100**
|--------------------------|--------
|  **Weight**              |  **15% of the total marks for the course**
|  **Submission deadline** |  **5pm, Friday, May 18**
|  **Submission mode**     |  **Electronic, using wattle**
|  **Estimated time**      |  **20 hours**
|  **Penalty**             |  **100% after the deadline**
  


# Submission

You need to submit the notebook `Assignment-3.ipynb` as part of your submission on wattle. You need to add your group and student details below. Remember your filename should be exactly as given below. Any changes to file name will mean your file can't be marked by the automarker, resulting in a zero mark.

**Note**

* For answers requiring free form written text, use the designated cells denoted by `YOUR ANSWER HERE` -- double click on the cell to write inside them.
* For all coding questions please write your code after the comment `YOUR CODE HERE`.
* After inserting your code **please remove** the following line from each code cell `raise NotImplementedError()`.
* In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively.
* You will be marked on **correctness** and **readability** of your code, if your marker can't understand your code you will get zero marks. 
* We have marked some questions with a tag **hard** and **slightly hard**, so that you can plan your time accordingly
* We advise students to skip Q2.4 at first. You can always come back after finishing all other questions.
* Have marked changes in color  <font color='magenta'>magenta from last version</font>, there are few clarifications.

### Group Name : XXX
### Student Id1: uXXXXXXX
### Student Id2: uXXXXXXX

In [None]:
import json
import os
import urllib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from scipy.stats import ttest_ind, ttest_rel,ttest_1samp
from sklearn.preprocessing import scale
plt.style.use('seaborn-notebook')
## inline figures
%matplotlib inline

## just to make sure few warnings are not shown
import warnings
warnings.filterwarnings("ignore")

#### If you need to put more imports please insert them below

In [None]:
## Put extra imports here if required by your code

## Part 1 Data Analysis (15 marks)

We will use the tweets dataset from [Assignment 2](https://cs.anu.edu.au/courses/comp2420/assessment/02-assignments/ass2/comp2420/assignment-2/). The questions in part 1 are not merely about performing a t-test, you need to think carefully about the type of t-test to run and craft your hypothesis accordingly.

#### Reading the dataframe

In [None]:
df_tweets = pd.read_hdf(os.path.join('data','yt_tweets_df.h5'))
df_tweets.head(5)

#### For question Q1.1, Q1.2 and Q1.3 you need to work on the dataframe `df_tweets`

## Q 1.1
### Compare the mean for '#friends' for tweets in language 'en' (lang_tweet='en') against the overall mean value, 612. (5 marks)
Give your analysis with the help of a t-test . You will need to explicitly state your hypothesis and p-value being used. In two cells below, write your code to perform test in first cell and in second cell write your hypothesis, p-value and the result from running the tests.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Q 1.2
### Compare the mean for '#friends' for tweets tagged with language English (lang_tweet='en') against the tweets tagged with language Japanese (lang_tweet='ja'). (5 marks)
Give your analysis with the help of a t-test . You will need to explicitly state your hypothesis and p-value being used. In two cells below, write your code to perform test in first cell and in second cell write your hypothesis, p-value and the result from running the tests.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Q 1.3
### Compare the mean for '#followers' against '#friends 'for tweets tagged with language English (lang_tweet='en'). (5 marks)

Give your analysis with the help of a t-test . You will need to explicitly state your hypothesis and p-value being used. In two cells below, write your code to perform test in first cell and in second cell write your hypothesis, p-value and the result from running the tests.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Part 2 Regression (45 marks)

We will use the data from before 2018 in the [Sean Lahman's Baseball Database](http://www.seanlahman.com/#sportsdata) to create a metric for picking baseball players using linear regression. This database contains the "complete batting and pitching statistics from 1871 to 2017, plus fielding statistics, standings, team stats, managerial records, post-season data, and more". [Documentation provided here]http://www.seanlahman.com/files/database/readme2017.txt).

We have extracted the data and derived two dataframes from it. These dataframes have historical offensive (that is, batting statistics) information about various teams and players up to and including the 2017 season.

Name of pandas DataFrame  | Name of file
:---: |  :---: |
stats |  baseball_team_stats_offensive_players.h5
playerLS | baseball_players_offensive_stats.h5

Description of **stats** DataFrame

Field| Description
:---: |  :---: |
teamID| unique ID for a baseball team
yearID| years for which we have stats for
w| number of games won out of 162 games played in a season
1B| normalized number of singles hit by team
2B| normalized number of doubles hit by team
3B| normalized number of triples hit by team
HR| normalized number of home runs hit by team
BB| normalized number of Base on Balls by team

Description of **playerLS** DataFrame

Field| Description
:---: |  :---: |
playerID| unique ID for a player
POS| position where a player plays in the team
minYear| year the player started his career
maxYear| year the player played his last game
1B| normalized number of singles hit by player
2B| normalized number of doubles hit by player
3B| normalized number of triples hit by player
HR| normalized number of home runs hit by player
BB| normalized number of Base on Balls by player
nameFirst| first name of the player
nameLast| last name of the player
salary| median salary of the player

***Note:*** You don't need to understand exactly what each of features mean! They can be seen as team/individual statistics for a baseball game.

In [None]:
stats = pd.read_hdf(os.path.join('data','baseball_team_stats_offensive_players.h5'))
stats.head(5)

In [None]:
playerLS = pd.read_hdf(os.path.join('data','baseball_players_offensive_stats.h5'))
playerLS.head(5)

## Q 2.1

### Build a simple linear regression model to predict the number of wins for each entry in `stats` dataframe. Your features should be made up of the columns pertaining to normalized singles, double, triples, HR, and BB rates. (10 marks)

To decide which of these terms to include, fit your model on data up to year 2002 and select the best performing model for data from 2003 to 2017. Use the fitted model to define a new [sabermetric](https://en.wikipedia.org/wiki/Sabermetrics) summary: which we'll call Offensive Predicted Wins (OPW). Also list the coefficients of your model

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Write your coeficients here in the following order 1B,2B,3B,HR,BB
YOUR ANSWER HERE

## Q 2.2

### Compute the OPW for each player based on the average rates in the `playerLS` DataFrame (5 marks)

Notice that players essentially have the same features as teams, so you can use your model from Q2.1 to perform a prediction. Add this column to the playerLS DataFrame. Call\Name this colum OPW.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Q 2.3
### Plot and describe the relationship between the median salary (in millions) and the predicted number of wins for a player. (10 marks)
Player should be active in the seasons between 2010 and 2012 inclusive, and should have an experience of at least 5 years. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

####  Write your description here
YOUR ANSWER HERE

## Q 2.4 
#### <font color='magenta'> Pick a team of 9 players such that you have a player for each of the 5 positions: C, 1B, 2B, 3B and SS, and 4 players from position OF</font>. The total budget you have for team salary is 25 million dollars.  Try to optimize for the expected/average OPW. (20 marks)  <font color='red'>hard</font>

There are many ways to do this, any reasonable optimization will be worth marks, along with the explanation of why and what are you doing. You should write your explanation in the text block provided

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Your explanation here

YOUR ANSWER HERE

## Part 3 Classification (20 marks)

In this example we will use the [ credit card clients ](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset.

This dataset has 24 coulmns. The last column named `DEFAULT` is the target variable which takes binary value, 1 or 0, indicating whether the client will default or not next month. Your task is **create a KNN classifier** for this dataset in Q3. You don't need to write the code to download and read the dataset as we have done this for you. You will need to work on the dataframe 'df_credit'.

Description of **df_credit** dataframe

Field| Description (type of values it takes)
:---: |  :---: |
LIMIT_BAL| Amount of the given credit
SEX| Gender (1 = male; 2 = female). 
EDUCATION|  Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
MARRIAGE|  Marital status (1 = married; 2 = single; 3 = others)
AGE| Age (year)
PAY_0| History of past payment, last month ( -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above)
PAY_2| History of past payment, 2 month back (same as PAY_0)
PAY_3| History of past payment, 3 month back (same as PAY_0)
PAY_4| History of past payment, 4 month back (same as PAY_0)
PAY_5| History of past payment, 5 month back (same as PAY_0)
PAY_6| History of past payment, 6 month back (same as PAY_0)
BILL_AMT1|  Amount of bill statement, last month
BILL_AMT2|  Amount of bill statement, 2 month back
BILL_AMT3|  Amount of bill statement, 3 month back
BILL_AMT4|  Amount of bill statement, 4 month back
BILL_AMT5|  Amount of bill statement, 5 month back
BILL_AMT6|  Amount of bill statement, 6 month back
PAY_AMT1|  Amount of previous payment, last month
PAY_AMT2|  Amount of previous payment, 2 month back
PAY_AMT3|  Amount of previous payment, 3 month back
PAY_AMT4|  Amount of previous payment, 4 month back
PAY_AMT5|  Amount of previous payment, 5 month back
PAY_AMT6|  Amount of previous payment, 6 month back
DEFAULT|  Will deafult this time (Yes = 1, No = 0)


##### You will need to create a training and test set yourself. Refer to the [lab 6](https://cs.anu.edu.au/courses/comp2420/labs/lab-6/) exercise

In [None]:
df_credit = pd.read_hdf(os.path.join('data','df_credit.h5'))
df_credit.head()

## Q 3.1
#### Write a **ten-fold cross validation** to estimate the optimal value for $k$ for the data set. <font color='magenta'>You need to consider only values between 20 to 50(inclusive) for $k$.</font>(10 marks) 

##### You will need to create a training and test set yourself. Refer to the [lab 6](https://cs.anu.edu.au/courses/comp2420/labs/lab-6/) exercise

***Note*** Keep in mind optimal value of $k$ depends on $d$, where $d$ is the number of features used.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Q 3.2 
#### Create a boxplot showing training scores for the optimal $k$ for each $d$-dimensional subspace with $d$ ranging from one to 23. <font color='magenta'>You need to consider only values between 20 to 50 (inclusive) for $k$.</font> (5 marks)
The plot should have the scores on the y-axis and the different dimensions $d$ on the x-axis. You should increase the features incrementally -- this exercise needs you to start from one feature and increase the number of features to 23 incrementally.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Q 3.3

#### Evaluate your performance on test set with best ($k$,$d$) pair. (5 marks)

Additionally, write a brief discussion of your conclusions to the questions and tasks in Q3.1 and Q3.2 in 100 words or less each.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Part 4 K-Means (10 marks)

We will use the standard [breast cancer data set](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html), from sklearn. We have already loaded the datset for you. $X$ contains all the features.

In [None]:
breast_cancer = datasets.load_breast_cancer()
X = scale(breast_cancer.data)

## Q 4.1 
#### Implement K-Means clustering for digits data. (10 marks) <font color='red'>slightly hard</font>

Complete the function kmeans below.
 
***Note:*** 
- You are **not allowed** to use any of the **sklearn's pre-implemented algorithms or functions**. 
- You are **not allowed** to use any pre-implemented **k-means** algorithm from **any module** .
- You **should** use the **numpy** library to do matrix operations and calculations.
- <font color='magenta'> You **should** use some smart ways for initializations.</font>

In [None]:
def kmeans(X, n_cluster, random_seed=2, n_init=100):
    '''
    Function calculates the centroids after performing k-means on the given dataset. 
    Function returns two values new calculated centers and labels for each datapoint.
    If we have n_cluster = 4 then labels from algorithm will correspond to values 0,1,2 and 3
    
    Args:
        X: np.array representing set of input data
        n_cluster: number of clusters to use for clustering
        random_seed: random seed to use for calling random function in numpy
        n_inint: max number of iterations to use for k-means
    Returns:
        centers: np.array representing the centers for n_clusters
        labels: np.array containing a label for each datapoint in X
    '''
    
    centers = np.zeros((n_cluster,X.shape[1]))
    labels = np.zeros_like(X)
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return centers,labels

## change the parameters of the function call to test your implementation
centers, labels = kmeans(X,n_cluster=4, random_seed=4, n_init=300)

In [None]:
## optional :You can write a code to visualize or check your algorithm here


## Part 5 Decision Trees (10 marks)

The following is a small synthetic data set about the weather conditions.  We are
going to try and use decision trees to predict whether it will rain or not on the given day.


|Temperature| Cloudy| UV Index| Humidity| Rain
|---:|--:|--:|--:|--:|
|25|No| Low| Low| No 
|29|No| Low| High| No
|26|No| Low| Medium| No
|26|No| Medium| Medium| No
|27|No| Medium| High| No
|28|No| High | High| No
|25|No| High |Low| No
|29|Yes| Low |Low| Yes
|28|No| Medium| High| Yes
|28|Yes| Medium| High| Yes
|26|No| Low |Low| Yes
|27|Yes| Low |High| Yes

**Note:**
* You can treat temperature as a continuous variable and split on a range of age values.
* Attribute selection in the tree uses information gain 

## Q 5.1
#### What is the initial entropy of Rain?  (2 marks)


YOUR ANSWER HERE

## Q 5.2
#### Which attribute would the decision-tree building algorithm choose at the root of the tree?   (2 marks)

Choose one through inspection and explain your reasoning in a sentence. 

YOUR ANSWER HERE

## Q 5.3
#### Calculate and specify the information gain of the attribute you chose to split on in the previous question.  (3 marks)

YOUR ANSWER HERE

## Q 5.4

#### Consider a decision tree built from an arbitrary set of data. If the output is binary, what is the maximum training set error for this dataset? Explain your answer. (Please note that this is the error on the same dataset the tree was trained on.  A new test set could have arbitrary errors.) (3 marks) <font color='red'>slightly hard</font>

YOUR ANSWER HERE