# **Q1**

**Step 1**: Load in the the csv into a `dataframe`.  

**Step 2**: Assess how many unique values of class there are.  

_Note: cannot query data using_ _`data.query()`_ _as 'class' is a python keyword_  



In [0]:
import pandas as pd
import numpy as np

data = pd.read_csv('./data/pulsar.csv', index_col=0)
number_RFI = len(data[data['class'] == 0])
number_pulsars = len(data[data['class'] == 1])

print(f'The number of pulsars is {number_pulsars}, and the number of noise data points is {number_RFI}')

**Step 3**: Check the range of standard deviations and means in the data set.

_It seems that the standard deviations are not of order 1, nor are the means of order 0, as such we will have to standardise our data before analysing to make sure the model is not over emphasising the high variance/high mean data._



In [0]:
dev0 = data['std_dm']
dev1 = data['std_pf']
mean0 = data['mean_dm']
mean1 = data['mean_pf']

print(f'Dispersion Dev  : {max(dev0)}\t{min(dev0)} \nPulse Dev\t: {max(dev1)}\t{min(dev1)}')
print(f'\n\nDispersion Mean : {max(mean0)}\t{min(mean0)} \nPulse Mean\t: {max(mean1)}\t{min(mean1)}')

**Step 4**: Standardise the data using sklearn  



In [0]:
from sklearn import preprocessing

data_no_class = data.iloc[:,0:-2]

data_scaler = preprocessing.StandardScaler().fit(data_no_class)

data_std_temp = data_scaler.transform(data_no_class)
type(data_std_temp)

**Step 5**: Merge the standardised data with the associated class

_This involves rebuilding the data frame we imported, so I loop through the first row in each column to grab the labels then add them to a list and assign the new data frame I've built with the label list_  



In [0]:
data_classes = data['class']
labels = []

for label in data_no_class.columns:
    labels.append(label)

data_std = pd.concat([pd.DataFrame(data_std_temp, columns=labels),data_classes], axis='columns')
data_std.head()

As can be seen we've recreated the initial data set but it is now standardised  



**Step 6**: Reassess the standard deviations and means

_While the means are not all 0, the means are all of order 1 and the deviations are of order 1 as well which means that our model will not be swayed by outliers in our data as much_  



In [0]:
dev0 = data_std['std_dm']
dev1 = data_std['std_pf']
mean0 = data_std['mean_dm']
mean1 = data_std['mean_pf']

print(f'Dispersion Dev  : {max(dev0)}\t{min(dev0)} \nPulse Dev\t: {max(dev1)}\t{min(dev1)}')
print(f'\n\nDispersion Mean : {max(mean0)}\t{min(mean0)} \nPulse Mean\t: {max(mean1)}\t{min(mean1)}')

# **Q2**

**Step 1**: Import the random forest module  

**Step 2**: Separate training data from test data; I will employ a 50/50 split  



In [0]:
from sklearn.ensemble import RandomForestClassifier as classify
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Xtrain, Xtest, ytrain, ytest = train_test_split(data_std.iloc[:,0:-2], data_std['class'], train_size=0.5)

**Step 3**: Train my model with my training data sets  



In [0]:
model1 = classify().fit(Xtrain, ytrain)
prediction1 = model1.predict(Xtest)


**Step 4**: Test my the efficacy of my model using a confusion matrix as I'm predicting classes  



In [0]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

mat = confusion_matrix(ytest, prediction1)

fig = plt.figure(figsize=(9, 9))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap='Blues')
plt.xlabel('true label')
plt.ylabel('predicted label');
print(f'model has accuracy of {100*accuracy_score(ytest, prediction1):.1f}% on test data')

# **Q3**

**Step 1**: obtain the list of parameters for my model  



In [0]:
from sklearn.model_selection import GridSearchCV as gridsearch

classify().get_params()

**Step 2**: Based off Lab 6 I could pick the most important parameters to loop through and optimise my model, or I could just optimise for all hyper parameters. Looking through the list I chose to optimise for `criterion`,`max_depth`,`min_samples_leaf`, and `n_estimators` due to time constraints in optimising the data.

**NOTE: THIS BLOCK IS INCREDIBLY SLOW DUE TO THE NUMBER OF PARAMETERS OPTIMISED (40 MINS)**

In [0]:
param_grid = {'criterion': ['gini', 'entropy'],
 'max_depth': np.arange(2, 10),
 'min_samples_leaf': np.arange(1, 10),
 'n_estimators': np.arange(10,100, 10),
 }

grid = gridsearch(classify(), param_grid)

grid.fit(Xtrain, ytrain)

**Step 3**: obtain optimised parameters  



In [0]:
grid.best_params_

**Step 4**: Retest the model with the optimised parameters  



In [0]:
model2 = classify(criterion='entropy', max_depth=8, min_samples_leaf=1, n_estimators=50).fit(Xtrain,ytrain)
prediction2 = model2.predict(Xtest)

print(f'model has accuracy of {100*accuracy_score(ytest, prediction2):.1f}% on test data')

**Step 5**: Reasses the effectiveness of my model  



In [0]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

mat = confusion_matrix(ytest, prediction2)

fig = plt.figure(figsize=(9, 9))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap='Blues')
plt.xlabel('true label')
plt.ylabel('predicted label');

_**Observations**: after optimising the data, the model has a false positive percentage of 0.6% while it has a false negative rate of nearly 20%. This implies that the model is good at identifying RFI but not ideal in identifying true pulsars. This is most likely because the subset of data that correlates to true pulsars is a much smaller subset of all of our data._


# **Q4**

**Break down**: This task simply requires me to apply the learning curve function to my model in **Q2**, `model1`. The function returns my training size, the training score, and the validation score for each size. All that is left is graphing my data.

In [0]:
from sklearn.model_selection import learning_curve

test_sizes = np.linspace(0.01,0.5,20)

N, train_lc, val_lc = learning_curve(model1, Xtest, ytest,
                                           train_sizes=test_sizes)
train = np.mean(train_lc, axis=1)
test = np.mean(val_lc, axis=1)
mean = 0.5*(train[-1]+test[-1])


plt.plot(N,train, '-', label='Train')
plt.plot(N,test, '-', label='Validation')
plt.plot([0,N[-1]],[mean,mean], '--', alpha=0.3, color='black')
plt.ylabel('Score')
plt.xlabel('Training Sizes')
plt.legend()

**Observations**: it seems as the model converges in its learning curve at approximately 500 samples which is approximately 5% of the total data. This means that no supplementary data is required to validate this model.