![Augury](https://github.com/augurysys/machine_learning_assignment/blob/master/AUGURY_logo.png?raw=true)

# **Augury Work Assignment - Machine Learning**

---

This assignment uses Google Colab, a Google service based on Jupyter Notebook, which is an interactive Python interpreter running in the browser.

You can read more about Google Colab [here](https://colab.research.google.com/notebooks/welcome.ipynb).
For additional information about Jupyter Notebook, see [here](http://jupyter.org)

# Overview - Speaker Gender Identification


The task we wish to tackle is speaker gender classification. The task is a well known and basic task of speech processing, and is often a preliminary stage before higher-order tasks such as speaker diarization.

We are looking at a binary classification task, based on 100 features, extracted from voice recordings of different speakers. The dataset consists of a training set with 1515 audio recordings and a test set with 460 audio recordings. Unfortunately, the names and origins of the features were lost, and we can’t reproduce which features are more or less relevant, and need to learn it from the data.

In addition, we don’t have accurate labels - we hired 5 annotators, each with a different level of skill in speaker gender classification, who gave us their best opinions. Since the task is relatively simple, we expect all annotators to be better than chance, but some may be worse than others. We need to decide on a ground truth for the data given the feedback we have.

Note that there might not be the same amount of male and female speakers in the training data - it's hard to know exactly without accurate ground truth, but one class seems more populated than the other. Despite this, we want to avoid our classifier having a bias towards one of the classes - we want to be equally accurate on both genders, as in a real world setting we expect the distribution to be more or less 50-50.

## Prerequisites - Importing the Data
There is no need to edit this cell - just run it whenever you're using the data to import it into the memory.
The local variable *features* contains 1515 examples, each having 100 features each, based on audio recordings of a single speaker each, in the form of a Pandas DataFrame.
More about the DataFrame interface can be found [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame).


In [None]:
import urllib2
import zipfile
from StringIO import StringIO
import pandas as pd
from IPython.display import display, HTML

response = urllib2.urlopen('https://github.com/augurysys/machine_learning_assignment/raw/master/augury_ml_assignment_2018.zip')
augury_ml_assignment_zip = response.read()

zip_file_strio = StringIO(augury_ml_assignment_zip)
zip_file = zipfile.ZipFile(zip_file_strio)

feature_train_csv_data = StringIO(zip_file.read('features_train.csv'))
features_train = pd.read_csv(feature_train_csv_data)

label_train_csv_data = StringIO(zip_file.read('labels_train.csv'))
labels_train = pd.read_csv(label_train_csv_data)

feature_test_csv_data = StringIO(zip_file.read('features_test.csv'))
features_test = pd.read_csv(feature_test_csv_data)

label_test_csv_data = StringIO(zip_file.read('labels_test_true.csv'))
labels_test = pd.read_csv(label_test_csv_data)


display(features_train)
display(labels_train)

# Assignment

Your task is to classify the recordings to the different speaker genders - *male* or *female*.
To perform classification, we suggest you use the [Scikit-Learn](http://scikit-learn.org/stable/index.html) package, which offers a rich variety of classifiers and regressors, as well as data preprocessing and manupulation tools, performance metrics and more.

Please review the code below. No need to make any changes.


In [None]:
import sklearn
from sklearn import preprocessing
import numpy as np

le = preprocessing.LabelEncoder()
le.fit(labels_test.as_matrix())

x_train = features_train.as_matrix()
y_train_annorators = np.zeros_like(labels_train.as_matrix())
for annotator in range(y_train_annorators.shape[1]):
  y_train_annorators[:, annotator] = \
    le.transform(labels_train.as_matrix()[:, annotator])
  
x_test = features_test.as_matrix()
y_test = le.transform(labels_test.as_matrix())

# Part 1 - Ground Truth Learning

In this part, we wish to unify the feedback from the 5 annotators to a single ground truth.
Please add code below, so that *y_train* contains a single column with the best approximation for ground truth for the training data you can achieve.

If you wish to skip this part, feel free to use only the first annotator, by uncommenting the code below.

In [None]:
y_train = np.zeros(y_train_annorators.shape[0])

################################################################################
#                                                                              #
# Enter your ground truth estimation code here                                 #
#                                                                              #
################################################################################


# TO SKIP SECTION: Uncomment the code below
# y_train = le.transform(labels_train.as_matrix()[:, 0])

# Part 2 - Classification

In this part, you will perform classification using the ground truth you've learned.
Please choose a classifier and train it on the training data using the code below.

In [None]:
from sklearn.metrics import accuracy_score

classifier = None  
# this should be your classifier model, as represented by a 
# Scikit-learn classifier object

################################################################################
#                                                                              #
# Enter your classifier training code here                                     #
#                                                                              #
################################################################################

y_train_pred = classifier.predict(x_train)
y_test_pred = classifier.predict(x_test)
print("Accuracy on training set: {:4.2f}"
      .format(accuracy_score(y_train, y_train_pred)))
print("Accuracy on test set: {:4.2f}"
      .format(accuracy_score(y_test, y_test_pred)))

# Part 3 - Dimensionality Reduction (Optional)

The data contains 100 features, but not all might be necessary to achieve high performance. Please try to reduce the dimension of the input data without dropping accuracy in classification. Feel free to edit your code above. 