---
---
Problem Set 10: Machine Learning III

Applied Data Science using Python

New York University, Abu Dhabi

Out: 28 Nov 2023 || **Due: 07 Dec 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of applied machine learning
- Learn the basic and advanced machine learning models


### Specific Goals
- Learn to apply different models for classification and regression
- Learn about different modalities of data
- Trade-offs between performance and computation

## Collaboration Policy
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this and previous semesters and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html). Violations may result in penalties, such as failure in a particular assignment.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **P11_YOUR NETID.ipynb**.

---




# General Instructions
This homework is worth 100 points. It has 3 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in this Jupyter (Colab) Notebook. **Start this homework early as the dataset is huge, and modeling will take a lot more time than your previous homeworks and recitations.**

*For this homework, we will not explicitly tell you which models to use. Now that you are aware of the different machine learning models, all the models you have learnt are fair game.*



# Introduction

## Modalities

There are typically (though arguably) 6 different **modalities** or modes of data that a data scientist encounters in their work: (i) numbers; (ii) text; (iii) images; (iv) audio/speech; (v) video, and (vi) graphs/networks.

There are separate fields and courses pertaining to each of these modalities offered in many of the academic institutions. For example, if you want to learn about text, you would take a *Natural Language Processing (NLP)* or a *Text Processing* course. For in-depth understanding of images and video, you would take a *Multimedia Analysis* or a *Computer Vision (CV)* course. To learn about audio or speech, you would take a *Speech Processing* or *Audio Processing* course. To learn about graphs and networks, you would take a *Network Science* or a *Graph Theory* course. These courses cover in-depth understanding of how to *process* and *represent* data of a given modality, and also teach the basic challenges within each field. That said, as of 2021, most of the state-of-the-art methods to deal with all of these modalities have somehow coverged to using *neural networks* in some form, and so to learn the state-of-the-art methods in neural networks, you would take a *Deep learning* course. Still, learning about the basics of how to *represent* data from different modalities is key to being successful in that particular field.

In most of your assignments so far, you have dealt with the first modality i.e. numbers. This is by design, as all the other modalities can typically be converted to (scalars or vectors or some form of) numbers, and so dealing with other modalities mostly requires you to *transform* your data to numbers: *Textual data* can be converted to frequency values or other vector representations as you have briefly seen in P7 and R12; *images* typically get converted to pixel values or other vector representations; *videos* are basically images in time; *speech/audio* typically get converted to spectrograms which are basically similar to images; finally, *networks* can be represented as lists, matrices or vector representations. But, in the end, all data is just numbers. Arguably, most data can just be converted to vectors or as you now know *embeddings*. This has led to papers such as [*Word2Vec*](https://arxiv.org/abs/1301.3781) for text, [*Wave2Vec*](https://ai.facebook.com/blog/wav2vec-state-of-the-art-speech-recognition-through-self-supervision/) for speech, [*node2vec*](https://dl.acm.org/doi/abs/10.1145/2939672.2939754) and [*graph2vec*](https://arxiv.org/abs/1707.05005) for graphs and networks, and so on.

To reiterate, an *embedding* is a relatively *low-dimensional* space into which you can translate high-dimensional vectors/data. Some images are big, some images are small, some images are black and white, some are RGB. Similarly, some voice recordings are long, some are short. Embeddings are a way to (i) represent high dimensional data into a compact representation to easily deal with the data in downstream tasks, and (ii) one that is standardized across different types of data. As a result, they make relationships between different instances easier and more *meaningful* like you saw in the case of *Word embeddings* in the *visualization* assignment.

In this assignment, you will use *embeddings* not for words, but instead for *speech* (or *speakers*) for the task of *speaker recognition/identification*, *speaker gender recognition*, and *speaker age estimation*.



## Motivation: Solving Voice Crimes Using d-vectors

The [United States Coast Guard (USCG)](https://en.wikipedia.org/wiki/United_States_Coast_Guard) handles about 16,000 *Mayday* calls for help every year. Every distress call the Coast Guard receives compels the federal agency to launch an expensive search-and-rescue effort involving at a minimum a small rescue boat, a C-130 fixed-wing aircraft or rescue helicopter, and the several Coast Guardsmen to operate them. The cost of each outing can run from \$10,000 to \$250,000. Small boats typically cost \$4,500 per hour to operate, whereas the helicopters can cost about \$16,000.

Unfortunately,  **about 1 percent of these distress calls are fake**. The men and women of the Coast Guard put themselves at risk every time the surface and air assets respond to a call for assistance. **Hoax callers** place Coast Guardsmen at unnecessary risk. Furthermore, hoax calls interfere with legitimate search and rescue cases, diverting assets from being available to help actual mariners in distress.

The penalty for transmitting a hoax distress call to the Coast Guard is up to six years in prison, a \$250,000 fine, a \$5,000 civil fine, and reimbursement to the Coast Guard for the cost of performing the search. Nevertheless, recently, an anonymous caller cost the U.S. Coast Guard roughly \$500,000 by sending first responders on unnecessary rescue missions 28 times through hoax calls. The hoax caller made 28 calls, everytime speaking in a different voice to fool USCG. *True story*. <sup>1</sup>

Unfortunately, the USCG does not have a system to be able to identify the voice of the person making the call. If they had such a system, every time a distress call came, they could have matched it to the voices of the callers from the past to make sure it was a legit call. If a caller had been red-flagged in the past, such a system would have been able to identify if the same person made the call again just by using a voice-matching system.

Your voice is like a fingerprint, and a voice-matching system is typically robust enough to not be fooled even if you try to change your voice.

The USCG has thus reached out to NYUAD's data scientists for help in creating a system through which they can recognize speakers through their voices. Because most fake callers do not speak in their own voice, USCG is not even able to make out other states of the caller such as their gender. Therefore USCG would want to go one step ahead and also create two other machine learning models that are able to detect the **gender**, and the **age** of the speaker.

In this homework, you will use *speaker embeddings* which we will call *d-vectors* ("d" representing the deep neural networks) as features to accomplish three tasks by training three machine learning models:

-*Voice2SpeakerID: model for predicting the ID of the speaker from voice*

-*Voice2Gender: model for predicting gender of the speaker from voice*

-*Voice2Age: model for estimating the age of the speaker from voice*

These three models will be created using the machine learning techniques you have learnt in the class.

------------------
<sup>1. Well, this is actually a [true story](https://www.dhs.gov/science-and-technology/news/2017/09/26/snapshot-voice-forensics-can-help-coast-guard-catch-hoax) :) </sup>



## Dataset

The dataset you are given is a preprocessed version of voice recordings collected by the United States Coast Guard, and is attached with the handout in three parts:

- 1. **voice_forensics_train.csv**: The dataset contains the training dataset that you will use across all three tasks. There are **7078** speakers in the dataset with each speaker having **4** recordings and hence 4 rows. As a result, there are **28312** rows. It has 132 columns. These columns are:
    - *Id:* This is the unique identifier for each row.
    - *speaker_id:* This is the unique identifier for each **speaker**.
    - *gender*: This is the self-identified gender of the speaker as "male" for male, or "female" for female.
    - *age*: This is the age of the speaker.
    - *d1,d2,...d128*: These values represent the 128-dimensional d-vector for the current speaker's current recording. Because each speaker has 4 recordings, therefore, each speaker has 4 different d-vectors.<sup>2</sup>

- 2. **voice2speakerid_test_x.csv**: This is the test dataset for the **Voice2SpeakerID** model. The labels are hidden for this dataset, as you will use your model to predict the labels for this dataset, and submit to Kaggle. This file contains 14156 rows, and 129 columns. These columns are:
    - *Id:* This is the unique identifier for each row. (This is the identifier you will use for submission to Kaggle.)
    - *d1,d2,...d128*. These values represent the 128-dimensional d-vector for the current speaker's current recording. You have to use these features to **match** the *speaker_id* from your training set.

- 3. **voice2genderage_test_x.csv**: This is the test dataset for the **Voice2Gender** and **Voice2Age** model. The labels are hidden for this dataset, as you will use your model to predict the labels for this dataset, and submit to Kaggle. This file contains 8802 rows, and 129 columns. These columns are:
    - *Id:* This is the unique identifier for each row. (This is the identifier you will use for submission to Kaggle.)
    - *d1,d2,...d128*. These values represent the 128-dimensional d-vector for the current speaker's current recording. You have to use these features to predict the *gender* and *age* of the speaker.

The 128-dimensional d-vectors are extracted from a complex deep learning based speaker identification model from this [paper](http://mlsp.cs.cmu.edu/people/rsingh/docs/pairwiseloss.pdf). In simple terms, all audio recordings are converted to [spectrograms](https://en.wikipedia.org/wiki/Spectrogram), which are then used to train a neural network model. Once the neural network is trained, a *d-vector* or *embedding* is extracted by passing each audio recording one-by-one through the trained neural network as shown in the diagram below.

![dvector](https://drive.google.com/uc?id=1abh5I0U3wI1LR3myScD4dpIdvjYMeqjP)

**If you do not understand this, *THAT'S OK*. For the purposes of this assignment, think of each d-vector as a feature representation of an audio/speech recording of a person which you will use to create your models. Similar to how words can be represented using word-embeddings or a BoW embedding, audio/speech recordings can be represented as d-vectors or audio/speech embeddings**.

You will use the same dataset for training across all three tasks. Your independent variable or *features* will remain the same for **all** the tasks (unless you augment some new features). What will change then? (i) the target/dependent variable; (ii) the model and the model parameters; and of course (iii) the evaluation metrics and performance based on whether the task is a classification task or a regression task.

-------

<sup>2. We saw in PS7 that an interesting property of **embeddings** is that they form meaningful relationships. We saw this concretely in the case of word embeddings where in a scatter plot of words, words which were *similar* formed clusters. This is a property that you will also find within *d-vectors*. More concretely, in this dataset, we have given you 4 d-vectors for each speaker. What you will notice is that if you take the **euclidean distance** of two d-vectors from the same speaker, and the **euclidean distance** of any of those two d-vectors to a d-vector from another speaker, the euclidean distance between d-vectors from same speaker will be much smaller i.e. **d-vectors from the same speaker will be closer to each other**. If you are curious, you can check this for yourself by selecting a few random speakers and d-vectors, and using scikit-learn's **[euclidean distance metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html)** to compute distances. This should also give you a clue on how to use these d-vectors to create a model for **Voice2SpeakerID** task. :)</sup>


## Evaluation Metrics
Once you have trained your model, it is very important to evaluate your model using the *right* metric. For example, accuracy, while widely used, is generally *not* a good metric especially when the data is not balanced across the different classes. *Precision*, *Recall*, *F1*, and *Area Under the Curve (AUC)* are almost always considered better choices. For regression tasks, *Mean Absolute Error (MAE)*, *Root Mean Squared Error (RMSE)*, and *R-Squared (R2)* are some of the popular choices. Make yourselves familiar with these different metrics by googling about that. Here are some articles that will be useful for understanding **[ROC AUC Score](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/)**, and **[F1 score](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)**.

## Hints:

- We highly recommend that you do the parts in the order they are presented in the problem set.

- The problem sets that you have worked on so far were not easy but they were all the same in one aspect: *they required less computation, and a lot more thinking*. This problem set is different from all others because it revolves around a large dataset and requires more computation and arguably less thinking. If you ever want to pursue a career in machine learning, you will deal with datasets that are as large, if not more, as this, and will deal with the same problems. We could have instead given you a small dataset, but a big goal of this homework is for you to be able to deal with large real-world datasets.

- Because it’s a large dataset, it is likely that your models will take more time to train (and test). Therefore, running time can range from a minute to even hours if you are using grid search over many parameters. The strategy for this homework, therefore, is to start with simpler models without grid search and parameter tuning — models that will train and predict quicker, and then gradually add grid search and other parameters if needed. If you cannot figure out via intuition which models are simple and will supposedly run quicker, you can figure this out empirically by trying them out on a small subset of your training set (i.e. via sampling). If your model is taking more than 10 minutes, then it’s not a simple model, try something simpler first. If you try simpler models, and are not able to pass the baselines/benchmarks, only then move to complex models, and/or more parameters.

- If you use grid search, and it takes a lot of time, you should try smaller values for your number of folds.

- Choose your loss and scoring parameters wisely, based on the metric you want to optimize. For example, if the metric is *root mean squared error*, the loss you would like minimized is *L2*. Think why.

- If any of your model uses regularization, limit the range of parameters. A rule of thumb is to try values in the *powers of 10* first (eg. `0.001, 0.01, 0.1, 1, 10, 100`), and then gradually move to a finer range if needed.

- Part 3 is **not** a classification task. It is a regression task, as such you would try models that employ regression. Most of the classification models you have learnt in class have their corresponding regression modules on scikit-learn. For example for an SVM, you have [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) (which is a classifier), and you also have [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) (which is a regressor).

- If you want to use SVM with a linear kernel, use [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) and [LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) modules instead of SVC and SVR modules, as the former will run faster.

- Normalizing or Standardizing the data is not needed for this assignment. This is because the extracted *d-vectors* were already normalized by the neural network.

- For each task, we ask you to submit at least 2 models. This is so that you try different models. If you are lucky and achieve your goal or target performance by simply trying a single model, we do not expect you to spend a lot of time optimizing the second model as long as it performs reasonably. It’s not required for your second model to pass the baselines or benchmarks. We just want to see that you tried a second model with reasonable parameters.

- For the last part on Voice2Age, passing Baseline C may require more than just modeling i.e. creatively thinking about the dataset.

# Part I: Voice2SpeakerID (35 points)

Create a model that predicts the *id* of the speaker based on the *d-vector* of their voice as shown in the figure below:

![voice2speakerid](https://drive.google.com/uc?id=12nyrSxkbp-wMznX6hYWN3W70UQvSK4a8)

This is a **matching** task i.e. we will **not** test your model on speakers that are not represented in the provided training set.


## Prompt

More concretely, you will use the d-vectors in `voice_forensics_train.csv` to fit at least two models for the prediction of speaker id from speaker embedding. You will then predict the `speaker_id` of the d-vectors in `voice2speakerid_test_x.csv` and submit the predictions to Kaggle.

We have provided you with a sample file as `voice2speakerid_solution_sample.csv`. Use the format of that file to submit your predictions.

We have provided three baselines on Kaggle based on the `accuracy` metric which you will need to beat to score full points on this task.

For this part, you will submit your code as part of the notebook, and will also submit the `voice2speakerid_solution.csv` file with your predictions to Brightspace as well as to Kaggle.

In [None]:
# Write you solution here

############# SOLUTION ###############

############ SOLUTION END #############

## *Concepts and tools required to complete this task*

*   Basics of machine learning
*   Classification
*   Understanding of *embeddings*
*   Critical Thinking
*   Scikit-Learn
*   Pandas and numpy




## Rubric

- +10 points for logical and reasonable steps to training and testing the models using the techniques taught in the course, and for a well-documented code and evaluation of **at least two models** at least one of which makes the same predictions as submitted on Kaggle
- +5 points for proper comments
- +5 points for achieving/crossing Baseline A
- +5 points for achieving/crossing Baseline B
- +10 points for achieving/crossing Baseline C



# Part II: Voice2Gender (30 points)

In general, women speak at a higher pitch—about an octave higher than men. An adult woman's average range is from 165 to 255 Hz, while a man's is 85 to 155 Hz. Men's voices are generally deeper. There are many other differences in the voice quality of men and women that make them identifiably different. Identifying the gender from voice is the most fundamental task -- one that is useful in many other downstream tasks.

Create a binary classifier which takes in the *d-vector* as input, and is able to predict the *gender* of the speaker as output, as shown in the diagram below:

![voice2gender](https://drive.google.com/uc?id=1dyeFlOuqIDcGOXpNsRuf_I-oPI1qxp_D)





## Prompt

More concretely, you will use the d-vectors in `voice_forensics_train.csv` to fit at least two models for the prediction of gender from speaker embedding. You will then predict the `gender` as `female` or `male` of the d-vectors in `voice2genderage_test_x.csv` and submit the predictions to Kaggle.

We have provided you with a sample file as `voice2gender_solution_sample.csv`. Use the format of that file to submit your predictions.

We have provided three baselines on Kaggle based on the `F1` metric which you will need to beat to score full points on this task.

For this part, you will submit your code as part of the notebook, and will also submit the `voice2gender_solution.csv` file with your predictions to Brightspace as well as to Kaggle.


In [None]:
# Write you solution here

############# SOLUTION ###############


############ SOLUTION END #############

## *Concepts and tools required to complete this task*

*   Basics of machine learning
*   Classification
*   Supervised learning
*   Understanding of *embeddings*
*   Critical Thinking
*   Scikit-Learn
*   Pandas and numpy


## Rubric

- +10 points for logical and reasonable steps to training and testing the models using the techniques taught in the course, and for a well-documented code and evaluation of **at least two models** at least one of which makes the same predictions as submitted on Kaggle
- +5 points for achieving/crossing Baseline A
- +5 points for achieving/crossing Baseline B
- +10 points for achieving/crossing Baseline C




# Part III: Voice2Age (35 points)

Because *age* is a continuous variable, this is a regression task. Create a regression model that predicts the *age* of the speaker from their voice *d-vector*.

![voice2age](https://drive.google.com/uc?id=1htIuj-fW_LYH4_WZ_3LczOIRldkMt92z)


## Prompt

More concretely, you will use the features in `voice_forensics_train.csv` to fit at least two models for the estimation of age from speaker embedding. You will then predict the `age` of the speaker corresponding to the given d-vectors in `voice2genderage_test_x.csv` and submit the predictions to Kaggle.

We have provided you with a sample file as `voice2age_solution_sample.csv`. Use the format of that file to submit your predictions.

We have provided three baselines on Kaggle based on the `RMSE` metric which you will need to beat to score full points on this task.

For this part, you will submit your code as part of the notebook, and will also submit the `voice2age_solution.csv` file with your predictions to Brightspace as well as to Kaggle.


In [None]:
# Write you solution here

############# SOLUTION ###############

############ SOLUTION END #############

## *Concepts and tools required to complete this task*

*   Basics of machine learning
*   Classification versus regression
*   Supervised learning
*   Understanding of *embeddings*
*   Critical Thinking
*   Scikit-Learn
*   Pandas and numpy

## Rubric

- +10 points for logical and reasonable steps to training and testing the models using the techniques taught in the course, and for a well-documented code and evaluation of **at least two models** at least one of which makes the same predictions as submitted on Kaggle
- +5 points for proper comments
- +5 points for achieving/crossing Baseline A
- +5 points for achieving/crossing Baseline B
- +10 points for achieving/crossing Baseline C


# Final Remarks

## Model Selection
There are many supervised machine learning models for the task of *classification*, and *regression*: k-nearest neighbours, decision trees, random forests, logistic/linear regression, support vector machines, neural networks, polynomial regression, and so on. Some models are specific to classification, some to regression, and some to both. There are many models to choose from, and for each model, there are **many** parameters to tune your model on. Of course, you can brute force, and try *all* the models in the world, and tune *all* the parameters on *all* the possible values. But that is not just infeasible, but would be result in some heavy overfitting.
So how do you choose the right model for a given problem? This is really big question. The ability to choose the right model, the right architecture, and tune the right set of parameters efficiently requires deep understanding of the task at hand, years of practice, and knowledge of the literature.

## Embeddings
Notice, how across all the tasks, you did not really have to deal with any **".wav"** or **".mp3"** files, or any sort of audio/speech processing. In fact, if we had not told you that these *embeddings* or *d-vectors* corresponded to voice recordings, or had not provided you with the Coast Guard context, you had no idea of knowing. For you to accomplish the tasks above, all you needed to know was that these vectors of numbers are *features* to be used in your model, very similar to PS9 where your features were unknown. That is precisely the beauty of *embeddings* in general that they abstract out modality related details for a data scientist to just focus on the task of *modeling*. That said, it is also important to acknowledge that many recent machine learning or deep learning models are actually *end-to-end*, and don't use embeddings at all i.e. for example an audio spectrogram goes into the neural network as input, and out comes the prediction. Of course, all of these statements are simplified versions of complex details, understanding of which requires significant domain knowledge.

In the context of embeddings though, a wide variety of research is being conducted on what is the best embedding for a given modality and type of data -- the reason why we went from Word2Vec to GLOVE to ELMo to BERT to [RoBERTA](https://arxiv.org/abs/1907.11692) to [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [GPT-3](https://arxiv.org/abs/2005.14165) for textual data. As of end of 2020, BERT, RoBERTA, GPT-2, and GPT-3 are the most common *pre-trained* deep learning based [*language models*](https://en.wikipedia.org/wiki/Language_model) being used for extracting word-embeddings. Figuring out the optimal architecture for extracting embeddings for a given modality is an art in itself -- one that is essential in the building of real world systems such as Alexa or Google Home, but teaching which is, unfortunately, out of the scope of this course. We encourage you to take courses in *deep learning* and *NLP* to learn these methods in detail as a follow-up to this course.