**1.1. In your own words, what does the role of a data scientist involve? (2 marks)**

A data scientist uses data to derive insights that can be used to solve problems and drive decision making (e.g., how to increase sales via different marketing methods). The role typically involves setting key questions to answer, figuring how to adress the question or solve a problem with data (inc. which data is needed and what will be done with it to get useful answers), primary data collection and/or secondary data sourcing, data combining, data cleaning, exploratory (and planned) data analysis, data visualisation, and predictive modelling using machine learning methods. They must also be able to effectively communicate findings, insights, and predictions to non-technical audiences (e.g., in reports or presentations).

**1.2. What is an outlier? (4 marks) Here we expect to see the following:
a. Definition
b. Examples
c. Should outliers always be removed? Why?
d. What are other possible issues that you can find in a dataset?**

a. An outlier is a value that differs substantially from the other values in a data set. There are different types of outliers, including "real" / "natural" outliers or errors. Outlier definitions and methods for detecting them vary (e.g., z-score > 3, IQR, box plots for continuous variables, count plots for categorical variables), as do methods for dealing with them (e.g., there might be cases where you might only deal with 'extreme' outliers).

b. An example of a "real" outlier might be if someone scores much higher (e.g., has a z-score > 3) than the average on a psychometric test; the score is not a mistake, the person who has contributed the data point is just much better at that test than the average person. An example of a error outlier might be if someone accidentally entered their age as '200' instead of 20 when answering a survey question; they are not actually 200 years old, they have just incorrectly entered the data point.

c. Deciding what to do with outliers depends on the type of outlier and how sensitive to outliers your analysis is. If an outlier is a mistake (e.g., somebody enters their age as 200), then these outliers can either be removed or replaced with the median value in that data set as a "best guess" (e.g., median age). If an outlier appears to be "real", reflecting natural variation in the data, there are multiple methods for dealing with this. If your analysis is robust against outliers (e.g., tree-based models or non-parametric tests which do not have a strict assumption of no outliers), you may choose to do nothing with them and leave them as they are. Alternatively, you can run your analysis twice; once with and without outliers removed to understand their impact on your findings and interpretation. Finally, you could choose to impute outliers with their median. Removing "real" outliers is not typically recommended without sound justifcation.

d. Beyond outliers in your data, you may find duplicated data (e.g., from someone completing a survey twice accidentally), you may find missing values (e.g., someone skipping a question in a survey), you may find typos (e.g., someone spells their email address incorrectly), you may find data inconsistencies (e.g., two different ages associated with the same participant or customer ID), or that data is uninformative due to lack of variation.

**1.3 Describe the concepts of data cleaning and data quality. ( 4 marks) Here we expect to see the following:
a. What is data cleaning?
b. Why is data cleaning important?
c. What type of mistakes do we expect to commonly see in datasets?**

a. Data cleaning is the process of preparing your data set for error-free, valid, and reliable data visualisation, analysis, or modelling by ensuring it's quality and integrity. High quality data is accurate, valid, complete, consistent, timely, and unique. Therefore, data cleaning involves checking for and ensuring these core elements of data quality. A typical data cleaning process would involve checking your data for issues (e.g., checking for missing values, outliers, typos, duplicated data, outdated data, assessing normality, assessing reliability, validity checks), correcting issues (e.g., imputing missing values and/or outliers with the median, removing duplicated data, fixing typos etc), changing data types (e.g., changing string values to numeric or one-hot encoding of non-ordinal categorical data), and transforming data for analysis (including scaling, normalising, and standardising).

b. Data cleaning is an essential part of ensuring that you are making valid and accurate conclusions from your data. If the data you feed into models is rubbish (e.g., has errors or lots of duplicated or outdated data, or is based on unreliable or invalid measurements), then you can expect the same from the models output, as reflected in the well-known phrase "Garbage in, Garbage Out".

c. Common mistakes in data sets include: missing values leading to incomplete cases, outliers from errors, duplicated data, mis-matched data (e.g., wrong customer ID assigned to data row during merging), typos (e.g., extra spaces, spelling errors), column names with spaces, unreliable measurements, low validity measurements, and outdated data.

**1.4 Discuss what is Unsupervised Learning - Clustering in Machine Learning using an example. (7.5 marks) Here we expect to see the following:
a. Definition.
b. When is it used?
c. What is a possible real-world application of unsupervised learning? d. What are its main limitations?**

a. Unsupervised learning is when you train a model on unlabeled data so that it can learn and identify the hidden structures or patterns in the data. Clustering is a specific type of unsupervised learning, where the model groups data points into "clusters" based on similarity or distance metrics, thereby identfying natural groupings in the data. A cluster is a set of data points grouped together because of their similarities along a set of dimensions (e.g., in the figure below, the two clusters are grouping different types of animals together based on their similarity). Different algorithms use different types of distance metrics (e.g., Eucliedean) to identify these groupings. 

![image.png](attachment:image.png)

Source of image: https://blog.bismart.com/en/classification-vs.-clustering-a-practical-explanation

b. You use this approach when you're looking to categorise or segment your data into groups based on patterns and similarities in the data, but do not know what those groups are /should be.

c. An example of a real-world application is for building a recommendation engine (e.g., such as the ones on Netflix or Spotify). You can use clustering to group similar users together (e.g., based on what shows you have watched or music you have listened to) and suggest things to users that similar users have liked/used (e.g., movies/shows watched and liked on Netflix, or songs added to playlists/ liked on Netflix). For example, Netflix use clustering to drive decision making around what new shows and movies to create based on which users like which content.

d. Overall, clustering methods tend to be very sensitive to the type of input and the parameters the user chooses, and these parameters can be difficult to determine. Selecting appropriate parameters relies on either strong domain knowledge or where that doesn't exist, basic trial and error. One of the biggest issues when using clustering algorithms (e.g., k-means) is determining the number of clusters in advance (which you must do) as if an incorrect number is chosen, this can lead to poor results (there are methods such as the "elbow method" for determining the number of ideal clusters but these are not without issues too). Clustering algorithms can also be sensitive to outliers (e.g., in k-means, outliers can impact centroid positions). These algorithms can also struggle in high-dimensional data sets as distance becomes less meaningful (which is referred to as the "curse of dimensionality") and data points can appear equally distant from each other, which makes it difficult for clustering algorithms to accurately distinguish between different clusters. The results you get are also really dependent on which distance metric you chose (e.g., Euclidean, cosine similarity), which isn't ideal when you don't know what the ground truth is. Like deciding the the number of clusters, this decisions relies on either strong domain knowledge or trial and error. Clustering algorithms tend to also have strong assumptions regarding the shape and size of clusters, assuming sphericity and similar cluster sizes, which can then lead to poor performance in instances where clusters are in fact irregular shapes or sizes (although some algorithms, such as spectral clustering, are better able to deal with varying clusters). Scalability can be a problem as some clustering algortims are very resource intensive due to their computational complexity which in certain use cases (e.g., Netflix data scale) is unfeasible, although, there are available work arounds for this (e.g., "mini-batch" k-means). Finally, interpretablity can be problematic as it is not always clear what each cluster represents in "real terms" as the algorithms find hidden patterns in the data that can be difficult to describe.


**1.5 Discuss what is Supervised Learning - Classification in Machine Learning using an example. (7.5 marks) Here we expect to see the following:
a. Definition.
b. When is it used?
c. What is a possible real-world application of supervised learning? 
d. What data do we need for it? Is there any processing that needs to be done?**

a. Supervised learning is when a model is trained using labeled data made up of "input-output" pairs, where the input is the features (e.g., symptoms) and the output is the label (e.g., diagnosis). During training, the model learns the association between the features and labels. Classification is a type of supervised learning where the model is trained to assign labels to input data (see image below for illustration). First, there is a training phase, where the model learns the association between features and labels, then the model's performance is evaluated using a test set (e.g., using accuracy, recall, f1-score, confusion matrices). Once the model is confirmed to be accurate and sensitive (low false positives and negatives, high true positives and negatives), the model can be used to predict the labels of new, unseen input data with confidence.

b. You use this approach when you are interested in being able to make predictions of classifications/labels based on new, unseen input data.

c. A real-world application of supervised learning could be diagnosis of Alzheimer's disease. You could train the classifier (e.g., Random Forest Classifier) on a training set of numerical features (e.g., memory problems, physical health, daily functioning etc) and labels (diagnosis or no diagnosis). A typical training-test split could be 70:30 or 80:20. You could then evaluate the performance of this model in terms of its generalisability to new data by looking at how accurate, sensitive, specific, and precise the predictions are based on this learning were in the test set (e.g., on 30% of the data). In this process, you could also use cross-validation methods (e.g., k-folds) to look at how different input features and hyper parameters impacted the model evaluation metrics in the test set. Once you are happy with your model's accuracy and sensitivity, you could then use this model to help clinicians make accurate diagnoses of alzheimer patients based on new data.

d. This approach requires a set of input features (X) and labels (y). It is necessary to standardised continuous features beforehand (e.g., z-scoring) to ensure that all features contribute equally to the model's learning and to encode categorical data as numeric data. It is important to prevent data leakage between training and test sets during the standardisation process (if all data is standardised together, then the model will indirectly have information about the test data even if it is excluded from training). If any of the categorical features are non-ordinal, it is a good idea to one-hot encode the data. One hot encoding transforms each categorical level into a binary feature (e.g., Ethnicity with levels 1-4), which ensures the model learns the appropriate assocaition between each level of the feature and the output label, without incorrectly assuming a linear relationship between the levels (i.e., there is no meaningful relationship between Ethnicity 1 and Ethnicity 2).

![image.png](attachment:image.png)

Source of image: https://blog.bismart.com/en/classification-vs.-clustering-a-practical-explanation

