## 1.1 In your own words, what does the role of a data scientist involve? 
#### 2 points

A data scientist takes data, converts it into a comprehensible state, and uses the data to answer questions posed by themselves, or by the customer/client they are working with. In essence, they act as an interpreter between raw data and the customer/client, drawing out conclusions and interpretations from the data. 

## 1.2 What is an outlier? Here we expect to see the following: 
### a. Definition
### b. Examples
### c. Should outliers always be removed? Why?
### d. What are other possible issues that you can find in a dataset?

#### 4 points

An outlier is an anomalous or abnormal data point, a data point that is conspicuously different from the majority of the other data points. Outliers can be because of exceptional real-life cases, but can also result from errors in data gathering or data input. For example, if surveying a group of university students, a dataset could consist of 2000 results, with 1999 of those surveyed aged between 18-28, and 1 with an age of 60. This 60 year-old would be an outlier; feasibly there could be an older student, but it is not the norm, and in this group, they are an anomaly. If the dataset consisted of the same number of students aged between 18-28, and 1 with an age of 250, this outlier could have been due to a data input error (e.g. a typo), as humans cannot age to 250. 

If outliers are not removed they can skew the data, and change the results of data analysis significantly. Outliers can affect results to the extent that when any conclusions drawn from a small sample and applied to a wider population will be erroneous.

Similar issues can be ccaused by missing data or duplicate rows, which can be dealt with (alongside anomalies) during data cleaning (see below). 


## 1.3 Describe the concepts of data cleaning and data quality. Here we expect to see the following:
### a. What is data cleaning?
### b. Why is data cleaning important?
### c. What type of mistakes do we expect to commonly see in datasets?
#### 4 points

Data cleaning is the preparation of raw data for analysis, transformating raw data into a usable dataset. 

During data cleaning, duplicates will be removed. Duplicates can alter the results, by overrepresenting certain aspects of the data which are not true to life.  They can be removed using the drop_duplicates function. Rows (or columns) which contain incomplete records will be assessed. 

For those records where placeholders have been used, e.g. '?' or 'n/a', they will be replaced with 'NaN'. There are inbuilt functions to deal with NaN values - e.g. dropna, fillna. These rows with NaN can either be dropped, or replaced with a filler of the programmer's choice e.g. the mean of the column. 

In addition, records that are in the incorrect format or datatype can be corrected. For example, a date could have been input as a string, rather than a datetime datatype. During data cleaning, this will be fixed, in order to be able to work with the data more esaily later on. 

Data quality is 'a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date.' (https://www.techtarget.com/searchdatamanagement/definition/data-quality). As is clear from the description of the data cleaning process, rigorous data cleaning results in high quality data. 

Data cleaning is important for a number of reasons. Primarily, it is often not possible to work with a raw, or uncleansed dataset without it simply throwing errors; for example, if a dataset has a '?' in place of a record, and the programmer tried to sort the column alphabetically. 

In addition, data cleaning can include removing columns that are not required in the processing, or that do not have enough correct/complete data to be usable - removing columns and streamlining the dataset can make processes quicker and more efficient. 

Duplicates can alter the results, by overrepresenting certain aspects of the data which are not true to life. Essentially, when it comes to data cleaning, the phrase 'garbage in, garbage out' must be remembered. If you are working with low quality data, you will get low quality results - they will be inaccurate, unrepresentative, and not answer the question you posed to begin with. 

## 1.4 Discuss what is Unsupervised Learning - Clustering in Machine Learning using an example. Here we expect to see the following:
### a. Definition.
### b. When is it used?
### c. What is a possible real-world application of unsupervised learning?
### d. What are its main limitations?
#### 7.5 points

Machine Learning is 'a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.' (https://www.ibm.com/topics/machine-learning#:~:text=the%20next%20step-,What%20is%20machine%20learning%3F,learn%2C%20gradually%20improving%20its%20accuracy.)

Unsupervised Learning is a type of Machine Learning in which unlabelled data is used to try to discover patterns in the data. It differs from supervised learning as there are no labelled target values for the algorithm to try to meet or match.

Clustering involves separating the data into similar groupings (clusters), according to certain patterns in the data. From this, the algorithm can then establish differences in the data that contribute to these resulting clusters.

For example, fraud detection in banks can be clustered machine learning. The algorithm does not know what 'fraud' is, but it can assess transactions from an account, and group the transactions on what is 'normal' behaviour for the account owner and what is abnormal. 

One main limitation of Clustering is a high sensitivity to outliers (which are described in 1.2). If there are too many outliers, the algorithms boundaries for each cluster will be skewed, and therefore the groupings will not be as clear or reliable.


## 1.5 Discuss what is Supervised Learning - Classification in Machine Learning using an example. Here we expect to see the following:
### a. Definition.
### b. When is it used?
### c. What is a possible real-world application of supervised learning?
### d. What data do we need for it? Is there any processing that needs to be done?
#### 7.5 points

Supervised learning uses labelled input data, training the algorithm with a labelled dataset (independent variable), and specified results (dependent variable), to create a model which successfully predicts the dependent variable from the independent variable.

Classification is similar to clustering, in that the algorithm is grouping the data into different sets, or categories, based on certain characteristics. However, Classification uses labelled data, while, as detailed above,  Clustering uses unlabelled data. 

The programmer must divide the data into 'training' set and 'testing set', so that the algorithm can first learn, and then test itself, to prove that it's algorithm works, an is accurate.

One real-world application of Classification Supervised Learning is when Spotify creates mood playlists (e.g. 'Sad Songs: songs for a broken heart'). Based on previous playlists made by its users, it can see that certain songs are grouped together based on playlists for different emotions, labelled by the playlist titles. Another real-world application of Classification are the Captcha tests that many computer users need to conduct to access certain websites. This trains the computer to recognise images of, for example, a bike, or traffic lights.