# Machine Learning
## Compulsory task

1. For each of the following examples describe at least one possible input and
output. Justify your answers:  
* 1.1 A self-driving car
* 1.2 Netflix recommendation system
* 1.3 Signature recognition
* 1.4 Medical diagnosis


1. Answer here

|      | Input(s)    | Output     |
| ---- | ----------- | ---------  |
| 1.1  | Camera feed, GPS data, sensor data (lidar, radar, etc.), destination input|Planned route, speed adjustments, steering commands, obstacle detection, pedestrian recognition          |
| 1.2  |User viewing history, ratings, genre preferences, time of day, device used             | Personalized list of recommended movies or TV shows, tailored suggestions based on user preferences           |
| 1.3  | Image of a signature            | Verification of authenticity (positive/negative), identity confirmation           |
| 1.4  | Patient symptoms, medical history, diagnostic tests (blood tests, imaging scans, etc.)            | Diagnosis of illness or condition, recommended treatment plan, prognosis           |


2. For each of the following case studies, determine whether it is appropriate to utilise regression or classification machine learning algorithms. Justify your answers:
* 2.1 Classifying emails as promotion or social based on their content and metadata.
* 2.2 Forecasting the stock price of a company based on historical data and market trends.
* 2.3 Sorting images of animals into different species based on their visual features.
* 2.4 Predicting the likelihood of a patient having a particular disease based on medical history and diagnostic test results.

2. Answer here
* 2.1 Classification - the goal is to categorize emails into one of two classes: promotion or social. Classification algorithms can be trained on labeled data where emails are tagged as promotion or social, and then the algorithm learns patterns in the content and metadata to make predictions on new, unseen emails.

* 2.2 Regression - Stock price prediction involves predicting a continuous variable (the price of the stock) based on historical data and market trends. Regression algorithms are suitable for this task as they can analyze patterns in historical data and extrapolate to make predictions about future stock prices.

* 2.3 : Classification - Similar to the email classification task, this is also a classification problem. The goal is to assign images of animals to different species categories based on their visual features. Classification algorithms can be trained on labeled image data, where each image is associated with a specific animal species label, enabling the algorithm to learn to distinguish between different species.

* 2.4 Classification - This task involves predicting whether a patient is likely to have a particular disease (or not) based on their medical history and diagnostic test results. It's essentially a binary classification problem where the classes are likely to have the disease or not. Classification algorithms can analyze the patient data to make predictions about the presence or absence of the disease.

3. For each of the following real-world problems, determine whether it is appropriate to utilise a supervised or unsupervised machine learning algorithm. Justify your answers:
* 3.1 Detecting anomalies in a manufacturing process using sensor data without prior knowledge of specific anomaly patterns.
* 3.2 Predicting customer lifetime value based on historical transaction data and customer demographics.
* 3.3 Segmenting customer demographics based on their purchase history, browsing behaviour, and preferences.
* 3.4 Analysing social media posts to categorise them into different themes.


3. Answer here
* 3.1 Unsupervised learning - An unsupervised learning algorithm would be appropriate for detecting anomalies in the manufacturing process because it doesn't require labeled data indicating which instances are normal and which are anomalous. Instead, the algorithm can learn patterns from the sensor data without prior knowledge of specific anomaly patterns. Techniques such as clustering or density estimation can be used to identify data points that deviate significantly from the norm, indicating potential anomalies.

* 3.2 Supervised learning - Predicting customer lifetime value involves using historical transaction data and customer demographics to predict a continuous target variable (lifetime value). Since historical data with known outcomes (lifetime values) are available, supervised learning algorithms can be trained on this labeled data to predict the lifetime value of new customers based on their transaction history and demographics.

* 3.3 Unsupervised learning - Customer segmentation typically involves identifying meaningful groups or clusters within a dataset based on similarities or patterns in the data. Since there are no predefined categories or labels for customer segments, unsupervised learning techniques such as clustering would be appropriate. Algorithms like k-means clustering or hierarchical clustering can be used to segment customers based on their purchase history, browsing behavior, and preferences without the need for labeled data.

* 3.4 Supervised learning (possibly with some unsupervised pre-processing) -  While unsupervised techniques like topic modeling could be used to identify underlying themes in the social media posts, supervised learning may be more appropriate if specific predefined themes or categories are required. In this case, labeled data would be necessary to train a supervised learning model to classify posts into predefined themes. However, unsupervised techniques could still be useful for feature extraction or dimensionality reduction before applying supervised learning algorithms. Therefore, a combination of both supervised and unsupervised techniques may be beneficial.

4.
For each of the following real-world problems, determine whether it is appropriate to utilise semi-supervised machine learning algorithms. Justify your answers:
* 4.1 Predicting fraudulent financial transactions using a dataset where most transactions are labelled as fraudulent or legitimate.
* 4.2 Analysing customer satisfaction surveys where only a small portion of the data is labelled with satisfaction ratings.
*4.3 Identifying spam emails in a dataset where the majority of emails are labelled.
* 4.4 Predicting the probability of default for credit card applicants based on their complete financial and credit-related information.


4. Answer here
* 4.1 Semi-supervised learning - the dataset contains a significant number of labeled instances (fraudulent or legitimate transactions) along with a large amount of unlabeled data. Semi-supervised algorithms can leverage both the labeled and unlabeled data to improve model performance. By using the labeled data to initially train the model and then incorporating the unlabeled data to refine the model, semi-supervised learning can potentially enhance fraud detection capabilities, especially in situations where labeled data is scarce.

* 4.2 Semi-supervised learning - In scenarios where only a small portion of the data is labeled (e.g., satisfaction ratings in customer surveys), semi-supervised learning can be beneficial. By utilizing both the labeled and unlabeled data, semi-supervised algorithms can learn more robust representations of the data and improve predictive accuracy. This approach can help in extracting patterns and insights from the unlabeled data, thereby enhancing the analysis of customer satisfaction surveys.

* 4.3 Semi-supervised learning - By combining the labeled data (spam/non-spam) with the unlabeled data, semi-supervised algorithms can learn more nuanced patterns of spam emails and improve classification accuracy. This approach can help in detecting previously unseen spam email patterns, thus enhancing the performance of the spam detection system.

* 4.4 Supervised learning - The problem involves predicting the probability of default for credit card applicants based on their financial and credit-related information, and there is no indication that a significant portion of the data is unlabeled. Supervised learning models can be trained directly on labeled data, where each applicant's default status is known, to predict the likelihood of default accurately. While unlabeled data could potentially provide additional insights, it's not explicitly mentioned in the problem description, so a supervised learning approach is sufficient.
