# COMP1008 2024/25 Coursework - Wine Quality Prediction with Machine Learning

#### **Student Name**: Gabriel Bridger
#### **Student ID**: 2086810 

## Task description

**Main Task**: Utilizing the provided Red Wine Quality dataset, build a Linear Regression Model and another Machine Learning Model of your choice to predict wine quality. Employ appropriate methods from the `pandas`, `matplotlib`, and `sklearn` libraries to analyze and process the dataset for building predictive models.



**Format**: Use this Jupyter Notebook as a template to write your report in `Markdown` cells, supported by your source code in Code cells. Ensure your code produces the corresponding plots or results addressing the questions. Rename this .ipynb file to `202425_COMP1008_cw_XXX.ipynb`, where XXX is your username (e.g., psxyz), and submit it to Moodle by <b><font color = "red">24 March, 3pm</font></b>.

**Marks**: The coursework is worth a total of 100 marks (accounting for 25% of the COMP1008 module grade). Marks will be awarded based on your understanding of machine learning theories, the informativeness and presentation of your code, visualizations, results (e.g., code comments, necessary labels in plots), self-learning ability in solving the specific problem, as well as, how succinct, concise, and clear is your report writing.

Please check the detailed instructions at the end of this template file.

<div class="alert alert-success" style="text-align:left;">
<h2>Question 1. Prediction Model 1 - Linear Regression Model<span style="float:right;">[50 marks]</span></h2></div>

#### Question 1a <span style="color:red">(5 marks)</span> 
**TASK**: Briefly explain why the Red Wine Quality dataset is suitable for linear regression analysis.
- Identify at least 3 characteristics that make this dataset appropriate for regression.
- Use 3 bullet points (one for each characteristics) to present your answer concisely.
- Your explanation should reflect your understanding of the linear regression model.

<b>Q1a Answer</b>: Your answer here

**Continuous Numerical Values**    
The data set has continuous numerical values which allows values to be plotted and analysed to create an equation for the line generated from linear regression techniques. This makes it appropriate because linear regression essentially constructs a line with a linear equation to try and plot a relationship between some given attributes, so the more points and the wider range of points that the algorithm is given the more accurately it will be able to generate a line representing the relationship between the given inputs and the selected output

**Clear Target Values**    
By having a clear target value - 'quality' attribute - it means that the system will be able to calculate a 'loss' value to see how good the generated equation is at predecting the relationship between the set of inputs and the output which means that the algorithm can look at the current loss and previous loss to see if it has decreased and if it has then the algorithm knows for the next iteration aswell to shift the regression line further in that direction so that it isn't blindly guessing which way will improve the accuracy of the predicted relationship.   

**Likely Linear Relation**
Lots of the potential input attributes within the dataset are likely to be linearly related meaning there likely exists a linear equation that can plot the overall relationship between the inputs of the wine data set and the outputs. So a linear regression model will be able to pretty accurately predict the relationship aswell as its aim is to try and establish the linear relationship between attributes

---

#### Question 1b <span style="color:red">(15 marks)</span>

**TASK**: Analyze the dataset using appropriate methods from the `pandas` and/or `matplotlib` libraries. 
- Identify potential issues with the current dataset, specify which part(s) of the dataset are affected. Explain what could go wrong if the data is not properly pre-processed.
- Provide at least 2 short-code solutions demonstrating how you analyze these issues.  
- Briefly explain how each code snippet helps evaluate data quality issues.


<b>Q1b answer</b>: Your answer here

### Potential Issues Within the Current Dataset
**NaN Values**

`Not A Number` values exist within the data set such as within the `density` or the `free sulfur dioxide` attribute.

If pre-processing isn't applied to remove the `NaN` values from the dataset then this could break the model because if an input is `Not a Number` then the formula (equation of the regression line) cannot be used to predicted the output value of an input which would cause the program to crash.1


**Outliers Within the Dataset**

Some values within the dataset have extreme values/outliers - such as in the `chlorides` or `sulphates` columns - which can make it harder for the algorithm to develop a linear regression line for the dataset, so these values will need to be identified and then removed from the data set to avoid harming the model.

If pre-processing isn't properly applied then the outliers can skew the predicted relationship between the inputs and the outputs resulting in inaccurate predictions for a given set of inputs.


**Not Normally Distributed**

Some of the data for attributes such as `residual sugar` have skewed data distributions and linear regression generally assumes that data is normally distributed across the dataset.

If pre-processing isn't properly applied then the model may be more sensitive to the outliers/extreme values present in some columns as it will try to create a linear relationship that can satisfy both extreme and normal values therefore the regression line will end up somewhere in between the extreme and normal values therefore it's predictions could be inaccurate.


**Duplicate Values**

If any duplicate rows exist within the dataframe this could become an issue when the algorithm begins to try and map the linear relationship between the inputs and the outputs. This is because by having duplicate rows the system may become bias towards the over duplicates as they appear more often so the algorithm then believes they should be more common, however in reality they may not be. Another issue that arises from having duplicate values is that the model can **overfit** to that specific data entry meaning that the performance metrics - such as `loss` - may be over inflated because for the test data the regression line would be very good at predicting a single combination of inputs' output values however on unseen data it may not perform as well due to over-representation of a single row within the dataset. 

In [3]:
import pandas as pd
import numpy as np
wineQualDf = pd.read_csv('winequality-red.csv')

# run to set up the data frame to be pre-processed

**Counting NaN Values**

The `countNaNRows()` takes in the data frame to be analysed and then uses the `df.isna()` function to create a dataframe of booleans identifiying whether the value at the corresponding location is a `NaN` value or not. Then the `.any(axis=1)` function is applied which conceptually flattens a dataframe into a list by looking at a row - determined by axis = 0 for columns or axis = 1 for rows - of a given index within the dataframe and then if any of the values in that row are `True` then the corresponding index in the created list will be `True` aswell, otherwise the corresponding index is set to `False`. This then means that the `.sum()` function can be applied as it is now operating on a list/array which takes `True = 1` and `False = 0` so by totalling the list it gives us the number of rows that contain `NaN` values.

In [19]:
def countNaNRows(df):
    NaNCount = df.isna().any(axis=1).sum()
    return NaNCount


countNaNRows(wineQualDf)

np.int64(275)

**Counting Outliers**

When removing outliers, once abstracted, the larger problem isn't removing the values from the dataframe but identifying and establishing what constitutes an outlier.

In the code snippet above an outlier is identified as a value that is either:
- *greater than the third quartile of the dataset plus 3 × the interquartile range*
- *less than the first quartile of the dataset minus 3 × the interquartile range*

The function `countOutlierRows` counts the number of rows that would be removed due to containing values that would be classified as 'outliers' by the program, this is done by first getting the quantiles of the passed in dataframe and then calculates the interquartile range. Next it generates a boolean mask with values that meet either condition specified. It then calls the `.sum()` function to finally count the number of rows that would be removed from the data frame due to containing values that constitute outliers.

In [15]:
def countOutlierRows(df) -> pd.DataFrame:
    quart1 = df.quantile(0.25)
    quart3 = df.quantile(0.75)

    IQrange = quart3 - quart1

    lowerBound = quart1-3*IQrange
    upperBound = quart3+3*IQrange

    outlierCount = ((df<lowerBound) | (df>upperBound)).any(axis=1).sum()

    return outlierCount


countOutlierRows(wineQualDf)

np.int64(169)

**Counting the Number of Duplicate Rows**

The function below - `countDuplicateRows` - counts the number of occurences of duplicate rows within the dataframe, this is done by using the inbuilt `.duplicated()` function that evaluates every row within a data frame to see if there has been an occurence of the same row before. This then returns a `series` that similarly to the `.any()` function has each element either evaluate to either `True` or `False` which - then again - can have the `.sum()` function applied to it which then essentially counts the number of occurences of `True` values which represnets the number of rows that are duplicated.

So for example in the dataframe:
    *'A': [1, 2, 3, 1, 2],*
    *'B': ['a', 'b', 'c', 'a', 'b']*

the output of the `countDuplicateRows()` would be 2 because there are 2 occurences of data duplication within the dataframe - `Row 4` is a duplicate of `Row 1` (1,a) and then `Row 5` is a copy of `Row 2`. 

In [20]:
def countDuplicateRows(df):
    dupeCount = df.duplicated().sum()
    return dupeCount

countDuplicateRows(wineQualDf)

np.int64(2)

---

#### Question 1c <span style="color:red">(20 marks)</span>
**TASK**: Apply appropriate data preprocessing techniques to address the issues identified in Question 1b.
- Provide a code solution that resolves the identified data issue(s).
- Briefly explain the methods and parameters used in your solution. Ensure your explanation clearly justifies how these techniques improve data quality and suitability for analysis.


<b>Q1c answer</b>: 

In [4]:
class NaNhandling():
    @staticmethod
    def handleNaNmodal(df):
        correctedDf = df.apply(lambda x: x.fillna(x.mode()[0],axis=0))
        return correctedDf
    

    @staticmethod
    def handleNaNdrop(df):
        return df.dropna()
    
    @staticmethod
    def handleNaNValues(df):
        handleMethod = 'modal'
        
        if countNaNRows(df) == 0:
            return df
        
        if handleMethod == 'modal':
            return NaNhandling.handleNaNmodal(df)

        elif handleMethod == 'drop':
            return NaNhandling.handleNaNdrop(df)
        

class outlierHandling():
    @staticmethod
    def getUpperLowerBounds(df):
        quart1 = df.quantile(0.25)
        quart3 = df.quantile(0.75)

        IQrange = quart3 - quart1

        lowerBound = quart1-3*IQrange
        upperBound = quart3+3*IQrange      

        return lowerBound,upperBound

    @staticmethod
    def handleOutliersModal(df):
        
        lowerBound,upperBound = outlierHandling.getUpperLowerBounds(df)
    
        for col in df.columns:
            modalVal = df[col].mode()[0]
            df[col] = df[col].apply(lambda x: modalVal if x<lowerBound[col] or x>upperBound[col] else x)

        return df

    @staticmethod
    def handleOutliersDrop(df):
        lowerBound,upperBound = outlierHandling.getUpperLowerBounds(df)
        strippedDf = df[~((df<lowerBound | df>upperBound)).any(axis=1)]
        return strippedDf


    @staticmethod
    def handleOutliers(df):
        handleMethod = 'modal'

        if countDuplicateRows(df) == 0:
            return df
        
        if handleMethod == 'modal':
            return outlierHandling.handleOutliersModal(df)
        
        elif handleMethod == 'drop':
            return outlierHandling.handleOutliersDrop(df)

        
def removeDuplicateRows(df):
    return df.drop_duplicates()


def performPreProcessing(df):
    df = NaNhandling.handleNaNValues(df)
    df = outlierHandling.handleOutliers(df)
    df = removeDuplicateRows(df)

    return df


performPreProcessing(wineQualDf)





NameError: name 'countNaNRows' is not defined

---

#### Question 1d <span style="color:red">(10 marks)</span>

**TASK**: Train and evaluate a Linear Regression model using the preprocessed dataset.   
- Print the model's weights.  
- Print the model's accuracy. 
    - Evaluate the model using at least three different metrics.
    - Briefly discuss the advantages of each metric in assessing model performance.


<b>Q1d answer</b>: 

In [13]:
# Your code here


---

<div class="alert alert-success" style="text-align:left;"><h2>Question 2. Prediction Model 2<span style="float:right;">[20 marks]</span></h2></div>

#### Question 2a <span style="color:red">(10 marks)</span>

**TASK**: Build a different machine learning model for the same prediction task.
- Choose a model covered in the lectures or explain your choice of a different method. If you choose a different method, provide at at least two arguments to justify your choice compared to the ones covered in the lectures. 
- Specify which model you selected and why. 
- List the key parameters of your chosen model (Model 2).
- Provide a code implementation for the selected method.

<b>Q2a answer</b>: 

In [14]:
# Your code here


---

### Question 2b <span style="color:red">(10 marks)</span>
**TASK**: Evaluate the performance of your new model and compare it to Prediction Model 1.
- Analyze whether the new model performs better or worse and explain why.
    - Base your evaluation on the same metrics used in Question 1d).
- Include one plot visually comparing the performance of both models.
- Provide a brief textual explanation interpreting the results.

<b>Q2b answer</b>: 

In [None]:
# Your code here


---

<div class="alert alert-success" style="text-align:left;"><h2>Question 3. Comparison and Improvement<span style="float:right;">[30 marks]</span></h2></div>

#### Question 3a <span style="color:red">(15 marks)</span>
**TASK**: Analyze the impact of removing the least important feature from Prediction Model 1.
- Identify and remove the least important feature. 
- Retrain the Linear Regression model and evaluate its performance. 
- Compare the results before and after feature removal.
- Provide a code implementation and a justification explaining the impact on model performance.


<b>Q3a answer</b>:

In [None]:
# Your code here


---

#### Question 3b <span style="color:red">(15 marks)</span>
**TASK**: Based on your observations, suggest strategies for improving future models when predicting on new data.
- Discuss potential improvements. 

<b>Hint</b>: based on relevant analysis, feature selection, feature scaling and data processing (e.g. resolve imbalanced samples, errors and outliers, etc.) could all potentially improve the model by reducing training time, fixing overfitting and improving interpretability, etc. 
You can also explore external resources for other potential approaches or techniques.<br>

<b>Note</b>: Coding is optional here, but your answers should be supported by relevant analysis or justifications.

<b>Q3b answer</b>:

In [None]:
# Add your answer here

---

## Appendix. Coursework Instructions

<b>Coursework Support</b>:
- COMP1008 computing tutorials and exercises on data processing and machine learning models on different example problems
- Example code building and analysing machine learning models in COMP1008 lectures slides on 'Machine learning'
- In the computing sessions, Q&A support for developing .ipynb projects
- In Teams channel 'COMP1008 2024/25 / Questions': support of common questions

<b>Marks</b>: in total 100 marks (count for 25% in COMP1008), awarded on the basis of:
- knowledge and understanding on the theories covered in lectures when answering the questions in the Jupyter Notebook report
- how informative and well presented your code, visualisations and results are (e.g. necessary labels in plots)
- self-learning ability making use of tutorial materials and online resources
- problem solving skills to obtain the answers and results for the specific dataset
- concise report with key details, e.g. parameters, data, etc. for others to repeat your methods and obtain the same results.

For more information of COMP1008 assessment please refer to the coursework issue in Moodle ('Course Content / Assessment').

<b>Format</b>:
- One single .ipynb file named 202425_COMP1008_cw_XXX.ipynb, where XXX is your username (e.g. psxyz)
- The .ipynb file should include your code and answers, using this given .ipynb template (please add cells as needed)
- You could use additional Python libraries as you wish, in addition to the ones demonstrated in the computing sessions
- There are multiple ways using different methods to complete the tasks. These are fine as long as all answers and analysis are supported by the code implemented in Jupyter Notebook, not by using other means (e.g. operations in Excel, or by using other languages, etc.).

<b>Submission</b>: 
- Deadline: <b><font color = "red">24 March, 3pm</font></b>.
- Late submission leads to a 5% deduction of the coursework on each weekday. Work submitted one week late will receive a 0 for the coursework.
- Method: in Moodle submit a single .ipynb file named 202425_COMP1008_cw_XXX.ipynb
- If you can’t submit your coursework on time due to ECs, please contact Student Services and your personal tutor ASAP

<b>Note: Plagiarism vs. Group Discussions</b> 

As you should know, plagiarism is completely unacceptable and will be dealt with according to University's standard policies.<br>
Students are encouraged to have only general discussions on the theory (not the specific questions) when completing the coursework.<br>
It is important that when you actually do your coursework and write the answers, you do it individually.<br>
Do NOT, under any circumstances, share your report, code or figures, etc. with anyone else.