# COMP1008 2024/25 Coursework - Wine Quality Prediction with Machine Learning

#### **Student Name**: Gabriel Bridger
#### **Student ID**: 2086810 

## Task description

**Main Task**: Utilizing the provided Red Wine Quality dataset, build a Linear Regression Model and another Machine Learning Model of your choice to predict wine quality. Employ appropriate methods from the `pandas`, `matplotlib`, and `sklearn` libraries to analyze and process the dataset for building predictive models.



**Format**: Use this Jupyter Notebook as a template to write your report in `Markdown` cells, supported by your source code in Code cells. Ensure your code produces the corresponding plots or results addressing the questions. Rename this .ipynb file to `202425_COMP1008_cw_XXX.ipynb`, where XXX is your username (e.g., psxyz), and submit it to Moodle by <b><font color = "red">24 March, 3pm</font></b>.

**Marks**: The coursework is worth a total of 100 marks (accounting for 25% of the COMP1008 module grade). Marks will be awarded based on your understanding of machine learning theories, the informativeness and presentation of your code, visualizations, results (e.g., code comments, necessary labels in plots), self-learning ability in solving the specific problem, as well as, how succinct, concise, and clear is your report writing.

Please check the detailed instructions at the end of this template file.

<div class="alert alert-success" style="text-align:left;">
<h2>Question 1. Prediction Model 1 - Linear Regression Model<span style="float:right;">[50 marks]</span></h2></div>

#### Question 1a <span style="color:red">(5 marks)</span> 
**TASK**: Briefly explain why the Red Wine Quality dataset is suitable for linear regression analysis.
- Identify at least 3 characteristics that make this dataset appropriate for regression.
- Use 3 bullet points (one for each characteristics) to present your answer concisely.
- Your explanation should reflect your understanding of the linear regression model.

<b>Q1a Answer</b>: Your answer here

**Continuous Numerical Values**    
The data set has continuous numerical values which allows values to be plotted and analysed to create an equation for the line generated from linear regression techniques. This makes it appropriate because linear regression essentially constructs a line with a linear equation to try and plot a relationship between some given attributes, so the more points and the wider range of points that the algorithm is given the more accurately it will be able to generate a line representing the relationship between the given inputs and the selected output

**Clear Target Values**    
By having a clear target value - 'quality' attribute - it means that the system will be able to calculate a 'loss' value to see how good the generated equation is at predecting the relationship between the set of inputs and the output which means that the algorithm can look at the current loss and previous loss to see if it has decreased and if it has then the algorithm knows for the next iteration aswell to shift the regression line further in that direction so that it isn't blindly guessing which way will improve the accuracy of the predicted relationship.   

**Likely Linear Relation**
Lots of the potential input attributes within the dataset are likely to be linearly related meaning there likely exists a linear equation that can plot the overall relationship between the inputs of the wine data set and the outputs. So a linear regression model will be able to pretty accurately predict the relationship aswell as its aim is to try and establish the linear relationship between attributes

---

#### Question 1b <span style="color:red">(15 marks)</span>

**TASK**: Analyze the dataset using appropriate methods from the `pandas` and/or `matplotlib` libraries. 
- Identify potential issues with the current dataset, specify which part(s) of the dataset are affected. Explain what could go wrong if the data is not properly pre-processed.
- Provide at least 2 short-code solutions demonstrating how you analyze these issues.  
- Briefly explain how each code snippet helps evaluate data quality issues.


<b>Q1b answer</b>: Your answer here

### Potential Issues Within the Current Dataset
**NaN Values**

`Not A Number` values exist within the data set such as within the `density` or the `free sulfur dioxide` attribute.

If pre-processing isn't applied to remove the `NaN` values from the dataset then this could break the model because if an input is `Not a Number` then the formula (equation of the regression line) cannot be used to predicted the output value of an input which would cause the program to crash.1

**Outliers Within the Dataset**

Some values within the dataset have extreme values/outliers - such as in the `chlorides` or `sulphates` columns - which can make it harder for the algorithm to develop a linear regression line for the dataset, so these values will need to be identified and then removed from the data set to avoid harming the model.

If pre-processing isn't properly applied then the outliers can skew the predicted relationship between the inputs and the outputs resulting in inaccurate predictions for a given set of inputs.

**Not Normally Distributed**

Some of the data for attributes such as `residual sugar` have skewed data distributions and linear regression generally assumes that data is normally distributed across the dataset.

If pre-processing isn't properly applied then the model may be more sensitive to the outliers/extreme values present in some columns as it will try to create a linear relationship that can satisfy both extreme and normal values therefore the regression line will end up somewhere in between the extreme and normal values therefore it's predictions could be inaccurate. 

In [25]:
import pandas as pd
wineQualDf = pd.read_csv('winequality-red.csv')

# run to set up the data frame to be pre-processed

In [26]:
def removeNaNValues(df) -> pd.DataFrame:
    return df.dropna()

removeNaNValues(wineQualDf)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
31,6.9,0.685,0.00,2.5,0.105,22.0,37.0,0.99660,3.46,0.57,10.6,6.58
32,8.3,0.655,0.12,2.3,0.083,15.0,113.0,0.99660,3.17,0.66,9.8,5.34
33,6.9,0.605,0.12,10.7,0.073,40.0,83.0,0.99930,3.45,0.52,9.4,6.11
34,5.2,0.320,0.25,1.8,0.103,13.0,50.0,0.99570,3.38,0.55,9.2,5.44
35,7.8,0.645,0.00,5.5,0.086,5.0,18.0,0.99860,3.40,0.55,9.6,6.71
...,...,...,...,...,...,...,...,...,...,...,...,...
1593,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6.03
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6.91
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6.99
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5.14


In [27]:
def replaceNaNValues(df) -> pd.DataFrame:
    df = df.apply(lambda col: col.fillna(col.mode()[0]))
    return df

replaceNaNValues(wineQualDf)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,6.0,34.0,0.99780,3.51,0.56,9.4,5.10
1,7.8,0.880,0.00,2.6,0.098,6.0,67.0,0.99680,3.20,0.68,9.8,5.09
2,7.8,0.760,0.04,2.3,0.092,6.0,54.0,0.99700,3.26,0.65,9.8,5.66
3,11.2,0.280,0.56,1.9,0.075,6.0,60.0,0.99800,3.16,0.58,9.8,6.15
4,7.4,0.700,0.00,1.9,0.076,6.0,34.0,0.99780,3.51,0.56,9.4,5.99
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99680,3.45,0.58,10.5,5.72
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6.91
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6.99
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5.14


**Clearing NaN Values**

There are 2 different ways that `NaN` values can be handled, by either removing the associated row from the dataset or by getting the mode for the given column and then replacing each NaN value in the cloumn with the modal value.

**Removing NaN Rows**

The other choice to handle `NaN` values is to simply remove the rows that contain `NaN` from the dataset. To do this pandas has a built in function `.dropna()` that looks through the dataframe for `NaN` and then drops the corresponding rows of data.

**Replacing NaN Values**

The other way to handle `NaN` values is to identify each column that has a `NaN` value within it, then get the mode - most common value - of that entire column and then replace any `NaN` value within the column with the modal Value. This is the repeated for every column within the dataset

In [34]:
def removeOutliers(df) -> pd.DataFrame:
    quart1 = df.quantile(0.25)
    quart3 = df.quantile(0.75)

    IQrange = quart3 - quart1

    lowerBound = quart1-3*IQrange
    upperBound = quart3+3*IQrange

    strippedDf = df[~((df<lowerBound) | (df>upperBound)).any(axis=1)]

    return strippedDf


removeOutliers(wineQualDf)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,,34.0,0.99780,3.51,0.56,9.4,5.10
1,7.8,0.880,0.00,2.6,0.098,,67.0,0.99680,3.20,0.68,9.8,5.09
2,7.8,0.760,0.04,2.3,0.092,,54.0,0.99700,3.26,0.65,9.8,5.66
3,11.2,0.280,0.56,1.9,0.075,,60.0,0.99800,3.16,0.58,9.8,6.15
4,7.4,0.700,0.00,1.9,0.076,,34.0,0.99780,3.51,0.56,9.4,5.99
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,,3.45,0.58,10.5,5.72
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6.91
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6.99
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5.14


**Removing Outliers**

When removing outliers, once abstracted, the larger problem isn't removing the values from the dataframe but identifying and establishing what constitutes an outlier.

In the code snippet above an outlier is identified as a value that is either:
- *greater than the third quartile of the dataset plus 3 × the interquartile range*
- *less than the first quartile of the dataset minus 3 × the interquartile range*

This method calculates the 25th and 75th percentile for each column in the dataframe, then calculates the interquartile range between these 2 values to then produce a lower and upper threshold value. Then finally it scans the given column and removes any rows containing a value that is either above the upper threshold or below the lower threshold.


---

#### Question 1c <span style="color:red">(20 marks)</span>
**TASK**: Apply appropriate data preprocessing techniques to address the issues identified in Question 1b.
- Provide a code solution that resolves the identified data issue(s).
- Briefly explain the methods and parameters used in your solution. Ensure your explanation clearly justifies how these techniques improve data quality and suitability for analysis.


<b>Q1c answer</b>: 

In [29]:
# Your code here


---

#### Question 1d <span style="color:red">(10 marks)</span>

**TASK**: Train and evaluate a Linear Regression model using the preprocessed dataset.   
- Print the model's weights.  
- Print the model's accuracy. 
    - Evaluate the model using at least three different metrics.
    - Briefly discuss the advantages of each metric in assessing model performance.


<b>Q1d answer</b>: 

In [30]:
# Your code here


---

<div class="alert alert-success" style="text-align:left;"><h2>Question 2. Prediction Model 2<span style="float:right;">[20 marks]</span></h2></div>

#### Question 2a <span style="color:red">(10 marks)</span>

**TASK**: Build a different machine learning model for the same prediction task.
- Choose a model covered in the lectures or explain your choice of a different method. If you choose a different method, provide at at least two arguments to justify your choice compared to the ones covered in the lectures. 
- Specify which model you selected and why. 
- List the key parameters of your chosen model (Model 2).
- Provide a code implementation for the selected method.

<b>Q2a answer</b>: 

In [31]:
# Your code here


---

### Question 2b <span style="color:red">(10 marks)</span>
**TASK**: Evaluate the performance of your new model and compare it to Prediction Model 1.
- Analyze whether the new model performs better or worse and explain why.
    - Base your evaluation on the same metrics used in Question 1d).
- Include one plot visually comparing the performance of both models.
- Provide a brief textual explanation interpreting the results.

<b>Q2b answer</b>: 

In [None]:
# Your code here


---

<div class="alert alert-success" style="text-align:left;"><h2>Question 3. Comparison and Improvement<span style="float:right;">[30 marks]</span></h2></div>

#### Question 3a <span style="color:red">(15 marks)</span>
**TASK**: Analyze the impact of removing the least important feature from Prediction Model 1.
- Identify and remove the least important feature. 
- Retrain the Linear Regression model and evaluate its performance. 
- Compare the results before and after feature removal.
- Provide a code implementation and a justification explaining the impact on model performance.


<b>Q3a answer</b>:

In [None]:
# Your code here


---

#### Question 3b <span style="color:red">(15 marks)</span>
**TASK**: Based on your observations, suggest strategies for improving future models when predicting on new data.
- Discuss potential improvements. 

<b>Hint</b>: based on relevant analysis, feature selection, feature scaling and data processing (e.g. resolve imbalanced samples, errors and outliers, etc.) could all potentially improve the model by reducing training time, fixing overfitting and improving interpretability, etc. 
You can also explore external resources for other potential approaches or techniques.<br>

<b>Note</b>: Coding is optional here, but your answers should be supported by relevant analysis or justifications.

<b>Q3b answer</b>:

In [None]:
# Add your answer here

---

## Appendix. Coursework Instructions

<b>Coursework Support</b>:
- COMP1008 computing tutorials and exercises on data processing and machine learning models on different example problems
- Example code building and analysing machine learning models in COMP1008 lectures slides on 'Machine learning'
- In the computing sessions, Q&A support for developing .ipynb projects
- In Teams channel 'COMP1008 2024/25 / Questions': support of common questions

<b>Marks</b>: in total 100 marks (count for 25% in COMP1008), awarded on the basis of:
- knowledge and understanding on the theories covered in lectures when answering the questions in the Jupyter Notebook report
- how informative and well presented your code, visualisations and results are (e.g. necessary labels in plots)
- self-learning ability making use of tutorial materials and online resources
- problem solving skills to obtain the answers and results for the specific dataset
- concise report with key details, e.g. parameters, data, etc. for others to repeat your methods and obtain the same results.

For more information of COMP1008 assessment please refer to the coursework issue in Moodle ('Course Content / Assessment').

<b>Format</b>:
- One single .ipynb file named 202425_COMP1008_cw_XXX.ipynb, where XXX is your username (e.g. psxyz)
- The .ipynb file should include your code and answers, using this given .ipynb template (please add cells as needed)
- You could use additional Python libraries as you wish, in addition to the ones demonstrated in the computing sessions
- There are multiple ways using different methods to complete the tasks. These are fine as long as all answers and analysis are supported by the code implemented in Jupyter Notebook, not by using other means (e.g. operations in Excel, or by using other languages, etc.).

<b>Submission</b>: 
- Deadline: <b><font color = "red">24 March, 3pm</font></b>.
- Late submission leads to a 5% deduction of the coursework on each weekday. Work submitted one week late will receive a 0 for the coursework.
- Method: in Moodle submit a single .ipynb file named 202425_COMP1008_cw_XXX.ipynb
- If you can’t submit your coursework on time due to ECs, please contact Student Services and your personal tutor ASAP

<b>Note: Plagiarism vs. Group Discussions</b> 

As you should know, plagiarism is completely unacceptable and will be dealt with according to University's standard policies.<br>
Students are encouraged to have only general discussions on the theory (not the specific questions) when completing the coursework.<br>
It is important that when you actually do your coursework and write the answers, you do it individually.<br>
Do NOT, under any circumstances, share your report, code or figures, etc. with anyone else.