# Using the CRISP-DM Method for MLN 601 Machine Learning
# Assessment 2: Classification 







# 1. Stage One - Determine Business Objectives and Assess the Situation  <a class="anchor"></a>
The traditional process of winemaking relies heavily on subjective, time-consuming and expensive sensory analysis through human experts to certify quality. This assessment often occurs at the final stage of production highlighting significant financial risk if a batch is deemed unsatisfactory (Nebot et al., 2015). This project confronts this challenge by leveraging machine learning to create a proactive and data-driven quality control system.


The objective of this project is to develop a robust binary classification model capable of predicting wine quality with high accuracy. The analysis will utilize the well-regarded Wine Quality dataset from the UCI Machine Learning Repository which contains detailed physicochemical measurements. To frame this as a classification problem, the original quality score (ranging from 0 to 10) is transformed into a categorical variable such as wines with a score below 6 are labeled as 'low' quality (1), and those with a score of 6 or above are labeled as 'high' quality (0).


By training prediction algorithms on this data, the analysis aims to uncover the key chemical indicators that distinguish high-quality wine from its lower-quality counterparts. The ultimate value of the resulting model will be measured by its predictive power and assessed through rigorous evaluation metrics like accuracy, precision, recall and AUC-ROC curve. A successful model would provide winemakers with an objective tool to forecast quality early, enabling timely interventions and minimizing the risk of costly production failures.



## 1.1 Success Criteria
The success criteria for this project is fundamentally tied to the statistical power and reliability of the classification model. To ensure the final model is robust enough for practical application, a competitive evaluation of multiple algorithms and parameter sets will be conducted. The "champion" model will be selected based on its superior performance against two specific critical metrics:


**Primary Metric (F1-Score):** An F1-Score of $ \ge 0.70 $ for the 'low' quality (1) class is the most critical hurdle. This metric is chosen because it effectively manages the business trade-offs. A False Negative (failing to detect a low-quality batch) is the most costly error potentially leading to wasted resources and reputational damage. A False Positive (flagging a good batch for review) is less costly but still inefficient. The F1-Score ensures the model finds an optimal balance by maximizing the detection of bad batches while minimizing unnecessary interventions.


**Overall Performance (AUC-ROC):** An AUC-ROC score of $ \ge 0.80 $ is required to confirm the generalization capability of the model. This metric provides a holistic assessment of how well the model can distinguish between high and low-quality wines across all possible decision thresholds. A high AUC value reflects that the model has discovered important and underlying trends in the data rather than an ability to perform well at only a particular and arbitrary cutoff value.


The model with the highest F1-Score among the considered models that also exceeds the minimum level of AUC-ROC will be considered as the best solution. After selecting the champion model based on its performance, the next critical step is to analyze its interpretability by extracting feature importance. This activity considers the variables of physicochemistry one by one and then ranks the variables based on their contribution to the final quality prediction. Success in this case implies providing the best 3-5 variables that always generate a classification of a wine. For example, the model might reveal that low alcohol content combined with high volatile acidity are the most powerful predictors of a "low quality" rating. This elevates the model from a simple predictive tool to a diagnostic system. It gives winemakers a fine-grained and data-driven answer to consider a batch is at risk, so that they can target intervention actions on specific chemical characteristics and eventually improve the production process end to end.

## 1.2 Assess the Current Situation<a class="anchor"></a>

This project is supported by a well-defined set of resources ensuring a smooth and effective workflow from data analysis to final reporting.
* **Personnel and Research**
The project will be executed by Md. Arifuzzaman Munaf, a postgraduate student specializing in Advanced Artificial Intelligence. This role encompasses responsibility for the entire project lifecycle including data preprocessing, exploratory analysis, model development, rigorous evaluation, and final documentation. The analysis will be supplemented by existing academic literature and research that are primarily sourced through Google Scholar to ground the project in established methodologies.


* **Data Source**
The analysis will utilize the renowned Wine Quality dataset from the UCI Machine Learning Repository. This dataset is composed of two separate files for red and white wine varieties containing 1,599 and 4,898 samples respectively. Each sample is described by 11 physicochemical attributes (e.g., fixed acidity, alcohol) and a single quality score. The data is well-structured, complete and contains no missing values(Cortez et al., 2009).


* **Computational Environment**
All development and analysis will primarily take place within Google Colab, a cloud-based platform that provides free access to significant computational resources including ample RAM (≈12.5 GB) and optional GPU/TPU hardware acceleration for demanding tasks. Google Drive will be integrated for seamless data storage and access. A local M3-powered device with 16 GB of RAM will serve as a supplementary resource for offline development and debugging.


* **Software & Libraries**
   The project will be implemented in Python (v3.11.13) which is the default kernel of colab and python(v3.12.2) will be used for offline debugging. The other computing 
    stack will include:
   - **Data Manipulation**: pandas and numpy for efficient data handling.
   - **Visualization**: matplotlib and seaborn for exploratory data analysis.
   - **Machine Learning Models**: scikit-learn for building the modeling pipeline and XGBoost and LightGBM for implementing advanced gradient boosting algorithms to maximize predictive performance.
   - Version control will be managed using GitHub to ensure a reproducible and well-documented workflow.




# 2. Stage  Two - Data Understanding <a class="anchor"></a>
The second stage of the CRISP-DM process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate the various sources.

## 2.1 Initial Data Acquisition <a class="anchor"></a>
List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar projects. Ensure you are clear about the various ways in which you can import data into your Notebook. Data can ne read directly from the source e.g. Website or uploaded into the notebook from your computer or cloud storage.

In [None]:
# Import Libraries Required
#import pandas as pd
#import matplotlib.pyplot as plt
#import numpy as np
#import seaborn as sns

In [None]:
#Data source: 
#Source Query location: 
#path =  'F:/Projects/Data Science/Defaults/train_/train.csv' or URL 
# reads the data from the file - denotes as CSV, it has no header, sets column headers
#df =  pd.read_csv(path, sep=',') 

## 2.2 Describe Data <a class="anchor"></a>
Data description of the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements to solve the problem.

In [None]:
#df.columns, df.shape, df.dtypes, df.describe(), df.info() and df.head(10) Use Pandas to explore and clean up your tabular data 

## 2.3 Verify Data Quality <a class="anchor"></a>

Examine the quality of the data:

- Is the data complete (does it cover all that you require)?
- Is it correct, or does the data contain errors ?
- Are there missing values in the data? If so, where do they occur?

### 2.3.1. Outliers <a class="anchor"></a>
At this point, we may also want to remove any outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values or rare events. However, you would remove anomalies based on the definition of extreme outliers:

https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

- Below the first quartile − 3 ∗ interquartile range
- Above the third quartile + 3 ∗ interquartile range

## 2.4 Initial Data Exploration  <a class="anchor"></a>
During this stage, address data questions using querying, data visualization and reporting techniques. These may include:

- **Distribution** of key attributes (for example, the target attribute of a prediction task)
- **Relationships** between pairs or small numbers of attributes
- Results of **simple aggregations**
- **Properties** of significant sub-populations
- **Simple** statistical analyses

These analyses may contribute to or refine the data description and quality aspects of your report, and feed into other data preparation steps needed for further analysis. 

- **Data exploration component of your report** - Describe results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. Include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

### 2.4.1 Distributions  <a class="anchor"></a>

In [None]:
def count_values_table(df):
        count_val = df.value_counts()
        count_val_percent = 100 * df.value_counts() / len(df)
        count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
        count_val_table_ren_columns = count_val_table.rename(
        columns = {0 : 'Count Values', 1 : '% of Total Values'})
        return count_val_table_ren_columns

In [None]:
# Histogram
def hist_chart(df, col):
        plt.style.use('fivethirtyeight')
        plt.hist(df[col].dropna(), edgecolor = 'k');
        plt.xlabel(col); plt.ylabel('Number of Entries'); 
        plt.title('Distribution of '+col);

In [None]:
# col = 'account_risk_band'
# Histogram & Results
# hist_chart(df, col)
# count_values_table(df.account_risk_band)

### 2.4.2 Correlations  <a class="anchor"></a>
Can we derive any correlation from this data-set. Pairplot chart gives us correlations, distributions and regression path
Correlogram are awesome for exploratory analysis. It allows to quickly observe the relationship between every variable of your matrix. 
It is easy to do it with seaborn: just call the pairplot function

Pairplot documentation is found here: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
#Seaborn allows to make a correlogram or correlation matrix really easily. 
#sns.pairplot(df.dropna().drop(['x'], axis=1), hue='y', kind ='reg')

#plt.show()


In [None]:
#df_agg = df.drop(['x'], axis=1).groupby(['y']).sum()
#df_agg = df.groupby(['y']).sum()

# 3. Stage Three - Data Preparation <a class="anchor"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

## 3.1 Select Your Data <a class="anchor"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your machine learning goal, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

Rationale for inclusion/exclusion - List the data to be included/excluded and the reasons for these decisions.

In [None]:
X_train_regr = df.drop(['date_maint', 'account_open_date'], axis = 1)
X_train = df.drop(['target', 'date_maint', 'account_open_date'], axis = 1)
X_test = test.drop(['date_maint', 'account_open_date'], axis = 1)

## 3.2 Clean The Data <a class="anchor"></a>
This task involves raising the data quality to the level required by the analysis techniques that you've selected. This may involve selecting clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modelling.

# 4. Stage Four - Modelling <a class="anchor"></a>
As the first step in modelling, you'll select the actual modelling technique that you'll be using e.g.Decision tree
  


## 4.1. Modelling technique <a class="anchor"></a>
Document the actual modelling technique that is to be used.

Import Models in your code below:

## 4.2. Modelling assumptions <a class="anchor"></a>
Many modelling techniques make specific assumptions about the data, for example that all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any assumptions made.


## 4.3. Build Model <a class="anchor"></a>
Run the modelling tool on the prepared dataset to create your model.

**Parameter settings** - With any modelling tool there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice of parameter settings.

**Model** - This is the actual model produced by the modelling tool, not a report on the model.

**Model description** - Describe the resulting model, report on the interpretation of the model and document any difficulties encountered with their meanings.

## 4.4. Assess Model <a class="anchor"></a>
Interpret the models according to your knowledge, your prediction success criteria and your desired test design. Judge the success of the application of modelling and discovery techniques technically to discuss the machine learning results in the business context. This task only considers models, whereas the evaluation phase also takes into account all other results that were produced in the course of the project.

At this stage you should rank the models and assess them according to the evaluation criteria. You should take the business objectives and business success criteria into account as far as you can here. In most ML projects a single technique is applied more than once and results are generated with several different techniques. 

**Model assessment** - Summarise the results of this task, list the qualities of your generated models (e.g.in terms of accuracy) and rank their quality in relation to each other.

**Revised parameter settings** - According to the model assessment, revise parameter settings and tune them for the next modelling run. Iterate model building and assessment until you strongly believe that you have found the best model. Document all such revisions and assessments.

# 5. Stage 5 - Evaluate  <a class="anchor"></a>
Previous steps deal with the accuracy and generality of the model. During this step you should assesses the degree to which the model meets your business objectives and seek to determine if there is some business reason why this model is deficient. 

Assessment of machine learning results - Summarise assessment results in terms of business success criteria, including a final statement regarding whether the project meets the initial business objectives.
Approved models - After assessing models with respect to business success criteria, the generated models that meet the selected criteria become the approved models. For this initial assessment, you are only required to consider one model.

# 6. Stage 6 - Deploy  <a class="anchor"></a>

In the deployment stage you would determine a strategy for their deployment and document here together with ongoing monitoring and maintenance of your model. This is particularly important as a predictive machine learning model significantly impacts business operations. For the purposes of this assessment we will use this section to conclude the report. The previous steps should contain your code and narrative text inserted at the relevant sections. Here, you should look at lessons learnt. This includes the things that went right, what went wrong, what you did well and areas for improvement. Additionally, summarise any other expereinces during the project.   

