# DX 601 Final Project

## Introduction

In this project, you will practice all the skills that you have learned throughout this module.
You will pick a data set to analyze from a list provided, and then perform a variety of analysis.
Most of the problems and questions are open ended compared to your previous homeworks, and you will be asked to explain your choices.
Most of them will have a particular type of solution implied, but it is up to you to figure out the details based on what you have learned in this module.

## Instructions

Each problem asks you to perform some analysis of the data, and usually answer some questions about the results.
Make sure that your question answers are well supported by your analysis and explanations; simply stating an answer without support will earn minimal points.

Notebook cells for code and text have been added for your convenience, but feel free to add additional cells.

## Example Code

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx500-examples
* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Submission

This project will be entirely manually graded.
However, we may rerun some or all of your code to confirm that it works as described.

### Late Policy

The normal homework late policy for OMDS does not apply to this project.
Boston University requires final grades to be submitted within 72 hours of class instruction ending, so we cannot accommodate 5 days of late submissions.

However, we have delayed the due date of this project to be substantially later than necessary given its scope, and given you more days for submission with full credit than you would have had days for submission with partial credit under the homework late policy.
Finally, the deadlines for DX 601 and DX 602 were coordinated to be a week apart while giving ample time for both of their projects.

## Shared Imports

For this project, you are forbidden to use modules that were not loaded in this template.
While other modules are handy in practice, modules that trivialize these problems interfere with our assessment of your own knowledge and skills.

If you believe a module covered in the course material (not live sessions) is missing, please check with your learning facilitator.

In [1]:
import math
import sys

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.linear_model

from sklearn.decomposition import PCA

## Problems

### Problem 1 (5 points)

Pick one of the following data sets to analyze in this project.
Load the data set, and show a random sample of 10 rows.

* [Iris data set](https://archive.ics.uci.edu/dataset/53/iris) ([PMLB copy](https://github.com/EpistasisLab/pmlb/tree/master/datasets/iris))
* [Breast Cancer Wisconsin](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) ([PMLB copy](https://github.com/EpistasisLab/pmlb/tree/master/datasets/_deprecated_breast_cancer_wisconsin))
* [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) ([PMLB - white subset only](https://github.com/EpistasisLab/pmlb/tree/master/datasets/wine_quality_white))


The PMLB copies of the data are generally cleaner and recommended for this project, but the other links are provided to give you more context.
To load the data from the PMLB Github repository, navigate to the `.tsv.gz` file in GitHub and copy the link from the "Raw" button.

If the data set you choose has more than ten columns, you may limit later analysis that is requested per column to just the first ten columns.

In [25]:
# Load Wine Quality dataset
WHITE_WINE_QUALITY_DATASET = "https://github.com/EpistasisLab/pmlb/raw/refs/heads/master/datasets/wine_quality_white/wine_quality_white.tsv.gz"

# Load dataset into a dataframe
df = pd.read_csv(WHITE_WINE_QUALITY_DATASET, sep="\t")

# Random sample of 10 rows
df.sample(10)

# Reproducible Random Sample
# df.sample(10, random_state=67) 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,target
2686,8.0,0.17,0.29,2.4,0.029,52.0,119.0,0.98944,3.03,0.33,12.9,6
2350,7.9,0.31,0.22,13.3,0.048,46.0,212.0,0.99942,3.47,0.59,10.0,5
466,7.0,0.14,0.32,9.0,0.039,54.0,141.0,0.9956,3.22,0.43,9.4,6
2273,6.1,0.46,0.32,6.2,0.053,10.0,94.0,0.99537,3.35,0.47,10.1,5
3194,5.7,0.16,0.32,1.2,0.036,7.0,89.0,0.99111,3.26,0.48,11.0,5
2158,7.4,0.18,0.27,1.3,0.048,26.0,105.0,0.994,3.52,0.66,10.6,6
4089,6.8,0.27,0.24,4.6,0.098,36.0,127.0,0.99412,3.15,0.49,9.6,6
1943,6.3,0.25,0.44,11.6,0.041,48.0,195.0,0.9968,3.18,0.52,9.5,5
1396,7.3,0.18,0.29,1.2,0.044,12.0,143.0,0.9918,3.2,0.48,11.3,7
64,7.2,0.24,0.27,1.4,0.038,31.0,122.0,0.9927,3.15,0.46,10.3,6


I selected the Wine Quality (white) dataset from PMLB. I loaded the data from the compressed TSV using pandas, and took a random sample of 10 rows which is displayed above. The target column is named `quality`, and all other columns are treated as input features. I also included a commented out version of the random sample code which implements a random state so that the "random" data can be reproduced consistently.

### Problem 2 (10 points)

List all the columns in the data set, and describe each of them in your own words.
You may have to search to learn about the data set columns, but make sure that the descriptions are your own words.

In [30]:
# List all column names in dataframe
cols = list(df.columns)
print(f"Columns ({len(cols)}):")
for i, c in enumerate(cols, start=1):
    print(f"{i}. {c}")

Columns (12):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
12. target


Below are brief descriptions for each column in the White_Wine_Quality dataset:

- `fixed acidity`: Refers to the non-volatile acids in wine—mainly tartaric and malic acid—that do not evaporate easily. These acids influence the wine’s overall sourness, structure, and stability.
- `volatile acidity`: Measures acids that can evaporate, mainly acetic acid (associated with vinegar smell). High values can lead to unpleasant vinegar-like aromas, so lower levels are preferred for quality.
- `citric acid`: A weak acid naturally present in small amounts. Adds freshness, “brightness”, and enhances the wine’s flavor and aroma when present at moderate levels.
- `residual sugar`: Amount of sugar left after fermentation (in g/L). Higher values indicate sweeter wines; lower values correspond to dry wines. White wines often vary widely in this measurement.
- `chlorides`: Represents the amount of salt in the wine (mostly sodium chloride) (in g/L). High chloride levels can negatively affect taste.
- `free sulfur dioxide`: SO2 that is not chemically bound (in mg/L)) and is available to act as an antimicrobial and antioxidant. Too much can impact aroma; too little may allow spoilage.
- `total sulfur dioxide`: Total amount of SO2 (free + bound). High levels may affect smell/taste; low levels may harm preservation.
- `density`: Liquid density (g/cm³); Density of the wine relative to water. Highly related to sugar and alcohol content: more sugar or less alcohol increases density.
- `pH`: Acidity level of the wine, measured on the pH scale. Lower pH = more acidic, which influences taste, preservation, and color stability.
- `sulphates`: Amount of potassium sulfate added, often used as a wine preservative. Linked to SO2 production, which affects freshness and microbial stability.
- `alcohol`: Alcohol percentage by volume (% ABV). Generally, higher alcohol levels increase the wine's body, warmth, and often its quality rating.
- `quality`: Our Target Variable! This is a sensory score assigned by trained wine experts. It represents an ordinal scale where 0 = very bad and 10 = excellent.

### Problem 3 (15 points)

Plot histograms of each column.
For each column, state the distribution covered in this module that you think best matches that column.

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 4 (20 points)

Plot each pair of an input column and the output column.
Classify each pair of input column and the output column as being independent or not.
Describe in words why you think that was the case.

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 5 (20 points)

Build an ordinary least squares regression for the target using all the input columns.
Report the mean squared error of the model over the whole data set.
Plot the actual values vs the predicted outputs to compare them. 

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 6 (20 points)

Which input column gives the best linear model of the target on its own?
How does that model compare to the model in problem 5?


In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 7 (20 points)

Pick and plot a pair of input columns with a visible dependency.
Identify a split of the values of one column illustrating the dependency and plot histograms of the other variable on both sides of the split.
That is, pick a threshold $t$ for one column $x$ and make two histograms, one where $x < t$ and one where $x \geq t$.

These histograms should look significantly different to make the dependency clear.
There should be enough data in both histograms so that these differences are unlikely to be noise.
Also make sure that the horizontal axis is the same in both histograms for clarity.

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 8 (40 points)

Perform principal components analysis of the input columns.
Compute how much of the data variation is explained by the first half of the principal components.
Build a linear regression using coordinates computed from the first half of the principal components.
Compare the mean squared error of this model to the previous model.
Plot actual targets vs predictions again. 

This problem depends on material from week 13.

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 9 (20 points)

What pair of input columns has the highest correlation?
How is that correlation reflected in the principal components?

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 10 (30 points)

Identify an outlier row in the data set.
You may use any criteria discussed in this module, and you must explain the criteria and how it led to picking this row.
Give a visualization showing how much this row sticks out compared to the other data based on your criteria.

In [None]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Generative AI Usage

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the [generative AI policy](https://www.bu.edu/cds-faculty/culture-community/gaia-policy/).
If you did not use any generative AI tools, simply write NONE below.

YOUR ANSWER HERE