# DX 601 Final Project

## Introduction

In this project, you will practice all the skills that you have learned throughout this module.
You will pick a data set to analyze from a list provided, and then perform a variety of analysis.
Most of the problems and questions are open ended compared to your previous homeworks, and you will be asked to explain your choices.
Most of them will have a particular type of solution implied, but it is up to you to figure out the details based on what you have learned in this module.

## Instructions

Each problem asks you to perform some analysis of the data, and usually answer some questions about the results.
Make sure that your question answers are well supported by your analysis and explanations; simply stating an answer without support will earn minimal points.

Notebook cells for code and text have been added for your convenience, but feel free to add additional cells.

## Example Code

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx500-examples
* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Submission

This project will be entirely manually graded.
However, we may rerun some or all of your code to confirm that it works as described.

### Late Policy

The normal homework late policy for OMDS does not apply to this project.
Boston University requires final grades to be submitted within 72 hours of class instruction ending, so we cannot accommodate 5 days of late submissions.

However, we have delayed the due date of this project to be substantially later than necessary given its scope, and given you more days for submission with full credit than you would have had days for submission with partial credit under the homework late policy.
Finally, the deadlines for DX 601 and DX 602 were coordinated to be a week apart while giving ample time for both of their projects.

## Shared Imports

For this project, you are forbidden to use modules that were not loaded in this template.
While other modules are handy in practice, modules that trivialize these problems interfere with our assessment of your own knowledge and skills.

If you believe a module covered in the course material (not live sessions) is missing, please check with your learning facilitator.

In [1]:
import math
import sys

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.linear_model

from sklearn.decomposition import PCA

## Problems

### Problem 1 (5 points)

Pick one of the following data sets to analyze in this project.
Load the data set, and show a random sample of 10 rows.

* [Iris data set](https://archive.ics.uci.edu/dataset/53/iris) ([PMLB copy](https://github.com/EpistasisLab/pmlb/tree/master/datasets/iris))
* [Breast Cancer Wisconsin](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) ([PMLB copy](https://github.com/EpistasisLab/pmlb/tree/master/datasets/_deprecated_breast_cancer_wisconsin))
* [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) ([PMLB - white subset only](https://github.com/EpistasisLab/pmlb/tree/master/datasets/wine_quality_white))


The PMLB copies of the data are generally cleaner and recommended for this project, but the other links are provided to give you more context.
To load the data from the PMLB Github repository, navigate to the `.tsv.gz` file in GitHub and copy the link from the "Raw" button.

If the data set you choose has more than ten columns, you may limit later analysis that is requested per column to just the first ten columns.

In [3]:
# YOUR CODE HERE
# Load the Dataset 
wine_quality = pd.read_csv("https://github.com/EpistasisLab/pmlb/raw/refs/heads/master/datasets/wine_quality_white/wine_quality_white.tsv.gz", sep="\t")

In [4]:
# View the dataset
wine_quality.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,target
1651,6.4,0.42,0.74,12.8,0.076,48.0,209.0,0.9978,3.12,0.58,9.0,6
4634,5.8,0.29,0.38,10.7,0.038,49.0,136.0,0.99366,3.11,0.59,11.2,6
1030,7.1,0.2,0.41,2.1,0.054,24.0,166.0,0.9948,3.48,0.62,10.5,6
3349,6.4,0.17,0.27,9.9,0.047,26.0,101.0,0.99596,3.34,0.5,9.9,6
437,6.6,0.16,0.32,1.4,0.035,49.0,186.0,0.9906,3.35,0.64,12.4,8
1245,8.0,0.66,0.72,17.55,0.042,62.0,233.0,0.9999,2.92,0.68,9.4,4
4159,7.4,0.16,0.3,13.7,0.056,33.0,168.0,0.99825,2.9,0.44,8.7,7
4405,5.9,0.29,0.16,7.9,0.044,48.0,197.0,0.99512,3.21,0.36,9.4,5
4190,7.0,0.22,0.3,1.4,0.04,14.0,63.0,0.98985,3.2,0.33,12.0,6
915,5.6,0.29,0.05,0.8,0.038,11.0,30.0,0.9924,3.36,0.35,9.2,5


In [5]:
# Limit the dataset to 10 columns
wine_quality = wine_quality.drop(("alcohol"), axis=1)

In [6]:
# View the dataset again with just 10 columns
wine_quality.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,target
4562,5.6,0.18,0.3,10.2,0.028,28.0,131.0,0.9954,3.49,0.42,7
2500,6.8,0.21,0.36,18.1,0.046,32.0,133.0,1.0,3.27,0.48,5
3636,6.5,0.26,0.39,1.4,0.02,12.0,66.0,0.99089,3.25,0.75,7
3987,7.3,0.23,0.41,14.6,0.048,73.0,223.0,0.99863,3.16,0.71,6
356,7.3,0.22,0.37,14.3,0.063,48.0,191.0,0.9978,2.89,0.38,6
3955,7.0,0.16,0.3,2.6,0.043,34.0,90.0,0.99047,2.88,0.47,6
104,7.4,0.25,0.37,13.5,0.06,52.0,192.0,0.9975,3.0,0.44,5
1725,6.9,0.17,0.22,4.6,0.064,55.0,152.0,0.9952,3.29,0.37,6
1459,7.9,0.11,0.49,4.5,0.048,27.0,133.0,0.9946,3.24,0.42,6
507,6.0,0.24,0.27,1.9,0.048,40.0,170.0,0.9938,3.64,0.54,7


YOUR ANSWERS HERE

### Problem 2 (10 points)

List all the columns in the data set, and describe each of them in your own words.
You may have to search to learn about the data set columns, but make sure that the descriptions are your own words.

In [7]:
# YOUR CODE HERE

In [8]:
# List all of the columns in the dataset
wine_quality.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'target'],
      dtype='object')

YOUR ANSWERS HERE

### Column Descriptions
* fixed acidity: A value representing the concentration of non-volatile acids in the wine. An example of a fixed acis is tartaric acid.
* volatile acidity: A value representing the concentration of volatile acids in the wine. An example of a volatile acid is acetic acid.
* citric acid: Amount of citric acid in the wine. Citric acid is a weak organic chemical present in wine in small quantities. Citric acid is known to add freshness and enhance the wine’s flavor and aroma.
* residual sugar: The amount of sugar remaining after fermentation. A higher residual sugar content will result in a sweeter wine.
* chlorides: The concentration of salt in the wine. A higher chloride concentration may indicate a higher salinity and can negatively affect wine flavor.
* free sulfur dioxide: Free sulfur dioxide refers to the portion of sulfur dioxide (SO₂) that is not bound to other molecules and acts as an antimicrobial and antioxidant agent. Optimal levels of free sulfur dioxide help preserve wine, but excessive SO₂ can affect taste and aroma.
* total sulfur dioxide: The total amount of free and bound SO₂. 
* density:
* pH:
* sulphates:
* target: the target represents the wine quality evaluated by wine experts, scored based on sensory data. The scores range from 0 (very bad) to 10 (very excellent). 

In [9]:
wine_quality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  target                4898 non-null   int64  
dtypes: float64(10), int64(1)
memory usage: 421.1 KB


### Problem 3 (15 points)

Plot histograms of each column.
For each column, state the distribution covered in this module that you think best matches that column.

In [10]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 4 (20 points)

Plot each pair of an input column and the output column.
Classify each pair of input column and the output column as being independent or not.
Describe in words why you think that was the case.

In [11]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 5 (20 points)

Build an ordinary least squares regression for the target using all the input columns.
Report the mean squared error of the model over the whole data set.
Plot the actual values vs the predicted outputs to compare them. 

In [12]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 6 (20 points)

Which input column gives the best linear model of the target on its own?
How does that model compare to the model in problem 5?


In [13]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 7 (20 points)

Pick and plot a pair of input columns with a visible dependency.
Identify a split of the values of one column illustrating the dependency and plot histograms of the other variable on both sides of the split.
That is, pick a threshold $t$ for one column $x$ and make two histograms, one where $x < t$ and one where $x \geq t$.

These histograms should look significantly different to make the dependency clear.
There should be enough data in both histograms so that these differences are unlikely to be noise.
Also make sure that the horizontal axis is the same in both histograms for clarity.

In [14]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 8 (40 points)

Perform principal components analysis of the input columns.
Compute how much of the data variation is explained by the first half of the principal components.
Build a linear regression using coordinates computed from the first half of the principal components.
Compare the mean squared error of this model to the previous model.
Plot actual targets vs predictions again. 

This problem depends on material from week 13.

In [15]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 9 (20 points)

What pair of input columns has the highest correlation?
How is that correlation reflected in the principal components?

In [16]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 10 (30 points)

Identify an outlier row in the data set.
You may use any criteria discussed in this module, and you must explain the criteria and how it led to picking this row.
Give a visualization showing how much this row sticks out compared to the other data based on your criteria.

In [17]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Generative AI Usage

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the [generative AI policy](https://www.bu.edu/cds-faculty/culture-community/gaia-policy/).
If you did not use any generative AI tools, simply write NONE below.

YOUR ANSWER HERE