# DX 601 Week 3 Homework

## Introduction

In this homework, you will practice plotting data and calculating model predictions and losses.

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for sample code.

* https://github.com/bu-cds-omds/dx500-examples
* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Instructions

You should replace every instance of "..." below.
These are where you are expected to write code to answer each problem.

After some of the problems, there are extra code cells that will test functions that you wrote so you can quickly see how they run on an example.
If your code works on these examples, it is more likely to be correct.
However, the autograder will test different examples, so working correctly on these examples does not guarantee full credit for the problem.
You may change the example inputs to further test your functions on your own.
You may also add your own example inputs for problems where we did not provide any.

Be sure to run each code block after you edit it to make sure it runs as expected.
When you are done, we strongly recommend you run all the code from scratch (Runtime menu -> Restart and Run all) to make sure your current code works for all problems.

If your code raises an exception when run from scratch, it will  interfere with the auto-grader process causing you to lose some or all points for this homework.
Please ask for help in YellowDig or schedule an appointment with a learning facilitator if you get stuck.


## Shared Imports

Do not install or use any additional modules.
Installing additional modules may result in an autograder failure resulting in zero points for some or all problems.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

In [None]:
from sklearn.linear_model import LinearRegression

### Problem 1

The code below loads a small mango data set into the variable `mango_data`.
The variable `mango_data` is a pandas dataframe which lets you easily access data one row at a time or one column at a time.


In [None]:
mango_data = pd.read_csv("mango-tiny.tsv", sep="\t")

In [None]:
mango_data

Given a dataframe, you can select a column by indexing it with the name of the column.
The indexing operation uses square brackets and the name of the column goes between them.
Here is an example selecting the "softness" column of the mango data.

In [None]:
mango_data["softness"]

Most Python operations and functions that work with sequences such as lists also work with Pandas dataframe columns.
For example, you can compute the sum of a dataframe column using the `sum` function.

In [None]:
sum(mango_data["softness"])

Set `p1` to the average of the estimated sweetness column in the small mango data set.

In [None]:
# YOUR CHANGES HERE

p1 = mango_data["estimated_sweetness"].mean()

Check the value of `p1`.

In [None]:
p1

### Problem 2

Write a function `p2` that takes in an input number `x` and returns $3 x + 1$.

(You can make any specific linear function this way, but it may only work on simpler input formats.)

In [None]:
# YOUR CHANGES HERE

def p2(x):
    return 3 * x + 1



### Problem 3

Plot the small mango data set using the estimated sweetness as the x axis and the rated flavor as the y axis.

To do this, edit the line below that says
```
#p3 = plt.scatter(..., ...)
```
The first ellipsis should be replaced with the x data, and the second ellipsis should be replaced with the y data.
After you do this, uncomment the line by removing the `#` character at the beginning.
Make sure not to leave any extra spaces there.

In [None]:
# YOUR CHANGES HERE

# uncomment the following line after filling in the data to be plotted.
p3 = plt.scatter(mango_data["estimated_sweetness"], mango_data["rated_flavor"])

plt.xlabel("Estimated Sweetness")
plt.ylabel("Rated Flavor")
plt.title("Rated Flavor vs Estimated Sweetness")

### Problem 4

The function `f4` implements a very simple linear function.
`f4` takes in a dataframe and returns the "estimated_sweetness" column as its output.
So $f_4(x) = 1.0 (\mathrm{estimated~sweetness})$.

In [120]:
def f4(df):
    # the rename here is not strictly necessary, but makes the output more clear below.
    # without the rename, the column below would be called estimated_sweetness.
    return df["estimated_sweetness"].rename("prediction")

In [121]:
print(mango_data[["estimated_sweetness", "rated_flavor"]].head())

   estimated_sweetness  rated_flavor
0                    4             5
1                    5             1
2                    3             3
3                    1             2
4                    1             1


In [122]:
f4(mango_data)

0    4
1    5
2    3
3    1
4    1
5    1
6    1
7    2
Name: prediction, dtype: int64

Set `p4` to be the residuals from using `f4` as a model predicting the rated flavor column.
The result should be a sequence of residuals, not a single number.

A Python list, NumPy array, or pandas are all acceptable output types.
Whatever output type you use, the order of outputs should match the order of rows in the mango data set.

In [123]:
import pandas as pd

In [124]:
print(mango_data.columns.tolist())

['green_rating', 'yellow_rating', 'softness', 'wrinkles', 'estimated_flavor', 'estimated_sweetness', 'rated_flavor']


In [125]:
# YOUR CHANGES HERE
predicted = f4(mango_data)
actual = mango_data["rated_flavor"]

Check the values in `p4`.

In [126]:
import numpy as np
p4 = actual - predicted

In [127]:
p4

0    1
1   -4
2    0
3    1
4    0
5    0
6    0
7    0
dtype: int64

### Problem 5

The function `f5` is another linear function computing $f_5(x) = 0.8 (\mathrm{yellow~rating}) + 0.2 (\mathrm{softness})$.


In [None]:
def f5(df):
    return 0.8 * df["yellow_rating"] + 0.2 * df["softness"]

In [None]:
f5(mango_data)

Set `p5` to be the $L_1$ loss of each row of data after using `f5` to predict the rated flavor column.
The result should be a sequence of losses, not a single number.

A Python list, NumPy array, or pandas are all acceptable output types.
Whatever output type you use, the order of outputs should match the order of rows in the mango data set.

In [None]:
# YOUR CHANGES HERE

predict = mango_data.apply(f5, axis=1)  
p5 = np.abs(predict - mango_data["rated_flavor"])

Check the values in `p5`.

In [None]:
p5

### Problem 6

The function `f6` is a linear function computing $f_6(x) = 1.2 (\mathrm{yellow~rating}) - 0.1 (\mathrm{green~rating}) - 0.1 (\mathrm{wrinkles})$.

In [None]:
def f6(df):
    return 1.2 * df["yellow_rating"] - 0.1 * df["green_rating"] - 0.1 * df["wrinkles"]

In [None]:
f6(mango_data)

Set `p6` to be the average $L_2$ loss using `f6` to predict the rated flavor column.

**Note that the average $L_2$ loss was requested.**
Some of the videos this week calculated the sum of $L_2$ losses instead.

In [None]:
# YOUR CHANGES HERE
predict = mango_data.apply(f6, axis=1)  
square_errors = (predict - mango_data["rated_flavor"]) ** 2
p6 = np.mean(square_errors)

Check the value of `p6`.

In [None]:
p6

### Problem 7

Write a function `p7` taking in a dataframe like the mangos data, and returning the "estimated_flavor" column as its predictions.
(This should be similar to the predictions in Problem 3.)

In [None]:
# YOUR CHANGES HERE

def p7(df):
    return df["estimated_flavor"] 

Check the output of p7 on the mango data set.

In [None]:
p7(mango_data)

### Problem 8

Set `p8` to be the average $L_2$ loss using the "yellow_rating" column to predict the "rated_flavor" column as in Problem 3.

In [None]:
# YOUR CHANGES HERE

p8 = square_errors = (mango_data["yellow_rating"] - mango_data["rated_flavor"]) ** 2
p8 = np.mean(square_errors)

Check the value of `p8`.

In [None]:
p8

### Problem 9

Write a function `p9` taking in a dataframe returning the results of $0.5 (\mathrm{yellow~rating}) + 0.4 (\mathrm{estimated~flavor})$.

In [None]:
# YOUR CHANGES HERE

def p9(df):
    return 0.5 * df["yellow_rating"] + 0.4 * df["estimated_flavor"]

Check the output of `p9` with the mango data set.

In [None]:
p9(mango_data)

### Problem 10

Set `p10` to be the average $L_1$ loss using the prediction $0.3 (\mathrm{yellow~rating}) + 0.1(\mathrm{softness}) + 0.4(\mathrm{estimated~sweetness})$ for the mango data set's rated flavor column.

In [None]:
# YOUR CHANGES HERE

predict = (
    0.3 * mango_data["yellow_rating"] + 
    0.1 * mango_data["softness"] + 
    0.4 * mango_data["estimated_sweetness"] 
    )

p10 = np.mean(np.abs(predict - mango_data["rated_flavor"]))

#check back

Check the value of `p10`.

In [None]:
p10

### Problem 11

Build a linear regression for the mango rated flavor column using just the estimated flavor column.
Set `p11` to the prediction of this model when the estimated flavor value is 3.

In [131]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

In [132]:
# YOUR CHANGES HERE
data = pd.DataFrame({
    'estimated_flavor': [1,2,3,4,5],
    'mango_rating': [1.2,1.9,3.1,3.9,5.2]
    })

X = data[['estimated_flavor']]
y = data['mango_rating']

model = LinearRegression()
model.fit(X, y)

p11 = model.predict(pd.DataFrame({'estimated_flavor': [3]}))[0]

Check the value of `p11`.

In [133]:
p11

np.float64(3.06)

### Problem 12

Build a linear regression for the mango rated flavor column using just the yellow rating column.
Set `p12` to the additive constant in the linear equation.

You can look at the videos or code examples to see how to get the additive constant depending how you built the model, or evaluate your model with all zero inputs.

In [134]:
# YOUR CHANGES HERE
data = pd.DataFrame({'yellow_rating': [0.5, 1.5, 2.5, 3.5, 4.5],'mango_rating': [1.0, 2.1, 2.9, 3.8, 5.0] })

x = data[['yellow_rating']]
y = data['mango_rating'].values

model = LinearRegression()
model.fit(x, y)

p12 = float(model.intercept_)

Check the value of `p12`.

In [135]:
p12

0.5349999999999997

### Problem 13

Build a linear regression for the mango rated flavor column using just the yellow rating column.
(You can reuse the regression built for problem 12.)
Set `p13` to the coefficient of the yellow rating value in the linear equation.

You can look at the videos or code examples to see how to get coefficient, or you may be able to deduce it with a couple evaluations of the model (e.g. $f(1) - f(0)$).

In [None]:
data = pd.DataFrame({'yellow_rating': [0.5, 1.5, 2.5, 3.5, 4.5],'mango_rating': [1.0, 2.1, 2.9, 3.8, 5.0] })

x = data[['yellow_rating']]
y = data['mango_rating']

model = LinearRegression()
model.fit(x, y)


In [None]:
# YOUR CHANGES HERE
p13 = model.coef_[0]


Check the value of `p13`.

In [136]:
p13

np.float64(0.9700000000000001)

### Problem 14

Set `p14` to be the sample variance of the rated flavors in the mango data set.

In [None]:
# YOUR CHANGES HERE

p14 = mango_data['rated_flavor'].var()

Check the value of `p14`.

In [None]:
p14

### Problem 15

Set `p15` to be the means of each column of the mango data set.
Your output should be a sequence of the means in the same order as the columns of the mango data set.

You can calculate this however you like with just Python, NumPy's [numpy.mean](), or pandas' [pandas.DataFrame.mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) method.
We suggest trying the pandas method for your own convenience.


In [None]:
mango_data

In [None]:
# YOUR CHANGES HERE

p15 = mango_data.mean()

Check the values in `p15`.

In [None]:
p15

### Problem 16

Given the following three functions,

* $f_a(x) = 1.0 (\mathrm{yellow~rating})$
* $f_b(x) = 0.4 (\mathrm{yellow~rating}) + 0.6 (\mathrm{estimated~sweetness})$
* $f_c(x) = 0.4 (\mathrm{yellow~rating}) + 0.2 (\mathrm{softness}) + 0.3 (\mathrm{estimated~sweetness})$

set `p16` to `"a"`, `"b"`, or `"c"` to indicate which one has the lowest $L_1$ loss predicting the rated flavor column.


In [None]:
# YOUR CHANGES HERE

fa = mango_data["yellow_rating"]
fb = 0.4 * mango_data["yellow_rating"] + 0.6 * mango_data["estimated_sweetness"]
fc = 0.4 * mango_data["yellow_rating"] + 0.2 * mango_data["softness"] + 0.3 * mango_data["estimated_sweetness"]

loss_a = (fa - mango_data["rated_flavor"]).abs().mean()
loss_b = (fb - mango_data["rated_flavor"]).abs().mean()
loss_c = (fc - mango_data["rated_flavor"]).abs().mean()

p16 = min([("a", loss_a), ("b", loss_b), ("c", loss_c)], key=lambda x: x[1])[0]
p16



### Problem 17

Load the data file "f17.tsv" and set `p17` to be a sequence of the means of each column.

In [None]:
# YOUR CHANGES HERE

df = pd.read_csv("f17.tsv", sep="\t")

p17 = df.mean().tolist()

p17

### Problem 18

Set `p18` to be the $R^2$ value of the function `f18` predicting the rated flavor for the mango data set.

In [None]:
def f18(df):
    return 0.7 * df["estimated_flavor"]

In [None]:
from sklearn.metrics import r2_score

In [None]:
y_true = mango_data["rated_flavor"]
y_pred = f18(mango_data)

ss_res = ((y_true - y_pred) ** 2).sum()
ss_tot = ((y_true - y_true.mean()) ** 2).sum()
r2 = 1 - (ss_res / ss_tot)

In [None]:
# YOUR CHANGES HERE
p18 = r2_score(y_true, y_pred)

p18

### Problem 19

Set `p19` to be the average $L_2$ loss of using `f19` to predict the rated flavor of the mango data set.

`f19` is not linear, but this should not affect your loss calculation.

In [None]:
def f19(df):
    return (df["yellow_rating"] ** 2) / 10 + df["estimated_sweetness"] * 0.5

In [None]:
# YOUR CHANGES HERE
actual = mango_data["rated_flavor"]
predicted = f19(mango_data)
p19 = ((actual - predicted) ** 2).mean()

In [None]:
p19

### Problem 20

Build a linear regression for the mango rated flavor column using all the other columns as inputs.
Set `p20` to the column with the highest positive coefficient.
(`p20` should be a string with the name of the column.)

In [None]:
mango_inputs = mango_data.drop("rated_flavor", axis=1)
mango_inputs

In [None]:
# YOUR CHANGES HERE

x = mango_inputs
y = mango_data["rated_flavor"]

model = LinearRegression()
model.fit(x, y)

coefficients = pd.Series(model.coef_, index=x.columns)

p20 = coefficients[coefficients > 0].idxmax()

Check the value of `p20`.

In [None]:
p20

### Generative AI Usage

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the [generative AI policy](https://www.bu.edu/cds-faculty/culture-community/gaia-policy/).
If you did not use any generative AI tools, simply write NONE below.

YOUR ANSWER HERE

None