# Homework 7 - Multiple Regression, K Nearest Neighbors
CS 133  
Dr. Henderson  
Spring 2025  
v1

---

In [18]:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

## 1. Multiple Regression

A climate research team collected data on regional microclimates to understand carbon sequestration rates in forest ecosystems. They measured 20 standardized environmental parameters (all normalized to center around zero). Their target variable was the **carbon sequestration rate** measured in tons per hectare per year, which ranged from -20 (indicating carbon release) to +20 (indicating strong carbon capture), with zero representing carbon neutrality. The researchers aimed to build a model that could predict how these various environmental factors collectively influence whether a forest area acts as a carbon sink or source.

1.1 (5pts) Uing the `carbon.csv` dataset, apply multiple regression to create a model that can predict the carbon sequestration rate based on the 20 standardized environmental parameters.

In [24]:
carbon = pd.read_csv('carbon.csv')
pf2 = PolynomialFeatures(degree=2)
carbon.info()
carbon.describe()
carbon.head()
#Need to drop carbon sequestration rate.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   soil_ph_deviation           399 non-null    float64
 1   moisture_content_variation  399 non-null    float64
 2   nitrogen_level_fluctuation  399 non-null    float64
 3   tree_density_difference     399 non-null    float64
 4   understory_diversity_index  399 non-null    float64
 5   canopy_coverage_deviation   399 non-null    float64
 6   temperature_anomaly         399 non-null    float64
 7   precipitation_difference    399 non-null    float64
 8   elevation_variation         399 non-null    float64
 9   wind_pattern_change         399 non-null    float64
 10  sunlight_exposure_shift     399 non-null    float64
 11  fungal_colony_presence      399 non-null    float64
 12  insect_population_variance  399 non-null    float64
 13  water_body_influence        399 non

Unnamed: 0,soil_ph_deviation,moisture_content_variation,nitrogen_level_fluctuation,tree_density_difference,understory_diversity_index,canopy_coverage_deviation,temperature_anomaly,precipitation_difference,elevation_variation,wind_pattern_change,...,fungal_colony_presence,insect_population_variance,water_body_influence,forest_edge_proximity,human_activity_impact,wildfire_history,geographic_orientation,local_pollution_level,adjacent_landscape_effect,carbon_sequestration_rate
0,-0.005818,-0.006002,-0.007275,-0.00622,0.008112,-0.027816,-0.037692,0.002191,-0.040248,-0.012,...,0.035238,0.03187,0.007499,-0.025301,0.020776,-0.051908,0.026896,-0.014664,-0.011946,1.334152
1,-0.004206,-0.004483,-0.003353,-0.001938,-0.025225,-0.003248,-0.003622,0.020648,-0.043638,-0.004623,...,-0.010668,-0.033787,-0.01312,0.002057,0.008023,-0.015728,0.016876,-0.034842,0.039858,2.174149
2,-0.01732,0.029524,-0.034604,0.00872,-0.003044,-0.006266,0.013242,-0.001315,0.015953,-0.003595,...,0.013845,-0.018368,-0.000995,-0.011459,-0.009702,0.036563,0.008172,-0.012342,0.036234,9.427182
3,-0.034326,0.035395,-0.007513,-0.010726,0.034265,-0.011407,-0.028913,0.021936,-0.023797,-0.007403,...,0.012387,0.02814,0.009673,0.005057,-0.001007,0.011433,-0.002299,0.012543,-0.012346,11.17785
4,0.017268,-0.031469,-0.012357,-0.005917,0.000782,-0.019566,0.035803,0.009124,0.037865,0.037545,...,0.019093,-0.003897,0.013784,-0.004139,-0.002013,0.010758,0.011178,0.04286,0.016149,7.595099


1.2 (2 pts) How well does your model explain the data? (Use an appropriate metric and explain what it means).

(Explain here)

1.3 (2 pts) A new microclimate was measured on the island of Borneo. Use the data from the file `borneo.csv` and your regression model to predict the carbon squestration rate of Borneo. Interpret the results.

(Interpret here)

## 2. Polynomial Regression

Health experts at Regional Medical Center are studying factors that influence recovery time following total knee replacement surgery. They've collected data from 399 patients to develop a predictive model that could help healthcare providers optimize post-surgical care plans and provide more accurate recovery timelines to patients.

For each patient, the researchers measured 12 key health metrics (normalized to a 0-1 scale). The target variable is the post-surgical **recovery time**, measured as the number of days (ranging from 0-25) until the patient achieves independent mobility according to standardized assessments.

The orthopedic department wants to develop a reliable model to predict recovery timelines for future patients. This would allow for more personalized care plans, better resource allocation, and improved patient expectations management.

2.1 (4pts) Uing the `orthopedic_recovery.csv` dataset, apply standard (non-polynomial) multiple regression to create a model that can predict the recovery time based on the 12 input parameters.

2.2 (2pts) What is the accuracy of your model?

(Explain here)

2.3 (5pts) Using multiple polynomial regression, build another model of the `orthopedic_recovery.csv` dataset

2.4 (2pts) What is the accuracy of your new model?

(Explain here)

2.5 (5pts) Using the data in the `patient.csv` file predict the recovery time for the patient using both models.  
_Note: you will need to transform the data before passing it to your polynomial model_

2.6 (2pts) Based on these predictions, what would be your estimate and how confident are you?

(Answer here)

### 3. K Nearest Neighbors

The kNN algorithm requires a distance metric between samples in the sample space. The most common distance metric used is Euclidean which is defined as:  

$$ distance(a,b) = \sqrt{ (a_{f1} - b_{f1})^2 + (a_{f2} - b_{f2})^2 + ... + (a_{fn} - b_{fn})^2  } $$

Where $ a, b $ are two samples in the dataset and $ fn $ is feature $ n $ of the dataset.

3.1 (7pts) Create a Python function to calculate the distance between two Pandas Series. You can assume the series will have the same number of features (although it would be a good idea to check), and that all features are numeric. Try not to use Python `for` loops.

In [15]:
def distance(a, b):
    #if series1.count() != series2.count():
        #pass
    #else:
        return (((a - b)**2).sum())**0.5 
        

3.2 (3pts) Check your function by creating a DataFrame from the list `[[1, 2], [3, 4]]` and calling your function with the two rows.

In [16]:
l = pd.DataFrame([ [1,2],[3,4] ])
distance(l.iloc[0],l.iloc[1])

np.float64(2.8284271247461903)

3.3 (14pts) To predict the value of an unlabeled sample, the kNN algorithm finds the $ k $ nearest samples in the dataset using the Euclidean distance. This requires testing the new sample against each row in the dataset and remembering the nearest $ k $ rows.  

Create a Python function called `kNN` that takes a parameter $ k $, an unlabeled sample as a Pandas Series, and a dataframe of labeled samples. Your function should return a dataframe of the $ k $ rows that are closest to the unlabeled sample.

Suggested pseudocode:
```
k is the number of neighbors to use
u is an unlabeled sample
X is the input samples from the labeled dataframe

neighbors is a new list of tuples
for each row in X
  calculate the distance to u
  append a tuple of the row index and distance to the neighbors list
  sort the neighbors list by distance
  truncate the neighbors list to the top k entries
select the rows in X from the index values in the neighbors list and return the new dataframe
```

_Hint: To sort a list of tuples on one of the tuple elements use `list.sort(key=lambda x: x[n])` where `n` is the position of the value to sort on._

In [17]:
def kNN(k ,u, x):
    #returns a list of rows that are the nearest neighbors based on k.
    neighbors = []
    for idx, row in X.iterrows():
        neighbors.append( (idx, distance(u, row) ) )
        neighbors.sort( key=lambda x: x[1] )
        neighbors = neighbors [:k]
    return X.iloc[ [n[0] for n in neighbors] ]
        

3.4 (3pts) Test your function using the unit tests in the cells below.

In [None]:
kNN(1, pd.Series([1,2]), pd.DataFrame([[1,2],[3,4],[10,10]]))

In [None]:
kNN(2, pd.Series([1,2]), pd.DataFrame([[1,2],[3,4],[10,10]]))

In [None]:
kNN(1, pd.Series([8,7]), pd.DataFrame([[1,2],[3,4],[10,10]]))

3.5 (8pts) To predict the class of an unlabeled sample, kNN uses the majority class from $ k $ nearest neighbors.

Create a function called `kNN_predict` which takes the same parameters as your function from 3.3 along with the corresponding class labels (e.g. `kNN_predict(k, u, X, y)`. Your function should perform the following:

- Calls your kNN function to get the $ k $ samples closest to $ u $
- Uses the closest $ k $ samples to return the predicted class (hint: use the mode of the $ k $ samples)

Your function should return the value of the predicted class (make sure the return value is not a `Series` or `DataFrame` object).

3.6 (2pts) Test your function on the two unit tests below

In [None]:
kNN_predict(1, pd.Series([1,2]), pd.DataFrame([[1,2],[3,4],[10,10]]), pd.Series(['apple','orange', 'orange']))

In [None]:
kNN_predict(3, pd.Series([1,2]), pd.DataFrame([[1,2],[3,4],[10,10]]), pd.Series(['apple','orange', 'orange']))

Using Seaborn load the `iris` dataset and create a dataframe of samples `X` by dropping the `species` column. Then create the corresponding labels by assigning the name `y` to the `species` column.

Load the `flowers.csv` file from the homework directory into a dataframe named `flowers`.

3.7 (4pts) For each row in the `flowers` dataframe, call your `kNN_predict` function with $ k = 3 $ to predict the class.

### 4. K Nearest Neighbors in SciKit Learn

The Sci-Kit Learn module `neighbors` contains a kNN classifier named `KNeighborsClassifier`. 

4.1 (3pts) Use the `KNeighborsClassifier` to fit a model to the `X` and `y` from the `iris` dataset you created above. Use $ k = 3 $ in your model

4.2 (2pts) Use your model to predict the class of each flower in the `flowers` dataframe you loaded above.

---

### Submission Instructions

Be sure to ***SAVE YOUR WORK***!  

Next, select Kernel -> Restart Kernel and Run All Cells...

Make sure there are no errors.

Then select File->Save and Export Notebook as->HTML and submit your HTML file to Canvas.