## HW6:  Statistical Fundamentals 


In [2]:
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

###### 1. In the Week 6 Statistics class we used the example of political polling and the need for random sampling to avoid bias. 

###### How can we introduce error if we do not sufficiently randomize our sampling?

We can encounter bias into our results if we do not properly randomize our sampling. For instance, in a political poll, if we only survey people from a specific region or demographic, the sample won’t reflect the entire population, leading to skewed results. This is called sampling bias, and it can give us inaccurate predictions or insights because some groups are over-represented, while others are under-represented. Without randomness, we can't guarantee that our sample is truly representative of the population.

###### 2. What is the difference between variance and standard deviation ?

Variance is the average of the squared differences from the mean. It gives us an idea of how much the data points are spread out from the mean.

Standard Deviation is the square root of the variance. It has the same units as the data, making it easier to interpret in the context of the data.

###### 3. Load the sklearn California housing dataset into a DataFrame.  Don't forget to get the target column as well.

###### https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

california_housing = fetch_california_housing()

df = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
df['MedHouseVal'] = california_housing.target  

df.tail()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.17192,741.0,2.123209,39.43,-121.32,0.847
20639,2.3886,16.0,5.254717,1.162264,1387.0,2.616981,39.37,-121.24,0.894


In [2]:
var = df.var()
std = df.std()

var, std

(MedInc         3.609323e+00
 HouseAge       1.583963e+02
 AveRooms       6.121533e+00
 AveBedrms      2.245915e-01
 Population     1.282470e+06
 AveOccup       1.078700e+02
 Latitude       4.562293e+00
 Longitude      4.014139e+00
 MedHouseVal    1.331615e+00
 dtype: float64,
 MedInc            1.899822
 HouseAge         12.585558
 AveRooms          2.474173
 AveBedrms         0.473911
 Population     1132.462122
 AveOccup         10.386050
 Latitude          2.135952
 Longitude         2.003532
 MedHouseVal       1.153956
 dtype: float64)

###### 4. Calculate the Variance and Standard Deviation of each column using the pandas functions.

In [3]:
corr_matrix = df.corr()
corr_matrix

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
MedInc,1.0,-0.119034,0.326895,-0.06204,0.004834,0.018766,-0.079809,-0.015176,0.688075
HouseAge,-0.119034,1.0,-0.153277,-0.077747,-0.296244,0.013191,0.011173,-0.108197,0.105623
AveRooms,0.326895,-0.153277,1.0,0.847621,-0.072213,-0.004852,0.106389,-0.02754,0.151948
AveBedrms,-0.06204,-0.077747,0.847621,1.0,-0.066197,-0.006181,0.069721,0.013344,-0.046701
Population,0.004834,-0.296244,-0.072213,-0.066197,1.0,0.069863,-0.108785,0.099773,-0.02465
AveOccup,0.018766,0.013191,-0.004852,-0.006181,0.069863,1.0,0.002366,0.002476,-0.023737
Latitude,-0.079809,0.011173,0.106389,0.069721,-0.108785,0.002366,1.0,-0.924664,-0.14416
Longitude,-0.015176,-0.108197,-0.02754,0.013344,0.099773,0.002476,-0.924664,1.0,-0.045967
MedHouseVal,0.688075,0.105623,0.151948,-0.046701,-0.02465,-0.023737,-0.14416,-0.045967,1.0


###### 5. Create a correlation matrix for the columns (including the target column, or median house value).
###### Which feature has the highest correlation to the target?

In [4]:
highest_corr_feature = corr_matrix['MedHouseVal'].sort_values(ascending=False).index[1]
highest_corr_value = corr_matrix['MedHouseVal'].sort_values(ascending=False).iloc[1]

highest_corr_feature, highest_corr_value

('MedInc', 0.6880752079585484)

###### 6. Create both KNN and Random Forest regression models on the California Housing data.  Pick an accuracy metric and determine which model is more accurate.  Remember to split training and testing data.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

x = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [6]:
# KNN Model
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(x_train, y_train)
y_pred_knn = knn_model.predict(x_test)

In [7]:
# Random Forest Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)
y_pred_rf = rf_model.predict(x_test)

In [8]:
from sklearn.metrics import r2_score

# Calculate R² (accuracy) for both models
score_knn = r2_score(y_test, y_pred_knn)
score_rf = r2_score(y_test, y_pred_rf)
score_knn, score_rf

(0.14631049965900345, 0.8051230593157366)

Random Forest is more accurate.