## Homework #1

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from <a href='https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md'>06-environment.md</a>.

In [1]:
import numpy as np
import pandas as pd

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [2]:
pd.__version__

'2.0.3'

### Getting the data

For this homework, we'll use the California Housing Prices dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```
Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv")

In [4]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Question 2

How many columns are in the dataset?

In [5]:
data.shape[1]

10

### Question 3

Which columns in the dataset have missing values?

In [6]:
data.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

`total_bedrooms`

### Question 4

How many unique values does the `ocean_proximity` column have?

In [7]:
data.ocean_proximity.value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

In [8]:
data.ocean_proximity.nunique()

5

### Question 5

What's the average value of the `median_house_value` for the house located near the bay?

In [9]:
data.groupby('ocean_proximity')['median_house_value'].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,min,mean,max
ocean_proximity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
<1H OCEAN,17500.0,240084.285464,500001.0
INLAND,14999.0,124805.392001,500001.0
ISLAND,287500.0,380440.0,450000.0
NEAR BAY,22500.0,259212.31179,500001.0
NEAR OCEAN,22500.0,249433.977427,500001.0


In [10]:
data[data.ocean_proximity == 'NEAR BAY']['median_house_value'].mean()

259212.31179039303

### Question 6



1. Calculate the average of `total_bedrooms` column in the dataset
2. Use the `fillna` method and to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calcualte the average of `total_bedrooms` again.
4. Has it changed?

> **Note**: take into account only 3 digits after the decimal point

In [11]:
mean_bedrooms = data['total_bedrooms'].mean()
mean_bedrooms

537.8705525375618

In [12]:
data['total_bedrooms'].fillna(mean_bedrooms).mean()

537.8705525375617

No, it's left the same.

### Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Invert `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

In [14]:
island_df = data[data.ocean_proximity == 'ISLAND']
island_df = island_df[['housing_median_age', 'total_rooms', 'total_bedrooms']]
island_df

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
8314,27.0,1675.0,521.0
8315,52.0,2359.0,591.0
8316,52.0,2127.0,512.0
8317,52.0,996.0,264.0
8318,29.0,716.0,214.0


In [15]:
X = island_df.values
XTX = X.T.dot(X)

XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 9.19403586e-04, -3.66412216e-05,  5.43072261e-05],
       [-3.66412216e-05,  8.23303633e-06, -2.77534485e-05],
       [ 5.43072261e-05, -2.77534485e-05,  1.00891325e-04]])

In [16]:
y = np.array([950, 1300, 800, 1000, 1300])

In [17]:
w = (XTX_inv @ X.T) @ y

In [18]:
w

array([23.12330961, -1.48124183,  5.69922946])

In [19]:
w[2]

5.6992294550655656

> **Note**: we just implemented normal equation


$$w = (X^T X)^{-1} X^T y$$


We'll talk about it more in the next week (Machine Learning for Regression)