## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](../../../01-intro/06-environment.md).

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [1]:
# Import libraries

import pandas as pd
import numpy as np

In [2]:
pd.__version__

'2.2.2'

### Getting the data 

For this homework, we'll use the Laptops Price dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv).

You can do it with wget:

In [None]:
# !wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv

In [3]:
# Read data to Dataframe

data =pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv")

data.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


### Q2. Records count

How many records are in the dataset?

- 12
- 1000
- 2160
- 12160


In [4]:
data.shape[0]

2160

### Q3. Laptop brands

How many laptop brands are presented in the dataset?

- 12
- 27
- 28
- 2160

In [5]:
data.Brand.value_counts()

Brand
Asus                415
HP                  368
Lenovo              366
MSI                 308
Acer                137
Apple               116
Dell                 84
Microsoft            77
Gigabyte             48
Razer                37
Medion               32
LG                   32
Alurin               29
PcCom                24
Samsung              22
Dynabook Toshiba     19
Vant                 11
Primux                8
Deep Gaming           8
Innjoo                6
Thomson               4
Prixton               3
Millenium             2
Denver                1
Jetwing               1
Realme                1
Toshiba               1
Name: count, dtype: int64

In [6]:
data.Brand.nunique()

27

### Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3

In [7]:
data.isna().sum()

Laptop             0
Status             0
Brand              0
Model              0
CPU                0
RAM                0
Storage            0
Storage type      42
GPU             1371
Screen             4
Touch              0
Final Price        0
dtype: int64

In [8]:
data.columns

Index(['Laptop', 'Status', 'Brand', 'Model', 'CPU', 'RAM', 'Storage',
       'Storage type', 'GPU', 'Screen', 'Touch', 'Final Price'],
      dtype='object')

In [9]:
data.groupby('Brand')['Final Price'].agg(['min', 'mean', 'max', 'count'])

Unnamed: 0_level_0,min,mean,max,count
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Acer,264.14,1001.285766,3691.0,137
Alurin,239.0,484.701379,869.0,29
Apple,299.0,1578.227672,3849.0,116
Asus,239.25,1269.380699,5758.14,415
Deep Gaming,1334.0,1505.3775,1639.01,8
Dell,379.0,1153.839881,3936.0,84
Denver,329.95,329.95,329.95,1
Dynabook Toshiba,397.29,999.197895,1805.01,19
Gigabyte,799.0,1698.488958,3799.0,48
HP,210.14,952.628478,5368.77,368


In [10]:
data[data.Brand == 'Dell']['Final Price'].max()

3936.0

### Q6. Median value of Screen

1. Find the median value of `Screen` column in the dataset.
2. Next, calculate the most frequent value of the same `Screen` column.
3. Use `fillna` method to fill the missing values in `Screen` column with the most frequent value from the previous step.
4. Now, calculate the median value of `Screen` once again.

Has it changed?

> Hint: refer to existing `mode` and `median` functions to complete the task.

- Yes
- No

In [11]:
# Calculate median
median_screen = data.Screen.median()
median_screen

15.6

In [12]:
# Calculate mean
mode_screen = data['Screen'].mode()
mode_screen

0    15.6
Name: Screen, dtype: float64

In [13]:
# fill in the missing values

data['Screen'] =data['Screen'].fillna(mode_screen)

In [14]:
# Confirming if there is still missing value in Screen column

data['Screen'].isna().sum()

4

In [15]:
# Confirming if the first result is still thesame
median_screen = data.Screen.median()
median_screen

# Answer is No 

15.6

### Q7. Sum of weights

1. Select all the "Innjoo" laptops from the dataset.
2. Select only columns `RAM`, `Storage`, `Screen`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- 0.43
- 45.29
- 45.58
- 91.30

In [16]:
innjoo_df = data[data.Brand== 'Innjoo']
innjoo_df = innjoo_df[['RAM', 'Storage', 'Screen']]

innjoo_df

Unnamed: 0,RAM,Storage,Screen
1478,8,256,15.6
1479,8,512,15.6
1480,4,64,14.1
1481,6,64,14.1
1482,6,128,14.1
1483,6,128,14.1


In [None]:
X =innjoo_df.values
print(X)

XTX = X.T.dot(X)
print(XTX)

XTX_inverse = np.linalg.inv(XTX)
print(XTX_inverse)

In [76]:
y = np.array([1100, 1300, 800, 900, 1000, 1100])

In [78]:
w = (XTX_inverse @ X.T) * y

In [80]:
w.sum()

91.29988062995548

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw01
* If your answer doesn't match options exactly, select the closest one