In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

QUESTION 1
What's the version of NumPy that you installed?

In [2]:
np.__version__

'1.20.1'

Getting the Data

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv")

In [4]:
#make sure dataset has been loaded
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


Question 2
How many records are in the dataset?

Here you need to specify the number of rows.

In [5]:
#using shape method
#notes: shape result --> (row,column)
df.shape

(11914, 16)

In [6]:
#or using len function
len(df)

11914

As you can see, the total number of row is 11914 both using shape method and len function

Question 3
Who are the most frequent car manufacturers (top-3) according to the dataset?

In [7]:
df.groupby('Make').Make.count().nlargest(3)

Make
Chevrolet     1123
Ford           881
Volkswagen     809
Name: Make, dtype: int64

most frequent car manufacturers (top-3) according to the dataset
1. Chevrolet
2. Ford
3. Volkswagen

Question 4
What's the number of unique Audi car models in the dataset?

In [8]:
df[df['Make']=='Audi']['Model'].nunique()

34

There is 34 model in the Audi Car

Question 5
How many columns in the dataset have missing values?

In [9]:
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

Based on this information, we get 5 columns that has missing value in the dataset.(columns with zero value means that columns has no missing value).
1. Engine Fuel Type
2. Engine HP 
3. Engine Cylinders
4. Number of Doors 
5. Market Category 

Question 6
Find the median value of "Engine Cylinders" column in the dataset.

In [10]:
#We can use statistics library for calculate median or mode etc.
import statistics

median = statistics.median(df['Engine Cylinders'])
median

6.0

Next, calculate the most frequent value of the same "Engine Cylinders".

In [11]:
mode = statistics.mode(df['Engine Cylinders'])
mode

4.0

Use the fillna method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.

In [12]:
New_Engine_Cylinders=df['Engine Cylinders'].fillna(mode)

Now, calculate the median value of "Engine Cylinders" once again.

In [13]:
new_median = statistics.median(New_Engine_Cylinders)

In [14]:
#check new median whether the value is different or not after fill missing value.
if new_median != median:
    print('Yes')
else:
    print('No')

No


As you can see, there is not change for median after we replace missing value with the most frequent value in the Engine Cylinders Column

Question 7

Select all the "Lotus" cars from the dataset.

In [15]:
df_lotus = df[df['Make']=='Lotus']

Select only columns "Engine HP", "Engine Cylinders".

In [16]:
df_lotus = df_lotus[['Engine HP','Engine Cylinders']]

Now drop all duplicated rows using drop_duplicates method (you should get a dataframe with 9 rows).

In [17]:
df_lotus = df_lotus.drop_duplicates()
df_lotus

Unnamed: 0,Engine HP,Engine Cylinders
3912,189.0,4.0
3913,218.0,4.0
3918,217.0,4.0
4216,350.0,8.0
4257,400.0,6.0
4259,276.0,6.0
4262,345.0,6.0
4292,257.0,4.0
4293,240.0,4.0


Get the underlying NumPy array. Let's call it X.

In [18]:
X = df_lotus.values
X

array([[189.,   4.],
       [218.,   4.],
       [217.,   4.],
       [350.,   8.],
       [400.,   6.],
       [276.,   6.],
       [345.,   6.],
       [257.,   4.],
       [240.,   4.]])

Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.

In [19]:
def vector_vector_multiplication(u, v):
    assert u.shape[0] == v.shape[0]
    
    n = u.shape[0]
    
    result = 0.0

    for i in range(n):
        result = result + u[i] * v[i]
    
    return result

In [20]:
def matrix_vector_multiplication(U, v):
    assert U.shape[1] == v.shape[0]
    
    num_rows = U.shape[0]
    
    result = np.zeros(num_rows)
    
    for i in range(num_rows):
        result[i] = vector_vector_multiplication(U[i], v)
    
    return result

In [21]:
def matrix_matrix_multiplication(U, V):
    assert U.shape[1] == V.shape[0]
    
    num_rows = U.shape[0]
    num_cols = V.shape[1]
    
    result = np.zeros((num_rows,num_cols))
    
    for i in range(num_cols):
        vi = V[:, i]
        Uvi = matrix_vector_multiplication(U, vi)
        result[:, i] = Uvi
    
    return result
        

In [22]:
XTX = matrix_matrix_multiplication(X.T,X)
XTX

array([[7.31684e+05, 1.34100e+04],
       [1.34100e+04, 2.52000e+02]])

Invert XTX.

In [23]:
#Invert XTX
XTX_inv = np.linalg.inv(XTX)

In [24]:
XTX_inv

array([[ 5.53084235e-05, -2.94319825e-03],
       [-2.94319825e-03,  1.60588447e-01]])

Create an array y with values [1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800].

In [25]:
y = np.array([1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800])

Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.

In [26]:
#We can use matrix_matrix_multiplication function that have defined previously to multiply the inverse of XTX with the transpose of X
XTX_inv_XTX = matrix_matrix_multiplication(XTX_inv,X.T)

In [27]:
#After we multiply the inverse of XTX with the transpose of X, 
#W = XTX_inv_XTX.dot(y)
w = matrix_vector_multiplication(XTX_inv_XTX, y)

What's the value of the first element of w?

In [28]:
#we can use indexing to get the first element
w[0]

4.594944810094551

Actually, we can calculate slope and intercept that use in the linear regression using matrix formula.
w[0] means slope
w[1] means intercept

y = w[0]x+w[1] (linear regression equation)