### Course: Machine Learning

### Date: 26th January 2024

### Asignment: Lab1

## Question 1

In [2]:
import pandas as pd
import numpy as np
import torch as tc
import zipfile


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
# Extract the contents of the 'archive.zip' file into the 'extracts' directory
with zipfile.ZipFile("archive.zip", "r") as z:
    z.extractall("extracts")

# Read the CSV file 'car_web_scraped_dataset.csv' into a pandas DataFrame
car_dataframe = pd.read_csv("extracts/car_web_scraped_dataset.csv")


In [4]:
car_dataframe.shape

(2840, 6)

In [5]:
car_dataframe.head(5)

Unnamed: 0,name,year,miles,color,condition,price
0,Kia Forte,2022,"41,406 miles","Gray exterior, Black interior","No accidents reported, 1 Owner","$15,988"
1,Chevrolet Silverado 1500,2021,"15,138 miles","White exterior, Black interior","1 accident reported, 1 Owner","$38,008"
2,Toyota RAV4,2022,"32,879 miles","Silver exterior, Unknown interior","No accidents reported, 1 Owner","$24,988"
3,Honda Civic,2020,"37,190 miles","Blue exterior, Black interior","No accidents reported, 1 Owner","$18,998"
4,Honda Civic,2020,"27,496 miles","Black exterior, Black interior","No accidents reported, 1 Owner","$19,498"


## Question 2

- This is suitable for regression because in regression, the goal is to understand the relationship between one or more independent variables and the dependent(target) variable which in this case could be the price which is also a continuous numeric outcome. The given dataset includes a numeric target variable, "price," and offers the opportunity to explore and quantify the relationships between various independent variables such as "year," "miles," "color," and "condition" in predicting the car prices.

## Question 3

<ol type="a">
  <li>
  <h3>Approach Explanation</h3>
    To determine the appropriate price category for each car in the dataset, I opted for a quartile-based approach. By calculating the first, second, and third quartiles of the 'price' column, I identified distinct price ranges. Cars with prices falling below the first quartile are categorized as 'cheap,' those between the first and second quartiles as 'average,' those between the second and third quartiles as 'expensive,' and those exceeding the third quartile as 'very expensive.' This strategy is designed to offer a meaningful classification, taking into account the statistical distribution of car prices in the dataset.
  </li>

 
 </br>
  <li>
  <h3>Approach Implementation</h3>
  
  </li>
</ol>

In [6]:
car_dataframe['price'].describe()

count        2840
unique       1245
top       $19,998
freq           51
Name: price, dtype: object

In [7]:
# Convert the 'price' column to numeric after removing any '$' and ',' characters
car_dataframe['price'] = pd.to_numeric(car_dataframe['price'].replace('[\$,]','', regex=True), errors='coerce')

# Calculate the quartiles of the 'price' column
quantiles = car_dataframe['price'].quantile(q=[0.25, 0.5, 0.75, 1])
quantiles

0.25     17851.0
0.50     23000.0
0.75     31222.5
1.00    252900.0
Name: price, dtype: float64

In [8]:
quantiles=quantiles.to_numpy()

In [9]:
def group_price(price):
    """
    Group the price into categories based on quantiles.
    Args:
    price (float): The price to be categorized.
    Returns:
    str: The category of the price.
    """
    if(price <= quantiles[0]):
        return 'cheap'
    elif(quantiles[0] < price <= quantiles[1]):
        return 'average'
    elif(quantiles[1] < price <= quantiles[2]):
        return 'expensive'
    elif(price > quantiles[2]):
        return 'very expensive'

In [10]:
car_dataframe['price_category'] = car_dataframe['price'].apply(group_price)

In [11]:
car_dataframe.shape

(2840, 7)

In [12]:
car_dataframe.head()

Unnamed: 0,name,year,miles,color,condition,price,price_category
0,Kia Forte,2022,"41,406 miles","Gray exterior, Black interior","No accidents reported, 1 Owner",15988,cheap
1,Chevrolet Silverado 1500,2021,"15,138 miles","White exterior, Black interior","1 accident reported, 1 Owner",38008,very expensive
2,Toyota RAV4,2022,"32,879 miles","Silver exterior, Unknown interior","No accidents reported, 1 Owner",24988,expensive
3,Honda Civic,2020,"37,190 miles","Blue exterior, Black interior","No accidents reported, 1 Owner",18998,average
4,Honda Civic,2020,"27,496 miles","Black exterior, Black interior","No accidents reported, 1 Owner",19498,average


## Question 4

*All the categorical features were preprocessed into numerical values using one-hot encoding*
After the target variable **price** was being predicted by the linear regression model using the other features in the dataset together with the encoded features.

In [13]:
car_dataframe.dtypes

name              object
year               int64
miles             object
color             object
condition         object
price              int64
price_category    object
dtype: object

In [14]:
car_dataframe.dropna(inplace=True)

In [15]:

# One-hot encoding categorical columns to numerical values
car_dataframe = pd.get_dummies(car_dataframe, columns=['condition', 'color', 'price_category', 'name'], drop_first=True)

# Converting 'miles' column to numeric, removing non-numeric characters
car_dataframe['miles'] = pd.to_numeric(car_dataframe['miles'].replace('[\D,]','', regex=True ), errors='coerce')
car_dataframe.head()

Unnamed: 0,year,miles,price,"condition_1 accident reported, 2 Owners","condition_1 accident reported, 3 Owners","condition_1 accident reported, 4 Owners","condition_1 accident reported, 5 Owners","condition_2 accidents reported, 1 Owner","condition_2 accidents reported, 2 Owners","condition_2 accidents reported, 3 Owners",...,name_Volkswagen Routan,name_Volkswagen Taos,name_Volkswagen Tiguan,name_Volvo S60,name_Volvo S80,name_Volvo S90,name_Volvo V60,name_Volvo XC40,name_Volvo XC60,name_Volvo XC90
0,2022,41406,15988,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2021,15138,38008,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,2022,32879,24988,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2020,37190,18998,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2020,27496,19498,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [16]:
# Converting the car_dataframe to float32 data type
car_dataframe = car_dataframe.astype('float32')
car_dataframe.head()

Unnamed: 0,year,miles,price,"condition_1 accident reported, 2 Owners","condition_1 accident reported, 3 Owners","condition_1 accident reported, 4 Owners","condition_1 accident reported, 5 Owners","condition_2 accidents reported, 1 Owner","condition_2 accidents reported, 2 Owners","condition_2 accidents reported, 3 Owners",...,name_Volkswagen Routan,name_Volkswagen Taos,name_Volkswagen Tiguan,name_Volvo S60,name_Volvo S80,name_Volvo S90,name_Volvo V60,name_Volvo XC40,name_Volvo XC60,name_Volvo XC90
0,2022.0,41406.0,15988.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2021.0,15138.0,38008.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2022.0,32879.0,24988.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2020.0,37190.0,18998.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2020.0,27496.0,19498.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
car_dataframe.dtypes

year                                       float32
miles                                      float32
price                                      float32
condition_1 accident reported, 2 Owners    float32
condition_1 accident reported, 3 Owners    float32
                                            ...   
name_Volvo S90                             float32
name_Volvo V60                             float32
name_Volvo XC40                            float32
name_Volvo XC60                            float32
name_Volvo XC90                            float32
Length: 401, dtype: object

In [18]:
# Converting DataFrame columns to PyTorch tensors and  drop the price column as the labels
inputs = tc.tensor(car_dataframe.drop('price', axis=1).values, dtype=tc.float32)
outputs = tc.tensor(car_dataframe['price'].values, dtype=tc.float32)

<ol type='a'>
    <li>The input and the target tensors are</li>

<ol>

In [19]:
print(inputs)

tensor([[ 2022., 41406.,     0.,  ...,     0.,     0.,     0.],
        [ 2021., 15138.,     0.,  ...,     0.,     0.,     0.],
        [ 2022., 32879.,     0.,  ...,     0.,     0.,     0.],
        ...,
        [ 2022., 27894.,     0.,  ...,     0.,     0.,     0.],
        [ 2021., 50220.,     0.,  ...,     0.,     0.,     0.],
        [ 2021., 26510.,     0.,  ...,     0.,     0.,     0.]])


In [20]:
print(outputs)

tensor([15988., 38008., 24988.,  ..., 29999., 22992., 24135.])


b. 

In [21]:
def generate_random_params(num_params):
    """
    Generate random parameters with the specified number of parameters.
    Args:
    num_params (int): The number of parameters to generate.
    Returns:
    torch.Tensor: Randomly generated parameters with the specified number of parameters.
    """
    weights = tc.rand((num_params, 1), requires_grad=True)
    return weights

In [22]:
input_size = inputs.shape
input_size

torch.Size([2840, 400])

In [23]:
num_params = inputs.shape[1]
random_params = generate_random_params(num_params)
print("Random parameters =  ", random_params)

Random parameters =   tensor([[0.8194],
        [0.1683],
        [0.2737],
        [0.6589],
        [0.3180],
        [0.5666],
        [0.7804],
        [0.3328],
        [0.2861],
        [0.2793],
        [0.1411],
        [0.5701],
        [0.9617],
        [0.8884],
        [0.6807],
        [0.5993],
        [0.4619],
        [0.8197],
        [0.4197],
        [0.7341],
        [0.6499],
        [0.9669],
        [0.4143],
        [0.1991],
        [0.4446],
        [0.9518],
        [0.6048],
        [0.5729],
        [0.8002],
        [0.9364],
        [0.9354],
        [0.3648],
        [0.8275],
        [0.8056],
        [0.0149],
        [0.7933],
        [0.3851],
        [0.3229],
        [0.3657],
        [0.8205],
        [0.2407],
        [0.0021],
        [0.7242],
        [0.3711],
        [0.0204],
        [0.4643],
        [0.7268],
        [0.7763],
        [0.8355],
        [0.1570],
        [0.3196],
        [0.4223],
        [0.9458],
        [0.5639],
      

c.

In [24]:
def linear_regression(inputs, weights, bias):
    """
    Performs linear regression on the given inputs using the provided weights and bias.

    Args:
    inputs (tensor): The input tensor for the regression.
    weights (tensor): The weights tensor for the regression.
    bias (tensor): The bias tensor for the regression.

    Returns:
    tensor: The result of the linear regression.
    """
    return tc.matmul(inputs, weights) + bias

In [25]:
def mean_squared_error(outputs, labels):
    return tc.mean((outputs - labels)**2)

In [26]:
predicitons = linear_regression(inputs, random_params, 0)
pd.DataFrame({'predictions': predicitons.view(-1).detach().numpy(), 'labels': outputs.view(-1).detach().numpy()})

Unnamed: 0,predictions,labels
0,8628.613281,15988.0
1,4206.037598,38008.0
2,7194.170410,24988.0
3,7917.666504,18998.0
4,6285.540039,19498.0
...,...,...
2835,18249.941406,8995.0
2836,21966.128906,9495.0
2837,6355.602539,29999.0
2838,10111.581055,22992.0


In [27]:
squared_error = mean_squared_error(predicitons, outputs)
print("Mean Squared Error =  ", squared_error.item())

Mean Squared Error =   468207840.0


d. 

In [28]:
def f(x):
    """Calculates the function f(x) = 2 * x^T * x"""
    return 2 * tc.matmul(x.t(), x)

(tensor([[4.0844e+07, 6.2293e+08, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [6.2293e+08, 1.0328e+10, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        ...,
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]]), tensor([[ 23687.4473, 183261.6406,  12664.8750,  ...,   9504.9131,
          11252.1377,  15617.0684],
        [183261.6406, 220186.3906, 194573.3281,  ..., 163498.6094,
         177472.5469, 248893.5469],
        [ 12664.8750, 194573.3281,      0.0000,  ...,      0.0000,
              0.0000,      0.0000],
        ...,
        [  9504.9131, 163498.6094,      0.0000,  ...,      0.0000,
             

In [29]:


def getGMatrix(input):
    G = []
    for i in range(5):
        x = input[i]
        x.requires_grad = True
        y = f(x)
        y.backward()
        print(x.grad == 4 * x)
        print(x.grad)
        G.append(x.grad)

    return G

In [30]:
getGMatrix(inputs[:5, :])

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, Tr

[tensor([8.0880e+03, 1.6562e+05, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 4.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 4.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e