Import Libraries

In [44]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, root_mean_squared_error

Import Data

In [6]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"

In [7]:
df = pd.read_csv(url)

In [22]:
df.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [9]:
df.shape

(53940, 10)

In [10]:
df.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

In [11]:
df1 = df[["carat","price"]]

In [12]:
df1

Unnamed: 0,carat,price
0,0.23,326
1,0.21,326
2,0.23,327
3,0.29,334
4,0.31,335
...,...,...
53935,0.72,2757
53936,0.72,2757
53937,0.70,2757
53938,0.86,2757


In [33]:
X = df1["carat"].values.reshape(-1,1)# reshape(n_sampels,n_features)
y = df1["price"].values.reshape(-1,1)

In machine learning and data processing, the terms n_samples and n_features refer to the dimensions of a dataset and how it’s organized:

    n_samples: The number of data points (or observations) in your dataset. Each sample represents an individual instance you want to use for training or predicting. For example, in a dataset of house prices, each row (house) would be a sample.

    n_features: The number of attributes (or input variables) used to describe each sample. In the house price example, features might include the number of bedrooms, square footage, and location.

Together, they form a 2D array where the shape is (n_samples, n_features). This is the standard input format for most machine learning models.
Why (-1, 1) and (1, -1) in Reshaping?

In numpy, the reshape operation allows you to rearrange the dimensions of an array, where -1 is a special placeholder that tells numpy to infer the dimension size based on the other specified dimension and the original array's total size. Here’s the significance of each:

    reshape(-1, 1): This reshapes the array to have one column and as many rows as necessary based on the original array’s size. It’s used to transform a 1D array (like [1, 2, 3, 4]) into a 2D column array ([[1], [2], [3], [4]]), which is required when there’s a single feature.
        -1 means "figure out the appropriate number of rows automatically."
        1 means "ensure we have one column."

    This is particularly useful for fitting models with a single feature.

    reshape(1, -1): This reshapes the array to have one row and as many columns as necessary based on the original array’s size. For example, np.array([1, 2, 3, 4]).reshape(1, -1) becomes [[1, 2, 3, 4]].
        1 means "set the number of rows to 1."
        -1 automatically calculates the number of columns based on the original array’s size.

This format (1, -1) is often used when you have a single sample and need to match the (n_samples, n_features) format for compatibility with functions that expect a 2D array.

In [35]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2, random_state= 42)

In [36]:
X_test.shape

(10788, 1)

Model Building And Training

In [37]:
lr = LinearRegression()
lr.fit(X_train,y_train)

Prediction

In [41]:
y_predict = lr.predict(X_test)

In [46]:
print("mae: ",mean_absolute_error(y_test,y_predict))
print("mse: ",mean_squared_error(y_test,y_predict))
print("rmse: ",root_mean_squared_error(y_test,y_predict))
print("r2_score: ", r2_score(y_test,y_predict))


mae:  1009.504742060089
mse:  2401388.654479092
rmse:  1549.6414599768207
r2_score:  0.8489390686155808
