## Lab: Generative AI for Models Development

# Learning objectives

In this lab, you will learn how to use generative AI to create Python codes that can:

Use linear regression in one variable to fit the parameters to a model
Use linear regression in multiple variables to fit the parameters to a model
Use polynomial regression in a single variable to fit the parameters to a model
Create a pipeline for performing linear regression using multiple features in polynomial scaling
Use the grid search with cross-validation and ridge regression to create a model with optimum hyperparameters

## Write a Python code that can perform the following tasks.
Read the CSV file, located on a given file path, into a pandas data frame, assuming that the first row of the file can be used as the headers for the data.

In [191]:
import pandas as pd

In [192]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"

In [193]:
df= pd.read_csv(URL)

In [194]:
df

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price
0,0,Acer,4,IPS Panel,2,1,5,35.560,1.6,8,256,1.60,978
1,1,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.20,634
2,2,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.20,946
3,3,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
4,4,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837
...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,Lenovo,4,IPS Panel,2,1,7,35.560,2.6,8,256,1.70,1891
234,234,Toshiba,3,Full HD,2,1,5,33.782,2.4,8,256,1.20,1950
235,235,Lenovo,4,IPS Panel,2,1,5,30.480,2.6,8,256,1.36,2236
236,236,Lenovo,3,Full HD,3,1,5,39.624,2.5,6,256,2.40,883


In [195]:
columns_with_missing_values=df.isnull().any()
columns_with_missing_values

Unnamed: 0        False
Manufacturer      False
Category          False
Screen            False
GPU               False
OS                False
CPU_core          False
Screen_Size_cm     True
CPU_frequency     False
RAM_GB            False
Storage_GB_SSD    False
Weight_kg          True
Price             False
dtype: bool

In [196]:
df["Weight_kg"] = df["Weight_kg"].fillna(df["Weight_kg"].mean())

In [197]:
df

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price
0,0,Acer,4,IPS Panel,2,1,5,35.560,1.6,8,256,1.60,978
1,1,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.20,634
2,2,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.20,946
3,3,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
4,4,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837
...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,Lenovo,4,IPS Panel,2,1,7,35.560,2.6,8,256,1.70,1891
234,234,Toshiba,3,Full HD,2,1,5,33.782,2.4,8,256,1.20,1950
235,235,Lenovo,4,IPS Panel,2,1,5,30.480,2.6,8,256,1.36,2236
236,236,Lenovo,3,Full HD,3,1,5,39.624,2.5,6,256,2.40,883


In [198]:
columns_with_missing_values=df.isnull().any()
columns_with_missing_values

Unnamed: 0        False
Manufacturer      False
Category          False
Screen            False
GPU               False
OS                False
CPU_core          False
Screen_Size_cm     True
CPU_frequency     False
RAM_GB            False
Storage_GB_SSD    False
Weight_kg         False
Price             False
dtype: bool

In [199]:
df["Screen_Size_cm"] = df["Screen_Size_cm"].fillna(df["Screen_Size_cm"].mean())

In [200]:
columns_with_missing_values=df.isnull().any()
columns_with_missing_values

Unnamed: 0        False
Manufacturer      False
Category          False
Screen            False
GPU               False
OS                False
CPU_core          False
Screen_Size_cm    False
CPU_frequency     False
RAM_GB            False
Storage_GB_SSD    False
Weight_kg         False
Price             False
dtype: bool

In [201]:
df["Screen_Size_cm"]=df["Screen_Size_cm"].astype(float) 

In [202]:
df["Weight_kg"] = df["Weight_kg"].astype(float)

In [203]:
df.dtypes

Unnamed: 0          int64
Manufacturer       object
Category            int64
Screen             object
GPU                 int64
OS                  int64
CPU_core            int64
Screen_Size_cm    float64
CPU_frequency     float64
RAM_GB              int64
Storage_GB_SSD      int64
Weight_kg         float64
Price               int64
dtype: object

In [204]:
df['Screen_Size_inch'] = df['Screen_Size_cm'] / 2.54

In [205]:
df['Screen_Size_inch']

0      14.0
1      15.6
2      15.6
3      13.3
4      15.6
       ... 
233    14.0
234    13.3
235    12.0
236    15.6
237    14.0
Name: Screen_Size_inch, Length: 238, dtype: float64

In [206]:
df[['Weight_pounds']] = df[['Weight_kg']] * 2.20462

In [207]:
df[['Weight_pounds']]

Unnamed: 0,Weight_pounds
0,3.527392
1,4.850164
2,4.850164
3,2.689636
4,4.210824
...,...
233,3.747854
234,2.645544
235,2.998283
236,5.291088


In [208]:
df['CPU_frequency'] = df['CPU_frequency'] / df['CPU_frequency'].max()

In [209]:
df

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price,Screen_Size_inch,Weight_pounds
0,0,Acer,4,IPS Panel,2,1,5,35.560,0.551724,8,256,1.60,978,14.0,3.527392
1,1,Dell,3,Full HD,1,1,3,39.624,0.689655,4,256,2.20,634,15.6,4.850164
2,2,Dell,3,Full HD,1,1,7,39.624,0.931034,8,256,2.20,946,15.6,4.850164
3,3,Dell,4,IPS Panel,2,1,5,33.782,0.551724,8,128,1.22,1244,13.3,2.689636
4,4,HP,4,Full HD,2,1,7,39.624,0.620690,8,256,1.91,837,15.6,4.210824
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,Lenovo,4,IPS Panel,2,1,7,35.560,0.896552,8,256,1.70,1891,14.0,3.747854
234,234,Toshiba,3,Full HD,2,1,5,33.782,0.827586,8,256,1.20,1950,13.3,2.645544
235,235,Lenovo,4,IPS Panel,2,1,5,30.480,0.896552,8,256,1.36,2236,12.0,2.998283
236,236,Lenovo,3,Full HD,3,1,5,39.624,0.862069,6,256,2.40,883,15.6,5.291088


In [210]:
# Convert the 'Screen' attribute into indicator variables
df1 = pd.get_dummies(df['Screen'], prefix='Screen')
# Append df1 into the original data frame df
df = pd.concat([df, df1], axis=1)
# Drop the original 'Screen' attribute from the data frame
df.drop('Screen', axis=1, inplace=True)


In [211]:
df

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price,Screen_Size_inch,Weight_pounds,Screen_Full HD,Screen_IPS Panel
0,0,Acer,4,2,1,5,35.560,0.551724,8,256,1.60,978,14.0,3.527392,False,True
1,1,Dell,3,1,1,3,39.624,0.689655,4,256,2.20,634,15.6,4.850164,True,False
2,2,Dell,3,1,1,7,39.624,0.931034,8,256,2.20,946,15.6,4.850164,True,False
3,3,Dell,4,2,1,5,33.782,0.551724,8,128,1.22,1244,13.3,2.689636,False,True
4,4,HP,4,2,1,7,39.624,0.620690,8,256,1.91,837,15.6,4.210824,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,Lenovo,4,2,1,7,35.560,0.896552,8,256,1.70,1891,14.0,3.747854,False,True
234,234,Toshiba,3,2,1,5,33.782,0.827586,8,256,1.20,1950,13.3,2.645544,True,False
235,235,Lenovo,4,2,1,5,30.480,0.896552,8,256,1.36,2236,12.0,2.998283,False,True
236,236,Lenovo,3,3,1,5,39.624,0.862069,6,256,2.40,883,15.6,5.291088,True,False


In [212]:
df['Price_EUR'] = df['Price'] * 0.88

In [213]:
df

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price,Screen_Size_inch,Weight_pounds,Screen_Full HD,Screen_IPS Panel,Price_EUR
0,0,Acer,4,2,1,5,35.560,0.551724,8,256,1.60,978,14.0,3.527392,False,True,860.64
1,1,Dell,3,1,1,3,39.624,0.689655,4,256,2.20,634,15.6,4.850164,True,False,557.92
2,2,Dell,3,1,1,7,39.624,0.931034,8,256,2.20,946,15.6,4.850164,True,False,832.48
3,3,Dell,4,2,1,5,33.782,0.551724,8,128,1.22,1244,13.3,2.689636,False,True,1094.72
4,4,HP,4,2,1,7,39.624,0.620690,8,256,1.91,837,15.6,4.210824,True,False,736.56
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,Lenovo,4,2,1,7,35.560,0.896552,8,256,1.70,1891,14.0,3.747854,False,True,1664.08
234,234,Toshiba,3,2,1,5,33.782,0.827586,8,256,1.20,1950,13.3,2.645544,True,False,1716.00
235,235,Lenovo,4,2,1,5,30.480,0.896552,8,256,1.36,2236,12.0,2.998283,False,True,1967.68
236,236,Lenovo,3,3,1,5,39.624,0.862069,6,256,2.40,883,15.6,5.291088,True,False,777.04


In [214]:
min_val = df['CPU_frequency'].min()  


In [215]:
max_val = df['CPU_frequency'].max()

In [216]:
normalized_frequency = (df['CPU_frequency'] - min_val) / (max_val - min_val)

In [217]:
df=pd.concat([df, normalized_frequency], axis=1)

In [218]:
df

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price,Screen_Size_inch,Weight_pounds,Screen_Full HD,Screen_IPS Panel,Price_EUR,CPU_frequency.1
0,0,Acer,4,2,1,5,35.560,0.551724,8,256,1.60,978,14.0,3.527392,False,True,860.64,0.235294
1,1,Dell,3,1,1,3,39.624,0.689655,4,256,2.20,634,15.6,4.850164,True,False,557.92,0.470588
2,2,Dell,3,1,1,7,39.624,0.931034,8,256,2.20,946,15.6,4.850164,True,False,832.48,0.882353
3,3,Dell,4,2,1,5,33.782,0.551724,8,128,1.22,1244,13.3,2.689636,False,True,1094.72,0.235294
4,4,HP,4,2,1,7,39.624,0.620690,8,256,1.91,837,15.6,4.210824,True,False,736.56,0.352941
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,Lenovo,4,2,1,7,35.560,0.896552,8,256,1.70,1891,14.0,3.747854,False,True,1664.08,0.823529
234,234,Toshiba,3,2,1,5,33.782,0.827586,8,256,1.20,1950,13.3,2.645544,True,False,1716.00,0.705882
235,235,Lenovo,4,2,1,5,30.480,0.896552,8,256,1.36,2236,12.0,2.998283,False,True,1967.68,0.823529
236,236,Lenovo,3,3,1,5,39.624,0.862069,6,256,2.40,883,15.6,5.291088,True,False,777.04,0.764706


## Simple linear regression

Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a target variable.
2. Calculate and display the MSE and R^2 values for the trained model

In [219]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Assume you have a pandas data frame called 'data_frame' with two columns: 'source_variable' and 'target_variable'

# Extract the source variable and target variable from the data frame
X = df[['CPU_frequency']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a linear regression model
model = LinearRegression()


In [220]:
# Train the model using the source and target variables
model.fit(X_train, y_train)


In [221]:
# Make predictions using the trained model with the test data set
y_pred = model.predict(X_test)


In [222]:
# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)


In [223]:
# Calculate the coefficient of determination (R^2)
r2 = r2_score(y_test, y_pred)


In [224]:
# Display the MSE and R^2 values
print("Mean Squared Error (MSE):", mse)
print("Coefficient of Determination (R^2):", r2)

# Additional details:
# - The 'LinearRegression' class from the 'sklearn.linear_model' module is used to create a linear regression model.
# - The 'fit()' method is used to train the model using the source and target variables.
# - The 'predict()' method is used to make predictions using the trained model.
# - The 'mean_squared_error()' function from the 'sklearn.metrics' module is used to calculate the MSE.
# - The 'r2_score()' function from the 'sklearn.metrics' module is used to calculate the R^2 value.

Mean Squared Error (MSE): 239035.9942943603
Coefficient of Determination (R^2): -0.03719417833496452


## Multiple linear regression

Write a Python code that performs the following tasks.
1. Develops and trains a linear regression model that uses some attributes of a data frame as the source variables and one of the attributes as a target variable.
2. Calculate and display the MSE and R^2 values for the trained model.

In [225]:
# Extract the source variables and target variable from the data frame
X = df[['RAM_GB', 'CPU_core', 'GPU']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a linear regression model
model = LinearRegression()



In [226]:
# Train the model using the source and target variables
model.fit(X_train, y_train)


In [227]:
# Make predictions using the trained model with the test data set
y_pred = model.predict(X_test)


In [228]:
# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)


In [229]:
# Calculate the coefficient of determination (R^2)
r2 = r2_score(y_test, y_pred)


In [230]:
# Display the MSE and R^2 values
print("Mean Squared Error (MSE):", mse)
print("Coefficient of Determination (R^2):", r2)

# Additional details:
# - The 'LinearRegression' class from the 'sklearn.linear_model' module is used to create a linear regression model.
# - The 'fit()' method is used to train the model using the source and target variables.
# - The 'predict()' method is used to make predictions using the trained model.
# - The 'mean_squared_error()' function from the 'sklearn.metrics' module is used to calculate the MSE.
# - The 'r2_score()' function from the 'sklearn.metrics' module is used to calculate the R^2 value.

Mean Squared Error (MSE): 218015.97877279608
Coefficient of Determination (R^2): 0.0540131638556397


## Multiple Polynomial regression

Write a Python code that performs the following tasks.
1. Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the source variable and another as a target variable.
2. Calculate and display the MSE and R^2 values for the trained models.
3. Compare the performance of the models.

In [231]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Assume you have a pandas data frame called 'data_frame' with two columns: 'source_variable' and 'target_variable'

# Extract the source variable and target variable from the data frame
X = df[['GPU']]
y = df['Price']



In [232]:
# Loop through the polynomial orders
for order in [2, 3, 5, 7]:
    # Create polynomial features
    polynomial_features = PolynomialFeatures(degree=order)

    X_poly = polynomial_features.fit_transform(X)
    
    # Initialize a linear regression model
    model = LinearRegression()
    #Train the model using the polynomial features and target variable
    model.fit(X_poly, y)

    # Initialize a linear regression model
    model = LinearRegression()

    #Train the model using the polynomial features and target variable
    model.fit(X_poly, y)
    # Make predictions using the trained model with the test data set
    y_pred = model.predict(X_poly)
    
    # Calculate the mean squared error (MSE)
    mse = mean_squared_error(y, y_pred)

    # Calculate the coefficient of determination (R^2)
    r2 = r2_score(y, y_pred)

    # Initialize lists to store the MSE and R^2 values for each model
    mse_values = []
    r2_values = []
    
    # Append the MSE and R^2 values to the lists
    mse_values.append(mse)
    r2_values.append(r2)

    # Display the MSE and R^2 values for the current model
    print(f"Polynomial Order {order}:")
    print("Mean Squared Error (MSE):", mse)
    print("Coefficient of Determination (R^2):", r2)
    print()


    

Polynomial Order 2:
Mean Squared Error (MSE): 297676.6116812453
Coefficient of Determination (R^2): 0.09462094391880316

Polynomial Order 3:
Mean Squared Error (MSE): 297676.6116812454
Coefficient of Determination (R^2): 0.09462094391880294

Polynomial Order 5:
Mean Squared Error (MSE): 297676.6116812454
Coefficient of Determination (R^2): 0.09462094391880294

Polynomial Order 7:
Mean Squared Error (MSE): 297676.6116812454
Coefficient of Determination (R^2): 0.09462094391880294



In [233]:
# Compare the performance of the models
best_order = np.argmin(mse_values)
worst_order = np.argmax(r2_values)


In [234]:
print("Model Comparison:")
print(f"Best Polynomial Order: {best_order + 2}")
print(f"Worst Polynomial Order: {worst_order + 2}")


Model Comparison:
Best Polynomial Order: 2
Worst Polynomial Order: 2


## Pipeline

Write a Python code that performs the following tasks.
1. Create a pipeline that performs parameter scaling, Polynomial Feature generation, and Linear regression. Use the set of multiple features as before to create this pipeline.
2. Calculate and display the MSE and R^2 values for the trained model.

In [243]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [246]:
# Extract the source variables and target variable from the data frame
X = [['RAM_GB', 'CPU_core', 'GPU']]
y = ['Price']


In [250]:
# Create a pipeline that performs parameter scaling, polynomial feature generation, and linear regression
pipeline = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(degree=2),
    LinearRegression()
)

In [252]:
#Train the model using the source and target variables
pipeline.fit(X_train, y_train)


In [253]:
# Make predictions using the trained model
y_pred = pipeline.predict(X_test)


In [255]:
# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)


In [256]:
# Calculate the coefficient of determination (R^2)
r2 = r2_score(y_test, y_pred)


In [257]:
# Display the MSE and R^2 values
print("Mean Squared Error (MSE):", mse)
print("Coefficient of Determination (R^2):", r2)


Mean Squared Error (MSE): 232298.3520095865
Coefficient of Determination (R^2): -0.00795906931257484


## Grid search and Ridge regression
An improved way to train your model is to use ridge regression instead of linear regression. You can use the polynomial features of multiple attributes. One of the key factors of ridge regression is using the parameter alpha as a hyperparameter for training. Using grid search, one can determine the optimum value of the hyperparameter for the given set of features. Grid search also uses cross-validation training to train and prepare the optimum model.

You can use generative AI to create the Python code to perform a grid search for the optimum ridge regression model, which uses polynomial features generated from multiple parameters.

In [258]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score


In [260]:
# Extract the source variables and target variable from the data frame
X = df[['RAM_GB', 'CPU_core', 'GPU']]
y = df['Price']

# Create polynomial features
polynomial_features = PolynomialFeatures(degree=2)

In [261]:
# Transform the source variables into polynomial features
X_poly = polynomial_features.fit_transform(X)


In [262]:
# Define the hyperparameter values for the grid search
param_grid = {'alpha': [0.0001,0.001,0.01, 0.1, 1, 10]}


In [264]:
# Initialize a ridge regression model
model = Ridge()


In [265]:
# Perform grid search with cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5)


In [268]:
# Train the model using the polynomial features and target variable
grid_search.fit(X_poly, y)


In [270]:
# Make predictions using the trained model
y_pred = grid_search.predict(X_poly)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y, y_pred)

# Calculate the coefficient of determination (R^2)
r2 = r2_score(y, y_pred)

# Display the MSE and R^2 values
print("Mean Squared Error (MSE):", mse)
print("Coefficient of Determination (R^2):", r2)


# Additional details:
# - The 'PolynomialFeatures' class from the 'sklearn.preprocessing' module is used to create polynomial features.
# - The 'GridSearchCV' class from the 'sklearn.model_selection' module is used to perform grid search with cross-validation.
# - The 'Ridge' class from the 'sklearn.linear_model' module is used for ridge regression.
# - The 'fit_transform()' method is used to transform the source variables into polynomial features.
# - The 'param_grid' parameter in the 'GridSearchCV' class specifies the hyperparameter values to search over.
# - The 'cv' parameter in the 'GridSearchCV' class specifies the number of folds for cross-validation.
# - The best model found by grid search can be accessed using the 'best_estimator_' attribute of the grid search object.


Mean Squared Error (MSE): 175434.61057360156
Coefficient of Determination (R^2): 0.4664182005162655


## Conclusion

With this, I have learned how to use generative AI to create Python codes that can:

Implement linear regression in single variable
Implement linear regression in multiple variables
Implement polynomial regression for different orders of a single variable
Create a pipeline that implements polynomial scaling for multiple variables and performs linear regression on them
Apply a grid search to create an optimum ridge regression model for multiple features