# Step 1 : Load Data

**chatGPT prompt**

Generate Python code to solve a regression problem using the following 5 steps:

Step 1: Import the necessary libraries and download the dataset using a web crawler.
Use the URL https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv to fetch the dataset. Convert the CSV content to a pandas DataFrame and print a summary of the dataset.

Step 2: Load the dataset and explore its contents. Perform one-hot encoding on the "State" feature.

Step 3: Preprocess the data by separating features (X) and setting the target variable (y) to 'Profit'. Convert X and y to numpy arrays suitable for the sklearn LinearRegression model. Split the dataset into training and testing sets using a test size of 20%.

Step 4: Train Lasso Linear Regression models with different variables and calculate Mean Squared Error (MSE). List the names of variables used in each model along with the MSE values, and present the results in a table format.

Step 5: Make predictions on the test data, calculate the Mean Squared Error (MSE) using the predicted values, and print the MSE.

Replace placeholders such as 'train.csv', 'target_column_name', and any parameter values with appropriate values in the code.

Feel free to adjust hyperparameters, perform additional preprocessing, feature engineering, or explore different regression algorithms.


#### Prompt: 使用 Python 從網路上下載 50_Startups 資料集，並進行資料前處理和特徵編碼，接著將資料集拆分成訓練集和測試集，使用 Lasso 線性回歸模型來預測公司利潤（Profit），針對不同數量的變數訓練模型並計算均方誤差（MSE），最後輸出各變數組合的 MSE 結果，並列出最終模型的 MSE。

In [1]:
# Step 1: Import the necessary libraries and download the dataset using a web crawler.
import pandas as pd
import requests
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

# Define the URL of the dataset
dataset_url = "https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv"

# Use requests to fetch the content of the dataset
response = requests.get(dataset_url)
data_csv = response.text

# Convert the CSV content to a pandas DataFrame
data = pd.read_csv(StringIO(data_csv))

# Explore the dataset
print(data.head())
print(data.info())
print(data.describe())

# Step 2: Load the dataset and explore its contents. Perform one-hot encoding on the "State" feature.
state_encoded = pd.get_dummies(data['State'], prefix='State', drop_first=True)
data = pd.concat([data, state_encoded], axis=1)
data.drop('State', axis=1, inplace=True)

# Step 3: Preprocess the data by separating features (X) and setting the target variable (y) to 'Profit'. Convert X and y to numpy arrays suitable for the sklearn LinearRegression model. Split the dataset into training and testing sets using a test size of 20%.
X = data.drop('Profit', axis=1).values
y = data['Profit'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Lasso Linear Regression models with different variables and calculate Mean Squared Error (MSE). List the names of variables used in each model along with the MSE values, and present the results in a table format.
results = []

for num_variables in range(1, X_train.shape[1] + 1):
    lasso_model = Lasso(alpha=0.1)
    lasso_model.fit(X_train[:, :num_variables], y_train)
    y_pred = lasso_model.predict(X_test[:, :num_variables])
    mse = mean_squared_error(y_test, y_pred)
    variable_names = list(data.drop('Profit', axis=1).columns[:num_variables])
    results.append((num_variables, variable_names, mse))

# Print the results table
print("Number of Variables\tVariable Names\t\t\tMean Squared Error")
print("--------------------------------------------------------------------")
for num_variables, variable_names, mse in results:
    variable_names_str = ", ".join(variable_names)
    print(f"{num_variables}\t\t\t{variable_names_str}\t\t\t{mse}")

# Step 5: Make predictions on the test data, calculate the Mean Squared Error (MSE) using the predicted values, and print the MSE.
final_model = Lasso(alpha=0.1)  # You can adjust the alpha parameter
final_model.fit(X_train, y_train)
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
print(f"Final Model Mean Squared Error: {final_mse}")


   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None
           R&D Spend  Administration  Marketing Spend         Profit
count      50.000000       50.000000

#### Prompt: 使用 Python 從網路下載 50_Startups 資料集，並將其載入為 pandas DataFrame。對資料集進行基礎探索與總結，將資料拆分為特徵（X）和目標變數（y），並使用 80/20 比例將資料集分為訓練集和測試集。使用線性回歸模型進行訓練，並計算模型對測試集的預測結果的均方誤差（MSE）。最後輸出 MSE 的結果

In [2]:
# Step 1: Import the necessary libraries and download the dataset using a web crawler.
import requests
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from io import StringIO

# Define the URL of the dataset
#dataset_url = "https://gist.github.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/bf6a5b1e89cf9e24fcb90701375a219c99616e57/train.csv"
dataset_url = "https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv"
# Use requests to fetch the content of the dataset
response = requests.get(dataset_url)
data_csv = response.text

# Convert the CSV content to a pandas DataFrame
data = pd.read_csv(StringIO(data_csv))

# Print a summary of the dataset
print(data.head())
print(data.info())
print(data.describe())

# Rest of the steps remain unchanged.
# Step 2: Load the dataset and explore its contents.
# Step 3: Preprocess the data by separating features (X) and the target variable (y). Split the dataset into training and testing sets using a test size of 20%.
# Step 4: Initialize a Linear Regression model, train it using the training data, and fit the model.
# Step 5: Make predictions on the test data, calculate the Mean Squared Error (MSE) using the predicted values, and print the MSE.


   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None
           R&D Spend  Administration  Marketing Spend         Profit
count      50.000000       50.000000

# Step 2: Preprocessing Data
1. onehot encoding State variable
**Prompt**
modify the step 2, onehot encoding the "State"

#### Prompt: 對資料集中的 'State' 特徵進行 one-hot 編碼，生成新的二元變數來代表不同的州，並將這些新變數添加到原資料集中，然後刪除原本的 'State' 欄位

In [None]:
# Perform one-hot encoding on the 'State' feature
state_encoded = pd.get_dummies(data['State'], prefix='State', drop_first=True)
data = pd.concat([data, state_encoded], axis=1)
data.drop('State', axis=1, inplace=True)




In [None]:
data.columns

Index(['R&D Spend', 'Administration', 'Marketing Spend', 'Profit',
       'State_Florida', 'State_New York'],
      dtype='object')

#### Prompt: 選取資料集中 'R&D Spend'、'Administration'、'Marketing Spend' 以及經過 one-hot 編碼後的 'State_Florida' 和 'State_New York' 作為特徵（X），並將 'Profit' 作為目標變數（Y）。將 X 和 Y 重新調整成合適的形狀（5 個特徵和 1 個目標變數），並檢查 X 的形狀。

In [None]:
X=data[['R&D Spend', 'Administration', 'Marketing Spend', 'State_Florida', 'State_New York']].values.reshape(-1,5)
Y=data['Profit'].values.reshape(-1,1)
X.shape

# chatGPT is good as below
#X = data.drop('Profit', axis=1).values.reshape(-1,5)
#Y = data['Profit'].values.reshape(-1,1)


(50, 5)

# Step 3: Train Test split

#### Prompt: 將特徵（X）和目標變數（Y）按照 80/20 的比例拆分為訓練集和測試集，並設定隨機種子為 42 以確保結果的可重現性。接著，輸出訓練集特徵資料（X_train）的形狀。

In [None]:
#Step 3
# train test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print(X_train.shape)

(40, 5)


# Step 4: Build Model and Evaluate

#### Prompt: 初始化線性回歸模型（Linear Regression），並使用訓練集特徵（X_train）和目標變數（y_train）來訓練模型。

In [None]:
# Step 4: Initialize a Linear Regression model, train it using the training data, and fit the model.
model = LinearRegression()
model.fit(X_train, y_train)



#### Prompt: 使用不同數量的變數訓練 Lasso 線性回歸模型，並計算每個模型的均方誤差（MSE）。針對每個模型，選擇從 1 個到所有變數的組合，並使用 alpha 參數為 0.1 的 Lasso 模型進行訓練。對測試資料進行預測，計算並儲存每組變數的 MSE 結果，最後以表格形式輸出每組變數數量、變數名稱和對應的 MSE

In [None]:
# Step 4: Train Lasso Linear Regression models with different variables and calculate MSE.

from sklearn.linear_model import Lasso

# Create a list to store results (number of variables, variable names, MSE)
results = []

# Test different variables
for num_variables in range(1, X_train.shape[1] + 1):
    # Train Lasso model with the selected variables
    lasso_model = Lasso(alpha=0.1)  # You can adjust the alpha parameter
    lasso_model.fit(X_train[:, :num_variables], y_train)

    # Make predictions on the test data
    y_pred = lasso_model.predict(X_test[:, :num_variables])

    # Calculate MSE
    mse = mean_squared_error(y_test, y_pred)

    # Store the results (number of variables, variable names, MSE)
    variable_names = list(data.drop('Profit', axis=1).columns[:num_variables])
    results.append((num_variables, variable_names, mse))

# Print the results table
print("Number of Variables\tVariable Names\t\t\tMean Squared Error")
print("--------------------------------------------------------------------")
for num_variables, variable_names, mse in results:
    variable_names_str = ", ".join(variable_names)
    print(f"{num_variables}\t\t\t{variable_names_str}\t\t\t{mse}")


Number of Variables	Variable Names			Mean Squared Error
--------------------------------------------------------------------
1			R&D Spend			59510962.80895922
2			R&D Spend, Administration			83764133.64832704
3			R&D Spend, Administration, Marketing Spend			80926321.17724034
4			R&D Spend, Administration, Marketing Spend, State_Florida			82009753.59553678
5			R&D Spend, Administration, Marketing Spend, State_Florida, State_New York			82009745.37455888


# Step 5: Deploy Answer

In [None]:
!pip install optuna



#### Prompt: 使用已訓練的線性回歸模型對測試集特徵（X_test）進行預測，計算預測結果與實際值之間的均方誤差（MSE），並輸出 MSE 值

In [None]:
# Step 5: Make predictions on the test data, calculate the Mean Squared Error (MSE) using the predicted values, and print the MSE.
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 82010363.04430094


#### Prompt: 使用 Python 從網路下載 50_Startups 資料集，將 'R&D Spend' 作為特徵（X），'Profit' 作為目標變數（y），並將資料集拆分為訓練集和測試集（80/20）。使用 Optuna 對 Lasso 模型的正則化參數 alpha 進行優化，目標是最小化均方誤差（MSE）。進行 100 次測試，找出最佳的 alpha 值，並以此值訓練最終模型。最後，計算並輸出最終模型的 MSE，以及最佳模型使用的變數名稱

In [None]:
import pandas as pd
import requests
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
import optuna

# Define the URL of the dataset
dataset_url = "https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv"

# Use requests to fetch the content of the dataset
response = requests.get(dataset_url)
data_csv = response.text

# Convert the CSV content to a pandas DataFrame
data = pd.read_csv(StringIO(data_csv))

# Preprocess the data
X = data[['R&D Spend']].values
y = data['Profit'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function for Optuna optimization
def objective(trial):
    alpha = trial.suggest_float('alpha', 0.001, 1)
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    y_pred = lasso_model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return mse

# Optimize alpha using Optuna
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)

# Get the best parameters
best_alpha = study.best_params['alpha']
print(f"Best Alpha: {best_alpha}")

# Train the final Lasso model with the best alpha
final_model = Lasso(alpha=best_alpha)
final_model.fit(X_train, y_train)
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
print(f"Final Model Mean Squared Error (After Optimization): {final_mse}")

# Print the best model's variable name
best_variable_name = 'R&D Spend'
print(f"Best Model Variable Name: {best_variable_name}")


[I 2023-08-15 06:44:36,156] A new study created in memory with name: no-name-fb40aced-5721-4adb-9d03-627629cab55f
[I 2023-08-15 06:44:36,163] Trial 0 finished with value: 59510962.81653254 and parameters: {'alpha': 0.8017123723860038}. Best is trial 0 with value: 59510962.81653254.
[I 2023-08-15 06:44:36,175] Trial 1 finished with value: 59510962.81229075 and parameters: {'alpha': 0.4086825480760995}. Best is trial 1 with value: 59510962.81229075.
[I 2023-08-15 06:44:36,187] Trial 2 finished with value: 59510962.81307502 and parameters: {'alpha': 0.48135257245079016}. Best is trial 1 with value: 59510962.81229075.
[I 2023-08-15 06:44:36,196] Trial 3 finished with value: 59510962.80900576 and parameters: {'alpha': 0.10430894426197101}. Best is trial 3 with value: 59510962.80900576.
[I 2023-08-15 06:44:36,207] Trial 4 finished with value: 59510962.809984185 and parameters: {'alpha': 0.19496837350969967}. Best is trial 3 with value: 59510962.80900576.
[I 2023-08-15 06:44:36,237] Trial 5 f

Best Alpha: 0.0012610851097637475
Final Model Mean Squared Error (After Optimization): 59510962.80789355
Best Model Variable Name: R&D Spend
