# Dataset Description
## This dataset contains information about student performance in Portugese in two different secondary schools. The dataset includes student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. The attribute information can be seen below:

* school: student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
* sex: student's sex (binary: 'F' - female or 'M' - male)
* age: student's age (numeric: from 15 to 22)
* address: student's home address type (binary: 'U' - urban or 'R' - rural)
* famsize: family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
* Pstatus: parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
* Medu: mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2- 5th to 9th grade 3-secondary education or 4-higher education)
* Fedu: father's education (numeric: 0 - none, 1 - primary education (4th grade), 2- 5th to 9th grade, 3 - secondary education or 4- higher education)
* Mjob: mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
* Fjob: father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
* reason: reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
* guardian: student's guardian (nominal: 'mother', 'father' or 'other')
* traveltime: home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
* studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
* failures: number of past class failures (numeric: n if 1<=n<3, else 4)
* schoolsup: extra educational support (binary: yes or no)
* famsup: family educational support (binary: yes or no)
* paid: extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
* activities: extra-curricular activities (binary: yes or no)
* nursery: attended nursery school (binary: yes or no)
* higher: wants to take higher education (binary: yes or no)
* internet: Internet access at home (binary: yes or no)
* romantic: with a romantic relationship (binary: yes or no)
* famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
* freetime: free time after school (numeric: from 1 - very low to 5 - very high)
* goout: going out with friends (numeric: from 1 - very low to 5 - very high)
* Dalc: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
* Walc: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
* health: current health status (numeric: from 1 - very bad to 5 - very good)
* absences: number of school absences (numeric: from 0 to 93)
* G1: first period grade (numeric: from 0 to 20)
* G2: second period grade (numeric: from 0 to 20)
* G3: final grade (numeric: from 0 to 20, output target)

# Import Necessary Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy import stats
from math import sqrt
%matplotlib inline

In [None]:
df = pd.read_csv("/kaggle/input/student-performance-data-set/student-por.csv")
df.head(5)

In [None]:
print("Data Shape: number of Rows = {0}, number of Columns = {1}".format(df.shape[0],df.shape[1]))

In [None]:
#gpt建議
print(f"Data Shape: number of Rows = {df.shape[0]}, number of Columns = {df.shape[1]}")


In [None]:
df.info()

顯示資料集中所有數值型欄位的統計摘要資訊。

包括每個欄位的平均值、標準差、最小值、最大值與四分位數等。

使用 .T（轉置）是為了讓欄位名稱橫向呈現，更清楚易讀。

可用來快速發現資料的分布狀況、是否有異常值或缺失值。



In [None]:
print("Show Statistical Description of Numerical Columns")
df.describe().T

# Data Cleaning

In [None]:
#check for missing values
df.isnull().sum()

No Missing Values in DataSet

In [None]:
# check for duplicates
# 檢查是否有資料為重複
df.duplicated().value_counts()

# Visualization

檢查性別是否平衡

In [None]:

# 檢查性別是否平衡

# Ignore warnings
warnings.filterwarnings("ignore")

# Count the occurrences of each category in the 'sex' column
target_count = df.sex.value_counts()

# Print the count of females
print('Female:', target_count[0])

# Print the count of males
print('Male:', target_count[1])

# Create a count plot of 'sex' with seaborn
sns.countplot(data=df, x="sex", hue="sex", palette="Blues")

# Set the title of the plot
plt.title('Count of Sex') 


「觀察 failures（過往學業失敗次數）是否與 sex（性別）有關聯。」


In [None]:
# 「觀察 failures（過往學業失敗次數）是否與 sex（性別）有關聯。」



# Grouping the DataFrame by 'failures' and 'sex', counting the occurrences, and resetting the index
failure_counts = df.groupby(["failures", "sex"]).size().reset_index(name="count")

# Printing the DataFrame showing counts of failures by sex
print(failure_counts)

# Creating a count plot of 'failures' with seaborn, differentiated by 'sex'
sns.countplot(data=df, x="failures", hue="sex", palette="Blues")

# Setting the title of the plot
plt.title('Count of Failures by Sex')


「是否年齡與失敗次數之間有潛在關聯？年紀較大的學生是否較常失敗？」

In [None]:
# Grouping the DataFrame by 'failures' and 'age', counting the occurrences, and resetting the index
failure_counts = df.groupby(["failures", "age"]).size().reset_index(name="count")

# Printing the DataFrame showing counts of failures by age
print(failure_counts)

# Creating a count plot of 'failures' with seaborn, differentiated by 'age' and plotted horizontally
sns.countplot(data=df, y='age', hue='failures', palette="rocket_r")

# Setting the title of the plot
plt.title('Count of Failures by Age')

# Setting the label for the y-axis
plt.ylabel('Age')


鄉村（R）與城市（U）的學生，是否在學業失敗上有明顯的分布差異。

若某一居住地區在高失敗次數上特別集中，可能顯示教育資源或其他社經背景的影響。

In [None]:
# Create a pivot table to count the occurrences of failures for each address type
pivot_table = df.pivot_table(index='failures', columns='address', aggfunc='size')

# Plot the heatmap using seaborn, with annotations, a colormap, and format for annotations
sns.heatmap(pivot_table, annot=True, cmap='tab20c_r', fmt='g')

# Add title to the plot
plt.title('Distribution of Failures by Address Type')

# Add label for the x-axis
plt.xlabel('Address Type')

# Add label for the y-axis
plt.ylabel('Number of Failures')


是否父母分居的學生，學業失敗次數有偏高趨勢？

若 Pstatus=A 在高 failure 值有顯著偏多，可能暗示家庭結構對學習表現的影響。

In [None]:
# Create a pivot table to count the occurrences of failures for each Pstatus type
pivot_table = df.pivot_table(index='failures', columns='Pstatus', aggfunc='size')

# Plot the heatmap using seaborn, with annotations, a colormap, and format for annotations
sns.heatmap(pivot_table, annot=True, cmap='tab20c_r', fmt='g')

# Add title to the plot
plt.title('Distribution of Failures by Pstatus Type')

# Add label for the x-axis
plt.xlabel('Pstatus Type')

# Add label for the y-axis
plt.ylabel('Number of Failures')


分析「父母是否同住」會不會影響學生三次成績的分布情況。

In [None]:
# Create subplots with 1 row and 3 columns, sharing y-axis
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=False)

# Iterate through grade periods and plot
for i, grade_period in enumerate(['G1', 'G2', 'G3']):
    sns.countplot(ax=axes[i], data=df, y=grade_period, hue="Pstatus", palette="Blues")
    axes[i].set_title(f'{grade_period} vs Pstatus')
    axes[i].set_ylabel("Grade")

# Show the plots
plt.show()

# Define columns for grade periods
columns = ["G1", "G2", "G3"]

# Iterate through grade periods
for i in range(len(columns)):
    # Group the DataFrame by 'Pstatus' and the current grade period, counting occurrences, and reset the index
    Pstatus_counts = df.groupby(["Pstatus", columns[i]]).size().reset_index(name="count")
    # Print the results
    print(columns[i],"\n",Pstatus_counts)
    print("\n")


In [None]:
# Create subplots with 1 row and 3 columns, adjusting figure size
fig, axes = plt.subplots(1, 3, figsize=(25, 15))

# Iterate through grade periods and plot
for i, grade in enumerate(['G1', 'G2', 'G3']):
    sns.countplot(data=df, y=grade, hue='failures', ax=axes[i], palette='Set2', dodge=True)
    axes[i].set_title(f'Failure vs {grade}')  # Set title for each subplot
    axes[i].set_ylabel('Grade')  # Set label for y-axis
    axes[i].set_xlabel('Count')  # Set label for x-axis

# Show the plots
plt.show()

# Define columns for grade periods
columns = ["G1", "G2", "G3"]

# Iterate through grade periods
for i in range(len(columns)):
    # Group the DataFrame by 'failures' and the current grade period, counting occurrences, and reset the index
    failures_counts = df.groupby(["failures", columns[i]]).size().reset_index(name="count")
    # Print the results
    print(columns[i],"\n",failures_counts)
    print("\n")


這段是在看：「學生過去有幾次學業失敗，會不會影響他現在（G1/G2/G3）的成績分布？」



In [None]:
# Create subplots with 1 row and 3 columns, adjusting figure size
fig, axes = plt.subplots(1, 3, figsize=(25, 15))

# Iterate through grade periods and plot
for i, grade in enumerate(['G1', 'G2', 'G3']):
    sns.countplot(data=df, y=grade, hue='studytime', ax=axes[i], palette='Set1', dodge=True)
    axes[i].set_title(f'StudyTime vs {grade}')  # Set title for each subplot
    axes[i].set_ylabel('Grade')  # Set label for y-axis
    axes[i].set_xlabel('Count')  # Set label for x-axis

# Show the plots
plt.show()

# Define columns for grade periods
columns = ["G1", "G2", "G3"]

# Iterate through grade periods
for i in range(len(columns)):
    # Group the DataFrame by 'studytime' and the current grade period, counting occurrences, and reset the index
    studytime_counts = df.groupby(["studytime", columns[i]]).size().reset_index(name="count")
    # Print the results
    print(columns[i],"\n",studytime_counts)
    print("\n")


「學生平常愛出去玩的程度（goout）會不會影響他們的成績？」



In [None]:
# Create subplots with 1 row and 3 columns, adjusting figure size
fig, axes = plt.subplots(1, 3, figsize=(25, 15))

# Iterate through grade periods and plot
for i, grade in enumerate(['G1', 'G2', 'G3']):
    sns.countplot(data=df, y=grade, hue='goout', ax=axes[i], palette='Dark2', dodge=True)
    axes[i].set_title(f'Go Out vs {grade}')  # Set title for each subplot
    axes[i].set_ylabel('Grade')  # Set label for y-axis
    axes[i].set_xlabel('Count')  # Set label for x-axis

# Show the plots
plt.show()

# Define columns for grade periods
columns = ["G1", "G2", "G3"]

# Iterate through grade periods
for i in range(len(columns)):
    # Group the DataFrame by 'goout' and the current grade period, counting occurrences, and reset the index
    goout_counts = df.groupby(["goout", columns[i]]).size().reset_index(name="count")
    # Print the results
    print(columns[i],"\n",goout_counts)
    print("\n")


「學生請假缺課次數（absences）是否會影響成績（G1/G2/G3）？」

In [None]:
# Create subplots with 1 row and 3 columns, adjusting figure size
fig, axes = plt.subplots(1, 3, figsize=(35, 25))

# Iterate through grade periods and plot
for i, grade in enumerate(['G1', 'G2', 'G3']):
    sns.countplot(data=df, y=grade, hue='absences', ax=axes[i], palette='rocket', dodge=True)
    axes[i].set_title(f'Absences vs {grade}')  # Set title for each subplot
    axes[i].set_ylabel('Grade')  # Set label for y-axis
    axes[i].set_xlabel('Count')  # Set label for x-axis

# Show the plots
plt.show()

# Define columns for grade periods
columns = ["G1", "G2", "G3"]

# Iterate through grade periods
for i in range(len(columns)):
    # Group the DataFrame by 'absences' and the current grade period, counting occurrences, and reset the index
    absences_counts = df.groupby(["absences", columns[i]]).size().reset_index(name="count")
    # Print the results
    print(columns[i],"\n",absences_counts)
    print("\n")


「兩間學校（GP vs MS）的學生成績分布有什麼不同？」



In [None]:
# Create subplots with 1 row and 3 columns, adjusting figure size
fig, axes = plt.subplots(1, 3, figsize=(35, 25))

# Iterate through grade periods and plot
for i, grade in enumerate(['G1', 'G2', 'G3']):
    sns.countplot(data=df, y=grade, hue='school', ax=axes[i], palette='Blues', dodge=True)
    axes[i].set_title(f'School vs {grade}')  # Set title for each subplot
    axes[i].set_ylabel('Grade')  # Set label for y-axis
    axes[i].set_xlabel('Count')  # Set label for x-axis

# Adjust layout to prevent overlapping
plt.tight_layout()

# Show the plots
plt.show()

# Define columns for grade periods
columns = ["G1", "G2", "G3"]

# Iterate through grade periods
for i in range(len(columns)):
    # Group the DataFrame by 'school' and the current grade period, counting occurrences, and reset the index
    school_counts = df.groupby(["school", columns[i]]).size().reset_index(name="count")
    # Print the results
    print(columns[i],"\n",school_counts)
    print("\n")


# Data Preprocessing

In [None]:
df.info()

把資料中所有「非數值的類別欄位」轉成數字形式，方便後續做統計分析或餵進模型。

In [None]:
# Importing necessary library
from sklearn.preprocessing import LabelEncoder

# Creating a LabelEncoder object
label_encoder = LabelEncoder()

# List of columns to be encoded
Columns = ["school","sex","address","famsize","Pstatus","Mjob","Fjob","reason","guardian","schoolsup","famsup","paid","activities",
                "nursery","higher","internet","romantic"]

# Iterate through each column and perform label encoding
for i in range(len(Columns)):
    # Retrieve the unique values in the column
    Country_keys = df[Columns[i]]
    Country_keys = Country_keys.tolist()
    
    # Perform label encoding
    Country_values = label_encoder.fit_transform(df[Columns[i]])
    Country_values = Country_values.tolist()
    
    # Update the DataFrame with the encoded values
    df[Columns[i]] = label_encoder.fit_transform(df[Columns[i]])
    
    # Create a dictionary mapping original values to encoded values
    Country_dict = dict(zip(Country_keys, Country_values))
    # Print the dictionary
    print(Country_dict)


「畫出整份資料集中，所有數值欄位彼此之間的相關係數熱力圖（correlation heatmap）」

In [None]:
# Calculate the correlation matrix
corr = df.corr()

# Create a figure with a large size
plt.figure(figsize=(50,50))

# Plot the heatmap using seaborn, with annotations and a blue colormap
sns.heatmap(corr, annot=True, cmap="Blues")

# Set the title of the plot
plt.title('Correlation Heatmap', fontsize=20)


In [None]:
df.info()

「畫出 G1 成績的盒狀圖＋散點圖，觀察是否有明顯離群值（極高或極低的成績）」

In [None]:
# cheaking the outliers in the feature 'G1'
plt.figure(figsize = (60,30))
sns.boxplot(x='G1', data=df)
sns.stripplot(x='G1', data=df, color="#804630")

「觀察 G2（第二學期成績）是否存在離群值。」

In [None]:
# cheaking the outliers in the feature 'G2'
plt.figure(figsize = (60,30))
sns.boxplot(x='G2', data=df)
sns.stripplot(x='G2', data=df, color="#804630")

用 z-score 方法來移除離群值（outliers），保留 z-score 小於 3 的資料列。

In [None]:
np.abs(stats.zscore(df))
np.abs(stats.zscore(df)).shape
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
df

In [None]:
#df['G2'] = np.log(df['G2'])

「這段是在再次觀察清除離群值後的 G1 成績分布，看異常值是否成功被移除。」

In [None]:
# cheaking the outliers in the feature 'G1'
plt.figure(figsize = (60,30))
sns.boxplot(x='G1', data=df)
sns.stripplot(x='G1', data=df, color="#804630")

「這段是在觀察 G2 成績在清除離群值後的分布情況，看是否已成功移除異常資料。」

In [None]:
# cheaking the outliers in the feature 'G2'
plt.figure(figsize = (60,30))
sns.boxplot(x='G2', data=df)
sns.stripplot(x='G2', data=df, color="#804630")

# Feature selection

In [None]:
# We will apply feature selection method that can help us to choose the effective features in model 
# instead of choosing all the effective ones and non-effective ones that can help us in best modeling 

 意思是：「拿掉 G3 成績，剩下所有欄位當作模型的輸入依據」

In [None]:
x = df.drop('G3', axis=1)
y = df['G3']

「把目前資料（x）裡的所有欄位名稱（也就是特徵）列出來看一眼。

In [None]:
all_features = x.columns
all_features

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectFromModel

「建立一個用 entropy 作為依據的決策樹分類模型，準備拿來訓練用或做特徵選擇。」

In [None]:
# Taking object from the library to use the model.
# Use gini criterion to define feature importance.
dtc = DecisionTreeClassifier(random_state=0, criterion='entropy') 

「建立一個 SelectFromModel，準備用你剛剛建立的 DecisionTreeClassifier 來挑出重要特徵。」

In [None]:
selector = SelectFromModel(estimator=dtc)

「根據資料 x, y，讓 DecisionTreeClassifier 訓練起來，並用來篩選出重要特徵。」

In [None]:
selector.fit(x, y)

「列出被挑選出來的重要特徵的索引位置（數字編號）。」

In [None]:
selector.get_support(indices=True)

「把被挑選出來的重要特徵的欄位索引編號存起來，然後印出來看。」

In [None]:
selected_features_idx = selector.get_support(indices=True)
selected_features_idx

「根據剛剛挑出來的欄位索引，從 all_features 裡取出被選中的特徵名稱。」

In [None]:
selected_features = all_features[selected_features_idx]
selected_features

「手動選了一組你認為重要的特徵，準備拿來當作訓練模型用的輸入。」

'Walc'：週末喝酒量

'absences'：缺課次數

'G1'：第一學期成績

'G2'：第二學期成績

In [None]:
feat = ['Walc', 'absences', 'G1', 'G2']

# Model and Optimaization

 Random Forest Regression model

「用 RandomForestRegressor 模型，根據 Walc, absences, G1, G2 來預測學生的期末成績 G3，並評估預測準確性。」

In [None]:
# Random Forest Regression model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target variable
features = df[feat]  # Features
target = df['G3']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the Random Forest Regression with specified parameters
RFR = RandomForestRegressor(random_state=100, criterion='squared_error', max_depth=30, min_samples_leaf=5, n_jobs=1)

# Train the regression
RFR.fit(X_train, y_train)

# Predict on the testing data
y_pred = RFR.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

rmse = sqrt(mse)

r2 = r2_score(y_test, y_pred)

print("RFR Mean Squared Error MSE:", mse)
print("RFR Root Mean Squared Error RMSE:", rmse)
print("RFR R^2 Score:", r2)

「用 GridSearchCV 測試多組隨機森林參數組合，挑出最適合你資料的模型設定。」



In [None]:
# using Gridsearch for best performancing Random Forest Regression model (OPTIMAIZATION)
from sklearn.model_selection import GridSearchCV
number = [5,11,13,41,42,101]
numbers = list(range(1, 31))
param_grid = {'criterion': ["squared_error", "absolute_error"],
              'random_state' : number,
              'n_jobs' : [1, -1],
              'max_depth' :  numbers}
grid = GridSearchCV(RandomForestRegressor(),param_grid,cv = 5)
grid.fit(X_train,y_train)
grid.best_params_

取得 GridSearchCV 幫你找出的最佳 RandomForestRegressor 模型（含最佳參數），並準備直接拿來用

In [None]:
grid.best_estimator_

使用最佳化後的隨機森林模型來預測 G3，並計算其 MSE、RMSE、R² 表現。

In [None]:
grid_predictions = grid.predict(X_test)

mse = mean_squared_error(y_test, grid_predictions)

rmse = sqrt(mse)

r2 = r2_score(y_test, grid_predictions)

print("Optimaized RFR Mean Squared Error MSE:", mse)
print("Optimaized RFR Root Mean Squared Error RMSE:", rmse)
print("Optimaized RFR R^2 Score:", r2)

列出 RandomForestRegressor 可設定的所有參數名稱（超參數），方便你知道可以調哪些東西。

In [None]:
# Get the list of available parameters in Random Forest Regression model
parameters = RandomForestRegressor().get_params().keys()

# Print the list of available parameters
print(parameters)

 Decision Tree Regression model


用決策樹回歸模型（DecisionTreeRegressor）來預測 G3 成績，並計算 MSE、RMSE、R² 指標評估預測效果。

In [None]:
# Decision Tree Regression model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target variable
features = df[feat]  # Features
target = df['G3']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the Decision Tree Regression with specified parameters
DTR = DecisionTreeRegressor(random_state=100, criterion='squared_error', max_depth=30, min_samples_leaf=5)

# Train the regression
DTR.fit(X_train, y_train)

# Predict on the testing data
y_pred = DTR.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

rmse = sqrt(mse)

r2 = r2_score(y_test, y_pred)

print("DTR Mean Squared Error MSE:", mse)
print("DTR Root Mean Squared Error RMSE:", rmse)
print("DTR R^2 Score:", r2)

使用交叉驗證與參數搜尋，從上千種組合中挑出對 G3 預測最準確的決策樹回歸模型設定。

In [None]:
# using Gridsearch for best performancing Decision Tree Regression model (OPTIMAIZATION)
from sklearn.model_selection import GridSearchCV
number = [5,11,13,41,42,101]
numbers = list(range(1, 31))
param_grid = {'random_state': number,
              'criterion' : ["squared_error", "absolute_error", "friedman_mse", "poisson"],
              'max_depth' : numbers,
              'min_samples_leaf' :  numbers}
grid = GridSearchCV(DecisionTreeRegressor(),param_grid,cv = 5)
grid.fit(X_train,y_train)
grid.best_params_

取得 GridSearchCV 幫你訓練出來的最佳 DecisionTreeRegressor 模型（含最佳參數設定），可以直接拿來預測使用。

In [None]:
grid.best_estimator_

用 GridSearch 找到的最佳決策樹回歸模型來預測 G3，並用 MSE、RMSE、R² 來量化它的預測效果。

In [None]:
grid_predictions = grid.predict(X_test)

mse = mean_squared_error(y_test, grid_predictions)

rmse = sqrt(mse)

r2 = r2_score(y_test, grid_predictions)

print("Optimaized DTR Mean Squared Error MSE:", mse)
print("Optimaized DTR Root Mean Squared Error RMSE:", rmse)
print("Optimaized DTR R^2 Score:", r2)

列出 DecisionTreeRegressor 所有可以調整的超參數名稱（keys），方便用來做 GridSearch 或模型微調。

In [None]:
# Get the list of available parameters in Decision Tree Regression model
parameters = DecisionTreeRegressor().get_params().keys()

# Print the list of available parameters
print(parameters)

Linear Regression model


基本上跟上面兩個模型的方法一樣兩個模型的方法一樣

In [None]:
# Linear Regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target variable
features = df[feat]  # Features
target = df['G3']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the Linear Regression with specified parameters
LR = LinearRegression(fit_intercept= True ,n_jobs = 1)

# Train the regression
LR.fit(X_train, y_train)

# Predict on the testing data
y_pred = LR.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

rmse = sqrt(mse)

r2 = r2_score(y_test, y_pred)

print("LR Mean Squared Error MSE:", mse)
print("LR Root Mean Squared Error RMSE:", rmse)
print("LR R^2 Score:", r2)

In [None]:
# using Gridsearch for best performancing Linear Regression model (OPTIMAIZATION)
from sklearn.model_selection import GridSearchCV

param_grid = {'fit_intercept': [True, False],
              'n_jobs' : [1, -1]}
grid = GridSearchCV(LinearRegression(),param_grid,cv = 5)
grid.fit(X_train,y_train)
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test)

mse = mean_squared_error(y_test, grid_predictions)

rmse = sqrt(mse)

r2 = r2_score(y_test, grid_predictions)

print("Optimaized LR Mean Squared Error MSE:", mse)
print("Optimaized LR Root Mean Squared Error RMSE:", rmse)
print("Optimaized LR R^2 Score:", r2)

In [None]:
# Get the list of available parameters in Linear Regression model
parameters = LinearRegression().get_params().keys()

# Print the list of available parameters
print(parameters)

Support Vector Machine Regression model


同理

In [None]:
# Support Vector Machine Regression model
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target variable
features = df[feat]  # Features
target = df['G3']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the Support Vector Machine Regression  with specified parameters
SVMR = SVR(kernel ='poly')

# Train the regression
SVMR.fit(X_train, y_train)

# Predict on the testing data
y_pred = SVMR.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

rmse = sqrt(mse)

r2 = r2_score(y_test, y_pred)


print("SVMR Mean Squared Error MSE:", mse)
print("SVMR Root Mean Squared Error RMSE:", rmse)
print("SVMR R^2 Score:", r2)

In [None]:
# using Gridsearch for best performancing Support Vector Machine Regression model (OPTIMAIZATION)
from sklearn.model_selection import GridSearchCV
number = [5,11,13,41,42,101]
numbers = list(range(1, 11))
param_grid = {'gamma' : ['scale', 'auto'],
              'kernel' : ['linear', 'rbf', 'sigmoid'],
              'degree' :  numbers}
grid = GridSearchCV(SVR(),param_grid,refit=True, verbose=3, cv = 5)
grid.fit(X_train,y_train)
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test)

mse = mean_squared_error(y_test, grid_predictions)

rmse = sqrt(mse)

r2 = r2_score(y_test, grid_predictions)

print("Optimaized SVMR Mean Squared Error MSE:", mse)
print("Optimaized SVMR Root Mean Squared Error RMSE:", rmse)
print("Optimaized SVMR R^2 Score:", r2)

In [None]:
# Get the list of available parameters in Support Vector Machine Regression model
parameters = SVR().get_params().keys()

# Print the list of available parameters
print(parameters)

XGBoost Regression model


同理

In [None]:
# XGBoost Regression model
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target variable
features = df[feat]  # Features
target = df['G3']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the XGBoost Regression with specified parameters
XGBR = XGBRegressor(gamma= 0.3, random_state= 42, n_estimators=11, n_jobs= -1, max_depth=10)

# Train the regression
XGBR.fit(X_train, y_train)

# Predict on the testing data
y_pred = XGBR.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

rmse = sqrt(mse)

r2 = r2_score(y_test, y_pred)

print("XGBR Mean Squared Error MSE:", mse)
print("XGBR Root Mean Squared Error RMSE:", rmse)
print("XGBR R^2 Score:", r2)

In [None]:
# using Gridsearch for best performancing XGBoost Regression model (OPTIMAIZATION)
from sklearn.model_selection import GridSearchCV
number = [5,11,13,41,42,101]
numbers = list(range(1, 11))
param_grid = {'random_state' : number,
              'n_estimators' : numbers,
              'n_jobs' :  [1, -1],
              'max_depth': numbers}
grid = GridSearchCV(XGBRegressor(), param_grid, cv = 5)
grid.fit(X_train,y_train)
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test)

mse = mean_squared_error(y_test, grid_predictions)

rmse = sqrt(mse)

r2 = r2_score(y_test, grid_predictions)

print("Optimaized XGBR Mean Squared Error MSE:", mse)
print("Optimaized XGBR Root Mean Squared Error RMSE:", rmse)
print("Optimaized XGBR R^2 Score:", r2)

In [None]:
# Get the list of available parameters in XGBoost Regression model
parameters = XGBRegressor().get_params().keys()

# Print the list of available parameters
print(parameters)

K Nearest Neighbors Regression model


同理

In [None]:
# K Nearest Neighbors Regression model
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target variable
features = df[feat]  # Features
target = df['G3']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the K Nearest Neighbors Regression model with specified parameters
KNNR = KNeighborsRegressor(n_neighbors= 7, n_jobs= 1, metric= 'manhattan')

# Train the regression
KNNR.fit(X_train, y_train)

# Predict on the testing data
y_pred = KNNR.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

rmse = sqrt(mse)

r2 = r2_score(y_test, y_pred)

print("KNNR Mean Squared Error MSE:", mse)
print("KNNR Root Mean Squared Error RMSE:", rmse)
print("KNNR R^2 Score:", r2)

In [None]:
# using Gridsearch for best performancing K Nearest Neighbors Regression model (OPTIMAIZATION)
from sklearn.model_selection import GridSearchCV
number = [5,11,13,41,42,101]
numbers = list(range(1, 51))
param_grid = {'n_neighbors': number,
              'n_jobs' :  [1, -1],
              'metric' : ['manhattan','euclidean','minkowski']}
grid = GridSearchCV(KNeighborsRegressor(),param_grid,cv = 5)
grid.fit(X_train,y_train)
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test)

mse = mean_squared_error(y_test, grid_predictions)

rmse = sqrt(mse)

r2 = r2_score(y_test, grid_predictions)

print("Optimaized KNNR Mean Squared Error MSE:", mse)
print("Optimaized KNNR Root Mean Squared Error RMSE:", rmse)
print("Optimaized KNNR R^2 Score:", r2)

In [None]:
# Get the list of available parameters in K Nearest Neighbors Regression model
parameters = KNeighborsRegressor().get_params().keys()

# Print the list of available parameters
print(parameters)