# Regression Quiz

https://drive.google.com/file/d/12njuYiHbyaXxOO7g_gCVTOCoOUeeqg8j/view?usp=sharing  **Link of the Dataset**

Data Dictionary (column description)
- Gender: Gender of the student (male/female)
- EthnicGroup: Ethnic group of the student (group A to E)
- ParentEduc: Parent(s) education background (from some_highschool to master's degree)
- LunchType: School lunch type (standard or free/reduced)
- TestPrep: Test preparation course followed (completed or none)
- ParentMaritalStatus: Parent(s) marital status (married/single/widowed/divorced)
- PracticeSport: How often the student parctice sport (never/sometimes/regularly))
- IsFirstChild: If the child is first child in the family or not (yes/no)
- NrSiblings: Number of siblings the student has (0 to 7)
- TransportMeans: Means of transport to school (schoolbus/private)
- WklyStudyHours: Weekly self-study hours(less that 5hrs; between 5 and 10hrs; more than 10hrs)
- MathScore: math test score(0-100)
- ReadingScore: reading test score(0-100)
- WritingScore: writing test score(0-100)

In [35]:
import pandas as pd
import numpy as np

df = pd.read_csv("Expanded_data_with_more_features.csv")

print(df.head())
# Shape (rows, columns)
# print("Shape:", df.shape)
print(df.info())
print(df.isna().sum())

   Unnamed: 0  Gender EthnicGroup          ParentEduc     LunchType TestPrep  \
0           0  female         NaN   bachelor's degree      standard     none   
1           1  female     group C        some college      standard      NaN   
2           2  female     group B     master's degree      standard     none   
3           3    male     group A  associate's degree  free/reduced     none   
4           4    male     group C        some college      standard     none   

  ParentMaritalStatus PracticeSport IsFirstChild  NrSiblings TransportMeans  \
0             married     regularly          yes         3.0     school_bus   
1             married     sometimes          yes         0.0            NaN   
2              single     sometimes          yes         4.0     school_bus   
3             married         never           no         1.0            NaN   
4             married     sometimes          yes         0.0     school_bus   

  WklyStudyHours  MathScore  ReadingScore  W

In [36]:
# Fill categorical with mode
cat_cols = [
    'EthnicGroup', 'ParentEduc', 'TestPrep',
    'ParentMaritalStatus', 'PracticeSport',
    'IsFirstChild', 'TransportMeans', 'WklyStudyHours'
]

for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Fill numeric with median
df['NrSiblings'].fillna(df['NrSiblings'].median(), inplace=True)

# Verify
print(df.isna().sum())

Unnamed: 0             0
Gender                 0
EthnicGroup            0
ParentEduc             0
LunchType              0
TestPrep               0
ParentMaritalStatus    0
PracticeSport          0
IsFirstChild           0
NrSiblings             0
TransportMeans         0
WklyStudyHours         0
MathScore              0
ReadingScore           0
WritingScore           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['NrSiblings'].fillna(df['NrSiblings'].median(), inplace=True)


In [37]:
# Number of duplicate rows
print("Duplicate rows:", df.duplicated().sum())

# See first few duplicates (if any)
print(df[df.duplicated()].head())


Duplicate rows: 0
Empty DataFrame
Columns: [Unnamed: 0, Gender, EthnicGroup, ParentEduc, LunchType, TestPrep, ParentMaritalStatus, PracticeSport, IsFirstChild, NrSiblings, TransportMeans, WklyStudyHours, MathScore, ReadingScore, WritingScore]
Index: []


In [38]:
# import pandas as pd
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# # --------------------------------
# # 1. Target & features
# # --------------------------------
# y = df['MathScore']
# X = df.drop(columns=['MathScore', 'Unnamed: 0'], errors='ignore')

# # --------------------------------
# # 2. Column groups
# # --------------------------------
# ordinal_cols = ['ParentEduc', 'WklyStudyHours', 'PracticeSport']
# ordinal_categories = [
#     ["some_highschool", "highschool", "some_college",
#      "associate's degree", "bachelor's degree", "master's degree"],  # ParentEduc
#     ["< 5", "5 - 10", "> 10"],                                       # WklyStudyHours
#     ["never", "sometimes", "regularly"]                              # PracticeSport
# ]

# binary_cols = ['Gender', 'LunchType', 'TestPrep', 'IsFirstChild', 'TransportMeans']
# binary_categories = [
#     ["female", "male"],            # Gender
#     ["free/reduced", "standard"],  # LunchType
#     ["none", "completed"],         # TestPrep
#     ["no", "yes"],                 # IsFirstChild
#     ["schoolbus", "private"]       # TransportMeans
# ]

# nominal_cols = ['EthnicGroup', 'ParentMaritalStatus']
# numeric_cols = ['NrSiblings', 'ReadingScore', 'WritingScore']

# # --------------------------------
# # 3. ColumnTransformer with sklearn.preprocessing
# # --------------------------------
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('ord', OrdinalEncoder(categories=ordinal_categories,
#                                handle_unknown='use_encoded_value', unknown_value=-1),
#          ordinal_cols),
#         ('bin', OrdinalEncoder(categories=binary_categories,
#                                handle_unknown='use_encoded_value', unknown_value=-1),
#          binary_cols),
#         ('nom', OneHotEncoder(handle_unknown='ignore', drop='first'),
#          nominal_cols),
#         ('num', 'passthrough', numeric_cols)
#     ],
#     remainder='drop'
# )

# # --------------------------------
# # 4. Fit & transform
# # --------------------------------
# X_encoded = preprocessor.fit_transform(X)

# print("Shape before encoding:", X.shape)
# print("Shape after encoding:", X_encoded.shape)


In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score

# === 1) Target/Features
y = df['MathScore']
X = df.drop(columns=['MathScore', 'Unnamed: 0'], errors='ignore')

# === 2) Column groups
ordinal_cols = ['ParentEduc', 'WklyStudyHours', 'PracticeSport']
ordinal_categories = [
    ["some_highschool", "highschool", "some_college",
     "associate's degree", "bachelor's degree", "master's degree"],  # ParentEduc
    ["< 5", "5 - 10", "> 10"],                                       # WklyStudyHours
    ["never", "sometimes", "regularly"]                              # PracticeSport
]

binary_cols = ['Gender', 'LunchType', 'TestPrep', 'IsFirstChild', 'TransportMeans']
binary_categories = [
    ["female", "male"],            # Gender
    ["free/reduced", "standard"],  # LunchType
    ["none", "completed"],         # TestPrep
    ["no", "yes"],                 # IsFirstChild
    ["schoolbus", "private"]       # TransportMeans
]

nominal_cols = ['EthnicGroup', 'ParentMaritalStatus']
numeric_cols = ['NrSiblings', 'ReadingScore', 'WritingScore']

# === 3) Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === 4) Preprocessing pipelines
# Ordinal
ordinal_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=ordinal_categories,
                               handle_unknown='use_encoded_value', unknown_value=-1))
])

# Binary
binary_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=binary_categories,
                               handle_unknown='use_encoded_value', unknown_value=-1))
])

# Nominal
nominal_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', drop='first'))
])

# Numeric
numeric_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Combine in ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('ord', ordinal_preprocessor, ordinal_cols),
        ('bin', binary_preprocessor, binary_cols),
        ('nom', nominal_preprocessor, nominal_cols),
        ('num', numeric_preprocessor, numeric_cols)
    ],
    remainder='drop'
)

# === 5) Final pipeline
model = Pipeline(steps=[
    ('prep', preprocessor),
    ('reg', LinearRegression())
])

# === 6) Train
model.fit(X_train, y_train)

# === 7) Evaluate
y_pred = model.predict(X_test)
print("MAE :", mean_absolute_error(y_test, y_pred))
print("RMSE:",  root_mean_squared_error(y_test, y_pred)  )
print("R²  :", r2_score(y_test, y_pred))


MAE : 4.367727502059757
RMSE: 5.464513578668688
R²  : 0.8713057792671536
