Task 1: Load and Split Preprocessed Data

In [37]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

# Load the data from the CSV file
data = pd.read_csv('exam_scores.csv')

# Define the target variable
target_variable = 'MathScore'  # Choose the target variable (MathScore, ReadingScore, or WritingScore)

# Drop rows with missing values in the target variable
data.dropna(subset=[target_variable], inplace=True)

# Drop unnecessary columns and target variable
features = data.drop(columns=[target_variable])

# One-hot encode categorical variables
categorical_cols = ['Gender', 'EthnicGroup', 'ParentEduc', 'LunchType', 'TestPrep', 'ParentMaritalStatus', 'PracticeSport', 'IsFirstChild', 'TransportMeans', 'WklyStudyHours']
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_features = pd.DataFrame(encoder.fit_transform(features[categorical_cols]))

# Assign column names to encoded features
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
encoded_features.columns = encoded_feature_names

# Concatenate encoded features with numerical features
preprocessed_data = pd.concat([encoded_features, features.drop(columns=categorical_cols)], axis=1)

# Handle missing values
imputer = SimpleImputer(strategy='mean')
preprocessed_data = pd.DataFrame(imputer.fit_transform(preprocessed_data), columns=preprocessed_data.columns)

# Define the target variable
target = data[target_variable]

# Split the data into training and test sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(preprocessed_data, target, test_size=0.3, random_state=42)



Task 2: Choose an Algorithm

Linear regression is a simple and widely used algorithm for regression tasks. It aims to find the linear relationship between the independent variables (features) and the dependent variable (target). The algorithm estimates the coefficients of the linear equation that minimizes the sum of squared differences between the predicted and actual target values.

Task 3: Train and Test a Mode

In [38]:
# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Test the model on the test set
y_pred = model.predict(X_test)


Task 4: Evaluate the Model

In [39]:
# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Calculate the coefficient of determination (R-squared)
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")

Mean Squared Error: 29.779818462250265
R-squared: 0.873855802089955


Task 5: Summary

The steps taken to train and evaluate the models involve loading and splitting the preprocessed data, choosing the linear regression algorithm, training the model on the training set, testing the model on the test set, and evaluating the model using mean squared error (MSE).

During training and evaluation, it's important to note the coefficients of the linear regression model. These coefficients represent the estimated impact of each feature on the target variable. By examining the coefficients, you can identify which features have a significant influence on the target variable and the direction of that influence (positive or negative).

Additionally, analyzing the MSE helps us understand the model's accuracy in predicting the target variable. A lower MSE indicates that the model's predictions are closer to the actual target values, suggesting better performance.