### Explanation of the data


- Gender :
Represents the student's gender (male or female).
A categorical feature often used in demographic analysis.

- EthnicGroup:
Represents the student's ethnic background (e.g., group A, group B, group C).
Another categorical variable that might relate to performance patterns.

- ParentEduc:
Indicates the highest level of education attained by the student’s parent (e.g., bachelor's degree, some college, master's degree).
A categorical variable that could correlate with the student's academic performance.

- LunchType:
Refers to the type of lunch the student receives (standard or free/reduced).
A socio-economic indicator that may influence academic performance.

- TestPrep:
Denotes whether the student completed test preparation (none, completed).
A categorical variable relevant to their scores.

- ParentMaritalStatus:
The marital status of the student’s parents (married, single, divorced).
May provide socio-demographic context.

- PracticeSport:
Frequency of the student practicing sports (regularly, sometimes, never).
Could impact mental and physical well-being, influencing academic performance.

- IsFirstChild:
Indicates if the student is the first child in the family (yes or no).
A categorical variable that might provide insight into family dynamics affecting education.

- NrSiblings:
The number of siblings the student has.
A numerical variable, potentially related to resource allocation within the family.

- TransportMeans:
How the student travels to school (e.g., school_bus, car, walk).
A categorical variable that might correlate with punctuality or attendance.

- WklyStudyHours:
The student’s weekly study hours (< 5, 5 - 10, etc.).
A categorical variable closely tied to academic performance.

- MathScore:
The student’s score in Math.
A numerical target variable for predictive tasks (e.g., regression).

- ReadingScore:
The student’s score in Reading.
Another numerical variable for prediction or analysis.

- WritingScore:
The student’s score in Writing.
Also a numerical variable, likely to be analyzed alongside Math and Reading scores.


# 1. Import Libraries and Load Data

dataset from kaggle:
https://www.kaggle.com/datasets/desalegngeb/students-exam-scores

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder  # Encodes categorical data (e.g., strings) into numerical format.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# mean_squared_error < Measures the average squared difference between the actual and predicted values.
# r2_score Measures < how well the regression model fits the data. Values range from 0 (poor fit) to 1 (perfect fit).
from sklearn.impute import SimpleImputer

# Load datasets
df = pd.read_csv('/content/Expanded_data_with_more_features.csv')

# Preview the expanded dataset
df.head()


ImportError: DLL load failed while importing _multiarray_umath: The specified module could not be found.

ImportError: DLL load failed while importing _multiarray_umath: The specified module could not be found.

ImportError: DLL load failed while importing _multiarray_umath: The specified module could not be found.

ImportError: DLL load failed while importing _multiarray_umath: The specified module could not be found.

ModuleNotFoundError: No module named 'sklearn'

In [None]:
df.columns

In [None]:
# Drop the 'Unnamed: 0' column, as it is an index or irrelevant
df = df.drop(columns=['Unnamed: 0'])

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
# Handle missing values:
# For numerical columns, use the mean or median to fill missing values
numerical_columns = ['NrSiblings', 'WklyStudyHours', 'MathScore', 'ReadingScore', 'WritingScore']
numerical_imputer = SimpleImputer(strategy='mean')  # Can also use 'median' if necessary

# Convert the 'WklyStudyHours' column to numeric values
def convert_study_hours(value):
    if isinstance(value, str):  # Only process strings
        if value == "< 5":
            return 4  # Treat "< 5" as 4 hours of study
        elif value == "> 10":
            return 12  # Treat "> 10" as 12 hours of study
        elif value == "10-May":
            return 10  # Treat "10-May" as 10 hours of study
        elif ' - ' in value:  # For ranges like '5 - 10'
            # Extract the two numbers and return the average
            lower, upper = value.split(' - ')
            return (float(lower) + float(upper)) / 2
    # For non-string values (numerical), return the value as is
    return value

# Apply conversion function to the 'WklyStudyHours' column
df['WklyStudyHours'] = df['WklyStudyHours'].apply(convert_study_hours)

# Now handle missing values for the rest of the numerical columns
df[numerical_columns] = numerical_imputer.fit_transform(df[numerical_columns])

# Display the updated DataFrame
df


# 2. Data Preprocessing

In [None]:
# For categorical columns, use the mode to fill missing values
categorical_columns = ['EthnicGroup', 'ParentEduc', 'TestPrep', 'ParentMaritalStatus',
                       'PracticeSport', 'IsFirstChild', 'TransportMeans']

categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_columns] = categorical_imputer.fit_transform(df[categorical_columns])

# Now, handle categorical columns by encoding them
label_encoder = LabelEncoder()

# Encode categorical variables into numeric labels
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])   # LabelEncoder is used to convert categorical labels into numeric labels.

# 3. Split the Data into Features and Target Variables

In [None]:
df.isnull().sum()

In [None]:
# Define features and target variables
X = df.drop(['MathScore', 'ReadingScore', 'WritingScore'], axis=1)
y_math = df['MathScore']
y_reading = df['ReadingScore']
y_writing = df['WritingScore']


# 4. Train-Test Split

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train_math, y_test_math = train_test_split(X, y_math, test_size=0.2, random_state=42)
X_train, X_test, y_train_reading, y_test_reading = train_test_split(X, y_reading, test_size=0.2, random_state=42)
X_train, X_test, y_train_writing, y_test_writing = train_test_split(X, y_writing, test_size=0.2, random_state=42)


# 5. Model Training and Prediction (Linear Regression)

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

# Drop columns with all NaN values from X_train and X_test
X_train = X_train.dropna(axis=1, how='all')
X_test = X_test.dropna(axis=1, how='all')

# Recalculate the numerical and categorical columns after cleaning
numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# Initialize the imputers for numerical and categorical columns
numerical_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Apply the imputers only to non-empty numerical columns
X_train[numerical_cols] = numerical_imputer.fit_transform(X_train[numerical_cols])

# Apply the imputers only if categorical columns exist
if len(categorical_cols) > 0:
    X_train[categorical_cols] = categorical_imputer.fit_transform(X_train[categorical_cols])

# Apply the imputers on the test set as well
X_test[numerical_cols] = numerical_imputer.transform(X_test[numerical_cols])

# Apply the categorical imputer only if categorical columns exist
if len(categorical_cols) > 0:
    X_test[categorical_cols] = categorical_imputer.transform(X_test[categorical_cols])

# Check for NaNs in X_train and X_test
print(f"Checking NaN values in X_train: {X_train.isnull().sum().any()}")
print(f"Checking NaN values in X_test: {X_test.isnull().sum().any()}")
print(f"Checking NaN values in y_train_math: {y_train_math.isnull().sum()}")

# Initialize Linear Regression models
model_math = LinearRegression()
model_reading = LinearRegression()
model_writing = LinearRegression()

# Train the models
model_math.fit(X_train, y_train_math)
model_reading.fit(X_train, y_train_reading)
model_writing.fit(X_train, y_train_writing)

# Make predictions
y_pred_math = model_math.predict(X_test)
y_pred_reading = model_reading.predict(X_test)
y_pred_writing = model_writing.predict(X_test)


# 6. Evaluate the Model Performance

In [None]:
# Display results
print("Math Score Predictions:", y_pred_math)
print("Reading Score Predictions:", y_pred_reading)
print("Writing Score Predictions:", y_pred_writing)


In [None]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier  # Example classifier

# Example training data for classification task
training_data = pd.DataFrame({
    'Gender': ['female', 'male', 'female', 'male', 'female'],
    'EthnicGroup': ['group A', 'group B', 'group C', 'group A', 'group B'],
    'ParentEduc': ['bachelor\'s degree', 'high school', 'some college', 'high school', 'master\'s degree'],
    'LunchType': ['standard', 'free/reduced', 'standard', 'free/reduced', 'standard'],
    'TestPrep': ['none', 'completed', 'none', 'completed', 'none'],
    'ParentMaritalStatus': ['married', 'single', 'married', 'single', 'married'],
    'PracticeSport': ['regularly', 'never', 'regularly', 'never', 'regularly'],
    'IsFirstChild': ['yes', 'no', 'yes', 'no', 'yes'],
    'NrSiblings': [1, 3, 2, 0, 1],
    'TransportMeans': ['school_bus', 'car', 'car', 'school_bus', 'school_bus'],
    'WklyStudyHours': ['< 5', '5-10', '> 10', '5-10', '< 5'],
    'Performance': ['pass', 'fail', 'pass', 'fail', 'pass']  # Target variable
})

# Save feature names
training_feature_names = [
    'Gender', 'EthnicGroup', 'ParentEduc', 'LunchType', 'TestPrep',
    'ParentMaritalStatus', 'PracticeSport', 'IsFirstChild',
    'NrSiblings', 'TransportMeans', 'WklyStudyHours'
]

# Encode categorical variables
label_encoders = {}
categorical_cols = [
    'Gender', 'EthnicGroup', 'ParentEduc', 'LunchType', 'TestPrep',
    'ParentMaritalStatus', 'PracticeSport', 'IsFirstChild', 'TransportMeans', 'WklyStudyHours'
]

for col in categorical_cols:
    label_encoders[col] = LabelEncoder().fit(training_data[col])
    training_data[col] = label_encoders[col].transform(training_data[col])

# Split data into features and target
X = training_data[training_feature_names]
y = training_data['Performance']

# Encode the target variable
target_encoder = LabelEncoder()
y = target_encoder.fit_transform(y)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a model (replace RandomForestClassifier with your model)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print the classification report
report = classification_report(y_test, y_pred, target_names=target_encoder.classes_)
print("\nClassification Report:\n", report)
