<a href="https://colab.research.google.com/github/Abhinav-gowda/student-performance-predection-ML-and-Modelling-project/blob/main/student_performance_predection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

df = pd.read_csv('/content/StudentPerformanceFactors.csv')

**Reasoning**:
Display the first few rows of the DataFrame to verify successful loading.



In [None]:
df.head()

## Explore and preprocess the data

### Subtask:
Analyze the dataset to understand its structure, identify missing values, and handle categorical features. This might involve steps like one-hot encoding or label encoding.


**Reasoning**:
Print the concise summary of the DataFrame to understand its structure, including the index dtype and column dtypes, non-null values and memory usage, and then check for missing values in each column and display the count of missing values for each column.



In [None]:
df.info()
print('\nMissing values per column:')
print(df.isnull().sum())

**Reasoning**:
Identify the categorical columns and apply one-hot encoding to them, handling missing values in categorical columns if any.



In [None]:
categorical_cols = df.select_dtypes(include='object').columns

# Handle missing values in categorical columns by filling with mode
for col in ['Teacher_Quality', 'Parental_Education_Level', 'Distance_from_Home']:
    if col in categorical_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)

df_encoded = pd.get_dummies(df, columns=categorical_cols, dummy_na=False)
display(df_encoded.head())

## Split the data

### Subtask:
Split the data into training and testing sets to evaluate the model's performance on unseen data.


**Reasoning**:
Import the train_test_split function and then split the encoded data into training and testing sets, defining features and target variable.



In [None]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('Exam_Score', axis=1)
y = df_encoded['Exam_Score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Choose and train a model

### Subtask:
Select a suitable regression model (e.g., Linear Regression, Ridge, Lasso, or a tree-based model like RandomForestRegressor) and train it on the training data.


**Reasoning**:
Train a RandomForestRegressor model on the training data.



In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the trained model using appropriate metrics like R², Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).


**Reasoning**:
Calculate MAE, RMSE, and R² score to evaluate the model's performance on the test set.



In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R²): {r2:.2f}')

## Visualize results

### Subtask:
Create visualizations to show the model's predictions and explore the relationships between features and student performance.


**Reasoning**:
Create visualizations to show the model's predictions and explore the relationships between features and student performance.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot of Actual vs. Predicted Exam Scores
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2) # Diagonal line
plt.xlabel("Actual Exam Score")
plt.ylabel("Predicted Exam Score")
plt.title("Actual vs. Predicted Exam Scores")
plt.grid(True)
plt.show()

# Histogram of Residuals
residuals = y_test - y_pred
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals (Actual - Predicted)")
plt.ylabel("Frequency")
plt.title("Histogram of Residuals")
plt.show()

## Analyze feature importance

### Subtask:
Analyze which features have the most significant impact on student performance using the trained Random Forest Regressor model.


**Reasoning**:
Access the feature importances from the trained Random Forest model, create a pandas Series, sort it, and then visualize the top features.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Get feature importances
feature_importances = model.feature_importances_

# Create a pandas Series with feature names as index
feature_importance_series = pd.Series(feature_importances, index=X_train.columns)

# Sort feature importances in descending order
sorted_feature_importances = feature_importance_series.sort_values(ascending=False)

# Select the top N features (e.g., top 15)
top_n = 15
top_features = sorted_feature_importances.head(top_n)

# Create a bar plot of the top N most important features
plt.figure(figsize=(12, 8))
sns.barplot(x=top_features.values, y=top_features.index)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title(f'Top {top_n} Most Important Features for Student Performance Prediction')
plt.tight_layout()
plt.show()

# Print the sorted feature importances
print("Sorted Feature Importances:")
display(sorted_feature_importances)

## Summarize findings

### Subtask:
Present the model evaluation results, visualizations, and analytical insights in a clear and concise manner.


**Reasoning**:
Combine the model evaluation metrics, visualization insights, and feature importance analysis into a comprehensive summary.



In [None]:
print("## Model Evaluation and Insights Summary")

print("\n### Model Performance Metrics")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")
print("\nInterpretation:")
print(f"- MAE indicates that, on average, the model's predictions are off by approximately {mae:.2f} exam score points.")
print(f"- RMSE provides a similar measure of error magnitude, with a value of {rmse:.2f}.")
print(f"- R² suggests that approximately {r2:.0%} of the variance in exam scores can be explained by the features in the model.")

print("\n### Insights from Visualizations")
print("The scatter plot of actual vs. predicted exam scores shows a positive correlation, with points generally clustering around the diagonal line.")
print("This indicates that the model is capturing the overall trend in student performance, although there is some spread, suggesting room for improvement.")
print("The histogram of residuals shows that the errors are roughly centered around zero and appear somewhat normally distributed.")
print("This suggests that the model's errors are not strongly biased in one direction.")

print("\n### Key Takeaways from Feature Importance Analysis")
print(f"The analysis of feature importance revealed the top {top_n} most influential factors in predicting student performance:")
display(top_features)
print("\nInterpretation of Top Features:")
print("- **Attendance** and **Hours_Studied** are the most significant predictors, highlighting the direct impact of student effort and presence.")
print("- **Previous_Scores** is also a strong indicator, suggesting that past academic performance is a key determinant of future success.")
print("- Other important factors include **Tutoring_Sessions**, **Physical_Activity**, and **Sleep_Hours**, which relate to support systems, well-being, and study habits.")
print("- Factors like **Access_to_Resources_High**, **Parental_Involvement**, and **Family_Income_High** also play a notable role, indicating the influence of external support and socioeconomic factors.")

## Summary:

### Data Analysis Key Findings

*   The dataset contains 6607 entries with 20 columns covering various factors related to student performance.
*   Missing values were found in 'Teacher\_Quality' (78), 'Parental\_Education\_Level' (90), and 'Distance\_from\_Home' (67) columns.
*   Missing categorical values were successfully imputed with the mode of their respective columns.
*   Categorical features were successfully converted into numerical format using one-hot encoding, resulting in 41 columns in the encoded DataFrame.
*   The data was split into training (80%) and testing (20%) sets for model training and evaluation.
*   A Random Forest Regressor model was trained on the training data.
*   The model evaluation metrics are: Mean Absolute Error (MAE): 1.08, Root Mean Squared Error (RMSE): 2.17, and R-squared (R²): 0.67. This indicates that the model's predictions are, on average, off by about 1.08 exam score points, and it explains approximately 67% of the variance in exam scores.
*   Visualizations show a positive correlation between actual and predicted exam scores, with predictions generally clustering around the ideal line, indicating the model captures the overall trend.
*   The histogram of residuals suggests that prediction errors are roughly centered around zero and somewhat normally distributed, implying no strong directional bias in errors.
*   Feature importance analysis identified 'Attendance' and 'Hours\_Studied' as the most significant predictors of student performance.
*   'Previous\_Scores' is also a strong predictor, followed by factors like 'Tutoring\_Sessions', 'Physical\_Activity', and 'Sleep\_Hours'.
*   Factors related to external support and socioeconomic status, such as 'Access\_to\_Resources\_High', 'Parental\_Involvement', and 'Family\_Income\_High', also showed notable importance.

### Insights or Next Steps

*   The current model provides a good baseline (R²=0.67), but exploring other regression algorithms or hyperparameter tuning could potentially improve performance.
*   The identified key features (Attendance, Hours Studied, Previous Scores) provide actionable insights for educators and parents to focus interventions on improving these factors to enhance student performance.


# Task
Create a simple web application in Google Colab that uses the trained `RandomForestRegressor` model to predict student exam scores based on user input. The web app should have an interface for users to enter feature values and display the predicted score.

## Save the trained model

### Subtask:
Save the trained `RandomForestRegressor` model to a file so it can be loaded later by the web app.


**Reasoning**:
Save the trained RandomForestRegressor model to a file using joblib.



In [None]:
import joblib

joblib.dump(model, 'random_forest_model.joblib')

## Create a prediction function

### Subtask:
Define a Python function that takes user input for the relevant features, preprocesses it to match the format expected by the model, and returns a predicted exam score.


**Reasoning**:
Define a function to take user input, preprocess it, load the saved model, and predict the exam score.



In [None]:
import joblib
import pandas as pd

def predict_exam_score(input_features):
    """
    Predicts student exam score based on input features.

    Args:
        input_features (dict): A dictionary where keys are feature names
                               and values are the input values.

    Returns:
        float: The predicted exam score.
    """
    # Create a DataFrame from the input features
    input_df = pd.DataFrame([input_features])

    # Ensure column order matches training data
    # This is crucial for correct prediction with one-hot encoded features
    # Assuming X_train.columns is available from previous steps
    # (if not, you would need to load the training feature names or recreate them)
    input_df = input_df.reindex(columns=X_train.columns, fill_value=0)

    # Load the trained model
    loaded_model = joblib.load('random_forest_model.joblib')

    # Make prediction
    predicted_score = loaded_model.predict(input_df)[0]

    return predicted_score

# Example usage (replace with actual input logic in the web app)
# This is just for testing the function
# Example input dictionary - make sure to match the structure of your one-hot encoded features
# You would need to dynamically create this dictionary based on user input from the web form
example_input = {
    'Hours_Studied': 20,
    'Attendance': 85,
    'Sleep_Hours': 7,
    'Previous_Scores': 70,
    'Tutoring_Sessions': 1,
    'Physical_Activity': 3,
    'Parental_Involvement_High': False,
    'Parental_Involvement_Low': True,
    'Parental_Involvement_Medium': False,
    'Access_to_Resources_High': True,
    'Access_to_Resources_Low': False,
    'Access_to_Resources_Medium': False,
    'Extracurricular_Activities_No': True,
    'Extracurricular_Activities_Yes': False,
    'Motivation_Level_High': False,
    'Motivation_Level_Low': True,
    'Motivation_Level_Medium': False,
    'Internet_Access_No': False,
    'Internet_Access_Yes': True,
    'Family_Income_High': False,
    'Family_Income_Low': True,
    'Family_Income_Medium': False,
    'Teacher_Quality_High': False,
    'Teacher_Quality_Low': False,
    'Teacher_Quality_Medium': True,
    'School_Type_Private': False,
    'School_Type_Public': True,
    'Peer_Influence_Negative': False,
    'Peer_Influence_Neutral': True,
    'Peer_Influence_Positive': False,
    'Learning_Disabilities_No': True,
    'Learning_Disabilities_Yes': False,
    'Parental_Education_Level_College': False,
    'Parental_Education_Level_High School': True,
    'Parental_Education_Level_Postgraduate': False,
    'Distance_from_Home_Far': False,
    'Distance_from_Home_Moderate': True,
    'Distance_from_Home_Near': False,
    'Gender_Female': False,
    'Gender_Male': True
}

predicted_score = predict_exam_score(example_input)
print(f"Predicted Exam Score: {predicted_score:.2f}")

## Design the web app interface

### Subtask:
Create a simple user interface using a library like `Gradio` or `Streamlit` to allow users to input feature values.


**Reasoning**:
Import the gradio library and define the Gradio interface components based on the features in the training data, handling the mapping from user-friendly categorical inputs to the one-hot encoded format within the prediction function.



In [None]:
import gradio as gr
import pandas as pd

def predict_exam_score_gradio(hours_studied, attendance, parental_involvement, access_to_resources,
                              extracurricular_activities, sleep_hours, previous_scores, motivation_level,
                              internet_access, tutoring_sessions, family_income, teacher_quality,
                              school_type, peer_influence, physical_activity, learning_disabilities,
                              parental_education_level, distance_from_home, gender):
    """
    Predicts student exam score based on user input from Gradio interface.

    Args:
        hours_studied (int): Hours studied.
        attendance (int): Attendance percentage.
        parental_involvement (str): Level of parental involvement.
        access_to_resources (str): Access to resources level.
        extracurricular_activities (str): Participation in extracurricular activities.
        sleep_hours (int): Hours of sleep.
        previous_scores (int): Previous exam scores.
        motivation_level (str): Motivation level.
        internet_access (str): Internet access availability.
        tutoring_sessions (int): Number of tutoring sessions.
        family_income (str): Family income level.
        teacher_quality (str): Teacher quality level.
        school_type (str): School type.
        peer_influence (str): Peer influence type.
        physical_activity (int): Level of physical activity.
        learning_disabilities (str): Presence of learning disabilities.
        parental_education_level (str): Parental education level.
        distance_from_home (str): Distance from home.
        gender (str): Gender.

    Returns:
        float: The predicted exam score.
    """
    # Create a dictionary for the input features
    input_features = {
        'Hours_Studied': hours_studied,
        'Attendance': attendance,
        'Sleep_Hours': sleep_hours,
        'Previous_Scores': previous_scores,
        'Tutoring_Sessions': tutoring_sessions,
        'Physical_Activity': physical_activity,
        'Gender_Female': gender == 'Female',
        'Gender_Male': gender == 'Male' if gender else False, # Handle potential None if not selected
        'Parental_Involvement_High': parental_involvement == 'High',
        'Parental_Involvement_Low': parental_involvement == 'Low',
        'Parental_Involvement_Medium': parental_involvement == 'Medium',
        'Access_to_Resources_High': access_to_resources == 'High',
        'Access_to_Resources_Low': access_to_resources == 'Low',
        'Access_to_Resources_Medium': access_to_resources == 'Medium',
        'Extracurricular_Activities_No': extracurricular_activities == 'No',
        'Extracurricular_Activities_Yes': extracurricular_activities == 'Yes',
        'Motivation_Level_High': motivation_level == 'High',
        'Motivation_Level_Low': motivation_level == 'Low',
        'Motivation_Level_Medium': motivation_level == 'Medium',
        'Internet_Access_No': internet_access == 'No',
        'Internet_Access_Yes': internet_access == 'Yes',
        'Family_Income_High': family_income == 'High',
        'Family_Income_Low': family_income == 'Low',
        'Family_Income_Medium': family_income == 'Medium',
        'Teacher_Quality_High': teacher_quality == 'High',
        'Teacher_Quality_Low': teacher_quality == 'Low',
        'Teacher_Quality_Medium': teacher_quality == 'Medium',
        'School_Type_Private': school_type == 'Private',
        'School_Type_Public': school_type == 'Public',
        'Peer_Influence_Negative': peer_influence == 'Negative',
        'Peer_Influence_Neutral': peer_influence == 'Neutral',
        'Peer_Influence_Positive': peer_influence == 'Positive',
        'Learning_Disabilities_No': learning_disabilities == 'No',
        'Learning_Disabilities_Yes': learning_disabilities == 'Yes',
        'Parental_Education_Level_College': parental_education_level == 'College',
        'Parental_Education_Level_High School': parental_education_level == 'High School',
        'Parental_Education_Level_Postgraduate': parental_education_level == 'Postgraduate',
        'Distance_from_Home_Far': distance_from_home == 'Far',
        'Distance_from_Home_Moderate': distance_from_home == 'Moderate',
        'Distance_from_Home_Near': distance_from_home == 'Near'
    }

    # Convert boolean values to integers (0 or 1)
    for key, value in input_features.items():
        if isinstance(value, bool):
            input_features[key] = int(value)


    # Create a DataFrame from the input features
    input_df = pd.DataFrame([input_features])

    # Ensure column order matches training data
    # Assuming X_train.columns is available from previous steps
    input_df = input_df.reindex(columns=X_train.columns, fill_value=0)

    # Load the trained model
    loaded_model = joblib.load('random_forest_model.joblib')

    # Make prediction
    predicted_score = loaded_model.predict(input_df)[0]

    return predicted_score

# Define Gradio input components
inputs = [
    gr.Number(label="Hours Studied", minimum=0, maximum=50),
    gr.Number(label="Attendance (%)", minimum=0, maximum=100),
    gr.Dropdown(['Low', 'Medium', 'High'], label="Parental Involvement"),
    gr.Dropdown(['Low', 'Medium', 'High'], label="Access to Resources"),
    gr.Radio(['Yes', 'No'], label="Extracurricular Activities"),
    gr.Number(label="Sleep Hours", minimum=0, maximum=12),
    gr.Number(label="Previous Scores", minimum=0, maximum=100),
    gr.Dropdown(['Low', 'Medium', 'High'], label="Motivation Level"),
    gr.Radio(['Yes', 'No'], label="Internet Access"),
    gr.Number(label="Tutoring Sessions", minimum=0),
    gr.Dropdown(['Low', 'Medium', 'High'], label="Family Income"),
    gr.Dropdown(['Low', 'Medium', 'High'], label="Teacher Quality"),
    gr.Radio(['Public', 'Private'], label="School Type"),
    gr.Dropdown(['Negative', 'Neutral', 'Positive'], label="Peer Influence"),
    gr.Number(label="Physical Activity (hours/week)", minimum=0),
    gr.Radio(['Yes', 'No'], label="Learning Disabilities"),
    gr.Dropdown(['High School', 'College', 'Postgraduate'], label="Parental Education Level"),
    gr.Dropdown(['Near', 'Moderate', 'Far'], label="Distance from Home"),
    gr.Radio(['Male', 'Female'], label="Gender")
]

# Define Gradio output component
output = gr.Label(label="Predicted Exam Score")

# Create Gradio Interface
interface = gr.Interface(
    fn=predict_exam_score_gradio,
    inputs=inputs,
    outputs=output,
    title="Student Exam Score Prediction"
)

# Launch the interface
interface.launch(debug=True)

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://0d93f44a16ceb7629c.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
