# ðŸ“Š Predicting Student Scores Based on Study Hours

This project demonstrates a simple yet powerful application of **linear regression** to predict student performance based on the number of hours they study. Using a dataset of students' study hours and corresponding scores, we build a machine learning model that learns the relationship between these two variables.

The workflow includes:
- Loading and exploring the dataset
- Splitting the data into training and testing sets
- Training a linear regression model
- Making predictions on unseen data
- Evaluating the model using metrics like RÂ² score, Mean Squared Error (MSE), and Mean Absolute Error (MAE)

This project is ideal for beginners in data science and machine learning who want to understand how supervised learning works in a real-world context.

## Importing Essential Python Libraries for Data Analysis and Visualization

Before diving into data analysis and visualization, we need to import several key Python libraries:

- **pandas**: A powerful library for data manipulation and analysis. It provides data structures like DataFrames that make it easy to handle structured data.
- **numpy**: A fundamental package for numerical computing in Python. It offers support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.
- **matplotlib.pyplot**: A popular plotting library used for creating static, animated, and interactive visualizations in Python.
- **seaborn**: Built on top of matplotlib, seaborn provides a high-level interface for drawing attractive and informative statistical graphics.

These libraries form the foundation for most data science and machine learning workflows in Python.

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 

## Loading and Previewing the Dataset

We begin by importing the dataset using **pandas**. The `read_excel()` function reads data from an Excel file named `student_scores_large.xlsx` and loads it into a DataFrame called `df`. This DataFrame will serve as the primary structure for analyzing and manipulating the data.

To get a quick overview of the dataset, we use `df.head()`, which displays the first five rows. This helps us understand the structure of the data, including column names and sample values.

In [5]:
df = pd.read_excel('student_scores_large.xlsx')
df.head()

Unnamed: 0,Hours,Scores
0,4.49,43.4
1,11.41,100.0
2,8.78,82.1
3,7.18,61.2
4,1.87,19.8


## Exploring Dataset Structure and Data Types

To gain a better understanding of the dataset, we use the `df.info()` method. This function provides a concise summary of the DataFrame, including:

- The number of entries (rows) and columns
- Column names and their data types (e.g., integer, float, object)
- The number of non-null (non-missing) values in each column
- Memory usage of the DataFrame

This overview is helpful for identifying missing data and understanding the structure before performing further analysis.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Hours   450 non-null    float64
 1   Scores  450 non-null    float64
dtypes: float64(2)
memory usage: 7.2 KB


## Checking the Dimensions of the Dataset

To understand the size of our dataset, we use the `df.shape` attribute. This returns a tuple representing the number of rows and columns in the DataFrame:

- The first value indicates the total number of rows (observations).
- The second value indicates the total number of columns (features).

This quick check helps us gauge the scale of the data we're working with and informs decisions about memory usage and processing strategies.

In [7]:
df.shape

(450, 2)

## Generating Summary Statistics for Numerical Columns

To quickly understand the distribution and central tendencies of the dataset, we use the `df.describe()` method. This function computes summary statistics for all numerical columns, including:

- **Count**: Number of non-null entries
- **Mean**: Average value
- **Standard deviation (std)**: Measure of spread
- **Min and Max**: Minimum and maximum values
- **25%, 50%, 75%**: Percentile values (quartiles)

These statistics help identify patterns, detect outliers, and guide further data cleaning or analysis.

In [8]:
df.describe()

Unnamed: 0,Hours,Scores
count,450.0,450.0
mean,5.946156,52.333778
std,3.556427,28.989509
min,0.06,0.0
25%,2.8775,27.075
50%,6.135,52.9
75%,9.0475,76.5
max,11.89,100.0


## Identifying Missing Values in the Dataset

To check for missing data, we use the `df.isnull().sum()` method. This command performs two steps:

1. `df.isnull()` returns a DataFrame of the same shape with `True` for each missing value and `False` otherwise.
2. `.sum()` then adds up the `True` values column-wise, giving the total count of missing entries in each column.

This is a crucial step in data cleaning, helping us decide whether to impute, drop, or otherwise handle missing values before analysis.

In [9]:
df.isnull().sum()

Hours     0
Scores    0
dtype: int64

### Interpretation of Missing Values Output

The output shows that both columns â€” **Hours** and **Scores** â€” have zero missing values:

- `Hours     0`: No missing entries in the "Hours" column.
- `Scores    0`: No missing entries in the "Scores" column.

This means the dataset is complete for these two variables, and we can proceed with analysis without needing to handle missing data for them.

## Importing Machine Learning Tools from scikit-learn

To build and evaluate a linear regression model, we import several essential components from the **scikit-learn** library:

- `train_test_split`: Splits the dataset into training and testing subsets, allowing us to evaluate model performance on unseen data.
- `LinearRegression`: The core class used to create and train a linear regression model.
- `mean_squared_error`, `mean_absolute_error`, `r2_score`: Metrics used to assess the accuracy and effectiveness of the model's predictions:
  - **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values.
  - **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values.
  - **RÂ² Score**: Indicates how well the model explains the variability of the target variable (closer to 1 means better fit).

These tools are fundamental for building, training, and evaluating regression models in supervised learning tasks.

In [11]:
from sklearn.model_selection import  train_test_split
from sklearn.linear_model import LinearRegression
from sklearn. metrics import mean_squared_error,mean_absolute_error,r2_score

## Selecting Features and Target Variable

To prepare the data for modeling, we define:

- `x`: The **feature variable**, representing the number of hours studied. This will be used as the input for our model.
- `y`: The **target variable**, representing the corresponding scores achieved. This is what we aim to predict.

By isolating these columns from the DataFrame, we set up the data for training a supervised learning model using linear regression.

In [16]:
x = df.iloc[:, :1]
y = df.iloc[:, -1]

## Splitting the Dataset into Training and Testing Sets

To evaluate the performance of our machine learning model, we divide the data into two subsets:

- `x_train`, `y_train`: Used to train the model.
- `x_test`, `y_test`: Used to test the model's predictions on unseen data.

We use `train_test_split()` from scikit-learn with the following parameters:
- `test_size=0.25`: Allocates 25% of the data for testing and 75% for training.
- `random_state=26`: Ensures reproducibility by setting a fixed seed for random shuffling.

This split helps us assess how well the model generalizes to new data.

In [17]:
x_train,x_test,y_train,y_test = train_test_split( x,y,test_size= 0.25, random_state= 26)

## Creating the Linear Regression Model

We initialize a linear regression model by creating an instance of the `LinearRegression` class from scikit-learn:

- `regressor = LinearRegression()`: This line sets up the model object, which will later be trained on the dataset using the training data.

At this stage, the model is created but not yet trained. The next step will be to fit it to the training data using the `.fit()` method.

In [20]:
regressor = LinearRegression()

## Training the Linear Regression Model

With the training data prepared, we now fit the linear regression model using:


In [22]:
regressor.fit(x_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## Making Predictions with the Trained Model

After training the linear regression model, we use it to make predictions on the test data:



In [23]:
y_prediction = regressor.predict(x_test)

## Displaying Test Features and Predicted Values

To inspect the model's predictions, we print :


In [27]:
print(x_test)
print(y_prediction)

     Hours
14    2.18
322   7.92
243   4.44
165  10.02
59    3.90
..     ...
238   7.75
282   4.51
273  10.96
19    3.49
141   3.02

[113 rows x 1 columns]
[23.00552292 67.96745291 40.70830372 84.41693949 36.47843574 90.37008701
 34.59849442 91.93670478 27.47038356 91.23172679 22.61386848 67.73246024
 97.2632052  39.4550095  22.22221404 46.2697968  70.86569578 71.80566644
 78.54212285 52.69292965 10.15925721 20.57726538 21.04725071 73.84226955
 95.30493299 97.10654342 55.74783431 51.90962077  8.200985   41.33495083
  7.65266878 63.81591582 45.17316436 28.64534689 96.79321987 46.73978213
 19.40230205 32.40522954 29.35032489 37.65339907 64.44256293 99.06481564
 32.09190598 53.24124587 13.29249275 16.58239006 86.3752117  56.92279763
 67.18414402 17.0523754  18.85398583 90.29175613 65.06921004 93.894977
 57.47111385 58.72440807 55.90449608 35.93011952 12.58751475 46.11313502
 33.50186198 54.49454009 36.32177396 53.16291499 53.0845841  80.18707151
 56.76613586 17.83568428 19.55896383 75.252

## Evaluating Model Performance with Regression Metrics

To assess how well our linear regression model performs, we calculate three key evaluation metrics:

- `r2_score`: Measures the proportion of variance in the target variable that is explained by the model. A value closer to 1 indicates a better fit.
- `mean_squared_error (mse)`: Represents the average of the squared differences between predicted and actual values. Lower values indicate better accuracy.
- `mean_absolute_error (mae)`: Calculates the average absolute difference between predicted and actual values, offering a more interpretable error measure.

These metrics help us understand the modelâ€™s predictive accuracy and guide improvements if needed.

In [30]:
r2_score_for_y_predict = r2_score(y_test,y_prediction)
mse = mean_squared_error(y_test,y_prediction)
mae = mean_absolute_error(y_test,y_prediction)

print(r2_score_for_y_predict)
print(mse)
print(mae)


0.9483295123611838
45.659936313039545
5.376284267136397


### Interpretation of Model Evaluation Metrics

The model's performance metrics indicate a strong fit:

- **RÂ² Score: 0.948**  
  This means that approximately 94.8% of the variance in student scores can be explained by the number of hours studied. It's a high value, suggesting the model captures the relationship well.

- **Mean Squared Error (MSE): 45.66**  
  On average, the squared difference between predicted and actual scores is about 45.66. While this value is in squared units, it helps identify large prediction errors.

- **Mean Absolute Error (MAE): 5.38**  
  The average absolute difference between predicted and actual scores is around 5.38 points, which is relatively low and indicates good predictive accuracy.

Overall, these metrics suggest that the linear regression model performs well on the test data.