# The Spark Foundation (Internship, Task 1)

 ## Terkuma Saaondo (May 2024, Intern)

# Prediction using supervised machine learning

**Introduction to Predicting Student Scores Based on Study Hours**

In today's data-driven world, understanding the factors that contribute to academic success is of paramount importance. One of the most critical factors influencing student performance is the amount of time spent studying. In this project, we aim to develop a supervised machine learning model that predicts student scores based on the number of hours they study. By leveraging historical data, we can uncover patterns and build a predictive model that assists students, educators, and policymakers in optimizing study strategies and improving educational outcomes.

In [1]:
# This task is to predict the student's result based on the number of study hours

In [2]:
# Importing needed libraries
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline


In [7]:
# call forth data from web link
url = "http://bit.ly/w-data"
s_data = pd.read_csv(url)
print("Data imported successfully")

s_data.head(10)

In [None]:
# Basic statistics of the dataset
print(data.describe())

# Check for missing values
print(data.isnull().sum())


In [None]:
# Plotting the distribution of scores
s_data.plot(x='Hours', y='Scores', style='o')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

In [None]:
# Correlation Heatmap
plt.figure(figsize=(8, 6))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Pair Plot
sns.pairplot(data)
plt.show()


In [None]:
# Preparing the data
X = s_data.iloc[:, :-1].values  
y = s_data.iloc[:, 1].values  

In [None]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size=0.2, random_state=0) 

In [None]:
# Algorithm Training 
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train) 

print("Training complete.")

In [None]:
# Plotting the regression line
line = regressor.coef_*X+regressor.intercept_

# Plotting for the test data
plt.scatter(X, y)
plt.plot(X, line);
plt.show()

In [None]:
# Making Predictions
print(X_test) # Testing data - In Hours
y_pred = regressor.predict(X_test) # Predicting the scores

In [None]:
# Comparing Actual vs Predicted
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df 

In [None]:
# You can also test with your own data
hours = 9.25
own_pred = regressor.predict(hours)
print("No of Hours = {}".format(hours))
print("Predicted Score = {}".format(own_pred[0]))

In [None]:
# Evaluating the model

The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity here, we have chosen the mean square error. There are many such metrics.
from sklearn import metrics  
print('Mean Absolute Error:', 
      metrics.mean_absolute_error(y_test, y_pred)) 