<h1 align="center"><b>GRIP : The Sparks Foundation</b></h1>

<h1 align="center">Data Science and Business Analytics Intern</h1>

## **Task 1: Prediction using Supervised ML**
### **Predict Percentage of Student based on No. of study hours.**
In this we will use linear regression to predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables.

## **Author: Jagrut Manish Thakare**


In [None]:
# Importing all libraries required in this notebook
import pandas as pd 
import numpy as np  
import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
url = "student_scores - student_scores.csv"
data = pd.read_csv(url)
print("Data imported successfully")

data.head(10) 

In [None]:
print(data.shape)

In [None]:
data.describe()

In [None]:
data.info()

Let's plot our data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data. We can create the plot with the following script:

In [None]:
# Plotting the distribution of scores
data.plot(x='Hours', y='Scores', style='o')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')       
plt.ylabel('Percentage Score')    
plt.show()

**From the graph above, we can clearly see that there is a positive linear relation between the number of hours studied and percentage of score.**

In [None]:
data.corr(method='pearson')

In [None]:
data.corr(method='spearman')

In [None]:
hours = data['Hours']
scores = data['Scores']


In [None]:
fig, ax = plt.subplots()
sns.histplot(scores, kde=True, ax = ax)
ax.set_xlim(0, 120)
plt.show()

In [None]:
sns.displot(hours, kde=True)
plt.show()

### **Preparing the data**

The next step is to divide the data into "attributes" (inputs) and "labels" (outputs).

In [None]:
X = data.iloc[:, :-1].values  
y = data.iloc[:, 1].values  

Now that we have our attributes and labels, the next step is to split this data into training and test sets. We'll do this by using Scikit-Learn's built-in train_test_split() method:

In [None]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size=0.2, random_state=50) 

### **Training the Algorithm**
We have split our data into training and testing sets, and now is finally the time to train our algorithm. 

In [None]:
from sklearn.linear_model import LinearRegression  
reg = LinearRegression()  
reg.fit(X_train, y_train) 

print("Training complete.")

In [None]:
# Plotting the regression line
m = reg.coef_
c = reg.intercept_
line = m*X+c

# Plotting for the test data
plt.scatter(X, y)
plt.plot(X, line)
plt.show()

### **Making Predictions**
Now that we have trained our algorithm, it's time to make some predictions.

In [None]:
print(X_test) # Testing data - In Hours
y_pred = reg.predict(X_test) # Predicting the scores

In [None]:
# Comparing Actual vs Predicted
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df 

In [None]:
sns.set_style('whitegrid')
sns.displot(np.array(y_test-y_pred), kde=True)
plt.show()

# What would be the predicted score if a student studies for 9.25 hours/day?

In [None]:
# You can also test with your own data
hours = 9.25
own_pred = reg.predict([[hours]])
print("If a student studies for {} hours per day he/she will score {} % in exam".format(hours, own_pred))

### **Evaluating the model**

The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity here, we have chosen the mean square error. There are many such metrics.

In [None]:
from sklearn import metrics  
from sklearn.metrics import r2_score
print('Mean Absolute Error:', 
      metrics.mean_absolute_error(y_test, y_pred)) 
print('R2 Score:', r2_score(y_test, y_pred)*100,"%")
