<a href="https://colab.research.google.com/github/Hannah-Susan-Mathew/The-Sparks-Foundation-Tasks/blob/main/TSF_Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![picture](https://drive.google.com/uc?export=view&id=1iRlF3NeMR4qPgaYsTcLioy2RruUftGHV)


###**Importing the Libraries**

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from urllib.request import urlretrieve
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

### **Downloading the Data**

The dataset is downloaded using the `urlretrieve` function from `urllib.request`.

In [None]:
url = 'http://bit.ly/w-data'

In [None]:
urlretrieve(url, 'student_scores.csv');

In [None]:
student_scores_df = pd.read_csv('student_scores.csv')

###**About the Dataset**

In [None]:
student_scores_df.shape

(25, 2)

In [None]:
student_scores_df.head(10)

Unnamed: 0,Hours,Scores
0,2.5,21
1,5.1,47
2,3.2,27
3,8.5,75
4,3.5,30
5,1.5,20
6,9.2,88
7,5.5,60
8,8.3,81
9,2.7,25


In [None]:
student_scores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Hours   25 non-null     float64
 1   Scores  25 non-null     int64  
dtypes: float64(1), int64(1)
memory usage: 528.0 bytes


In [None]:
student_scores_df.describe()

Unnamed: 0,Hours,Scores
count,25.0,25.0
mean,5.012,51.48
std,2.525094,25.286887
min,1.1,17.0
25%,2.7,30.0
50%,4.8,47.0
75%,7.4,75.0
max,9.2,95.0


### **Visualising the Data**

In [None]:
fig = px.histogram(student_scores_df, 
                   x='Hours', 
                   marginal='box', 
                   color_discrete_sequence=['orange'], 
                   title='Distribution of Hours')
fig.update_layout(bargap=0.1)
fig.show()

In [None]:
fig = px.histogram(student_scores_df, 
                   x='Scores',
                   marginal='box', 
                   color_discrete_sequence=['green'], 
                   title='Distribution of Scores')
fig.update_layout(bargap=0.1)
fig.show()

In [None]:
fig = px.scatter(student_scores_df, 
                 x='Hours', 
                 y='Scores', 
                 hover_data=['Hours','Scores'], 
                 title='No. of hours v/s Score obtained')
fig.update_traces(marker_size=10)
fig.show()

In [None]:
corr = student_scores_df.Scores.corr(student_scores_df.Hours)
print('Correlation coefficient:',corr)

Correlation coefficient: 0.9761906560220887


The correlation coefficient is 0.976 which is quite close to 1. Hence, we see that the `Hours` column and the `Scores` column are positively correlated, i.e, a linear relationship exists between the variables. Therefore, a linear regression line can be fitted to the data.

###**Modelling**

We have to predict the score based on the number of hours of study. So we retrieve the input column and the target column.

In [None]:
input  = student_scores_df[['Hours']]
target = student_scores_df.Scores

Now we split the data into training data and testing data. Since the dataset is small, we split the dataset in the ratio 70:30.

In [None]:
input_train, input_test, target_train, target_test = train_test_split(input,
                                                                      target,
                                                                      test_size = 0.3,
                                                                      random_state = 50)

Next, we fit the Linear Regression model based on the training data.

In [None]:
model = LinearRegression()
model.fit(input_train, target_train);

In [None]:
# Parameters of the regression line
print('Slope       =', model.coef_[0])
print('y-intercept =', model.intercept_)

Slope       = 9.521606076248064
y-intercept = 3.7843079418921874


In [None]:
x_range = np.linspace(1,9.5,200)
y_range = model.predict(x_range.reshape(-1, 1))

fig = go.Figure([
    go.Scatter(x = x_range,
               y = y_range,
               name = 'Regression Line'),
    go.Scatter(x = input_train.squeeze(),
               y = target_train,
               name = 'Train',
               mode = 'markers'),
    go.Scatter(x = input_test.squeeze(),
               y = target_test,
               name = 'Test',
               mode = 'markers')])

fig.update_layout(title = 'No. of hours v/s Score obtained')
fig.update_traces(marker_size = 10)
fig.show()

###**Evaluating the model**

In [None]:
test_preds = model.predict(input_test)

In [None]:
pd.DataFrame({'Hours':input_test.squeeze(),
              'Original Score':target_test.squeeze(),
              'Predicted Score':test_preds,
              'Residual Error':target_test.squeeze()-test_preds})

Unnamed: 0,Hours,Original Score,Predicted Score,Residual Error
15,8.9,95,88.526602,6.473398
20,2.7,30,29.492644,0.507356
23,6.9,76,69.48339,6.51661
22,3.8,35,39.966411,-4.966411
14,1.1,17,14.258075,2.741925
12,4.5,41,46.631535,-5.631535
8,8.3,81,82.813638,-1.813638
9,2.7,25,29.492644,-4.492644


In [None]:
x = mse(target_test, test_preds, squared = False)
y = mae(target_test, test_preds)
z = r2(target_test, test_preds)
pd.DataFrame({'Root_Mean_Squared_Error':[x],'Mean_Absolute_Error':[y],'R-squared_Score':[z]})

Unnamed: 0,Root_Mean_Squared_Error,Mean_Absolute_Error,R-squared_Score
0,4.636799,4.14294,0.97172


### **Prediction**

In [None]:
expected_score = model.predict(np.array([[9.25]]))[0]
print('The expected score of a student who studies for 9.25 hrs/day is',round(expected_score, 2))

The expected score of a student who studies for 9.25 hrs/day is 91.86


---