## THE SPARKS FOUNDATION : DATA SCIENCE AND BUSINESS ANALYTICS

### TASK 1 :- Prediction using Supervised ML (Level - Beginner)

### Author : Jeevan Chhajed

In this task, we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables.


In [None]:
# Importing all libraries required in this notebook
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Step 1 : Reading Data from online source

In [None]:
### Reading data from remote link
data = pd.read_csv('http://bit.ly/w-data')
data.head(15)

Let's plot our data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data. We can create the plot with the following script.

## Step 2 : Data Visualization

In [None]:
# Plotting the distribution of scores
data.plot(x='Hours', y='Scores', style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')
plt.show()

#### From the graph above, we can clearly see that there is a positive linear relation between the number of hours studied and percentage of score.

## Step 3 : Preparing The Data
The next step is to divide the data into "attributes" (inputs) and "labels" (outputs).

In [None]:
x = data.iloc[:, :-1].values
y = data.iloc[:, 1].values

## Step 4 : Algorithm Training
Splitting the data into training data-set and test data-set. Then, start training the algorithm.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 0)
regress = LinearRegression()
regress.fit(x_train.reshape(-1,1), y_train)

In [None]:
print("Training Complete !!!")

## Step 5 : Ploting the line of regression

In [None]:
# Plotting the regression line
line = regress.coef_*x+regress.intercept_

# Plotting for the test data
plt.scatter (x,y)
plt.plot (x, line, color = 'Red')
plt.show()

## Step 6 : Making Predictions
Now that we have trained our algorithm, it's time to make some predictions.

In [None]:
### Testing data - In Hours
print(x_test)

### Predicting the scores
y_pred = regress.predict(x_test)

In [None]:
### Comparing Actual vs Predicted
data1 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

In [None]:
data1

In [None]:
### Estimating the Training Data and Test Data Score

print("Training score:", regress.score(x_train, y_train))
print("Testing score:", regress.score(x_test, y_test))

In [None]:
### Ploting the line graph to depict the diffrence between the actual and predicted value.

data1.plot(kind='line', figsize=(8,6))
plt.grid(which='major', linewidth='0.8', color = 'red')
plt.grid(which='major', linewidth='0.5', color = 'blue')
plt.grid()
plt.show()

In [None]:
### Testing your own data.
hours = 9.25
test = np.array([hours])
test = test.reshape(-1,1)
own_pred = regress.predict(test)
print ("No. of Hours = {}".format(hours))
print ("Predicted Score = {}".format(own_pred[0]))

## Step 7 : Evaluating the model
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity here, we have chosen the mean square error. There are many such metrics.

In [None]:
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) 
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root mean squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))