# Python and Data Science
## Machine Learning
### Part 1: Linear Regression using Scikit-Learn

Libraries Used:

- [Numpy](https://www.youtube.com/watch?v=lLRBYKwP8GQ&t=1073s) - Used for making arrays
- [Pandas](https://www.youtube.com/watch?v=zN2Hua6oII0&t=8s) - Used for handling data sets
- [matplotlib](https://www.youtube.com/watch?v=nzKy9GY12yo) - Used for making charts and graphs
- [scikit-learn intro](https://www.youtube.com/watch?v=rvVkVsG49uU) - Used to handle complex mathematics
- [scikit-learn tut](https://www.youtube.com/watch?v=M9Itm95JzL0) - In depth discussion
- [pickle](https://www.youtube.com/watch?v=6Q56r_fVqgw) - Used to preserve trained models


In [None]:
# Library Imports
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as pyplot
import pickle
from sklearn import linear_model
from sklearn.utils import shuffle
from matplotlib import style

In [None]:
# Prints the first 5 people's data sets 
data = pd.read_csv("student-mat.csv", sep=";")
print(data.head())

In [None]:
# Selects a few datasets for the calculations and then prints them
data = data[["G1", "G2", "G3", "traveltime", "health", "freetime"]]
print(data.head())

## Attributes and Labels

In machine learning:

1. **Attributes** (Features): Characteristics or properties of data used as input for algorithms. Also known as variables or features. Represented by a feature vector for each data point.

2. **Labels** (Target Variable): Values we want the model to predict or learn. The dataset is labeled in supervised learning, where each data point has an associated label serving as ground truth.

The goal is to use attributes to learn patterns and relationships that enable the model to predict labels for new, unseen data.

List the attributes here:

What is the label in this example?


In [None]:
# Chooses the dataset we are predicting
predict = "G3"
x = np.array(data.drop(predict, axis=1)) # Sets the x-axis
y = np.array(data[predict]) # Sets the Y axis

In [None]:
# Split the data set into train and test sets
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)

### Linear Regression
 Using a set of data points, the LR algorithm attempts to find a 'line of best fit' through the data.
 The 'line of best fit' is a corrrelation between the data points.
 LR is best used in a situation when there is a strong correlation of data to begin with.
 
 <table>
    <tr>
        <td>
            <img src="LR-data1.png" width="50%" />
        </td>
        <td>
            <img src="LR-data2.png" width="50%" />
        </td>
    </tr>
    <tr>
        <td>
            <p style="text-align:center">Strongly Correlated</p>
        </td>
        <td>
            <p style="text-align:center">Poorly Correlated</p>
        </td>
    </tr>
 </table>
 
 <b>Linear Equation: y = mx + b, where m is the gradient of the line.
 For any given x and a known gradient, y can be predicted.</b>

In [None]:
linear = linear_model.LinearRegression()
# Define the line of best fit
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)                               

## Save the Model with pickle

- use the run all button several times, until we get a high accuracy number
- Comment out of the cell above

In [None]:
# Open a pickle file with binary
with open("studentmodel.pickle", "wb") as f:
    pickle.dump(linear, f)

In [None]:
# Load the pickle file 
pickle_in = open("studentmodel.pickle", "rb")
linear = pickle.load(pickle_in)

In [None]:
# How to use the model_selection
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

In [None]:
# Prints the predictions vs their actual G3 schools
predictions = linear.predict(x_test)
for x in range(len(predictions)):
    print(predictions[x], x_test[x], y_test[x])

### Plot the data 

In [None]:
style.use("ggplot")

# Set up a scatter plot
p = "G1"
pyplot.scatter(data[p], data["G3"]) # Plots graph points
pyplot.xlabel(p) # Labels x-axis 
pyplot.ylabel("Final Grade") # Labels y-axis
pyplot.show() # Displays Graph