# Linear Regression

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
# Reading in a CSV file containing data for positions and their corresponding salaries
dataset = pd.read_csv('Position_Salaries.csv')

In [None]:
# Outputting information about the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Position  10 non-null     object
 1   Level     10 non-null     int64 
 2   Salary    10 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 368.0+ bytes


The code is calling the `info()` method, which provides information about the data frame, including the number of rows and columns, the column names, the number of non-null values in each column, and the data types of the columns. The output of this method is useful for understanding the structure and content of the dataset, and can be used to inform data cleaning and analysis tasks.

In [None]:
# Generating descriptive statistics about the dataset
dataset.describe()

Unnamed: 0,Level,Salary
count,10.0,10.0
mean,5.5,249500.0
std,3.02765,299373.883668
min,1.0,45000.0
25%,3.25,65000.0
50%,5.5,130000.0
75%,7.75,275000.0
max,10.0,1000000.0


The code is calling the `describe()` method, which generates descriptive statistics about the data frame, including the count, mean, standard deviation, minimum and maximum values, and quartiles for each numeric column in the data frame. The output of this method is useful for getting a sense of the distribution of the data and identifying potential issues such as missing values, outliers, or anomalies. It can also help inform data cleaning and analysis tasks.


In [None]:
# Displaying the first few rows of the dataset
dataset.head()

Unnamed: 0,Position,Level,Salary
0,Business Analyst,1,45000
1,Junior Consultant,2,50000
2,Senior Consultant,3,60000
3,Manager,4,80000
4,Country Manager,5,110000


The code is calling the `head()`, which displays the first few rows of the data frame, by default the first 5 rows. This can be useful for getting a quick overview of the dataset, including the column names and the values in the first few rows. The output of this method is often used to check that the data has been imported correctly and to get a sense of the data's structure and content.

`iloc` is a pandas method for selecting subsets of data based on integer-based indexing.

`[:, 1:-1]` selects all rows (:) and all columns starting from the second column (level) except the last one (:-1) as the input features.

`[:, -1]` selects all rows (:) and only the last column (-1) as the output labels.

`.values` returns the data as a NumPy array.

In [None]:
# Select all rows and columns starting from the second column to the second to last column
X = dataset.iloc[:, 1:-1].values

# Select all rows and the last column of the dataset as the target variable
y = dataset.iloc[:, -1].values

Normally, you would split the data into train and test split here, but for these purposes, we are going to have you train the model on all of the data.

## Training the Linear Regression model on the whole dataset

In [None]:
from sklearn.linear_model import LinearRegression

`LinearRegression` is a class from the `sklearn.linear_model` module that implements linear regression for continuous regression tasks.

`lin_reg` is an instance of the LinearRegression class that will be used to fit and predict with the linear regression model.

`fit` is a method of the `LinearRegression` class that trains (fits) the linear regression model using the input features (`X`) and output labels (`y`).

In [None]:
# Create a Linear Regression object

# Train (fit) the model using the input features (X) and output labels (y)


## Visualising the Linear Regression results

This utility function is used to visualize the fitted model on a scatter plot. 

You do not need to understand this as it is mostly for show, but if you are curious:

The function first plots the scatter plot of the feature and target variables with color red. Then, if `smooth` is set to True, it creates a sequence of values to use for X-axis on the plot by taking the minimum and maximum of the original X feature and generating a range with 0.1 intervals. Next, it creates a new feature matrix `model_prediction_features` by applying the polynomial regression object poly_reg (if `model_type` is set to 'Polynomial') to the transformed features or just uses the original `X` features (if model_type is set to 'Linear').

Finally, the function plots the fitted model on the scatter plot by passing the `model_prediction_features` matrix to the `model.predict()` method and plotting the line in blue color. It sets the title of the plot to show the regression type, sets the axis labels, and then shows the plot using the `plt.show()` function.


In [None]:
def plot_model(X, y, model, model_type = 'Linear', poly_reg = None, smooth = False):
  """
  Function to visualize the fitted model on a scatter plot.

  Parameters:
  -----------
  X : numpy array
      The feature array.
  y : numpy array
      The target array.
  model : fitted model object
      The trained model object that is used for prediction and plotting.
  model_type : str, optional (default='Linear')
      A string indicating the type of regression model used.
      If set to 'Linear', the function uses the original features in X.
      If set to 'Polynomial', the function uses poly_reg to transform X.
  poly_reg : polynomial regression object, optional (default=None)
      The polynomial regression object that is used to transform the features when model_type is set to 'Polynomial'.
  smooth : bool, optional (default=False)
      A boolean variable to indicate if the plot should have a smooth line or not.

  Returns:
  --------
  None
  """

  # Plot the scatter plot of X and y with color red
  plt.scatter(X, y, color = 'red')

  # If smooth is True, create a sequence of values to use for X-axis on the plot
  if smooth:
    X_grid = np.arange(min(X), max(X), 0.1)
    X = X_grid.reshape((len(X_grid), 1))
  
  # Create new feature matrix based on the model type
  model_prediction_features = X if model_type == 'Linear' else poly_reg.fit_transform(X)

  # Plot the fitted model on the scatter plot with blue line
  plt.plot(X, model.predict(model_prediction_features), color = 'blue')

  # Set title, axis labels and show the plot
  plt.title(f'Truth or Bluff ({model_type} Regression)')
  plt.xlabel('Position Level')
  plt.ylabel('Salary')
  plt.show()

In [None]:
# Plot the model using the above method

## Predicting a new result with Linear Regression

In [None]:
# Use the trained linear regression to make a prediction on a new data point
# Here, the position level is 6.5
# The output prediction is a continuous value indicating the predicted salary

## Training the Polynomial Regression model on the whole dataset

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Create a polynomial features object with a degree of 4
# Will overfit

# Transform the original feature matrix X to include polynomial features up to degree 4


In [None]:
# Create a pandas DataFrame from the transformed polynomial features
# The DataFrame will have columns named using the names of the polynomial features


In [None]:
# Create another Linear Regression object

# Train (fit) the model using the polynomial input features (X_poly) and output labels (y)


## Visualising the Polynomial Regression results

In [None]:
# Plot the results of the polynomial regression


## Visualising the Polynomial Regression results (for higher resolution and smoother curve)

In [None]:
# Make the same plot as above with a smooth curve

## Predicting a new result with Polynomial Regression

In [None]:
# Use the trained polynomial regression to make a prediction on a new data point
# Here, the position level is 6.5
# The output prediction is a continuous value indicating the predicted salary