### ENEL 645 - Data Mining and Machine Learning
### Assignment 1 - Linear Regression
#### Author: Steven Duong (30022492)

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

1. Load Data

We load the data using the pandas library.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving reduced_version_data_ENEL_645.csv to reduced_version_data_ENEL_645.csv


In [3]:
import io

# store data into df
df = pd.read_csv(io.BytesIO(uploaded['reduced_version_data_ENEL_645.csv']),  index_col='Community Name')
df.head()

Unnamed: 0_level_0,Sector,Group Category,Category,Crime Count,Resident Count,Year,Month
Community Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
WHITEHORN,NORTHEAST,Crime,Street Robbery,1,12019,2019,SEP
FOOTHILLS,EAST,Crime,Theft OF Vehicle,10,317,2019,NOV
ACADIA,SOUTH,Crime,Theft FROM Vehicle,13,10520,2019,SEP
MAHOGANY,SOUTHEAST,Crime,Theft OF Vehicle,1,11784,2019,NOV
LINCOLN PARK,WEST,Crime,Commercial Break & Enter,5,2617,2019,NOV


In [4]:
# convert categorical variables into dummy/indicator variables
df = pd.get_dummies(df)
df.head()

Unnamed: 0_level_0,Crime Count,Resident Count,Year,Sector_CENTRE,Sector_EAST,Sector_NORTH,Sector_NORTHEAST,Sector_NORTHWEST,Sector_SOUTH,Sector_SOUTHEAST,...,Month_DEC,Month_FEB,Month_JAN,Month_JUL,Month_JUN,Month_MAR,Month_MAY,Month_NOV,Month_OCT,Month_SEP
Community Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
WHITEHORN,1,12019,2019,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
FOOTHILLS,10,317,2019,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
ACADIA,13,10520,2019,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
MAHOGANY,1,11784,2019,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
LINCOLN PARK,5,2617,2019,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [5]:
# split the dataset into input features (X) and target variable (y)
X = df.drop('Crime Count', axis=1)
y = df['Crime Count']

In [6]:
# display feature matrix X
# x.info can check for null data points
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, WHITEHORN to KILLARNEY/GLENGARRY
Data columns (total 35 columns):
 #   Column                                  Non-Null Count   Dtype
---  ------                                  --------------   -----
 0   Resident Count                          100000 non-null  int64
 1   Year                                    100000 non-null  int64
 2   Sector_CENTRE                           100000 non-null  uint8
 3   Sector_EAST                             100000 non-null  uint8
 4   Sector_NORTH                            100000 non-null  uint8
 5   Sector_NORTHEAST                        100000 non-null  uint8
 6   Sector_NORTHWEST                        100000 non-null  uint8
 7   Sector_SOUTH                            100000 non-null  uint8
 8   Sector_SOUTHEAST                        100000 non-null  uint8
 9   Sector_WEST                             100000 non-null  uint8
 10  Group Category_Crime                    100000 non-n

In [7]:
# display target vector y
# y.info() can check for null data points
y.info()

<class 'pandas.core.series.Series'>
Index: 100000 entries, WHITEHORN to KILLARNEY/GLENGARRY
Series name: Crime Count
Non-Null Count   Dtype
--------------   -----
100000 non-null  int64
dtypes: int64(1)
memory usage: 1.5+ MB


In [16]:
# split the dataset into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)

### 2. Hypothesis Function

This is a superised regresion problem with multiple features. So we first want to create a hypothesis function that will predict these data points. We will select a linear function as our hypothesis that includes all of our features. This will be in the form:

h(x) = θ₀ + θ₁x₁ + ... + θₙxₙ

We can use sklearn's library for linear regression to achieve this functionality. Then fit the model using our training data.

In [17]:
model = LinearRegression()

### 3. Normal Equation and OLS for Linear Regression

Sklearn's Linear Regression algorithm does not use Gradient Descent to optimize the parameters of the model. Instead, it uses a method called Ordinary Least Squares (OLS), which is based on the Gauss-Markov theorem.

Ordinary Least Squares operates under the goal of minimizing the sum of the squares of the differences between the observed and predicted values. The difference between the observed and predicted values is also known as the residuals.

Model Definition: We define our hyperplane as:

y = θ₀ + θ₁X₁ + θ₂X₂ + ... + θₖXₖ + ε

Here, 'y' is the target variable we want to predict, 'X' are the features in our dataset, 'θ' are the coefficients for each feature that we want to learn, and 'ε' is the random error term. Each 'θ' coefficient represents the change in 'y' for a one-unit change in the corresponding 'X', assuming all other factors are held constant.

Matrix Formulation: This equation can be rewritten in matrix form as:

y = Xθ + ε

Here, 'X' is a matrix of the features, 'θ' is a vector of the parameters, and 'ε' is a vector of the error terms.

Solving for Coefficients: To find the best 'θ' coefficients that minimize the sum of the squared residuals, we use the Normal Equation:

θ = (X'X)^-1 X'y

Here, 'X' is a matrix of the features, 'y' is a vector of the target variable, (X'X)^-1 is the inverse of the matrix product of X and its transpose, and X'y is the matrix product of X and y.

Calculating the Coefficients: The Normal Equation provides a direct method to calculate the 'θ' coefficients that minimize the sum of the squared residuals. In other words, it gives us the coefficients that provide the best fit hyperplane for our data.

In [18]:
model.fit(X_train, y_train)
predictions = model.predict(X_test)

### 4. Cost Function for Linear Regression and Model Evaluation

Since this is a Linear Regression problem, we can use the mean-squared error function as our cost function. The mean squared error will be used as our cost function. This equation is given as:

J = (1/2m) * Σ(hᵢ - yᵢ)²

We can use the built in sk-learn library to get the function for the mean_squared_error.

In [19]:
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('MSE:', mse)

MSE: 525.3552882121902
