# WEATHER PREDICTION USING LINEAR REGRESSION

_**Predicting apparent temperature using Linear Regression.**_

In [None]:
# Imports required packages

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

## Data Collection

In [None]:
# Loads dataset from csv file
weather = pd.read_csv("weather.csv")

# Displays few of the instances from the dataset
display(weather.head())

## Exploratory Data Analysis (EDA)

In [None]:
# Checks for basic information about the dataset

weather.info()

**Observations from the basic dataset information are as follows.**

- Column "Precip Type" has missing values
- Out of 12 columns 4 columns are of non-numeric

In [None]:
# Checks for the descriptive statistics of the dataset

weather.describe()

Observations from the basic descriptive statistics are as follows.

- Column "Loud Cover" is single-valued
- Columns have different scales

**Finds values associated with non-numerical/categorical columns.**

In [None]:
# Finds values associated with "Summary"

print(weather.Summary.value_counts())

print("\nTotal unique value", weather.Summary.nunique())

In [None]:
# Finds values associated with "Precip Type"

print(weather["Precip Type"].value_counts())

print("\nTotal unique value", weather["Precip Type"].nunique())

In [None]:
# Finds values associated with "Daily Summary"

print(weather["Daily Summary"].value_counts())

print("\nTotal unique value", weather["Daily Summary"].nunique())

In [None]:
# Check for missing values against each categorical column

weather[["Summary", "Precip Type", "Daily Summary"]].isnull().sum()

In [None]:
# Initializes default Seaborn theme
sns.set_theme()

# Plot histogram of each numeric feature analyze distribution of data
weather.hist(bins=50, figsize=(15,8))
plt.show()

From the above distribution, feature "Humidity", "Wind Speed (km/h)" and "Pressure (millibars)" could have outliers.
A boxplot of each of these features are plotted below for further analysis.

In [None]:
sns.boxplot(weather.Humidity)
plt.show()

Only therRows with above-zero humidity are to be considered.

In [None]:
sns.boxplot(weather["Wind Speed (km/h)"])
plt.show()

Only the rows with less than 60 km/h wind speed are to be considered.

In [None]:
sns.boxplot(weather["Pressure (millibars)"])
plt.show()

Only the rows with above-zero pressure (millibars) are to be considered.

All the outliers identified above will be dropped in the following section.

## Preparing Data

### Checking for Duplicates and Single-valued Columns

**Removes duplicate observations, if any**

In [None]:
# Drops duplicate instances, if any
weather.drop_duplicates(keep='first', inplace=True)

**Removes single-valued columns**

In [None]:
weather.drop(columns=["Loud Cover"], axis=1, inplace=True)
print("\nData shape after single-value column removal:", weather.shape)

### Removing Outliers

In [None]:
# Removes the outliers as found from the ealier analysis

weather = weather[weather['Humidity'] != 0.0]
weather = weather[weather['Wind Speed (km/h)'] <= 60]
weather = weather[weather['Pressure (millibars)'] > 0]
weather.shape

### Removes Other Columns and Rows

In [None]:
# Removes column "Formatted Date" as no time-series analysis is being performed.

weather.drop(columns=["Formatted Date"], axis=1, inplace=True)
weather.shape

In [None]:
# As number of rows with missing "Precip Type" is just a tiny portion of 
# total number of rows, those rows are gets removed

weather.dropna(subset=["Precip Type"], axis=0, inplace=True)

In [None]:
# Resets index of the DataFrame to have continguous index numbers before further processing
weather.reset_index(inplace=True, drop=True)

In [None]:
#Shows the post-preprocessing shape of the data
print(weather.shape)

### Seperating Test Set

**To ensure same distribution both in training and test dataset and to make test dataset representative of the population, stratified sampling over column "Temperature (C)" was consisdered.**

In [None]:
# Creates a column on which stratification will be based on. Essentially, it is a column
# having values each would be a temperature bin that an instance will be associated to.

weather["Temperature_bin"] = pd.cut(
    weather["Temperature (C)"],                       # Values to be binned
    bins=[-30., -10., 0.0, 10., 20., 30., np.inf],    # Creates six bins
    labels=[1, 2, 3, 4, 5, 6])                        # Associates labels to each bin

In [None]:
# Splits data into train and test dataset applying stratification

train_set, test_set = train_test_split(
    weather, test_size = 0.2, stratify = weather["Temperature_bin"], random_state=42)

In [None]:
# Removes intermediate attribute 'Temperature_bin' after stratification
# as this would no more be required

train_set.drop("Temperature_bin", axis=1, inplace=True)
test_set.drop("Temperature_bin", axis=1, inplace=True)

### Seperating Target Column

In [None]:
# Seperates target from features for both training and test set

X_train = train_set.drop("Apparent Temperature (C)", axis = 1)
target_train = train_set["Apparent Temperature (C)"].copy()

X_test = test_set.drop("Apparent Temperature (C)", axis = 1)
target_test = test_set["Apparent Temperature (C)"].copy()

### Transforming Data

#### Transforming Training Data

**Creating transformation pipeline to impute missing and scale numeric data in training dataset**

In [None]:
# Sets list of numerical and categorical attributes

cat_attribs = ["Summary", "Precip Type", "Daily Summary"]
num_attribs = list(X_train.columns)
num_attribs.remove(cat_attribs[0])
num_attribs.remove(cat_attribs[1])
num_attribs.remove(cat_attribs[2])

In [None]:
# Setting data transformation pipeline for numerical attributes
# Note that featue scaling is NOT required for algorithms to be used here

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    #("std_scaler", StandardScaler())    # Not required
])

In [None]:
# Transforms both numerical and categorical attritues by using ColumnTransformer. 

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),      # Uses sub-pipeline already defined above
    ("cat", OneHotEncoder(), cat_attribs)])  # Considering OneHot encoding will just be fine for handful values

X_train_transformed = full_pipeline.fit_transform(X_train)

In [None]:
# Checks for the shape of the transformed training dataset

X_train_transformed.shape

#### Transforming Testing Data

In [None]:
X_test_transformed = full_pipeline.transform(X_test)

In [None]:
# Checks for the shape of the transformed testing dataset

X_test_transformed.shape

## Modeling

### Modeling Using Closed Form Approach

**Using Singular Value Decomposition (SVD) Approach over LinearRegression (LR) Algorithm**

In [None]:
# Fits a LinearRegression model

lr_model = LinearRegression()
lr_model.fit(X_train_transformed, target_train)

In [None]:
# Performs predictions on both training and testing dataset

predictions_train_lr = lr_model.predict(X_train_transformed)
predictions_test_lr = lr_model.predict(X_test_transformed)

In [None]:
rmse_train_lr = np.sqrt(mean_squared_error(target_train, predictions_train_lr))
rmse_test_lr = np.sqrt(mean_squared_error(target_test, predictions_test_lr))

### Analyzing Model Performance
_Note that cross validation was not used for not being useful for closed form modeling approaches._

In [None]:
# Shows Linear Regression model performance on both datasets

print("Linear Regression Model Peroformance (in RMSE):\n")
print("Train Error:", rmse_train_lr)
print("Test Error:", rmse_test_lr)

In [None]:
# Shows models' prediction and prediction error side-by-side 
# on few of the instances from the test dataset

pd.DataFrame({
    "Actual Target": target_test, 
    "LR Prediction": predictions_test_lr, 
    "LR Prediction Error:": np.abs(target_test - predictions_test_lr)
}).head(10)