# Project - Predictive Modelling

## Import Libraries

1. General libraries to work with data and visualize data:

In [None]:
import numpy as np
import pandas as pd

# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2. sklearn libraries to perform regression and classifications:

In [None]:
# For randomized data splitting
from sklearn.model_selection import train_test_split

# 1. To build linear regression_model & Stats Model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

3. To check model performance:

In [None]:
from sklearn import metrics

# calculate accuracy measures and confusion matrix
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 1 Linear Regresison

## 1.1 Load dataset and explore

Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis.

In [None]:
df= pd.read_excel('compactiv.xlsx')
df.head()

In [None]:
df.shape

In [None]:
df.info()

Most of the columns in the data are numeric in nature ('int64' or 'float64' type).
'runqsz' is object type.

Fix which columns can be dropped
Fix which columns are corelated

In [None]:
df.describe().T

### Univariate Analysis

Fix https://www.analyticsvidhya.com/blog/2020/07/univariate-analysis-visualization-with-illustrations-in-python/

### Bivariate Analysis

In [None]:
#Fix plot size

df = df.iloc[:, 0:23]
sns.pairplot(df, diag_kind='kde');

### Multivariate Analysis

## 1.2 Missing Values & Outliers

Impute null values if present, also check for the values which are equal to zero. Do they have any meaning or do we need to change them or drop them? Check for the possibility of creating new features if required. Also check for outliers and duplicates if there.

In [None]:
df.duplicated().sum()

There are no duplicates in the data

### Missing Values

In [None]:
df.isnull().sum()

There are 104 null/missing values in ‘rchar’ column, 15 in ‘wchar’ column respectively. Boxplots are generated to visualize the skewness in data.

In [None]:
sns.boxplot(x= 'rchar', y='runqsz', data=df)
plt.title('Boxplot of rchar')
plt.show()

In [None]:
sns.boxplot(x= 'wchar', y='runqsz', data=df)
plt.title('Boxplot of wchar')
plt.show()

Median imputation is preferred when the distribution is skewed, as the median is less sensitive to outliers than the mean.

In [None]:
medianfiller_rchar=df['rchar'].median()
medianfiller_rchar

In [None]:
medianfiller_wchar=df['wchar'].median()
medianfiller_wchar

In [None]:
df['rchar']=df['rchar'].fillna(medianfiller_rchar)
df['wchar']=df['wchar'].fillna(medianfiller_wchar)

In [None]:
df.info()

There are no more missing values.

Fix check for zero values

### 1.3.1 Encode Data

In [None]:
df['runqsz'].value_counts()

In [None]:
df = pd.get_dummies(df, columns=['runqsz'],drop_first=True)
df.head()

### Outliers

In [None]:
plt.figure(figsize = (12,8))
feature_list = df.columns
for i in range(len(feature_list)):
    plt.subplot(4,6, i + 1)
    sns.boxplot(y = df[feature_list[i]], data = df)
    plt.title(feature_list[i])
    plt.tight_layout()

There are multiple outliers in multiple columns.

In [None]:
def remove_outlier(col):
    Q1,Q3=col.quantile([0.25, 0.75])
    IQR=Q3-Q1
    lower_range=Q1-(1.5*IQR)
    upper_range=Q3+(1.5*IQR)
    return lower_range, upper_range

In [None]:
for i in df.columns:
    LL,UL=remove_outlier(df[i])
    df[i] = np.where(df[i] > UL, UL, df[i])
    df[i] = np.where(df[i] < LL, LL, df[i])

In [None]:
plt.figure(figsize = (12,8))
feature_list = df.columns
for i in range(len(feature_list)):
    plt.subplot(4,6, i + 1)
    sns.boxplot(y = df[feature_list[i]], data = df)
    plt.title(feature_list[i])
    plt.tight_layout()

In [None]:
df.head(5)

In [None]:
df['pgscan'].value_counts()

In [None]:
df['pgout'].value_counts()

In [None]:
df.drop(['pgscan'], inplace=True, axis=1)

In [None]:
df.shape

## 1.3 Modeling

Encode the data (having string values) for Modelling. Split the data into train and test (70:30). Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate method from statsmodel. Create multiple models and check the performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one with appropriate reasoning.

Fix scikit learn method is used?

### 1.3.2 Split Data for Stats & Linear Models

In [None]:
# independent variables
X = df.drop(['usr'], axis=1)
# dependent variable
y = df[['usr']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.30,
                                                    random_state=1)

Stats Model:

### Check Multi-collinearity using VIF:

In [None]:
# Compute VIF for each predictor to detect multicollinearity
vif = pd.DataFrame()
vif['feature'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif.sort_values('VIF')

In [None]:
# Drop features with VIF scores greater than 10
to_drop = vif[vif["VIF"] > 5]["feature"].values
X_train = X_train.drop(to_drop, axis=1)
X_test = X_test.drop(to_drop, axis=1)

In [None]:
# Add constant to X_train and X_test
X_trainc = sm.add_constant(X_train)
X_testc = sm.add_constant(X_test)

In [None]:
X_trainc.head()

In [None]:
X_testc.head()

Linear Model:

In [None]:
X_train.head()

In [None]:
X_test.head()

### 1.3.3 Fit Stats linear model using OLS

In [None]:
ols_model = sm.OLS(y_train, X_trainc)
ols_results = ols_model.fit()

In [None]:
# let's print the regression summary
# print the summary statistics for the training set

print("Training set:")
print(ols_results.summary())

In [None]:
# make a prediction for the testing set
ytest_predict_stats = ols_results.predict(X_testc)
print("Predicted y:", ytest_predict_stats)

In [None]:
# calculate the RMSE and R-squared for the training set
train_rmse = np.sqrt(mean_squared_error(y_train, ols_results.predict(X_trainc)))
train_r_squared = ols_results.rsquared
print("Training set RMSE:", train_rmse)
print("Training set R-squared:", train_r_squared)

# calculate the RMSE and R-squared for the testing set
test_rmse = np.sqrt(mean_squared_error(y_test, ytest_predict_stats))
test_r_squared = 1 - (np.sum((y_test - ytest_predict_stats)**2) / np.sum((y_test - np.mean(y_test))**2))
print("Testing set RMSE:", test_rmse)
print("Testing set R-squared:", test_r_squared)

RSquared:

In [None]:
print('The variation in the independent variable which is explained by the dependent variable is',round(ols_results.rsquared*100,4),'%')

RMSE:

In [None]:
print("The Root Mean Square Error (RMSE) of the model is for the training set is",mean_squared_error(ols_results.fittedvalues,y_train,squared=False))

In [None]:
print("The Root Mean Square Error (RMSE) of the model is for testing set is",np.sqrt(mean_squared_error(y_test,ytest_predict_stats)))

In [None]:
# Fit the initial model
initial_model = sm.OLS(y_train, X_trainc).fit()

# Print the summary of the initial model
print(initial_model.summary())

# Drop the least significant feature and refit the model
p_values = initial_model.pvalues.drop('const')
while p_values.max() > 0.05:
    X_trainc = X_trainc.drop(columns=p_values.idxmax())
    model = sm.OLS(y_train, X_trainc).fit()
    p_values = model.pvalues.drop('const')

# Print the summary of the final model
print(model.summary())

### Linear Regression model

Scale data:

In [None]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [None]:
ytrain_predict_linear = linear_model.predict(X_train)
ytest_predict_linear= linear_model.predict(X_test)
#print("Predicted y:", ytrain_predict_linear)

In [None]:
# print the summary statistics for the training set
print("Training set:")
print("R-squared:", linear_model.score(X_train, y_train))
print("Intercept:", linear_model.intercept_)
print("Coefficient:", linear_model.coef_)

### Linear Regression model evaluation:

In [None]:
# calculate the RMSE for the training set
train_rmse = np.sqrt(mean_squared_error(y_train, ytrain_predict_linear))
# Get the score on training set:
print('The coefficient of determination R^2 of the prediction on Train set',linear_model.score(X_train, y_train))
print("Training set RMSE:", train_rmse)

print(" ")
# calculate the RMSE and R-squared for the testing set
test_rmse = np.sqrt(mean_squared_error(y_test, ytest_predict_linear))
# Get the score on TEST set:

test_r_squared = linear_model.score(X_test, y_test)
print('The coefficient of determination R^2 of the prediction on Test set',test_r_squared)
print("Testing set RMSE:", test_rmse)
print("The Root Mean Square Error (RMSE) of the model for testing set is",test_rmse)
print("Testing set R-squared:", test_r_squared)

In [None]:
scores = cross_val_score(linear_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

# Calculate the root mean squared error (RMSE) from the cross-validation scores
rmse_scores = np.sqrt(-scores)

# Print the average RMSE across all folds
print("Average RMSE: ", np.mean(rmse_scores))

### 1.3.5 Best Model

In [None]:
# create a Scikit-learn linear regression model and fit it to the training data
lr_model_sk = LinearRegression()
lr_model_sk.fit(X_train, y_train)

# make predictions on the testing data using the Scikit-learn model
predictions_sk = lr_model_sk.predict(X_test)

# calculate the RMSE for the Scikit-learn model
rmse_sk = np.sqrt(mean_squared_error(y_test, predictions_sk))
print("Scikit-learn model RMSE:", rmse_sk)

# create an OLS stats linear regression model and fit it to the training data
X_train_stats = sm.add_constant(X_train)
ols_model = sm.OLS(y_train, X_train_stats).fit()

# make predictions on the testing data using the OLS stats model
X_test_stats = sm.add_constant(X_test)
predictions_stats = ols_model.predict(X_test_stats)

# calculate the RMSE for the OLS stats model
rmse_stats = np.sqrt(mean_squared_error(y_test, predictions_stats))
print("OLS stats model RMSE:", rmse_stats)

## 1.4 Inference

Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper business interpretation and actionable insights present.

In [None]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error and R^2 score on the test data
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the model coefficients, mean squared error, and R^2 score
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean Squared Error:', mse)
print('R^2 Score:', r2)


Note to remember:

The R-squared (R2) value typically ranges from 0 to 1, with 1 indicating a perfect fit between the model and the data. A higher R2 value indicates that the model explains more of the variance in the data.

The root mean square error (RMSE) represents the average difference between the actual and predicted values of the outcome variable. It is measured in the same units as the outcome variable. There is no specific range for RMSE, but a lower RMSE value indicates that the model has better predictive power.

The adjusted R-squared (R2) is a modified version of the R-squared value that adjusts for the number of predictor variables in the model. It typically ranges from negative infinity to 1, with a higher value indicating a better fit between the model and the data. The adjusted R2 penalizes the inclusion of irrelevant predictors in the model and rewards the inclusion of relevant predictors.