### Q1. Problem Statement: Model Evaluation Metrics for Regression

- Write a Python program that reads the winequality-white.csv file into a DataFrame, the following are the tasks that are to be taken into consideration while constructing diffract model and in the end evaluate them based on `RMSE, MAPE, RMSLE.`

1. Load the given dataset into a data frame

2. Find missing values and drop them if you find any

3. Check data types for all features

4. Extract dependent and independent variables into the y & x data frame ("alcohol" is our dependent feature)

5. Split your data into train and test, by 20% as test size

6. Create a new data frame for comparison of all models containing column as model name, RMSE, MAPE, RMSLE

7. Build linear regression, SVM, ridge, lasso, Decision Tree and measure their RMSE, MAPE, RMSLE and make the final data frame 

**Step - 1:** `Importing Libraries`

In [4]:
import pandas as pd 
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split

**Step - 2:** `Loading given CSV file into DataFrame`

In [13]:
df = pd.read_csv("WineQT.csv")

**Step - 3:** `Exploring Data`

In [14]:
df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,5
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5,6
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,7
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7,8
9,6.7,0.58,0.08,1.8,0.097,15.0,65.0,0.9959,3.28,0.54,9.2,5,10


In [15]:
df.tail(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
1133,6.7,0.32,0.44,2.4,0.061,24.0,34.0,0.99484,3.29,0.8,11.6,7,1584
1134,7.5,0.31,0.41,2.4,0.065,34.0,60.0,0.99492,3.34,0.85,11.4,6,1586
1135,5.8,0.61,0.11,1.8,0.066,18.0,28.0,0.99483,3.55,0.66,10.9,6,1587
1136,6.3,0.55,0.15,1.8,0.077,26.0,35.0,0.99314,3.32,0.82,11.6,6,1590
1137,5.4,0.74,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6,6,1591
1138,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,1592
1139,6.8,0.62,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6,1593
1140,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,1594
1141,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,1595
1142,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,1597


In [16]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
count,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0
mean,8.311111,0.531339,0.268364,2.532152,0.086933,15.615486,45.914698,0.99673,3.311015,0.657708,10.442111,5.657043,804.969379
std,1.747595,0.179633,0.196686,1.355917,0.047267,10.250486,32.78213,0.001925,0.156664,0.170399,1.082196,0.805824,463.997116
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0,0.0
25%,7.1,0.3925,0.09,1.9,0.07,7.0,21.0,0.99557,3.205,0.55,9.5,5.0,411.0
50%,7.9,0.52,0.25,2.2,0.079,13.0,37.0,0.99668,3.31,0.62,10.2,6.0,794.0
75%,9.1,0.64,0.42,2.6,0.09,21.0,61.0,0.997845,3.4,0.73,11.1,6.0,1209.5
max,15.9,1.58,1.0,15.5,0.611,68.0,289.0,1.00369,4.01,2.0,14.9,8.0,1597.0


**Step - 3:** `Checking Data Types

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   int64  
 12  Id                    1143 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 116.2 KB


**Step - 4:** `Checking Null Values`

In [18]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
Id                      0
dtype: int64

**Step - 5:** `Extract our dependent variable into y & independent into x `

In [19]:
1 # everything is fine

2# now extract our dependent variable into y and independent variable into x

x = df.drop(columns = "alcohol")
y = df.alcohol

**Step - 6:** `Spinting into Train And Test`

In [20]:
# Assuming x and y are already defined
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100)

# Corrected print statements
print("Shape of train data:", x_train.shape, y_train.shape)
print("Shape of test data:", x_test.shape, y_test.shape)

Shape of train data: (914, 12) (914,)
Shape of test data: (229, 12) (229,)


**Step - 7:** `Create a new DataFrame for a comparision of all model containing coluns as Model name RSME, MAPE, RMSLE`

In [None]:
data = {'Model Name':[],'MAPE':[],'RMSE':[],'RMSLE':[]}
df = pd.DataFrame(data)
df

Unnamed: 0,Model Name,MAPE,RMSE,RMSLE


**Step - 8:** `Model Building`

In [21]:
from sklearn.linear_model import LinearRegression

# 1. Linear Regression Model
model_ler = LinearRegression().fit(x_train, y_train)
pred = model_ler.predict(x_test)

# 2. RMSE Calculation (Root Mean Squared Error)
rmse = np.sqrt(np.mean((y_test - pred) ** 2))

# 3. MAPE (Mean Absolute Percentage Error)
mape = np.mean(np.abs((y_test - pred) / y_test)) * 100

# 4. RMSLE (Root Mean Squared Logarithmic Error)
rmsle = np.sqrt(np.mean((np.log(pred + 1) - np.log(y_test + 1)) ** 2))

# Storing results in DataFrame
df1 = pd.DataFrame({'Model Name': ['Linear Regression'], 'MAPE': [mape], 'RMSE': [rmse], 'RMSLE': [rmsle]})

# Append to existing DataFrame (if df is already defined, otherwise create new)
df = pd.DataFrame()  # Define an empty DataFrame if not already created
df = pd.concat([df, df1], ignore_index=True)

# Display the results
df

Unnamed: 0,Model Name,MAPE,RMSE,RMSLE
0,Linear Regression,4.055171,0.545484,0.047833


In [None]:
# SVM
from sklearn.svm import SVR  

# Assuming x_train, x_test, y_train, y_test are already defined

# 1. Support Vector Regression (SVR) Model
model_svr = SVR().fit(x_train, y_train)
pred = model_svr.predict(x_test)

# 2. RMSE Calculation (Root Mean Squared Error)
rmse = np.sqrt(np.mean((y_test - pred) ** 2))

# 3. MAPE (Mean Absolute Percentage Error)
mape = np.mean(np.abs((y_test - pred) / y_test)) * 100

# 4. RMSLE (Root Mean Squared Logarithmic Error)
rmsle = np.sqrt(np.mean((np.log(pred + 1) - np.log(y_test + 1)) ** 2))

# Storing results in DataFrame
df1 = pd.DataFrame({'Model Name': ['SVM Regression'], 'MAPE': [mape], 'RMSE': [rmse], 'RMSLE': [rmsle]})

# Append to existing DataFrame 
df = pd.DataFrame() 
df = pd.concat([df, df1], ignore_index=True)

# Display the results
print(df)


       Model Name      MAPE      RMSE     RMSLE
0  SVM Regression  6.474655  0.928665  0.078149


In [None]:
# ridge Regression
from sklearn.linear_model import Ridge

# Manually normalize X (since normalize=True is deprecated)
x_train_norm = x_train / np.linalg.norm(x_train, axis=0)
x_test_norm = x_test / np.linalg.norm(x_test, axis=0)

# Create and train Ridge Regression model
ridgeReg = Ridge(alpha=0.0005)
ridgeReg.fit(x_train_norm, y_train)

# Predictions
pred = ridgeReg.predict(x_test_norm)

# 1. RMSE (Root Mean Squared Error)
rmse = np.sqrt(np.mean((y_test - pred) ** 2))

# 2. MAPE (Mean Absolute Percentage Error)
mape = np.mean(np.abs((y_test - pred) / y_test)) * 100

# 3. RMSLE (Root Mean Squared Logarithmic Error)
rmsle = np.sqrt(np.mean((np.log(pred + 1) - np.log(y_test + 1)) ** 2))

# Creating a DataFrame to store the results
df1 = pd.DataFrame({'Model Name': ['Ridge Regression'], 'MAPE': [mape], 'RMSE': [rmse], 'RMSLE': [rmsle]})

# Append to existing DataFrame 
df = pd.DataFrame()  
df = pd.concat([df, df1], ignore_index=True)

# Display results
df


Unnamed: 0,Model Name,MAPE,RMSE,RMSLE
0,Ridge Regression,62.688115,6.500858,0.455113


In [22]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso

# Initialize Lasso Regression model
lassoReg = Lasso(alpha=0.0005)
lassoReg.fit(x_train, y_train)  

# Predictions
pred = lassoReg.predict(x_test)  # Predict on raw X_test

# 1. RMSE (Root Mean Squared Error)
rmse = np.sqrt(np.mean((y_test - pred) ** 2))

# 2. MAPE (Mean Absolute Percentage Error)
mape = np.mean(np.abs((y_test - pred) / y_test)) * 100

# 3. RMSLE (Root Mean Squared Logarithmic Error)
rmsle = np.sqrt(np.mean((np.log(pred + 1) - np.log(y_test + 1)) ** 2))

# Creating a DataFrame to store the results
df1 = pd.DataFrame({'Model Name': ['Lasso Regression'], 'MAPE': [mape], 'RMSE': [rmse], 'RMSLE': [rmsle]})

# Display results
df


Unnamed: 0,Model Name,MAPE,RMSE,RMSLE
0,Linear Regression,4.055171,0.545484,0.047833


In [23]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Initialize Decision Tree Regressor
dtreg = DecisionTreeRegressor(max_depth=5)
dtreg.fit(x_train, y_train)  # Train the model

# Predictions
pred = dtreg.predict(x_test)

# 1. RMSE (Root Mean Squared Error)
rmse = np.sqrt(np.mean((y_test - pred) ** 2))

# 2. MAPE (Mean Absolute Percentage Error)
mape = np.mean(np.abs((y_test - pred) / y_test)) * 100

# 3. RMSLE (Root Mean Squared Logarithmic Error)
rmsle = np.sqrt(np.mean((np.log(pred + 1) - np.log(y_test + 1)) ** 2))

# Creating a DataFrame to store the results
df1 = pd.DataFrame({'Model Name': ['Decision Tree Regression'], 'MAPE': [mape], 'RMSE': [rmse], 'RMSLE': [rmsle]})

# Append to existing DataFrame (or create a new one if df doesn't exist)
try:
    df = pd.concat([df, df1], ignore_index=True)  # Append to existing df
except NameError:
    df = df1  # Create df if it doesn't exist

# Display results
df


Unnamed: 0,Model Name,MAPE,RMSE,RMSLE
0,Linear Regression,4.055171,0.545484,0.047833
1,Decision Tree Regression,5.388244,0.748939,0.06384


**Step - 9:** `Comparing all the models`

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# List to store model results
results = []

# Function to compute evaluation metrics
def evaluate_model(model, model_name):
    model.fit(x_train, y_train)
    pred = model.predict(x_test)

    # 1. RMSE (Root Mean Squared Error)
    rmse = np.sqrt(mean_squared_error(y_test, pred))

    # 2. MAPE (Mean Absolute Percentage Error)
    mape = np.mean(np.abs((y_test - pred) / y_test)) * 100

    # 3. RMSLE (Root Mean Squared Logarithmic Error)
    rmsle = np.sqrt(np.mean((np.log(pred + 1) - np.log(y_test + 1)) ** 2))

    # Append results to list
    results.append({'Model Name': model_name, 'MAPE': mape, 'RMSE': rmse, 'RMSLE': rmsle})

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=0.0005),
    "Lasso Regression": Lasso(alpha=0.0005),
    "Decision Tree Regression": DecisionTreeRegressor(max_depth=5),
    "SVM Regression": SVR()
}

# Evaluate each model
for name, model in models.items():
    evaluate_model(model, name)

# Create DataFrame
df_results = pd.DataFrame(results)

# Display Results
df_results


Unnamed: 0,Model Name,MAPE,RMSE,RMSLE
0,Linear Regression,4.055171,0.545484,0.047833
1,Ridge Regression,4.464806,0.602671,0.051869
2,Lasso Regression,5.554982,0.772075,0.065461
3,Decision Tree Regression,5.432385,0.770927,0.065471
4,SVM Regression,6.474655,0.928665,0.078149
