# Model Training

#### No need to save the EDA encoded data, because we need to do the automated entire process.

Why we didnt save that data?

    Because whenever we got a new data, we need to perform this so for that I didnt save that data.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("./data/gemstone.csv")

In [3]:
df.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772
3,3,0.32,Ideal,G,VS1,61.6,56.0,4.38,4.41,2.71,666
4,4,1.7,Premium,G,VS2,62.6,59.0,7.65,7.61,4.77,14453


#### 1. First thing is we drop the id column

In [4]:
df = df.drop(labels = ["id"], axis=1)

In [5]:
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772
3,0.32,Ideal,G,VS1,61.6,56.0,4.38,4.41,2.71,666
4,1.7,Premium,G,VS2,62.6,59.0,7.65,7.61,4.77,14453


#### 2. Splitting Independent and Dependent Feature

In [6]:
X = df.drop(labels=["price"], axis=1)
Y = df[["price"]]

In [7]:
Y

Unnamed: 0,price
0,13619
1,13387
2,2772
3,666
4,14453
...,...
193568,1130
193569,2874
193570,3036
193571,681


#### 3. Defining which column should be ordinal encoded and which should be scaled

In [8]:
# By using this way also we can able to get the numerical and categorical columns, previous method also we can use for this.
categorical_cols = X.select_dtypes(include='object').columns
numerical_cols = X.select_dtypes(exclude="object").columns

In [9]:
categorical_cols

Index(['cut', 'color', 'clarity'], dtype='object')

In [10]:
numerical_cols

Index(['carat', 'depth', 'table', 'x', 'y', 'z'], dtype='object')

#### 4. Define the custom ranking for each ordinal variable

In [11]:
# The categories are the same way which I am given the rank

cut_categories = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
color_categories = ["D", "E", "F", "G", "H", "I", "J"]
clarity_categories = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]

## NOW ITS A SUPER IMPORTANT THING TO UNDERSTAND THE VERY VERY IMPORTANT TECHNIQUE, HOW I AUTOMATED THE PROCESS.

#### How do you handle missing values?

    Fill With MEAN, MEDIAN, MODE

    Until now we used .fill (pandas technique) Now we will use one ml technique

#### 5. Handling Missing Values, Feature Scaling, and Feature Engineering

In [12]:
from sklearn.impute import SimpleImputer # For handling missing values
from sklearn.preprocessing import StandardScaler # For handling feature scaling
from sklearn.preprocessing import OrdinalEncoder # For Encoding the categorical values
## pipeline
from sklearn.pipeline import Pipeline # Its for connect the sim. imp. output to ss and ss op to oe
from sklearn.compose import ColumnTransformer # Its for connecting all pipeline together

# Simple imputer - Its a univariate imputer for completing missing values with simple strategies.

    # The strategies are mean, median and mode

    # If we have a outlier, use median
    # If we have a categorical features, use mode

#### Handling missing values

#### Then handling feature scaling              

        - #### Why feature scaling?

        When we apply the regression, we should use feature scaling,

        * To reach the global minima,

        * While doing this instead of higher values we need to feature scale to those down values

#### Handling feature engineering

        * In feature engineering we specifically do ordinal encoding

        * When we are mapping that categorical features, we need to do that in the automated way.

#### WHY ORDINAL ENCODING?

        * Whereever our categorical features in ranks, we have to use the ordinal encoding.

#### 1. SIMPLE IMPUTER
#### 2. STANDARD SCALER
#### 3. ORDINAL ENCODING

(whatever the simple imputer output it will go to the standard scaler, whatever the standard scaler output it will go to the ordinal encoding)

#### This is a pipeline

## So for this we need to implement the pipeline

# What is pipeline?

        * Pipeline is nothing but, its just combining the multiple step.

        by using this code we can able to create the pipeline

                from sklearn.pipeline import Pipeline #### Its just for connect simple imputer output to standard scaler and standard scaler output to ordinal encoding.

        (The above code is just connecting the one output to another, but we need to group this together by connecting)

#### Grouping all pipelines

        from sklearn.compose import ColumnTransformer

In [13]:
from sklearn.impute import SimpleImputer # For handling missing values
from sklearn.preprocessing import StandardScaler # For handling feature scaling
from sklearn.preprocessing import OrdinalEncoder # For Encoding the categorical values
## pipeline
from sklearn.pipeline import Pipeline # Its for connect the sim. imp. output to ss and ss op to oe
from sklearn.compose import ColumnTransformer # Its for connecting all pipeline together

#### Now we creating the pipeline.

    The first pipeline is only for the numerical feature.

#### 6. Creating Pipeline

In [14]:
## Numerical Pipeline

num_pipeline = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="median")),
        ("scaling", StandardScaler())
     ]
)

## Categorical Pipeline

cat_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ordinalencoder", OrdinalEncoder(categories=[cut_categories, color_categories,clarity_categories])),
        # After ordinal encoding we get a values like 1,2,3,4,5 so we should to do scaling.
        # If we do one hot encoding, then dont do
        ("scaler", StandardScaler())
    ]
)

preprocessor = ColumnTransformer([
    ("num_pipeline", num_pipeline, numerical_cols),
    ("cat_pipeline", cat_pipeline, categorical_cols)
])

#### We created numerical pipeline separately and categorical pipeline separately,

Now we need to combine this.

That is the reason we imported column transformer.

and we have combined both these things and executed.

#### 7. Train, Test Split

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state=42)

#### What we need to do for the X_train?

    We need to do fit_transform for the X_train with the help of the preprocessor variable.
    (Preprocessor variable above we have created for combining the pipeline)

# fit_transform only for X_train, not X_test.
#### Otherwise, it will consider as a data leakage.

In [16]:
preprocessor.fit_transform(X_train) # Automatically missing values are also handled,
# Feature scaling are also done.

array([[-0.82314374, -1.12998781, -0.64189666, ...,  0.87410007,
        -0.93674681,  1.35074594],
       [ 0.94502267, -1.77782269,  0.92190185, ..., -1.13764403,
         0.91085333,  0.68445511],
       [ 1.9584839 ,  0.16568195,  0.40063568, ..., -0.13177198,
         0.91085333,  0.01816428],
       ...,
       [ 0.92345966,  0.90606467,  0.40063568, ..., -0.13177198,
         0.29498662,  0.01816428],
       [-1.03877378, -0.66724861, -0.64189666, ..., -1.13764403,
         0.29498662,  2.01703677],
       [-1.03877378, -0.01941373,  0.92190185, ..., -1.13764403,
         0.29498662, -1.31441737]])

In [17]:
#### Now converting this into dataframe

X_train = pd.DataFrame(preprocessor.fit_transform(X_train), columns=preprocessor.get_feature_names_out())
X_test = pd.DataFrame(preprocessor.transform(X_test), columns=preprocessor.get_feature_names_out())

In [18]:
X_train.head()

Unnamed: 0,num_pipeline__carat,num_pipeline__depth,num_pipeline__table,num_pipeline__x,num_pipeline__y,num_pipeline__z,cat_pipeline__cut,cat_pipeline__color,cat_pipeline__clarity
0,-0.823144,-1.129988,-0.641897,-0.780451,-0.835103,-0.876024,0.8741,-0.936747,1.350746
1,0.945023,-1.777823,0.921902,1.073226,1.166389,0.946633,-1.137644,0.910853,0.684455
2,1.958484,0.165682,0.400636,1.703116,1.755063,1.742237,-0.131772,0.910853,0.018164
3,-0.995648,-0.574701,-0.641897,-1.122391,-1.161138,-1.165334,0.8741,-0.32088,2.017037
4,-0.995648,0.25823,0.400636,-1.176382,-1.152082,-1.136403,-1.137644,1.52672,-0.648127


In [19]:
X_test.head()

Unnamed: 0,num_pipeline__carat,num_pipeline__depth,num_pipeline__table,num_pipeline__x,num_pipeline__y,num_pipeline__z,cat_pipeline__cut,cat_pipeline__color,cat_pipeline__clarity
0,-0.629077,0.25823,-0.12063,-0.600482,-0.581521,-0.572248,0.8741,-1.552614,-0.648127
1,2.605374,-2.148014,-0.12063,2.126042,2.198832,1.959219,-1.137644,0.294987,-1.314417
2,-1.125026,-1.222536,0.921902,-1.374347,-1.414721,-1.46911,-0.131772,-0.936747,2.017037
3,-1.017211,-0.574701,0.921902,-1.158385,-1.161138,-1.194265,-0.131772,1.52672,2.017037
4,0.858771,0.628421,-0.641897,0.947248,0.985258,1.004495,0.8741,0.910853,-0.648127


#### 8. Model Training

In [20]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 

In [21]:
regression = LinearRegression()
regression.fit(X_train, y_train)

In [22]:
regression.coef_

array([[ 6432.97591819,  -132.34206204,   -70.48787525, -1701.38593925,
         -494.17005097,   -76.32351645,    68.80035873,  -464.67990411,
          652.10059539]])

In [23]:
regression.intercept_

array([3976.8787389])

In [24]:
# Now we want to find the prediction by creating a y_pred variable and writing the prediction condition

# But we want to do automate this also.

In [30]:
import numpy as np

# This function is written for checking the model performance
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square
# ADJUSTED R2 SQUARE ALSO WE CAN DO BY USING THE FORMULA.

#### In the real world scenario, we are not going to train one model, we are going to train multiple models.

In [35]:
#### Training multiple models

models = {
    "Linear Regression" : LinearRegression(),
    "Lasso" : Lasso(),
    "Ridge" : Ridge(),
    "ElasticNet" : ElasticNet()
}

# Now I need to iterate it the every model to check the accuracy of it.

trained_model_list = []
model_list = []
r2_list = []

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    
    # Make Predictions

    y_pred = model.predict(X_test)
    mae, rmse, r2_square = evaluate_model(y_test, y_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])

    print("MODEL TRAINING PERFORMANCE")
    print("RMSE:", rmse)
    print("MAE:", mae)
    print("R2_SQUARE", r2_square*100)

    r2_list.append(r2_square)

    print("-"*35)
    print("\n")

# WHICH EVER R2 SQUARE IS HIGHER, WE CAN TAKE THAT

Linear Regression
MODEL TRAINING PERFORMANCE
RMSE: 1014.6296630375463
MAE: 675.075827006748
R2_SQUARE 93.62906819996049
-----------------------------------


Lasso
MODEL TRAINING PERFORMANCE
RMSE: 1014.6591302750638
MAE: 676.2421173665508
R2_SQUARE 93.62869814082755
-----------------------------------


Ridge
MODEL TRAINING PERFORMANCE
RMSE: 1014.6343233534445
MAE: 675.1077629781357
R2_SQUARE 93.62900967491628
-----------------------------------


ElasticNet
MODEL TRAINING PERFORMANCE
RMSE: 1533.3541245902313
MAE: 1060.9432977143008
R2_SQUARE 85.44967219374031
-----------------------------------




In [36]:
model_list

['Linear Regression', 'Lasso', 'Ridge', 'ElasticNet']