## Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this tutorial. This process contains the below steps:

### Importing libraries:

 Firstly we will import the library which will help in building the model. Below is the code for it:

In [1]:
# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mtp  
import pandas as pd

Importing dataset:

Now we will import the dataset(50_CompList), which contains all the variables. Below is the code for it:

In [11]:
#importing datasets  
data_set= pd.read_csv('50_Startups.csv')  

In [12]:
data_set.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In above output, we can clearly see that there are five variables, in which four variables are continuous and one is categorical variable.

### Extracting dependent and independent Variables:

In [49]:
#Extracting Independent and dependent Variable  
x= data_set.iloc[:, :-1].values  
y= data_set.iloc[:, 4].values  

As we can see in the above output, the last column contains categorical variables which are not suitable to apply directly for fitting the model. So we need to encode this variable.

### Encoding Dummy Variables:

As we have one categorical variable (State), which cannot be directly applied to the model, so we will encode it. To encode the categorical variable into numbers, we will use the `LabelEncoder` class. But it is not sufficient because it still has some relational order, which may create a wrong model. So in order to remove this problem, we will use `OneHotEncoder`, which will create the dummy variables. Below is code for it:

# Encoding Methods: Label Encoding vs. One-Hot Encoding

## 1. Label Encoding
- **What It Does**: Converts each category into a numeric label (e.g., `0, 1, 2, ...`).
- **Example**:  
  - Categories: `["Red", "Blue", "Green"]`  
  - Encoded: `[0, 1, 2]`
- **Usage**:  
  - Useful when the categorical variable has an **ordinal relationship** (e.g., `Low < Medium < High`).
  - **Caution**: Introduces **implicit order**, which might not make sense for **nominal categories** like `State` or `Color`.

## 2. One-Hot Encoding
- **What It Does**: Converts each category into a separate binary column (dummy variables).
- **Example**:  
  - Categories: `["Red", "Blue", "Green"]`  
  - Encoded:  
    ```
    [[1, 0, 0],
     [0, 1, 0],
     [0, 0, 1]]
    ```
- **Usage**:  
  - Suitable for **nominal categories** (no inherent order), such as `State` or `City`.
  - Avoids introducing **artificial relationships** between categories.


In [50]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Apply One-Hot Encoding to the 'State' column (assumed to be at index 3)
column_transformer = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(), [3])  # Apply OneHotEncoder to column index 3
    ],
    remainder='passthrough'  # Keep other columns as they are
)

x = column_transformer.fit_transform(x)


In [53]:
print(x)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

Here we are only encoding one independent variable, which is state as other variables are continuous.

## Understanding Dummy Variables and the Dummy Variable Trap

When using **One-Hot Encoding**, categorical variables are converted into multiple binary columns (dummy variables), each representing a category. Here's a detailed explanation based on the given scenario:

### Dummy Variable Encoding Output
After encoding the `State` column, each unique state is represented as a binary column:
- **California**: `[1, 0, 0]`
- **Florida**: `[0, 1, 0]`
- **New York**: `[0, 0, 1]`

### Correspondence of Columns:
- The **first column** corresponds to `California`.
- The **second column** corresponds to `Florida`.
- The **third column** corresponds to `New York`.

This allows the model to differentiate between states using binary values. However, if we include **all three dummy variables**, it introduces a redundancy, as the values in the third column can be derived from the first two. This redundancy leads to the **dummy variable trap**.

---

### What is the Dummy Variable Trap?
The **dummy variable trap** occurs when:
1. All dummy variables are included in the model, causing **multicollinearity** (high correlation between variables).
2. The model becomes unable to uniquely determine the effect of each variable because one variable can be expressed as a linear combination of others.

For example:
- If `California = 1`, `Florida = 0`, it implies `New York = 0`.  
Thus, one column is always dependent on the others, leading to redundant information.

---

### Avoiding the Dummy Variable Trap
To avoid this issue:
1. **Exclude one dummy variable**: This prevents multicollinearity by removing the redundant column.
2. **Interpretation**: The excluded column becomes the **baseline category**, and the remaining dummy variables compare against it.

For the given dataset:
- Remove the first column (`California`), so the dataset retains only:
  - Florida: `[1, 0]`
  - New York: `[0, 1]`
- California is now implicitly represented when both dummy variables are `0`.

### Code to Avoid the Dummy Variable Trap:
```python
x = x[:, 1:]  # Exclude the first column to avoid the trap


**Now, we are writing a single line of code just to avoid the dummy variable trap:**

In [54]:
#avoiding the dummy variable trap:  
x = x[:, 1:]  

In [55]:
print(x)

[[0.0 1.0 165349.2 136897.8 471784.1]
 [0.0 0.0 162597.7 151377.59 443898.53]
 [1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 1.0 144372.41 118671.85 383199.62]
 [1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 1.0 131876.9 99814.71 362861.36]
 [0.0 0.0 134615.46 147198.87 127716.82]
 [1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 1.0 120542.52 148718.95 311613.29]
 [0.0 0.0 123334.88 108679.17 304981.62]
 [1.0 0.0 101913.08 110594.11 229160.95]
 [0.0 0.0 100671.96 91790.61 249744.55]
 [1.0 0.0 93863.75 127320.38 249839.44]
 [0.0 0.0 91992.39 135495.07 252664.93]
 [1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 1.0 114523.61 122616.84 261776.23]
 [0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 1.0 94657.16 145077.58 282574.31]
 [1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 1.0 86419.7 153514.11 0.0]
 [0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 1.0 78389.47 153773.43 299737.29]
 [1.0 0.0 73994.56 122782.75 303319.26]
 [1.0 0.0 67532.53 105751.03 304768.73]
 [0.0 1.0 77044.01 99281.34 140574.81]
 [0

As we can see in the above output image, the first column has been removed.

**Now we will split the dataset into training and test set. The code for this is given below:**

If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.

In [56]:
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

## Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we will fit our regression model to the training set. It will be similar to as we did in Simple Linear Regression model. The code for this will be:

In [57]:

#Fitting the MLR model to the training set:  
from sklearn.linear_model import LinearRegression  
regressor= LinearRegression()  
regressor.fit(x_train, y_train)  

### Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model. We will do it by predicting the test set result. For prediction, we will create a y_pred vector. Below is the code for it:

In [58]:
#Predicting the Test set result;  
y_pred= regressor.predict(x_test)  

In [59]:
print('Train Score: ', regressor.score(x_train, y_train))  
print('Test Score: ', regressor.score(x_test, y_test))  

Train Score:  0.9501847627493607
Test Score:  0.9347068473282966


The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the test dataset.

`Note: In the next topic, we will see how we can improve the performance of the model using the Backward Elimination process.`