# Linear Regression with Categorical Variables
In this tutorial, we will explore how to conduct linear regression on a dataset that contains categorical variables.
We will use a car price dataset on Kaggle (https://www.kaggle.com/datasets/hellbuoy/car-price-prediction).

### Step 1: Import Required Libraries

In [10]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

### Step 2: Load and Explore Data

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/yangliuiuk/data/main/CarPrice_Assignment.csv")
df.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

### Step 3: Select and Encode Categorical Variables

In [22]:
# Encode categorical variables using pd.get_dummies
df_encoded = pd.get_dummies(df, columns=['fueltype', 'aspiration', 'doornumber', 'carbody', 'drivewheel', 'enginelocation',
                                            'enginetype', 'cylindernumber', 'fuelsystem'], dtype=float)
df_encoded.head()

Unnamed: 0,car_ID,symboling,CarName,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,...,cylindernumber_twelve,cylindernumber_two,fuelsystem_1bbl,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,1,3,alfa-romero giulia,88.6,168.8,64.1,48.8,2548,130,3.47,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2,3,alfa-romero stelvio,88.6,168.8,64.1,48.8,2548,130,3.47,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3,1,alfa-romero Quadrifoglio,94.5,171.2,65.5,52.4,2823,152,2.68,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,4,2,audi 100 ls,99.8,176.6,66.2,54.3,2337,109,3.19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,5,2,audi 100ls,99.4,176.6,66.4,54.3,2824,136,3.19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Step 4: Linear Regression

In [23]:
# Select all predictor variables (features) including encoded categorical variables
X = df_encoded.drop(columns=['car_ID', 'CarName', 'price'])  # Drop columns that are not predictors

# Add a constant (intercept) term to the model
X = sm.add_constant(X)

# Define the dependent variable (target variable)
Y = df_encoded['price']

# Split the data into a training set and a testing set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Fit the linear regression model on the training data
model = sm.OLS(Y_train, X_train).fit()

# Get regression summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.948
Model:                            OLS   Adj. R-squared:                  0.931
Method:                 Least Squares   F-statistic:                     55.65
Date:                Thu, 25 Jan 2024   Prob (F-statistic):           8.01e-62
Time:                        08:59:33   Log-Likelihood:                -1458.9
No. Observations:                 164   AIC:                             3000.
Df Residuals:                     123   BIC:                             3127.
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                 -8390.91

In [26]:
# Make predictions on the test data
Y_pred = model.predict(X_test)

# Evaluate the model
rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
r2 = r2_score(Y_test, Y_pred)

print("Rooted Mean Squared Error:", rmse)
print("R-squared:", r2)

Rooted Mean Squared Error: 3089.16761282734
R-squared: 0.8791174248055004


In [28]:
Y_test.describe()

count       41.000000
mean     13489.894317
std       8995.422247
min       5151.000000
25%       7898.000000
50%       9960.000000
75%      13499.000000
max      41315.000000
Name: price, dtype: float64