<div align="center">

# Final Project Report

### STAT 306  
### The University of British Columbia  
### Yu Chang #47945050
### Zhuoran Wang  

</div>

# Introduction

#### Background of Research
The Chinese automobile company Geely Auto aims to enter the U.S. market by establishing a local manufacturing facility and producing cars domestically. Success in this market requires a comprehensive understanding of the specific factors that influence car prices in the U.S. By analyzing car features that American consumers value the most, Geely Auto can better align their product offerings with market expectations. This study focuses on analyzing car prices influenced by various factors using suitable multiple linear regression model to inform Geely Auto’s strategic decisions regarding feature selection and pricing.

#### Motivation for Analysis
he primary motivation for this study is to develop a multiple linear regression model that accurately estimates car prices based on their features. By understanding the relationship between various attributes and car prices, Geely Auto can tailor their production to match U.S. consumer preferences and ensure a competitive market entry. Additionally, the insights gained from this analysis can support feature selection and engineering for optimal pricing strategies.

By employing a multiple linear regression model, this study aims to evaluate the strength and significance of relationships between car related features and price. The results will provide actionable insights into which features drive value in the eyes of U.S. consumers. This understanding is critical for guiding Geely Auto’s decisions on which features to prioritize in their designs, ensuring their products meet market expectations while remaining cost-effective.




#### Dataset Overview

Given the context of the Chinese automobile company's will on sell cars in U.S. They have hired an automobile consulting firm to analyze the factors influencing car pricing. In particular, they aim to identify the key factors impacting car prices in the American market, as these may differ significantly from those in the Chinese market. 

The dataset was published in 2019. The exact timeframe and methodology of data collection is not declared on the source website. All features are measured in U.S. customary units, and all data points are from vehicles in the U.S based on various market surveys.The dataset has no missing values and no duplicate values, ensuring data quality for statistical inference purposes. 


#### Here is an entire overview of the variables recorded in the dataset: 


| Variable  | Type               | Description                                   | Unit          
|-----------|--------------------|-----------------------------------------------|---------------|
| car_ID | Categorical | the unique ID for the cars in the dataset | Unitless |
| symboling | Categorical | the symbol for each car corresponds to the levels of the feature 'carbody' | Unitless
| Carname | Categorical | the name of the car | Unitless | 
| fueltype  | Binary Categorical | The type of fuel the car uses: "gas" or "diesel" | Unitless |
| aspiration  | Categorical | the type of aspiration used in the car's engine | Unitless
| doornumber  | Categorical | the number of doors on the car | Unitless |
| carbody | Categorical | The five different car category such as 'sedan','wagon' | Unitless |
| drivewheel  | Cateogrical | the drive wheel type of car | Unitless
| enginelocation | Categorical | the location of the engine in the car |Unitless|
| wheelbase | Numerical | the distance between the front and rear axles of a vehicle  | inch|
| carlength | Numerical | the length of the car | inch |
|carwidth | Numerical | the width of the car | inch |
|carheight | Numerical | the height of the car | inch |
| curbweight | Numerical | total weight of the vehicle without passengers or cargo but includes all necessary operating fluids | lbs|
| enginetype | Categorical|  type of engine used in the vehicles | Unitless
| cylindernumber | Categorical | the number of cylinders in the car's engine |Unitless
 enginesize | Numerical | the size of the engine |  cubic inches |
 |fuelsystem | Categorical | fuel delivery system of the car | Unitless|
 |boreratio| Numerical |The ratio of the cylinder's bore to its stroke, affecting the engine's efficiency and power output | Unitless|
 |stroke| Numerical | The distance the piston travels inside the cylinder, impacting engine displacement and performance | inch |
 |compressionratio | Numerical |The ratio of the cylinder's maximum to minimum volume, influencing engine efficiency and power generation | Unitless |
| horsepower | Numerical | the power of the engine |  hp |
| peakrpm | Numerical | the engine's maximum revolution per minute at the peak power | peak revolutions per minute |
| citympg | Numerical | the fuel efficiency in miles per gallon driving in city| miles per gallo |
| highwaympg | Numerical | the fuel efficiency in miles per gallon driving in the highway| miles per gallo |
|price | Numerical | the price of the car | US dollar |

## 2. Analysis
Present suitable visualizations of the data and a summary of any key features. Explain and apply the chosen statistical methodology to address the question(s) of interest motivating the study.



### Preliminary Data Cleaning 


##### a.  Similar Representation

The features **carlength**,” “**carwidth**,” “**carheight**,” “**curbweight**,” and “**wheelbase**” all represent attributes related to the size and dimensions of a car. Including all of these in the model would introduce redundancy and increase the complexity of the analysis. To address this, we select “**carlength**” as the most representative measure of car size. This decision is based on its straightforward interpretation and its expected relevance in determining car pricing. By dropping the remaining features, we reduce multicollinearity, simplify the model, and decrease the risk of overfitting.

Similarly, the features “**citympg**” and “**highwaympg**” both describe the car’s fuel efficiency, but under different driving conditions. Since “**citympg**” is more reflective of real-world driving for most users, we choose to retain it while removing “**highwaympg**.” This reduces redundancy without losing critical information about fuel economy.

For features such as “**fuelsystem**,” “**boreratio**,” “**stroke**,” “**compressionratio**,” “**enginetype**,” and **“horsepower,”** all are associated with the performance of the car’s engine. However, many of these features are technical and less intuitive to interpret for understanding car pricing. We retain “horsepower,” as it is a well-known and widely accepted metric for evaluating engine performance and has a clear impact on a car's value. The other features are excluded to streamline the analysis and improve the model's interpretability.


#### b. Redundant Features

The features **car_ID** is just representing the identical definition of the car in the dataset and **symboling** is just the numeric encoding of the feature "**carbody**". To simplify the mode, these features are excluded from the analysis.

#### c. Complex Encoding

The feature **CarName** is a text-based attribute. Encoding it as a categorical variable would be impractical due to the large number of unique values, which could result in an excessively complex model. In addition, encoding text feature as bag of words would lead to a spare wording matrix, which might underfit this feature and increase computational costs. Using advanced techniques like word vectors or transfer learning is beyond the scope of this analysis, so this feature is excluded for simplicity.

#### d.Sparse Feature
The categorical feature **"car body"** has too many unique levels. Given the small size of our dataset, the data points corresponding to each car body type are sparse. This sparsity may result in underrepresentation of certain car body types, leading to potential underfitting of the model. To simplify the model and improve its performance, we have decided to drop this feature.

---

### Data Cleaning for Numerical Features

Firstly, we take a look at the pre-cleaned data set to see what features are still there. 





Then, we plot the distributions of the numeric features left in the dataset to check the skewness.



1) The distributions of **horsepower** and **enginesize** are right-skewed, which may violate the assumption of the linear model assumptions, such as linearity and homoscedasticity, and might potentially affect the model’s performance. To address this, **we will perform natural log transformation on these two features to reduce skewness** .

2) The response variable price is also skewed, however, we don't usually change the output if the linear model assumptions are all met, so we will keep it unchanged by now. 

Now, the distributions for these features look better.

---

### Data Cleaning for Categorical Features:
Here is the plot for categorical features:

By analyzing the above class imbalance plot, the three features below have highly imbalanced class distributions.
1) Drivewheel:
2) Enginelocation:
3) Cylindernumber:

Such imbalance can negatively effect the model's general performance, as the model may overfit to the majority classes while underfitting the minority classes. To avoid potential problem, we have decided to **exclude** these features.

---

### Investigate Multicollinearity



We plot the correlation matrix of the numerical features left in the data as below. 



<plot here


From the correlation matrix, **horsepower** and **enginesize** are highly correlated, whcih makes sense with our intuition where the bigger enginesize usually represents the higher horsepower because they can burn more fuel and generate more power, just like "Height" and "Weight" of a child are highly correlated.


To further investigate, we compute VIF for each feature. 

<plot here

It seems that the VIFs for **horsepower** is near to 10 (>= 10 indicates severe multicollinearity ).

So based on the above analysis, it seems to be reasonable to drop this features to reduce the risk of causing multicollinearity in our fitted model.

---

### Model Fitting and Selection

To select the most desirable model, we perform the best subset selection

* Why best subset selection? Unlike forward/backward selection which relies on sequatial inclusion of the covariates, best subset selection evaluates all possible combinations of the covariates. This ensures that the final model is robust and not influenced by the order in which the covariates are added. 
* To address post-inference bias, we split the dataset into two subsets: a selection set and a modeling set. The best subset selection process is performed on the selection set to determine the optimal group of variables. This separation ensures that the modeling set remains independent for model evaluation. 


<plot here
<plot here

* Model with 4 covariates has the highest adjusted R^2, let's call it **Model 4**
* Model with 6 covariates has the suitable Cp that is closest to number of (covariates+1), let's call it **Model 6**

We will select the features identified in these two models and fit two multiple linear regression models—one with the three features from Model 4 and another with the four features from Model 5—using the modeling set (test split). 

<plot here

From the table, we can see that model with 6 covariates (**Model 6**) provides the smallest RMSE and the highest adjR^2 compared to **Model 4**, so we will choose it as our final model.

Now our Model Equation is: 

### Check Model Assumptions

<plot he

The residual plot indicates a violation of the assumption of constant variance (heteroscedasticity), as the variance of residuals appears to increase with the fitted values. To address this, we will transform the output y.

## 3. Conclusion
Discuss findings from the analysis along with any other pertinent comments of interest. Address the initial research question(s).



## 4. Appendix (Optional)

Include any other relevant information or materials in this optional section. Note that the grader is not obligated to read this section, and so any content that is to be graded should be within the main body of the report.