<a href="https://colab.research.google.com/github/Reece-Lu/Laptop_Data_Crawler_and_Price_Prediction_UvicCSC503Project/blob/data_clean%26machine_learining/CSC503ProjectComputerPricePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC503 Data Mining: Analyzing Computer Configurations and Prices

- Chengze Hao V01010253
- Yuwen Lu V01022209

https://github.com/Reece-Lu/Laptop_Data_Crawler_and_Price_Prediction_UvicCSC503Project

In [None]:
from IPython.display import Image
import warnings
warnings.filterwarnings("ignore")

## Section 1: Laptop Data Crawler.

In this section, we will discuss the development of a laptop data crawler with **scrapy**. The crawler, built using web scraping techniques and Python programming, will automatically collect laptop information from various retailer websites. It will extract specifications like processor, memory, storage, graphics card, and prices.

To ensure data accuracy, the crawler will handle dynamic web pages and errors effectively. The collected data will be cleaned and preprocessed to address inconsistencies, missing values, and outliers.


### Introduction to the website being crawled.
The website being crawled is [**Smartprix.com**](https://www.smartprix.com/laptops). It is an online platform that provides comprehensive information on various electronic products, including laptops. Users can browse through laptop models, compare their specifications and prices, and read user reviews and ratings. The data crawler collects laptop data from Smartprix.com, which is utilized for further analysis and insights into the laptop market.


### Choosing the Data to be crawled on Smartprix

Taking the webpage of https://www.smartprix.com/laptops/hp-pavilion-15-eg3081tu-laptop-13th-gen-core-ppd19cikb8j4 as an example, we extract and crawl the data of the laptop's price, group rank, overall rank, and the detailed configuration information, and add it to our dataset.

In [None]:
Image(filename='/content/CleanShot 2023-06-26 at 17.46.20@2x.png',height=400)

FileNotFoundError: ignored

In [None]:
Image(filename='/content/CleanShot 2023-06-26 at 17.46.42@2x.png',height=400)

### The implementation of the Data Crawler

Link of the code:
https://github.com/Reece-Lu/Laptop_Data_Crawler_and_Price_Prediction_UvicCSC503Project/blob/master/csc503project/spiders/crawling_spider.py

**Allowed domains:**

``` python
allowed_domains = ["smartprix.com"]
```
In this code, we set the allowed_domains to specify the domain we are allowed to crawl. In this example, we only allow crawling of pages under the smartprix.com domain. This ensures that the crawler only retrieves data from within that domain and does not navigate to pages from other domains.

**Explanation of Rule:**
``` python
rules = (
    Rule(LinkExtractor(allow=(r'laptops/(.*-.*-.*-.*)'),
                       deny=(r'brand', r'compare', r'list')),
         callback='parse_item',
         follow=True),
)
```
Here, we define a Rule to specify how the spider should follow and parse links. The Rule consists of a LinkExtractor and other parameters.

- The LinkExtractor is used to define which links should be followed by the spider. In this example, we use a regular expression `laptops/(.*-.*-.*-.*)` to match links with a specific format. Additionally, we use the deny parameter to exclude certain links, such as those containing "brand", "compare", or "list".

- The callback parameter specifies which method should be called to parse the response after following a link. In this case, we use the parse_item method for parsing.

- The follow parameter indicates whether the spider should continue following links extracted from the current page. Here, we set it to True to ensure that links are followed and related pages are crawled.

**Explanation of parse_item:**

```python
def parse_item(self, response):
    # Get the page link
    link = response.url

    # Extract data
    rating = response.xpath('//div[@class="pg-prd-rating"]').get()
    pricewrap = response.xpath('//div[@class="pg-prd-pricewrap"]/div[@class="price"]/text()').get()
    s_score = response.xpath('//div[@class="pg-prd-s-score"]').get()
    quick_specs = response.xpath('//div[@class="sm-fullspecs-grp"]').getall()
    sm_box = response.xpath('//div[@class="sm-box"]').get()

    # Save data to CSV file
    item = {
        'link': link,
        'rating': rating,
        'pricewrap': pricewrap,
        's_score': s_score,
        'quick_specs': quick_specs,
        'sm_box': sm_box,
    }

    print(item)
    yield item

```

The parse_item method is a callback function used to parse the page content. In this example, we use XPath selectors to extract data from the page. Specifically:

- The **link** variable stores the link of the current page.
- The **rating** variable uses an XPath selector to extract the content of the `<div>` element with class="pg-prd-rating".
- The **pricewrap** variable uses an XPath selector to extract the text content of the `<div class="pg-prd-pricewrap">` element's child `<div class="price">`.
- The **s_score** variable uses an XPath selector to extract the content of the `<div>` element with class="pg-prd-s-score".
- The **quick_specs** variable uses an XPath selector to extract the content of all `<div>` elements with class="sm-fullspecs-grp".
- The **sm_box** variable uses an XPath selector to extract the content of the `<div>` element with class="sm-box".


The extracted data is then stored in the item dictionary, and yield item is used to return the item for further processing and storage.


**Explanation of CSVWriterPipeline:**

```python
class CSVWriterPipeline:
    def open_spider(self, spider):
        self.file = open('data.csv', 'w', newline='')
        self.writer = csv.DictWriter(self.file, fieldnames=['link', 'rating', 'pricewrap', 's_score', 'quick_specs', 'sm_box'])
        self.writer.writeheader()

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        self.writer.writerow(item)
        return item
```

`CSVWriterPipeline` is a custom pipeline used to save the extracted data to a CSV file. In this example, the following methods are implemented:

The `open_spider` method is called when the spider starts and is used to open the CSV file and create a DictWriter object for writing the data. We also write the header of the CSV file using `writer.writeheader()`.

The `close_spider` method is called when the spider finishes and is used to close the CSV file.

The `process_item` method is used to process each item passed from the spider. Here, we use `writer.writerow(item)` to write each item to the CSV file. Finally, we return item to continue passing it to subsequent pipelines or processing methods.

### The result of Data Crawler
The link to the row data :
https://drive.google.com/file/d/1fMwEWie3xGfXoDAxHPfvD_tpOvffYaKW/view?usp=sharing

This table contains numerous HTML tags, so cleaning the data is a necessary step.


In [None]:
Image(filename='/content/CleanShot 2023-06-26 at 23.13.10@2x.png',height=600)

FileNotFoundError: ignored

-------

## Section 2: Row Data cleaning

In [None]:
import pandas as pd

df = pd.read_csv('/content/laptop_price_raw.csv')

df.size

11145

Directly delete the rows which have empty values, because our dataset do not lake samples.

In [None]:
df.dropna(inplace=True)
df.describe()

Unnamed: 0,link,Rating,Price,Score,Specs
count,2224,2224,2224,2224,2224
unique,2224,1785,1055,2099,2219
top,https://www.smartprix.com/laptops/hp-victus-16...,"<div class=""pg-prd-rating""><span class=""sm-rat...","₹59,990","<div class=""pg-prd-s-score""><div class=""score ...","['<div class=""sm-fullspecs-grp""><div class=""ti..."
freq,1,6,36,5,3


**Handle the colum of "Rating":**

In [None]:
# Create two empty columns
df['Rating_Number'] = ''
df['Votes_Number'] = ''

# Extract the numbers and store them in the respective columns
df['Rating_Number'] = df['Rating'].str.extract(r'rating:(\d+\.?\d*)')
df['Votes_Number'] = df['Rating'].str.extract(r'margin-left:10px;">(\d+)')

for i in range(10):
  row = df.iloc[i]
  print(row.Rating_Number,row.Votes_Number)

4.35 115
4.05 64
4.55 68
4.45 74
4.6 59
4.5 109
4.4 413
4 558
4.5 181
4.35 207


**Handling the colum of "Price":**

In [None]:
df['Price'] = df['Price'].str.replace('₹', '')

for i in range(10):
  row = df.iloc[i]
  print(row.Price)

58,990
31,990
76,990
38,990
75,763
58,990
49,990
58,990
62,990
64,990


**Handling the colum of "Score":**

In [None]:
# Create two empty columns
df['Computer_Score'] = ''
df['Group_Rank'] = ''
df['Group_Rank_Size'] = ''
df['Total_Rank'] = ''
df['Total_Rank_Size'] = ''

# Extract the numbers and store them in the respective columns
df['Computer_Score'] = df['Score'].str.extract(r'score rank-(\d+)-bg')
df['Group_Rank'] = df['Score'].str.extract(r'Group Rank: <b class="rank-\d+">#(\d+)')
df['Group_Rank_Size'] = df['Score'].str.extract(r'Group Rank: <b class="rank-\d+">#\d+</b> \/ (\d+)')
df['Total_Rank'] = df['Score'].str.extract(r'Overall Rank: <b class="rank-\d+">#(\d+)')
df['Total_Rank_Size'] = df['Score'].str.extract(r'Overall Rank: <b class="rank-\d+">#\d+</b> \/ (\d+)')

df.describe()

Unnamed: 0,link,Rating,Price,Score,Specs,Rating_Number,Votes_Number,Computer_Score,Group_Rank,Group_Rank_Size,Total_Rank,Total_Rank_Size
count,2224,2224,2224,2224,2224,2224.0,2224,2224,2224,2224,2224,2224
unique,2224,1785,1055,2099,2219,28.0,459,4,401,36,1041,2
top,https://www.smartprix.com/laptops/hp-victus-16...,"<div class=""pg-prd-rating""><span class=""sm-rat...",59990,"<div class=""pg-prd-s-score""><div class=""score ...","['<div class=""sm-fullspecs-grp""><div class=""ti...",4.3,1,2,57,345,448,2352
freq,1,6,36,5,3,196.0,98,1288,24,172,18,2221


**Handling the colum of "Specs":**

Because the configurations for each laptop maybe have different attributes, and we are trying to find out the dictionary of the attributes have ever occured and use the dictionary to check each laoptop's configurations.

In [None]:
from bs4 import BeautifulSoup

specs_html = row.Specs
soup = BeautifulSoup(specs_html, 'html.parser')

titles = soup.select('td.title')
values = [title.text.strip() for title in titles]
unique_values = list(set(values))

print(unique_values)

['Device Type', 'Bluetooth', 'Ethernet', 'Sales Package', 'Weight', 'Processor', 'PPI', 'Generation', 'USB Ports', 'Speed', 'Cores', 'Model', 'Dedicated Memory', 'Touchpad', 'Warranty', 'Battery Backup', 'Type', 'Battery Details', 'Features', 'RAM', 'Touch', 'Speakers', 'Resolution', 'Refresh Rate', 'Inbuilt Microphone', 'GPU', 'Series', 'Battery', 'HDMI', 'Keyboard', 'WiFi', 'OS', 'Dimensions', 'Headphone Jack', 'Optical Drive', 'Anti Glare Screen', 'Cache', 'Keyboard Backlit', 'Utility', 'Brand', 'Card Reader', 'Solid State Drive', 'Size', 'Camera', 'Microphone In', 'SSD Interface']


In [None]:
import pandas as pd
from bs4 import BeautifulSoup

# create the colums accourding to the dictinory, namely the "unique_values".
for value in unique_values:
    df = df.assign(**{value: None})

# update the values of these colums for each laptop.
for i in range(len(df)):
    row = df.iloc[i]
    specs_html = row.Specs
    soup = BeautifulSoup(specs_html, 'html.parser')
    titles = soup.select('td.title')
    values = [title.text.strip() for title in titles]

    for value in unique_values:
        found_value = None
        for j in range(len(titles)):
            if titles[j].text.strip() == value:
                next_sibling = titles[j].find_next_sibling('td')
                if next_sibling is not None:
                    if next_sibling.find('span'):
                        found_value = next_sibling.find('span').text.strip()
                    elif next_sibling.find('svg'):
                        found_value = next_sibling.find('svg')['style'].split(':')[2].split(';')[0]
                    else:
                        found_value = next_sibling.text.strip()


                break



        if found_value is not None:
            df.at[i, value] = found_value
        else:
            df.at[i, value] = 'No'

print(df)

                                                   link  \
0     https://www.smartprix.com/laptops/hp-victus-16...   
1     https://www.smartprix.com/laptops/acer-one-14-...   
2     https://www.smartprix.com/laptops/hp-pavilion-...   
3     https://www.smartprix.com/laptops/infinix-inbo...   
4     https://www.smartprix.com/laptops/dell-inspiro...   
...                                                 ...   
194                                                 NaN   
330                                                 NaN   
1295                                                NaN   
1394                                                NaN   
2216                                                NaN   

                                                 Rating   Price  \
0     <div class="pg-prd-rating"><span class="sm-rat...  58,990   
1     <div class="pg-prd-rating"><span class="sm-rat...  31,990   
2     <div class="pg-prd-rating"><span class="sm-rat...  76,990   
3     <div class="pg-pr

In [None]:
import pandas as pd

# Drop specified columns
df = df.drop(['link', 'Rating', 'Score', 'Specs'], axis=1)

# Move the "Price" column to the end of the DataFrame
price_col = df.pop('Price')
df['Price'] = price_col

# Save the DataFrame as a CSV file
df.to_csv('laptop_price_cleaned.csv', index=False)

print("Data saved as 'laptop_price_cleaned.csv'")

Data saved as 'laptop_price_cleaned.csv'


In [None]:
print(df.iloc[0])

Rating_Number                                                      4.35
Votes_Number                                                        115
Computer_Score                                                        2
Group_Rank                                                            2
Group_Rank_Size                                                     481
Total_Rank                                                          278
Total_Rank_Size                                                    2352
Device Type                                                     Netbook
Bluetooth                                                          v5.2
Ethernet                                 Integrated 10/100/1000 GbE LAN
Sales Package                                                        No
Weight                                                     2.46\u2009kg
Processor                                 11th Gen Intel Core i5 11400H
PPI                                                      ~ 137\u

------


##Section 3: Prepare the Data for Machine Learning.
Here we do some processing to pave the way for future work, because having data ready for machine learning algorithms offers several benefits.

Improved Accuracy: High-quality and well-prepared data can lead to improved accuracy in machine learning models. By ensuring that the data is clean, consistent, and properly formatted, machine learning algorithms can make more accurate predictions and classifications.

Efficient Model Training: When data is ready for machine learning algorithms, the training process becomes more efficient.

Enhanced Feature Extraction: Preparing data involves transforming raw data into meaningful features and Proper feature extraction techniques can help uncover patterns, relationships within data. This enables machine learning algorithms to make better-informed decisions.

Increased Generalization: Well-prepared data helps improve the generalization capabilities of machine learning models.

Compatibility with Algorithms: Different machine learning algorithms have specific requirements regarding the format and characteristics of the input data. Preparing the data to meet these requirements ensures compatibility with the chosen algorithms.

Reduced Bias: Data preparation is crucial for reducing bias in machine learning models. By carefully examining and cleaning the data, biases and discriminatory factors can be identified and addressed.



**Import the dataset:**

Using all the columns in the dataset except the price column as the computer matrix and the price column as the computer_labels matrix.

In [None]:
import pandas as pd
import numpy as np
data= pd.read_csv('/content/laptop_price_cleaned (2).csv')
data = pd.DataFrame(data)
computer = data.drop("Price",axis=1)
computer_labels = data["Price"].copy()
computer_num = computer.select_dtypes(include=[np.number])
computer_text = computer.select_dtypes(exclude=[np.number])
data.head()

Unnamed: 0,Rating_Number,Votes_Number,Computer_Score,Group_Rank,Group_Rank_Size,Total_Rank,Total_Rank_Size,HDMI,Keyboard,Generation,...,Touchpad,Warranty,Features,Security Lock Port,Processor,Adapter Type,Bluetooth,Weight,Device Type,Price
0,4.35,115,2,2,481,278,2352,1 x HDMI 2.1 Port,"Full-Size, Mica Silver Keyboard with Numeric K...",11th Gen,...,HP Imagepad with multi-touch Gesture Support; ...,1year Warranty,"250nits Brightness, Micro-Edge, 45% NTSC",No,11th Gen Intel Core i5 11400H,150 W Smart AC Power Adapter,v5.2,2.46kg,Netbook,58990.0
1,4.05,64,2,3,178,1187,2352,1 x HDMI Port,Keyboard with touchpad with multi gesture and ...,11th Gen,...,Yes,1year Onsite Warranty,No,Yes,11th Gen Intel Core i3 1115G4,AC 45- Watt power adapter,v5,1.5kg,Ultrabook,31990.0
2,4.55,68,2,116,332,735,2352,1 x HDMI 2.1 Port,"Full-size, backlit, Natural Silver keyboard wi...",13th Gen,...,Yes,1year Onsite Warranty,"micro-edge, Low Blue Light, Brightness: 300 ni...",Yes,13th Gen Intel Core i5 1340P,65 W Smart AC power adapter,v5.3,1.75kg,Ultrabook,76990.0
3,4.45,74,3,55,345,1459,2352,1 x HDMI 1.4 Port,Yes,11th Gen,...,Yes,1year Onsite Warranty,"Peak Brightness: 300nits, 100% sRGB, NTSC 72%",Yes,11th Gen Intel Core i5 1155G7,65W Type-C,v5.1,1.24kg,Ultrabook,38990.0
4,4.6,59,2,58,332,517,2352,1 x HDMI v1.4 Port,Standard Keyboard,13th Gen,...,Yes,1year Onsite Warranty,300nits WVA Display w/ ComfortView Plus Support,No,13th Gen Intel Core i5 1340P,65 Watt AC Adapter,v5.2,1.85kg,Hybrid,75763.0


We look at the dataset to see what its tag names are and what the individual columns look like for the data. For example, is it a data type or a text type, and then there is to observe what our target column looks like.

Understanding of the data's structure and characteristics, helps people in making informed decisions regarding data preprocessing and selecting suitable machine learning algorithms.

In addition, dataset observation enables the assessment of data quality by identifying missing values, outliers, and noise.

**Edit the pipeline:**

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

num_pipeline

`num_pipeline` is an example of a pipeline object created using make_pipeline that is designed to handle numerical features.

`SimpleImputer` is a class in scikit-learn used for handling missing values. By specifying the strategy as "median", it automatically calculates the median value for each feature and replaces the missing values with it.

`StandardScaler`, It performs feature scaling on the numerical features. StandardScaler is a class in scikit-learn used for feature standardization, which scales the values of features to have a mean of 0 and a standard deviation of 1.

In [None]:
computer_num_prepared = num_pipeline.fit_transform(computer_num)
columns = num_pipeline.get_feature_names_out()
df_computer_num_prepared = pd.DataFrame(computer_num_prepared,
                                       columns=num_pipeline.get_feature_names_out())
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_attribs = computer.select_dtypes(include='number').columns.tolist()

cat_attribs = computer.select_dtypes(exclude='number').columns.tolist()

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore", sparse=False))


preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)])

preprocessing


`num_attribs` is all the column with value is numerical type.

`cat_attribs`  is all the column with value is text type.

The `SimpleImputer` in `cat_pipeline` is used to do the most frequent strategy to fill missing values in categorical features.

`OneHotEncoder(handle_unknown="ignore", sparse=False)`: It performs one-hot encoding on the categorical features. OneHotEncoder is a class in scikit-learn used for one-hot encoding categorical variables. By setting `handle_unknown="ignore"`, it ignores unknown categories during encoding. The `sparse=False` argument ensures that the encoded features are returned as a dense array.

The `ColumnTransformer` class allows us to specify transformers and their corresponding subsets of columns or features. In this case, the preprocessing object is created using ColumnTransformer with two transformers:`("num", num_pipeline, num_attribs)` and `("cat", cat_pipeline, cat_attribs)`. By using ColumnTransformer, we can combine these transformers into a single preprocessing pipeline that applies the respective transformations to the numerical and categorical features.



**Dividing the test set and training set:**




In [None]:
# computer_prepared = preprocessing.fit_transform(computer)
# df_computer_prepared = pd.DataFrame(computer_prepared, columns=preprocessing.get_feature_names_out())
# df_computer_labels = pd.DataFrame(computer_labels, columns=['Price'])

from sklearn.model_selection import train_test_split
train_data,test_data,train_labels,test_labels = train_test_split(computer, computer_labels, test_size=0.2, random_state=42)


In this code snippet:

`train_test_split` is a function used to split the dataset into random train and test subsets.

`computer` refers to the input data or features of your dataset.

`computer_labels` refers to the corresponding labels or target values for the input data.

`test_size=0.2` specifies the proportion of the dataset that should be allocated to the test set. In this case, 20% of the data will be used for testing, while the remaining 80% will be used for training.

`random_state=42` sets the random seed for reproducibility. It ensures that the data is split in the same way each time you run the code, allowing for consistent results.

After executing the code, the dataset is split into train_data and test_data for the input features, and train_labels and test_labels for the corresponding target values.

##Section 4:Machine Learning.
In this section, we will compare the effectiveness of supervised and unsupervised models for computer price prediction.

The supervised learning model we will use includes linear regression, decision tree regression.

The unsupervised learning model we will use is Principal Component Analysis(PCA).



**Linear Regression:**

Linear regression is a supervised machine learning algorithm used for predicting a continuous target variable based on one or more input features. It assumes a linear relationship between the input features and the target variable.

The equation of a simple linear regression model with one input feature is given by: **$y$ = $b_0$ + $b_1$*$x$**

**$y$** is the predicted value or the target variable.

**$b_0$** is the y-intercept, which represents the predicted value of $y$ when $x$ is 0.

**$b_1$** is the slope of the line, which represents the change in the predicted value of $y$ for a unit change in $x$.

**$x$** is the input feature or independent variable

The equation for multiple linear regression is an extension of the simple linear regression equation, including coefficients for each input feature:**$y$ = $b_0$ + $b_1$*$x_1$ + $b_2$*$x_2$ + ...+ $b_n$*$x_n$**

In our project, we used linear regression to predict computer price.Where the price should be $y$, and other attributes should be $x_n$, we train the model to find suitable $b$.



**Do linear regression:**

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())

lin_reg.fit(train_data, train_labels)


In this code, the `LinearRegression` class is imported from `sklearn.linear_model`, and the `make_pipeline` function is imported from `sklearn.pipeline`.

The `make_pipeline` function is used to create a pipeline by concatenating the preprocessing steps defined in the `preprocessing` object (which typically consists of transformers like `ColumnTransformer`) with the `LinearRegression` estimator. This pipeline combines the data preprocessing steps and the linear regression model into a single object.

The `fit` method is then called on the `lin_reg` object, which fits the pipeline to the training data `train_data` and the corresponding target labels `train_labels`. This step trains the linear regression model using the preprocessed data.

In [None]:
lin_predictions = lin_reg.predict(test_data)

The code `lin_predictions = lin_reg.predict(test_data)` is used to obtain predictions on the test data using the trained linear regression model (`lin_reg`).

**The Accuracy of Linear Regression:**

In [None]:
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(test_labels, lin_predictions, squared=False)
lin_rmse

5.860929990479504e+16

Here we calculated the root mean squared error (RMSE) between the predicted values and the actual target values. It measures the average difference between the predictions and the true values, providing an evaluation of the performance of the linear regression model. A lower RMSE indicates better model performance, indicating smaller prediction errors.

**Decision Tree Regression:**

After doing the linear regression, we want train another model to make a comparison, we select decision tree regression. Because in contrast to linear regression, which relies purely on mathematical formulas, the decision tree model is an exception to the rule that relies less on mathematical formulas.

In a decision tree, the algorithm builds a tree-like model by recursively partitioning the input space based on the values of input features. Each partition corresponds to a leaf node in the tree, which represents a prediction or a decision. The splitting of the input space is determined based on certain criteria, such as maximizing information gain or minimizing impurity, depending on whether it's a classification or regression task.

In the context of decision tree regression, the algorithm constructs a decision tree specifically tailored for regression problems. It recursively splits the data based on feature values to create partitions that minimize the mean squared error (MSE) or other regression-specific cost functions. The predicted output value for a given input is typically the average (or weighted average) of the target values within the leaf node where that input falls.

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor())

tree_reg.fit(train_data, train_labels)

tree_predictions = tree_reg.predict(test_data)

Using scikit-learn's `DecisionTreeRegressor` and `make_pipeline` functions to create a pipeline with a decision tree regression model. The pipeline combines the data preprocessing steps defined in the `preprocessing` object with the decision tree regression model. The model is then trained on the training data using the `fit` method. Subsequently, predictions are made on the test data using the trained model.

**The Accuracy Decision Tree Regression:**

In [None]:
tree_rmse = mean_squared_error(test_labels,tree_predictions,squared=False)
tree_rmse

12083.07086717119

Now we calculate the RMSE of decision tree regression

**Fine-Tune Decision Tree Regression:**

Since the above decision tree regression does not perform well, the results of RMSE are large compared to linear regression. So we want to optimize our decision tree regression model to find good parameters.

The **Grid Search** is a good method, because grid search in scikit-learn is a technique for finding the optimal hyperparameters of a model. It involves defining a grid of hyperparameter values, fitting the model with each combination, and selecting the best-performing model based on an evaluation metric. This process automates hyperparameter tuning and saves time and effort.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("decision_tree", DecisionTreeRegressor(random_state=42))
])

param_grid = [
    {'decision_tree__max_depth': [None, 3, 5, 7]},
    {'decision_tree__min_samples_split': [2, 5, 10]},
    {'decision_tree__min_samples_leaf': [1, 2, 4]}
]

grid_search = GridSearchCV(
    full_pipeline,
    param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error'
)

grid_search.fit(computer, computer_labels)

The code above performs a grid search with a decision tree regressor model using scikit-learn's `Pipeline` and `GridSearchCV`. It creates a pipeline that consists of preprocessing steps and a decision tree regressor. The grid search explores different combinations of hyperparameters for the decision tree regressor, such as maximum depth, minimum samples split, and minimum samples leaf. The grid search evaluates the model's performance using cross-validation and a scoring metric. Finally, the grid search is fitted to the input data and target labels, enabling the identification of the best model and its corresponding hyperparameters.

In [None]:
grid_search.best_params_

{'decision_tree__min_samples_leaf': 2}

Here we find the best parameters of **Decsion Tree Regression**

**The Accuracy of Refined Decision Tree Regression:**

In [None]:
best_decision_tree = grid_search.best_estimator_
tree_predictions_new = best_decision_tree.predict(test_data)
tree_rmse_new = mean_squared_error(test_labels,tree_predictions_new,squared=False)
tree_rmse_new

1894.657027472717

Finally we used best parameters in new **Decision Tree Regression model** to do the prediction, but the **RMSE** still higher than **Linear Regression** model.

**PCA:**

**PCA** is an unspervised learning model, and dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional representation while retaining the most important information or patterns in the data.

The mathematical principles of **PCA** is:

**1.Calculate Data Standardization:** `X_std = (X - X.mean(axis=0)) / X.std(axis=0)`

**2.Calculate Covariance Matrix:** `C = np.cov(X_std.T)`

**3.Find the Egienvalues and Egienvectors of C:** `eigenvalues, eigenvectors = np.linalg.eig(C)`

**4.Selection of Principal Components:** `top_k_indices = np.argsort(eigenvalues)[::-1][:k]` `W = eigenvectors[:, top_k_indices]`

**5.Data Projection:** `Y = np.dot(X_std, W)`

Basicly, what PCA do is to calculate the covariance matrix of data after standarization. Then find the egienvalues and egienvectors of covariance matrix, then rebuild lower dimension dataset.











In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardized data
computer_prepared = preprocessing.fit_transform(computer)
scaler = StandardScaler()
computer_scaled = scaler.fit_transform(computer_prepared)

n_components = 2
pca = PCA(n_components=n_components)

# Do PCA
computer_pca = pca.fit_transform(computer_scaled)

# View the proportion of explained variance
explained_variance_ratio = pca.explained_variance_ratio_
print("Explaining the proportion of variance：", explained_variance_ratio)

# View the variance of the principal components
components_variance = pca.explained_variance_
print("Variance of principal components：", components_variance)

#View the dataset after do the PCA
print("dataset after PCA：", computer_pca)

Explaining the proportion of variance： [0.0028486  0.00277231]
Variance of principal components： [22.36303641 21.76408547]
dataset after PCA： [[ 4.52900251 -0.79670759]
 [-2.56186308 -1.29085492]
 [ 2.48860945 -0.4238146 ]
 ...
 [ 1.54289363 -0.20875797]
 [-4.32955156 -1.3582883 ]
 [ 1.24038104 -0.22808352]]


`computer_prepared = preprocessing.fit_transform(computer)`

 `scaler = StandardScaler()`

`computer_scaled = scaler.fit_transform(computer_prepared)`

These three codes is to use proprocessing we defined before to preprocess dataset get ready for doing PCA

Create PCA objects and specify the number of principal components to be retained, assume we need two-dimensional data by `n_components = 2`and
`pca = PCA(n_components=n_components)`

Now do the PCA `computer_pca = pca.fit_transform(computer_scaled)` and get the dataset after doing PCA `computer_pca`

In [None]:
# Standardize the PCA-reduced data
df_computer_pca = pd.DataFrame(computer_pca)
scaler = StandardScaler()
computer_pca_scaled = scaler.fit_transform(df_computer_pca)

#Reclassification after PCA
train_data_pca,test_data_pca,train_labels,test_labels = train_test_split(computer_pca_scaled, computer_labels, test_size=0.2, random_state=42)



Because we get a new dataset, we need to standard and scale it, also we need to reclassify the training set and testset.

**Linear Regression After PCA:**

Now we decide to varify the performance of linear regreesion on the dataset after doing PCA

In [None]:
# Create the linear regression model
lin_reg_pca = LinearRegression()

# Fit the linear regression model
lin_reg_pca.fit(train_data_pca, train_labels)

# Make predictions
lin_predictions_pca = lin_reg_pca.predict(test_data_pca)

In [None]:
#Calculate the rmse of linear regression after PCA
lin_rmse_pca = mean_squared_error(test_labels, lin_predictions_pca, squared=False)
lin_rmse_pca

56049.100302730054

From the outcome we can see that the **RMSE** of **Linear Regression** is higher than before and higher than refined **Decision Tree Regression** model.



```
# This is formatted as code
```

##Section 5: Conclusion and Prospecting.



**Conclusion:**
- Comparison of linear regression and decision tree regression:
Although linear regression and decision tree regression are both used for regression tasks but have different characteristics. Linear regression provides interpretability through coefficients representing feature-target relationships, assuming linearity. Decision tree regression can capture non-linear relationships without explicit feature engineering. In our project, due to we have prepared well before training, like we make our daset clean and complete. In this situation linear regression requires perform better than decision tree regression, even we set the best parameters after doing grid search for decision tree.

- Comparison of linear regression and linear regressionc after doing PCA: Using linear regression with PCA allows for dimensionality reduction and helps address collinearity among features. It may improve model performance when dealing with high-dimensional datasets. However, in our project, linear regression performed after PCA result in worse performance compared to using linear regression directly. This can occur due to the loss of information during the dimensionality reduction process, reduced interpretability as the original features are replaced by principal components, and the potential amplification of noise. So the impact of PCA on performance should be carefully assessed.

**Prospect:**

For our project, I would like to show that linear regression does not perform as well as it should after PCA reduces the dimensionality of the data. This may be related to what we did when preprocessing the data, and we may not have processed the data corresponding to the characteristics of PCA. For example, we did not whiten the data before PCA, but just did the same preprocessing operation as before and then performed PCA, so this may be the reason why linear regression does not perform well after PCA. We had thought that linear regression would perform worse after PCA, but now the gap is a little bit bigger.