<a href="https://colab.research.google.com/github/PranayJagtap06/UFM_Mobile_Phone_Pricing/blob/master/ufm_mobile_phone_pricing_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mobile Phone Pricing Classifier

> This notebook is in association with the Unified Mentor Machine Learning internship project submission.

The task of the porject is to develop a system that can predict the price of a mobile phone using the data available on phones in the market. The mobile phones must be categorized as `0: low cost`/`1: medium cost`/`2: high cost` or `3: very high cost`.


Visit Deployed [Mobile Price Range Classifier Streamlit App](https://mob-price-range-classifier.streamlit.app/)

## 1. Importing Dataset

In [None]:
!git clone https://github.com/PranayJagtap06/UFM_Mobile_Phone_Pricing.git

Cloning into 'UFM_Mobile_Phone_Pricing'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 6 (delta 0), reused 3 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (6/6), 74.43 KiB | 846.00 KiB/s, done.


In [None]:
import zipfile

zip_ref = zipfile.ZipFile("/content/UFM_Mobile_Phone_Pricing/mobile_phone_pricing.zip", 'r')
zip_ref.extractall("/content")
zip_ref.close()

## 2. Importing Libraries

In [None]:
!pip install icecream

Collecting icecream
  Downloading icecream-2.1.3-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting colorama>=0.3.9 (from icecream)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting executing>=0.3.1 (from icecream)
  Downloading executing-2.1.0-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting asttokens>=2.0.1 (from icecream)
  Downloading asttokens-3.0.0-py3-none-any.whl.metadata (4.7 kB)
Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Downloading asttokens-3.0.0-py3-none-any.whl (26 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading executing-2.1.0-py2.py3-none-any.whl (25 kB)
Installing collected packages: executing, colorama, asttokens, icecream
Successfully installed asttokens-3.0.0 colorama-0.4.6 executing-2.1.0 icecream-2.1.3


In [None]:
!pip install dagshub mlflow[jupyter]

Collecting dagshub
  Downloading dagshub-0.3.45-py3-none-any.whl.metadata (11 kB)
Collecting mlflow[jupyter]
  Downloading mlflow-2.18.0-py3-none-any.whl.metadata (29 kB)
Collecting appdirs>=1.4.4 (from dagshub)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting dacite~=1.6.0 (from dagshub)
  Downloading dacite-1.6.0-py3-none-any.whl.metadata (14 kB)
Collecting gql[requests] (from dagshub)
  Downloading gql-3.5.0-py2.py3-none-any.whl.metadata (9.2 kB)
Collecting dataclasses-json (from dagshub)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting treelib>=1.6.4 (from dagshub)
  Downloading treelib-1.7.0-py3-none-any.whl.metadata (1.3 kB)
Collecting pathvalidate>=3.0.0 (from dagshub)
  Downloading pathvalidate-3.2.1-py3-none-any.whl.metadata (12 kB)
Collecting boto3 (from dagshub)
  Downloading boto3-1.35.71-py3-none-any.whl.metadata (6.7 kB)
Collecting dagshub-annotation-converter>=0.1.0 (from dagshub)
  Downloading dagshub_annotat

In [None]:
import os
import plotly
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio

pio.templates.default = "seaborn"
pio.renderers.default = "colab"

from icecream import ic
from urllib.parse import urlparse
from typing import Dict, Any, Optional
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_recall_curve, roc_curve
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

import dagshub
import mlflow

## 3. Load Dataset

In [None]:
pd.set_option("display.max_colwidth", None)
df = pd.read_csv("/content/Mobile Phone Pricing/dataset.csv")
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


## 4. Inspecting Dataset for `null` values

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

In [None]:
df.isnull().sum()

Unnamed: 0,0
battery_power,0
blue,0
clock_speed,0
dual_sim,0
fc,0
four_g,0
int_memory,0
m_dep,0
mobile_wt,0
n_cores,0


The dataset is clean and ready to use.

## 5. Exploratory Data Analysis

In [None]:
df.describe()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.3095,0.5215,32.0465,0.50175,140.249,4.5205,...,645.108,1251.5155,2124.213,12.3065,5.767,11.011,0.7615,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.341444,0.499662,18.145715,0.288416,35.399655,2.287837,...,443.780811,432.199447,1084.732044,4.213245,4.356398,5.463955,0.426273,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,...,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,...,282.75,874.75,1207.5,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,...,564.0,1247.0,2146.5,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,...,947.25,1633.0,3064.5,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,...,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


Lets explore the target variable `price_range`.

In [None]:
df.price_range.value_counts()

Unnamed: 0_level_0,count
price_range,Unnamed: 1_level_1
1,500
2,500
3,500
0,500


That's good, the dataset is balanced.

Let's see how many 4G phones there in the dataset.

In [None]:
df.four_g.value_counts()

Unnamed: 0_level_0,count
four_g,Unnamed: 1_level_1
1,1043
0,957


Let's see how many 3G phones are present.

In [None]:
df.three_g.value_counts()

Unnamed: 0_level_0,count
three_g,Unnamed: 1_level_1
1,1523
0,477


Checking the count of dual sim phones.

In [None]:
df.dual_sim.value_counts()

Unnamed: 0_level_0,count
dual_sim,Unnamed: 1_level_1
1,1019
0,981


Now let's plot some plots for better understanding of the dataset.

In [None]:
# Function for saving plotly plots as html to embed them later
with open('html_template.html', 'w') as f:
  f.write("""
  <!doctype html>
  <html>
  <head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  </head>

  <body>
  <!-- <h3>{{ heading }}</h3> -->
  {{ fig }}
  </body>
  </head>
  """)

def fig_to_html(fig: plotly.graph_objs._figure.Figure,
                # plot_heading: str,
                output_path: Optional[str]="output.html",
                template_path: Optional[str]="html_template.html") -> None:
  """
  Convert a plotly figure to an HTML.
  """
  # Create output directory if it doesn't exist
  output_dir = "plotly_html"
  os.makedirs(output_dir, exist_ok=True)

  from jinja2 import Template
  # Convert the figure to HTML
  plotly_jinja_data = {
      "fig": fig.to_html(full_html=False, include_plotlyjs="cdn"),
      # "heading": plot_heading
      }

  # Load the template
  with open(os.path.join(output_dir, output_path), "w", encoding="utf-8") as f:
    with open(template_path, "r", encoding="utf-8") as template_file:
      template = Template(template_file.read())
      f.write(template.render(plotly_jinja_data))

In [None]:
# Mobile Phone Scatter plot
fig1 = px.scatter(
    df,
    x="battery_power",
    y="ram",
    color="price_range",
    size="int_memory",
    title="Mobile Phone Scatter Plot",
    labels={
        "battery_power": "Battery Power (mAh)",
        "ram": "RAM (MB)",
        "price_range": "Price Range",
        "int_memory": "Internal Memory (GB)"
    },
    hover_data=["battery_power", "ram", "price_range", "int_memory"],
)
fig1.update_layout(
    plot_bgcolor="darkgrey",
    template="seaborn"
)
fig_to_html(fig1, "mobile_phone_scatter_plot.html")
fig1.show()

The above scatter plot visualizes the relationship between battery power and RAM for mobile, categorized by price range and sized by internal memory.

**Observations:**
1. **Relationship Between Battery Power and Price Range:** Battery power is evenly distributed across price ranges, with no strong positive or negative correlation. Phones in both lower (0) and higher (3) price ranges are spread across the entire spectrum of battery capacities (400 to 2000 mAh). This suggests that battery power is not a strong differentiating factor for pricing, as even budget phones offer competitive battery capacities.
2. **Relationship Between RAM and Price Range:** RAM increases consistently with price range. Phones in Price Range 0 (black points) are clustered at the lower end of the RAM spectrum (500–1500 MB).
Phones in Price Range 3 (light orange points) are clustered at the higher end (3000–4000 MB). This indicates that RAM is a key differentiator for pricing, with higher RAM being a feature of premium phones.
3. **Impact of Internal Memory (Point Size):** Larger data points, representing higher internal memory, are more prevalent in higher price ranges. Phones in Price Range 3 not only have higher RAM but also tend to have higher internal memory (as seen from larger point sizes).
Conversely, smaller data points (lower internal memory) are predominantly in Price Range 0 and Price Range 1. This indicates that internal memory, along with RAM, is another significant factor influencing phone pricing.
4. **Clustering:** Phones in lower price ranges (0 and 1) are clustered at the lower left, indicating a combination of low battery power, RAM, and internal memory. Phones in higher price ranges (2 and 3) dominate the upper region of the plot due to higher RAM and larger data points, representing more premium configurations.
5. **Insights:** Battery Power does not significantly affect price range, as high-capacity batteries are available across all ranges.
RAM and Internal Memory are critical features for premium phones, as seen from their strong positive correlation with price range.
Feature Trade-offs in Budget Phones (Price Range 0) tend to compromise on RAM and internal memory while offering competitive battery power.


In [None]:
# 2. Battery Power vs. Price Range
fig2 = px.box(df, x="price_range", y="battery_power", title="Battery Power vs. Price Range")
fig_to_html(fig2, "battery_power_vs_price_range.html")
fig2.show()

The above box plot gives insight into the distribution of battery power across different price categories.

**Observations:**
1. **Median Battery Power:** The median battery power increases as the price range increases. Suggesting, that higher priced phones have higher battery power.
2. **Price Range 0:** Has a relatively lower median and interquartile range (IQR) for battery power compared to higher price ranges.
3. **Price Range 1 to 3:** There is an increasing trend where both the median and overall battery power distribution slightly increase, though the change is not drastic.
4. **Spread & Variation:** The whiskers of each box plot extend from the minimum to the maximum values, indicating the range of battery_power within each price range. There is a significant overlap in the range of battery power across all price categories, which indicates that some lower-priced devices may still offer comparable battery power to mid-range or high-priced devices.
5. **Insights:** Higher price ranges are generally associated with slightly higher battery power, but the overlap suggests that battery power is not a strong differentiator between price ranges. Manufacturers might prioritize other features besides battery power when justifying higher prices, or there might be diminishing returns in battery capacity for premium-priced devices.

In [None]:
# 3. RAM vs. Price Range
fig3 = px.scatter(
    df,
    x="ram",
    y="price_range",
    title="RAM vs. Price Range",
    color="price_range",
    size="int_memory",
    labels={
        "ram": "RAM (MB)",
        "price_range": "Price Range",
        "int_memory": "Internal Memory (GB)"
        },
    hover_data=["ram", "price_range", "int_memory"]
)
fig3.update_layout(
    plot_bgcolor="darkgrey",
    template="seaborn"
)
fig_to_html(fig3, "ram_vs_price_range.html")
fig3.show()

The scatter plot displays the relationship between RAM (in MB) and the price range of mobile phones.

**Observations:**
1. **RAM Influence:** The plot suggests that RAM is a significant factor influencing the price range of mobile phones. Phones with higher RAM tend to be more expensive.
2. **Clustering:** The clustering of data points suggests that there are specific price ranges associated with certain RAM configurations.
3. **Internal Memory:** The size of the data points reveals that internal memory is another important factor determining the price of mobile phones.
4. **Insights:** This plot indicates that RAM plays a vital role in pricing mobile phones. Higher RAM values lead to higher price ranges, and the size of the data points highlights the impact of internal memory on the cost.






In [None]:
# 4. 3G/4G Availability by Price Range
fig4 = px.histogram(df, x="price_range", color="four_g", title="3G/4G Availability by Price Range",
                    barmode="group",
                    labels={"four_g": "4G Availability"},
                    # facet_col="four_g"
                   )
fig_to_html(fig4, "3g_4g_availability_by_price_range.html")
fig4.show()

The scatter plot displays the relationship between RAM (in MB) and the price range of mobile phones.

**Observations:**
1. **Higher Price Range, Higher 4G Availability:** As the price range increases, the number of phones with 4G availability also increases. This trend is consistent across all price categories.
2. **Prevalence of 4G:** In all price ranges, a significant proportion of phones offer 4G capabilities. This indicates that 4G is a prevalent feature in the mobile phone market.
3. **Insights:** The plot suggests a strong correlation between price range and 4G availability. This indicates that phones with higher price tags are more likely to have 4G capabilities. This observation aligns with the expectation that newer and higher-end phones are more likely to incorporate advanced features like 4G.

In [None]:
corr_matrix = df.corr()  # Calculate the correlation matrix

# Create the heatmap using px.imshow
fig5 = px.imshow(
    corr_matrix,
    text_auto=".4f",  # Display correlation values rounded to 2 decimal places
    aspect="auto",  # Adjust aspect ratio for better visualization
    color_continuous_scale="viridis",  # Choose a color scale
    title="Correlation Heatmap for Price Range"
)

fig5.show()

## 6. Preparing Training Data

In [None]:
# Copying dataset
ds = df.copy(deep=True)

In [None]:
# Separating features and target
X = ds.drop("price_range", axis=1)
y = ds.price_range

In [None]:
# Creating train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=ds.price_range, random_state=42)
ic(X_train.shape, X_test.shape, y_train.shape, y_test.shape);

ic| X_train.shape: (1600, 20)
    X_test.shape: (400, 20)
    y_train.shape: (1600,)
    y_test.shape: (400,)


Now we are ready to train classification models.

## 7. Setting Up DAGsHub & mlflow

First, let's set up dagshub for experiment tracking.

In [None]:
!dagshub login

                                [1m❗❗❗ AUTHORIZATION REQUIRED ❗❗❗[0m                                


Open the following link in your browser to authorize the client:
https://dagshub.com/login/oauth/authorize?state=fd6cd198-2afd-4b69-89a2-32dd30ed24aa&client_id=32b60ba385aa7cecf24046d8195a71c07dd345d9657977863b52e7748e0f0f28&middleman_request_id=4d96426b0bfbc129b62ee5f94cc23f785902c514c0c4162c749344a0f35bac36


[2K[32m⠹[0m Waiting for authorization
[1A[2K✅ OAuth token added


In [None]:
from google.colab import userdata
repo_owner_ = userdata.get('REPO_OWNER')
repo_name_ = userdata.get('REPO_NAME')
tracking_uri = userdata.get('MLFLOW_TRACKING_URI')

os.makedirs('tmp', exist_ok=True)

# Creating function to log experiments to mlflow
def create_experiment(experiment_name: str,run_name: str, run_metrics: Dict[str, Any], model, model_name: str = None, artifact_paths: Dict[str, str] = {}, run_params: Dict[str, Any] = None, tag_dict: Dict[str, str] = {"tag1":"Linear Regression", "tag2":"House Rent Prediction"}):

    try:
        dagshub.init(repo_owner=f"{repo_owner_}", repo_name=f"{repo_name_}", mlflow=True)

        # You can get your MLlfow tracking uri from your dagshub repo by opening "Remote" dropdown menu, go to "Experiments" tab and copy the MLflow experiment tracking uri and paste below
        mlflow.set_tracking_uri(f"{tracking_uri}")

        mlflow.set_experiment(experiment_name)

        with mlflow.start_run(run_name=run_name):

            # log params
            if not run_params == None:
                for param in run_params:
                    mlflow.log_param(param, run_params[param])

            # log metrics
            for metric, value in run_metrics.items():
                if isinstance(value, list):
                    # If the metric is a list, log each value as a separate step
                    for step, v in enumerate(value):
                        mlflow.log_metric(metric, v, step=step)
                else:
                    # If it's a single value, log it normally
                    mlflow.log_metric(metric, value)

            tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

            # log artifacts
            for artifact_name, path in artifact_paths.items():
                if path and os.path.exists(path):
                    if tracking_url_type_store != "file":
                        mlflow.log_artifact(
                            path,
                            # artifact_name
                        )
                elif path:
                    print(f"Warning: Artifact file not found: {path}")

            # log model
            if tracking_url_type_store != "file":
                # mlflow.sklearn.save_model(model, save_path)
                mlflow.sklearn.log_model(model, "sk_model")

            mlflow.set_tags(tag_dict)

        print(f'Run - {run_name} is logged to Experiment - {experiment_name}')
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        import traceback
        traceback.print_exc()

## 8. Model Training & Evaluation

### 8.1 Logistic Regression Classifier

#### 8.1.1 Model Training

In [None]:
np.random.seed(42)

# Create a pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, solver="liblinear"))

# Create a parameter grid
param_grid = {
    'logisticregression__C': [0.01, 0.1, 1, 10, 100],  # Regularization parameter
    'logisticregression__penalty': ['l1', 'l2']  # Penalty type
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")

# Fit the model
grid_search.fit(X_train, y_train)

In [None]:
# Best estimator
grid_search.best_estimator_

In [None]:
# Best score
grid_search.best_score_

0.859375

#### 8.1.2 Model Evaluation

In [None]:
# Train Set Score (Accuracy)
train_acc = grid_search.score(X_train, y_train)
print(f"Training Accuracy: {train_acc*100:.2f}%")

# Test Set Score (Accuracy)
test_acc = grid_search.score(X_test, y_test)
print(f"Testing Accuracy: {test_acc*100:.2f}%")

Training Accuracy: 89.38%
Testing Accuracy: 84.00%


In [None]:
# Making predictions on y_test
np.random.seed(42)
y_preds = grid_search.best_estimator_.predict(X_test)

In [None]:
# Making predictions on test set
pred_price = grid_search.best_estimator_.predict(pd.DataFrame(X_test.iloc[15].to_numpy(), index=X_test.columns).T)
true_price = y_test.iloc[15]

print("Rent Prediction for Linear Regression Model:")
print("\tTest Set:")
print(f"""\t\tPredicted Price Range: {pred_price[0]:.2f} | True Price Range: {true_price:.2f}""")
ds.loc[X_test.iloc[15].name]

Rent Prediction for Linear Regression Model:
	Test Set:
		Predicted Price Range: 3.00 | True Price Range: 3.00


Unnamed: 0,1985
battery_power,1829.0
blue,1.0
clock_speed,2.1
dual_sim,0.0
fc,8.0
four_g,0.0
int_memory,59.0
m_dep,0.1
mobile_wt,91.0
n_cores,5.0


In [None]:
# Classification Report
print(f"Logistic Regression Classification Report:\n\n{classification_report(y_test, y_preds)}")

Logistic Regression Classification Report:

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       100
           1       0.72      0.68      0.70       100
           2       0.68      0.70      0.69       100
           3       0.96      0.98      0.97       100

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400



Okay, let's analyze the performance metrics of your Logistic Regression model:

**Metrics Analysis:**
1. **Training Accuracy (89.38%):** This metric represents the model's accuracy on the training data. It indicates that the model correctly predicted the price range for approximately 89.38% of the mobile phones in the training set.
2. **Testing Accuracy (84.00%):** This metric represents the model's accuracy on the testing data, which is unseen data during training. It indicates that the model correctly predicted the price range for approximately 84.00% of the mobile phones in the testing set.
3. **Precision:** Precision measures the proportion of correctly predicted positive instances (for a specific class) out of all instances predicted as positive for that class.
  - Precision for class 0 (Low Price) is very high (0.99), meaning that when the model predicts a phone as 'Low Price,' it is almost always correct.
  - Precision for class 3 (Very High Price) is also high (0.96), indicating good accuracy in predicting 'Very High Price' phones.
  - Precision for classes 1 (Medium Price) and 2 (High Price) are lower (0.72 and 0.68 respectively), suggesting that the model has more difficulty accurately identifying these price ranges.
4. **Recall:** Recall measures the proportion of correctly predicted positive instances (for a specific class) out of all actual positive instances for that class.
  - Recall for class 0 (Low Price) is perfect (1.00), indicating that the model correctly identifies all 'Low Price' phones.
  - Recall for class 3 (Very High Price) is also high (0.98), meaning it captures most 'Very High Price' phones.
  - Recall for classes 1 (Medium Price) and 2 (High Price) are lower (0.68 and 0.70 respectively), indicating that the model misses some phones belonging to these price ranges.
5. **F1-score:** The F1-score is the harmonic mean of precision and recall, providing a balanced measure of both metrics.
  - F1-scores generally follow the trends of precision and recall.
  - Higher F1-scores for classes 0 and 3 indicate better overall performance for these price ranges.
  - Lower F1-scores for classes 1 and 2 highlight the model's relatively weaker performance in these categories.

**Model Performance and Generalization:** The model demonstrates reasonably good overall performance, with an accuracy of 84.00% on the testing set. The slight drop in accuracy from training to testing (89.38% to 84.00%) indicates some degree of overfitting but it's not concerning. The model seems to generalize fairly well to unseen data.

**Model Insights and Use:**
The model excels at predicting 'Low Price' and 'Very High Price' mobile phones, achieving high precision, recall, and F1-scores for these classes. The model has more difficulty accurately predicting 'Medium Price' and 'High Price' phones, as indicated by lower precision, recall, and F1-scores. Despite some weaknesses, the model can be useful for providing a preliminary price range prediction for mobile phones. It can assist in market analysis, product categorization, and potentially even pricing strategies.

In [None]:
# Plotting the Confusion Matrix
def plot_confusion_matrix(y_test: np.ndarray, y_preds: np.ndarray, model_name: str, plot_name: str) -> None:
    """Plot confusion matrix."""

    cm = confusion_matrix(y_test, y_preds)

    fig = px.imshow(
        cm,
        text_auto=True,  # Display values on the heatmap
        labels=dict(x="Predicted", y="True"),  # Set axis labels
        x=['Low', 'Medium', 'High', 'Very High'],  # Update x-axis labels
        y=['Low', 'Medium', 'High', 'Very High'],  # Update y-axis labels
        color_continuous_scale="Blues"  # Customize the color scale
    )

    fig.update_layout(title=f"Confusion Matrix: {model_name}")  # Set plot title
    fig_to_html(fig, f"{plot_name}")
    fig.show()  # Display plot

plot_confusion_matrix(y_test.to_numpy(), y_preds, "Logistic Regression", "confusion_matrix_log_reg.html")

The above confusion matrix justifies the classification report.

In [None]:
# Plotting Precision-Recall Curve
def plot_precision_recall_curve(y_test: np.ndarray, y_preds: np.ndarray, model_name: str, plot_name: str) -> None:
    """Plot precision-recall curve."""

    import plotly.graph_objects as go
    from sklearn.metrics import precision_recall_curve, average_precision_score
    from sklearn.preprocessing import label_binarize

    # Assuming you have 'y_test' (true labels) and 'y_preds' (predicted labels)

    # 1. Binarize the labels
    n_classes = len(ds['price_range'].unique())  # Get the number of classes
    y_test_bin = label_binarize(y_test, classes=range(n_classes))
    y_preds_bin = label_binarize(y_preds, classes=range(n_classes))

    # 2. Create the Plotly figure
    fig = go.Figure()

    # 3. Calculate and plot precision-recall curves for each class
    for i in range(n_classes):
        precision, recall, _ = precision_recall_curve(y_test_bin[:, i], y_preds_bin[:, i])
        avg_precision = average_precision_score(y_test_bin[:, i], y_preds_bin[:, i])

        fig.add_trace(go.Scatter(
            x=recall,
            y=precision,
            mode='lines',
            name=f"Class {i} (Avg Precision: {avg_precision:.2f})"
        ))

    # 4. Update layout for better visualization
    fig.update_layout(
        title=f"Precision-Recall Curve: {model_name}",
        xaxis_title="Recall",
        yaxis_title="Precision",
        xaxis_range=[0, 1],
        yaxis_range=[0, 1],
        showlegend=True
    )

    fig_to_html(fig, f"{plot_name}")

    fig.show()  # Display plot

plot_precision_recall_curve(y_test.to_numpy(), y_preds, "Logistic Regression", "pr_curve_log_reg.html")

**Overall Interpretation of PR Curve:**
The Precision-Recall curve reinforces the observations from the classification report and accuracy metrics. The model demonstrates strong performance in identifying 'Low Price' and 'Very High Price' phones, but it struggles with the 'Medium Price' and 'High Price' categories. This could be due to overlapping features or less clear distinctions between these price ranges in the dataset.

**Insights:**

- The model might be most useful in scenarios where correctly identifying 'Low Price' and 'Very High Price' phones is critical, even if it means some misclassification of 'Medium Price' and 'High Price' phones.
- If accurate prediction of all price ranges is equally important, further investigation and model improvement may be necessary, focusing on improving the performance for the 'Medium Price' and 'High Price' categories.

In [None]:
# Plotting ROC Curve
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.preprocessing import label_binarize

def plot_roc_curve(y_test: np.ndarray, y_preds: np.ndarray, model_name: str, plot_name: str) -> None:
    """Plots the ROC curve."""

    # 1. Binarize the labels.
    n_classes = len(ds['price_range'].unique())  # Get the number of classes
    y_test_bin = label_binarize(y_test, classes=range(n_classes))
    y_preds_bin = label_binarize(y_preds, classes=range(n_classes))

    # 2. Create the figure.
    fig = go.Figure()

    # 3. Calculate the fpr and tpr.
    for i in range(n_classes):
        fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_preds_bin[:, i])
        roc_auc = auc(fpr, tpr)

        fig.add_trace(go.Scatter(
            x=fpr,
            y=tpr,
            mode='lines',
            name=f"Class {i} (AUC = {roc_auc:.2f})"
        ))

    # 4. Update the plot.
    fig.update_layout(
        title=f"ROC Curve: {model_name}",
        xaxis_title="False Positive Rate",
        yaxis_title="True Positive Rate",
        xaxis_range=[0, 1],
        yaxis_range=[0, 1],
        showlegend=True
    )

    fig_to_html(fig, f"{plot_name}")

    fig.show()  # Display

plot_roc_curve(y_test.to_numpy(), y_preds, "Logistic Regression", "roc_curve_log_reg.html")

**Overall Interpretation:**
The ROC curve further supports the findings from the precision-recall curve and other metrics. The model demonstrates outstanding performance in identifying 'Low Price' and 'Very High Price' phones, achieving high true positive rates with low false positive rates. However, it faces challenges in discriminating between 'Medium Price' and 'High Price' phones, as indicated by the lower and more curved ROC curves for these classes.

**Insights:**

- The model is highly reliable for scenarios where correctly identifying 'Low Price' and 'Very High Price' phones is crucial, even if it means some misclassification of 'Medium Price' and 'High Price' phones.
- If accurate prediction of all price ranges is equally important, further investigation and model improvement may be necessary, focusing on improving the discrimination ability for the 'Medium Price' and 'High Price' categories.

#### 8.1.3 Logging Model

In [None]:
# Logging Experiment
from datetime import datetime
experiment_name = "mob_price_pred_log_reg"
run_name = "run_"+str(datetime.now().strftime("%d-%m-%y_%H:%M:%S"))

run_metrics = {"train_acc": train_acc, "test_acc": test_acc}

artifact_paths = {"mob_scatter_plot": "/content/plotly_html/mobile_phone_scatter_plot.html", "battery_power_vs_price_range": "/content/plotly_html/battery_power_vs_price_range.html", "ram_vs_price_range": "/content/plotly_html/ram_vs_price_range.html", "3g_4g_availability_by_price_range": "/content/plotly_html/3g_4g_availability_by_price_range.html",
    "confusion_matrix": "/content/plotly_html/confusion_matrix_log_reg.html", "pr_curve": "/content/plotly_html/pr_curve_log_reg.html", "roc_curve": "/content/plotly_html/roc_curve_log_reg.html",
                  }

run_params = {"penalty": grid_search.best_params_["logisticregression__penalty"], "C": grid_search.best_params_["logisticregression__C"]}

create_experiment(experiment_name, run_name, run_metrics, grid_search.best_estimator_, model_name="log_reg", artifact_paths=artifact_paths, run_params=run_params, tag_dict={"tag1": "Logistic Regression", "tag2": "Mobile Phone Price Prediction"})



🏃 View run run_28-11-24_14:12:27 at: https://dagshub.com/pranay.makxenia/ML_Projects.mlflow/#/experiments/13/runs/d87ee03b963f4205a86fd31ee4817bff
🧪 View experiment at: https://dagshub.com/pranay.makxenia/ML_Projects.mlflow/#/experiments/13
Run - run_28-11-24_14:12:27 is logged to Experiment - mob_price_pred_log_reg


### 8.2 K-Nearest Neighbors Classifier

#### 8.2.1 Model Training

In [None]:
np.random.seed(42)

# Create a pipeline
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())

# Create a parameter grid
param_grid = {
    'kneighborsclassifier__n_neighbors': [3, 5, 7, 9, 11, 13, 15],  # Number of neighbors
    'kneighborsclassifier__weights': ['uniform', 'distance']  # Weighting scheme
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")

# Fit the model
grid_search.fit(X_train, y_train)

In [None]:
# Best estimator
grid_search.best_estimator_

In [None]:
# Best score
grid_search.best_score_

0.579375

#### 8.2.2 Model Evaluation

In [None]:
# Train Set Score (Accuracy)
train_acc = grid_search.score(X_train, y_train)
print(f"Training Accuracy: {train_acc*100:.2f}%")

# Test Set Score (Accuracy)
test_acc = grid_search.score(X_test, y_test)
print(f"Testing Accuracy: {test_acc*100:.2f}%")

Training Accuracy: 100.00%
Testing Accuracy: 57.50%


In [None]:
# Making predictions on y_test
np.random.seed(42)
y_preds = grid_search.best_estimator_.predict(X_test)

In [None]:
# Making predictions on test set
pred_price = grid_search.best_estimator_.predict(pd.DataFrame(X_test.iloc[15].to_numpy(), index=X_test.columns).T)
true_price = y_test.iloc[15]

print("Rent Prediction for Linear Regression Model:")
print("\tTest Set:")
print(f"""\t\tPredicted Price Range: {pred_price[0]:.2f} | True Price Range: {true_price:.2f}""")
ds.loc[X_test.iloc[15].name]

Rent Prediction for Linear Regression Model:
	Test Set:
		Predicted Price Range: 3.00 | True Price Range: 3.00


Unnamed: 0,1985
battery_power,1829.0
blue,1.0
clock_speed,2.1
dual_sim,0.0
fc,8.0
four_g,0.0
int_memory,59.0
m_dep,0.1
mobile_wt,91.0
n_cores,5.0


In [None]:
# Classification Report
print(f"K-Nearest Neighbors Classification Report:\n\n{classification_report(y_test, y_preds)}")

K-Nearest Neighbors Classification Report:

              precision    recall  f1-score   support

           0       0.79      0.68      0.73       100
           1       0.41      0.45      0.43       100
           2       0.44      0.47      0.45       100
           3       0.72      0.70      0.71       100

    accuracy                           0.57       400
   macro avg       0.59      0.57      0.58       400
weighted avg       0.59      0.57      0.58       400



Okay, let's analyze the performance metrics of your K-Nearest Neighbors model:

**Metrics Analysis:**
1. **Training Accuracy (100.00%):** This metric represents the model's accuracy on the training data. A 100% training accuracy suggests that the model has perfectly memorized the training data. While this might seem impressive, it often indicates overfitting, where the model has learned the training data too well and may not generalize well to unseen data.
2. **Testing Accuracy (57.50%):** This metric represents the model's accuracy on the testing data, which is unseen data during training. A significantly lower testing accuracy (57.50%) compared to the training accuracy (100.00%) confirms the overfitting concern. The model's performance drops considerably when applied to new, unseen data.
3. **Precision:** Precision measures the proportion of correctly predicted positive instances (for a specific class) out of all instances predicted as positive for that class.
  - Precision for class 0 (Low Price) is relatively high (0.79), meaning that when the model predicts a phone as 'Low Price,' it is correct about 79% of the time.
  - Precision for class 3 (Very High Price) is also relatively good (0.72,) indicating decent accuracy in predicting 'Very High Price' phones.
  - Precision for classes 1 (Medium Price) and 2 (High Price) are lower (0.41 and 0.44 respectively), suggesting that the model has more difficulty accurately identifying these price ranges.
4. **Recall:** Recall measures the proportion of correctly predicted positive instances (for a specific class) out of all actual positive instances for that class.
  - Recall for class 0 (Low Price) is 0.68, indicating that the model correctly identifies about 68% of 'Low Price' phones.
  - Recall for classes 1, 2, and 3 is around 0.45, 0.47, and 0.70 respectively, indicating a moderate ability to capture phones belonging to these price ranges.
5. **F1-score:** The F1-score is the harmonic mean of precision and recall, providing a balanced measure of both metrics.
  - F1-scores generally follow the trends of precision and recall.
  - Class 0 has a relatively higher F1-score (0.73), while classes 1, 2, and 3 have lower F1-scores, reflecting the model's overall performance on each price range.

**Model Performance and Generalization:** The model demonstrates poor overall performance, with a testing accuracy of only 57.50%. This is significantly lower than the training accuracy, highlighting the overfitting issue. The large discrepancy between training and testing accuracy indicates that the model has not generalized well to unseen data. It has memorized the training data but fails to apply the learned patterns to new instances effectively.

**Model Insights and Use:** The model shows some ability to predict 'Low Price' and 'Very High Price' phones, although with limited accuracy. The model suffers from severe overfitting, resulting in poor generalization to unseen data. It has difficulty accurately predicting 'Medium Price' and 'High Price' phones.
In its current state, the model is not reliable for predicting mobile phone price ranges. Its poor generalization makes it unsuitable for practical applications.

In [None]:
# Plotting the Confusion Matrix
plot_confusion_matrix(y_test.to_numpy(), y_preds, "K-Nearest Neighbors", "confusion_matrix_knn.html")

The above confusion matrix justifies the classification report on K-Nearest Neighbour classifier.

In [None]:
# Plotting Precision-Recall Curve
plot_precision_recall_curve(y_test.to_numpy(), y_preds, "K-Nearest Neighbors", "pr_curve_knn.html")

**Overall Interpretation:** The Precision-Recall curve reflects the observations from the classification report and accuracy metrics. The K-Nearest Neighbors model exhibits suboptimal performance, especially for the 'Medium Price' and 'High Price' categories. The curves for these classes are lower and more curved, indicating a significant trade-off between precision and recall.

**Insights:**
- The model might be somewhat useful in scenarios where correctly identifying 'Low Price' and 'Very High Price' phones is more important than achieving high accuracy across all price ranges.
However, the overall performance is not ideal, particularly for 'Medium Price' and 'High Price' phones.
- Further investigation and model improvement are necessary to address the limitations and improve the precision and recall for all price categories. This might involve feature engineering, hyperparameter tuning, data augmentation, or exploring alternative algorithms.

In [None]:
# Plotting ROC Curve
plot_roc_curve(y_test.to_numpy(), y_preds, "K-Nearest Neighbors", "roc_curve_knn.html")

**Overall Interpretation:** The ROC curve reinforces the findings from the precision-recall curve and other metrics. The K-Nearest Neighbors model exhibits suboptimal performance, especially for the 'Medium Price' and 'High Price' categories. The curves for these classes are lower and further away from the top-left corner, indicating a less effective discrimination ability. The model struggles to distinguish between these price ranges effectively.

**Insights:**
- The model's performance is relatively better for 'Low Price' and 'Very High Price' phones, but it struggles with 'Medium Price' and 'High Price' phones.
- The lower AUC scores for classes 1 and 2 suggest that the model has difficulty accurately classifying these price ranges.
- Further investigation and model improvement are necessary to address the limitations and improve the overall performance, particularly for the 'Medium Price' and 'High Price' categories. This might involve feature engineering, hyperparameter tuning, data augmentation, or exploring alternative algorithms.

#### 8.2.3 Logging Model

In [None]:
# Logging Experiment
from datetime import datetime
experiment_name = "mob_price_pred_knn"
run_name = "run_"+str(datetime.now().strftime("%d-%m-%y_%H:%M:%S"))

run_metrics = {"train_acc": train_acc, "test_acc": test_acc}

artifact_paths = {"mob_scatter_plot": "/content/plotly_html/mobile_phone_scatter_plot.html", "battery_power_vs_price_range": "/content/plotly_html/battery_power_vs_price_range.html", "ram_vs_price_range": "/content/plotly_html/ram_vs_price_range.html", "3g_4g_availability_by_price_range": "/content/plotly_html/3g_4g_availability_by_price_range.html",
    "confusion_matrix": "/content/plotly_html/confusion_matrix_knn.html", "pr_curve": "/content/plotly_html/pr_curve_knn.html", "roc_curve": "/content/plotly_html/roc_curve_knn.html",
                  }

run_params = {"n_neighbors": grid_search.best_params_["kneighborsclassifier__n_neighbors"], "weights": grid_search.best_params_["kneighborsclassifier__weights"]}

create_experiment(experiment_name, run_name, run_metrics, grid_search.best_estimator_, model_name="knn", artifact_paths=artifact_paths, run_params=run_params, tag_dict={"tag1": "KNN", "tag2": "Mobile Phone Price Prediction"})

2024/11/28 14:20:01 INFO mlflow.tracking.fluent: Experiment with name 'mob_price_pred_knn' does not exist. Creating a new experiment.


🏃 View run run_28-11-24_14:20:00 at: https://dagshub.com/pranay.makxenia/ML_Projects.mlflow/#/experiments/14/runs/cf729e801b3c498bb5c6e6d38fb73c1f
🧪 View experiment at: https://dagshub.com/pranay.makxenia/ML_Projects.mlflow/#/experiments/14
Run - run_28-11-24_14:20:00 is logged to Experiment - mob_price_pred_knn


### 8.3 Random Forest Classifier

#### 8.3.1 Model Training

In [None]:
np.random.seed(42)

# Create a pipeline
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(n_jobs=-1))

# Create a parameter grid
param_grid = {
    'randomforestclassifier__n_estimators': [100, 150, 250, 300],
    'randomforestclassifier__max_features': [10, 19, 'sqrt', 'log2'],
    'randomforestclassifier__max_depth': [None, 5, 10, 20],
    'randomforestclassifier__min_samples_split': [2, 5, 10],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")

# Fit the model
grid_search.fit(X_train, y_train)


invalid value encountered in cast



In [None]:
# Best estimator
grid_search.best_estimator_

In [None]:
# Best score
grid_search.best_score_

0.8868750000000001

#### 8.3.2 Model Evaluation

In [None]:
# Train Set Score (Accuracy)
train_acc = grid_search.score(X_train, y_train)
print(f"Training Accuracy: {train_acc*100:.2f}%")

# Test Set Score (Accuracy)
test_acc = grid_search.score(X_test, y_test)
print(f"Testing Accuracy: {test_acc*100:.2f}%")

Training Accuracy: 97.81%
Testing Accuracy: 91.00%


In [None]:
# Making predictions on y_test
np.random.seed(42)
y_preds = grid_search.best_estimator_.predict(X_test)

In [None]:
# Making predictions on test set
pred_price = grid_search.best_estimator_.predict(pd.DataFrame(X_test.iloc[15].to_numpy(), index=X_test.columns).T)
true_price = y_test.iloc[15]

print("Rent Prediction for Linear Regression Model:")
print("\tTest Set:")
print(f"""\t\tPredicted Price Range: {pred_price[0]:.2f} | True Price Range: {true_price:.2f}""")
ds.loc[X_test.iloc[15].name]

Rent Prediction for Linear Regression Model:
	Test Set:
		Predicted Price Range: 3.00 | True Price Range: 3.00


Unnamed: 0,1985
battery_power,1829.0
blue,1.0
clock_speed,2.1
dual_sim,0.0
fc,8.0
four_g,0.0
int_memory,59.0
m_dep,0.1
mobile_wt,91.0
n_cores,5.0


In [None]:
# Classification Report
print(f"Random Forest Classifier Classification Report:\n\n{classification_report(y_test, y_preds)}")

Random Forest Classifier Classification Report:

              precision    recall  f1-score   support

           0       0.96      0.95      0.95       100
           1       0.86      0.87      0.87       100
           2       0.86      0.87      0.87       100
           3       0.96      0.95      0.95       100

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400



Okay, let's analyze the performance metrics of your Random Forest Classifier model:

**Metrics Analysis:**
1. **Training Accuracy (97.81%):** This metric represents the model's accuracy on the training data. It indicates that the model correctly predicted the price range for approximately 97.81% of the mobile phones in the training set. This high accuracy suggests that the model has learned the training data very well. However, it's essential to consider the testing accuracy to assess if the model is overfitting.
2. **Testing Accuracy (91.00%):** This metric represents the model's accuracy on the testing data, which is unseen data during training. It indicates that the model correctly predicted the price range for approximately 91.00% of the mobile phones in the testing set. This high testing accuracy, compared to the training accuracy, suggests that the model generalizes well to new, unseen data and is not significantly overfitting.
3. **Precision:** Precision measures the proportion of correctly predicted positive instances (for a specific class) out of all instances predicted as positive for that class.
  - Precision for classes 0 (Low Price) and 3 (Very High Price) is very high (0.96), meaning that when the model predicts a phone as 'Low Price' or 'Very High Price,' it is correct about 96% of the time.
  - Precision for classes 1 (Medium Price) and 2 (High Price) is also relatively good (0.86), indicating decent accuracy in predicting these price ranges.
4. **Recall:** Recall measures the proportion of correctly predicted positive instances (for a specific class) out of all actual positive instances for that class.
  - Recall for classes 0 (Low Price) and 3 (Very High Price) is 0.95, indicating that the model correctly identifies about 95% of 'Low Price' and 'Very High Price' phones.
  - Recall for classes 1 (Medium Price) and 2 (High Price) is slightly higher (0.87), suggesting a good ability to capture phones belonging to these price ranges.
5. **F1-score:** The F1-score is the harmonic mean of precision and recall, providing a balanced measure of both metrics.
  - F1-scores generally follow the trends of precision and recall.
  - Classes 0 and 3 have high F1-scores (0.95), while classes 1 and 2 have slightly lower but still good F1-scores (0.87), reflecting the model's overall performance on each price range.

**Model Performance and Generalization:** The model demonstrates excellent overall performance, with a high testing accuracy of 91.00%. This indicates that the model is able to predict mobile phone price ranges accurately. The relatively small difference between training and testing accuracy suggests that the model generalizes well to unseen data and is not significantly overfitting. It has learned the underlying patterns in the data and can apply them effectively to new instances.

**Model Insights and Use:** The model excels at predicting all price ranges, achieving high precision, recall, and F1-scores for all classes. It demonstrates strong discrimination ability and generalization capabilities. While the model performs very well, there is still room for potential improvement, particularly for classes 1 and 2, where the precision and recall are slightly lower than for classes 0 and 3. The Random Forest Classifier model is a highly effective and reliable tool for predicting mobile phone price ranges. It can be used for market analysis, product categorization, pricing strategies, and other applications where accurate price range prediction is crucial.

In [None]:
# Plotting the Confusion Matrix
plot_confusion_matrix(y_test.to_numpy(), y_preds, "Random Forest", "confusion_matrix_rf.html")

The above confusion matrix justifies the classification report on Random Forest Classifier.

In [None]:
# Plotting Precision-Recall Curve
plot_precision_recall_curve(y_test.to_numpy(), y_preds, "Random Forest", "pr_curve_rf.html")

**Overall Interpretation:** The Precision-Recall curve demonstrates that the Random Forest Classifier performs very well across all price ranges, achieving high precision and recall. The curves for all classes are generally high and stay close to the top-right corner, indicating a good balance between precision and recall.

**Insights:**
- The Random Forest Classifier is a highly effective model for predicting mobile phone price ranges, achieving excellent performance for all classes.
- The model demonstrates a good ability to distinguish between different price ranges, minimizing both false positives and false negatives.
- The high average precision scores for all classes further support the model's strong performance.

In [None]:
# Plotting ROC Curve
plot_roc_curve(y_test.to_numpy(), y_preds, "Random Forest", "roc_curve_rf.html")

**Overall Interpretation:** The ROC curve analysis demonstrates that the Random Forest Classifier performs exceptionally well across all price ranges, achieving high true positive rates with low false positive rates. The curves for all classes are generally high and close to the top-left corner, indicating a strong ability to discriminate between different price ranges.

**Insights:**
- The Random Forest Classifier is a highly effective model for predicting mobile phone price ranges, achieving excellent performance for all classes.
- The model demonstrates a strong ability to distinguish between different price ranges, minimizing both false positives and false negatives.
- The high AUC scores for all classes further support the model's strong performance.

#### 8.3.3 Logging Model

In [None]:
# Logging Experiment
from datetime import datetime
experiment_name = "mob_price_pred_random_forest"
run_name = "run_"+str(datetime.now().strftime("%d-%m-%y_%H:%M:%S"))

run_metrics = {"train_acc": train_acc, "test_acc": test_acc}

artifact_paths = {"mob_scatter_plot": "/content/plotly_html/mobile_phone_scatter_plot.html", "battery_power_vs_price_range": "/content/plotly_html/battery_power_vs_price_range.html", "ram_vs_price_range": "/content/plotly_html/ram_vs_price_range.html",
    "confusion_matrix": "/content/plotly_html/confusion_matrix_rf.html", "pr_curve": "/content/plotly_html/pr_curve_rf.html", "roc_curve": "/content/plotly_html/roc_curve_rf.html",
                  }

run_params = {
    "n_estimators": grid_search.best_params_["randomforestclassifier__n_estimators"],
    "max_features": grid_search.best_params_["randomforestclassifier__max_features"],
    "max_depth": grid_search.best_params_["randomforestclassifier__max_depth"],
    "min_samples_split": grid_search.best_params_["randomforestclassifier__min_samples_split"],
    "min_samples_leaf": grid_search.best_params_["randomforestclassifier__min_samples_leaf"]
}

create_experiment(experiment_name, run_name, run_metrics, grid_search.best_estimator_, model_name="random_forest", artifact_paths=artifact_paths, run_params=run_params, tag_dict={"tag1": "Random Forest Classifier", "tag2": "Mobile Phone Price Prediction"})

2024/11/28 15:17:35 INFO mlflow.tracking.fluent: Experiment with name 'mob_price_pred_random_forest' does not exist. Creating a new experiment.


🏃 View run run_28-11-24_15:17:34 at: https://dagshub.com/pranay.makxenia/ML_Projects.mlflow/#/experiments/15/runs/237207ce11a8487898c3665851dd28af
🧪 View experiment at: https://dagshub.com/pranay.makxenia/ML_Projects.mlflow/#/experiments/15
Run - run_28-11-24_15:17:34 is logged to Experiment - mob_price_pred_random_forest


## 9. Conclusion


After carefully analyzing the above trained models, we can clearly see that `Random Forest Classifier` is the best model followed by `Logistic Regression Classifier`. We will still deploy all the three models, for comparison, on a streamlit app.

# Next

Next we will create a streamlit app to deploy the models for predicting mobile phone price.

In [None]:
from google.colab import files
import shutil

def zip_and_download_folder(folder_path, zip_filename):
  shutil.make_archive(zip_filename, 'zip', folder_path)
  files.download(zip_filename + '.zip')

zip_and_download_folder('/content/plotly_html', 'plotly_html')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>