<a href="https://colab.research.google.com/github/JatinKrRana/AlmaBetter-Capstone_Projects/blob/main/Mobile_Price_Range_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Mobile Price Range Prediction Project: A Machine Learning Classification Journey



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name**            - Jatin Kumar Rana

# **Project Summary -**

In the fast-evolving landscape of technology, predicting the price range of mobile phones has become a critical task for both consumers and manufacturers. In this Mobile Price Range Prediction project, we delve into the realm of Machine Learning Classification to develop a model that can accurately predict the price range of mobile devices based on various features. The project involves several stages, including exploratory data analysis (EDA),data preprocessing, and the implementation of supervised machine learning algorithms such as Logistic Regression, Random Forest Classifier, and XG Boost Classifier.

### **Exploratory Data Analysis (EDA):**

Our journey commences with a detailed exploration of the dataset through EDA. Visualization tools and statistical techniques are employed to uncover patterns, trends, and relationships within the data. Descriptive statistics provide a snapshot of central tendencies, while data distribution plots illuminate the spread of key variables. Correlation matrices help discern connections between different features.

Understanding the nuances of the data is critical during EDA. Insights gained from this process guide subsequent decisions in data preprocessing and model selection. For instance, if certain features exhibit strong correlations with the target variable, they may play a pivotal role in predicting mobile price ranges.

### **Data Preprocessing:**

Building on the insights from EDA, we transition to the essential phase of data preprocessing. Obtaining a relevant and comprehensive dataset is fundamental to the success of any machine learning endeavor. This dataset, containing information about various mobile phones and their features, forms the foundation for our predictive model.

Identifying and addressing missing data takes precedence in the preprocessing stage. Techniques such as imputation or removal of missing values are applied to enhance the dataset's quality. With the dataset cleaned of missing values, encoding categorical data becomes the next focus. This step ensures that machine learning algorithms can effectively utilize all available information, as they typically operate on numerical data.

Data cleaning and feature engineering follow. This involves handling outliers, scaling features, and creating new features that might enhance the predictive power of the model. Feature engineering, informed by the patterns uncovered during EDA, becomes a crucial step in preparing the dataset for the subsequent machine learning algorithms.

### **Supervised Machine Learning Algorithms and Implementation:**

The heart of the project lies in the implementation of supervised machine learning algorithms. Three powerful classifiers—Logistic Regression, Random Forest Classifier, and XG Boost Classifier—are chosen for their effectiveness in classification tasks.

**Logistic Regression:**
Logistic Regression is a linear model used for binary classification. In our project, it serves as a baseline model, providing a simple yet effective approach to predict mobile price ranges. The algorithm calculates the probability of a mobile belonging to a particular price range based on its features.

**Random Forest Classifier:**
Random Forest is an ensemble learning algorithm known for its robustness and high accuracy. It operates by constructing multiple decision trees during training and outputs the mode of the classes as the prediction. Random Forest is particularly effective in handling complex relationships within the data.

**XG Boost Classifier:**
XG Boost, or Extreme Gradient Boosting, is a powerful and efficient algorithm that belongs to the gradient boosting family. It sequentially builds a series of weak learners to create a strong predictive model. XG Boost is known for its speed and performance, making it a popular choice in machine learning competitions.

The implementation of these algorithms involves training the models on the preprocessed data and evaluating their performance using metrics such as accuracy, precision, recall, and F1 score. Hyperparameter tuning may be applied to optimize the models further.

In conclusion, the Mobile Price Range Prediction project combines data preprocessing, exploratory data analysis, and the implementation of three supervised machine learning algorithms to create a robust predictive model. The outcome of this project not only provides valuable insights into the factors influencing mobile prices but also showcases the power of machine learning in solving real-world problems in the technology domain.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


### **Problem Statement:**

In the fiercely competitive mobile phone market, companies seek to understand the relationship between mobile phone features (e.g., RAM, internal memory) and selling prices. The goal is to categorize phones into price ranges rather than predict exact prices.

### **Objectives:**

**1. Data Collection and Preprocessing:**

* Assemble a comprehensive dataset.
* Cleanse and standardize the data.

**2. Exploratory Data Analysis (EDA):**

* Uncover patterns and correlations.
* Identify key features influencing prices.

**3. Feature Engineering:**

* Enhance predictive features.
* Consider new feature creation.

**4. Model Development:**

* Build a machine learning model for price range classification.
* Evaluate various algorithms.

**5. Model Evaluation and Fine-Tuning:**

* Assess model performance metrics.
* Optimize model parameters.

**6. Interpretability and Insights:**

* Extract actionable insights.
* Highlight influential features.

**7. Documentation and Reporting:**

* Document the entire process.
* Summarize findings and recommendations.

This study aims to equip mobile phone companies with actionable insights for strategic decision-making in product positioning and pricing.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime

from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Loading Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
mob_price_df = pd.read_csv('/content/drive/MyDrive/data_mobile_price_range.csv')

### Dataset First View

In [None]:
# Dataset First Look
mob_price_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
no_of_row = mob_price_df.shape[0]
no_of_col = mob_price_df.shape[1]

print(f"Number of rows in mob_price_df is {no_of_row} and columns is {no_of_col}.")

### Dataset Information

In [None]:
# Dataset Info
mob_price_df.info()

#### Duplicate Values

In [None]:
# Calculating the number of duplicate values in each column
num_duplicates_in_columns = mob_price_df.apply(lambda x: x.duplicated().sum())
print(f'Number of Duplicate Values in Each Column:\n{num_duplicates_in_columns}')

In [None]:
print(f'Number of Duplicate row in the dataset is:{mob_price_df.duplicated().sum()}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_value_count = mob_price_df.isnull().sum().sum()
print(f"Number of null values in the seoul_bike dataframe is {null_value_count}.")

In [None]:
bool_df = mob_price_df.isnull()
column_name_list = list(mob_price_df.columns)
null_value_list = []

for i in column_name_list:
    null_value = (bool_df[i] == True).sum()
    null_value_list.append(null_value)

mob_price_null_value_count_df = pd.DataFrame({'Column Name': column_name_list, 'Null Value Count': null_value_list})
print(mob_price_null_value_count_df)

### What did you know about your dataset?

The dataset mob_price_df consists of 2000 rows and 21 columns, representing various features of mobile phones (e.g., battery power, camera quality, RAM, etc.). The data types include int64 and float64. There are no null values or duplicate rows. The target variable is price_range.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
pd.DataFrame(mob_price_df.columns,columns = ['Variables of Mobile Price DataFrame'])

In [None]:
# Dataset Describe
mob_price_df.describe()

### Variables Description

**1. battery_power:**  Battery capacity in milliampere-hours (mAh).

**2. blue**: Bluetooth availability (0 for absent, 1 for present).

**3. clock_speed:** Processor speed in gigahertz (GHz).

**4. dual_sim:** Dual SIM card support (0 for single SIM, 1 for dual SIM).

**5. fc:** Front Camera megapixels.

**6. four_g:** 4G network compatibility (0 for absent, 1 for present).

**7. int_memory:** Internal Memory in gigabytes.

**8. m_dep:** Mobile Depth (thickness) in centimeters.

**9. mobile_wt:** Mobile Weight in grams.

**10. n_cores:** Number of processor cores.

**11. pc:** Primary Camera megapixels.

**12. px_height:** Pixel Resolution Height.

**13. px_width:** Pixel Resolution Width.

**14. ram:** Random Access Memory in megabytes.

**15. sc_h:** Screen Height in centimeters.

**16. sc_w:** Screen Width in centimeters.

**17. talk_time:** Maximum call duration in hours.

**18. three_g:** 3G network compatibility (0 for absent, 1 for present).

**19. touch_screen:** Touch screen availability (0 for absent, 1 for present).

**20. wifi:** Wi-Fi capability (0 for absent, 1 for present).

**21. price_range:** Categorical variable indicating the price range (0, 1, 2, or 3).

### Check Unique Values for each variable.

In [None]:
# Number of unique values in each variable
mob_price_df.nunique()

In [None]:
# Unique Values for each variable.
for i in mob_price_df.columns:
    print(i, ':', mob_price_df[i].unique(), '\n')

## 3. ***Data Wrangling***

### Data Wrangling Code

**1. Identifying numerical and categorical columns:**

In [None]:
numerical_variables = []
categorical_variables = []

for col in mob_price_df.columns:
    if mob_price_df[col].nunique() > 5:
        numerical_variables.append(col)
    else:
        categorical_variables.append(col)

# Print the result
print('Numerical Columns:', numerical_variables)
print('Categorical Columns:', categorical_variables)

**2. Counting the number of '0' values in the 'px_height' and 'sc_w' columns.**

In [None]:
columns_with_zeros = ['px_height', 'sc_w']
count_zeros = mob_price_df[columns_with_zeros].eq(0).sum()
print(count_zeros)

**3. Removing '0' values of 'px_height' (pixel resolution height) column**

In [None]:
mob_price_df = mob_price_df[mob_price_df['px_height'] != 0]

**4. Replace '0' Values with 'nan' in the 'sc_w' (screen width) Column**

In [None]:
mob_price_df['sc_w']=mob_price_df['sc_w'].replace(0,np.nan)

**5. Impute Missing Values using K-Nearest Neighbors (KNN)**

In [None]:
impute_knn = KNNImputer(n_neighbors=1)
mobile_data=pd.DataFrame(impute_knn.fit_transform(mob_price_df),columns=mob_price_df.columns)

In [None]:
columns_with_zeros = ['px_height', 'sc_w']
count_zeros = mob_price_df[columns_with_zeros].eq(0).sum()
print(count_zeros)

### What all manipulations have you done and insights you found?



Data Manipulations and Insights:

**1. Column Identification:**

* Identified numerical and categorical columns based on the number of unique values.

**2. Handled '0' Values in 'px_height' and 'sc_w' columns because it should not be '0' as it doesn't make sense for a phone screen width or pixel height to be 0.**

* Counted '0' values in 'px_height' and 'sc_w' and got 2 and 180 zero values in those column respectively.
* Removed rows with '0' in 'px_height' cause it has only 2 zero values.
* Replaced '0' with NaN in 'sc_w'.

**3. Imputation using KNN:**

* Utilized KNN imputation for missing values.

**4. Re-assessment of '0' Values:**

* Re-counted '0' values after manipulations.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Univariate Analysis (U):

#### Chart - 1: Histograms for Numeric Variables

In [None]:
# Setting the style for better visualization
sns.set(style="whitegrid")

# Creating subplots for histograms
plt.figure(figsize=(15, 10))
for i, variable in enumerate(numerical_variables, 1):
    plt.subplot(4, 4, i)
    sns.histplot(data=mob_price_df, x=variable, bins=20, kde=True)
    plt.title(f'Histogram - {variable}')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose histogram because histograms are effective for visualizing the distribution of numeric data.

##### 2. What is/are the insight(s) found from the chart?

The columns 'battery_power', 'clock_speed', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_width', 'ram', 'sc_h', and 'talk_time' show a uniform distribution, indicating even spread of values. For 'fc', 'px_height', and 'sc_w', a right-skewed distribution is observed with values concentrated towards the lower end and some higher outliers. Median may be more robust for central tendency in the right-skewed group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

* Customer Tailoring: Understanding diverse customer preferences from uniform distributions can guide tailored marketing and product strategies, potentially expanding market reach and satisfaction.

* Identifying Premium Products: Recognizing right-skewed distributions allows the identification of higher outliers, helping businesses capitalize on premium or high-performance products that attract specific customer segments.

**Negative Impact:**

* Challenges in Targeting: Uniformly distributed features might pose challenges in targeted marketing or product differentiation, as there may be a lack of distinct patterns or preferences among customers.

* Meeting Customer Expectations: Right-skewed distributions in crucial features may signal challenges in meeting the expectations of customers who prioritize higher values in those aspects, potentially affecting product success.

#### Chart - 2: Bar chart for Categorical Variables to show the distribution of each category.


In [None]:
# Create subplots for bar charts
plt.figure(figsize=(15, 10))
for i, variable in enumerate(categorical_variables, 1):
    plt.subplot(3, 3, i)
    sns.countplot(data=mob_price_df, x=variable)
    plt.title(f'Bar Chart - {variable}')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Bar charts are chosen for categorical variables because they effectively visualize the count or frequency of categories within each variable.

##### 2. What is/are the insight(s) found from the chart?

1. 'blue', 'dual_sim', 'four_g', 'touch_screen', and 'wifi' are binary features with nearly equal distribution of values, indicating balanced datasets.
2. 'price_range' has four equal bars, suggesting a balanced distribution among the four price ranges.
3. 'three_g' is imbalanced, with a higher count of 0 (indicating no 3G capability) around 500 and a lower count of 1 (indicating 3G capability) around 1500.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Potential Positive Impact:**

1. Understanding binary features helps tailor strategies for a broad audience.
2. Diverse price ranges in 'price_range' cater to varied customer budgets, potentially expanding customer base and increasing sales.

**Potential Negative Impact:**

1. Imbalanced 'three_g' distribution may indicate a gap in product features, potentially leading to lower satisfaction.
2. Challenges in targeted marketing due to nearly equal distribution in binary features may limit effective strategy implementation.

**Justification:**

1. Positive impacts include informed strategic decisions and increased market reach.
2. Negative impacts stem from potential limitations in product features and challenges in targeted marketing, affecting customer satisfaction and market share.

#### Chart - 3: The distribution of Numerical Columns (Boxplots) to check outliers:

In [None]:
# Set up the figure with a specified size
fig = plt.figure(figsize=(12, 25))

# Counter for subplot position
c = 1

# Iterate over numerical columns to create subplots
for i in numerical_variables:
    plt.subplot(10, 4, c)

    # Set plot title
    plt.title('Distribution of {}'.format(i))

    # Create a boxplot for the current numerical column
    sns.boxplot(x=i, data=mob_price_df, color="tomato")

    # Increment the counter
    c = c + 1

# Adjust layout for better spacing
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked boxplots to visually represent the distribution of numerical variables in a dataset. Boxplots provide a concise summary of key statistics, help identify outliers, allow easy comparison between variables, are space-efficient, and contribute to overall readability.

##### 2. What is/are the insight(s) found from the chart?


The chart reveals that, overall, there are no significant outliers in most of the numerical variables. However, there is an exception in the 'fc' column, where significant outliers are observed. This suggests that the distribution of values in the 'fc' column deviates from the norm, with some data points being notably different from the majority.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

* The identification of outliers in the 'fc' column can lead to a refinement of business strategy by uncovering valuable insights into customer behavior or operational improvements.

**Potential Negative Impact:**

* Outliers may indicate data quality issues or errors, risking biased analyses and potentially leading to suboptimal business decisions. Addressing these issues is crucial to ensure accurate and reliable insights.

#### Chart - 4: The distribution of Categorical Columns (Boxplots) to check outliers

In [None]:
# Visualizing the distribution of Categorical Columns (Boxplots) to check outliers
# Set up the figure with a specified size
fig = plt.figure(figsize=(12, 20))

# Counter for subplot position
c = 1

# Iterate over categorical columns to create subplots
for i in categorical_variables:
    plt.subplot(10, 4, c)

    # Set plot title
    plt.title('Distribution of {}'.format(i))

    # Create a boxplot for the current categoricall column
    sns.boxplot(x=i, data=mob_price_df, color="green")

    # Increment the counter
    c = c + 1

# Adjust layout for better spacing
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose boxplots to visualize the distribution of categorical columns. While boxplots are traditionally used for numerical data, they can help identify outliers or variations in categorical data. The visualization allows for a quick comparison of different categorical variables, and the inclusion of titles enhances interpretability.

##### 2. What is/are the insight(s) found from the chart?

There are no outliers in any of the categorical columns

### Bivariate Analysis (B):

#### Chart - 5: Box plots for 'price_range' against numeric variables

In [None]:
# Numeric Variables
numeric_variables = ['battery_power', 'clock_speed', 'fc', 'int_memory', 'mobile_wt', 'pc',
                     'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time']

# Create subplots for box plots
plt.figure(figsize=(18, 12))
for i, variable in enumerate(numeric_variables, 1):
    plt.subplot(3, 4, i)
    sns.boxplot(data=mob_price_df, x='price_range', y=variable)
    plt.title(f'Box Plot - {variable} vs Price Range')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose box plots for their ability to provide a clear comparison of the distribution of numeric variables across different price ranges. Box plots summarize central tendency, spread, and identify outliers effectively. They are robust to skewed data and offer insights into the variability between price ranges.

##### 2. What is/are the insight(s) found from the chart?

1. Most numeric variables show a uniform distribution across price ranges, offering diverse options for customers.
2. The left-skewed 'ram' vs 'price_range' suggests higher RAM is associated with lower-priced devices, indicating a potential cost-effective option.
3. Outliers in 'fc' vs 'price_range' imply premium devices with exceptional front camera specifications, potentially commanding higher prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Potential Positive Impact:**

1. Diverse Options: Uniform distribution provides diverse options, enhancing customer satisfaction.
2. Cost-Effective RAM Devices: Left-skewed 'ram' suggests cost-effective options with higher RAM, appealing to budget-conscious customers.
3. Premium Devices: Outliers in 'fc' indicate a market for premium devices, potentially increasing revenue from higher-priced products.

**Potential Negative Impact:**

1. Loss of Revenue for High RAM Devices: Left-skewed 'ram' may lead to missed revenue if offering higher RAM in lower-priced devices does not align with market demand.
2. Focus on Premium Devices: Overemphasis on premium devices risks limiting market reach, leading to negative growth if market demand for such features is limited.

**Justification:**

* Positive impact arises from strategic opportunities, while negative impact may occur if product strategies are not aligned with actual market demand. A balanced approach is crucial for success.

#### Chart - 6: Count plots for categorical variables

In [None]:
# Categorical Variables
categorical_variables = ['blue', 'dual_sim', 'four_g', 'three_g', 'touch_screen', 'wifi']

# Create subplots for count plots
plt.figure(figsize=(15, 8))
for i, variable in enumerate(categorical_variables, 1):
    plt.subplot(2, 3, i)
    sns.countplot(data=mob_price_df, x=variable, hue='price_range')
    plt.title(f'Count Plot - {variable} vs Price Range')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose count plots for their effectiveness in comparing the distribution of categorical variables against 'price_range'. They provide a clear representation of category counts, support multiple comparisons, and facilitate easy interpretation of relationships between categorical variables and the target variable.

##### 2. What is/are the insight(s) found from the chart?

Most count charts show a uniform distribution of price ranges for binary categorical variables, indicating balance.
An exception is the 'three_g' count chart, which reveals an imbalance with a higher frequency of 0 (no 3G) in the lower range and a higher frequency of 1 (3G) in the higher range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Potential Positive Impact:**

1. Balanced binary features across price ranges enhance customer satisfaction.
2. Insight on 3G capability being more frequent in higher price ranges presents an opportunity to market and price 3G-capable devices as premium products.

**Potential Negative Impact:**

1. Imbalance in 3G feature distribution may lead to negative growth if customer preference for 3G capability in higher-priced devices is not addressed.

**Justification:**

* Aligning product strategies with customer preferences can lead to increased sales and revenue.
* Ignoring the observed trend in 3G capability may result in missed market opportunities and negative growth.

#### Chart - 7: Pie chart for 3G And 4G Connectivity

In [None]:
binary_features = [ 'four_g', 'three_g']
for dataset in binary_features:
  fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (16, 8))

  palette_color =sns.color_palette('rocket_r')
  mob_price_df[dataset].value_counts().plot.pie (autopct='%1.1f%%', ax = ax1,colors=palette_color, shadow=True,labeldistance=None)
  ax1.set_title('Distribution by price range')
  ax1.legend(['Support', 'Does not Support'])
  sns.countplot(x = dataset, hue = 'price_range', data = mob_price_df, ax = ax2, color = 'red')
  ax2.set_title('Distribution by price range')
  ax2.set_xlabel(dataset)
  ax2.legend(['Low Cost', 'Medium Cost', 'High Cost', 'Very High Cost'])
  ax2.set_xticklabels(['Does not Support', 'Support'])

##### 1. Why did you pick the specific chart?

The pie chart is chosen to display the overall distribution of binary categories ('Support' and 'Does not Support'). It provides a quick visual overview. The count plot is selected for comparing binary categories against different price ranges, offering insights into their distribution. The 'rocket_r' color palette is used for clear distinction, and legends/labels enhance interpretability.

##### 2. What is/are the insight(s) found from the chart?

**4G Support:**

* The pie chart for 'four_g' indicates a relatively balanced distribution, with 52.1% supporting 4G and 47.9% not supporting. This suggests a significant portion of devices in the dataset support 4G, but there is also a notable presence of devices that do not.

**3G Support:**

* The pie chart for 'three_g' shows a more pronounced distribution, with 76.1% supporting 3G and 23.9% not supporting. The majority of devices in the dataset support 3G, indicating its widespread availability among the sampled mobile devices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

1. Balanced 4G Distribution: Flexibility to cater to both 4G and non-4G markets, addressing diverse customer preferences.
2. Dominant 3G Support: Opportunity to optimize services for widespread 3G, potentially leading to increased sales.

**Negative Growth Consideration:**

1. Limited 4G Adoption: Potential negative growth if there is unmet demand for 4G in regions where it is essential.
2. Shift Towards 5G: Risk of negative growth if the dataset does not consider 5G, and the market is rapidly transitioning to this advanced standard.

**Justification:**

* Insights allow tailoring offerings to diverse connectivity preferences, but potential negative growth may arise from unmet demand for advanced standards or industry shifts toward technologies like 5G. Businesses should balance current market needs with future trends.







### Multivariate Analysis:

#### Chart - 8: 3D plots to visualize interactions among three variables (ram, battery_power, px_height)

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Select three numerical variables for the 3D plot
selected_3d_variables = ['ram', 'battery_power', 'px_height']

# Create a 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(mob_price_df[selected_3d_variables[0]],
           mob_price_df[selected_3d_variables[1]],
           mob_price_df[selected_3d_variables[2]],
           c='blue', marker='o')

# Set axis labels
ax.set_xlabel(selected_3d_variables[0])
ax.set_ylabel(selected_3d_variables[1])
ax.set_zlabel(selected_3d_variables[2])

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

The 3D scatter plot was chosen to visually explore interactions among three specific variables ('ram', 'battery_power', and 'px_height') in the mob_price_df dataset. This type of chart allows for the simultaneous examination of multivariate relationships, providing insights into patterns and dependencies among the selected variables in a three-dimensional space.

##### 2. What is/are the insight(s) found from the chart?

The even distribution of dots at the down side of the 3D scatter plot suggests consistent screen pixel heights among devices. Within this range, there is limited variation in RAM and battery power, implying a potential lack of strong linear correlation. The pattern indicates homogeneity within a subgroup of devices and facilitates the identification of potential outliers with unique feature combinations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

* Consistent Features: Insights suggest a market segment valuing consistency; targeted strategies can enhance satisfaction.
* Homogeneous Subgroup: Identifying a niche market with specific preferences allows tailored products/services for increased market share.
* Outlier Detection: Recognizing unique products helps adapt strategies to capitalize on emerging trends.

**Negative Growth Considerations:**

* Limited Variation: Limited diversity in RAM and battery power may require product portfolio diversification.
* Potential Lack of Correlation: Difficulty in predicting preferences based on identified features may challenge optimization strategies.
* Overlooking Heterogeneous Segments: Focusing on homogeneous subgroups may lead to missed opportunities in diverse market segments.

In summary, adapting to positive insights while addressing potential challenges is crucial for sustained business growth.

#### Chart - 9: 3D plots to visualize interactions among three variables (ram, battery_power, px_width)

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Select three numerical variables for the 3D plot
selected_3d_variables = ['ram', 'battery_power', 'px_width']

# Create a 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(mob_price_df[selected_3d_variables[0]],
           mob_price_df[selected_3d_variables[1]],
           mob_price_df[selected_3d_variables[2]],
           c='green', marker='o')

# Set axis labels
ax.set_xlabel(selected_3d_variables[0])
ax.set_ylabel(selected_3d_variables[1])
ax.set_zlabel(selected_3d_variables[2])

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?


The 3D scatter plot was chosen to visualize interactions among the selected numerical variables ('ram', 'battery_power', and 'px_width') in the mob_price_df dataset. This chart enables a simultaneous examination of multivariate relationships and facilitates a comparative analysis of RAM, battery power, and screen pixel width in a three-dimensional space.

##### 2. What is/are the insight(s) found from the chart?

The even distribution of dots across the entire 3D scatter plot suggests:

1. **Uniform Variation:** Devices exhibit a diverse range of values for RAM, battery power, and screen pixel width.
2. **Lack of Clear Patterns:** No distinct clusters or patterns indicate a lack of clear segmentation based on these variables.
3. **Potential Independence:** Changes in one variable may not be strongly correlated with changes in the other two.
4. **Diverse Device Characteristics:** The chart reflects a diverse set of devices with various combinations of features.
5. **No Dominant Segment:** Absence of a concentrated region implies no dominant or prevalent segment of devices based on these three features.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Positive Impact:**

* **Diverse Market Understanding:** Helps tailor products to a broad range of consumer preferences.
* **Flexibility in Marketing:** Allows versatile marketing approaches adaptable to a diverse audience.

**Negative Growth Considerations:**

* D**ifficulty in Targeting Segments:** Challenges in targeting specific consumer segments may impact market penetration.
* **Product Cannibalization Risk:** Even distribution increases the risk of internal competition, potentially leading to product cannibalization.
* **Limited Competitive Advantage:** Lack of clear patterns may make it challenging to establish a distinct competitive advantage based on specific device features.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Selecting all numeric variables
numeric_variables = mob_price_df.select_dtypes(include='number')

# Calculating the correlation matrix
correlation_matrix = numeric_variables.corr()

# Creating a heatmap for the correlation matrix
plt.figure(figsize=(18, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix for Numeric Variables')
plt.show()

##### 1. Why did you pick the specific chart?


A correlation matrix with a heatmap was chosen because it provides a concise and visually interpretable summary of relationships between multiple numeric variables. The heatmap's color representation facilitates the identification of strong positive or negative correlations, aiding in the understanding of variable interactions and potential multicollinearity in predictive modeling.

##### 2. What is/are the insight(s) found from the chart?

**1. Strong Positive Correlation:**

* 'ram' has a strong positive correlation with 'price_range,' suggesting that higher RAM is associated with higher price ranges.

**2. Moderate Positive Correlations:**

* 'px_height' and 'px_width' are moderately positively correlated.
* 'sc_h' and 'sc_w' (screen height and width) show a moderate positive correlation.

**3. Other Correlations:**

* Weak or negligible correlations are observed for 'battery_power' and 'clock_speed.'
* Positive correlation between 'four_g' and 'three_g.'

**4. Inverse Correlations:**

* Negative correlation between 'three_g' and 'talk_time,' indicating devices with 3G may have slightly lower talk times.

**5. Multicollinearity Warning:**

* 'pc' and 'fc' exhibit a strong positive correlation, suggesting potential multicollinearity, which can impact predictive model stability.

#### Chart - 15 - Pair Plot

In [None]:
# Creating a scatter plot matrix
sns.pairplot(mob_price_df[numerical_variables])

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot matrix is chosen for its ability to visually explore multivariate relationships between numerical variables in the mob_price_df dataset. It allows quick identification of patterns, correlations, and outliers, making it a comprehensive and efficient visualization for understanding the dataset's numeric features.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***