<a href="https://colab.research.google.com/github/Mervin151111/Mervin19/blob/main/Mobile_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Mobile Price Range Prediction


##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 - Mervin Pereira

# **Project Summary -**

Write the summary here within 500-600 words.

The project is about exploring and analyzing a mobile phone dataset to gain insights and understanding about the factors that contribute to the pricing of mobile phones. The dataset contains 2000 records with 21 columns, including various features such as battery power, clock speed, RAM, etc.

The goal of the project is to understand the relationship between the features and the target variable, price range, and to build a model that can predict the price range of a mobile phone based on its features. To achieve this, we started by exploring the dataset and performing data preprocessing tasks such as checking for missing values, outliers, and data types.

Next, we visualized the data to gain insights and understand the distribution of each feature, such as using histograms and box plots. We also used correlation matrices to understand the relationship between the features and the target variable.

This project is about analyzing a dataset of mobile phone features and prices, with the objective of finding out the relationship between different features and the price range of the mobile phone. The dataset includes various features such as battery power, RAM, internal memory, camera megapixels, screen size, and more. The target variable is the price range, which is divided into four categories: lowest cost, medium cost, high cost, and very high cost. The aim of the project is not to predict the actual price of a mobile phone, but to determine which features are most important in determining the price range.

The first step in the analysis is to perform exploratory data analysis to get a better understanding of the dataset. The data is checked for missing values, outliers, and data type errors. Descriptive statistics and data visualization techniques are used to understand the distribution and relationships between variables. From the analysis, it was found that there are no missing values in the dataset, but there are some outliers present in the data. The outliers were removed using the IQR method, and the data was standardized using the StandardScaler technique.

After data preprocessing, feature engineering is performed to identify the most relevant features for the model. Correlation analysis is used to identify the features that have a significant relationship with the target variable. The features with low correlation coefficients are eliminated to improve the performance of the model.

The final step in the analysis is to build a model using different machine learning techniques. The dataset is split into training and testing sets, and different classification algorithms such as logistic regression, decision trees, random forests, and support vector machines are applied to the data. The performance of the models is evaluated based on different metrics such as accuracy, precision, recall, and F1 score. The best performing model is selected based on these metrics.

In conclusion, this project aims to analyze sales data of mobile phones and determine the factors that drive their selling prices. The project involves different steps such as exploratory data analysis, data preprocessing, feature engineering, and model building. The final goal is to develop a model that can accurately categorize mobile phones based on their price range. The insights from this analysis can be used by mobile phone manufacturers to optimize their product features and pricing strategies.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/Mervin151111/Mervin19/

# **Problem Statement**


**Write Problem Statement Here.**

The problem statement of this project is to analyze the sales data of mobile phones and identify the key factors that drive their prices. The objective is to find out the relationship between the features of a mobile phone (such as RAM, internal memory, camera, battery power, etc.) and its selling price. The goal is not to predict the actual price but to classify the price into four categories indicating how high the price is.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd 
import numpy as np  
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from scipy import stats
from sklearn.linear_model import LogisticRegression

### Dataset Loading

In [None]:
# Load Dataset

df = pd.read_csv("/content/data_mobile_price_range.csv")

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

df.shape

# We have 2000 rows and 21 columns

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().sum()

# No duplicate values in out dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isnull().mean()*100

# No null/missing values in our dataset

In [None]:
# Visualizing the missing values

sns.heatmap(df.isnull(), cmap="viridis")

# Visualization shows that there are no missing values

### What did you know about your dataset?

Answer Here

Our dataset contains 5,000 rows and 21 columns.
The columns are named battery_power, blue, clock_speed, dual_sim, fc, four_g, int_memory, m_dep, mobile_wt, n_cores, pc, price_range, px_height, px_width, ram, sc_h, sc_w, talk_time, three_g, touch_screen, and wifi.
The column price_range is the target variable. Our dataset is balanced dataset

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe()

### Variables Description 

Answer Here

This dataset contains information on mobile phone specifications for 2000 mobile phones. There are 21 columns, including battery power, clock speed, dual sim support, front camera, internal memory, number of cores, rear camera, screen height, screen width, RAM, and WiFi support.

The mean and standard deviation values for each column indicate the distribution of the data, with battery power ranging from 501 to 1998 mAh, clock speed ranging from 0.5 to 3 GHz, internal memory ranging from 2 to 64 GB, and RAM ranging from 256 to 3998 MB.

The minimum, 25th percentile, 50th percentile (median), 75th percentile, and maximum values provide additional information on the distribution of the data for each column.

The target variable in this dataset is 'price_range' which ranges from 0 to 3, with 0 being the lowest price range and 3 being the highest. This suggests that the dataset could be used for predicting the price range of mobile phones based on their specifications.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# iterate over the columns of the dataframe
for col in df.columns:
    # get the unique values of the column and print them
    unique_vals = df[col].unique()
    print(f"Unique values in column {col}: {unique_vals}")

In [None]:
# FInd the number of unique values in each column

# iterate over the columns of the dataframe
for col in df.columns:
    # get the unique values of the column and print them
    unique_vals = df[col].nunique()
    print(f"Unique values in column {col}: {unique_vals}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# There are no missing values and duplicate values


### What all manipulations have you done and insights you found?

Answer Here.

There are no missing values and duplicated values hence no manipulation technique needed

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

plt.hist(df['ram'], bins=20)
plt.xlabel('RAM')
plt.ylabel('Frequency')
plt.title('Distribution of RAM')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The histogram is commonly used for single variable analysis because it provides a visual representation of the distribution of a single continuous variable. In this case, we used a histogram to analyze the distribution of RAM in the dataset. The x-axis represents the range of RAM values and the y-axis represents the frequency of occurrence for each value or range of values. The height of each bar represents the frequency or number of times a particular RAM value occurs in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the chart, we can see that the RAM column has a roughly normal distribution

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

This information can be useful for manufacturers and consumers alike. Manufacturers can use this information to determine what the most common RAM configurations are and adjust their production accordingly. Consumers can use this information to help guide their purchasing decisions, knowing what the most common RAM configurations are and what to expect in terms of performance.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

sns.violinplot(x=df['battery_power'])
plt.xlabel('Battery Power')
plt.title('Distribution of Battery Power')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We chose a violin plot because it allows us to see both the distribution of the data as well as the probability density at different values.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

We can see that the range of battery power is from around 300 to 2100 mAh, with the majority of devices having a battery power between 750 to 1750 mAh. The median is also represented by the white dot in the center of the plot and is located around 1250 mAh,

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Insights gained from analyzing data can lead to informed business decisions that can potentially lead to positive impact. For example, if we found that a certain feature like battery power was strongly correlated with the price range of a mobile phone, a company may choose to prioritize investing in improving that feature in their products to appeal to consumers who are willing to pay more for those features.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

sns.kdeplot(df['int_memory'], fill=True)
plt.xlabel('Internal Memory (GB)')
plt.ylabel('Density')
plt.title('Distribution of Internal Memory')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The kernel density plot is used to visualize the distribution of a continuous variable. It is a smoothed version of a histogram and can help identify the shape of the distribution, the presence of multiple peaks, and any outliers. In this specific case, we used the kernel density plot to visualize the distribution of internal memory (in GB) of mobile phones in the dataset. The fill parameter was set to True to show the area under the curve, which can help with better visualization of the distribution

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The KDE plot shows that the internal memory (in GB) of most phones is concentrated between 0 and 64 GB. There is a slight bump in the density around 8 GB and 16 GB, indicating that these are common memory sizes for phones. The plot also shows that there are relatively few phones with more than 64 GB of internal memory. Overall, this plot provides a smooth and continuous representation of the distribution of internal memory in the dataset.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights obtained from the chart could potentially lead to positive business impact. For example, if the company observes that a majority of their customers are interested in smartphones with memory from 8 to 16GB they could consider launching new models with prioritize features that enhance the user experience related to internal memory. This could potentially increase customer satisfaction, brand loyalty, and revenue for the company.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

sns.boxplot(x=df['pc'])
plt.xlabel('Rear Camera Megapixels')
plt.title('Distribution of Rear Camera Megapixels')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We picked the box plot to visualize the distribution of rear camera megapixels because it provides information about the median, quartiles, minimum, and maximum values, as well as any outliers. This allows us to quickly understand the range and variability of the data, which can be useful for making comparisons or identifying potential issues.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The boxplot shows that the majority of phones have rear camera megapixels between 5 and 15.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

If the business is focused on marketing smartphones with high-quality rear cameras, the insights gained from this chart can help them understand the distribution and range of rear camera megapixels in the market. They can use this information to design and market their products accordingly to meet the needs and preferences of their target customers.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

sns.countplot(x='four_g', data=df)
plt.xlabel('4G Support')
plt.ylabel('Frequency')
plt.title('Distribution of 4G Support')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I picked this chart to visualize the distribution of the 4G support in the dataset using a countplot. A countplot is a good choice for categorical variables with few unique values. It helps to understand how many devices have 4G support and how many do not.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The plot shows the frequency of 4G support in the dataset. The plot shows that there are more devices that support 4G than those that don't. Specifically, there are around 1100 devices in the dataset that support 4G, and around 900 devices that do not support 4G. This suggests that 4G is a widely available feature in mobile devices today.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insight that a majority of the phones in the dataset have 4G support can be useful for businesses in the mobile industry. They can focus on developing and marketing 4G-enabled devices to cater to the consumer demand for faster and more efficient mobile connectivity. Additionally, they can use this information to conduct market research and gather feedback on the performance of their 4G-enabled devices compared to those without 4G support. This can lead to improved product development and increased customer satisfaction, which can ultimately have a positive impact on the business.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

sns.kdeplot(df['talk_time'], fill=True)
plt.xlabel('Talk Time (min)')
plt.ylabel('Density')
plt.title('Distribution of Talk Time')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We picked the kernel density estimation (KDE) plot for the distribution of talk time because it provides a smooth estimate of the density of observations in the data and allows us to visualize the shape of the distribution in a continuous manner. The fill option also makes it easy to see the proportion of observations in different areas of the distribution.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The KDE plot shows the distribution of talk time for mobile phones in the dataset. The plot indicates that the most common talk time for phones is between 5 and 16 minutes, with a peak around 7 minutes. Additionally, the plot shows that talk times below 5 minutes and above 21 minutes are relatively rare.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Insights gained from analyzing the data can be used to make informed decisions and improve various aspects of the business.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

sns.pairplot(df, vars=['battery_power', 'int_memory', 'ram', 'px_height', 'px_width'], hue='price_range')
plt.suptitle('Pairplot of Key Features')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The sns.pairplot() function is used to plot pairwise relationships between variables in a dataset. In this example, we selected a subset of variables that we believe are key features of a mobile phone - battery_power, int_memory, ram, px_height, and px_width.

By including the hue='price_range' argument, we can color-code the scatterplots based on the different price ranges of the phones. This allows us to visually inspect how these key features are related to the price range of the phones.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

we have gathered several insights:

1) There appears to be a positive correlation between the ram and price_range variables. This suggests that higher RAM capacity tends to result in more expensive mobile phones.


2) There also appears to be a positive correlation between the px_width and px_height variables, which makes sense since they are both measures of screen resolution. However, there does not seem to be a clear relationship between either of these variables and price_range.


3) The distribution of int_memory and battery_power variables across price_range categories seems to be relatively evenly distributed, with some variation.


4) There does not seem to be a clear relationship between any of the variables and price_range, although there is some separation between the categories in some of the scatterplots, particularly for ram.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from data analysis can potentially lead to positive business impact. For example, the insight that higher RAM and battery power are associated with higher prices could help a smartphone manufacturer to target a market segment that values high-performance devices and price them accordingly. Similarly, understanding the relationship between different features and price range can inform product development and marketing strategies to target specific customer segments.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# 0 -> Low cost, 1-> medium cost, 2-> high cost, 3-> very high cost 

# Set plot size
plt.figure(figsize=(10, 8))

# Create a dictionary for the labels
label_dict = {0:'Low Cost', 1:'Medium Cost', 2:'High Cost', 3:'Very High Cost'}

# Plot the scatterplot
sns.scatterplot(x='battery_power', y='ram', hue='price_range', data=df)

# Set the x and y axis labels
plt.xlabel('Battery Power')
plt.ylabel('RAM')

# Set the title of the plot
plt.title('Battery Power vs RAM with Price Range')

# Set the legend title and labels using the label dictionary
plt.legend(title='Price Range', labels=[label_dict[i] for i in range(4)], bbox_to_anchor=(1.05, 1), loc='upper left')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A scatter plot with a hue variable is often used to visualize the relationship between two continuous variables and how they vary with a categorical variable. In this example, the scatter plot shows the relationship between the battery power and RAM of mobile devices, and the hue variable is the price range of the devices. This type of plot can help to identify any patterns or trends in the data that might be related to the price range of the devices.
Additionally, the use of colors to represent different price ranges makes it easier to quickly identify the range to which a given device belongs. This makes the scatter plot an effective way to visualize multiple variables simultaneously and can be especially helpful for exploratory data analysis.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) Devices with higher battery power tend to have higher RAM capacity, as we can see a positive correlation between the two variables.

2) The price range of the devices appears to be related to both the battery power and the RAM of the device. For example, most devices in the "Very High Cost" range have high battery power and high RAM.

3) There are clear clusters of devices in each price range based on their battery power and RAM. This could suggest that manufacturers may be using similar hardware configurations for devices in each price range.

4) The plot shows that there are a large number of devices in the "Medium Cost" and "High Cost" ranges with similar battery power and RAM capacity, making it difficult to distinguish between them based on these two variables alone.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

1) The insights could be used to inform product development decisions, such as determining which hardware configurations are most popular in each price range, and using this information to guide future product designs.

2) The insights could be used to develop more targeted marketing campaigns for different price ranges, highlighting the features and specifications that are most relevant to each target audience.

3) The insights could help companies better understand their competition and identify areas where they can differentiate their products based on battery power and RAM capacity.

4) By identifying patterns in the data, the insights could help companies make more informed pricing decisions, such as determining the optimal price points for devices with different hardware configurations.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# 0 -> Low cost, 1-> medium cost, 2-> high cost, 3-> very high cost 

price_labels = ['Low cost', 'Medium cost', 'High cost', 'Very high cost']
sns.barplot(x='price_range', y='battery_power', data=df, estimator=np.mean)
plt.xlabel('Price Range')
plt.ylabel('Battery Power')
plt.title('Price Range vs Battery Power')
plt.xticks(range(4), price_labels)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

A bar plot is commonly used to display the mean or median of a continuous variable for different categories. In this example, the bar plot is used to visualize the mean battery power for each of the four price ranges (low cost, medium cost, high cost, and very high cost). This type of plot can be helpful for comparing the average battery power across different price ranges and identifying any trends or patterns in the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) The average battery power tends to increase with the price range of the devices, which suggests that battery power is an important factor in determining the price of a device.

2) The difference in average battery power between the lowest price range ("Low cost") and the highest price range ("Very high cost") is relatively large, indicating that there may be significant differences in the hardware configurations of devices in different price ranges.

3) The average battery power for devices in the "Medium cost" and "High cost" ranges is relatively similar, suggesting that there may be more overlap in the hardware configurations of devices in these two price ranges.

4) The plot can be used to inform product development decisions by identifying the average battery power levels for each price range. For example, manufacturers may decide to focus on developing devices with higher battery power to target the higher-end price ranges.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

1) The insights could be used to inform product development decisions, such as determining which hardware configurations are most popular in each price range, and using this information to guide future product designs.

2) The insights could be used to develop more targeted marketing campaigns for different price ranges, highlighting the features and specifications that are most relevant to each target audience.

3) The insights could help companies better understand their competition and identify areas where they can differentiate their products based on battery power and other hardware specifications.

4) By identifying patterns in the data, the insights could help companies make more informed pricing decisions, such as determining the optimal price points for devices with different hardware configurations.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Define label names
labels = ['Low Cost (0)', 'Medium Cost (1)', 'High Cost (2)', 'Very High Cost (3)']

# Set the order of the legend labels
order = [0, 1, 2, 3]

# Create the plot
sns.catplot(x='dual_sim', hue='price_range', kind='count', data=df)
plt.xlabel('Dual SIM Support')
plt.ylabel('Count')
plt.title('Price Range vs Dual SIM Support')
plt.legend(title='Price Range', labels=labels, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Catplot displays the number of observations in each category for a given variable. In this case, the count plot is used to compare the number of devices that support dual SIM cards across the four different price ranges. This type of plot can be helpful for identifying any patterns or trends in the data, and for comparing the distribution of a categorical variable across different categories.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) The majority of devices across all price ranges support dual SIM cards. However, the proportion of devices with dual SIM support is slightly higher in the low and medium price ranges compared to the high and very high price ranges.

2) In the low and medium price ranges, there is a fairly even distribution of devices that support dual SIM cards and devices that do not. In the high and very high price ranges, however, there are significantly fewer devices that support dual SIM cards.

3) The trend of fewer devices supporting dual SIM cards in the higher price ranges could be explained by the fact that devices in these price ranges may be targeted towards higher-end users who prioritize other features over dual SIM support.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights gained from this chart could potentially help create a positive business impact. By understanding the relationship between price range and dual SIM support, a company could make informed decisions about product development, marketing strategies, and pricing.

For example, a company could use this information to target different price ranges with different marketing messages or product features, depending on the prevalence of dual SIM support in each price range.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

sns.barplot(x='n_cores', y='price_range', data=df, ci=None)
plt.xlabel('Number of Cores')
plt.ylabel('Price ($)')
plt.title('Number of Cores vs Price')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Bar plot with the number of cores on the x-axis and the price range on the y-axis, which can be used to gain insights about the relationship between the number of cores in a mobile device and its price range. The absence of a confidence interval (ci=None) indicates that the bars represent the mean values of the price range for each number of cores, without error bars representing the variability of the data. This plot can be used to identify any trends or patterns in the data

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) Devices with more cores tend to be priced higher than those with fewer cores. 

2) The range of prices for devices with a single core is wider than for devices with multiple cores. This suggests that there may be more variability in the features and quality of devices with a single core, leading to a wider range of prices.

3) The difference in price between devices with four and eight cores is relatively small compared to the difference in price between devices with one and four cores. This suggests that there may be a point of diminishing returns when it comes to adding more cores to a mobile device, in terms of the impact on price.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

1) Companies involved in the mobile device industry could use the information to inform decisions about product development, such as deciding how many cores to include in a new device based on the desired price range. For example, if a company wants to create a device priced in the mid-range, they may want to consider using a processor with four cores, as this is the point at which the price range begins to increase more steeply.

2) The insights could also inform pricing strategies, such as deciding how to price devices with different numbers of cores. For example, if a company wants to create a device with eight cores, they may decide to price it at a premium compared to devices with fewer cores, but not significantly higher than a device with four cores, as the chart suggests that there may be a point of diminishing returns in terms of the impact on price.

3) By understanding the relationship between the number of cores and the price range in mobile devices, companies could potentially optimize their product offerings and pricing strategies to better meet customer needs and preferences, potentially leading to increased sales and revenue

#### Chart - 12

In [None]:
# Chart - 12 visualization code

sns.boxplot(x='dual_sim', y='price_range', data=df)
plt.xlabel('Dual SIM Support')
plt.ylabel('Price ($)')
plt.title('Dual SIM Support vs Price')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The chart boxplot was likely chosen to explore the relationship between dual SIM support in mobile devices and their prices. The boxplot is a good choice for visualizing the distribution of prices for each category of dual SIM support (yes or no) because it shows the median, range, and distribution of prices for each group.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) Devices with dual SIM support tend to have a higher median price range than devices without dual SIM support.

2) The interquartile range (IQR) for devices with dual SIM support is larger than the IQR for devices without dual SIM support. This suggests that the prices of devices with dual SIM support are more spread out across the middle 50% of the data.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from this chart could help businesses in the mobile device industry make decisions about which features to include in their products and at what price points to offer them. For example, if devices with dual SIM support tend to have higher median prices and a wider range of prices, a company might consider emphasizing this feature in their marketing and pricing strategies to appeal to customers who are willing to pay more for this feature.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

sns.scatterplot(x='battery_power', y='ram', hue='price_range', size='int_memory', data=df)
plt.xlabel('Battery Power')
plt.ylabel('RAM')
plt.title('Battery Power vs RAM by Price Range and Internal Memory')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

This chart was likely chosen because it allows for the visualization of multiple variables simultaneously, including battery power, RAM, price range, and internal memory. The use of hue and size aesthetics in the scatterplot allows for the inclusion of additional information without making the plot too cluttered or difficult to read.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The scatterplot shows the relationship between battery power, RAM, price range, and internal memory in smartphones. The size of the points corresponds to the amount of internal memory, and the color of the points corresponds to the price range.

The plot suggests that there is a positive relationship between battery power and RAM, as smartphones with higher battery power tend to have higher RAM. Additionally, it appears that higher priced smartphones tend to have both higher battery power and higher RAM.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from this chart could potentially help create a positive business impact by informing decisions related to product development and pricing strategies. For example, if the company sees that high-priced phones tend to have both high battery power and high RAM, they may consider investing more in the development of phones with these features and pricing them accordingly. Additionally, the company may use this information to target specific customer segments with different price ranges and feature preferences.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(15, 12))
corr = df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=True, fmt='.2f', annot_kws={"size": 10})
plt.title('Correlation Matrix')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We picked this chart because it shows the correlation matrix between all the variables in the dataset, which is a useful tool for identifying patterns and relationships between variables. It can help identify which variables are strongly correlated with each other

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) The most positively correlated features are RAM and price range (0.92)

2) The most negatively correlated features are many.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

plt.figure(figsize=(40, 30))
sns.pairplot(df, hue='price_range')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The pair plot is a useful tool for visualizing relationships between multiple variables in a dataset. By plotting all possible pairwise combinations of variables in a dataset, we can quickly identify which variables are related to each other and potentially identify any patterns or trends. In this case, we are using the pair plot to explore the relationships between the different features in our mobile phone dataset, and how they relate to the price range of the phone.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1) There is a strong positive correlation between RAM and price range.

2) There appears to be a moderate positive correlation between battery power and price range.

3) There are clear clusters in the data based on price range, suggesting that different price ranges correspond to different combinations of features.

4) There appears to be a moderate negative correlation between price range and front camera mega pixels.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

df.isnull().sum()

# There are no missing values hence no Imputation technique is required

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

Nil

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# calculate IQR for each column
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# detect outliers
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)

# print the number of outliers
print('Number of outliers:', outliers.sum())

In [None]:
# remove outliers without using any technique
# df = df[~outliers]

# reset index
# df = df.reset_index(drop=True)

# print the new shape of the dataset
# print('New shape:', df.shape)

In [None]:
# Calculate the IQR for each column in the dataset
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Define a threshold for the IQR
threshold = 1.5

# Remove any value whose value is below Q1 - threshold * IQR or above Q3 + threshold * IQR
df = df[~((df < (Q1 - threshold * IQR)) | (df > (Q3 + threshold * IQR))).any(axis=1)]

# Reset the index of the cleaned dataset
df = df.reset_index(drop=True)

# Print the new shape of the cleaned dataset
print('New shape:', df.shape)

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

We tried out Z-score and IQR method. Z-score did not remove all the outliers. The IQR method involves computing the difference between the third quartile (Q3) and the first quartile (Q1) of the data distribution, which is known as the interquartile range. The data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers and are removed from the dataset.

In [None]:
# Encode your categorical columns

# not needed

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.) 

NOT needed

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# not needed

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# not needed

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

# We have used scaling rather that nothing else is needed

### 6. Data Scaling

##### Which method have you used to scale you data and why?

We use StandardScaler to standardize the features of a dataset before applying certain machine learning algorithms. StandardScaler is a preprocessing technique that transforms the features such that they have zero mean and unit variance. This is done by subtracting the mean of each feature and dividing by the standard deviation of that feature.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

# Nil

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting and Data scaling

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.


# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('price_range', axis=1), df['price_range'], test_size=0.2, random_state=42)

# Scaling our data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)



##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

#10. Hyperparamter Tuning

In [None]:
# define the model
model = RandomForestClassifier(random_state=42)

# define the hyperparameters to be tuned
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# perform a grid search of the hyperparameters
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# get the best estimator model
best_model = grid_search.best_estimator_

# fit the best estimator model to the training data
best_model.fit(X_train_scaled, y_train)

# Predict on the model

y_pred = best_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score:", accuracy)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The machine learning model used in this project is a Random Forest Classifier, which is an ensemble learning method that constructs multiple decision trees at training time and outputs the class that is the mode of the classes of the individual trees.

The model was trained on the given dataset and achieved an accuracy score of 89.73%.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

cv_score = grid_search.cv_results_['mean_test_score'][grid_search.best_index_]
print("CV score:", cv_score)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# define the model
model = LogisticRegression(random_state=42)

# fit the model to the training data
model.fit(X_train_scaled, y_train)

# predict the target variable on the testing data
y_pred = model.predict(X_test_scaled)

# evaluate the accuracy score of the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy score:", accuracy)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Hyperparameter tunning is done Fitting is done

# Predict on the model

cv_score = grid_search.cv_results_['mean_test_score'][grid_search.best_index_]
print("CV score:", cv_score)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

In the given code, we have used Grid Search Cross Validation to optimize the hyperparameters of the RandomForestClassifier model. Grid Search is a popular technique for hyperparameter optimization, which involves defining a grid of hyperparameters to be searched, and then exhaustively searching all possible combinations of hyperparameters using cross-validation. The purpose of hyperparameter optimization is to find the best set of hyperparameters that yield the highest performance of the model on a given dataset.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

Yes there is improvement by few percentage.

Evaluation Metric Score Chart

|---------------------|----------------------|

|      Model          |     Accuracy Score   |

|---------------------|----------------------|

|  Default Model      |         0.85         |

|---------------------|----------------------|

|  Optimized Model    |         0.87         |

|---------------------|----------------------|

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

cross val score. Yes it shows the exact accuracy score

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

Overall, this project demonstrates the use of machine learning techniques to predict the price range of mobile devices, which can be useful for businesses in the mobile device industry to optimize pricing strategies and improve profitability.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***