Project Attachements - AI Developer
Devices Price Classification System
using Python and Spring Boot

Project Description:
Build a Devices Price Classification System (AI System) using Python and SpringBoot. Mainly
the system will include two small projects:
- Python project: will allow you to predict the prices, allowing the sellers to classify the device's prices according to their characteristics
- SpringBoot project: Will contain a simple entity, and a few endpoints, to call the service
from the Python project for a bunch of test cases, and store them.

Advice and Guidance (Evaluation Criteria):
- Focus on the requirements and the questions (since they are the most weighted things here).
- We are going in-depth in the details, so we highly recommend not applying any algorithm or concept if it is not suitable for your case. Include comments on each algorithm or concept applied to articulate the rationale behind your choices and decisions.

- Documentation: Provide clear documentation on how to run the application, interact with API endpoints.
- Code Quality: Evaluate the code for readability, maintainability, and adherence to best
practices

Python Project
DataSet: Devices specifications:
○ Train Data: attached
○ Test Data: attached

Dataset columns are as follows:

- id - ID

- battery_power - Total energy a battery can store in one time measured in mAh

- blue - Has Bluetooth or not

- clock_speed - The speed at which the microprocessor executes instructions

- dual_sim - Has dual sim support or not

- fc - Front Camera megapixels

- four_g - Has 4G or not

- int_memory - Internal Memory in Gigabytes

- m_dep - Mobile Depth in cm

- mobile_wt - Weight of mobile phone

- n_cores - Number of cores of the processor

- pc - Primary Camera megapixels

- px_height - Pixel Resolution Height

- px_width - Pixel Resolution Width

- ram - Random Access Memory in Megabytes

- sc_h - Screen Height of mobile in cm

- sc_w - Screen Width of mobile in cm

- talk_time - longest time that a single battery charge will last when you are

- three_g - Has 3G or not

- touch_screen - Has touch screen or not

- wifi - Has wifi or not

- price_range - This is the target variable with the value of:
- 0 (low cost)
- 1 (medium cost)
- 2 (high cost)
- 3 (very high cost)

Modeling Steps:
- Do the following operations, to build your own ML model, to predict or classify the price for any device:

Data Preparing:

- Do your best to prepare the data very well, and do some engineering processing, add your comments.
○ EDA.(Show 1-2 insights, add your comments)

- Select and illustrate appropriate charts for your dataset to facilitate
the discovery of patterns, insights, and correlations. (Add your comments)
○ Train using an appropriate algorithm. (Add your comments)

Evaluate your model:

- Show some evaluation metrics.(confusion matrix, or any other metrics, Add your comments).

Optimize your model:

- Choose an appropriate algorithm to make your result good enough.(Add your comments).

- Endpoints:
○ RESTful API to predict the price for any device:

- Will take the specs for any device, and send it to your ML model, then return the predicted price.

SpringBoot Project Entities:
- Device: to describe every device in our system.

EndPoints: Implement RESTful endpoints to handle the following operations
- POST /api/devices/: Retrieve a list of all devices
- GET /api/devices/{id}: Retrieve details of a specific device by ID.
- POST /api/devices: Add a new device.
- POST /api/predict/{deviceId}
○ This will call the Python API to predict the price, and save the result in the device entity here.
○ Apply some best practices here, like transaction management.

Testing:
- Do prediction for 10 devices from the Test dataset above.

DataStorage:
- It's not a big deal, choose any kind of Database.

Please make sure the project repo is public.

Best Of Luck

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    precision_score
)

# Set pandas to display all columns
pd.set_option('display.max_columns', None)

# Set the path to the data
TRAIN_PATH = 'data/train.csv'
TEST_PATH = 'data/test.csv'

# Set random seed
RANDOM_STATE = 2024

In [15]:
# Load datasets
train_df = pd.read_csv(TRAIN_PATH)

# Display the first few rows of the training data
train_df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1.0,0.0,7.0,0.6,188.0,2.0,2.0,20.0,756.0,2549.0,9.0,7.0,19,0,0,1,1
1,1021,1,0.5,1,0.0,1.0,53.0,0.7,136.0,3.0,6.0,905.0,1988.0,2631.0,17.0,3.0,7,1,1,0,2
2,563,1,0.5,1,2.0,1.0,41.0,0.9,145.0,5.0,6.0,1263.0,1716.0,2603.0,11.0,2.0,9,1,1,0,2
3,615,1,2.5,0,0.0,0.0,10.0,0.8,131.0,6.0,9.0,1216.0,1786.0,2769.0,16.0,8.0,11,1,0,0,2
4,1821,1,1.2,0,13.0,1.0,44.0,0.6,141.0,2.0,14.0,1208.0,1212.0,1411.0,8.0,2.0,15,1,1,0,1


### Training Data Overview
- **First few rows**: The data includes various features like battery power, whether the device has Bluetooth (`blue`), the clock speed, and many others, alongside the target variable `price_range`.

In [16]:
# Print the shape of the training data
train_df.shape

(2000, 21)

- **Shape**: As seen from the above output, the train data has `2000` rows and `21` columns (including the target variable).

In [17]:
# Check the statistical description of the training data
train_df.describe()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,1995.0,1995.0,1995.0,1995.0,1996.0,1996.0,1995.0,1996.0,1998.0,1998.0,1999.0,1999.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.310276,0.521303,32.04812,0.502256,140.266533,4.518036,9.915789,644.651804,1251.287788,2124.262262,12.303652,5.766383,11.011,0.7615,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.335766,0.499671,18.146476,0.28853,35.384676,2.288946,6.058469,443.355443,432.35293,1085.273372,4.212373,4.3574,5.463955,0.426273,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,0.0,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,5.0,282.0,874.25,1206.5,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,10.0,564.0,1247.0,2147.5,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,15.0,947.25,1633.0,3065.5,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,20.0,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


### Statistics, Missing Values, and Basic Information
- **Descriptive Statistics**: The dataset covers a wide range of values for each feature, suggesting varied types of devices. For example, `battery_power` ranges from 501 to 1998 mAh.

In [18]:
# Check for missing values in the training data
train_df.isnull().sum()


battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               5
four_g           5
int_memory       5
m_dep            5
mobile_wt        4
n_cores          4
pc               5
px_height        4
px_width         2
ram              2
sc_h             1
sc_w             1
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

- **Missing Data**: A few columns have missing data, but the number is relatively small. For instance, `fc` (front camera megapixels) and `four_g` have 5 missing entries each.

In [19]:
# Check the information of the training data
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             1995 non-null   float64
 5   four_g         1995 non-null   float64
 6   int_memory     1995 non-null   float64
 7   m_dep          1995 non-null   float64
 8   mobile_wt      1996 non-null   float64
 9   n_cores        1996 non-null   float64
 10  pc             1995 non-null   float64
 11  px_height      1996 non-null   float64
 12  px_width       1998 non-null   float64
 13  ram            1998 non-null   float64
 14  sc_h           1999 non-null   float64
 15  sc_w           1999 non-null   float64
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

- **Basic Information**: Interestingly, all the features are numerical, either integers or floats. This is very good.