## Assignment
Prepare a model in Jupyter Notebook using Python. Only use the training data for training the model and check the model's performance on unseen data using the test dataset to make sure it does not overfit.
Ensure that the notebook reflects your thought process. It’s better to show all the approaches, not only the final one (e.g. if you tested several models, you can show all of them). The path to obtaining the final model should be clearly shown.


In [90]:
import pandas as pd
import os
import numpy as np

In [91]:
train_df = pd.read_json("./train_dataset.json", orient="columns")
test_df = pd.read_json("./test_dataset.json", orient="columns")
val_df = pd.read_json("./val_dataset.json", orient="columns")

df = pd.concat([train_df, val_df, test_df], axis=0)
df.head(10)

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,new,1250.0,producer warranty,"17"" - 17.9""",4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2649.0
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8 gb,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,new,1000.0,producer warranty,"15"" - 15.9""",3399.0
10423,,,,2,,,,,,,,,new,,producer warranty,,1599.0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,new,256.0,producer warranty,"12"" - 12.9""",4499.0
4870,integrated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,2,8 gb,[windows 10 home],hdd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,2.0,intel core i3,new,1000.0,producer warranty,"15"" - 15.9""",2099.0
2498,dedicated graphics,"[wi-fi, bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,8 gb,"[windows 8.1 home 64-bit, other]",hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,new,1000.0,producer warranty,"17"" - 17.9""",2699.0
6220,dedicated graphics,"[wi-fi, bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,8 gb,[no system],ssd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.5,intel core i5,new,256.0,producer warranty,"15"" - 15.9""",3199.0
10594,integrated graphics,"[nfc (near field communication), gps]",1920 x 1080,2,8 gb,[windows 10 professional],,[touchpad],[SD card reader],ddr4,2.5,intel core i5,new,500.0,producer warranty,"15"" - 15.9""",2749.0
11640,integrated graphics,"[wi-fi 802.11 b/g/n/ac, bluetooth, lan 10/100/...",1920 x 1080,2,8 gb,[windows 10 professional],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,2.5,intel core i5,new,256.0,producer warranty,"15"" - 15.9""",3199.0


In [92]:
df.columns

Index(['graphic card type', 'communications', 'resolution (px)', 'CPU cores',
       'RAM size', 'operating system', 'drive type', 'input devices',
       'multimedia', 'RAM type', 'CPU clock speed (GHz)', 'CPU model', 'state',
       'drive memory size (GB)', 'warranty', 'screen size', 'buynow_price'],
      dtype='object')

In [93]:
df.dtypes

graphic card type          object
communications             object
resolution (px)            object
CPU cores                  object
RAM size                   object
operating system           object
drive type                 object
input devices              object
multimedia                 object
RAM type                   object
CPU clock speed (GHz)     float64
CPU model                  object
state                      object
drive memory size (GB)    float64
warranty                   object
screen size                object
buynow_price              float64
dtype: object

In [94]:
df.shape

(7853, 17)

In [95]:
df = df.dropna()

In [96]:
df.shape

(6109, 17)

In [97]:
def get_unique_values(column_name, dataframe):
    # Splitting each comma-separated value into a new column
    column_df = pd.DataFrame(dataframe[column_name].str.split(',').tolist())
    
    unique_values = []
    for column in column_df.columns:
        # Collecting unique non-null values
        unique_values.extend(
            value for value in column_df[column].unique()
            if value not in unique_values and value is not None
        )

    return unique_values

In [98]:
compact_cols = ['communications', 'input devices', 'multimedia']

In [99]:
# use to_list( ) function to split lists into columns
# example of to_list( ) function of pandas
pd.DataFrame(df['communications'].to_list())

Unnamed: 0,0,1,2,3,4,5
0,bluetooth,lan 10/100/1000 mbps,,,,
1,wi-fi,bluetooth,lan 10/100 mbps,,,
2,wi-fi,bluetooth,,,,
3,wi-fi,bluetooth,lan 10/100 mbps,,,
4,wi-fi,bluetooth,lan 10/100/1000 mbps,,,
...,...,...,...,...,...,...
6104,wi-fi,bluetooth,lan 10/100/1000 mbps,,,
6105,bluetooth,lan 10/100 mbps,,,,
6106,wi-fi,bluetooth,lan 10/100/1000 mbps,,,
6107,bluetooth,lan 10/100 mbps,,,,


In [100]:
compact_dict = {col: [] for  col in compact_cols}

# apply get_unique_vals to all compact_cols
# search through the dataframe and add a feature to unique_vals list when it is not included before
for col in compact_cols:
    compact_dict[col] = get_unique_values(col,df)

In [101]:
compact_dict

{'communications': [nan], 'input devices': [nan], 'multimedia': [nan]}

In [102]:
df['resolution_x'] = df['resolution (px)'].apply(lambda x: x.split(' x ')[0]).astype(int)
df['resolution_y'] = df['resolution (px)'].apply(lambda x: x.split(' x ')[1]).astype(int)

df['RAM size'] = df['RAM size'].apply(lambda x: x.split(' gb')[0]).astype(int)

In [103]:
df['RAM size'].unique()

array([32,  8, 12,  4, 16,  2, 20,  6, 24])

In [104]:
df = df.drop(columns=['resolution (px)'])

In [112]:
df['screen-size_x'] = df['screen size'].str.extract(r'(\d+\.?\d*)').astype(float)
df['screen-size_y'] = df['screen size'].str.extract(r'-(\s*\d+\.?\d*)').astype(float)


In [113]:
df['screen-size_y']

7233    17.9
5845    15.9
5897    12.9
4870    15.9
2498    17.9
        ... 
9211    15.9
2748    17.9
2072    17.9
4741    15.9
6980    15.9
Name: screen-size_y, Length: 6109, dtype: float64