# Transforming the Laptop Proces Dataset
The Laptop Prices dataset can be find [here](https://www.kaggle.com/ionaskel/laptop-prices). You can download it and find all the specifications on that dedicated page. In order to run successfully this notebook, the dataset must be plased in the same directory.

## Load the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
laptops = pd.read_csv('laptops.csv', encoding = "ISO-8859-1",  index_col='Index')
laptops.head()

Unnamed: 0_level_0,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69
2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94
3,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,575.0
4,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,2537.45
5,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6


In [3]:
# Get unique values for the company name
unique_companies = laptops['Company'].unique()
unique_companies

array(['Apple', 'HP', 'Acer', 'Asus', 'Dell', 'Lenovo', 'Chuwi', 'MSI',
       'Microsoft', 'Toshiba', 'Huawei', 'Xiaomi', 'Vero', 'Razer',
       'Mediacom', 'Samsung', 'Google', 'Fujitsu', 'LG'], dtype=object)

## Remove the entries containint Flash memory
There are few laptops containing only Flash memory. Since their number is insignificant, we drop those entries.

In [4]:
print('Number of rows before dropping: {}'.format(laptops.shape[0]))
laptops = laptops[~laptops.Memory.str.contains('Flash')].copy()
print('Number of rows after dropping: {}'.format(laptops.shape[0]))

Number of rows before dropping: 1303
Number of rows after dropping: 1228


## Split the memory to SSD and HDD 
Now the column named "Memory" contains a contcatenated string on SDD and HDD memory. We split this column in two columns, one for SSD and one for HDD memory.

In [5]:
def memory_SSD(x):
    """
    Function to transform the column named 'Memory'.
    It extracts the SSD part of the memory.
    
    :param str x: one entry in the column named 'Memory'
    """
    tokens = str(x).split()  # split the string
    if tokens[1] == 'SSD':
        # convert from TB to GB
        mem_size = tokens[0]
        if mem_size[-2:] == 'TB':
            return str(float(mem_size[:-2]) * 1024)
        else:
            return mem_size[:-2]
    else:
        return '0'
    
# apply the function to each entry in the "Memory" column
laptops['SSD_Memory_GB'] = laptops['Memory'].apply(memory_SSD)

In [6]:
def memory_HDD(x):
    """
    Function to transform the column named 'Memory'.
    It extracts the HDD part of the memory.
    
    :param str x: one entry in the column named 'Memory'
    """
    tokens = str(x).split()
    if len(tokens) > 2:
        if tokens[2] == '+': # if it contains SSD and HDD
            mem_size = tokens[3]  # convert from TB to GB
            if mem_size[-2:] == 'TB':
                return str(float(mem_size[:-2]) * 1024)
            else:
                return mem_size[:-2]
        else:
            return '0'
    else:  # if it contains only HDD
        if tokens[1] == 'HDD':
            mem_size = tokens[0]  # convert from TB to GB
            if mem_size[-2:] == 'TB':
                return str(float(mem_size[:-2]) * 1024)
            else:
                return mem_size[:-2]
        else:
            return '0'
        
laptops['HDD_Memory_GB'] = laptops['Memory'].apply(memory_HDD)
laptops.head()

Unnamed: 0_level_0,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros,SSD_Memory_GB,HDD_Memory_GB
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69,128,0
3,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,575.0,256,0
4,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,2537.45,512,0
5,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6,256,0
6,Acer,Aspire 3,Notebook,15.6,1366x768,AMD A9-Series 9420 3GHz,4GB,500GB HDD,AMD Radeon R5,Windows 10,2.1kg,400.0,0,500


In [7]:
del laptops['Memory']

## Remove some extra characters
The columns named "Weight", "Ram" and "ScreenResolution" contain some extra characters.

In [8]:
# remove 'kg' at the end of each entry in the "Weight" column
laptops['Weight'].str[-2:].unique()
laptops['Weight_kg'] = laptops['Weight'].apply(lambda x: str(x)[:-2])
del laptops['Weight']

In [9]:
# remove 'GB' at the end of each entry in the "Ram" column
laptops['Ram'].str[-2:].unique()
laptops['Ram_GB'] = laptops['Ram'].apply(lambda x: str(x)[:-2])
del laptops['Ram']

In [10]:
# Remove the first few words for each entry in the column named "ScreenResolution"
laptops['ScreenResolution_px'] = laptops['ScreenResolution'].apply(lambda x: str(x).split()[-1])
del laptops['ScreenResolution']

## Transform the CPU and GPU information
We split the CPU in the name of the model and the clock rate. For the GPU we only take the model.

In [11]:
def cpu_type(x):
    """
    Function to transform the column named 'Cpu'.
    It extracts the Cpu model name.
    
    :param str x: one entry in the column named 'Cpu'
    """
    tokens = str(x).split()
    cpu_type = tokens[:-1]
    if cpu_type[0] == 'Intel':
        if len(cpu_type) > 3:
            return ' '.join(cpu_type[:3])
        else:
            return ' '.join(cpu_type)
    elif cpu_type[0] == 'AMD':
        if len(cpu_type) > 2:
            return ' '.join(cpu_type[:2])
        else:
            return ' '.join(cpu_type)
    else:
        return ' '.join(cpu_type)

laptops['Cpu_Type'] = laptops['Cpu'].apply(lambda x: ' '.join(str(x).split()[:2]))

In [12]:
# extract the clock rate of the CPU
laptops['Cpu_Frequency_GHz'] = laptops['Cpu'].apply(lambda x: str(x).split()[-1][:-3])
del laptops['Cpu']

In [13]:
# extract the GPU model
laptops['Gpu_Type'] = laptops['Gpu'].apply(lambda x: ' '.join(str(x).split()[:2]))
del laptops['Gpu']

In [14]:
# write the cleaned dataset in a .csv file
laptops.to_csv('laptops_updated.csv')