# Analyzing the laptops dataset

Playground project with the laptops.csv file to practice with pandas. The original file is available [here](https://dsserver-prod-resources-1.s3.amazonaws.com/293/laptops.csv?versionId=MzL1FMfo0SoiLJBjWpFkxHqLhEqD__SO) as part of course by [Dataquest.io](https://dataset.io)

In [1]:
# importing pandas and numpy
import pandas as pd
import numpy as np

In [2]:
# opening the laptops.csv file as laptops
# setting the encoding to "Latin-1" to read the file
laptops = pd.read_csv("laptops.csv", encoding= "Latin-1")

# gatherin intial information about the dataset
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.5+ KB


The laptops.csv file has 1303 entries, with 13 columns. All entries are set to objects (strings). The column names are not united, mixing capital letters, non alfabetical characters and in case of the "Storage" even a space before the column name.

In order to prepare the dataset for analysis I will:
- unite the column names
- inspect the values whether some can be set as numeric or any other values then objects
- inspect the dataset to find missing values
- based on the discoveries about the missing values decide what to do with them

## 1. Cleaning the dataset
### 1.1 Header row
The goal for the header row is to have united names. The header row will be in lower-case, using "_" instead of spaces and finally it will have only alfanumeric characters.

In [3]:
# defining clean_col function to clean and unite the header row
def clean_col(col):
    col = col.strip() 
    col = col.replace("(", "") 
    col = col.replace(")", "")
    col = col.replace("Operating System", "os")
    col = col.replace(" ", "_")
    col = col.lower()
    return col

In [4]:
# using the clean_col function to assign new valeus in the header row
new_header = []

for c in laptops.columns:
   clean_header = clean_col(c)
   new_header.append(clean_header)

laptops.columns = new_header

# checking the output
print(laptops.columns)

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')


### 1.2 Converting object values to integers and floats
Some of the data set columns can be converted to integers and floats so I can perform better analysis. The columns I'm going to change from the object type are:
- screen_size
- ram
- weight
- price_euros

In [5]:
laptops.head(3)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500


#### 1.2.1 Screen size
To be able to perform comparision on the screen sizes I will remove the '"' character and convert the values in the column to float data type. Let's see what unique values are in the screen size row to determine whether the change is possible or no. If no problems are detected I will change it.

In [6]:
laptops["screen_size"].unique()

array(['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"', '17.3"',
       '10.1"', '13.5"', '12.5"', '13.0"', '18.4"', '13.9"', '12.3"',
       '17.0"', '15.0"', '14.1"', '11.3"'], dtype=object)

In [7]:
# removing the " character
laptops["screen_size"] = laptops["screen_size"].str.replace('"',"")

# converting the object to float
laptops["screen_size"] = laptops["screen_size"].astype(float)

# renaming the header from screen_size to screen_size_inch
laptops.rename(columns={"screen_size":"screen_size_inch"}, inplace= True)

# printing the descriptive statistic for the column
laptops["screen_size_inch"].describe()

count    1303.000000
mean       15.017191
std         1.426304
min        10.100000
25%        14.000000
50%        15.600000
75%        15.600000
max        18.400000
Name: screen_size_inch, dtype: float64

#### 1.2.2 Ram
Next column to clean and convert the values is the ram that now consist of the numeric value and the "GB" appendix. I will remove the "GB", convert the numeric value to integer and state that the ram is in GB by renaming the header to ram_gb.

In [8]:
laptops["ram"].unique()

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

In [9]:
# removing the "GB" from the string
laptops["ram"] = laptops["ram"].str.replace("GB","")

# converting the object to integer
laptops["ram"] = laptops["ram"].astype(int)

# renaming the header from ram to ram_gb
laptops.rename(columns={"ram":"ram_gb"}, inplace=True)

# printing the descriptive statistic for the column
laptops["ram_gb"].describe()

count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram_gb, dtype: float64

#### 1.2.3 Weight
Next I will remove the "kg" from the weight column. As the values have been probably updated manully there is one value that is actually having the "kgs" instead of kg. From the unique methond it might be hard to spot, but once we rund the .astype() python show us error that it can't convert all the values in the column. Thus I will also remove the "s" character. Then convert it into float datatype and rename the column to weight_kg.

In [10]:
laptops["weight"].unique()

array(['1.37kg', '1.34kg', '1.86kg', '1.83kg', '2.1kg', '2.04kg', '1.3kg',
       '1.6kg', '2.2kg', '0.92kg', '1.22kg', '0.98kg', '2.5kg', '1.62kg',
       '1.91kg', '2.3kg', '1.35kg', '1.88kg', '1.89kg', '1.65kg',
       '2.71kg', '1.2kg', '1.44kg', '2.8kg', '2kg', '2.65kg', '2.77kg',
       '3.2kg', '0.69kg', '1.49kg', '2.4kg', '2.13kg', '2.43kg', '1.7kg',
       '1.4kg', '1.8kg', '1.9kg', '3kg', '1.252kg', '2.7kg', '2.02kg',
       '1.63kg', '1.96kg', '1.21kg', '2.45kg', '1.25kg', '1.5kg',
       '2.62kg', '1.38kg', '1.58kg', '1.85kg', '1.23kg', '1.26kg',
       '2.16kg', '2.36kg', '2.05kg', '1.32kg', '1.75kg', '0.97kg',
       '2.9kg', '2.56kg', '1.48kg', '1.74kg', '1.1kg', '1.56kg', '2.03kg',
       '1.05kg', '4.4kg', '1.90kg', '1.29kg', '2.0kg', '1.95kg', '2.06kg',
       '1.12kg', '1.42kg', '3.49kg', '3.35kg', '2.23kg', '4.42kg',
       '2.69kg', '2.37kg', '4.7kg', '3.6kg', '2.08kg', '4.3kg', '1.68kg',
       '1.41kg', '4.14kg', '2.18kg', '2.24kg', '2.67kg', '2.14kg',
       '1.

In [11]:
# removing the "kg" from the string
laptops["weight"] = laptops["weight"].str.replace("kg","")

# removing the "s" from the string
laptops["weight"] = laptops["weight"].str.replace("s","")

# converting the object to float
laptops["weight"] = laptops["weight"].astype(float)

# renaming the header from ram to ram_gb
laptops.rename(columns={"weight":"weight_kg"}, inplace=True)

# printing the descriptive statistic for the column
laptops["weight_kg"].describe()

count    1303.000000
mean        2.038734
std         0.665475
min         0.690000
25%         1.500000
50%         2.040000
75%         2.300000
max         4.700000
Name: weight_kg, dtype: float64

#### 1.2.4 Price in Euros
The price in euros does not have any special character to be removed, rather replacing the "," character with the ".", so it can be converted in float dtype.

In [12]:
# replacing the "," with "." from the string
laptops["price_euros"] = laptops["price_euros"].str.replace(",",".")

# converting the object to float
laptops["price_euros"] = laptops["price_euros"].astype(float)

# printing the descriptive statistic for the column
laptops["price_euros"].describe()

count    1303.000000
mean     1123.686992
std       699.009043
min       174.000000
25%       599.000000
50%       977.000000
75%      1487.880000
max      6099.000000
Name: price_euros, dtype: float64

### 1.3 Spliting strings
Some of the values in colums are containg two leading information. The information in the gpu column seems to be a manufacturer (Intel, AMD) followed by a model name/number. Let's extract the manufacturer by itself so we can find the most common ones. The same can be applyed on the cpu column and screen, extrating the dimensions to get screen resolution.

#### 1.3.1 Extracting the GPU manufacturer

In [13]:
# creating new column gpu_manufacturer
laptops["gpu_manufacturer"] = (laptops["gpu"]
                                       .str.split()
                                       .str[0])

# displaying the value count to se the gpu manufacturer proportion
laptops["gpu_manufacturer"].value_counts()

Intel     722
Nvidia    400
AMD       180
ARM         1
Name: gpu_manufacturer, dtype: int64

#### 1.3.2 Extracting the CPU manufacturer

In [14]:
# creating new column cpu_manufacturer
laptops["cpu_manufacturer"] = (laptops["cpu"]
                                       .str.split()
                                       .str[0])

# displaying the value count to se the gpu manufacturer proportion
laptops["cpu_manufacturer"].value_counts()

Intel      1240
AMD          62
Samsung       1
Name: cpu_manufacturer, dtype: int64

#### 1.3.3 Extracting the screen resolution

In [15]:
# creating new column cpu_manufacturer
laptops["screen_resolution"] = (laptops["screen"]
                                       .str.split()
                                       .str[-1])

# displaying the value count to se the gpu manufacturer proportion
laptops["screen_resolution"].value_counts()

1920x1080    841
1366x768     308
3840x2160     43
3200x1800     27
1600x900      23
2560x1440     23
2560x1600      6
2256x1504      6
2304x1440      6
1920x1200      5
2400x1600      4
2880x1800      4
1440x900       4
2160x1440      2
2736x1824      1
Name: screen_resolution, dtype: int64

#### 1.3.4 Processor speed
The processor speed can be extracted from the cpu column.

In [16]:
# creating new column cpu_speed_ghz
laptops["cpu_speed_ghz"] = (laptops["cpu"]
                        .str.split()
                        .str[-1]
                        )
# removing the "GHz" string
laptops["cpu_speed_ghz"] = laptops["cpu_speed_ghz"].str.replace("GHz","")

# convertin the speed to float
laptops["cpu_speed_ghz"] = laptops["cpu_speed_ghz"].astype(float)

# printing descriptive statistic for the processor speed
laptops["cpu_speed_ghz"].describe()

count    1303.000000
mean        2.298772
std         0.506340
min         0.900000
25%         2.000000
50%         2.500000
75%         2.700000
max         3.600000
Name: cpu_speed_ghz, dtype: float64

#### 1.3.5 Storage
The storage can be splitted and the transformed to integer values.

In [17]:
# crating new column storage GB
laptops["storage_gb"] = (laptops["storage"]
                        .str.split()
                        .str[0]
                        )
# converting the TB to GB (1TB is 1000 GB)
laptops["storage_gb"] = laptops["storage_gb"].str.replace("TB", "000")

# removing the GB
laptops["storage_gb"] = laptops["storage_gb"].str.replace("GB", "")

#converting the storage to integer
laptops["storage_gb"] = laptops["storage_gb"].astype(int)

# printing descriptive statistic for the storage size
laptops["storage_gb"].describe()

count    1303.000000
mean      441.928626
std       356.903966
min         1.000000
25%       256.000000
50%       256.000000
75%       512.000000
max      2000.000000
Name: storage_gb, dtype: float64

### 1.4 Identifying manual entry errors
As the data might not be consistent in values, I will check wheter there are some discrepancies and possibly correcting the data.

In [18]:
laptops["os"].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64

In [19]:
laptops["category"].value_counts()

Notebook              727
Gaming                205
Ultrabook             196
2 in 1 Convertible    121
Workstation            29
Netbook                25
Name: category, dtype: int64

#### 1.4.1 Correcting Operating System values
From the code it is visible that the operating system ("os") column have some data that can be merged. There is duplicity with the macOS x Mac OS, caused probably by the manual entry. I will create a mapping dictionary with OS names to get consistent data.

In [20]:
mapping_dictionary = {
    "Windows" : "Windows",
    "No OS" : "No OS",
    "Linux" : "Linux",
    "Chrome OS" : "Chrome OS",
    "macOS" : "macOS",
    "Mac OS" : "macOS",
    "Android" : "Android"
}

# mapping the values and assigning them back to the laptops dataset
laptops["os"] = laptops["os"].map(mapping_dictionary)

# correct value counts
laptops["os"].value_counts()


Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: os, dtype: int64

### 1.5 Identifying empty cells
Some columns in the dataset are having empty values. I will do an analysis to determine what values are missing in what columns. Based on the performed analysis I will decide what the next steps will be.

In [21]:
# identifying the null values
laptops.isnull().sum()


manufacturer           0
model_name             0
category               0
screen_size_inch       0
screen                 0
cpu                    0
ram_gb                 0
storage                0
gpu                    0
os                     0
os_version           170
weight_kg              0
price_euros            0
gpu_manufacturer       0
cpu_manufacturer       0
screen_resolution      0
cpu_speed_ghz          0
storage_gb             0
dtype: int64

In [22]:
laptops["os_version"].value_counts(dropna=False)

10      1072
NaN      170
7         45
10 S       8
X          8
Name: os_version, dtype: int64

There is only one column with empty cells. The null values are present in the os_version column. All together there is 170 missing values on the operating system version in the laptop. After examining the OS version closely we can see that the 17O missing values is the second most common value. Since the OS version is closely linked to the os column some os versions can be missing, but it can be guessed what values are missing.

In [23]:
os_with_null = laptops.loc[laptops["os_version"].isnull(), "os"]
os_with_null.value_counts()

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

Out of the 170 missing values 66 have no OS, 62 have Linus, 27 have Chrome OS, 13 have macOS and lastly 2 have Android. We can determine that the laptops with macOS have X as their os version. Since this is older dataset. See [source](https://en.wikipedia.org/wiki/MacOS). It is not possible to guees the Linux, because there is plenty of Linux distributions and the dataset does not specify this. I will add the os version to all mac laptops. The laptops with no OS will be marked as "Version Unknown" to get rid of the null values.

In [24]:
# adding the "X" for the macOS where the value was missing
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"

# adding the "Version Unknown" to empty cells in the os_version column
laptops.loc[laptops["os"] == "No OS", "os_version"] = "Version Unknown"

In [25]:
laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

Linux        62
Chrome OS    27
Android       2
Name: os, dtype: int64

At the end the laptops dataset is lef with only 91 null values in the os_version column. It was reduced by 79. The data are now **clean and prepared for analysis.**

## 2. Analyzing the dataset
As the dataset is cleaned now I will answer some question about the laptops. The questions originally proposed by Dataquest.io are:
- Are laptops made by Apple more expensive than those made by other manufacturers?
- What is the best value laptop with a screen size of 15" or more?
- Which laptop has the most storage space?

### 2.1 Are laptops by Apple expensive?
#### 2.1.1 Average price of Apple and Other manufactures


In [26]:
# creating apple laptops dataframe and daframe containing prices of all other laptops manufacturers
apple_laptops = laptops.loc[laptops["manufacturer"] == "Apple", ["price_euros"]]
other_laptops = laptops.loc[laptops["manufacturer"] != "Apple", ["manufacturer","price_euros"]]

# printing descriptive statistics for both dataframes
print("Apple Laptops")
print(apple_laptops.describe())
print("\n")
print("Other Laptops")
print(other_laptops.describe())
print("\n")
print("Apple median value", apple_laptops.median())
print("Other laptops median value", other_laptops.median())

Apple Laptops
       price_euros
count    21.000000
mean   1564.198571
std     561.623595
min     898.940000
25%    1163.000000
50%    1339.690000
75%    1958.900000
max    2858.000000


Other Laptops
       price_euros
count  1282.000000
mean   1116.471123
std     698.903305
min     174.000000
25%     598.000000
50%     959.500000
75%    1478.500000
max    6099.000000


Apple median value price_euros    1339.69
dtype: float64
Other laptops median value price_euros    959.5
dtype: float64


If we are looking only on the price and do not take any other parameters into consideration, then we can tell that the **Apple laptops are expensive.** The price average for Apple laptops is 1564€ and other manufactures have the average price on 1116€, difference being 448€. The median price (the middle value) is 1339€ for Macs and 956€ for othr laptops. The difference between median values is 383€. 

When looking at the maximum value in the dataset Apple's most pricy laptop is costing 2858€. Other manufactures have a laptop costing 6099€. I will do a further analysis to find out who:
- what manufacturer is selling such a expensive laptop
- how does Apple laptops compare to each individual laptop manufacturer

#### 2.1.2 Median price of Apple laptops comapared with other manufactures

In [27]:
# printing the maximum price in the other laptops
print("Most expensive laptop manufacturer:", "\n", other_laptops.max())
print("\n",)
# number of laptops manufacturers
print("Counts of laptops manufacturers", "\n",other_laptops["manufacturer"].value_counts())
print("\n",)

# median laptop prices by manufacturer
print("Median prices grouped by manufacturer" "\n", other_laptops.groupby("manufacturer")["price_euros"].median().sort_values(ascending=False))


Most expensive laptop manufacturer: 
 manufacturer    Xiaomi
price_euros       6099
dtype: object


Counts of laptops manufacturers 
 Dell         297
Lenovo       297
HP           274
Asus         158
Acer         103
MSI           54
Toshiba       48
Samsung        9
Razer          7
Mediacom       7
Microsoft      6
Vero           4
Xiaomi         4
Fujitsu        3
LG             3
Google         3
Chuwi          3
Huawei         2
Name: manufacturer, dtype: int64


Median prices grouped by manufacturer
 manufacturer
Razer        2899.00
LG           2099.00
Samsung      1649.00
MSI          1599.00
Microsoft    1569.50
Google       1559.00
Huawei       1424.00
Toshiba      1211.50
Xiaomi       1099.45
Asus         1012.50
Dell          985.00
HP            966.50
Lenovo        899.00
Fujitsu       739.00
Acer          559.00
Mediacom      265.00
Chuwi         248.90
Vero          206.85
Name: price_euros, dtype: float64


#### 2.1.3 Conclusion: Apple is not *that* expensive
The most expensive laptop is by Xiaomi.

There is 18 laptop manufacturers in the dataset and out of those 7 manufacturers have greater median price than Apple. When looking at the manufactures who prodce the most laptops (Lenovo, Dell, HP, Asus, Acer) their median price is bellow 1012€. Those manufacturers can reduce the price of laptops because they are focusing on a broad porfolio. The most expensive brands like Razer, LG or Samsung have lower number of laptops. 

Apple is producing 21 types of laptops, so it is better to compare this manufacturer with brands such us Toshiba or Samsung. Toshiba is having slightly cheeper with the median price 1211.5€ and Samsung is more expensive than Apple product withe the median price 1649€. Based on the data we can conclude that more laptop types the manufacturer is producing the cheaper the laptops will be.

After deeper analysis we can say that **Apple laptops are not out of the price range and are not the most expensive laptops on the market.**

### 2.2 Cheapest laptop with 15" screen

In [28]:
# Dataframe containing only laptops 15" and higher
laptops_15 = laptops.loc[laptops["screen_size_inch"] >= 15.0]
laptops_15.loc[:,["manufacturer","model_name","os","screen_size_inch","price_euros"]].sort_values("price_euros").head()



Unnamed: 0,manufacturer,model_name,os,screen_size_inch,price_euros
290,Acer,Chromebook C910-C2ST,Chrome OS,15.6,199.0
1102,Acer,Chromebook 15,Chrome OS,15.6,209.0
555,Asus,A541NA-GO342 (N3350/4GB/500GB/Linux),Linux,15.6,224.0
30,Chuwi,"LapBook 15.6""",Windows,15.6,244.99
483,Chuwi,"Lapbook 15,6",Windows,15.6,248.9


Cheapest laptop with minimum of 15" screen is **Chromebook C910-C2ST** by Acer with price 199€. The OS on this laptop is Chrome OS. Three cheapest laptops have an OS that is not dominating the market. If we are looking for a laptop with Windows the cheapest is **LapBook by Chuwi**, costing 244.99€.

### 2.3 Laptop with the most storage space

In [29]:
laptops.loc[:,["manufacturer","model_name","os","storage_gb","price_euros"]].sort_values(["storage_gb", "price_euros"], ascending=False).head(16)

Unnamed: 0,manufacturer,model_name,os,storage_gb,price_euros
1063,Dell,Inspiron 5567,Windows,2000,989.99
341,Lenovo,IdeaPad 320-15ABR,Windows,2000,899.0
279,Lenovo,IdeaPad 320-17IKBR,No OS,2000,849.0
775,Asus,Q524UQ-BHI7T15 (i7-7500U/12GB/2TB/GeForce,Windows,2000,839.0
467,Dell,Inspiron 5570,Windows,2000,759.0
171,HP,17-bs001nv (i5-7200U/6GB/2TB/Radeon,Windows,2000,699.0
709,HP,17-ak002nv (A10-9620P/6GB/2TB/Radeon,Windows,2000,655.01
807,HP,15-ba043na (A12-9700P/8GB/2TB/W10),Windows,2000,629.0
1130,HP,15-bs078cl (i7-7500U/8GB/2TB/W10),Windows,2000,629.0
688,HP,17-Y002nv (A10-9600P/6GB/2TB/Radeon,Windows,2000,569.0


The highest storage space in the dataset is 2TB (2000 GB). Laptops with this storage size are listed above, in total there is 16 laptops with 2TB storage. Most expensive is **Inspiration 5567 by Dell**, price is set to 989.99€. Cheapest laptop is **14-am079na (N3710/8GB/2TB/W10)** by HP, costing 389€. 