# Project_LAPTOPS

## Introduction


The main goal of this project is to explore and clean our laptop dataframe and make some basic analysis.
We'll start by importing some useful libraries such as numpy and panda.

In [451]:
import numpy as np
import pandas as pd

df = pd.read_csv("laptops.csv", encoding="Latin-1")
df.head()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


In [452]:
# We display the informations contained in our datafrmae
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB



We can see that every column is represented as the object type, indicating that they are represented by strings, not numbers. Also, one of the columns, Operating System Version, has null values.

The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the “Storage" column name has a space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will save us more work in the long run.


In [453]:
# We display the columns contained in our dataframe
df.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [454]:
new_columns = []
for column in df.columns :
    clean_c = column.strip()
    new_columns.append(clean_c)
df.columns = new_columns

df.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

We obseve that there is no longer white spaces in the start and end of each column.

However, the column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. 

Let's finish cleaning our column labels by:

- Replacing spaces with underscores.
- Removing special characters.
- Making all labels lowercase.
- Shortening any long column names.

In [455]:
def cleaning(col):
    col = col.strip()
    col = col.replace("Operating System","os")
    col = col.replace(" ","_")
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

new_columns = []
for c in df.columns:
    clean_c = cleaning(c)
    new_columns.append(clean_c)

df.columns = new_columns
print(df.columns)

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')


### Converting String Columns to Numeric

We observed earlier that all 13 columns 
have the object dtype, meaning they're stored as strings. 


In [456]:
# Now, we identify the unique values in the ram column of the laptops dataframe

unique_ram = df["ram"].unique()
unique_ram 

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'], dtype=object)

From the result above, we identify a clear pattern in the ram column; all values are integers and include the character GB at the end of the string:
Therefore, we would like to to convert both the ram and screen_size columns to numeric dtypes so that we can make some analysis.
we'll have to first remove the non-digit characters.


In [457]:
# We remove first the the string "GB" contained in each value 

udpdate_ram_col = []
for ram in df["ram"] :
    ram = ram.replace("GB","")
    udpdate_ram_col.append(ram)
df["ram"] = udpdate_ram_col

# Now, we convert the ram column in integer

df["ram"] = df["ram"].astype(int)
df["ram"].unique()

array([ 8, 16,  4,  2, 12,  6, 32, 24, 64], dtype=int64)

In [458]:
# We do the same operations for the screen_size column
df["screen_size"].unique()

array(['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"', '17.3"',
       '10.1"', '13.5"', '12.5"', '13.0"', '18.4"', '13.9"', '12.3"',
       '17.0"', '15.0"', '14.1"', '11.3"'], dtype=object)

In [459]:
udpdate_screen_size_col = []
for ram in df["screen_size"] :
    ram = ram.replace('"',"")
    udpdate_screen_size_col.append(ram)
df["screen_size"] = udpdate_screen_size_col

# Now, we convert the ram column in integer

df["screen_size"] = df["screen_size"].astype(float)
df["screen_size"].unique()

array([ 13.3,  15.6,  15.4,  14. ,  12. ,  11.6,  17.3,  10.1,  13.5,
        12.5,  13. ,  18.4,  13.9,  12.3,  17. ,  15. ,  14.1,  11.3])

### Renaming the columns

Now that we've converted our columns to numeric dtypes, the final step is to rename the column. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data.
Let's rename the ram column next and analyze the results.

Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, we'll  rename the column from ram to ram_gb.


In [460]:
df.rename(columns={'ram': 'ram_gb'}, inplace=True)
df.rename(columns={"screen_size": "screen_size_inches"}, inplace=True)
df.head(10)

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram_gb,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360
5,Acer,Aspire 3,Notebook,15.6,1366x768,AMD A9-Series 9420 3GHz,4,500GB HDD,AMD Radeon R5,Windows,10,2.1kg,40000
6,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.2GHz,16,256GB Flash Storage,Intel Iris Pro Graphics,Mac OS,X,2.04kg,213997
7,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,256GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,115870
8,Asus,ZenBook UX430UN,Ultrabook,14.0,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,16,512GB SSD,Nvidia GeForce MX150,Windows,10,1.3kg,149500
9,Acer,Swift 3,Ultrabook,14.0,IPS Panel Full HD 1920x1080,Intel Core i5 8250U 1.6GHz,8,256GB SSD,Intel UHD Graphics 620,Windows,10,1.6kg,77000


In [461]:
# Let's now display the descriptive statistics for the ram_gb column

ram_gb_desc = df["ram_gb"].describe()
ram_gb_desc

count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram_gb, dtype: float64

From the result above, we observe that the dataframe has 1303 ram values, and the maximal ram size is 64 GB and the minimal is 2 GB.

In average, the laptops in our dataframe have approximately 8 GB.

Now,  we would like to extract the manufacturer name from the cpu column and find the counts of each manufacturer. 


In [462]:
# We define a new column "cpu_manufacturer" and we assign the first character of each element in cpu column
df["cpu_manufacturer"] = (df["cpu"].str.split().str[0] )

# We count the values in the new column "cpu_manufacturer" for each manufacturer 
cpu_manufacturer_counts = df["cpu_manufacturer"].value_counts()

cpu_manufacturer_counts


Intel      1240
AMD          62
Samsung       1
Name: cpu_manufacturer, dtype: int64

Now,  we observe that there are two variations of the Apple operating system — macOS — in our dataset: Mac OS and macOS.  
We'll create a dictionary called corrections and pass that dictionary as an argument to Series.map():
We will change it into mac_os


In [463]:
correction  = {"macOS":"mac_os", "Mac OS" : "mac_os" }
df["os"] = df["os"].map(correction)
df.head()

Unnamed: 0,manufacturer,model_name,category,screen_size_inches,screen,cpu,ram_gb,storage,gpu,os,os_version,weight,price_euros,cpu_manufacturer
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,mac_os,,1.37kg,133969,Intel
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,mac_os,,1.34kg,89894,Intel
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,,,1.86kg,57500,Intel
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,mac_os,,1.83kg,253745,Intel
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,mac_os,,1.37kg,180360,Intel


Now, we would like to drop all the rows and columns that contain null values from our dataframe

In [464]:
laptops_no_null_rows = df.dropna(axis = 0)
laptops_no_null_cols = df.dropna(axis = 1)

Now, for rows with No OS values, let's replace the missing value in the os_version column with the value Version Unknown.

In [465]:
df.loc[df["os"] == "No OS" , "os_version"] = "version_unknown"


# We count the number of laptops that have null os_version 
value_counts_after = df.loc[df["os_version"].isnull(), "os"].value_counts()
value_counts_after

mac_os    13
Name: os, dtype: int64

Now, since 13 laptops have null os_version and they are from macOs operating system, we shall replace the null value in os_version by X, and then print the counts of null values os_version to double check.

In [466]:
df.loc[df["os"] == "mac_os", "os_version"] = "X"

Finally, we would like to convert the weight column into numeric so that we can make some analysis.
For this, it appears that the weight column needs the kg and kgs characters to be removed from the end of each string.
And then we'll display the first ten rows of our final dataframe. 

In [467]:
df["weight"] = df["weight"].str.replace("kgs","").str.replace("kg","").astype(float)

df.rename(columns = {"weight": "weight_kg"}, inplace=True)

df.to_csv('df_cleaned.csv',index=False)

df[["os","cpu","weight_kg"]].head(10)

Unnamed: 0,os,cpu,weight_kg
0,mac_os,Intel Core i5 2.3GHz,1.37
1,mac_os,Intel Core i5 1.8GHz,1.34
2,,Intel Core i5 7200U 2.5GHz,1.86
3,mac_os,Intel Core i7 2.7GHz,1.83
4,mac_os,Intel Core i5 3.1GHz,1.37
5,,AMD A9-Series 9420 3GHz,2.1
6,mac_os,Intel Core i7 2.2GHz,2.04
7,mac_os,Intel Core i5 1.8GHz,1.34
8,,Intel Core i7 8550U 1.8GHz,1.3
9,,Intel Core i5 8250U 1.6GHz,1.6
