# Data cleaning basics continuation

In [1]:
import numpy as np
import pandas as pd

In [2]:
laptops = pd.read_csv("laptops.csv", encoding="Latin-1")

In [3]:
laptops.head(3)

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500


We are looking for treating the null values, but without removing whole lines or columns, because would may intefer directly on our analysis.
 One method is to explore all of the values in the column, for this we can use Series.value_counts() with the dropna=False parameter.  default, Series.value_counts() won't include null values in its output. This parameter allows us to explicitly indicate we want to see the null values:

In [4]:
# print(laptops["os_version"].value_counts(dropna=False))
print(laptops["Operating System Version"].value_counts(dropna=False))

10      1072
NaN      170
7         45
X          8
10 S       8
Name: Operating System Version, dtype: int64


<br>**Now let see how are the OS's of the null values:**

In [5]:
os_with_null_values = laptops.loc[
                                laptops["Operating System Version"].isnull(),
                                "Operating System"
                                ]
print(os_with_null_values.value_counts())

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: Operating System, dtype: int64


Immediately we can observe a few things:
* Most of the missing values are actually when the laptop doesn't include any OS. This is an important distinction, because it's not so much that we don't know what the value is, as that there can't be a value.
* 13 of the laptops that come with macOS do not specify the version. Leaning on our knowledge of MacOS, we might know that the full name of macOS used to be Mac OS X, and so we might to fill these values to be more consistent.

In both of these cases, we can fill the missing values to make our data more correct. For the rest of the values, it's probably best to leave them as missing so we don't remove important values.<br>
First, let's explore those mac columns a bit more to make sure our intuition was correct:

In [6]:
# mac_os_versions = laptops.loc[laptops["os"] == "macOS", "os_version"]

mac_os_versions = laptops.loc[laptops["Operating System"] == "macOS", "Operating System Version"]

print(mac_os_versions.value_counts(dropna=False))

NaN    13
Name: Operating System Version, dtype: int64


We were correct, all 8 rows that have the value X are macOS versions. We'll fill in all of the NaN values with X. We can use assignment with a boolean comparison to perform this replacement:

In [7]:
#laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"
laptops.loc[(laptops["Operating System"] == "macOS"),"Operating System Version"] = "X"

For our other case, let's insert a No OS value into the os_version column for any laptop with a No OS value in the os column:

In [8]:
bool_list_no_os = laptops["Operating System"] == "No OS"
laptops.loc[bool_list_no_os, "Operating System Version"] = "Version Unknown"

In [9]:
# value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

---

Lets get the storage column and separate it in capacity with gb and type, and do it for 2 memories (max a laptop has on this dataset)

In [10]:
#This ajust is just for the continuation doesnt generate to different results from the original
laptops = laptops.rename(columns={"Storage" : "storage"})

In [11]:
laptops.columns = laptops.columns.str.replace(" St", "st")

In [12]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

First lets take a look at the storage data

In [13]:
print(laptops.loc[70:90, 'storage'])

70               128GB SSD
71               256GB SSD
72               256GB SSD
73    128GB SSD +  1TB HDD
74                 1TB HDD
75                 1TB HDD
76                 2TB HDD
77    128GB SSD +  1TB HDD
78                 1TB HDD
79    128GB SSD +  1TB HDD
80               256GB SSD
81               512GB SSD
82               256GB SSD
83               128GB SSD
84                 1TB HDD
85    128GB SSD +  1TB HDD
86               256GB SSD
87               256GB SSD
88    128GB SSD +  1TB HDD
89               256GB SSD
90                 1TB HDD
Name: storage, dtype: object


In [14]:
def treating_storage(sto_str):
    sto_list = sto_str.split()
    