## Reading CSV Files with Encodings

We've learned how to select, assign, and analyze data with pandas using pre-cleaned data. In reality, data is rarely in the format needed to perform analysis. Data scientists commonly spend [over half their time cleaning data](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/), so knowing how to clean "messy" data is an extremely important skill.

We'll learn the basics of data cleaning with pandas as we work with laptops.csv, a CSV file containing information about 1,300 laptop computers. The first five rows of the CSV file are shown below:

We can start by reading the data into pandas. Let's look at what happens when we use the pandas.read_csv() function with only the filename as an argument:

In [108]:
import pandas as pd

laptops = pd.read_csv('../Datasets/laptops.csv')

We get an error! Reading the traceback, we can see it references UTF-8, which is a type of encoding. 

Computers, at their lowest levels, can only understand binary (0 and 1) and encodings are systems for representing characters in binary.

This error is telling us that the encoding it used (utf-8) failed to convert the data into binary.

Thankfully, the pandas.read_csv() function has an encoding argument we can use to specify an encoding:

In [110]:
# df = pd.read_csv("filename.csv", encoding="encoding_type")

The top four most popular encodings, which we can use to set the encoding parameter of pandas.read_csv() above, are:

- utf-8 - Universal Coded Character Set Transformation Format—8-bit, a dominant character encoding for the web.
- latin1 - Also known as 'ISO-8859-1', a part of the ISO/IEC 8859 series.
- Windows-1252 - A character encoding of the Windows family, also known as 'cp1252' or sometimes ANSI.
- utf-16 - Similar to 'utf-8' but uses 16 bits to represent each character instead of 8.

Since the pandas.read_csv() function already tried to read in the laptops.csv file using the default encoding type (utf-8) and failed, we know the file's not encoded using that format!

Let's try another popular encoding type to see if that works.

In [111]:
laptops = pd.read_csv('../Datasets/laptops.csv', encoding='latin1')

In [112]:
laptops.head(2)

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price (Euros)
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894


In [113]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


## Cleaning Column Names

We can see that all columns are represented by the object dtype, indicating that they store string values, not numerical values. Also, one of the columns, Operating System Version, contains some null values.

The column labels also have a mix of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the " Storage" column name has a leading space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will avoid headaches in the long run.

We can access the column axis labels of a dataframe using the [DataFrame.columns attribute](https://pandas.pydata.org/pandas-docs/stable/basics.html#attributes-and-the-raw-ndarray-s). This returns an index object—a special type of NumPy ndarray—with the label (name) of each column:

In [114]:
laptops.columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

Not only can we use the attribute to view the column labels, we can also assign new ones with it:

In [115]:
laptops_test = laptops.copy()
laptops_test.columns = ['A', 'B', 'C', 'D', 'E',
                        'F', 'G', 'H', 'I', 'J',
                        'K', 'L', 'M']
print(laptops_test.columns)

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'], dtype='object')


Next, let's use the DataFrame.columns attribute to remove whitespaces from the column names.

In [116]:
laptops.columns = [col.strip() for col in laptops.columns]

In [117]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7   Storage                   1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


This is good start, but we still need to standardize the column labels a bit more. Let's finish cleaning them up by:

- Replacing spaces between words with underscores.
- Removing any special characters, like parentheses.
- Making all labels lowercase.
- Shortening any long column names.

Since we need to perform these steps on each of our column labels, it makes sense for us to create a helper function that uses [Python string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) to clean our column labels as described above. Then we can use a for loop to apply that function to each column label. Let's look at an example:

In [118]:
def clean_col(col):
    col = col.replace('(', '').replace(')', '').lower().strip()
    col = col.replace('operating system', 'os').replace(' ', '_')
    return col

new_columns = [clean_col(c) for c in laptops.columns]

In [119]:
new_columns

['manufacturer',
 'model_name',
 'category',
 'screen_size',
 'screen',
 'cpu',
 'ram',
 'storage',
 'gpu',
 'os',
 'os_version',
 'weight',
 'price_euros']

In [120]:
laptops.columns = new_columns

In [121]:
laptops.head(2)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894


## Converting String Columns to Numeric

We observed earlier that all 13 columns have the object dtype, indicating they're storing strings. Let's look at the first few rows of some of our columns:

In [122]:
print(laptops.iloc[:5, 2:5])

    category screen_size                              screen
0  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600
1  Ultrabook       13.3"                            1440x900
2   Notebook       15.6"                   Full HD 1920x1080
3  Ultrabook       15.4"  IPS Panel Retina Display 2880x1800
4  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600


Of these three columns, we have three different types of text data:

- category: Purely text data; it has no numeric values.
- screen_size: Numeric data stored as text data because of the " character that represents "inches."
- screen: A combination of text data (screen type) and numeric data (screen size).

Because the values in the screen_size column are stored as text data, we can't easily sort them numerically. For instance, if we wanted to select laptops with screens 15" or larger, we'd be unable to do so without using some clever tricks.

Let's address this problem by converting the screen_size column to purely numeric values. Whenever we convert text to numeric data, we can follow this data cleaning workflow:

![image.png](attachment:image.png)

The first step is to explore the data. One of the best ways to start exploring the data is to use the Series.unique() method to view all of the unique values in the column:

In [123]:
laptops['screen_size'].unique()

array(['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"', '17.3"',
       '10.1"', '13.5"', '12.5"', '13.0"', '18.4"', '13.9"', '12.3"',
       '17.0"', '15.0"', '14.1"', '11.3"'], dtype=object)

Our next step is to identify patterns and special cases that block us from converting the column to numeric. Looking at the results above, we can observe the following:

- All values in this column follow a pattern: two digits, followed by a decimal (.), followed by a single digit, followed by a double quotation mark ("). We'll eventually need to remove that " so we can convert the column to numeric.
- There are no special cases; every unique value in the column matches this pattern.
- Because the int dtype won't be able to store these decimal values, we'll eventually need to convert the column to a float dtype.

Let's see if we can identify any patterns and special cases in the ram column next.

In [125]:
laptops['ram'].unique()

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

A note about Series.unique(): The Series.unique() method returns a numpy array, not a list or pandas series. This means that we can't use the Series methods we've learned so far, like Series.head(). If you want to convert the result to a list, you can use the tolist() method of the numpy array:

In [126]:
unique_ram = laptops["ram"].unique().tolist()

In [127]:
unique_ram

['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB']

## Removing Non-Digit Characters

We identified a clear pattern in the ram column; all values were integers, followed by the characters GB (gigabyte) at the end of the string:

To be able to convert both the ram and screen_size columns to numeric dtypes, we'll have to first remove the non-digit characters, GB and ", respectively.

![image.png](attachment:image.png)

Thankfully, the pandas library contains dozens of [vectorized string methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary) we can use to manipulate text data. Many of them perform the same operations as the Python string methods we've used already. Most pandas vectorized string methods are available using the [Series.str accessor](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling). This means we can access them by adding str between the series object name and the method name:

![image.png](attachment:image.png)

In our case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used earlier when cleaning up column labels. Here's how we use it to clean up the screen_size column:

In [128]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"', '')
print(laptops["screen_size"].unique())
print("`screen_size` dtype:", laptops["screen_size"].dtype)

['13.3' '15.6' '15.4' '14.0' '12.0' '11.6' '17.3' '10.1' '13.5' '12.5'
 '13.0' '18.4' '13.9' '12.3' '17.0' '15.0' '14.1' '11.3']
`screen_size` dtype: object


Although screen_size still has an object dtype, the unique string values it contains are clearly ready to be converted to numeric values. 

We'll handle that step on the following screen.

But first, let's remove the non-digit characters from the ram column like we've done for the screen_size column in the provided code.

In [129]:
laptops['ram'] = laptops['ram'].str.replace('GB', '')
laptops['ram'].unique()

array(['8', '16', '4', '2', '12', '6', '32', '24', '64'], dtype=object)

## Converting Columns to Numeric dtypes

Now, we can convert the columns to a numeric dtype. This is also referred to as type casting or changing the data type.

![image.png](attachment:image.png)

To do this, we use the Series.astype() method. To convert the column to a numeric dtype, we can pass either int or float as the argument for the method. Since the int dtype can't handle decimal values, we'll convert the screen_size column to the float dtype:

In [130]:
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].unique())
print("`screen_size` dtype:", laptops["screen_size"].dtype)

[13.3 15.6 15.4 14.  12.  11.6 17.3 10.1 13.5 12.5 13.  18.4 13.9 12.3
 17.  15.  14.1 11.3]
`screen_size` dtype: float64


Our screen_size column is now the float64 dtype. Let's convert the dtype of the ram column to numeric next.

In [131]:
laptops['ram'] = laptops['ram'].astype(float)
laptops['ram'].unique()

array([ 8., 16.,  4.,  2., 12.,  6., 32., 24., 64.])

In [132]:
laptops.dtypes

manufacturer     object
model_name       object
category         object
screen_size     float64
screen           object
cpu              object
ram             float64
storage          object
gpu              object
os               object
os_version       object
weight           object
price_euros      object
dtype: object

Now that we've converted our columns to numeric dtypes, the final step is to rename the columns. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data.

![image.png](attachment:image.png)

In our case, the quote characters we removed from the screen_size column denoted that the screen size was in inches.

To stop us from losing information that helps us understand the data, we can use the [DataFrame.rename() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) to rename the column from screen_size to screen_size_inches.

Below, we specify the axis=1 parameter so pandas knows that we want to rename labels in the column axis as opposed to the index axis (axis=0):

In [133]:
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)
print(laptops.dtypes)

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                   float64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object


Note that we can either use inplace=True or assign the result back to the dataframe; both will give us the same results.

Let's rename the ram column next and analyze the results.

In [134]:
laptops.rename({'ram': 'ram_gb'}, axis=1, inplace=True)

laptops.ram_gb.describe()

count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram_gb, dtype: float64

## Extracting Values from Strings

Columns often contain useful information that's buried within some text so it's useful to be able to extract these values (substrings) from strings. For example, let's look at the first five values from the gpu (graphics processing unit) column to see if there's any useful information we can extract from it:

In [135]:
print(laptops["gpu"].head())

0    Intel Iris Plus Graphics 640
1          Intel HD Graphics 6000
2           Intel HD Graphics 620
3              AMD Radeon Pro 455
4    Intel Iris Plus Graphics 650
Name: gpu, dtype: object


The information in this column tells us the chip manufacturer (e.g., Intel, AMD) followed by its model name/number. Being able to analyze the data by the manufacturer could be useful to us so let's extract it, with the idea that we'll store it in a new column, gpu_manufacturer.

The pandas library has a great vectorized string method for this situation: [Series.str.split() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html). We can use it to split the column on any character (or pattern), and store the results in a pandas series that contains a list of each element after splitting. By default, the method splits on a whitespace character (space) so that text is broken into individual words, like in this example:

In [136]:
gpu_head_split = laptops["gpu"].head().str.split()
print(gpu_head_split)

0    [Intel, Iris, Plus, Graphics, 640]
1           [Intel, HD, Graphics, 6000]
2            [Intel, HD, Graphics, 620]
3               [AMD, Radeon, Pro, 455]
4    [Intel, Iris, Plus, Graphics, 650]
Name: gpu, dtype: object


Notice how the method returns a series object containing a list of the words from the original gpu column. Now all we need to do is select the first element in each list to create our new gpu_manufacturer column.

The pandas library comes to the rescue with another vectorized string method we can leverage here! The [Series.str accessor](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#indexing-with-str) can be used with [] notation to directly index by position locations:

In [137]:
print(gpu_head_split.str[0])

0    Intel
1    Intel
2    Intel
3      AMD
4    Intel
Name: gpu, dtype: object


Since we've been working on laptops["gpu"].head(), we're only seeing the first five rows of laptops["gpu"]. We could easily apply this technique to the entire dataframe by dropping the call to head(). Then, we could assign our results from str[0] to a new column, gpu_manufacturer.

Let's do that now in the exercise below.

In [140]:
laptops["gpu_manufacturer"] = (laptops["gpu"]
                                       .str.split()
                                       .str[0]
                              )
laptops["cpu_manufacturer"] = (laptops["cpu"]
                                       .str.split()
                                       .str[0]
                              )

laptops["cpu_manufacturer"].value_counts()

cpu_manufacturer
Intel      1240
AMD          62
Samsung       1
Name: count, dtype: int64

##  Correcting Bad Values

In the previous exercise, we saw that although Intel was the top cpu and gpu manufacturer, it truly domniates the cpu manufacturing market! It's always a good idea to question your data to make sure it makes sense, because sometimes it may have issues we need to address.

If our data has been scraped from a webpage or if there was manual data entry involved at some point, we may end up with inconsistent values in our dataset. This can make it difficult to analyze our data holistically. Let's look at an example from our os column:

In [141]:
print(laptops["os"].value_counts())

os
Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: count, dtype: int64


We can see that there are two representations of the Apple operating system in our dataset: Mac OS and macOS. One way we can fix this is with the [Series.map() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html). While we could use the Series.str.replace() method to fix this particular issue, the Series.map() method is ideal when we want to change multiple values in a column at once, so let's take this opportunity to learn how this other method works.

The most common way to use Series.map() is with a mapping dictionary. Let's look at an example using a series of misspelled fruit that's being stored in a series called s:

![image.png](attachment:image.png)

To fix all the spelling mistakes at the same time, we create a dictionary called corrections and pass that dictionary as an argument to Series.map() to map the incorrect words (keys) onto the correct ones (values):

corrections = {
    "pair": "pear",
    "oranje": "orange",
    "bananna": "banana"
}
s_fixed = s.map(corrections)
print(s_fixed)

![image.png](attachment:image.png)

Notice that each string key was replaced by its corresponding string value. One important thing to remember with the Series.map() method is that if a value from the series doesn't exist as a key in the dictionary, it will convert that value to NaN. To see this "mistake" in action, let's see what happens when we call map() on s_fixed using the same corrections dictionary:

In [None]:
s_fixed_again = s_fixed.map(corrections)
print(s_fixed_again)

![image.png](attachment:image.png)

Because none of the values in the s_fixed series matched any of the keys in our corrections dictionary, all the values in s_fixed have became NaN values! This is a very common occurence, especially when working in a Jupyter notebook environment where we can easily re-run cells accidentally.

When using the map() method, make sure that each unique value in the series is represented as a key in the dictionary being passed to the map() method, otherwise you'll get NaN values in your resulting series. If there are values in the series you don't want to change, ensure you set their keys and values equal to each other so that "no changes are mapped" but each unique value appears as a key in the dictionary.

Let's use Series.map() to clean the values in the os column.

In [143]:
laptops['os'].unique()

array(['macOS', 'No OS', 'Windows', 'Mac OS', 'Linux', 'Android',
       'Chrome OS'], dtype=object)

In [144]:
list_of_os = laptops['os'].unique()

In [145]:
corrections = {}

for os in list_of_os:
    if os == 'Mac OS':
        corrections[os] = 'macOS'
    else:
        corrections[os] = os
corrections

{'macOS': 'macOS',
 'No OS': 'No OS',
 'Windows': 'Windows',
 'Mac OS': 'macOS',
 'Linux': 'Linux',
 'Android': 'Android',
 'Chrome OS': 'Chrome OS'}

In [146]:
laptops['os'].map(corrections)

laptops['os'].map(corrections).value_counts()

os
Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: count, dtype: int64

##  Dropping Missing Values

In previous lessons, we talked briefly about missing values and how both NumPy and pandas represent these as null values. In pandas, null values will be indicated by either NaN or None.

Recall that we can use the [DataFrame.isnull() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) to identify missing values in each column. The method returns a boolean dataframe, which we can then use the DataFrame.sum() method on to give us a count of the True values for each column:

In [147]:
print(laptops.isnull().sum())

manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram_gb                  0
storage                 0
gpu                     0
os                      0
os_version            170
weight                  0
price_euros             0
gpu_manufacturer        0
cpu_manufacturer        0
dtype: int64


It's clear that we have only one column with null values, os_version, which has 170 missing values.

There are a few options for handling these missing values:

- Remove all rows that contain missing values.
- Remove all columns that contain missing values.
- Fill each missing value with some other value.
- Leave the missing values as they are.

The first two options are often used when preparing data for machine learning algorithms, which are unable to handle data with null values. We can use the [DataFrame.dropna() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) if we wanted to remove or drop rows and/or columns with null values.

The DataFrame.dropna() method accepts an axis parameter, which indicates whether we want to drop along the index axis (axis=0) or the column axis (axis=1). Let's look at an example:

![image.png](attachment:image.png)

The default value for the axis parameter is 0, so df.dropna() is equivalent to df.dropna(axis=0):

![image.png](attachment:image.png)

The rows with index labels x and z contain null values, so those rows were dropped. Let's look at what happens when we pass axis=1 to specify the column axis instead:

![image.png](attachment:image.png)

Only the column with label C contains null values, so, in this case, just that one column was removed.

Let's practice using DataFrame.dropna() to remove rows and columns:

In [148]:
laptops_no_null_rows = laptops.dropna()
laptops_no_null_cols = laptops.dropna(axis=1)

## Filling Missing Values

On the previous screen, we listed a few ways we can deal with missing values:

- Remove all rows that contain missing values.
- Remove all columns that contain missing values.
- Fill each missing value with some other value.
- Leave the missing values as they are.

While dropping rows or columns is the easiest approach to dealing with missing values, it may not always be the best approach. For example, removing a disproportionate amount of one manufacturer's laptops could impact our analysis.

With this in mind, it's a good idea to explore the missing values in the os_version column before we make a decision. As we've seen, the Series.value_counts() method is a great way to explore all of the unique values in a column. Let's use it again here, but this time we'll use a parameter we haven't seen before:

In [149]:
print(laptops["os_version"].value_counts(dropna=False))

os_version
10      1072
NaN      170
7         45
X          8
10 S       8
Name: count, dtype: int64


Because we set the dropna parameter to False, the result includes null (NaN) values. Analyzing the restults, we can see that 10 is the most frequent value in the column, follwed by our NaN missing values.

Since it's so closely related to the os_version column, let's also explore the os column. We'll only look at rows where the os_version is missing:

In [150]:
os_with_null_v = laptops.loc[laptops["os_version"].isnull(), "os"]
print(os_with_null_v.value_counts())

os
No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: count, dtype: int64


From these results, we can conclude a couple of important things:

- The most frequent value is No OS. This is important to note because if there is no operating system on the laptop, there shouldn't be a version defined in the os_version column.
- Thirteen of the laptops that come with macOS do not specify the version. We can use our knowledge of MacOS to confirm that os_version should be equal to X for these rows.

In both of these cases, we can fill in the missing values to make our data more complete. For the rest of the values, it's probably best to leave them as NaN so we don't remove important values.

In [151]:
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"

For rows with No OS values in the os column, let's replace the missing value in the os_version column with the value Not Applicable.

In [152]:
laptops['os_version'].value_counts()

os_version
10      1072
7         45
X         21
10 S       8
Name: count, dtype: int64

In [153]:
laptops['os'].value_counts()

os
Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: count, dtype: int64

In [154]:
laptops['os_version'].value_counts(dropna=False)

os_version
10      1072
NaN      157
7         45
X         21
10 S       8
Name: count, dtype: int64

In [155]:
laptops.loc[laptops['os'] == 'No OS', 'os_version'] = 'Not Applicabe'

In [156]:
laptops['os_version'].value_counts(dropna=False)

os_version
10               1072
NaN                91
Not Applicabe      66
7                  45
X                  21
10 S                8
Name: count, dtype: int64

## Challenge: Clean a String Column

Now it's time to practice what we've learned so far! In this challenge, we'll clean the weight column. Let's look at a sample of the data in that column:

In [157]:
print(laptops["weight"].head())

0    1.37kg
1    1.34kg
2    1.86kg
3    1.83kg
4    1.37kg
Name: weight, dtype: object


Your challenge is to convert the values in this column to numeric values. As a reminder, here's the data cleaning workflow you can use:

![image.png](attachment:image.png)

While it appears that the weight column may just need the kg characters removed from the end of each string, there is one special case ― one of the values ends with kgs, so you'll have to remove both kg and kgs characters.

In the last step of this challenge, we'll also ask you to use the [DataFrame.to_csv() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) to save the cleaned data to a CSV file. It's a good idea to save your dataframe as a CSV file when you finish cleaning in case you wish to perform your analysis later.

We can use the following syntax to save a dataframe as a CSV file:

df.to_csv('filename.csv', index=False)

By default, pandas will save the index labels as a column in the CSV file. Our dataset has integer labels that don't contain any data, so we don't need to save the index.

In [159]:
laptops['weight_kg'] = laptops['weight'].str.replace('kg', '').str.replace('s', '').astype(float)

laptops.to_csv('laptops_cleaned.csv', index=False)

Our dataset is almost ready for analysis, but there are still some data cleaning tasks left! Here are your next steps:

- Convert the price_euros column to a numeric dtype.
- Extract the screen resolution from the screen column.
- Extract the processor speed from the cpu column.

Here are some questions you might attempt answering by analyzing the cleaned data:

- Are laptops made by Apple more expensive than those made by other manufacturers?
- What is the best value laptop with a screen size of 15" or more?
- Which laptop has the most storage space?