<a href="https://colab.research.google.com/github/dbro-dev/DataQuest_Courses/blob/master/020__Data_Cleaning_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MISSION 6: Data Cleaning Basics

In this mission, we will learn techniques to use when performing data cleaning to prepare a messy data set.

## 1. Reading CSV Files with Encodings

In this mission, we'll learn the basics of data cleaning with pandas as we work with `laptops.csv`, a CSV file containing information about 1,300 laptop computers.

We can start by reading the data into pandas. Let's look at what happens when we use the `pandas.read_csv()` [function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) with only the filename argument:


```
laptops = pd.read_csv("laptops.csv")
```

We get an error! This error references UTF-8, which is a type of encoding. Computers, at their lowest levels, can only understand binary - `0` and `1` - and encodings are systems for representing characters in binary.

Something we can do if our file has an unknown encoding is to **try the most common encodings**:

* UTF-8
* Latin-1 (also known as ISO-8859-1)
* Windows-1251

The `pandas.read_csv()` function has an `encoding` argument we can use to specify an encoding:



```
df = pd.read_csv("filename.csv", encoding="some_encoding")
```
Since the `pandas.read_csv()` function already tried to read in the file with UTF-8 and failed, we know the file's not encoded with that format. Let's try the next most popular encoding in the exercise.



First, let's import `laptops.csv` into this Colaboratory Notebook:

In [1]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
id = "1VjDliDrTkzvTvjz6uRnjt7Z72uv6TI6s"

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('laptops.csv')

Instructions:


1. Import the pandas library
2. Use the `pandas.read_csv()` function to read the `laptops.csv`file into a dataframe `laptops`.
  * Specify the encoding using the string `"Latin-1"`.
3. Use the `DataFrame.info()` method to display information about the `laptops` dataframe.


In [4]:
import pandas as pd
laptops = pd.read_csv("laptops.csv", encoding="Latin-1")

laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


## 2. Cleaning Column Names

With the `DataFrame.info()` used above, we can see that every column is represented as the `object` type, indicating that they are represented by strings, not numbers. Also, one of the columns, `Operating System Version`, has null values. 

The column labels have a variety of **upper and lowercase letters**, as well as **spaces and parentheses**, which will make them harder to work with and read. One noticeable issue is that the ` Storage"` column name has a space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will save us more work in the long run.

We can access the column axis of a dataframe using the `DataFrame.columns` attribute. This returns an index object — a special type of NumPy ndarray — with the labels of each column:

In [6]:
print(laptops.columns)

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')


Not only can we use the attribute to view the column labels, we can also assign new labels to the attribute:



```
laptops_test = laptops.copy()
laptops_test.columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M']

print(laptops_test.columns)
```


```
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'], dtype='object')
```
Next, let's use the `DataFrame.columns` attribute to remove whitespaces from the column names.



In [7]:
new_columns = []

for i in laptops.columns:
    stripped = i.strip(" ")
    new_columns.append(stripped)
    
laptops.columns = new_columns

print(laptops.columns)

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')


## 3. Cleaning Column Names Continued

he column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:

* Replacing spaces with underscores.
* Removing special characters.
* Making all labels lowercase.
* Shortening any long column names.

We can create a function that uses [Python string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) to clean our column labels, and then again use a loop to apply that function to each label. Let's look at an example:



```
def clean_col(col):
    col = col.strip()
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)

laptops.columns = new_columns
print(laptops.columns)
```



```
Index(['manufacturer', 'model name', 'category', 'screen size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'operating system',
       'operating system version', 'weight', 'price euros'],
      dtype='object')
```



Our code:

* Defined a function, which:
  * Used the `str.strip()` [method](https://docs.python.org/3.6/library/stdtypes.html#str.strip) to remove whitespace from the start and end of the string.
  * Used the `str.replace()` [method](https://docs.python.org/3.6/library/stdtypes.html#str.replace) to remove parentheses from the string.
  * Used the `str.lower()` [method](https://docs.python.org/3.6/library/stdtypes.html#str.lower) to make the string lowercase.
  * Returns the modified string.
* Used a loop to apply the function to each item in the index object and assign it back to the `DataFrame.columns` attribute.
* Printed the new values for the `DataFrame.columns` attribute.

Let's use this technique to clean the column labels in our dataframe, adding a few extra cleaning 'chores' along the way.


1. Define a function, which accepts a string argument, and:
  * Removes any whitespace from the start and end of the string.
  * Replaces the substring `Operating System` with the abbreviation `os`.
  * Replaces all spaces with underscores.
  * Removes parentheses from the string.
  * Makes the entire string lowercase.
  * Returns the modified string.
2. Use a loop to apply the function to each item in the `DataFrame.columns` attribute for the `laptops` dataframe. Assign the result back to the `DataFrame.columns` attribute.


In [8]:
def clean_col(col):
    col = col.strip()
    col = col.replace("Operating System","os")
    col = col.replace(" ","_")
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

columns_clean = []

for i in laptops.columns:
    col = clean_col(i)
    columns_clean.append(col)
    
laptops.columns = columns_clean

print(laptops.columns)

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')


## 4. Converting String Columns to Numeric

We observed earlier that all 13 columns have the `object` dtype, meaning they're stored as strings. Let's look at the first few rows of some of our columns:

In [9]:
print(laptops.iloc[:5,2:5])

    category screen_size                              screen
0  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600
1  Ultrabook       13.3"                            1440x900
2   Notebook       15.6"                   Full HD 1920x1080
3  Ultrabook       15.4"  IPS Panel Retina Display 2880x1800
4  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600


Of these three columns, we have three different types of text data:

* `category`: Purely text data - there are no numeric values.
* `screen_size`: Numeric data stored as text data because of the `"` character.
* `screen`: A combination of pure text data with numeric data.

Because the values in the `screen_size` column are stored as text data, we can't sort them numerically. For instance, if we wanted to select laptops with screens 15" or larger, we'd be unable to do so.

Let's convert the `screen_size` column to numeric next. Whenever we convert text to numeric data, we can follow this data cleaning workflow:

![alt text](https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg)


The first step is to **explore the data**. One of the best ways to do this is to use the `Series.unique()` method to view all of the unique values in the column:

In [10]:
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())

object
['13.3"' '15.6"' '15.4"' '14.0"' '12.0"' '11.6"' '17.3"' '10.1"' '13.5"'
 '12.5"' '13.0"' '18.4"' '13.9"' '12.3"' '17.0"' '15.0"' '14.1"' '11.3"']




Our next step is to **identify patterns and special cases**. We can observe the following:

* All values in this column follow the same pattern - a series of digit and period characters, followed by a quote character (`"`).
* There are no special cases. Every value matches the same pattern.
* We'll need to convert the column to a `float` dtype, as the `int` dtype won't be able to store the decimal values.

Let's identify any patterns and special cases in the `ram` column next.

In [11]:
unique_ram = laptops["ram"].unique()

print(unique_ram)

['8GB' '16GB' '4GB' '2GB' '12GB' '6GB' '32GB' '24GB' '64GB']


Here, we can identify a clear pattern in the `ram` column - all values are integers and include the character `GB` at the end of the string.



## 5. Removing Non-Digit Characters

To convert both the `ram` and `screen_size` columns to numeric dtypes, we'll have to first remove the non-digit characters (As depicted in the flow-chart above).

The pandas library contains dozens of [vectorized string methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary) we can use to manipulate text data, many of which perform the same operations as Python string methods. Most vectorized string methods are available using the `Series.str` [accessor](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling), which means we can access them by adding `str` between the series name and the method name:

![alt text](https://s3.amazonaws.com/dq-content/346/Syntax.png)



In this case, we can use the `Series.str.replace()` method, which is a vectorized version of the Python `str.replace()` method we used in the previous screen, to remove all the quote characters from every string in the `screen_size` column:

In [12]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"','')

print(laptops["screen_size"].unique())

['13.3' '15.6' '15.4' '14.0' '12.0' '11.6' '17.3' '10.1' '13.5' '12.5'
 '13.0' '18.4' '13.9' '12.3' '17.0' '15.0' '14.1' '11.3']


Let's remove the non-digit characters from the ram column next.

In [13]:
laptops["ram"] = laptops["ram"].str.replace("GB","")
print(laptops["ram"].unique())

['8' '16' '4' '2' '12' '6' '32' '24' '64']


## 6. Converting Columns to Numeric Dtypes

Now, we can **convert (or cast) the columns to a numeric dtype which is Step 5 in the flow chart image**.

![alt text](https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg)

To do this, we use the `Series.astype()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html). To convert the column to a numeric dtype, we can use either `int` or `float` as the parameter for the method. Since the `int` dtype can't store decimal values, we'll convert the `screen_size` column to the `float` dtype:

In [14]:
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())

float64
[13.3 15.6 15.4 14.  12.  11.6 17.3 10.1 13.5 12.5 13.  18.4 13.9 12.3
 17.  15.  14.1 11.3]


Our `screen_size` column is now the `float64` dtype. Let's convert the dtype of the `ram` column to numeric next.

In [15]:
laptops["ram"] = laptops["ram"].astype(int)
laptops.dtypes

manufacturer     object
model_name       object
category         object
screen_size     float64
screen           object
cpu              object
ram               int64
storage          object
gpu              object
os               object
os_version       object
weight           object
price_euros      object
dtype: object

## 7. Renaming Columns

## 8. Extracting Values from Strings

## 9. Correcting Bad Values

## 10. Dropping Missing Values

## 11. Filling Missing Values

## 12. Challenge: Clean a String Column