<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#ERRORS" data-toc-modified-id="ERRORS-0.0.1"><span class="toc-item-num">0.0.1&nbsp;&nbsp;</span>ERRORS</a></span></li></ul></li></ul></li><li><span><a href="#Cleaning-Column-Headers" data-toc-modified-id="Cleaning-Column-Headers-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cleaning Column Headers</a></span></li><li><span><a href="#Cleaning-Row-Values" data-toc-modified-id="Cleaning-Row-Values-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Cleaning Row Values</a></span><ul class="toc-item"><li><span><a href="#Text-to-Numeric-Data" data-toc-modified-id="Text-to-Numeric-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Text to Numeric Data</a></span></li><li><span><a href="#Null-Values" data-toc-modified-id="Null-Values-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Null Values</a></span></li></ul></li></ul></div>

#### ERRORS

`UnicodeDecodeError                        Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte`

Something we can do if our file has an unknown encoding is to try the most common encodings:

UTF-8

Latin-1 (also known as ISO-8895-1)

Windows-1251

The `pandas.read_csv()` function has an `encoding` argument we can use to specify an encoding:

`df = pd.read_csv("filename.csv", encoding="some_encoding")`

Since the pandas.read_csv() function already tried to read in the file with UTF-8 and failed, we know the file's not encoded with that format. Let's try the next most popular encoding in the exercise.

## Cleaning Column Headers
We can access the column axis of a dataframe using the `DataFrame.columns` attribute. This returns an index object — a special type of NumPy ndarray — with the labels of each column:

Not only can we use the attribute to view the column labels, we can also assign new labels to the attribute:

`df = df.copy()`

`df.columns = ['A', 'B', 'C', 'D', 'E',
               'F', 'G', 'H', 'I', 'J',
               'K', 'L', 'M']`
               
`new_columns = []
for c in df.columns:
    new_columns.append(c.strip())
df.columns = new_columns`

However, the column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:

- Replacing spaces with underscores.

- Removing special characters.

- Making all labels lowercase.

- Shortening any long column names.

`def clean_col(col):
    col = col.strip()
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col`

`new_columns = []
for c in df.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)`

`df.columns = new_columns
print(df.columns)`

Defined a function, which:

- Used the `str.strip()` method to remove whitespace from the start and end of the string.

- Used the `str.replace()` method to remove parentheses from the string.

- Used the `str.lower()` method to make the string lowercase.

- Returns the modified string.

## Cleaning Row Values
### Text to Numeric Data

**Process**

Explore Data --> Identify Patterns/Special Cases --> Remove non-digit chars --> Convert to Numeric --> Rename (if required)

*Explore Data* - view all of the unique values `Series.unique()`

*Identify Patterns/Special Cases* - Are the values consistent? Special cases? 

*Remove non-digit chars* - before converting to a numeric value, we must remove any non-numeric characters:
- `df.col.str.replace('char_to_replace', 'new_character')`

*Convert to Numeric (CAST)* - convert column values to numeric dtype:
- `df.col.astype(float)` (float, int, 'Int614', 'object', 'category')

*Rename (if required)* - now, rename the column to reflect the new values
- `df.rename({'col': 'new_col'}, axis=1, inplace=True)`
    - `axis=1` ---> rename labels in the column axis
    - `inplace=True` ---> Assign the result back to the dataframe

The pandas library contains dozens of vectorized string methods we can use to manipulate text data, many of which perform the same operations as Python string methods. Most vectorized string methods are available using the `Series.str` accessor, which means we can access them by adding str between the series name and the method name

`df.col.str.split()`:

This method splits each string on the whitespace; the result is a series containing individual Python lists. Also note that we used parentheses to method chain over multiple lines, which makes our code easier to read.

If your data has been scraped from a webpage or if there was manual data entry involved at some point, you may end up with inconsistent values.

One way we can fix this is with the `Series.map()` method. The `Series.map()` method is ideal when we want to change multiple values in a column, but we'll use it now as an opportunity to learn how the method works.

The most common way to use Series.map() is with a dictionary. Let's look at an example using a series of misspelled fruit:


`0 pair
1 oranje
2 banannna
`

`corrections = {
    "pair": "pear",
    "oranje": "orange",
    "banannna": "banana"
}`

`s = s.map(corrections)`

`0 pear
1 orange
2 banana
`

We can see that each of our corrections were made across our series. One important thing to remember with `Series.map()` is that if a value from your series doesn't exist as a key in your dictionary, it will convert that value to `NaN`. Let's see what happens when we run map one more time:

`s = s.map(corrections)`

`0 NaN
1 NaN
2 NaN
`


`df.col = df.col.map(mapping_dict)`

### Null Values
In previous missions, we've talked briefly about missing values and how both NumPy and pandas represent these as null values. In pandas, null values will be indicated by either `NaN` or `None`.

Recall that we can use the DataFrame.isnull() method to identify missing values, which returns a boolean dataframe. We can then use the DataFrame.sum() method to give us a count of the True values for each column:

`print(df.isnull().sum())`

There are a few options for handling missing values:

- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.

The first two options are often used to prepare data for machine learning algorithms, which are unable to be used with data that includes null values. We can use the `DataFrame.dropna()` method to remove or drop rows and columns with null values.

The `DataFrame.dropna()` method accepts an `axis` parameter, which indicates whether we want to drop along the column or index axis. Let's look at an example:

- `df.dropna()` OR `df.dropna(axis=0)` ---> Drops ROWS with null values
- `df.dropna(axis=1)` ---> Drops Columns Containing Null Values

- `print(df[col].value_counts(dropna=False))` - Keeps null values as part of calculation