# Learning Notebook - Part 2 of 3 - Common Problems and Solutions

## 1. Introduction

Sometimes, the data we receive to work with, has a few problems.
These can be badly formatted lines, strange characters in strings, data in the wrong data type...

Usually, we call **data cleaning** to the process of manually inspecting raw data and solving existing issues.
Clean data can then be used in a data pipeline to train some cool machine learning models :)

So, let's clean some data!

<img src="./media/henry.jpeg" width="400" align='left'>

In [1]:
# Some imports
import csv
import os
import pandas as pd

In [2]:
# Some helper functions to get the data files path
def pokemons_filepath(filename):
    return os.path.join('data', 'pokemons', filename)


def sharks_filepath(filename):
    return os.path.join('data', 'sharks', filename)

## 2. Bad lines

When a file has one or more lines, with more columns than the rest of the file, function read_csv is not capable of creating a DataFrame with the data, and a `ParserError` is thrown.

In [3]:
# sometimes, there are bad lines in a csv file
try:
    pd.read_csv(pokemons_filepath('pokemons_bad_lines.csv'))
except Exception as e:
    print(e)

Error tokenizing data. C error: Expected 12 fields in line 3, saw 16



One possible way of solving this problem is simply telling function read_csv not to throw an error, which will make it ignore these bad lines. For this, we should use the argument **error_bad_lines**. Also, a warning is shown, letting us know that some lines were ignored.

*Attention: the warning starts indexing the lines with 1. So when it says that line 3 was skipped, it was the line with index 2 in the file.*

In [4]:
# error_bad_lines=False ignores bad lines and builds the df by dropping the bad lines
pd.read_csv(pokemons_filepath('pokemons_bad_lines.csv'), error_bad_lines=False)

b'Skipping line 3: expected 12 fields, saw 16\nSkipping line 6: expected 12 fields, saw 13\n'


Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
2,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
3,6,Charmeleon,Fire,,58,64,58,80,65,80,1,False


But what if we're not interested in losing data? We may inspect those lines to see what's actually causing the problem!

So, first, we open the file and create a file object that we can work with. For this we use Python's function **open**.

Then, we use function **readlines** on the file object in order to store each line as an element in a list.

Now we can inspect the lines where we know there's something wrong.

In the end, we should close the file object, using Python's function **close**.

[docs](https://docs.python.org/3.3/tutorial/inputoutput.html#methods-of-file-objects)

In [5]:
# create a file object f
f = open(pokemons_filepath('pokemons_bad_lines.csv'), 'r')

# create a list of lines
lines_list = f.readlines()

# print the content of specific lines
print(lines_list[2])
print(lines_list[5])

# closing the file
f.close()

2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False,,,,

5,Charmander,Fire,,39,52,43,60,50,65,1,False,extra column



After all, the lines with extra columns didn't have any problem, other than the extra elements at the end.
So we can try to do better than just dumping the lines.

For this, we can use Python's **csv** module: it has the **reader** function that can read a csv file into a list of lists.

Then, we just have to make sure that each inner list has the same number of elements.

[docs](https://docs.python.org/3/library/csv.html)

In [6]:
# we can even deal with the extra columns, if they happen in the end of the line
f = open(pokemons_filepath('pokemons_bad_lines.csv'), 'r')

# read the csv into a list of lists
csv_list = list(csv.reader(f))

# these two lists correspond to the first two lines of the file
csv_list[:2]

[['#',
  'Name',
  'Type 1',
  'Type 2',
  'HP',
  'Attack',
  'Defense',
  'Sp. Atk',
  'Sp. Def',
  'Speed',
  'Generation',
  'Legendary'],
 ['1',
  'Bulbasaur',
  'Grass',
  'Poison',
  '45',
  '49',
  '49',
  '65',
  '65',
  '45',
  '1',
  'False']]

In [7]:
# we want each line to have at most 12 elements
n_elements = 12

# each inner list can only have up to n_elements
csv_list = [i[:n_elements] for i in csv_list]
f.close()

# we finally create a DataFrame, using the first list as the column names, and the other lists as data
pd.DataFrame(csv_list[1:], columns=csv_list[0])

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False
5,6,Charmeleon,Fire,,58,64,58,80,65,80,1,False


## 3. Pandas for data cleaning

<img src="./media/pandas_cleaning.jpeg" width="500" align='left'>

Let's now see an example on how to use pandas to clean data in a DataFrame.
Some of these methods were already shown to you, but here you'll get a more extended example on how to use them.

We'll start by previewing our data.

In [8]:
! head -5 data/sharks/endangered_sharks.csv

Id,Scientific name,Common name,Population trend,IUCN status
#027,Haploblepharus fuscus,   Brown/shyshark     ,unknown,Vulnerable[96]
#041,Mustelus fasciatus,  Striped/smooth-hound    ,decreasing,Critically endangered[128]
#023,Glyphis gangeticus,     Ganges/shark   ,decreasing,Critically endangered [88]
#011,Carcharhinus plumbeus,   Sandbar/shark  ,decreasing,Vulnerable[54]


And then read it into a pandas DataFrame.

In [9]:
sharks = pd.read_csv(sharks_filepath('endangered_sharks.csv'))
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status
0,#027,Haploblepharus fuscus,Brown/shyshark,unknown,Vulnerable[96]
1,#041,Mustelus fasciatus,Striped/smooth-hound,decreasing,Critically endangered[128]
2,#023,Glyphis gangeticus,Ganges/shark,decreasing,Critically endangered [88]
3,#011,Carcharhinus plumbeus,Sandbar/shark,decreasing,Vulnerable[54]
4,#025,Glyphis glyphis,Speartooth/shark,decreasing,Endangered[92]


Gladly, the reading part went ok. But the need of some cleaning is very obvious!


### String methods

[docs](https://pandas.pydata.org/pandas-docs/stable/text.html)

The first tool we'll use are the Python's string methods applied to pandas DataFrames.
We can apply the same string methods, that we know from Python, to all the elements in a pandas Series, by calling them with a `.str` before the method name.

Let's see some examples.

In the first example, we'll tackle column 'Common name'. We can see that the words in the sharks' names are separated by slashes ('/') instead of blank spaces (' ').

In order to replace the slashes with blank spaces, we can use method **replace**.
Notice that we call the method like .str.replace !

In [10]:
# replacing the slashes '/' with blank spaces ' ' in column 'Common name'
sharks['Common name'] = sharks['Common name'].str.replace('/', ' ')
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status
0,#027,Haploblepharus fuscus,Brown shyshark,unknown,Vulnerable[96]
1,#041,Mustelus fasciatus,Striped smooth-hound,decreasing,Critically endangered[128]
2,#023,Glyphis gangeticus,Ganges shark,decreasing,Critically endangered [88]
3,#011,Carcharhinus plumbeus,Sandbar shark,decreasing,Vulnerable[54]
4,#025,Glyphis glyphis,Speartooth shark,decreasing,Endangered[92]


When we previewed the file, we could see that the values in the 'Common name' column had some blank spaces at the beginning and at the end. We can see that again by selecting the first value of the column.

In [11]:
sharks.loc[0, 'Common name']

'   Brown shyshark     '

In order to remove these blank spaces, we'll use method **strip**. Remember that method strip only removes blank spaces at the beginning and at the end of the strings. All the blank spaces in the middle of the string will be kept, which is exactly what we want!

In [12]:
# remove blank spaces in column Common name
sharks['Common name'] = sharks['Common name'].str.strip()
sharks.loc[0, 'Common name']

'Brown shyshark'

Method strip removes blank spaces by default. But we can also ask it to remove other character or set of characters.

For instance, by calling method strip with ' -', we're removing both blank spaces ' ' and dashes '-'.

In [13]:
s = pd.Series(['---1--', '-  2  ', '  3   '])
s.str.strip(' -')

0    1
1    2
2    3
dtype: object

Then, in column 'IUCN status', we can see that all the strings end with [some_number]. We want to remove these from the strings. For this, we'll use methods **split** and **get**.

As it's name suggests, method split splits the strings, according to some separator, into a list of parts.
For instance, by calling split('[') for the first value of the column, we get the list ['Vulnerable', '96]'].

Then, method get allows us to retrieve an element part from the lists.

In [14]:
# remove [some_number] in column IUCN status
sharks['IUCN status'] = sharks['IUCN status'].str.split('[').str.get(0)
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status
0,#027,Haploblepharus fuscus,Brown shyshark,unknown,Vulnerable
1,#041,Mustelus fasciatus,Striped smooth-hound,decreasing,Critically endangered
2,#023,Glyphis gangeticus,Ganges shark,decreasing,Critically endangered
3,#011,Carcharhinus plumbeus,Sandbar shark,decreasing,Vulnerable
4,#025,Glyphis glyphis,Speartooth shark,decreasing,Endangered


Next example is to separate column 'Scientific name' into two new columns: 'Genus' and 'Species', where the genus is the first word from the scientific name, and the species is the second word.

We'll use again the split method. But this time, we have to give it an extra argument, which is the number of parts that we want to split the strings into. This way, we can assign each of the split parts to a different pandas Series and thus create two new columns.

In [15]:
# split the Scientific name in two columns: genus and species
sharks['Genus'], sharks['Species'] = sharks['Scientific name'].str.split(' ', 1).str
sharks.head()

  


Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status,Genus,Species
0,#027,Haploblepharus fuscus,Brown shyshark,unknown,Vulnerable,Haploblepharus,fuscus
1,#041,Mustelus fasciatus,Striped smooth-hound,decreasing,Critically endangered,Mustelus,fasciatus
2,#023,Glyphis gangeticus,Ganges shark,decreasing,Critically endangered,Glyphis,gangeticus
3,#011,Carcharhinus plumbeus,Sandbar shark,decreasing,Vulnerable,Carcharhinus,plumbeus
4,#025,Glyphis glyphis,Speartooth shark,decreasing,Endangered,Glyphis,glyphis


Now we've got a genus and a species for each shark. But there is an inconsistency in the way we're presenting the genus, which is capitalized, and the species, which is not capitalized (it's starting with lower case).

In order to fix this, we can use method **capitalize**, which makes the first character in a string uppercase.

In [16]:
# capitalize the strings in column 'Species'
sharks['Species'] = sharks['Species'].str.capitalize()
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status,Genus,Species
0,#027,Haploblepharus fuscus,Brown shyshark,unknown,Vulnerable,Haploblepharus,Fuscus
1,#041,Mustelus fasciatus,Striped smooth-hound,decreasing,Critically endangered,Mustelus,Fasciatus
2,#023,Glyphis gangeticus,Ganges shark,decreasing,Critically endangered,Glyphis,Gangeticus
3,#011,Carcharhinus plumbeus,Sandbar shark,decreasing,Vulnerable,Carcharhinus,Plumbeus
4,#025,Glyphis glyphis,Speartooth shark,decreasing,Endangered,Glyphis,Glyphis


The last example with string manipulation is in column 'Id', where we want to remove the number signs '#'.

In [17]:
# remove the '#' from column 'Id'
sharks['Id'] = sharks['Id'].str.lstrip('#')
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status,Genus,Species
0,27,Haploblepharus fuscus,Brown shyshark,unknown,Vulnerable,Haploblepharus,Fuscus
1,41,Mustelus fasciatus,Striped smooth-hound,decreasing,Critically endangered,Mustelus,Fasciatus
2,23,Glyphis gangeticus,Ganges shark,decreasing,Critically endangered,Glyphis,Gangeticus
3,11,Carcharhinus plumbeus,Sandbar shark,decreasing,Vulnerable,Carcharhinus,Plumbeus
4,25,Glyphis glyphis,Speartooth shark,decreasing,Endangered,Glyphis,Glyphis


### Set data types

Now that column 'Id' lost the '#', we can turn it into a numeric column.

Method **astype** can be used to cast the values of the column to int data type.

In [18]:
sharks['Id'] = sharks['Id'].astype(int)
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status,Genus,Species
0,27,Haploblepharus fuscus,Brown shyshark,unknown,Vulnerable,Haploblepharus,Fuscus
1,41,Mustelus fasciatus,Striped smooth-hound,decreasing,Critically endangered,Mustelus,Fasciatus
2,23,Glyphis gangeticus,Ganges shark,decreasing,Critically endangered,Glyphis,Gangeticus
3,11,Carcharhinus plumbeus,Sandbar shark,decreasing,Vulnerable,Carcharhinus,Plumbeus
4,25,Glyphis glyphis,Speartooth shark,decreasing,Endangered,Glyphis,Glyphis


### Replace values & Rename columns

Our data is starting to look much better! Now let's see an example of the **replace** method.
We'll try to make the information in column 'Population trend' more legible.

First thing to do is to replace the cells with value 'unknown' with the character '?', and the cells with value 'decreasing' with value True. Then, we'll change the column name to 'Decreasing population'.

In [19]:
# besides "decreasing" labels, is there any oher labels, e.g. like "increasing"?
sharks['Population trend'].unique()

array(['unknown', 'decreasing'], dtype=object)

In [20]:
# replacing values in column 'Population trend'
sharks['Population trend'] = sharks['Population trend'].replace({'unknown': '?', 'decreasing': True})
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Population trend,IUCN status,Genus,Species
0,27,Haploblepharus fuscus,Brown shyshark,?,Vulnerable,Haploblepharus,Fuscus
1,41,Mustelus fasciatus,Striped smooth-hound,True,Critically endangered,Mustelus,Fasciatus
2,23,Glyphis gangeticus,Ganges shark,True,Critically endangered,Glyphis,Gangeticus
3,11,Carcharhinus plumbeus,Sandbar shark,True,Vulnerable,Carcharhinus,Plumbeus
4,25,Glyphis glyphis,Speartooth shark,True,Endangered,Glyphis,Glyphis


In [21]:
# Changing column's 'Population trend' name to 'Decreasing population'
sharks = sharks.rename(columns={'Population trend': 'Decreasing population'})
sharks.head()

Unnamed: 0,Id,Scientific name,Common name,Decreasing population,IUCN status,Genus,Species
0,27,Haploblepharus fuscus,Brown shyshark,?,Vulnerable,Haploblepharus,Fuscus
1,41,Mustelus fasciatus,Striped smooth-hound,True,Critically endangered,Mustelus,Fasciatus
2,23,Glyphis gangeticus,Ganges shark,True,Critically endangered,Glyphis,Gangeticus
3,11,Carcharhinus plumbeus,Sandbar shark,True,Vulnerable,Carcharhinus,Plumbeus
4,25,Glyphis glyphis,Speartooth shark,True,Endangered,Glyphis,Glyphis


And finally our data seems much better, right?
The only thing left is to sort the values using the 'Id' column, but I'll leave that one to you!