# Learning Notebook - Part 1 of 3 - Dealing with csv's... and beyond

Welcome to data wrangling!

Up to this point of the academy, you already handled several datasets. Some were big, some were small. Some were very clean, some were a little messier (maybe with some missing values).

But in all the cases, they were pretty easy to handle (thanks pd.read_csv!), and conveniently accessible.

Well, in real life, that never happens. When dealing with a data science problem, you'll find that the data you need is probably scattered around multiple sources, stored in several different formats, and there are always additional challenges! 


![title](./media/files_everywhere.png)


But worry not, that's what we're here for :)
In this specialization you'll learn many tools that will turn you into a professional data wrangler.

**Let's start with some handy Jupyter notebook's tools.**

## 1. ! (system shell access)

In a Jupyter notebook, any statement that you start with an exclamation mark (!), will be sent to the underlying operating system.

In practice, this means that you can run shell commands in the notebooks, in the same way as you do in your computer terminal.

**Disclaimer**: the goal of this is to show that we can run shell commands from the notebooks, and we'll use Unix commands to demonstrate it. In most of the examples, we tried to include a Windows equivalent. If by any chance some command doesn't work in your machine, try to figure out by yourself what works in your OS, because this will probably be handy in the future!

Let's see some examples now. The first is to list the files in the current directory.

In [1]:
# list the current directory
# in Windows: ! dir
! ls -lh

total 284K
drwxrwxr-x 6 ppribeir ppribeir 4,0K jul 31 20:27  data
-rw-rw-r-- 1 ppribeir ppribeir  28K jul 31 20:27 'Exercise notebook.ipynb'
-rw-rw-r-- 1 ppribeir ppribeir 132K ago  1 10:49 "Learning Notebook - Part 1 of 3 - Dealing with csv's... and beyond.ipynb"
-rw-rw-r-- 1 ppribeir ppribeir  65K jul 31 20:27 'Learning Notebook - Part 2 of 3 - Common Problems and Solutions.ipynb'
-rw-rw-r-- 1 ppribeir ppribeir  35K jul 31 20:27 'Learning Notebook - Part 3 of 3 - Dealing with larger datasets.ipynb'
drwxrwxr-x 2 ppribeir ppribeir 4,0K jul 31 20:27  media
-rw-rw-r-- 1 ppribeir ppribeir 1,5K jul 31 20:35  README.md
-rw-rw-r-- 1 ppribeir ppribeir 1,4K jul 31 16:05  requirements_old.txt
-rw-rw-r-- 1 ppribeir ppribeir  133 jul 31 22:13  requirements.txt


In order to see the contents of a file, we just have to use command **cat** (Unix) or **type** (Windows), followed by the file path.

In [2]:
# print the contents of a file
# in Windows: ! type data\lorem\lorem_ipsum_short.txt
! cat data/lorem/lorem_ipsum_short.txt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis et tristique enim. Sed eget venenatis eros, quis suscipit lorem. Ut malesuada, erat a scelerisque cursus, odio mi sodales ligula, at elementum ipsum dui condimentum ipsum. Nam imperdiet viverra dictum. Aenean commodo accumsan iaculis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Maecenas sed suscipit odio, ornare ullamcorper felis. Donec sit amet egestas sem, ac semper sem. Integer mattis purus sed orci volutpat porta. Donec cursus dapibus interdum. Donec ornare mattis dolor. Sed quis scelerisque nisl, in efficitur odio. Phasellus posuere libero eget orci feugiat scelerisque. Phasellus ullamcorper tortor nec facilisis blandit.
Vestibulum tempus, purus id ultrices eleifend, lorem nunc malesuada odio, vitae aliquam elit ipsum a nulla. Vivamus sed neque arcu. Vivamus commodo nunc a est hendrerit tempor. Nullam viverra sit amet augue in sagittis. Pellentesque molestie porttitor volutpa

Sometimes files are very big. In that case, printing all the content of a file can be too expensive and not very useful.

Thus, it's important to know how big is a file before opening it.
A good way to do it is by counting the number of lines that the file has 

In [3]:
# counting the number of lines in a file
# in Windows: ! type data\lorem\lorem_ipsum_long.txt | find /c /v ""

! wc -l < data/lorem/lorem_ipsum_long.txt

150


Note: wc counts the carriage returns (aka newline, you'll hear about it in a bit), so make sure that the last line ends with a carriage return, or it won't be included in this count!

When a file is in fact very big, we can still preview it. But instead of printing all its content, we'll just want to print its first lines.

For this we can use the command **head** (Unix) or **more** (Windows).

In [4]:
# print the content of the first two lines of the file
# in Windows: ! more /e data\lorem\lorem_ipsum_long.txt P 2
! head -2 data/lorem/lorem_ipsum_long.txt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed pharetra pulvinar nulla, nec ultricies velit posuere nec. Nulla rhoncus convallis lectus in dignissim. Nulla aliquet justo risus, ac dapibus urna faucibus aliquet. Cras in pretium leo. Etiam id neque a erat feugiat vehicula. Curabitur accumsan volutpat ante, vitae vestibulum lorem congue et. Vivamus fringilla massa id dictum bibendum. Maecenas iaculis arcu ut tellus varius, at imperdiet metus lacinia. Maecenas eget turpis metus. Maecenas est mauris, venenatis cursus nulla at, gravida posuere sapien. Fusce a ex purus. Pellentesque eleifend, lorem sed pulvinar scelerisque, orci eros maximus purus, id sagittis dolor est non nulla. Vestibulum ex metus, porttitor a leo ornare, pharetra pharetra libero. Suspendisse cursus ligula in ante tincidunt rhoncus malesuada eu justo.
Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nam nec sem nisl. Aliquam at velit neque. Cras nec sem fringilla, h

Similarly, we can also preview the last lines of the file, with the **tail** command on Unix.

Unfortunately, Windows doesn't have a built-in command equivalent to tail. But there are some packages that can be installed to fill this purpose (check this [stackoverflow post](https://stackoverflow.com/questions/187587/a-windows-equivalent-of-the-unix-tail-command)). Or you can always **type** the whole file (if not too big), and read the last lines...

In [5]:
# print the content of the last three lines of the file
# in Windows :(
! tail -3 data/lorem/lorem_ipsum_long.txt

Morbi sodales et felis in bibendum. Nullam sollicitudin dapibus tellus, at molestie dui sagittis sed. Proin a tellus ac mi pharetra ullamcorper et in diam. Donec at posuere massa. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Quisque faucibus justo sollicitudin tincidunt tincidunt. Mauris magna massa, convallis quis erat sit amet, varius consequat nibh. In ac elit eu purus tempor lobortis quis at ante. Aliquam erat volutpat. Etiam sed arcu ut ex venenatis suscipit. Quisque ac porta diam. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce tristique malesuada diam, at hendrerit neque euismod sit amet.
Aenean commodo posuere elit, et ornare mi bibendum eget. Mauris scelerisque elit neque, vel eleifend leo pellentesque et. Aliquam erat volutpat. Vestibulum porttitor, neque in eleifend tincidunt, dolor tortor sagittis velit, a vehicula odio lectus et neque. Etiam odio diam, congue nec neque quis, hendrerit convallis nisi. Nullam leo lig

**Optional tip**

Another useful thing is that we can use variables with the ! commands.

We can assign the result of a ! command to a variable `var` in the notebook. `var` will be a list, where each element corresponds to a line in the ! command. Examples:

In [6]:
var = ! wc -l < ./data/lorem/lorem_ipsum_long.txt
var

['150']

In [7]:
var = ! ls -la
var

['total 296',
 'drwxrwxr-x 5 ppribeir ppribeir   4096 ago  1 10:49 .',
 'drwxrwxr-x 3 ppribeir ppribeir   4096 jul 31 20:22 ..',
 'drwxrwxr-x 6 ppribeir ppribeir   4096 jul 31 20:27 data',
 '-rw-rw-r-- 1 ppribeir ppribeir  28604 jul 31 20:27 Exercise notebook.ipynb',
 'drwxrwxr-x 2 ppribeir ppribeir   4096 jul 31 20:38 .ipynb_checkpoints',
 "-rw-rw-r-- 1 ppribeir ppribeir 134242 ago  1 10:49 Learning Notebook - Part 1 of 3 - Dealing with csv's... and beyond.ipynb",
 '-rw-rw-r-- 1 ppribeir ppribeir  65559 jul 31 20:27 Learning Notebook - Part 2 of 3 - Common Problems and Solutions.ipynb',
 '-rw-rw-r-- 1 ppribeir ppribeir  35678 jul 31 20:27 Learning Notebook - Part 3 of 3 - Dealing with larger datasets.ipynb',
 'drwxrwxr-x 2 ppribeir ppribeir   4096 jul 31 20:27 media',
 '-rw-rw-r-- 1 ppribeir ppribeir   1460 jul 31 20:35 README.md',
 '-rw-rw-r-- 1 ppribeir ppribeir   1338 jul 31 16:05 requirements_old.txt',
 '-rw-rw-r-- 1 ppribeir ppribeir    133 jul 31 22:13 requirements.txt']

And conversely, we can also pass python variables to the ! commands.

In [8]:
filename = "./data/lorem/lorem_ipsum_long.txt"
! wc -l < {filename}

150


---

**Now that we know how to handle files let's learn about file formats.**


## 2. File Formats Introduction

Throughout this academy we've been dealing a lot with csv files.
But in the real world, there is a plethora of file formats! 

Just for fun, you can scroll [this Wikipedia page](https://en.wikipedia.org/wiki/List_of_file_formats), to see a list with all the file formats that exist out there.

In this notebook, we'll learn about the most common file formats and how to read data from files with different formats into pandas DataFrames.
We'll also cover how to deal with different character encodings.

In [9]:
# Some imports
import chardet
import os
import pandas as pd

### Delimiter separated values

A [delimiter separated values](https://en.wikipedia.org/wiki/Delimiter-separated_values) file, is a file where each row represents a data point and its several fields are separated with a **delimiter** character. The lines are separated by [newlines](https://en.wikipedia.org/wiki/Newline#In_programming_languages) (which you can think of as the `\n` character).

The best known example of a delimiter separated values file is the **comma separated values**, where the delimiter character is a comma. These files usually have the .csv extension, i.e they are called something.csv.

Other common delimiters are:

* tab (`\t`)
* colon (`:`)
* pipe (`|`)
* space (` `)

In fact, using the tab as a delimiter is so common, that tab separated values files are called tsvs, and have the .tsv extension.

In [10]:
# Using the ! magic and the head command to see the first 5 lines of a csv file
#! more /e data\pokemons\pokemons.csv P 5
! head -5 data/pokemons/pokemons.csv

#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False


### JSON

[JSON](https://en.wikipedia.org/wiki/JSON) stands for JavaScript Object Notation. It was derived from JavaScript but it is language independent, so many languages (including Python, of course!) have built in support to parse and generate JSON data. 

This data format is very common in communication between browsers, servers and databases, so expect to see it a lot!

In JSON files, the data is represented in key-value pairs and arrays. 
Also, JSON files use the .json extension.

In [11]:
# Displaying the contents of a JSON file
#! type data\pokemons\pokemons.json
! cat data/pokemons/pokemons.json

{
    "1": {
        "Name": "Bulbasaur",
        "Types": [
            "Grass",
            "Poison"
        ],
        "HP": 45,
        "Attack": 49,
        "Defense": 49,
        "Sp. Atk": 65,
        "Sp. Def": 65,
        "Speed": 45,
        "Generation": 1,
        "Legendary": false
    },
    "2": {
        "Name": "Ivysaur",
        "Types": [
            "Grass",
            "Poison"
        ],
        "HP": 60,
        "Attack": 62,
        "Defense": 63,
        "Sp. Atk": 80,
        "Sp. Def": 80,
        "Speed": 60,
        "Generation": 1,
        "Legendary": false
    }
}

### Excel

[Excel](https://en.wikipedia.org/wiki/Microsoft_Excel) is a spreadsheet developed by Microsoft, which has been THE tool used by Data Analysts for the past decades.

We're not planning to use Excel much around here, but working as a data scientist in a company, you'll very likely receive some Excel files with data.

Unfortunately, using the head or cat commands to preview Excel files won't show us much... So we'll have to rely on pandas for that (more on that soon).

In [12]:
# Previewing excel files using head doesn't help us much
#! more /e data\pokemons\pokemons.xls P 5
! head -5 data/pokemons/pokemons.xls

��ࡱ�                >  ��	               �          ����    ����    �   �   ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������	   ��       �  ��    �   \ p None                                                                                                            B  �a   =  �           c        �   �   @    �    =  �Z �?N*8      X"       �   �    1  �   ��      Arial1  �   ��      Arial1  �   ��      Arial1  �   ��      Arial1  �   ��      Arial1  �   ��      Arial1  �  ��      Arial1  �   ��      Arial �   General�  

### HTML

[HTML](https://en.wikipedia.org/wiki/HTML) stands for Hypertext Markup Language (HTML) and is the standard markup language for creating web pages and web applications.

As a data scientist, it's not very likely that you'll receive HTML files with data. But if you think that HTML is the language that developers use to actually "paint" web pages, you may start finding it more interesting :)

But more on that later, let's now preview an HTML file.

In [13]:
#! type data\pokemons\pokemons.html
! cat data/pokemons/pokemons.html

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>#</th>
      <th>Name</th>
      <th>Type 1</th>
      <th>Type 2</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>Sp. Atk</th>
      <th>Sp. Def</th>
      <th>Speed</th>
      <th>Generation</th>
      <th>Legendary</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Bulbasaur</td>
      <td>Grass</td>
      <td>Poison</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
      <td>1</td>
      <td>False</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Ivysaur</td>
      <td>Grass</td>
      <td>Poison</td>
      <td>60</td>
      <td>62</td>
      <td>63</td>
      <td>80</td>
      <td>80</td>
      <td>60</td>
      <td>1</td>
      <td>False</td>
    </tr>
  </tbody>
</table>

**Optional tip**

Just some curiosity about jupyter-notebooks and HTML.

You can write HTML code in a markdown cell to render the HTML.
For instance, if you try to copy the content HTML file that we read above and paste it in a markdown cell, this will be the result.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type 1</th>
      <th>Type 2</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>Sp. Atk</th>
      <th>Sp. Def</th>
      <th>Speed</th>
      <th>Generation</th>
      <th>Legendary</th>
    </tr>
    <tr>
      <th>#</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>Bulbasaur</td>
      <td>Grass</td>
      <td>Poison</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
      <td>1</td>
      <td>False</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ivysaur</td>
      <td>Grass</td>
      <td>Poison</td>
      <td>60</td>
      <td>62</td>
      <td>63</td>
      <td>80</td>
      <td>80</td>
      <td>60</td>
      <td>1</td>
      <td>False</td>
    </tr>
  </tbody>
</table>

The table even looks pretty because jupyter is running some CSS code by default, to add styles to the output tables.

However you can't really use it to calculate anything, only displaying it.

## 3. Reading files with pandas

<img src="./media/panda_reading.png" align="left"/>

Pandas has a set of functions to read data from files into pandas DataFrames.
Their signature looks like

```
pd.read_XXX(filepath, other arguments)
```

where XXX should be replaced with the file type: **csv** (for delimiter separated values files), **json**, **excel**, **html**, among others.

By now, you should already be familiar with function read_csv. But so far, we've only seen very basic examples of its usage. Let's explore this family of functions a bit more to see how we can handle more complicated situations.

### 1. read_csv

[docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

* **Delimiter type**

The read_csv function can be used to read any delimiter separated values file.
By default, it assumes that the delimiter character is a comma (hence, read_csv). If we want to use a different delimiter, we can use argument **sep**.

In [14]:
# preview tsv
# ! type data\pokemons\pokemons_tabs.tsv
! cat data/pokemons/pokemons_tabs.tsv

#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
5	Charmander	Fire		39	52	43	60	50	65	1	False


In [15]:
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons.csv'), sep=',').head()

# You can also try this one:
#pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_semicol.csv'), sep=';').head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


**Tip**

Using `os.path.join('path', 'to', 'file')` makes the pathfile OS independent, which means that this command should work both on Windows and Unix.

* **Using a column as index**

If we want to use a certain column as the DataFrame index, we can use the argument **index_col** giving it the name or the number of the column.

In [16]:
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_short.csv'), index_col='#')

# Using the index column number instead of name would be
# pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_short.csv'), index_col=0)

Unnamed: 0_level_0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
5,Charmander,Fire,,39,52,43,60,50,65,1,False


* **Specifying column names**

read_csv assumes that the first row in the file has the column names. If the file doesn't have column names, this can be inconvenient! To avoid this, we can specify the column names to use, with argument **names**.

In [17]:
# preview file without column names
# ! type data\pokemons\pokemons_no_names.csv
! cat data/pokemons/pokemons_no_names.csv

1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
5,Charmander,Fire,,39,52,43,60,50,65,1,False


In [18]:
cols = ['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary']
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_no_names.csv'), names=cols)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


Also, if the file has column names but we want to override them, we can still use argument **names**. But we need to add argument **header** too: this argument specifies in which line the column names are.

So, header=0 means that the header (column names) is the first line, and names=list_of_names means that we want to use list_of_names instead of the current header.

In [19]:
# preview file with meaningless column names
# ! type data\pokemons\pokemons_meaningless_names.csv
! cat ./data/pokemons/pokemons_meaningless_names.csv

Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7,Col_8,Col_9,Col_10,Col_11,Col_12
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
5,Charmander,Fire,,39,52,43,60,50,65,1,False


In [20]:
cols = ['#', 'My Name', 'My Type 1', 'My Type 2', 'My HP', 'My Attack', 'My Defense', 'My Sp. Atk',
       'My Sp. Def', 'My Speed', 'My Generation', 'My Legendary']
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_meaningless_names.csv'), names=cols, header=0)

Unnamed: 0,#,My Name,My Type 1,My Type 2,My HP,My Attack,My Defense,My Sp. Atk,My Sp. Def,My Speed,My Generation,My Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


* **Ignoring blank lines**

By default, empty lines are ignored by read_csv.
If we're interested in detecting empty lines, we can use argument **skip_blank_lines=False** to see the empty lines in the DataFrame as NaNs.

In [21]:
# preview file with blank rows
# ! type data\pokemons\pokemons_blank_rows.csv
! cat ./data/pokemons/pokemons_blank_rows.csv

#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False

3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False

4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
5,Charmander,Fire,,39,52,43,60,50,65,1,False


In [22]:
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_blank_rows.csv'), skip_blank_lines=False)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1.0,Bulbasaur,Grass,Poison,45.0,49.0,49.0,65.0,65.0,45.0,1.0,False
1,2.0,Ivysaur,Grass,Poison,60.0,62.0,63.0,80.0,80.0,60.0,1.0,False
2,,,,,,,,,,,,
3,3.0,Venusaur,Grass,Poison,80.0,82.0,83.0,100.0,100.0,80.0,1.0,False
4,,,,,,,,,,,,
5,4.0,Mega Venusaur,Grass,Poison,80.0,100.0,123.0,122.0,120.0,80.0,1.0,False
6,5.0,Charmander,Fire,,39.0,52.0,43.0,60.0,50.0,65.0,1.0,False


* **Replacing NA values**

There are some values that are interpreted as NaNs by default, like the string 'NaN' or a null value.
If we want to additionally consider other values as NaNs, we can use argument **na_values** for that.

In [23]:
# in this file, we have some missing values marked as Unknown and ???
# ! type data\pokemons\pokemons_missing_values.csv
! cat ./data/pokemons/pokemons_missing_values.csv

#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Unknown,Grass,Poison,60,62,63,80,80,60,1,???
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,???
5,Unknown,Fire,,39,52,43,60,50,65,1,False


In [24]:
# recognize Unknown in column Name as NaN, recognize ??? in column Legendary as NaN
pd.read_csv(
    os.path.join('data', 'pokemons', 'pokemons_missing_values.csv'),
    na_values={'Name': 'Unknown', 'Legendary': '???'}
)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,,Grass,Poison,60,62,63,80,80,60,1,
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,
4,5,,Fire,,39,52,43,60,50,65,1,False


* **Decimal separator**

Number separators is an issue that has divided the world. For instance:

```
3,141.592
```

For Europeans, this is the $\pi$ number, rounded to six decimal cases.
While for Americans, this is more or less the number of million dollars that Donald Trump has (according to [Forbes](https://www.forbes.com/profile/donald-trump/), and it might be outdated now...).

So, depending where our data comes from, we may need to specify the decimal separator to use when reading a file. For this, we can use the **decimal** argument.

In [25]:
# Reading a file with a dot as decimal separator (default)
combats = pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_stats.csv'), sep=';')
combats

Unnamed: 0,#,Combats,Wins,Wins (%)
0,1,133,37,2781954887218045
1,2,121,46,38016528925619836
2,3,132,89,6742424242424242
3,4,125,70,56
4,5,112,55,49107142857142855


In [26]:
# read_csv interprets the "Wins (%)" columns as object (i.e, as strings)
combats.dtypes

#            int64
Combats      int64
Wins         int64
Wins (%)    object
dtype: object

In [27]:
# reading a file with the right decimal separator specification
combats = pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_stats.csv'), sep=';', decimal=',')

# read_csv interprets the "Wins (%)" columns as float
combats.dtypes

#             int64
Combats       int64
Wins          int64
Wins (%)    float64
dtype: object

### 2. read_json

[docs](https://pandas.pydata.org/docs/reference/api/pandas.io.json.read_json.html)

* **Orientation**

In the read_json function, we have a very important argument, which is **orient**.
There are different ways to represent a DataFrame as a json object, and this argument allows to specify how should function read_json interpret the keys and the values in the json file.

If we don't consider the json orientation when reading it, we may have ValueErrors or see our DataFrame transposed! So, make sure you check the documentation in order to understand the orientation of your json file and feed it correctly to read_json function.

In [28]:
# split orientation doesn't match our json file
try:
    pd.read_json(os.path.join('data', 'pokemons', 'pokemons.json'), orient='split')
except Exception as e:
    print("Sorry, split orientation doesn't work for this json file...", e)

Sorry, split orientation doesn't work for this json file... JSON data had unexpected key(s): 2, 1


In [29]:
# neither does records orientation, as it returns the DataFrame transposed
pd.read_json(os.path.join('data', 'pokemons', 'pokemons.json'), orient='records')

Unnamed: 0,1,2
Name,Bulbasaur,Ivysaur
Types,"[Grass, Poison]","[Grass, Poison]"
HP,45,60
Attack,49,62
Defense,49,63
Sp. Atk,65,80
Sp. Def,65,80
Speed,45,60
Generation,1,1
Legendary,False,False


In [30]:
# index orientation is what we're looking for
pd.read_json(os.path.join('data', 'pokemons', 'pokemons.json'), orient='index')

Unnamed: 0,Name,Types,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,"[Grass, Poison]",45,49,49,65,65,45,1,False
2,Ivysaur,"[Grass, Poison]",60,62,63,80,80,60,1,False


### 3. read_excel

[docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)

Since excel files are spreadsheets, there may be some cells whose content doesn't belong to the "table" we want to convert into a DataFrame.
Often, these cells are either in the beginning or at the end of the file. And we have two convenient arguments for these situations!

Consider the following DataFrame, read from an excel file.

In [31]:
pd.read_excel(os.path.join('data', 'pokemons', 'pokemons_garbage_rows.xlsx'))

Unnamed: 0,Data gathered from https://www.kaggle.com/terminus7/pokemon-challenge/data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,,,,,,,,hello!!,,,,
1,,,Sometimes there's random data in excel files,,,Because they are spreadsheets!,,,,,,
2,,,,,,,,,,,,
3,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
4,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
5,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
6,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
7,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
8,,,,,,,,,123,,,
9,And,in,the,footer,too!,,,,123,,,


* **Skip rows at the beginning of the file**

In order to skip n rows at the beginning of the file, we can use argument **skiprows**.

In [32]:
pd.read_excel(os.path.join('data', 'pokemons', 'pokemons_garbage_rows.xlsx'), skiprows=4)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65.0,65.0,45.0,1.0,0.0
1,2,Ivysaur,Grass,Poison,60,62,63,80.0,80.0,60.0,1.0,0.0
2,3,Venusaur,Grass,Poison,80,82,83,100.0,100.0,80.0,1.0,0.0
3,4,Mega Venusaur,Grass,Poison,80,100,123,122.0,120.0,80.0,1.0,0.0
4,,,,,,,,,123.0,,,
5,And,in,the,footer,too!,,,,123.0,,,
6,,,,,,Pure,GARBAGE!!!,,,,,


* **Skip rows at the end of the file**

And to skip rows at the end of the file, we can use argument **skipfooter**.

In [33]:
pd.read_excel(os.path.join('data', 'pokemons', 'pokemons_garbage_rows.xlsx'), skiprows=4, skipfooter=3)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False


You can also grab specific sheets using ***sheet_name*** 

In [34]:
pd.read_excel(
    os.path.join('data', 'pokemons', 'pokemons_garbage_rows.xlsx'),
    sheet_name="Sheet1",
    skiprows=4,
    skipfooter=3
)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False


### 4. read_html

[docs](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)

* **Return type is list!**

The return type of the read_html function is a list of DataFrames, and not a single DataFrame as the other pandas read functions we've seen so far.

So, if you want to read a html table into a DataFrame, you can do the following.

In [35]:
pd.read_html(os.path.join('data', 'pokemons', 'pokemons.html'))[0]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False


## 4. Encodings

At its deepest level, computers can only understand two things: zeros and ones.
So, everything that we handle with a computer (text, music clips, photos), is represented by a long sequence of zeros and ones, called *bits*.

In particular, characters (like 'a' or '1' or '!') are represented in terms of bits too.
When people started programming computers, they had to define that translation of characters in terms of bits, which is called an **encoding**. 

Seems easy, right? We have letters (a-z), digits (1-9), and a couple of special characters like punctuation. And then, an encoding called [ASCII](https://en.wikipedia.org/wiki/ASCII) was born.

<br>
<img src="./media/ASCII_code_chart.png" width="500">
<br>
    
But as we know, nothing is easy... Different languages have different special characters (eg. 'ç'), accents (eg. 'ã') maybe even all the characters are different from latin characters (like Arabic or Chinese), we have lower and upper cases (eg. 'A' and 'a'), etc.

Thus, many different encoding standards started emerging. And the encodings madness began...

Nowadays, the [UTF-8](https://pt.wikipedia.org/wiki/UTF-8) encoding is the most used in the Web, however, there are still hundreds of encodings in use! For instance, [these](https://docs.python.org/3/library/codecs.html#standard-encodings) are the built-in encodings in Python.

But enough about history, let's play with encodings!

This file has pokemon names in English language. Let's start by previewing it.

In [36]:
# preview the file with pokemons english names
# ! type data\pokemons\pokemons_english_names.csv
! cat data/pokemons/pokemons_english_names.csv

#,English Name
1,Bulbasaur
2,Ivysaur
3,Venusaur
4,Charmander
5,Charmeleon


Now let's use function read_csv to read it, with argument **encoding** to specify the encoding we want to use.

In this case, we're reading the file with the ASCII encoding.

In [37]:
# reading the file while specifying an encoding
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_english_names.csv'), encoding='ascii')

Unnamed: 0,#,English Name
0,1,Bulbasaur
1,2,Ivysaur
2,3,Venusaur
3,4,Charmander
4,5,Charmeleon


Ok, now let's try to read pokemon names in French. Again, we'll try to use the ASCII encoding. But this time we won't be very lucky...

In fact, this makes sense, because in French there are some characters, like accents, that are not supported by ASCII.

In [38]:
# trying to read a file with the wrong encoding -> error
try:
    pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_french_names.csv'), encoding='ascii')
except Exception as e:
    print(e)    

'ascii' codec can't decode byte 0xe8 in position 60: ordinal not in range(128)


Be very careful with encodings, because sometimes, we can use an encoding that doesn't cause an error in read_csv, but that doesn't quite work either! See that weird character in Pokemon #4?

In [39]:
# trying to read a file with a wrong encoding -> weird characters!
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_french_names.csv'), encoding='cp1255')

Unnamed: 0,#,French Name
0,1,Bulbizarre
1,2,Herbizarre
2,3,Florizarre
3,4,Salamטche
4,5,Reptincel


Ok, so we're reaching the conclusion that reading a file without knowing its encoding can be very painful...

<br>
<img src="./media/tell_me_your_files_encodings.png" width="500">
<br>

But fear no more, because we have a very handy tool that can help us with encodings!
Meet [chardet](https://chardet.readthedocs.io/en/latest/).

Chardet can be used to make a very educated guess about the encoding of a file. Which we can use later to read the file.

In [40]:
# using chardet to find the encoding of our pokemons french names file
chardet.detect(open(os.path.join('data', 'pokemons', 'pokemons_french_names.csv'), 'rb').read())

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

In [41]:
# reading the file specifying the encoding returned by chardet -> it works!
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_french_names.csv'), encoding='ISO-8859-1')

Unnamed: 0,#,French Name
0,1,Bulbizarre
1,2,Herbizarre
2,3,Florizarre
3,4,Salamèche
4,5,Reptincel


Let's see an example with a little more complicated characters.

In [42]:
# ! type data\pokemons\pokemons_japanese_names.csv 
! cat data/pokemons/pokemons_japanese_names.csv

#,���{�̖��O
1,�t�V�M�_�l
2,�t�V�M�\�E
3,�t�V�M�o�i
4,�q�g�J�Q
5,���U�[�h


In [43]:
# chardet to the rescue!
chardet.detect(open(os.path.join('data', 'pokemons', 'pokemons_japanese_names.csv'), 'rb').read())

{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}

In [44]:
# reading the file with the right encoding
pd.read_csv(os.path.join('data', 'pokemons', 'pokemons_japanese_names.csv'), encoding='SHIFT_JIS')

Unnamed: 0,#,日本の名前
0,1,フシギダネ
1,2,フシギソウ
2,3,フシギバナ
3,4,ヒトカゲ
4,5,リザード
