In [1]:
# Install the environnement
%pip install git+https: // github.com/AwePhD/NotebooksLabsessionImage.git


Collecting git+https://github.com/AwePhD/NotebooksLabsessionImage.git
  Cloning https://github.com/AwePhD/NotebooksLabsessionImage.git to /tmp/pip-req-build-wko581wo
  Running command git clone -q https://github.com/AwePhD/NotebooksLabsessionImage.git /tmp/pip-req-build-wko581wo
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: NLI
  Building wheel for NLI (PEP 517) ... [?25l[?25hdone
  Created wheel for NLI: filename=NLI-1.0.0-py3-none-any.whl size=2406 sha256=74aa5692ff28a7ace01deb5f6b2999d40864cd7ab8900ccf3958945ce986ccb2
  Stored in directory: /tmp/pip-ephem-wheel-cache-7y8z23vr/wheels/17/4a/a4/4f920391e876c3c2632ecc7851748e1c11539349fe2eefd2c4
Successfully built NLI
Installing collected packages: NLI
Successfully installed NLI-1.0.0


In [2]:
# Import dataset
# Can be found at https://www.kaggle.com/vishalsubbiah/pokemon-images-and-types
!rm - rf ./*
!curl - LO https: // github.com/AwePhD/NotebooksLabsessionImage/raw/main/pokemon_dataset.zip
!unzip - qq pokemon_dataset.zip
!rm pokemon_dataset.zip


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   156  100   156    0     0   1695      0 --:--:-- --:--:-- --:--:--  1695
100 2484k  100 2484k    0     0  10.6M      0 --:--:-- --:--:-- --:--:-- 10.6M


In [1]:
# Third party imports
import pandas as pd

## Import from CSV

In previous notebook we manipulated data from a CSV file. Now we will look at the importation step.

The comprehensive documentation for `.read_csv` is [available](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv).

Let's look at the csv file contents.

In [2]:
%%!
head -10 pokemon.csv

['Name,Type1,Type2',
 'bulbasaur,Grass,Poison',
 'ivysaur,Grass,Poison',
 'venusaur,Grass,Poison',
 'charmander,Fire',
 'charmeleon,Fire',
 'charizard,Fire,Flying',
 'squirtle,Water',
 'wartortle,Water',
 'blastoise,Water']

Here the header of `read_csv` method in [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv):

![](../img_md/read_csv.png)

Conclusion: a lot of choices possible.

We will see some options here, namely how to specify seperation character, column names and index names

#### Direct use of `.read_csv`.

Because our `.csv` is well formatted, a direct use of `.read_csv`  will import the data.

In [3]:
pd.read_csv("./pokemon.csv").head()

Unnamed: 0,Name,Type1,Type2
0,bulbasaur,Grass,Poison
1,ivysaur,Grass,Poison
2,venusaur,Grass,Poison
3,charmander,Fire,
4,charmeleon,Fire,


One interesting operation is to change the `Name` column into an index. Actually, we can argue that the name of a Pokemon is an unique identifier. So, referring to its name is more readable and might be useful instead of an arbitrary integer index.

In [4]:
pd.read_csv("./pokemon.csv", index_col="Name").head()

Unnamed: 0_level_0,Type1,Type2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
bulbasaur,Grass,Poison
ivysaur,Grass,Poison
venusaur,Grass,Poison
charmander,Fire,
charmeleon,Fire,


For instance, we can have the list of Pokemons in alphabetical order.

In [5]:
pd.read_csv("./pokemon.csv", index_col='Name').sort_index().head()

Unnamed: 0_level_0,Type1,Type2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
abomasnow,Grass,Ice
abra,Psychic,
absol,Dark,
accelgor,Bug,
aegislash-blade,Steel,Ghost


#### Dealing with non-standard `.csv`

Based on the source of your `.csv` it can have a different format and so it won't be read correctly by `.read_csv`. Let's see a new version of our file:


In [6]:
%%!
head -5 pokemon_bad.csv

['bulbasaur;Grass;Poison',
 'ivysaur;Grass;Poison',
 'venusaur;Grass;Poison',
 'charmander;Fire',
 'charmeleon;Fire']

Our delimiters are `;` instead of `,` and we have no columns names :cry:.

In [7]:
pd.read_csv("./pokemon_bad.csv").head()

Unnamed: 0,bulbasaur;Grass;Poison
0,ivysaur;Grass;Poison
1,venusaur;Grass;Poison
2,charmander;Fire
3,charmeleon;Fire
4,charizard;Fire;Flying


We can manually get our `DataFrame` by adding the missing information. The argument `sep` changes the default caracter which is a comma. Plus, we can identify the `names` of columns.

In [8]:
pd.read_csv(
    "./pokemon_bad.csv",
    sep=';',
    index_col=0, # First column as index
    names=["Type1", "Type2"], # Names of the columns
).head()

Unnamed: 0,Type1,Type2
bulbasaur,Grass,Poison
ivysaur,Grass,Poison
venusaur,Grass,Poison
charmander,Fire,
charmeleon,Fire,


Note: this `DataFrame` is not equivalent to the first one since the `index` does not have a label (previously the label was `Name`).

This can be fixed by using `.rename_axis` ✨

In [9]:
pd.read_csv(
    "./pokemon_bad.csv",
    sep=';',
    index_col=0, # First column as index
    names=["Type1", "Type2"], # Names of the columns
).rename_axis('Name').head()

Unnamed: 0_level_0,Type1,Type2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
bulbasaur,Grass,Poison
ivysaur,Grass,Poison
venusaur,Grass,Poison
charmander,Fire,
charmeleon,Fire,
