# Introduction to Pandas

In [1]:
import datetime
print(f"Last updated: {datetime.datetime.now()}")

Last updated: 2024-12-23 09:09:21.766099


## What is pandas?

If you're getting into machine learning and data science and you're using Python, you're going to use pandas.

[pandas](https://pandas.pydata.org/) is an open source library which helps you analyse and manipulate data.

<img src="images/pandas-6-step-ml-framework-tools-highlight.png" alt="a 6 step machine learning framework along will tools you can use for each step" width="700"/>

## Why pandas?

pandas provides a simple to use but very capable set of functions you can use to on your data.

It's integrated with many other data science and machine learning tools which use Python so having an understanding of it will be helpful throughout your journey.

One of the main use cases you'll come across is using pandas to transform your data in a way which makes it usable with machine learning algorithms.

## What does this notebook cover?

Because the pandas library is vast, there's often many ways to do the same thing. This notebook covers some of the most fundamental functions of the library, which are more than enough to get started.

## Where can I get help?

If you get stuck or think of something you'd like to do which this notebook doesn't cover, don't fear!

The recommended steps you take are:
1. **Try it** - Since pandas is very friendly, your first step should be to use what you know and try figure out the answer to your own question (getting it wrong is part of the process). If in doubt, run your code.
2. **Search for it** - If trying it on your own doesn't work, since someone else has probably tried to do something similar, try searching for your problem in the following places (either via a search engine or direct):
    * [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) - the best place for learning pandas, this resource covers all of the pandas functionality.
    * [Stack Overflow](https://stackoverflow.com/) - this is the developers Q&A hub, it's full of questions and answers of different problems across a wide range of software development topics and chances are, there's one related to your problem.
    <!-- * [ChatGPT](https://chat.openai.com/) - ChatGPT is very good at explaining code, however, it can make mistakes. Best to verify the code it writes first before using it. Try asking "Can you explain the following code for me? {your code here}" and then continue with follow up questions from there.
     -->
An example of searching for a pandas function might be:

> "how to fill all the missing values of two columns using pandas"

Searching this on Google leads to this post on Stack Overflow: https://stackoverflow.com/questions/36556256/how-do-i-fill-na-values-in-multiple-columns-in-pandas

The next steps here are to read through the post and see if it relates to your problem. If it does, great, take the code/information you need and **rewrite it** to suit your own problem.

3. **Ask for help** - If you've been through the above 2 steps and you're still stuck, you might want to ask your question on Stack Overflow. Remember to be specific as possible and provide details on what you've tried.

Remember, you don't have to learn all of these functions off by heart to begin with.

What's most important is remembering to continually ask yourself, "what am I trying to do with the data?".

Start by answering that question and then practicing finding the code which does it.

Let's get started.

## 0. Importing pandas

To get started using pandas, the first step is to import it.

The most common way (and method you should use) is to import pandas as the abbreviation `pd` (e.g. `pandas` -> `pd`).

If you see the letters `pd` used anywhere in machine learning or data science, it's probably referring to the pandas library.

In [2]:
import pandas as pd

# Print the version
print(f"pandas version: {pd.__version__}")

pandas version: 2.2.2


## 1. Datatypes

pandas has two main datatypes, `Series` and `DataFrame`.
* [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) - a 1-dimensional column of data.
* [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (most common) - a 2-dimesional table of data with rows and columns.

You can create a `Series` using `pd.Series()` and passing it a Python list.

In [3]:
# Creating a series of car types ["BMW", "Toyota", "Honda"]
car_types = pd.Series(["BMW", "Toyota", "Honda"])
car_types

Unnamed: 0,0
0,BMW
1,Toyota
2,Honda


In [None]:
# Creating a series of colours
car_types = pd.Series(["Red", "Blue", "White"])
car_types

Unnamed: 0,0
0,Red
1,Blue
2,White


You can create a `DataFrame` by using `pd.DataFrame()` and passing it a Python dictionary.

Let's use our two `Series` as the values.

In [4]:
# Creating a DataFrame of cars and colours blue, red, white
# car_data = ..
car_types = pd.Series(["BMW", "Toyota", "Honda"])
car_colours = pd.Series(["Blue", "Red", "White"])
car_data = pd.DataFrame({"Car type": car_types, "Colour": car_colours})
car_data


Unnamed: 0,Car type,Colour
0,BMW,Blue
1,Toyota,Red
2,Honda,White


You can see the keys of the dictionary became the column headings (text in bold) and the values of the two `Series`'s became the values in the DataFrame.

It's important to note, many different types of data could go into the DataFrame.

Here we've used only text but you could use floats, integers, dates and more.

### Exercises

1. Make a `Series` of different foods.
2. Make a `Series` of different dollar values (these can be integers).
3. Combine your `Series`'s of foods and dollar values into a `DataFrame`.

Try it out for yourself first, then see how your code goes against the solution.

**Note:** Make sure your two `Series` are the same size before combining them in a DataFrame.

In [5]:
# Your code here

food_series = pd.Series(["Pizza", "Burger", "Salad"])
food_series

Unnamed: 0,0
0,Pizza
1,Burger
2,Salad


In [6]:
food_series = pd.Series([10, 20, 30])
food_series

Unnamed: 0,0
0,10
1,20
2,30


In [7]:
food_series = pd.Series(["Pizza", "Burger", "Salad"])
food_series = pd.Series([10, 20, 30])
food_data = pd.DataFrame({"Food": food_series, "Price": food_series})
food_data

Unnamed: 0,Food,Price
0,10,10
1,20,20
2,30,30


## 2. Importing data

Creating `Series` and `DataFrame`'s from scratch is nice but what you'll usually be doing is importing your data in the form of a `.csv` (comma separated value), spreadsheet file or something similar such as an [SQL database](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html).

pandas allows for easy importing of data like this through functions such as [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and [`pd.read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) (for Microsoft Excel files).

Say you wanted to get this information from this Google Sheet document into a pandas `DataFrame`.

<img src="images/pandas-car-sales-csv.png" alt="spreadsheet with car sales information" width="600">

You could export it as a `.csv` file and then import it using `pd.read_csv()`.

> **Tip:** If the Google Sheet is public, `pd.read_csv()` can read it via URL, try searching for "pandas read Google Sheet with URL".

In this case, the exported `.csv` file is called `car-sales.csv`.

In [8]:
# Import car sales data
import pandas as pd

car_sales = pd.read_csv("car-sales.csv")
car_sales


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


Now we've got the same data from the spreadsheet available in a pandas `DataFrame` called `car_sales`.

Having your data available in a `DataFrame` allows you to take advantage of all of pandas functionality on it.

Another common practice you'll see is data being imported to `DataFrame` called `df` (short for `DataFrame`).

In [9]:
# Import the car sales data and save it to df

car_sales = pd.read_csv("car-sales.csv")
car_sales

df = pd.read_csv("car-sales.csv")
df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


Now `car_sales` and `df` contain the exact same information, the only difference is the name. Like any other variable, you can name your `DataFrame`'s whatever you want. But best to choose something simple.

### Anatomy of a DataFrame

Different functions use different labels for different things. This graphic sums up some of the main components of `DataFrame`'s and their different names.

<img src="images/pandas-dataframe-anatomy.png" alt="pandas dataframe with different sections labelled" width="800"/>


## 3. Exporting data

After you've made a few changes to your data, you might want to export it and save it so someone else can access the changes.

pandas allows you to export `DataFrame`'s to `.csv` format using [`.to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) or spreadsheet format using [`.to_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html).

We haven't made any changes yet to the `car_sales` `DataFrame` but let's try export it.

In [10]:
# Export the car sales DataFrame to csv
#car_sales.to_csv("exported/exported-car-sales.csv")

import os
import pandas as pd

# Data for the DataFrame
car_data = {
    'Car': ['BMW', 'Toyota', 'Honda'],
    'Color': ['Blue', 'Red', 'White']
}

# Creating the DataFrame
car_sales = pd.DataFrame(car_data)

# Ensure the "exported" directory exists
os.makedirs("exported", exist_ok=True)

# Export the DataFrame to a CSV file
car_sales.to_csv("exported/exported-car-sales.csv", index=False)

print("DataFrame exported successfully to 'exported/exported-car-sales.csv'")


DataFrame exported successfully to 'exported/exported-car-sales.csv'


Running this will save a file called `export-car-sales.csv` to the `exported` folder.

<img src="images/pandas-exported-car-sales-csv.png" alt="folder with exported car sales csv file highlighted" width="600"/>

## Exercises

1. Practice importing a `.csv` file using `pd.read_csv()`, you can download `heart-disease.csv`. This file contains annonymous patient medical records and whether or not they have heart disease.
2. Practice exporting a `DataFrame` using `.to_csv()`. You could export the heart disease `DataFrame` after you've imported it.

**Note:**
* Make sure the `heart-disease.csv` file is in the same folder as your notebook orbe sure to use the filepath where the file is.
* You can name the variables and exported files whatever you like but make sure they're readable.

In [11]:
# Your code here

# 1. Import data
import pandas as pd

# Importer le fichier CSV
heart_disease_data = pd.read_csv("heart-disease.csv")

# Afficher les 5 premières lignes
print(heart_disease_data.head())


   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  


In [12]:
# Exporting the patient_data DataFrame to csv
# 2. Exporter le DataFrame dans un nouveau fichier CSV
heart_disease_data.to_csv("exported_heart_disease.csv", index=False)

print("Fichier exporté avec succès sous le nom 'exported_heart_disease.csv'")


Fichier exporté avec succès sous le nom 'exported_heart_disease.csv'


## 4. Describing data

One of the first things you'll want to do after you import some data into a pandas `DataFrame` is to start exploring it.

pandas has many built in functions which allow you to quickly get information about a `DataFrame`.

Let's explore some using the `car_sales` `DataFrame`.

In [15]:
# Your code here
import pandas as pd

# Charger les données depuis le fichier CSV
heart_disease_df = pd.read_csv("heart-disease.csv")

# Afficher les 5 premières lignes du DataFrame
print("Premières lignes du DataFrame:")
print(heart_disease_df.head())

# Afficher des informations générales sur le DataFrame
print("\nRésumé des informations:")
print(heart_disease_df.info())

# Obtenir des statistiques descriptives sur les colonnes numériques
print("\nStatistiques descriptives:")
print(heart_disease_df.describe())

# Afficher les types de données des colonnes
print("\nTypes de données des colonnes:")
print(heart_disease_df.dtypes)

# Vérifier les valeurs manquantes
print("\nValeurs manquantes dans chaque colonne:")
print(heart_disease_df.isnull().sum())


Premières lignes du DataFrame:
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  

Résumé des informations:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-n

[`.dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) shows us what datatype each column contains.

In [17]:
import pandas as pd

# Charger les données depuis le fichier CSV
heart_disease_df = pd.read_csv("heart-disease.csv")

# Afficher les types de données des colonnes
print(heart_disease_df.dtypes)



age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object


Notice how the `Price` column isn't an integer like `Odometer` or `Doors`. Don't worry, pandas makes this easy to fix.

[`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) gives you a quick statistical overview of the numerical columns.

In [18]:
# Your code here
# Obtenir des statistiques descriptives sur les colonnes numériques
print("\nStatistiques descriptives:")
print(heart_disease_df.describe())


Statistiques descriptives:
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.366337    0.683168    0.966997  131.623762  246.264026    0.148515   
std      9.082101    0.466011    1.032052   17.538143   51.830751    0.356198   
min     29.000000    0.000000    0.000000   94.000000  126.000000    0.000000   
25%     47.500000    0.000000    0.000000  120.000000  211.000000    0.000000   
50%     55.000000    1.000000    1.000000  130.000000  240.000000    0.000000   
75%     61.000000    1.000000    2.000000  140.000000  274.500000    0.000000   
max     77.000000    1.000000    3.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope          ca  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean     0.528053  149.646865    0.326733    1.039604    1.399340    0.729373   

[`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) shows a handful of useful information about a `DataFrame` such as:
* How many entries (rows) there are
* Whether there are missing values (if a columns non-null value is less than the number of entries, it has missing values)
* The datatypes of each column

In [19]:
# Your code here
print("\nRésumé des informations:")
print(heart_disease_df.info())


Résumé des informations:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
None


You can also call various statistical and mathematical methods such as [`.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) or [`.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) directly on a `DataFrame` or `Series`.

In [20]:
# Calling .mean() on a DataFrame
# Hint : use (numeric_only = True) get mean values of numeric columnns only
import pandas as pd

# Charger les données depuis le fichier CSV
heart_disease_df = pd.read_csv("heart-disease.csv")

# Calculer la moyenne des colonnes numériques uniquement
mean_values = heart_disease_df.mean(numeric_only=True)

# Afficher les résultats
print(mean_values)


age          54.366337
sex           0.683168
cp            0.966997
trestbps    131.623762
chol        246.264026
fbs           0.148515
restecg       0.528053
thalach     149.646865
exang         0.326733
oldpeak       1.039604
slope         1.399340
ca            0.729373
thal          2.313531
target        0.544554
dtype: float64


In [21]:
# Calling .mean() on a Series
car_prices = pd.Series([3000, 3500, 11250])
car_prices.mean()

5916.666666666667

In [22]:
# Calling .sum() on a DataFrame
import pandas as pd

# Charger les données depuis le fichier CSV
heart_disease_df = pd.read_csv("heart-disease.csv")

# Calculer la somme des colonnes numériques uniquement
sum_values = heart_disease_df.sum(numeric_only=True)

# Afficher les résultats
print(sum_values)


age         16473.0
sex           207.0
cp            293.0
trestbps    39882.0
chol        74618.0
fbs            45.0
restecg       160.0
thalach     45343.0
exang          99.0
oldpeak       315.0
slope         424.0
ca            221.0
thal          701.0
target        165.0
dtype: float64


Your remark?

In [23]:
# Calling .sum() on a DataFrame with numeric_only=True
#car_sales.sum(numeric_only=True)

# Calculer la somme des colonnes numériques uniquement dans car_sales
sum_values = car_sales.sum(numeric_only=True)

# Afficher les résultats
print(sum_values)


Price       70000
In_stock        2
dtype: int64


In [24]:
# Calling .sum() on a Series
import pandas as pd

# Exemple de Series avec des données numériques
car_prices = pd.Series([20000, 30000, 25000, 15000, 22000])

# Calculer la somme des valeurs dans la Series
total_price = car_prices.sum()

# Afficher le résultat
print(total_price)


112000


Calling these on a whole `DataFrame` may not be as helpful as targeting an individual column. But it's helpful to know they're there.

`.columns` will show you all the columns of a `DataFrame`.

In [25]:
car_sales.columns
import pandas as pd

# Exemple de DataFrame
car_sales = pd.DataFrame({
    'Car': ['BMW', 'Toyota', 'Honda'],
    'Color': ['Blue', 'Red', 'White'],
    'Price': [25000, 20000, 22000]
})

# Afficher les noms des colonnes
print(car_sales.columns)


Index(['Car', 'Color', 'Price'], dtype='object')


You can save them to a list which you could use later.

In [26]:
# Save car_sales columns to a list
car_columns = car_sales.columns

car_sales = pd.DataFrame({
    'Make': ['BMW', 'Toyota', 'Honda'],
    'Color': ['Blue', 'Red', 'White'],
    'Price': [25000, 20000, 22000]
})

# Enregistrer les colonnes de car_sales dans une liste
car_columns = list(car_sales.columns)

# Afficher le premier élément de la liste
print(car_columns[0])




Make


`.index` will show you the values in a `DataFrame`'s index (the column on the far left).

In [34]:
#your code here
car_sales.index

RangeIndex(start=0, stop=3, step=1)

pandas `DataFrame`'s, like Python lists, are 0-indexed (unless otherwise changed). This means they start at 0.

<img src="images/pandas-dataframe-zero-indexed.png" alt="dataframe with index number 0 highlighted" width="700"/>

In [31]:
# Show the length of a DataFrame, Hint use len()
import pandas as pd

# Exemple de DataFrame
df = pd.DataFrame({
    'Marque': ['BMW', 'Toyota', 'Honda', 'Audi', 'Ford', 'Chevrolet', 'Nissan', 'Kia', 'Hyundai', 'Mercedes'],
    'Couleur': ['Bleu', 'Rouge', 'Blanc', 'Noir', 'Gris', 'Jaune', 'Vert', 'Orange', 'Violet', 'Beige'],
    'Prix': [25000, 20000, 22000, 30000, 18000, 21000, 25000, 26000, 23000, 27000]
})

# Afficher la longueur du DataFrame
print(len(df))


10


So even though the length of our `car_sales` dataframe is 10, this means the indexes go from 0-9.

## 5. Viewing and selecting data

Some common methods for viewing and selecting data in a pandas DataFrame include:

* [`DataFrame.head(n=5)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) - Displays the first `n` rows of a DataFrame (e.g. `car_sales.head()` will show the first 5 rows of the `car_sales` DataFrame).
* [`DataFrame.tail(n=5)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) - Displays the last `n` rows of a DataFrame.
* [`DataFrame.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) - Accesses a group of rows and columns by labels or a boolean array.
* [`DataFrame.iloc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) - Accesses a group of rows and columns by integer indices (e.g. `car_sales.iloc[0]` shows all the columns from index `0`.
* [`DataFrame.columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) -  Lists the column labels of the DataFrame.
* `DataFrame['A']` - Selects the column named `'A'` from the DataFrame.
* `DataFrame[DataFrame['A'] > 5]` - Boolean indexing filters rows based on column values meeting a condition (e.g. all rows from column `'A'` greater than `5`.
* [`DataFrame.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) - Creates a line plot of a DataFrame's columns (e.g. plot `Make` vs. `Odometer (KM)` columns with `car_sales[["Make", "Odometer (KM)"]].plot();`).
* [`DataFrame.hist()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) - Generates histograms for columns in a DataFrame.
* [`pandas.crosstab()`](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) - Computes a cross-tabulation of two or more factors.

In practice, you'll constantly be making changes to your data, and viewing it. Changing it, viewing it, changing it, viewing it.

You won't always want to change all of the data in your `DataFrame`'s either. So there are just as many different ways to select data as there is to view it.

`.head()` allows you to view the first 5 rows of your `DataFrame`. You'll likely be using this one a lot.

In [33]:
# Show the first 5 rows of car_sales
print(car_sales.head())

     Make  Color  Price
0     BMW   Blue  25000
1  Toyota    Red  20000
2   Honda  White  22000


Why 5 rows? Good question. I don't know the answer. But 5 seems like a good amount.

Want more than 5?

No worries, you can pass `.head()` an integer to display more than or less than 5 rows.

In [35]:
# Show the first 7 rows of car_sales
print(car_sales.head(7))

     Make  Color  Price
0     BMW   Blue  25000
1  Toyota    Red  20000
2   Honda  White  22000


`.tail()` allows you to see the bottom 5 rows of your `DataFrame`. This is helpful if your changes are influencing the bottom rows of your data.

In [36]:
# Show bottom 5 rows of car_sales
print(car_sales.tail())

     Make  Color  Price
0     BMW   Blue  25000
1  Toyota    Red  20000
2   Honda  White  22000


You can use `.loc[]` and `.iloc[]` to select data from your `Series` and `DataFrame`'s.

Let's see.

In [37]:
# Create a sample series
animals = pd.Series(["cat", "dog", "bird", "snake", "ox", "lion"],
                    index=[0, 3, 9, 8, 67, 3])
animals

Unnamed: 0,0
0,cat
3,dog
9,bird
8,snake
67,ox
3,lion


[`.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) takes an integer or label as input. And it chooses from your `Series` or `DataFrame` whichever index matches the number.

In [38]:
# Select all indexes with 3
animals.loc[3]

Unnamed: 0,0
3,dog
3,lion


In [39]:
# Select index 9
animals.loc[0]

'cat'

Let's try with our `car_sales` DataFrame.

In [40]:
# print car_sales
# Your code here
import pandas as pd

# Créer les données pour le DataFrame
car_data = {
    'Make': ['Toyota', 'Honda', 'Toyota', 'BMW', 'Nissan', 'Toyota', 'Honda', 'Honda', 'Toyota', 'Nissan'],
    'Colour': ['White', 'Red', 'Blue', 'Black', 'White', 'Green', 'Blue', 'Blue', 'White', 'White'],
    'Odometer (KM)': [150043, 87899, 32549, 11179, 213095, 99213, 45698, 54738, 60000, 31600],
    'Doors': [4, 4, 3, 5, 4, 4, 4, 4, 4, 4],
    'Price': ['$4,000.00', '$5,000.00', '$7,000.00', '$22,000.00', '$3,500.00', '$4,500.00', '$7,500.00', '$7,000.00', '$6,250.00', '$9,700.00']
}

# Créer le DataFrame
car_sales = pd.DataFrame(car_data)

# Afficher le DataFrame
print(car_sales)



     Make Colour  Odometer (KM)  Doors       Price
0  Toyota  White         150043      4   $4,000.00
1   Honda    Red          87899      4   $5,000.00
2  Toyota   Blue          32549      3   $7,000.00
3     BMW  Black          11179      5  $22,000.00
4  Nissan  White         213095      4   $3,500.00
5  Toyota  Green          99213      4   $4,500.00
6   Honda   Blue          45698      4   $7,500.00
7   Honda   Blue          54738      4   $7,000.00
8  Toyota  White          60000      4   $6,250.00
9  Nissan  White          31600      4   $9,700.00


In [41]:
# Select row at index 3
car_sales.loc[3]

Unnamed: 0,3
Make,BMW
Colour,Black
Odometer (KM),11179
Doors,5
Price,"$22,000.00"


[`iloc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) does a similar thing but works with exact positions.


In [43]:
import pandas as pd

# Créer des données pour les animaux
animal_data = {
    'Animal': ['Cat', 'Dog', 'Elephant', 'Tiger', 'Lion'],
    'Age': [2, 5, 10, 4, 6],
    'Weight (kg)': [4.5, 20.0, 500.0, 250.0, 190.0]
}

# Créer le DataFrame
animals = pd.DataFrame(animal_data)

# Afficher le DataFrame
print(animals)


     Animal  Age  Weight (kg)
0       Cat    2          4.5
1       Dog    5         20.0
2  Elephant   10        500.0
3     Tiger    4        250.0
4      Lion    6        190.0


In [44]:
# Select row at position 3 of animals
animals.iloc[3]

Unnamed: 0,3
Animal,Tiger
Age,4
Weight (kg),250.0


Even though `'snake'` appears at index 8 in the series, it's shown using `.iloc[3]` because it's at the 3rd (starting from 0) position.

Let's try with the `car_sales` `DataFrame`.

In [45]:
# Select row at position 3
car_sales.iloc[3]

Unnamed: 0,3
Make,BMW
Colour,Black
Odometer (KM),11179
Doors,5
Price,"$22,000.00"


In [None]:
#remark ?
#write here


You can see it's the same as `.loc[]` because the index is in order, position 3 is the same as index 3.

You can also use slicing with `.loc[]` and `.iloc[]`.

In [None]:
# Get all rows up to position 3
animals.iloc[:3]

0     cat
3     dog
9    bird
dtype: object

In [None]:
#this is what we call slicing, we will practice on that later :)

In [46]:
# Get all rows up to (and including) index 3
car_sales.loc[:3]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"


In [None]:
# Get all rows of the "Colour" column
car_sales.loc[:, "Colour"] # note: ":" stands for "all", e.g. "all indices in the first axis"

0    White
1      Red
2     Blue
3    Black
4    White
5    Green
6     Blue
7     Blue
8    White
9    White
Name: Colour, dtype: object

When should you use `.loc[]` or `.iloc[]`?
* Use `.loc[]` when you're selecting rows and columns **based on their lables or a condition** (e.g. retrieving data for specific columns).
* Use `.iloc[]` when you're selecting rows and columns **based on their integer index positions** (e.g. extracting the first ten rows regardless of the labels).

However, in saying this, it will often take a bit of practice with each of the methods before you figure out which you'd like to use.

If you want to select a particular column, you can use `DataFrame.['COLUMN_NAME']`.

In [None]:
# Select Make column
car_sales['Make']

0    Toyota
1     Honda
2    Toyota
3       BMW
4    Nissan
5    Toyota
6     Honda
7     Honda
8    Toyota
9    Nissan
Name: Make, dtype: object

In [None]:
# Select Colour column
car_sales['Colour']

0    White
1      Red
2     Blue
3    Black
4    White
5    Green
6     Blue
7     Blue
8    White
9    White
Name: Colour, dtype: object

Boolean indexing works with column selection too. Using it will select the rows which fulfill the condition in the brackets.

In [None]:
# Select cars with over 100,000 on the Odometer
car_sales[car_sales["Odometer (KM)"] > 100000]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
4,Nissan,White,213095,4,"$3,500.00"


In [None]:
# Select cars with less thn 100,000 on the Odometer
# Your code here

In [None]:
# Select cars which are made by Toyota
car_sales[car_sales["Make"] == "Toyota"]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
5,Toyota,Green,99213,4,"$4,500.00"
8,Toyota,White,60000,4,"$6,250.00"


In [None]:
# Select cars whith 4 doors
# Your code here


## Summary

### Main topics we covered
* [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) - a single column (can be multiple rows) of values.
* [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) - multiple columns/rows of values (a DataFrame is comprised of multiple Series).
* [Importing data](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) - we used [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv) to read in a CSV (comma-separated values) file but there are multiple options for reading data.
* [Exporting data](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) - we exported our data using `to_csv()`, however there are multiple methods of exporting data.
* [Describing data](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
    * `df.dtypes` - find the datatypes present in a dataframe.
    * `df.describe()` - find various numerical features of a dataframe.
    * `df.info()` - find the number of rows and whether or not any of them are empty.
* [Viewing and selecting data](https://pandas.pydata.org/docs/user_guide/10min.html#viewing-data)
    * `df.head()` - view the first 5 rows of `df`.
    * `df.loc` & `df.iloc` - select specific parts of a dataframe.
    * `df['A']` - select column `A` of `df`.
    * `df[df['A'] > 1000]` - selection column `A` rows with values over 1000 of `df`.
    * `df['A']` - plot values from column `A` using matplotlib (defaults to line graph).
* [Manipulating data and performing operations](https://pandas.pydata.org/docs/user_guide/10min.html#operations) - pandas has many built-in functions you can use to manipulate data, also many of the Python operators (e.g. `+`, `-`, `>`, `==`) work with pandas.

### Further reading
Since pandas is such a large library, it would be impossible to cover it all in one go.

The following are some resources you might want to look into for more.
* [Python for Data Analysis by Wes McKinney](https://wesmckinney.com/book/) - possibly the most complete text of the pandas library (apart from the documentation itself) written by the creator of pandas.
* [Medium article](https://medium.com/datadriveninvestor/the-only-30-methods-you-should-master-to-become-a-pandas-pro-b91b8b36bc2d) - Medium should be your friend during your journey, learn how to search for info, answers of your questions

