# Introduction to pandas
In this notebook, we will learn the basics of the `pandas` library. This is a powerful package that is often used to work with data in Python. It alllows you to look at your data, perform manipulations, and create visualizations. 

There is a lot to learn about `pandas`, but we will cover the basics in this notebook. Check out the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) for more information about the available functions and methods.

This lesson covers the following topics:
- Reading data as a DataFrame
- Basic operations on a DataFrame: 
    - Selecting 
    - Counting 
    - Calculating the average
    - Adding new data
    - Dealing with missing data
    - Sorting 
- Writing data to a CSV file 


First, you have to import the `pandas` package. You can do this by running the code below. Note that it's common practice to import `pandas` as `pd` to save time when typing. 

In [1]:
import pandas as pd

`pandas` works with a structure called a `DataFrame`. This is table with rows and columns. You can think of it as a spreadsheet. It consists of `columns` (vertical series of data), and `rows` (horizontal series).

#### Loading data

Let's start by loading a CSV file and saving it to a variable. You can do this by using the `read_csv` function, which is a `pandas` function. This function takes the `path` to the file as a so-called `argument`. An argument is information that you pass to a function to customize it. In many cases, the function will not work without arguments. Arguments are placed between the parentheses of the function, and are separated by commas. 

The second argument we will use is `sep`. This is short for `separator`. It tells the function which character is used to separate the columns in the CSV file. The default separator is a comma, but you can also use a semicolon, tab, or any other character. Since the comma is the default separator, you technically don't have to specify it, but we do it here for illustration purposes.

You can give your DataFrame any name you want, but it is common to call it `df`. This is short for `DataFrame`.

In [2]:
df = pd.read_csv('../../data/songs.csv', sep = ',')

`pandas` supports many file types, such as Excel files, JSON files, and more. Here is [an overview](https://pandas.pydata.org/docs/reference/io.html) of the supported file types. 

The function you use to load your data will depend on the file type. For example, the function to load an Excel file is `read_excel()`. See the [official documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for more information about the function and its arguments. 

#### Inspecting data

The `head()` function allows you to see the first 10 rows of the DataFrame. This is useful to get a quick overview of the data.

In [3]:
df.head(5)

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6


You can also use the `tail()` function to see the last rows of the DataFrame. You can specify the number of rows you want to see by passing an argument to the function.

In [4]:
df.tail(6)

Unnamed: 0,song,artist,length,language,rating
4,Ik Wil Dansen,Froukje,194,Dutch,9.6
5,Too Sweet,Hozier,251,English,3.3
6,Europapa,Joost,160,Dutch,4.7
7,Who,Jimin,170,English,
8,Supernova,aespa,179,Korean,7.6
9,LUNCH,Billie Eilish,180,English,8.6


Get a summary of the DataFrame with the `info()` function. This will show you the number of rows, the number of columns, the data types of the columns, and the number of non-null values in each columns.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   song      10 non-null     object 
 1   artist    10 non-null     object 
 2   length    10 non-null     int64  
 3   language  10 non-null     object 
 4   rating    9 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 528.0+ bytes


#### Data manipulation
This section covers some basic operations that you can perform on a DataFrame, such as selecting data, adding values, grouping and sorting. 

##### Selecting a column.
You can grab a single column from the DataFrame by using the column name in square brackets:

In [6]:
df['song']

0              BIRDS OF A FEATHER
1                        Espresso
2                     Not Like Us
3    Si Antes Te Hubiera Conocido
4                   Ik Wil Dansen
5                       Too Sweet
6                        Europapa
7                             Who
8                       Supernova
9                           LUNCH
Name: song, dtype: object

In practice, you will often want to save it as a list. You can do this by using the `tolist()` function: 

In [7]:
df['song'].tolist()

['BIRDS OF A FEATHER',
 'Espresso',
 'Not Like Us',
 'Si Antes Te Hubiera Conocido',
 'Ik Wil Dansen',
 'Too Sweet',
 'Europapa',
 'Who',
 'Supernova',
 'LUNCH']

You can also select multiple columns by passing a `list` of column names to the DataFrame.

In [8]:
df[['song', 'artist']]

Unnamed: 0,song,artist
0,BIRDS OF A FEATHER,Billie Eilish
1,Espresso,Sabrina Carpenter
2,Not Like Us,Kendrick Lamar
3,Si Antes Te Hubiera Conocido,KAROL G
4,Ik Wil Dansen,Froukje
5,Too Sweet,Hozier
6,Europapa,Joost
7,Who,Jimin
8,Supernova,aespa
9,LUNCH,Billie Eilish


##### Selecting a row

There are multiple ways to select a row in a DataFrame. You can use the `iloc` function to select a row by its index. This also works for ranges. 

In [9]:
df.iloc[0]

song        BIRDS OF A FEATHER
artist           Billie Eilish
length                     210
language               English
rating                     6.5
Name: 0, dtype: object

<div class="alert alert-block alert-info">

<b>Exercise</b> <p>
Use the .iloc function to select a range of 5 rows from the DataFrame.

</div>



In [10]:
# Your code here:
df.iloc[:5]

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6


You can also select rows based on a condition. For example, you can select all rows for a specific artist:

In [11]:
df[df['artist'] == 'Billie Eilish']

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
9,LUNCH,Billie Eilish,180,English,8.6


##### Counting values

The function `value_counts()` allows you to count the number of occurrences of each value in a column.

In [12]:
df['language'].value_counts()

language
English    6
Dutch      2
Spanish    1
Korean     1
Name: count, dtype: int64

##### Calculating the average

To calculate the average of a column, you can use the `mean()` function. We have to import the function from the `numpy` first because it is not included in `pandas`. 

Numpy is a package that includes many mathematical functions, and is essential for working with numerical data in Python. 

In [13]:
from numpy import mean
lengths = df['length'].tolist()
mean(lengths)

199.1

##### Null values

You may have noticed that the `rating` column contains the value `NaN`. This is called a `null value`, and it means that the data is missing. This can happen for many reasons, such as a mistake in the data collection process, or because the data is not available. 

In [14]:
# The DataFrame is printed here again for your convenience:
df

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6
5,Too Sweet,Hozier,251,English,3.3
6,Europapa,Joost,160,Dutch,4.7
7,Who,Jimin,170,English,
8,Supernova,aespa,179,Korean,7.6
9,LUNCH,Billie Eilish,180,English,8.6


Null values can cause problems when you are working with data. In the example below, we try to calculate the average rating of the songs. This will not work because the `NaN` values are not a number.

In [15]:
ratings = df['rating'].tolist()
mean(ratings)

nan

You can find null values in a DataFrame by using the `isnull()` function. This will return a DataFrame with `True` and `False` values. 

In [16]:
df.isnull()

Unnamed: 0,song,artist,length,language,rating
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,True
8,False,False,False,False,False
9,False,False,False,False,False


The easiest way to deal with null values is to remove them from the data. You can do this by applying the `dropna()` function to the DataFrame. This function will remove any rows that contain null values. 

In [17]:
df_dropped = df.dropna()
ratings = df_dropped['rating'].tolist()
mean(ratings)

6.755555555555556

See, now it works! 

The `dropna()` function has an argument called `subset`. This allows you to specify the columns that you want to check for null values, instead of checking all columns. 


<div class="alert alert-block alert-info">

<b>Exercise</b> </p>
Take a look at the [official documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) for the `dropna()` function and try to figure out how to remove rows that contain null values in the `rating` column only. 

</div>






In [18]:
# Your code here: 
df.dropna(subset = ['rating'])


Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6
5,Too Sweet,Hozier,251,English,3.3
6,Europapa,Joost,160,Dutch,4.7
8,Supernova,aespa,179,Korean,7.6
9,LUNCH,Billie Eilish,180,English,8.6


Another option is to manually add a value to replace the NaN value. You can do this with the `.loc` function. This function allows you to select a row and column by their index, and assign a new value to it. 


<div class="alert alert-block alert-info">

<b>Exercise</b> </p>
Try to add a rating of choice to the song with the missing value. \
Hint: the syntax looks like this: `df.loc[0, 'rating'] = 5` 

</div>




In [20]:
# Your code here:
df.loc[7, 'rating'] = 9
df


Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6
5,Too Sweet,Hozier,251,English,3.3
6,Europapa,Joost,160,Dutch,4.7
7,Who,Jimin,170,English,9.0
8,Supernova,aespa,179,Korean,7.6
9,LUNCH,Billie Eilish,180,English,8.6


##### Adding a column

In this exercise we will add a new column to an existing DataFrame. One way to do this is by making a `list` with the values you want to add, and assigning it to a new column name. It's important that the length of the list matches the number of rows in the DataFrame, otherwise it will not work. 

You can use the following syntax to add a column: 
`df['new_column'] = new_column_list`

<div class="alert alert-block alert-info">

<b>Exercise</b>  
- Make a list that contains a Boolean (True or False) for each song in the DataFrame. This Boolean should be True if you know the song, and False if you don't.
- Add the list to the existing DataFrame as a column. Don't forget to come up with a descriptive name for the column.

</div>


In [22]:
# Your code here:
known = [True, True, True, False, True, False, True, True, True, True]
df['known'] = known
df


Unnamed: 0,song,artist,length,language,rating,known
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5,True
1,Espresso,Sabrina Carpenter,175,English,7.2,True
2,Not Like Us,Kendrick Lamar,274,English,8.0,True
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3,False
4,Ik Wil Dansen,Froukje,194,Dutch,9.6,True
5,Too Sweet,Hozier,251,English,3.3,False
6,Europapa,Joost,160,Dutch,4.7,True
7,Who,Jimin,170,English,9.0,True
8,Supernova,aespa,179,Korean,7.6,True
9,LUNCH,Billie Eilish,180,English,8.6,True


##### Sorting data

You can sort the data in a DataFrame by using the `sort_values()` function. This function takes a column name as an argument. You can also specify whether you want to sort the data in ascending or descending order.

In [23]:
df_sorted = df.sort_values(by='rating', ascending=False)

In [24]:
df_sorted

Unnamed: 0,song,artist,length,language,rating,known
4,Ik Wil Dansen,Froukje,194,Dutch,9.6,True
7,Who,Jimin,170,English,9.0,True
9,LUNCH,Billie Eilish,180,English,8.6,True
2,Not Like Us,Kendrick Lamar,274,English,8.0,True
8,Supernova,aespa,179,Korean,7.6,True
1,Espresso,Sabrina Carpenter,175,English,7.2,True
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5,True
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3,False
6,Europapa,Joost,160,Dutch,4.7,True
5,Too Sweet,Hozier,251,English,3.3,False


Now it's time to write the data to a file. You can do this by using the `to_csv()` function. This function takes the path to the file as an argument.

In [25]:
df_sorted.to_csv('sorted_songs.csv')

It's also possible to write the data to a different file type. For example, if your type of choice is an Excel file, you can use the `to_excel()` function.

We only touched the surface of what you can do with `pandas`. If you can think of a data manipulation task, it's likely that there is a function for it. Consult the documentation or your favorite AI assistant to learn more about the possibilities. 