## Introduction to pandas
In this notebook, we will learn the basics of the `pandas` library. This is a powerful package that is often used to work with data in Python. It alllows you to look at your data, perform manipulations, and create visualizations. 

There is a lot to learn about `pandas`, but we will cover the basics in this notebook. Check out the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) for more information about the available functions and methods.

This lesson covers the following topics:
- Reading data as a DataFrame
- Basic operations on a DataFrame: 
    - Selecting 
    - Counting 
    - Calculating the average
    - Adding new data
    - Dealing with missing data
    - Sorting 
- Writing data to a file 


First, you have to import the `pandas` package. You can do this by running the following code:

In [2]:
import pandas as pd

`pandas` works with a structure called a `DataFrame`. This is table with rows and columns. You can think of it as a spreadsheet. 

#### Loading and inspecting data

Let's start by loading a CSV file and saving it to a variable. You can do this by using the `read_csv` function. This function takes the path to the file as an `argument`: 

In [141]:
df = pd.read_csv('songs.csv')

The `head()` function allows you to see the first 10 rows of the DataFrame. This is useful to get a quick overview of the data.

In [142]:
df.head(5)

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6


You can also use the `tail()` function to see the last rows of the DataFrame. You can specify the number of rows you want to see by passing an argument to the function.

In [143]:
df.tail(5)

Unnamed: 0,song,artist,length,language,rating
5,Too Sweet,Hozier,251,English,3.3
6,Europapa,Joost,160,Dutch,4.7
7,Who,Jimin,170,English,
8,Supernova,aespa,179,Korean,7.6
9,LUNCH,Billie Eilish,180,English,8.6


Get a summary of the DataFrame with the `info()` function. This will show you the number of rows, the number of columns, the data types of the columns, and the number of non-null values in each column.

In [144]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   song      10 non-null     object 
 1   artist    10 non-null     object 
 2   length    10 non-null     int64  
 3   language  10 non-null     object 
 4   rating    9 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 528.0+ bytes


In [145]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   song      10 non-null     object 
 1   artist    10 non-null     object 
 2   length    10 non-null     int64  
 3   language  10 non-null     object 
 4   rating    9 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 528.0+ bytes


#### Data manipulation
This section covers some basic operations that you can perform on a DataFrame, such as selecting data, adding values, grouping and sorting. 

##### Selecting a column.
You can grab a single column from the DataFrame by using the column name in square brackets:

In [146]:
df['song']

0              BIRDS OF A FEATHER
1                        Espresso
2                     Not Like Us
3    Si Antes Te Hubiera Conocido
4                   Ik Wil Dansen
5                       Too Sweet
6                        Europapa
7                             Who
8                       Supernova
9                           LUNCH
Name: song, dtype: object

In practice, you will often want to save it as a list. You can do this by using the `tolist()` function: 

In [147]:
df['song'].tolist()

['BIRDS OF A FEATHER',
 'Espresso',
 'Not Like Us',
 'Si Antes Te Hubiera Conocido',
 'Ik Wil Dansen',
 'Too Sweet',
 'Europapa',
 'Who',
 'Supernova',
 'LUNCH']

##### Selecting a row

There are multiple ways to select a row in a DataFrame. You can use the `iloc` function to select a row by its index. This also works for ranges. 

In [148]:
df.iloc[0]

song        BIRDS OF A FEATHER
artist           Billie Eilish
length                     210
language               English
rating                     6.5
Name: 0, dtype: object

You can also select a row based on a condition. For example, you can select all rows for a specific artist:

In [149]:
df[df['artist'] == 'Billie Eilish']

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
9,LUNCH,Billie Eilish,180,English,8.6


##### Counting values

The function `value_counts()` allows you to count the number of occurrences of each value in a column.

In [150]:
df['language'].value_counts()

language
English    6
Dutch      2
Spanish    1
Korean     1
Name: count, dtype: int64

##### Calculating the average

To calculate the average of a column, you can use the `mean()` function. We have to import the function from the `numpy` first because it is not included in `pandas`. 

In [151]:
from numpy import mean
lengths = df['length'].tolist()
mean(lengths)

199.1

##### Null values

You may have noticed that the `rating` column contains the value `NaN`. This is called a `null value`, and it means that the data is missing. 

In [154]:
df

Unnamed: 0,song,artist,length,language,rating
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
1,Espresso,Sabrina Carpenter,175,English,7.2
2,Not Like Us,Kendrick Lamar,274,English,8.0
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
4,Ik Wil Dansen,Froukje,194,Dutch,9.6
5,Too Sweet,Hozier,251,English,3.3
6,Europapa,Joost,160,Dutch,4.7
7,Who,Jimin,170,English,
8,Supernova,aespa,179,Korean,7.6
9,LUNCH,Billie Eilish,180,English,8.6


Null values can cause problems when you are working with data. In the example below, we try to calculate the average rating of the songs. This will not work because the `NaN` values are not a number.

In [152]:
ratings = df['rating'].tolist()
mean(ratings)

nan

The easiest way to deal with null values is to remove them from the data. You can do this by applying the `dropna()` function to the DataFrame. This function will remove any rows that contain null values. 

In [153]:
df_dropped = df.dropna()
ratings = df_dropped['rating'].tolist()
mean(ratings)

6.755555555555556

See, now it works! 

Another option is to manually add a value. You can do this with the `.loc` function. This function allows you to select a row and column by their index. 

Exercise: Try to add a rating of choice to the song with the missing value. \
Hint: the syntax looks like this: `df.loc[0, 'rating'] = 5` 

In [164]:
# Your code here:




##### Adding a column

You can add a column to an existing DataFrame by assigning a list to a new column name. It's important that the length of the list matches the number of rows in the DataFrame, otherwise it will not work. 

You can use the following syntax to add a column: 
df['new_column'] = new_column_list

Exercise: 
- Make a list that contains a Boolean (True or False) for whether you know the songs in the DataFrame or not. 
- Add the column to the DataFrame. Don't forget to come up with a descriptive name for the column.

In [None]:
# Your code here:




##### Sorting data

You can sort the data in a DataFrame by using the `sort_values()` function. This function takes the column name as an argument. You can also specify whether you want to sort the data in ascending or descending order.

In [165]:
df_sorted = df.sort_values(by='rating', ascending=False)

In [166]:
df_sorted

Unnamed: 0,song,artist,length,language,rating
4,Ik Wil Dansen,Froukje,194,Dutch,9.6
9,LUNCH,Billie Eilish,180,English,8.6
2,Not Like Us,Kendrick Lamar,274,English,8.0
8,Supernova,aespa,179,Korean,7.6
1,Espresso,Sabrina Carpenter,175,English,7.2
0,BIRDS OF A FEATHER,Billie Eilish,210,English,6.5
3,Si Antes Te Hubiera Conocido,KAROL G,198,Spanish,5.3
6,Europapa,Joost,160,Dutch,4.7
5,Too Sweet,Hozier,251,English,3.3
7,Who,Jimin,170,English,


Now it's time to write the data to a file. You can do this by using the `to_csv()` function. This function takes the path to the file as an argument.

In [None]:
df.to_csv('sorted_songs.csv')

We only touched the surface of what you can do with `pandas`. If you can think of a data manipulation task, it's likely that there is a function for it. Consult the documentation or your favorite AI assistant for more information to learn more about the possibilities. 

[0;31mType:[0m        module
[0;31mString form:[0m <module 'pandas' from '/Users/5610710/opt/anaconda3/lib/python3.9/site-packages/pandas/__init__.py'>
[0;31mFile:[0m        ~/opt/anaconda3/lib/python3.9/site-packages/pandas/__init__.py
[0;31mDocstring:[0m  
pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
  