# Pandas

The pandas package is a powerful tool for data manipulation and analysis in Python. It provides data structures and functions for working with structured data, such as data frames. A dataframe is a two-dimensional table with labeled rows and columns. Each row has a unique index, and each column has a name. Let us read a `csv` file into a dataframe:

In [26]:
import pandas as pd

df = pd.read_csv("country.csv")

print(type(df))

<class 'pandas.core.frame.DataFrame'>


Notice that the variable `df` is now a `pandas.DataFrame` object. Let us print this object to see what it contains:

In [2]:
print(df)

                   Name      Continent  Population
0           Afghanistan           Asia    22720000
1               Albania         Europe     3401200
2               Algeria         Africa    31471000
3        American Samoa        Oceania       68000
4               Andorra         Europe       78000
5                Angola         Africa    12878000
6              Anguilla  North America        8000
7            Antarctica     Antarctica           0
8   Antigua and Barbuda  North America       68000
9             Argentina  South America    37032000
10               Canada  North America    40100000


Notice how easy it was to read the `csv` file into a `DataFrame`. There was no need to create a csv reader, and no need to iterate over the file line by line, and no need to worry about opening and closing the file.

We can also display the first few rows of the `DataFrame` using the `head()` method:

In [7]:
df.head()

Unnamed: 0,Name,Continent,Population
0,Afghanistan,Asia,22720000
1,Albania,Europe,3401200
2,Algeria,Africa,31471000
3,American Samoa,Oceania,68000
4,Andorra,Europe,78000


Notice how nicely formatted the output is. Similar to numpy arrays, we can see the shape and size of the dataframe:

In [4]:
print("Shape of the dataframe: ", df.shape)
print("Number of elements in the dataframe: ", df.size)

Shape of the dataframe:  (10, 3)
Number of elements in the dataframe:  30


We can also list the columns of a dataframe:

In [6]:
df.columns

Index(['Name', 'Continent', 'Population'], dtype='object')

We can also see the data types for each column:

In [7]:
df.dtypes

Name          object
Continent     object
Population     int64
dtype: object

Notice that the column 'Population' is an integer. Pandas idenfies the column as an integer and automatically converts the column to an integer type, unlike when we were reading directly from the csv file ourselves and converting the column to an integer type.

It is also possible to display specific columns of a DataFrame using the syntax:

In [None]:
print(type(df['Name']))
df['Name']

<class 'pandas.core.series.Series'>


0             Afghanistan
1                 Albania
2                 Algeria
3          American Samoa
4                 Andorra
5                  Angola
6                Anguilla
7              Antarctica
8     Antigua and Barbuda
9               Argentina
10                 Canada
Name: Name, dtype: object

Notice that the type of the data is now a `pandas.core.series.Series`. We can preserve the type as a dataframe by using two square brackets `[[ ]]`:

In [9]:
print(type(df[['Name']]))
df[['Name']]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name
0,Afghanistan
1,Albania
2,Algeria
3,American Samoa
4,Andorra
5,Angola
6,Anguilla
7,Antarctica
8,Antigua and Barbuda
9,Argentina


We see that the data is a pandas dataframe. It is also possible to specify more than one column:

In [9]:
df[['Name', 'Population']]

Unnamed: 0,Name,Population
0,Afghanistan,22720000
1,Albania,3401200
2,Algeria,31471000
3,American Samoa,68000
4,Andorra,78000
5,Angola,12878000
6,Anguilla,8000
7,Antarctica,0
8,Antigua and Barbuda,68000
9,Argentina,37032000


## The iloc attribute

Similar to numpy arrays, pandas dataframes have an iloc attribute that can be used to access rows and columns by position:

In [12]:
print(df)
print()
print("Element at row 0 and column 1: ", df.iloc[0, 1])
print("Element at row 3 and column 2: ", df.iloc[3, 2])

                  Name      Continent  Population
0          Afghanistan           Asia    22720000
1              Albania         Europe     3401200
2              Algeria         Africa    31471000
3       American Samoa        Oceania       68000
4              Andorra         Europe       78000
5               Angola         Africa    12878000
6             Anguilla  North America        8000
7           Antarctica     Antarctica           0
8  Antigua and Barbuda  North America       68000
9            Argentina  South America    37032000

Element at row 0 and column 1:  Asia
Element at row 3 and column 2:  68000


It is important to note that the numbers start at 0. 

We can also using slicing to get a range of values:

In [15]:
print(df)
print()
print("Rows from 1 to 3, with all columns: ", df.iloc[0:4,:])
print()
print("Rows from start to 5, and columns from 1 to last: ", df.iloc[:6,1:])

                  Name      Continent  Population
0          Afghanistan           Asia    22720000
1              Albania         Europe     3401200
2              Algeria         Africa    31471000
3       American Samoa        Oceania       68000
4              Andorra         Europe       78000
5               Angola         Africa    12878000
6             Anguilla  North America        8000
7           Antarctica     Antarctica           0
8  Antigua and Barbuda  North America       68000
9            Argentina  South America    37032000

Rows from 1 to 3, with all columns:               Name Continent  Population
0     Afghanistan      Asia    22720000
1         Albania    Europe     3401200
2         Algeria    Africa    31471000
3  American Samoa   Oceania       68000

Rows from start to 5, and columns from 1 to last:    Continent  Population
0      Asia    22720000
1    Europe     3401200
2    Africa    31471000
3   Oceania       68000
4    Europe       78000
5    Africa    1

## The loc attribute

Similar to the `iloc` attribute, the `loc` attribute allows you to select rows and columns, but instead of using integer indices, you use labels. For example, in our dataframe, the columns have labels. We can use these labels to select the columns:

In [18]:
print(df)
print()

print("Rows with index 1, 2, 3, 4 and columns 'Continent' and 'Population':\n ", df.loc[1:5, ['Continent', 'Population']])

                  Name      Continent  Population
0          Afghanistan           Asia    22720000
1              Albania         Europe     3401200
2              Algeria         Africa    31471000
3       American Samoa        Oceania       68000
4              Andorra         Europe       78000
5               Angola         Africa    12878000
6             Anguilla  North America        8000
7           Antarctica     Antarctica           0
8  Antigua and Barbuda  North America       68000
9            Argentina  South America    37032000

Rows with index 1, 2, 3, 4 and columns 'Continent' and 'Population':
    Continent  Population
1    Europe     3401200
2    Africa    31471000
3   Oceania       68000
4    Europe       78000
5    Africa    12878000


There are two main diffeerences between `loc` and `iloc` in Pandas. First, `loc` is label-based, which means that you have to specify the row and column labels to access a particular cell or group of cells. On the other hand, `iloc` is integer-based, which means that you have to specify the row and column indices to access a particular cell or group of cells.

Second, `loc` is inclusive, which means that it includes the last element in the specified range, while `iloc` is exclusive, which means that it excludes the last element in the specified range.

## Filtering data

One of the main uses of Pandas is to filter data. This is a very common task in data analysis. Filtering data is the process of selecting a subset of data from a larger dataset based on certain criteria. For eample, let us filter our data set to include only rows where the population is greater than 1,000,000:

In [10]:
df[df['Population'] > 1000000]

Unnamed: 0,Name,Continent,Population
0,Afghanistan,Asia,22720000
1,Albania,Europe,3401200
2,Algeria,Africa,31471000
5,Angola,Africa,12878000
9,Argentina,South America,37032000
10,Canada,North America,40100000


We see that the rows 0, 1, 2, 5, and 9 satisfy the condition. We can also filter on more than one condition. We can do this using operators such as `and` and `or`. However, we need to use the `&` and `|` operators instead of `and` and `or` respectively:

In [15]:
# get rows where population is greater than 1 million and continent is Asia
df[(df['Population'] > 1000000) & (df['Continent'] == 'Asia')]

Unnamed: 0,Name,Continent,Population
0,Afghanistan,Asia,22720000


In [16]:
# get rows where population is greater than 1 million or continent is Asia
df[(df['Population'] > 1000000) | (df['Continent'] == 'Asia')]

Unnamed: 0,Name,Continent,Population
0,Afghanistan,Asia,22720000
1,Albania,Europe,3401200
2,Algeria,Africa,31471000
5,Angola,Africa,12878000
9,Argentina,South America,37032000
10,Canada,North America,40100000


The `~` character can be used to indicate `not`. For example, `~True` is `False`:

In [17]:
# get rows where population is not greater than 1 million
df[~(df['Population'] > 1000000)]

Unnamed: 0,Name,Continent,Population
3,American Samoa,Oceania,68000
4,Andorra,Europe,78000
6,Anguilla,North America,8000
7,Antarctica,Antarctica,0
8,Antigua and Barbuda,North America,68000


## Pandas methods

There are many methods that can be used to manipulate dataframes. It is important to note that while DataFrames are mutable, the methods that are used to manipulate them do not change the original DataFrame. Instead, they return a new DataFrame with the changes applied. If we want the changes to be in place, we need to set the `inplace` parameter to `True`. By default is it `False`. For example, the `drop` method is used to remove rows or columns from a DataFrame:

In [13]:
df.drop(columns='Population')

df.head()

Unnamed: 0,Name,Continent,Population
0,Afghanistan,Asia,22720000
1,Albania,Europe,3401200
2,Algeria,Africa,31471000
3,American Samoa,Oceania,68000
4,Andorra,Europe,78000


Although we used the `drop` method to remove the column `Population`, we see that the column is still there. This is because the `drop` method does not modify the original DataFrame. To modify the original DataFrame, we need to assign the result of the `drop` method to the original DataFrame or use the ``inplace` argument:

In [14]:
# saving the nodification back to the dataframe
df2 = df.drop(columns='Population')
df2.head()

Unnamed: 0,Name,Continent
0,Afghanistan,Asia
1,Albania,Europe
2,Algeria,Africa
3,American Samoa,Oceania
4,Andorra,Europe


In [None]:
# using inplace=True to save the modification back to the dataframe
df.drop(columns='Population', inplace=True)
df.head()

Unnamed: 0,Name,Continent
0,Afghanistan,Asia
1,Albania,Europe
2,Algeria,Africa
3,American Samoa,Oceania
4,Andorra,Europe


The `drop_duplicates` method is used to remove duplicate rows from a DataFrame. It takes a few optional arguments, including `subset`, which specifies the columns to consider when identifying duplicates. Similar to the `drop` method, the `drop_duplicates` method returns a new DataFrame with the duplicates removed. If you want to modify the original DataFrame in place, you can use the `inplace` argument.

The `insert` method is used to insert a new row at a specific index in a DataFrame. It takes three arguments: the index at which to insert the new row, the name of the new row, and the values for the new row. The `insert` method returns a new DataFrame with the new row inserted. If you want to modify the original DataFrame in place, you can use the `inplace` argument.

The `replace` method is used to replace values in a DataFrame. It takes two arguments: the value to be replaced and the value to replace it with. The `replace` method returns a new DataFrame with the values replaced. If you want to modify the original DataFrame in place, you can use the `inplace` argument.

Let us look at examples of these methods in action:

In [22]:
df = pd.read_csv("country.csv")

df_no_duplicates = df.drop_duplicates(subset=['Continent'])
print("DataFrame after dropping duplicate continents:\n")
print(df_no_duplicates)
print()
df_replace = df.replace(to_replace="Asia", value="South Asia")
print("DataFrame after replacing Asia with South Asia:\n")
print(df_replace)
print()

DataFrame after dropping duplicate continents:

             Name      Continent  Population
0     Afghanistan           Asia    22720000
1         Albania         Europe     3401200
2         Algeria         Africa    31471000
3  American Samoa        Oceania       68000
6        Anguilla  North America        8000
7      Antarctica     Antarctica           0
9       Argentina  South America    37032000

DataFrame after replacing Asia with South Asia:

                   Name      Continent  Population
0           Afghanistan     South Asia    22720000
1               Albania         Europe     3401200
2               Algeria         Africa    31471000
3        American Samoa        Oceania       68000
4               Andorra         Europe       78000
5                Angola         Africa    12878000
6              Anguilla  North America        8000
7            Antarctica     Antarctica           0
8   Antigua and Barbuda  North America       68000
9             Argentina  South A

Finally, the `sort_values()` method can be used to sort the DataFrame by any column of our choosing:

In [23]:
print("DataFrame after sorting by Population in ascending order:\n")
print(df.sort_values(by='Population'))
print()
print("DataFrame after sorting by Population in descending order:\n")
print(df.sort_values(by='Population', ascending=False))

DataFrame after sorting by Population in ascending order:

                   Name      Continent  Population
7            Antarctica     Antarctica           0
6              Anguilla  North America        8000
3        American Samoa        Oceania       68000
8   Antigua and Barbuda  North America       68000
4               Andorra         Europe       78000
1               Albania         Europe     3401200
5                Angola         Africa    12878000
0           Afghanistan           Asia    22720000
2               Algeria         Africa    31471000
9             Argentina  South America    37032000
10               Canada  North America    40100000

DataFrame after sorting by Population in descending order:

                   Name      Continent  Population
10               Canada  North America    40100000
9             Argentina  South America    37032000
2               Algeria         Africa    31471000
0           Afghanistan           Asia    22720000
5            

If we print the data frame, we see that the order has not changes:

In [24]:
print(df)

                   Name      Continent  Population
0           Afghanistan           Asia    22720000
1               Albania         Europe     3401200
2               Algeria         Africa    31471000
3        American Samoa        Oceania       68000
4               Andorra         Europe       78000
5                Angola         Africa    12878000
6              Anguilla  North America        8000
7            Antarctica     Antarctica           0
8   Antigua and Barbuda  North America       68000
9             Argentina  South America    37032000
10               Canada  North America    40100000


This is because we did not save the modification. To save it, we need to specify the `inplace` parameter to be `True`:

In [25]:
print("DataFrame after sorting by Population in descending order:\n")
df.sort_values(by='Population', ascending=False, inplace=True)
print(df)

DataFrame after sorting by Population in descending order:

                   Name      Continent  Population
10               Canada  North America    40100000
9             Argentina  South America    37032000
2               Algeria         Africa    31471000
0           Afghanistan           Asia    22720000
5                Angola         Africa    12878000
1               Albania         Europe     3401200
4               Andorra         Europe       78000
3        American Samoa        Oceania       68000
8   Antigua and Barbuda  North America       68000
6              Anguilla  North America        8000
7            Antarctica     Antarctica           0
