## Introducing Pandas

Pandas is a popular open source library for data analysis built on top of the Python programming language and it is the nucleus in a large Python ecosystem of data science tools. Pandas is heavily used for data analysis and cleaning and it is easily integrated with libraries for statistics, [machine learning](https://scikit-learn.org/stable/), [web scraping](https://www.kaggle.com/code/patrickgomes/web-scraping-to-pandas-step-by-step-in-9-lines), [data visualization](https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html) and more.


#### Import the Pandas library

Import the pandas library to get access to its features.

In [1]:
import pandas as pd
# pd.set_option('display.max_rows', None)

#### Importing a dataset

Pandas can import a variety of file types, each file type has its associated import method. We can use the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method to tell Pandas to open up the *population.csv* file.

This method imports the CSV file's contents into an object called a **DataFrame**, a two-dimensional labeled data structure with columns of potentially different types.


In [2]:
url = 'https://raw.githubusercontent.com/GiorgioBar/pandas/main/datasets/population.csv'
population_df = pd.read_csv(url)
population_df.head()

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
3,ITC,Nord-ovest,1,males,Y0,0 years,54122
4,ITC,Nord-ovest,2,females,Y0,0 years,51309


This DataFrame consists of seven columns and an index (the range of ascending numbers on the left side of the DataFrame). Index labels serve as identifiers for rows of data. We can set any column as the index of the DataFrame, if we do not explicitly tell pandas which column to use, the library generates a numeric index starting from 0.

#### Taking a first look at the dataframe

We can get the first *N* rows of a DataFrame using the *head()* function. Similarly, we can get the last *N* rows using the *tail()* function.

In [3]:
population_df.head(4)

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
3,ITC,Nord-ovest,1,males,Y0,0 years,54122


In [4]:
population_df.tail(3)

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
41307,IT111,Sud Sardegna,2,females,Y23,23 years,1251
41308,IT111,Sud Sardegna,1,males,TOTAL,total,165713
41309,IT111,Sud Sardegna,9,total,Y20,20 years,2784


We can inquire about the number of rows and columns in the DataFrame and extract a row by [its index position](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) (which starts counting at 0), or [by its index label](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) (if we have set a specific column of the DataFrame as index). This method returns an object of class **Series**, a one-dimensional labeled array of values.

In [5]:
print(population_df.shape)
print(population_df.iloc[2])

(41310, 7)
ITTER107          IT
Territory      Italy
SEXISTAT1          9
Gender         total
ETA1              Y0
Age          0 years
Value         401521
Name: 2, dtype: object


We can [sort](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) the DataFrame by multiple columns in different orders. Notice that by default this method returns a new DataFrame object, leaving the original data intact. Most pandas operations return a copy of the Series or DataFrame object, we can either assign this object to a new variable or overwrite the original one. The usage of the *inplace=True* keyword argument, available for some methods, is discouraged.

In [6]:
population_df.sort_values(by = "Value", ascending = False).head(3)
# population_df.sort_values(by = ["Age", "Value"], ascending= [True, False]).head(3)

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
40604,IT,Italy,9,total,TOTAL,total,58983122
40603,IT,Italy,2,females,TOTAL,total,30235705
40602,IT,Italy,1,males,TOTAL,total,28747417


## Perform SQL-like operations using pandas

In [7]:
url = 'https://raw.githubusercontent.com/GiorgioBar/pandas/main/datasets/area.csv'
area_df = pd.read_csv(url)

In [8]:
area_df.head()

Unnamed: 0,ITTER107,Territory,TIPO_DATO4,Data type,Value
0,IT,Italy,TOTAREA,total area (Ha),30206830.0
1,IT,Italy,TOTAREA2,total area (km2),302068.3
2,ITC,Nord-ovest,TOTAREA,total area (Ha),5792680.0
3,ITC,Nord-ovest,TOTAREA2,total area (km2),57926.8
4,ITC1,Piemonte,TOTAREA,total area (Ha),2538670.0


#### SELECT

We can select some columns of a DataFrame by passing a list of column names to the indexing operator ([ ]) of the DataFrame.

In [9]:
# print(population_df.columns)
population_df[['Territory', 'Gender', 'Age', 'Value']].head(4)

Unnamed: 0,Territory,Gender,Age,Value
0,Italy,males,0 years,205956
1,Italy,females,0 years,195565
2,Italy,total,0 years,401521
3,Nord-ovest,males,0 years,54122


#### WHERE

DataFrames can be filtered using [boolean indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-boolean), passing a Series of boolean values to the indexing operator.

In [10]:
# population_df[population_df["Value"] > 50000000]

is_less_than_one_year_old = population_df["Age"] == "0 years"
print(is_less_than_one_year_old.value_counts())

False    40905
True       405
Name: Age, dtype: int64


In [11]:
population_df[is_less_than_one_year_old]

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
3,ITC,Nord-ovest,1,males,Y0,0 years,54122
4,ITC,Nord-ovest,2,females,Y0,0 years,51309
...,...,...,...,...,...,...,...
400,IT110,Barletta-Andria-Trani,2,females,Y0,0 years,1356
401,IT110,Barletta-Andria-Trani,9,total,Y0,0 years,2844
41183,IT111,Sud Sardegna,1,males,Y0,0 years,864
41216,IT111,Sud Sardegna,2,females,Y0,0 years,785


We can specify multiple conditions using | (OR) and & (AND) operators and enclosing each condition inside a pair of parentheses.

In [12]:
population_df[(population_df["Age"] == "0 years") & (population_df["Value"] > 100000)]

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
5,ITC,Nord-ovest,9,total,Y0,0 years,105431


#### GROUP BY

We can group DataFrame rows into buckets, based on shared values in a given column or columns, by invoking the *groupby* method on a DataFrame. The method returns a **DataFrameGroupBy** object, a storage container that provides a set of methods to analyze each independent group.


In [13]:
# Series.unique() Returns unique values of a Series object (as a NumPy array)

print(population_df["ITTER107"].unique().size)
print(population_df["Territory"].unique().size)

print(population_df["Age"].unique())
print(population_df["Age"].unique().size)

135
134
['0 years' '1 years' '2 years' '3 years' '4 years' '5 years' '6 years'
 '7 years' '8 years' '9 years' '10 years' '11 years' '12 years' '13 years'
 '14 years' '15 years' '16 years' '17 years' '18 years' '19 years'
 '20 years' '21 years' '22 years' '23 years' '24 years' '25 years'
 '26 years' '27 years' '28 years' '29 years' '30 years' '31 years'
 '32 years' '33 years' '34 years' '35 years' '36 years' '37 years'
 '38 years' '39 years' '40 years' '41 years' '42 years' '43 years'
 '44 years' '45 years' '46 years' '47 years' '48 years' '49 years'
 '50 years' '51 years' '52 years' '53 years' '54 years' '55 years'
 '56 years' '57 years' '58 years' '59 years' '60 years' '61 years'
 '62 years' '63 years' '64 years' '65 years' '66 years' '67 years'
 '68 years' '69 years' '70 years' '71 years' '72 years' '73 years'
 '74 years' '75 years' '76 years' '77 years' '78 years' '79 years'
 '80 years' '81 years' '82 years' '83 years' '84 years' '85 years'
 '86 years' '87 years' '88 years' '89 year

In [14]:
italy_population_df = population_df[ (population_df["Gender"] == "total") & (population_df["Age"] != "total") ]
italy_population_df

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
2,IT,Italy,9,total,Y0,0 years,401521
5,ITC,Nord-ovest,9,total,Y0,0 years,105431
8,ITC1,Piemonte,9,total,Y0,0 years,26804
11,ITC11,Torino,9,total,Y0,0 years,13938
14,ITC12,Vercelli,9,total,Y0,0 years,1002
...,...,...,...,...,...,...,...
41292,IT111,Sud Sardegna,9,total,Y61,61 years,5343
41295,IT111,Sud Sardegna,9,total,Y19,19 years,2869
41297,IT111,Sud Sardegna,9,total,Y98,98 years,128
41299,IT111,Sud Sardegna,9,total,Y86,86 years,1819


In [22]:
# territory_groups = italy_population_df.groupby("ITTER107")
territory_groups = italy_population_df.groupby(["ITTER107", "Territory"])

The DataFrameGroupBy object stores a group for each unique value in the column we specify. Its *size* method returns a Series with a list of the groups and their row counts. The *get_group* method accepts a group name and returns a DataFrame with the corresponding rows.

In [25]:
print(italy_population_df["ITTER107"].nunique())
print(territory_groups.ngroups)
print(territory_groups.size())
# territory_groups.get_group("IT")
# Since we are grouping by two columns, the get_group method requires a tuple of values
territory_groups.get_group(("IT", "Italy"))

135
135
ITTER107  Territory            
IT        Italy                    101
IT108     Monza e della Brianza    101
IT109     Fermo                    101
IT110     Barletta-Andria-Trani    101
IT111     Sud Sardegna             101
                                  ... 
ITG2      Sardegna                 101
ITG25     Sassari                  101
ITG26     Nuoro                    101
ITG27     Cagliari                 101
ITG28     Oristano                 101
Length: 135, dtype: int64


Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
2,IT,Italy,9,total,Y0,0 years,401521
551,IT,Italy,9,total,Y1,1 years,405952
1087,IT,Italy,9,total,Y2,2 years,424329
1222,IT,Italy,9,total,Y3,3 years,444659
1728,IT,Italy,9,total,Y4,4 years,465225
...,...,...,...,...,...,...,...
38714,IT,Italy,9,total,Y96,96 years,41709
38996,IT,Italy,9,total,Y97,97 years,29970
39398,IT,Italy,9,total,Y98,98 years,21142
39864,IT,Italy,9,total,Y99,99 years,14144


We can invoke methods on the GroupBy object to apply aggregate operations to every group.

In [26]:
territory_groups.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEXISTAT1,Value
ITTER107,Territory,Unnamed: 2_level_1,Unnamed: 3_level_1
IT,Italy,909,58983122
IT108,Monza e della Brianza,909,870112
IT109,Fermo,909,168485
IT110,Barletta-Andria-Trani,909,379251
IT111,Sud Sardegna,909,335108
...,...,...,...
ITG2,Sardegna,909,1579181
ITG25,Sassari,909,474142
ITG26,Nuoro,909,199349
ITG27,Cagliari,909,419770


In [None]:
%%shell
jupyter nbconvert --to html /content/SQL_like_queries_using_Pandas.ipynb

[NbConvertApp] Converting notebook /content/SQL_like_queries_using_Pandas.ipynb to html
[NbConvertApp] Writing 324821 bytes to /content/SQL_like_queries_using_Pandas.html


