## Introducing Pandas

Pandas is a popular open source library for data analysis built on top of the Python programming language and it is the nucleus in a large Python ecosystem of data science tools. Pandas is heavily used for data analysis and cleaning and it is easily integrated with libraries for statistics, [machine learning](https://scikit-learn.org/stable/), [web scraping](https://www.kaggle.com/code/patrickgomes/web-scraping-to-pandas-step-by-step-in-9-lines), [data visualization](https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html) and more.


#### Import the Pandas library

Import the pandas library to get access to its features.

In [12]:
import pandas as pd

#### Importing a dataset

Pandas can import a variety of file types, each file type has its associated import method. We can use the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method to tell Pandas to open up the *population.csv* file.

This method imports the CSV file's contents into an object called a **DataFrame**, a two-dimensional labeled data structure with columns of potentially different types.


In [13]:
url = 'https://raw.githubusercontent.com/GiorgioBar/pandas/main/datasets/population.csv'
population_df = pd.read_csv(url)
population_df.head()

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
3,ITC,Nord-ovest,1,males,Y0,0 years,54122
4,ITC,Nord-ovest,2,females,Y0,0 years,51309


This DataFrame consists of seven columns and an index (the range of ascending numbers on the left side of the DataFrame). Index labels serve as identifiers for rows of data. We can set any column as the index of the DataFrame, if we do not explicitly tell pandas which column to use, the library generates a numeric index starting from 0.

#### Taking a first look at the dataframe

We can get the first *N* rows of a DataFrame using the *head()* function. Similarly, we can get the last *N* rows using the *tail()* function.

In [14]:
population_df.head(4)

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
3,ITC,Nord-ovest,1,males,Y0,0 years,54122


In [15]:
population_df.tail(3)

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
41307,IT111,Sud Sardegna,2,females,Y23,23 years,1251
41308,IT111,Sud Sardegna,1,males,TOTAL,total,165713
41309,IT111,Sud Sardegna,9,total,Y20,20 years,2784


We can inquire about the number of rows and columns in the DataFrame and extract a row by [its index position](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) (which starts counting at 0), or [by its index label](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) (if we have set a specific column of the DataFrame as index). This method returns an object of class **Series**, a one-dimensional labeled array of values.

In [16]:
print(population_df.shape)
print(population_df.iloc[2])

(41310, 7)
ITTER107          IT
Territory      Italy
SEXISTAT1          9
Gender         total
ETA1              Y0
Age          0 years
Value         401521
Name: 2, dtype: object


We can [sort](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) the DataFrame by multiple columns in different orders. Notice that by default this method returns a new DataFrame object, leaving the original data intact. Most pandas operations return a copy of the Series or DataFrame object, we can either assign this object to a new variable or overwrite the original one. The usage of the *inplace=True* keyword argument, available for some methods, is discouraged.

In [17]:
population_df.sort_values(by = "Value", ascending = False).head(3)
# population_df.sort_values(by = ["Age", "Value"], ascending= [True, False]).head(3)

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
40604,IT,Italy,9,total,TOTAL,total,58983122
40603,IT,Italy,2,females,TOTAL,total,30235705
40602,IT,Italy,1,males,TOTAL,total,28747417


## Perform SQL-like operations using pandas

In [18]:
url = 'https://raw.githubusercontent.com/GiorgioBar/pandas/main/datasets/area.csv'
area_df = pd.read_csv(url)

In [19]:
area_df.head()

Unnamed: 0,ITTER107,Territory,TIPO_DATO4,Data type,Value
0,IT,Italy,TOTAREA,total area (Ha),30206830.0
1,IT,Italy,TOTAREA2,total area (km2),302068.3
2,ITC,Nord-ovest,TOTAREA,total area (Ha),5792680.0
3,ITC,Nord-ovest,TOTAREA2,total area (km2),57926.8
4,ITC1,Piemonte,TOTAREA,total area (Ha),2538670.0


#### SELECT

We can select some columns of a DataFrame by passing a list of column names to the indexing operator ([ ]) of the DataFrame.

In [38]:
# print(population_df.columns)
population_df[['Territory', 'Gender', 'Age', 'Value']].head(4)

Unnamed: 0,Territory,Gender,Age,Value
0,Italy,males,0 years,205956
1,Italy,females,0 years,195565
2,Italy,total,0 years,401521
3,Nord-ovest,males,0 years,54122


#### WHERE

DataFrames can be filtered using [boolean indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-boolean), passing a Series of boolean values to the indexing operator.

In [54]:
# population_df[population_df["Value"] > 50000000]

is_less_than_one_year_old = population_df["Age"] == "0 years"
print(is_less_than_one_year_old.value_counts())

False    40905
True       405
Name: Age, dtype: int64


In [55]:
population_df[is_less_than_one_year_old]

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
3,ITC,Nord-ovest,1,males,Y0,0 years,54122
4,ITC,Nord-ovest,2,females,Y0,0 years,51309
...,...,...,...,...,...,...,...
400,IT110,Barletta-Andria-Trani,2,females,Y0,0 years,1356
401,IT110,Barletta-Andria-Trani,9,total,Y0,0 years,2844
41183,IT111,Sud Sardegna,1,males,Y0,0 years,864
41216,IT111,Sud Sardegna,2,females,Y0,0 years,785


We can specify multiple conditions using | (OR) and & (AND) operators and enclosing each condition inside a pair of parentheses.

In [60]:
population_df[(population_df["Age"] == "0 years") & (population_df["Value"] > 100000)]

Unnamed: 0,ITTER107,Territory,SEXISTAT1,Gender,ETA1,Age,Value
0,IT,Italy,1,males,Y0,0 years,205956
1,IT,Italy,2,females,Y0,0 years,195565
2,IT,Italy,9,total,Y0,0 years,401521
5,ITC,Nord-ovest,9,total,Y0,0 years,105431


#### GROUP BY


In [33]:
%%shell
jupyter nbconvert --to html /content/SQL_like_queries_using_Pandas.ipynb

[NbConvertApp] Converting notebook /content/SQL_like_queries_using_Pandas.ipynb to html
[NbConvertApp] Writing 311961 bytes to /content/SQL_like_queries_using_Pandas.html


