# Introduction to Pandas

Pandas is an open-source data manipulation and analysis library for Python. It provides data structures and functions needed to manipulate and analyze structured data seamlessly. The two primary data structures in Pandas are:

- **Series:** a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a single column in a database table.
- **DataFrame:** A two-dimensional labeled data structure with columns that can be of different data types. It is similar to a spreadsheet or a SQL table, where data is organized in rows and columns.

Key features of Pandas include:

- **Data Cleaning and Preparation:** Pandas provides functions and methods to clean and prepare data, such as handling missing values, dealing with duplicates, and transforming data.

- **Data Exploration and Analysis:** Pandas facilitates the exploration and analysis of data through various functions for summarization, grouping, aggregation, and statistical analysis.

- **Time Series Analysis:** Pandas has robust support for time series data, allowing for easy manipulation and analysis of temporal data.

- **Integration with Other Libraries:** Pandas can be seamlessly integrated with other libraries such as NumPy, Matplotlib, and scikit-learn, providing a comprehensive ecosystem for data analysis and machine learning.

- **Data Import and Export:** Pandas supports the import and export of data from various file formats, including CSV, Excel, SQL databases, and more.

## Series

As stated above, **Series** is a one-dimensional array-like object that contains a sequence of values of the same type, each identified by an appropriate label. A Series is composed of two array-like objects: one containing the sequence of values and the other containing the associated array of data labels (alled the *index*). As a convention, `pandas` is imported as `pd`.

In [89]:
# Importing necessary libraries
import pandas as pd
import numpy as np

# Creating a Pandas Series with 10 random values chosen from the range 0 to 999
my_series = pd.Series(np.random.choice(1000, size=10))

# Printing the entire Series
print("The Series:\n", my_series)

# Printing the sequence of values using the Pandas Array attribute
print("\nThe sequence of values is:\n", my_series.array)

# Printing the index of the Series
print("\nThe index is:\n", my_series.index)

The Series:
 0    815
1    180
2    891
3    829
4    929
5    942
6    478
7    448
8    629
9    210
dtype: int64

The sequence of values is:
 <NumpyExtensionArray>
[np.int64(815), np.int64(180), np.int64(891), np.int64(829), np.int64(929),
 np.int64(942), np.int64(478), np.int64(448), np.int64(629), np.int64(210)]
Length: 10, dtype: int64

The index is:
 RangeIndex(start=0, stop=10, step=1)


The index labels can be specified using the parameter `index` within the `pd.Series()` constructor, as shown below:

In [90]:
# Importing the 'string' module for character manipulation
import string

# Creating a Pandas Series with 10 random values and a custom index of the first 10 lowercase letters
my_series = pd.Series(np.random.choice(1000, size=10), index=list(string.ascii_lowercase)[0:10])

# Displaying the Series
print(my_series)

a    204
b    501
c    295
d    903
e    932
f    389
g    390
h    413
i     38
j    976
dtype: int64


The `index` of a series can be modified by assigning a new index to its attribute `pd.Series.index`.

In [91]:
# Creating a list of the first 10 uppercase letters
new_index = list(string.ascii_uppercase)[0:10]

# Shuffling the list using np.random.shuffle
np.random.shuffle(new_index)

# Displaying the shuffled index array
print(f"The new index array for my series: {new_index}")

# Assigning the shuffled index to the Pandas Series
my_series.index = new_index

# Displaying the Series with the new index
print("\nSeries with New Index:")
print(my_series)

The new index array for my series: ['D', 'H', 'A', 'F', 'G', 'J', 'I', 'C', 'B', 'E']

Series with New Index:
D    204
H    501
A    295
F    903
G    932
J    389
I    390
C    413
B     38
E    976
dtype: int64


Similarly to NumPy, values in Pandas `Series` and `DataFrame` can be accessed using various methods, such as indexing (fancy and boolean) and slicing. The syntax is exactly the same as demonstrated in the NumPy lesson.

In [92]:
# Examples of fancy and boolean indexing and slicing

# Printing the first three elements of the series using slicing
print(f"Printing the first three elements of my series:\n{my_series[:3]}")

# Printing the last three elements of my series using slicing
print(f"\nPrinting the last three elements of my series:\n{my_series[-3:]}")

# Printing three selected elements of the series using fancy indexing
print(f"\nPrinting three selected elements of my series:\n{my_series[[0, 6, 8]]}")

# Printing three selected elements of the series using label-based fancy indexing
print(f"\nPrinting three selected elements of my series:\n{my_series[['A', 'B', 'D']]}")

# Using boolean indexing to filter elements greater than 500
print(f"\nmy_series contains {(my_series > 500).sum()} values greater than 500:\n{my_series[my_series > 500]}")

Printing the first three elements of my series:
D    204
H    501
A    295
dtype: int64

Printing the last three elements of my series:
C    413
B     38
E    976
dtype: int64

Printing three selected elements of my series:
D    204
I    390
B     38
dtype: int64

Printing three selected elements of my series:
A    295
B     38
D    204
dtype: int64

my_series contains 4 values greater than 500:
H    501
F    903
G    932
E    976
dtype: int64


  print(f"\nPrinting three selected elements of my series:\n{my_series[[0, 6, 8]]}")


Not only is the syntax for indexing and slicing the same as NumPy, but `Series` and `DataFrame` also preserve the same NumPy properties in mathematical operations, such as broadcasting and vectorization. Moreover, Pandas aligns data by label index, allowing mathematical operations with ease.

In [93]:
# Multiplying each element in the series by 2
print(my_series * 2, "\n")

# Performing a series of mathematical operations: multiply by 2, subtract the original series, and then subtract 1
print(my_series * 2 - my_series - 1,  "\n")

# Squaring each element in the series
print(my_series ** 2,  "\n")

D     408
H    1002
A     590
F    1806
G    1864
J     778
I     780
C     826
B      76
E    1952
dtype: int64 

D    203
H    500
A    294
F    902
G    931
J    388
I    389
C    412
B     37
E    975
dtype: int64 

D     41616
H    251001
A     87025
F    815409
G    868624
J    151321
I    152100
C    170569
B      1444
E    952576
dtype: int64 



In [94]:
# Defining a list of cities
cities = ['Milano', 'Novara', 'Torino', 'Como', 'Catanzaro', 'Roma', 'Messina', 'Firenze', 'Venezia']

# Creating a Pandas Series (obj1) with random integer values and the specified index
obj1 = pd.Series(np.random.randint(low=10_000, high=1_000_000, size=len(cities)), index=cities)

# Creating another Pandas Series (obj2) with different random integer values and the same index
obj2 = pd.Series(np.random.randint(low=1, high=1_000, size=len(cities)), index=cities)

# Modifying specific values in obj1 and obj2
obj1['Novara'] = 0
obj2['Como'] = np.nan

obj1.name, obj2.name = 'cities population', 'cities population'
obj1.index.name, obj1.index.name = 'cities', 'cities'

# Displaying obj1, obj2, and the result of their addition
print("obj1:")
print(obj1)

print("\nobj2:")
print(obj2)

print("\nobj1 + obj2:")
print(obj1 + obj2)

obj1:
cities
Milano       660747
Novara            0
Torino        91726
Como         874018
Catanzaro    907963
Roma         410462
Messina      711884
Firenze      426752
Venezia      758547
Name: cities population, dtype: int64

obj2:
Milano       130.0
Novara       116.0
Torino       430.0
Como           NaN
Catanzaro    669.0
Roma         210.0
Messina      101.0
Firenze      960.0
Venezia      434.0
Name: cities population, dtype: float64

obj1 + obj2:
cities
Milano       660877.0
Novara          116.0
Torino        92156.0
Como              NaN
Catanzaro    908632.0
Roma         410672.0
Messina      711985.0
Firenze      427712.0
Venezia      758981.0
Name: cities population, dtype: float64


It is possible to generate a `Series` from a python list or dictionary and vice versa.

In [95]:
# Creating a Python list using the range function
my_list = list(range(0, 10))
print("Python List:\n", my_list)

# Creating a Pandas Series from the Python list
my_series = pd.Series(my_list)
print("\nPandas Series:\n", my_series)

# Converting the Pandas Series back to a Python list
_list = my_series.to_list()
print("\nPython list:\n",_list)

# Converting the Pandas Series back to a Python dict
_dict = my_series.to_dict()
print("\nPython list:\n",_dict)

# Creating a Python dictionary
my_cities_population = {'Legnano': 60_118, 'Milano': 1_358_420, 'Segrate': 36_961, 'Corsico': 34.505}
print("\nPython List:\n", my_list)

# Creating a Pandas Series from the Python list
my_series = pd.Series(my_cities_population, name='Comuni nella città metr. di Milano per popolazione')
print("\nPandas Series:\n", my_series)

Python List:
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Pandas Series:
 0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Python list:
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Python list:
 {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}

Python List:
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Pandas Series:
 Legnano      60118.000
Milano     1358420.000
Segrate      36961.000
Corsico         34.505
Name: Comuni nella città metr. di Milano per popolazione, dtype: float64


`Series` and `DataFrame` may contain *NA* values. The function `pd.isna()` assesses the presence of NA value into `Series` and `DataFrame`

In [96]:
# Checking for missing values in obj1 and printing the result
print("Does obj1 contain NA?\n", obj1.isna())

# Checking for missing values in obj2 and printing the result
print("\nDoes obj2 contain NA?\n", obj2.isna())

# Checking if obj2 still contains any missing values after imputation
if obj2.isna().sum() > 0:
    # Printing a message if obj2 contains missing values
    print("\nYes, obj2 contains NA at index:\n", list(obj2[obj2.isna()].index))

Does obj1 contain NA?
 cities
Milano       False
Novara       False
Torino       False
Como         False
Catanzaro    False
Roma         False
Messina      False
Firenze      False
Venezia      False
Name: cities population, dtype: bool

Does obj2 contain NA?
 Milano       False
Novara       False
Torino       False
Como          True
Catanzaro    False
Roma         False
Messina      False
Firenze      False
Venezia      False
Name: cities population, dtype: bool

Yes, obj2 contains NA at index:
 ['Como']


## Dataframe

A `DataFrame` is a two-dimensional labeled data structure with potentially heterogeneous columns, often used for more complex and structured data. The `DataFrame` has both a row and column index. It is initialized using the function `pd.DataFrame(data, index, columns)`. If you pass a column that isn’t contained in the dictionary, it will appear with missing values.

![title](img/axis.png)

In [97]:
# Define a dictionary with data for Italian cities and their populations
cities = {
    'city': ['Rome', 'Milan', 'Naples', 'Turin', 'Palermo', 'Genoa', 'Bologna'],
    'population': [2_872_800, 1_378_689, 967_069, 885_265, 678_492, 580_097, 391_374],
    'area': [1287.36, 181.67, 117.27, 130.17, 158.9, 240.29, 140.86]
}

# Create a pandas DataFrame using the dictionary
df = pd.DataFrame(cities, columns = ['city', 'area', 'population'])

# Print the DataFrame
print(df)

      city     area  population
0     Rome  1287.36     2872800
1    Milan   181.67     1378689
2   Naples   117.27      967069
3    Turin   130.17      885265
4  Palermo   158.90      678492
5    Genoa   240.29      580097
6  Bologna   140.86      391374


For a large dataset, it can be useful to inspect the first or last rows using the functions `pd.head()` and `pd.tail()`, respectively.

In [98]:
# Inspecting the first five rows
print("The first five rows")
print(df.head(), '\n')

# Inspecting the last five rows
print("The last five rows")
print(df.tail())

The first five rows
      city     area  population
0     Rome  1287.36     2872800
1    Milan   181.67     1378689
2   Naples   117.27      967069
3    Turin   130.17      885265
4  Palermo   158.90      678492 

The last five rows
      city    area  population
2   Naples  117.27      967069
3    Turin  130.17      885265
4  Palermo  158.90      678492
5    Genoa  240.29      580097
6  Bologna  140.86      391374


## reindex

An important method on Pandas is `reindex`, which creates a new object with the values rearranged to match the new index.

In [99]:
# Creating a DataFrame (df) using pd.DataFrame() constructor
# The DataFrame has two columns 'A' and 'B' with values generated using np.arange() function for numeric sequences
# The index parameter sets the row labels of the DataFrame
df = pd.DataFrame({'A':np.arange(0,3), 'B':np.arange(4,7)}, index= ['b','c','a'])

# Printing the DataFrame df to the console
print(df)

# Creating a new DataFrame (df2) by reindexing df with a new index ['a', 'b', 'c', 'd']
# Rows that don't have corresponding labels in the original DataFrame will be filled with NaN (Not a Number)
df2 = df.reindex(['a', 'b', 'c', 'd'])

# Printing the reindexed DataFrame df2 to the console
print("\nCreating a new DataFrame (df2) by reindexing df with a new index")
print(df2)

   A  B
b  0  4
c  1  5
a  2  6

Creating a new DataFrame (df2) by reindexing df with a new index
     A    B
a  2.0  6.0
b  0.0  4.0
c  1.0  5.0
d  NaN  NaN


## Dropping elements from a series or a dataframe

In many cases, you may need to drop elements from a series or a dataframe. The `drop` method allows you to delete the entries (index or columns) and retrieve the dataframe with the specified index or column labels removed.

In [100]:
# Defining column names
columns = ['Capital', 'Population', 'GDP']

# Creating a dictionary containing data
data = {'Italy': ['Rome', 58997201, 2090448e+06], 
        'Japan': ['Tokyo', 126226568, 4971929e+06], 
        'United States': ['Washington D.C.', 341139019, 27675271e+06], 
        'French': ['Paris', 65554051, 2936702e+06]
       }

# Creating a DataFrame from the dictionary, setting the index to be the keys of the dictionary
df = pd.DataFrame.from_dict(data, orient='index', columns=columns)

# Dropping rows with index 'Japan' and 'French'
new_df = df.drop(['Japan', 'French']) 
print(new_df,'\n')

# Dropping columns 'Population' and 'GDP'
new_df = df.drop(['Population', 'GDP'], axis='columns')
print(new_df,'\n')

# Dropping row with index 'Italy' and column 'GDP'
new_df = df.drop(index='Italy', columns='GDP')
print(new_df,'\n')

                       Capital  Population           GDP
Italy                     Rome    58997201  2.090448e+12
United States  Washington D.C.   341139019  2.767527e+13 

                       Capital
Italy                     Rome
Japan                    Tokyo
United States  Washington D.C.
French                   Paris 

                       Capital  Population
Japan                    Tokyo   126226568
United States  Washington D.C.   341139019
French                   Paris    65554051 



## Indexing, selection and filtering

Pandas indexing is a powerful feature that allows you to access and manipulate data within DataFrames and Series efficiently. Here are some key points about pandas indexing:

1. **Label-based Indexing**: Pandas primarily uses label-based indexing for accessing data. This means that you can access elements using row and column labels rather than integer positions.

2. **Indexing in DataFrames**: In DataFrames, indexing can be applied to both rows and columns. You can use row labels (index) and column labels to access specific data points, slices of rows or columns, or even subsets of the DataFrame.

3. **Index Objects**: Indexes in pandas are objects that store the axis labels (row or column labels). These index objects are immutable, meaning they cannot be changed once created. This immutability ensures the integrity of the data structures.

4. **Hierarchical Indexing**: Pandas supports hierarchical indexing, also known as multi-level indexing. It allows you to have multiple levels of row or column labels, providing a way to represent higher-dimensional data in a DataFrame.

5. **Selection Methods**: Pandas provides various methods for indexing and selecting data, including `loc[]`, `iloc[]`, and `at[]`. 
   - `loc[]` is used for label-based indexing, allowing you to select data based on row and column labels.
   - `iloc[]` is used for positional indexing, allowing you to select data based on integer positions.
   - `at[]` is similar to `loc[]` but provides faster access to a single scalar value.

6. **Boolean Indexing**: You can use boolean arrays or boolean conditions to filter rows or columns in a DataFrame based on certain criteria. This is often referred to as boolean indexing.

7. **Setting and Resetting Index**: You can set or reset the index of a DataFrame using `set_index()` and `reset_index()` methods, respectively. This allows you to change the row labels or move the existing index into a column.

Understanding pandas indexing is crucial for efficient data manipulation and analysis, as it provides a flexible and intuitive way to access and modify data in DataFrames and Series.

![title](img/indexing_options.png)

In [101]:
# Defining column names
columns = ['Capital', 'Population', 'GDP']

# Creating a dictionary containing data
data = {'Italy': ['Rome', 58997201, 2090448e+06], 
        'Japan': ['Tokyo', 126226568, 4971929e+06], 
        'United States': ['Washington D.C.', 341139019, 27675271e+06], 
        'France': ['Paris', 65554051, 2936702e+06],
        'Unite Kingdom': ['London', 67596281, 3495261e+06],
        'Brazil': ['Brasília', 205375043, 2331391e+06],
        'China': ['Beijing', 1409670000, 18532633e+06],
        'India': ['New Delhi', 1428627663, 3937011e+06],
        'Germany': ['Berlin', 84669326, 4591100e+06],
       }

# Creating a DataFrame from the dictionary, setting the index to be the keys of the dictionary
df = pd.DataFrame.from_dict(data, orient='index', columns=columns)
print(df, '\n')

# Selecting the 'Population' column
print(df['Population'], '\n')

# Selecting the 'GDP' and 'Capital' columns
print(df[['Capital','GDP']], '\n')

# Selecting the 'Italy' and 'French' rows
print(df.loc[['Italy', 'France']], '\n')

# Selecting the population and GDP of 'Italy' and 'French'
print(df.loc[['Italy', 'France'],['Population', 'GDP']], '\n')

# Selecting the first two rows
print(df.iloc[:2])

                       Capital  Population           GDP
Italy                     Rome    58997201  2.090448e+12
Japan                    Tokyo   126226568  4.971929e+12
United States  Washington D.C.   341139019  2.767527e+13
France                   Paris    65554051  2.936702e+12
Unite Kingdom           London    67596281  3.495261e+12
Brazil                Brasília   205375043  2.331391e+12
China                  Beijing  1409670000  1.853263e+13
India                New Delhi  1428627663  3.937011e+12
Germany                 Berlin    84669326  4.591100e+12 

Italy              58997201
Japan             126226568
United States     341139019
France             65554051
Unite Kingdom      67596281
Brazil            205375043
China            1409670000
India            1428627663
Germany            84669326
Name: Population, dtype: int64 

                       Capital           GDP
Italy                     Rome  2.090448e+12
Japan                    Tokyo  4.971929e+12
United S

`loc` and `iloc` are both indexing methods in pandas, but they have different use cases and behaviors:

1. **`loc`**:
   - `loc` is primarily label-based indexing, meaning it selects data based on row and column labels.
   - It is used to access a group of rows and columns by label(s) or a boolean array.
   - With `loc`, you specify row and column labels explicitly.
   - The syntax for `loc` is `df.loc[row_label, column_label]`.
   - It is inclusive for both the start and stop labels when slicing.
   - `loc` is useful when you know the label(s) of the rows and columns you want to select.

2. **`iloc`**:
   - `iloc` is positional indexing, meaning it selects data based on integer positions.
   - It is used to access a group of rows and columns by integer positions.
   - With `iloc`, you specify row and column positions explicitly, similar to array indexing in Python.
   - The syntax for `iloc` is `df.iloc[row_position, column_position]`.
   - It is exclusive for the stop position when slicing, following Python's convention.
   - `iloc` is useful when you want to select data by integer position, regardless of the row and column labels.

In [102]:
# Selecting 'GDP' of countries with more than 60e+06 population
print(df[df.Population > 60e+06]['GDP'])

Japan            4.971929e+12
United States    2.767527e+13
France           2.936702e+12
Unite Kingdom    3.495261e+12
Brazil           2.331391e+12
China            1.853263e+13
India            3.937011e+12
Germany          4.591100e+12
Name: GDP, dtype: float64


The provided code snippet selects the 'GDP' of countries with a population greater than 60 million using boolean indexing. 

- `df.Population > 60e+06`: This creates a boolean mask where each element in the 'Population' column of the DataFrame `df` is compared to 60 million. It returns `True` for countries with a population greater than 60 million and `False` otherwise.
- `df[df.Population > 60e+06]`: This uses boolean indexing to filter the rows of the DataFrame `df` where the condition (population greater than 60 million) is `True`. It selects only the rows that meet the condition.
- `['GDP']`: This selects the 'GDP' column from the filtered DataFrame, resulting in a Series containing the GDP values of countries with a population greater than 60 million.

## Sorting

Sorting rows or columns of a dataset by some criterion is another important built-in operation. 

The method `sort_index` returns a new sorted object that matches the given criteria. In contrast, the method `sort_values` sorts the object by its own values. 

In [103]:
print(df)

                       Capital  Population           GDP
Italy                     Rome    58997201  2.090448e+12
Japan                    Tokyo   126226568  4.971929e+12
United States  Washington D.C.   341139019  2.767527e+13
France                   Paris    65554051  2.936702e+12
Unite Kingdom           London    67596281  3.495261e+12
Brazil                Brasília   205375043  2.331391e+12
China                  Beijing  1409670000  1.853263e+13
India                New Delhi  1428627663  3.937011e+12
Germany                 Berlin    84669326  4.591100e+12


In [104]:
sorted_df_by_row = df.sort_index(axis=0)

# Sorted dataframe by index (axis=0)
print(sorted_df_by_row)

                       Capital  Population           GDP
Brazil                Brasília   205375043  2.331391e+12
China                  Beijing  1409670000  1.853263e+13
France                   Paris    65554051  2.936702e+12
Germany                 Berlin    84669326  4.591100e+12
India                New Delhi  1428627663  3.937011e+12
Italy                     Rome    58997201  2.090448e+12
Japan                    Tokyo   126226568  4.971929e+12
Unite Kingdom           London    67596281  3.495261e+12
United States  Washington D.C.   341139019  2.767527e+13


In [105]:
sorted_df_by_col = df.sort_index(axis=1)

# Sorted dataframe by column (axis=1)
print(sorted_df_by_col)

                       Capital           GDP  Population
Italy                     Rome  2.090448e+12    58997201
Japan                    Tokyo  4.971929e+12   126226568
United States  Washington D.C.  2.767527e+13   341139019
France                   Paris  2.936702e+12    65554051
Unite Kingdom           London  3.495261e+12    67596281
Brazil                Brasília  2.331391e+12   205375043
China                  Beijing  1.853263e+13  1409670000
India                New Delhi  3.937011e+12  1428627663
Germany                 Berlin  4.591100e+12    84669326


In [106]:
sorted_series_by_values = df['GDP'].sort_values()

# Sorted the 'GDP' series by its own value
print(sorted_series_by_values, "\n")

# To sort the entire dataframe and not only the series
sorted_df_by_values = df.sort_values('GDP')
print(sorted_df_by_values)

Italy            2.090448e+12
Brazil           2.331391e+12
France           2.936702e+12
Unite Kingdom    3.495261e+12
India            3.937011e+12
Germany          4.591100e+12
Japan            4.971929e+12
China            1.853263e+13
United States    2.767527e+13
Name: GDP, dtype: float64 

                       Capital  Population           GDP
Italy                     Rome    58997201  2.090448e+12
Brazil                Brasília   205375043  2.331391e+12
France                   Paris    65554051  2.936702e+12
Unite Kingdom           London    67596281  3.495261e+12
India                New Delhi  1428627663  3.937011e+12
Germany                 Berlin    84669326  4.591100e+12
Japan                    Tokyo   126226568  4.971929e+12
China                  Beijing  1409670000  1.853263e+13
United States  Washington D.C.   341139019  2.767527e+13


In [107]:
# To sort for multiple values

sorted_df_by_values = df.sort_values(['Population', 'GDP'], ascending=False)
print(sorted_df_by_values)

                       Capital  Population           GDP
India                New Delhi  1428627663  3.937011e+12
China                  Beijing  1409670000  1.853263e+13
United States  Washington D.C.   341139019  2.767527e+13
Brazil                Brasília   205375043  2.331391e+12
Japan                    Tokyo   126226568  4.971929e+12
Germany                 Berlin    84669326  4.591100e+12
Unite Kingdom           London    67596281  3.495261e+12
France                   Paris    65554051  2.936702e+12
Italy                     Rome    58997201  2.090448e+12


## Summarizing and Descriptive Statistics

Using the appropriate methods, it's possible to compute a descriptive statistic (eg. mean, cov) or a mathematical operation (eg. sum, prod) from a Series or a DataFrame. 

The `pandas.DataFrame.sum()` method is a powerful function in the pandas library that is used to compute the sum of values over a specified axis in a DataFrame. 

In [110]:
# Importing the string module to access string constants
import string

# Generating a list of lowercase letters ('a' to 'z')
indexes = list(string.ascii_lowercase)

# Creating a DataFrame with 5 rows and 2 columns, filled with random normal values
# The columns are named 'A' and 'B', and the index is the first 5 lowercase letters ('a' to 'e')
df = pd.DataFrame(data=np.random.normal(size=(5, 2)), columns=['A', 'B'], index=indexes[:5])

# Printing the DataFrame to the console
print(df)

          A         B
a -0.416273 -1.652043
b  0.931256  1.767693
c  0.646331 -0.054900
d -0.823023 -1.073058
e -0.047175 -1.040609


In [111]:
# Adding a new column 'A+B' which is the sum of columns 'A' and 'B' for each row
df['A+B'] = df.sum(axis='columns')

# Calculating the sum of each column
total_s = df.sum(axis=0)

# Transforming the column sum into a DataFrame and transposing it
total_f = total_s.to_frame().T

# Setting the index of the new DataFrame to 'Total'
total_f.index = ['Total']

# Concatenating the original DataFrame with the column sum DataFrame
df = pd.concat([df, total_f])

# Printing the final DataFrame
print(df)

              A         B       A+B
a     -0.416273 -1.652043 -2.068316
b      0.931256  1.767693  2.698950
c      0.646331 -0.054900  0.591431
d     -0.823023 -1.073058 -1.896082
e     -0.047175 -1.040609 -1.087784
Total  0.291115 -2.052916 -1.761801


Summing the two columns of interest retrieves the same result as the `sum()` method using columns as the axis of reference.

In [113]:
# Creating a DataFrame with 5 rows and 2 columns, filled with random normal values
# The columns are named 'A' and 'B', and the index is the first 5 lowercase letters ('a' to 'e')
df = pd.DataFrame(data=np.random.normal(size=(5, 2)), columns=['A', 'B'], index=indexes[:5])

# Adding a new column 'A+B' which is the sum of the values in columns 'A' and 'B' for each row
df['A+B'] = df['A'] + df['B']

# Printing the DataFrame to the console
print(df)

          A         B       A+B
a -2.191691 -1.704411 -3.896102
b  1.010604 -0.151641  0.858963
c  0.025903 -0.650193 -0.624290
d  0.918977  0.006619  0.925596
e  0.237880 -0.358286 -0.120407


Setting the `skipna` parameter to `False` in the `sum()` method means that NA (null) values are not excluded from the computation, which affects the resulting sum.

When `skipna=False`, the `sum()` method will include NA values in the computation. If any of the values in the column are NA, and `skipna` is set to `False`, the result for that column will also be NA. This is because the presence of NA values means that the computation cannot be performed without considering these missing values, resulting in an output of NA for that column.

In [125]:
df.loc['a',['A','B']] = np.nan
df

print(df.sum(),"\n")

# Setting the parameter skipna to False means that NA values are not excluded when computing the result.
print(df.sum(skipna=False))

A      2.193363
B     -1.153501
A+B   -2.856239
dtype: float64 

A           NaN
B           NaN
A+B   -2.856239
dtype: float64


In [167]:
# Creating a DataFrame with 5 rows and 2 columns, filled with random integer values between 1 and 99
# The columns are named 'A' and 'B'
df = pd.DataFrame(np.random.randint(low=1, high=100, size=(5, 2)), columns=['A', 'B'])

# Adding a new column 'A*B' which is the product of the values in columns 'A' and 'B' for each row
df['A*B'] = df.prod(axis=1)

# Printing the DataFrame to the console
print(df)

Unnamed: 0,A,B,A*B
0,61,20,1220
1,52,50,2600
2,48,17,816
3,81,52,4212
4,93,70,6510
