# Essential DataFrame Operations: Summarization and Data Manipulation
_This section focuses on summarizing DataFrames, chaining methods for efficient workflows, and performing key DataFrame operations._

---

## Contents
1. **Introduction**  
   - Why summarizing data is important  
   - Streamlining operations with method chaining  

2. **Key Concepts**  
   - Summarizing a DataFrame  
   - Chaining DataFrame Methods  
   - DataFrame Operations  

3. **Practical Exercises**  
   Explanation of what the user will learn in this section.

---

## Datasets Used  
- [Volumen de solicitudes de visa colombiana recibidas desde 2017](https://www.datos.gov.co/Estad-sticas-Nacionales/Volumen-de-solicitudes-de-visa-colombiana-recibida/mgr2-njqc/about_data)  
### About Dataset

#### Context  
This dataset provides information about the number of visa applications received by the Colombian Visa and Immigration Authority. It is aligned with the definition established by Resolution 5477 of 2022, detailing the nationality, gender, date of birth of the applicant, and the type of visa requested, classified by the purpose of stay. It aims to track trends in visa applications over the years.

#### Content  
##### Features of the dataset:
- **Año Solicitud**: The year in which the visa application was submitted.  
- **Nacionalidad**: The nationality of the applicant as per the passport.  
- **Sexo**: The gender of the applicant as per the passport.  
- **Fecha de Nacimiento**: The applicant's date of birth as per the passport.  
- **Vocación de Permanencia**: The type of visa requested, classified by the purpose of stay (e.g., temporary, permanent).  
- **Número**: The total number of visa applications received.

#### Acknowledgements  
Special thanks to the Ministry of Foreign Affairs of Colombia for providing this dataset.

#### Inspiration  
This dataset can be used for analyzing trends in immigration and visa applications, identifying demographic patterns, and understanding migration flows into Colombia over the years.

#### Source  
The data was provided by the Ministry of Foreign Affairs of Colombia and can be accessed through the Colombian Open Data portal.

--- 

## Author
**Author Name:** Juan Alejandro Carrillo Jaimes  

**Contact:** [jalejandrocjaimes@gmail.com](mailto:jalejandrocjaimes@gmail.com) - [Linkedin-AlejoCJaimes31](https://www.linkedin.com/in/alejocjaimes31/)  

**Purpose:** This content was created as an educational resource for university students.


# 1. Introduction
In this receipe, we explore a variety of the most common DataFrame atributes and methods with the *visa_applications* dataset.

## Why summarizing data is important  

<p align="center">
  <img src="https://miro.medium.com/max/511/1*87OCzaZL-Es-DFiVL4ANAw.png" width="500" height="600"/>
</p>

Understanding how the data is distributed is one of the first tasks to be performed. Obtaining a summary of the data with DataFrame methods helps us to understand a little of its distribution and variability. From data types to descriptive statistics.


## Streamlining operations with method chaining  

<p align="center">
  <img src="https://www.sharpsightlabs.com/wp-content/uploads/2021/03/pandas-chain_syntax-explanation.png" width="500" height="300"/>
</p>

The chaining of operations is very important to use fewer lines of code and to have a clear and clean view of what is being done.

# Dataset Important Information

How to import the dataset and its step-by-step is described in [C2-1-Selecting-and-Organizing-Columns.ipynb](https://github.com/Doc-UP-AlejandroJaimes/Pandas-for-Education-Learning-through-Hands-On-Examples/blob/main/C2-Essential-Pandas-Techniques-for-DataFrames/C2-01-Selection-and-Organization/C2-1-Selecting-and-Organizing-Columns.ipynb)

In [133]:
import pandas as pd # type: ignore
import numpy as np # type: ignore
import os

In [134]:
# absolute path
current_file_path = os.path.abspath('C2-Essential-Pandas-Techniques-for-DataFrames/C2-01-Selection-and-Organization/C2_1_Selecting_and_Organizing_Columns.ipynb')

# up to 4 directories
root_dir = os.path.abspath(os.path.join(current_file_path, '../../../../../'))  # Subir 4 directorios

# dataset directory
dataset_dir = os.path.join(root_dir, 'datasets', 'visa-col-application-datagov-df')

for dirname, _, filenames in os.walk(dataset_dir):
    for filename in filenames:
        print(os.path.join(dirname, filename))

c:\Users\study_2025\Documents\Github\Doc-UP-AlejandroJaimes\Pandas-for-Education-Learning-through-Hands-On-Examples\datasets\visa-col-application-datagov-df\Visa_Applications_Colombia_2017_20250217.csv


In [135]:
visa_applications = pd.read_csv(os.path.join(dataset_dir, 'Visa_Applications_Colombia_2017_20250217.csv'))

1. Normalize all columns, change the current columns names, by:

    **Año Solicitud** -> *year_application*

    **Nacionalidad** -> *nationality*

    **Sexo** -> *sex*

    **Fecha de Nacimiento** -> *birth_date*

    **Vocación de permanencia** -> *permanet_stay_intent*

    **Número** -> *number_of_application*

In [136]:
new_columns = {'Año Solicitud': 'year_application', 'Nacionalidad': 'nationality', \
                'Sexo': 'gender', 'Fecha de Nacimiento': 'birth_date', 'Vocación de permanencia': 'permanent_stay_intent', \
                'Número': 'number_of_application'
            }
visa_applications.rename(columns=new_columns, inplace=True)
visa_applications.columns.tolist()

['year_application',
 'nationality',
 'gender',
 'birth_date',
 'permanent_stay_intent',
 'number_of_application']

# 2. Key Concepts

## 2.1 Summarizing a DataFrame

Summarizing a DataFrame is a crucial step in data analysis as it provides an overview of the dataset’s structure, key statistics, and missing values.

### **Basic Properties**  
- **`.shape`** – Returns a tuple with the cells and cols in the DataFrame.
- **`.size`** – Returns the total number of elements (cells) in the DataFrame.  
    
- **`.ndim`** – Returns the number of dimensions (2 for DataFrames, 1 for Series).  
    
- **`len(df)`** – Returns the number of rows in the DataFrame.  
    
- **`.count()`** – Returns the number of non-null values per column.  
    

#### **Summary Statistics**  
- **`.min()`** – Returns the minimum value per column.  
    
- **`.max()`** – Returns the maximum value per column.  
    
- **`.mean()`** – Returns the mean (average) per column.  
    
- **`.median()`** – Returns the median per column.  
    
- **`.std()`** – Returns the standard deviation per column.  
    

#### **Statistical Summary**  
- **`.describe()`** – Is very powerful and calculate all the descriptive statistics and quartiles at once (count, mean, std, min, 25%, 50%, 75%, max) for numerical columns.  
    
- **`.describe().T`** – Transposes the output of `.describe()` for better readability.  
    
- **`.describe(percentiles=[0.1, 0.5, 0.9])`** – Includes custom percentiles in the summary.  


#### **Key Methods for Summarization**  

-  **`.info()`** – Provides a concise summary of the DataFrame, including column names, data types, and missing values.  
   
- **`.head(n)` and `.tail(n)`** – Display the first or last `n` rows of the DataFrame.  
  
- **`.value_counts()`** – Counts unique values in a categorical column.  
  
- **`.nunique()`** – Returns the number of unique values in each column.  
  
- **`.isnull().sum()`** – Detects missing values in each column.  


In [137]:
print('BASIC PROPERTIES')
print('=='*25)
shape = visa_applications.shape # (349583, 6)
size = visa_applications.size # number of rows * number of columns
dim = visa_applications.ndim # number of dimensions
total_rows = len(visa_applications) # number of rows
print(f'Shape: {shape}\nSize: {size}\nDimensions: {dim}\nTotal Rows: {total_rows}')
print('Non-null values in each column: ')
print(visa_applications.count())
print('=='*25)


BASIC PROPERTIES
Shape: (349583, 6)
Size: 2097498
Dimensions: 2
Total Rows: 349583
Non-null values in each column: 
year_application         349583
nationality              349583
gender                   349576
birth_date               349583
permanent_stay_intent    349583
number_of_application    349583
dtype: int64


In [138]:
visa_applications.select_dtypes(include='number').min()

year_application         2017
number_of_application       1
dtype: int64

In [139]:
print('SUMMARY STATISTICS (only continuos data)')
print('=='*25)
ss_min = visa_applications.select_dtypes(include='number').min() # (349583, 6)
ss_max = visa_applications.select_dtypes(include='number').max() # number of rows * number of columns
ss_mean = visa_applications.select_dtypes(include='number').mean() # number of dimensions
ss_median = visa_applications.select_dtypes(include='number').median() # number of rows
ss_std = visa_applications.select_dtypes(include='number').std() # number of rows
ss_var = visa_applications.select_dtypes(include='number').var() # number of rows
ss_quantile = visa_applications.select_dtypes(include='number').quantile([0.25, 0.5, 0.75]) # number of rows
print(f'Min: {ss_min}\n\nMax: {ss_max}\n\nMean: {ss_mean}\n\nMedian: {ss_median}\n\nStd: {ss_std}\n\nVar: {ss_var}\n\nQuantile: {ss_quantile}')
print('=='*25)

SUMMARY STATISTICS (only continuos data)
Min: year_application         2017
number_of_application       1
dtype: int64

Max: year_application         2024
number_of_application      22
dtype: int64

Mean: year_application         2019.270342
number_of_application       1.653776
dtype: float64

Median: year_application         2019.0
number_of_application       1.0
dtype: float64

Std: year_application         1.825129
number_of_application    0.998218
dtype: float64

Var: year_application         3.331098
number_of_application    0.996439
dtype: float64

Quantile:       year_application  number_of_application
0.25            2018.0                    1.0
0.50            2019.0                    1.0
0.75            2021.0                    2.0


In [140]:
print('STATISTICAL SUMMARY')
print('=='*25)
stat_summary = visa_applications.describe()
print(stat_summary)
print('TRANSPOSE STATISTICAL SUMMARY WITH PERCENTILES')
print('=='*25)
stat_summary = visa_applications.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])
print(stat_summary.T)
print('=='*25)

STATISTICAL SUMMARY
       year_application  number_of_application
count     349583.000000          349583.000000
mean        2019.270342               1.653776
std            1.825129               0.998218
min         2017.000000               1.000000
25%         2018.000000               1.000000
50%         2019.000000               1.000000
75%         2021.000000               2.000000
max         2024.000000              22.000000
TRANSPOSE STATISTICAL SUMMARY WITH PERCENTILES
                          count         mean       std     min     10%  \
year_application       349583.0  2019.270342  1.825129  2017.0  2017.0   
number_of_application  349583.0     1.653776  0.998218     1.0     1.0   

                          25%     50%     75%     90%     max  
year_application       2018.0  2019.0  2021.0  2022.0  2024.0  
number_of_application     1.0     1.0     2.0     3.0    22.0  


In [141]:
visa_applications.select_dtypes(include='number').mean(skipna=False)

year_application         2019.270342
number_of_application       1.653776
dtype: float64

**Keep in mind**
    
1. **describe** methods only works with numerical columns.

2. Numeric columns that have missing values, by default pandas handles missing values in numeric columns skipping them. It is possible to change this behavior by setting the `skipna` parameter to `False`.

    **Other functions using `skipna`**
    - `.sum(skipna=True/False)`
    - `.min(skipna=True/False)`
    - `.max(skipna=True/False)`
    - `.std(skipna=True/False)`
    - `.median(skipna=True/False)`

In [142]:
visa_applications.select_dtypes(include='number').mean()

year_application         2019.270342
number_of_application       1.653776
dtype: float64

In [143]:
print('KEY METHODS')
print('=='*25)
ds_info = visa_applications.info()
ds_head = visa_applications.head(2) # first 2 rows
ds_tail = visa_applications.tail(2) # last 2 rows
ds_values_c = visa_applications.value_counts() # count of unique values
ds_unique = visa_applications.nunique() # number of unique values
ds_miss_vals = visa_applications.isnull().sum() # number of missing values
print(f'Info: {ds_info}\n\nHead: {ds_head}\n\nTail: {ds_tail}\n\nValue Counts: {ds_values_c}\n\nUnique Values: {ds_unique}\n\nMissing Values: {ds_miss_vals}')
print('=='*25)

KEY METHODS
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349583 entries, 0 to 349582
Data columns (total 6 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   year_application       349583 non-null  int64 
 1   nationality            349583 non-null  object
 2   gender                 349576 non-null  object
 3   birth_date             349583 non-null  object
 4   permanent_stay_intent  349583 non-null  object
 5   number_of_application  349583 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 16.0+ MB
Info: None

Head:    year_application          nationality    gender  birth_date  \
0              2017          ECUATORIANA  FEMENINO  24/07/1897   
1              2017  FEDERACION DE RUSIA  FEMENINO  03/05/1919   

         permanent_stay_intent  number_of_application  
0  Con vocación de permanencia                      2  
1  Sin vocación de permanencia                      2  

Tail:         year_applic

## 2.2 Chaining DataFrame Methods

Chaining DataFrame Methods in pandas refers to the practice of applying multiple methods sequentially in a single, streamlined command rather than assigning intermediate results to variables. This approach makes code more concise, readable, and efficient by avoiding unnecessary variable creation.

**Example**
We will use the `.isna` method to check for `NaN` values chaining with `any(axis=1)` to return `True` for rows with at least one `NaN`.

In [144]:
df_nulls = visa_applications[visa_applications.isna().any(axis=1)]
df_nulls

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application
248458,2021,ESTADOUNIDENSE,,31/05/2003,Sin vocación de permanencia,2
304706,2022,VENEZOLANA,,01/05/1974,Con vocación de permanencia,2
304707,2022,CUBANA,,08/04/1992,Sin vocación de permanencia,1
304708,2022,CUBANA,,01/12/1992,Sin vocación de permanencia,1
304709,2022,CUBANA,,08/04/1993,Sin vocación de permanencia,2
304710,2022,CUBANA,,07/04/1997,Sin vocación de permanencia,1
304711,2022,CUBANA,,27/09/2003,Sin vocación de permanencia,1


1. Determine wheter there are any missing values in the DataFrame.

In [145]:
visa_applications.isnull().any().any()

np.True_

2. Apply the next transformartion for the column `permanent_stay_intent` use lower case for the new value.
    - `Sin vocación de permancia` -> *temporal*
    - `Con vocación de permanencia` -> *permanente*

In [146]:
visa_applications['permanent_stay_intent'] = (
    visa_applications['permanent_stay_intent']
    .str.lower()
    .str.replace('sin vocación de permanencia', 'temporal')
    .str.replace('con vocación de permanencia', 'permanente')
)
visa_applications.head()

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application
0,2017,ECUATORIANA,FEMENINO,24/07/1897,permanente,2
1,2017,FEDERACION DE RUSIA,FEMENINO,03/05/1919,temporal,2
2,2017,FRANCESA,FEMENINO,20/08/1919,temporal,1
3,2017,CUBANA,FEMENINO,03/02/1922,temporal,2
4,2017,ESTADOUNIDENSE,FEMENINO,17/11/1922,temporal,1


3. Homologate values for the column gender according to this
   - *FEMENINO* -> **F**
   - *MASCULINO* -> **M**

In [147]:
visa_applications['gender'].value_counts()

gender
MASCULINO    224294
FEMENINO     125282
Name: count, dtype: int64

In [148]:
visa_applications['gender'] = visa_applications['gender'].str.replace('MASCULINO','M').replace('FEMENINO','F')
visa_applications.head()

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application
0,2017,ECUATORIANA,F,24/07/1897,permanente,2
1,2017,FEDERACION DE RUSIA,F,03/05/1919,temporal,2
2,2017,FRANCESA,F,20/08/1919,temporal,1
3,2017,CUBANA,F,03/02/1922,temporal,2
4,2017,ESTADOUNIDENSE,F,17/11/1922,temporal,1


In [149]:
visa_applications.tail(5)

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application
349578,2024,ESPAÑOLA,M,11/03/2024,permanente,2
349579,2024,AZERBAIYANA,M,06/04/2024,temporal,1
349580,2024,CHINA,M,08/04/2024,temporal,1
349581,2024,ESTADOUNIDENSE,M,06/05/2024,temporal,2
349582,2024,EGIPCIA,M,12/09/2024,temporal,2


**Keep in mind**
    
**Benefits of Method Chaining**
1. **Improves readability**: The code is structured in a logical, top-down flow.
2. **Avoids temporary variables**: Reduces memory usage and clutter.
3. **Enhances performance**: Some operations can be optimized internally by pandas.

For purposes of *readability*, method chains are often writting as one method call per line surronded by parentheses. This make it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening.

## 2.3 DataFrame operations

The python arithmetic and comparison operators work with DataFrames, as they do with Series.

When an aritmethic or comparision operator is used with a DataFrame, each value of each column gets the operation applied to it. Tipically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homgenous data, then the DataFrame operation likely to fail.

**Example**
We will add 2 to each value of the Datraframe.

In [150]:
visa_applications['birth_date'] = pd.to_datetime(visa_applications['birth_date'], dayfirst=True)
visa_applications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349583 entries, 0 to 349582
Data columns (total 6 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   year_application       349583 non-null  int64         
 1   nationality            349583 non-null  object        
 2   gender                 349576 non-null  object        
 3   birth_date             349583 non-null  datetime64[ns]
 4   permanent_stay_intent  349583 non-null  object        
 5   number_of_application  349583 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 16.0+ MB


In [151]:
visa_applications + 2

TypeError: can only concatenate str (not "int") to str

To successfully use an operator with a DataFrame, first select homogenous data. We use `gender` as the label for our index, and then seelct the columnes we desire with the `select_dtypes` method.

In [152]:
visa_app = visa_applications.copy()
visa_app.set_index('gender', inplace=True)
visa_app.head()

Unnamed: 0_level_0,year_application,nationality,birth_date,permanent_stay_intent,number_of_application
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,2017,ECUATORIANA,1897-07-24,permanente,2
F,2017,FEDERACION DE RUSIA,1919-05-03,temporal,2
F,2017,FRANCESA,1919-08-20,temporal,1
F,2017,CUBANA,1922-02-03,temporal,2
F,2017,ESTADOUNIDENSE,1922-11-17,temporal,1


In [153]:
visa_app = visa_applications.select_dtypes(include='number')
visa_app.head()

Unnamed: 0,year_application,number_of_application
0,2017,2
1,2017,2
2,2017,1
3,2017,2
4,2017,1


In [154]:
visa_app = visa_app + 2
visa_app.head()

Unnamed: 0,year_application,number_of_application
0,2019,4
1,2019,4
2,2019,3
3,2019,4
4,2019,3


1. Add new column called `age` and calculate it.

In [156]:
year_date =  visa_applications['birth_date'].dt.year
current_year = pd.Timestamp.now().year
age = current_year - year_date
visa_applications['age'] = age

In [157]:
visa_applications.head()

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application,age
0,2017,ECUATORIANA,F,1897-07-24,permanente,2,128
1,2017,FEDERACION DE RUSIA,F,1919-05-03,temporal,2,106
2,2017,FRANCESA,F,1919-08-20,temporal,1,106
3,2017,CUBANA,F,1922-02-03,temporal,2,103
4,2017,ESTADOUNIDENSE,F,1922-11-17,temporal,1,103


2. Calculate the column `still_alive` keep in mind this:
   - Women tend to live longer than men, with this differencie know as the longevity gap. In 2021, the average life expectancy for **women was 73.8 years** compared to **68.4 years for men**
   - To represent this, the colum `still_alive` will be **1** if *still alive*, and **0** in other case.

In [158]:
def still_alive(age, gender):
    alive = 0
    if gender == 'M' and age <=68.4:
        alive = 1

    if gender == 'F' and age <=73.8:
        alive = 1
    
    return alive

visa_applications['still_alive'] = visa_applications.apply(lambda row: still_alive(row['age'], row['gender']), axis=1)

In [159]:
visa_applications.tail(15)

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application,age,still_alive
349568,2024,KENIANA,M,2023-09-14,temporal,1,2,1
349569,2024,ECUATORIANA,M,2023-09-29,temporal,6,2,1
349570,2024,ESTADOUNIDENSE,M,2023-11-22,permanente,1,2,1
349571,2024,BRASILERA,M,2023-12-10,permanente,1,2,1
349572,2024,ESTADOUNIDENSE,M,2023-12-10,permanente,2,2,1
349573,2024,VENEZOLANA,M,2023-12-12,temporal,1,2,1
349574,2024,NICARAGÜENSE,M,2024-02-18,temporal,2,1,1
349575,2024,ESTADOUNIDENSE,M,2024-02-19,permanente,2,1,1
349576,2024,ESTADOUNIDENSE,M,2024-02-23,temporal,1,1,1
349577,2024,MEXICANA,M,2024-03-08,permanente,1,1,1


In [None]:
visa_applications['still_alive'].value_counts()

still_alive
1    332593
0     16990
Name: count, dtype: int64

In [160]:
visa_applications['still_alive'].value_counts(normalize=True)

still_alive
1    0.951399
0    0.048601
Name: proportion, dtype: float64

In [161]:
visa_applications['still_alive'].mean()

np.float64(0.951399238521324)

In [185]:
visa_applications.drop(columns=['rem_life_expect'],inplace=True)

3. Calculate the remaining life expectancy based on the following averages: 
   - MALES  (68.4 years) 
   - FEMALES (73.8 years).

    **Note**: If there are `NaN` values, put 0.

In [186]:
visa_applications.head()

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application,age,still_alive
0,2017,ECUATORIANA,F,1897-07-24,permanente,2,128,0
1,2017,FEDERACION DE RUSIA,F,1919-05-03,temporal,2,106,0
2,2017,FRANCESA,F,1919-08-20,temporal,1,106,0
3,2017,CUBANA,F,1922-02-03,temporal,2,103,0
4,2017,ESTADOUNIDENSE,F,1922-11-17,temporal,1,103,0


In [187]:
LIFE_EXPECTANCY = {'M': 68.4, 'F': 73.8}

def calculate_life_expectancy(age, gender):
    if pd.isna(age) or gender not in LIFE_EXPECTANCY:
        return 0
    
    remaining_years = LIFE_EXPECTANCY[gender] - age
    return max(remaining_years, 0) # Ensure no negativa values.


visa_applications['rem_life_expect'] = visa_applications.apply(
    lambda row: 
        calculate_life_expectancy(
            row['age'], row['gender']
        ),
        axis=1
    ).round(3)

In [188]:
visa_applications.sample(n=15, random_state=42)

Unnamed: 0,year_application,nationality,gender,birth_date,permanent_stay_intent,number_of_application,age,still_alive,rem_life_expect
72412,2017,ECUATORIANA,M,2017-02-27,temporal,1,8,1,60.4
305356,2022,BRITANICA,F,1959-12-18,permanente,2,66,1,7.8
192112,2019,ESTADOUNIDENSE,M,1998-02-14,permanente,4,27,1,41.4
210086,2020,PERUANA,F,1972-07-16,permanente,1,53,1,20.8
322058,2022,VENEZOLANA,M,1987-07-05,permanente,3,38,1,30.4
238793,2020,VENEZOLANA,M,1998-12-16,temporal,1,27,1,41.4
252540,2021,AUSTRALIANA,F,1976-09-09,permanente,4,49,1,24.8
19244,2017,NICARAGÜENSE,F,1996-11-19,temporal,1,29,1,44.8
186690,2019,SUDAFRICANA,M,1993-06-17,temporal,1,32,1,36.4
144162,2019,VENEZOLANA,F,1966-08-22,permanente,1,59,1,14.8


In [189]:
visa_applications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349583 entries, 0 to 349582
Data columns (total 9 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   year_application       349583 non-null  int64         
 1   nationality            349583 non-null  object        
 2   gender                 349576 non-null  object        
 3   birth_date             349583 non-null  datetime64[ns]
 4   permanent_stay_intent  349583 non-null  object        
 5   number_of_application  349583 non-null  int64         
 6   age                    349583 non-null  int32         
 7   still_alive            349583 non-null  int64         
 8   rem_life_expect        349583 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int32(1), int64(3), object(3)
memory usage: 22.7+ MB


4. Round `remaining_life_expectancy` to two decimal places.

In [190]:
visa_applications['rem_life_expect'].round(2).sample(n=4, random_state=42)

72412     60.4
305356     7.8
192112    41.4
210086    20.8
Name: rem_life_expect, dtype: float64

# 3. Exercises

## **Exercises: DataFrame Summarization, Chaining Methods, and Operations**  

### **Dataset: Volumen de solicitudes de visa colombiana recibidas desde 2017**  
📌 *Make sure you have downloaded the dataset before running the exercises.*  

---

### **Exercise 1: Summarizing Visa Applications by Year and Nationality**  
**Objective:** Generate summary statistics for visa applications grouped by **year** and **nationality**.  

**Tasks:**  
1. Group the dataset by **"year_application"** and **"nationality"**.  
2. Compute the following statistics for each nationality per year:  
   - **Total visa applications**  
   - **Mean number of applications**  
   - **Standard deviation of applications**  
3. Sort the summary table in **descending order** based on total visa applications.  

**Hint:** Use `.groupby()`, `.agg()`, and `.sort_values()`.  

---

### **Exercise 2: Chaining Methods to Filter and Summarize Data Efficiently**  
**Objective:** Use **method chaining** to filter and summarize data in a single command.  

**Tasks:**  
1. Select only the records where:  
   - **year_application is 2023 or later**  
   - **permanent_stay_intent** is "Sin vocación de permanencia"  
2. Compute the **total number of applications** for each nationality in this subset.  
3. Sort the result in **descending order** and return only the **top 10 nationalities**.  

**Hint:** Chain `.query()`, `.groupby()`, `.agg()`, and `.sort_values()` into a **single operation**.  

---

### **Exercise 3: Creating a Summary Table of Gender-Based Application Patterns**  
**Objective:** Summarize visa applications based on **gender** and **permanent_stay_intent**.  

**Tasks:**  
1. Use **pivot tables** to summarize the total visa applications by:  
   - **"gender"** (rows)  
   - **"permanent_stay_intent"** (columns)  
2. Compute **both absolute and percentage distributions** of applications.  
3. Sort the table based on the highest number of applications per gender.  

**Hint:** Use `.pivot_table()` and `.apply(lambda x: x / x.sum() * 100)`.  

---

### **Exercise 4: Applying Operations to Detect Application Trends**  
**Objective:** Use **cumulative sums and rolling averages** to analyze visa trends over time.  

**Tasks:**  
1. Compute the **cumulative total number of applications** per year.  
2. Calculate the **rolling mean (window = 3 years)** to smooth out fluctuations.  
3. Identify **years with the most significant increases or drops** in applications.  

**Hint:** Use `.cumsum()` and `.rolling().mean()`.  

---

### **Exercise 5: Using Chaining to Compare Missing Data Patterns**  
**Objective:** Detect missing values and summarize their distribution in the dataset.  

**Tasks:**  
1. Identify **which columns contain missing values** and compute the **percentage of missing data** per column.  
2. Filter out columns where **more than 5%** of data is missing.  
3. Create a new DataFrame that only contains **rows with at least one missing value** for further analysis.  

**Hint:** Use `.isna()`, `.sum()`, `.mean()`, `.query()`, and `.dropna()`.  

---

### **Exercise 6: Transforming Data by Transposing and Normalizing Applications**  
**Objective:** Normalize and transpose data for easier analysis.  

**Tasks:**  
1. Extract only the **yearly totals** of visa applications for each nationality.  
2. Normalize the values using **min-max scaling**:  
   $X'$ = $\frac{X - X_{min}}{X_{max} - X_{min}}$ 
3. Transpose the resulting DataFrame so that years become columns and nationalities become rows.  

**Hint:** Use `.pivot_table()`, `MinMaxScaler` from `sklearn.preprocessing`, and `.T`.