# Essential DataFrame Operations: Selecting and Organizing Columns
_This section covers the fundamentals of selecting multiple DataFrame columns, using different selection methods, and organizing column names for better readability._

---

## Contents
1. **Introduction**  
   - Overview of DataFrame column selection  
   - Importance of organized data  

2. **Key Concepts**  
   - Selecting multiple DataFrame Columns  
   - Selecting columns with Methods  
   - Ordering Column Names  

3. **Practical Exercises**  
   Explanation of what the user will learn in this section.

---

## Datasets Used  
- [Volumen de solicitudes de visa colombiana recibidas desde 2017](https://www.datos.gov.co/Estad-sticas-Nacionales/Volumen-de-solicitudes-de-visa-colombiana-recibida/mgr2-njqc/about_data)  
### About Dataset

#### Context  
This dataset provides information about the number of visa applications received by the Colombian Visa and Immigration Authority. It is aligned with the definition established by Resolution 5477 of 2022, detailing the nationality, gender, date of birth of the applicant, and the type of visa requested, classified by the purpose of stay. It aims to track trends in visa applications over the years.

#### Content  
##### Features of the dataset:
- **Año Solicitud**: The year in which the visa application was submitted.  
- **Nacionalidad**: The nationality of the applicant as per the passport.  
- **Sexo**: The gender of the applicant as per the passport.  
- **Fecha de Nacimiento**: The applicant's date of birth as per the passport.  
- **Vocación de Permanencia**: The type of visa requested, classified by the purpose of stay (e.g., temporary, permanent).  
- **Número**: The total number of visa applications received.

#### Acknowledgements  
Special thanks to the Ministry of Foreign Affairs of Colombia for providing this dataset.

#### Inspiration  
This dataset can be used for analyzing trends in immigration and visa applications, identifying demographic patterns, and understanding migration flows into Colombia over the years.

#### Source  
The data was provided by the Ministry of Foreign Affairs of Colombia and can be accessed through the Colombian Open Data portal.

--- 

## Author
**Author Name:** Juan Alejandro Carrillo Jaimes  

**Contact:** [jalejandrocjaimes@gmail.com](mailto:jalejandrocjaimes@gmail.com) - [Linkedin-AlejoCJaimes31](https://www.linkedin.com/in/alejocjaimes31/)  

**Purpose:** This content was created as an educational resource for university students.


# 1. Introduction
The **goal** of this chapter is deep you in many fundamental operations of the DataFrame. Like selectig, ordering, fix null values, among others.

## Overview of DataFrame column selection
![image.png](attachment:image.png)

Selecting columns from a pandas DataFrame is one of the core operations when analyzing data. It allows you to extract only the data you're interested in, which is crucial for making the data more manageable and focusing on the relevant information. You can select columns by their names, their index positions, or even using more advanced techniques like regular expressions.

## Importance of organized data
![sorting-pandas-image](https://i.ytimg.com/vi/F0tQ1YG6BWQ/maxresdefault.jpg)

Organizing your data is vital for performing efficient analysis and ensuring that the insights you derive are meaningful. Proper organization includes arranging the columns in a logical order, renaming them for clarity, and ensuring that the data is consistent and properly formatted. This is essential for both the human user and the computational tasks that will be performed on the data.

# Dataset Important Information
You can download the dataset from [Volumen de solicitudes de visa colombiana recibidas desde 2017](https://www.datos.gov.co/Estad-sticas-Nacionales/Volumen-de-solicitudes-de-visa-colombiana-recibida/mgr2-njqc/about_data), export like CSV.

**Steps to download**

1. Go to [datos.gov.co](https://www.datos.gov.co/), and create account. (the main language is spanish, you can switch the language to English)
   
![image-2.png](attachment:image-2.png)

2. Copy and paste the url of the Dataset, https://www.datos.gov.co/Estad-sticas-Nacionales/Volumen-de-solicitudes-de-visa-colombiana-recibida/mgr2-njqc/about_data

![image.png](attachment:image.png)

3. Click on "exportar" (export if you switch this to English.), select CSV format and click on download.

![image-3.png](attachment:image-3.png)

4. Wait to download and change the file name by `Visa_Applications_Colombia_2017_20250217.csv`

![image-4.png](attachment:image-4.png)

5. Save this in the root of the project in **datasets** folder inside folder called `visa-col-application-datagov-df`

![image-5.png](attachment:image-5.png)

6. Import the dataset.

In [1]:
import pandas as pd # type: ignore
import numpy as np # type: ignore
import os

In [2]:
# absolute path
current_file_path = os.path.abspath('C2-Essential-Pandas-Techniques-for-DataFrames/C2-01-Selection-and-Organization/C2_1_Selecting_and_Organizing_Columns.ipynb')

# up to 4 directories
root_dir = os.path.abspath(os.path.join(current_file_path, '../../../../../'))  # Subir 4 directorios

# dataset directory
dataset_dir = os.path.join(root_dir, 'datasets', 'visa-col-application-datagov-df')

for dirname, _, filenames in os.walk(dataset_dir):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

c:\Users\study_2025\Documents\Github\Doc-UP-AlejandroJaimes\Pandas-for-Education-Learning-through-Hands-On-Examples\datasets\visa-col-application-datagov-df\Visa_Applications_Colombia_2017_20250217.csv


In [3]:
visa_applications = pd.read_csv(os.path.join(dataset_dir, 'Visa_Applications_Colombia_2017_20250217.csv'))

In [4]:
visa_applications.shape

(349583, 6)

In [5]:
visa_applications.head()

Unnamed: 0,Año Solicitud,Nacionalidad,Sexo,Fecha de Nacimiento,Vocación de permanencia,Número
0,2017,ECUATORIANA,FEMENINO,24/07/1897,Con vocación de permanencia,2
1,2017,FEDERACION DE RUSIA,FEMENINO,03/05/1919,Sin vocación de permanencia,2
2,2017,FRANCESA,FEMENINO,20/08/1919,Sin vocación de permanencia,1
3,2017,CUBANA,FEMENINO,03/02/1922,Sin vocación de permanencia,2
4,2017,ESTADOUNIDENSE,FEMENINO,17/11/1922,Sin vocación de permanencia,1


# 2. Key Concepts

## 2.1 Selecting multiple DataFrame columns

We can select a *single* column by passing the column name to the index operator of a DataFrame, or using [loc-iloc for more columns]. This was covered in the *Selecting a column Chapter1- C1-Series*

1. Normalize all columns, change the current columns names, by:

    **Año Solicitud** -> *year_application*

    **Nacionalidad** -> *nationality*

    **Sexo** -> *sex*

    **Fecha de Nacimiento** -> *birth_date*

    **Vocación de permanencia** -> *permanet_stay_intent*

    **Número** -> *number_of_application*


In [8]:
new_columns = {'Año Solicitud': 'year_application', 'Nacionalidad': 'nationality', \
                'Sexo': 'gender', 'Fecha de Nacimiento': 'birth_date', 'Vocación de permanencia': 'permanet_stay_intent', \
                'Número': 'number_of_application'
            }
visa_applications.rename(columns=new_columns, inplace=True)
visa_applications.columns.tolist()

['year_application',
 'nationality',
 'gender',
 'birth_date',
 'permanet_stay_intent',
 'number_of_application']

2. Pass a list of the desired columns to the indexing operator

    **Remember**
    
    **2.1** Using the index operation can return either a Series or DataFrame. If *we pass in a list with a *single item*, we will get back a DataFrame. If we pass in a just a *string with the column name*, we will get a Series back.

    **2.2** We can also use `.loc` to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon `(:)` to indicate a slice that selects all of the rows. This can also return wither a *DataFrame* or a *Series*

In [9]:
selected_columns = ['year_application', 'nationality', 'gender']
visa_app_columns = visa_applications[selected_columns]
print(f'Shape: {visa_applications.shape}')
visa_app_columns.head()

Shape: (349583, 6)


Unnamed: 0,year_application,nationality,gender
0,2017,ECUATORIANA,FEMENINO
1,2017,FEDERACION DE RUSIA,FEMENINO
2,2017,FRANCESA,FEMENINO
3,2017,CUBANA,FEMENINO
4,2017,ESTADOUNIDENSE,FEMENINO


In [10]:
# 2.1 Selecting columns by indexing and list
print(visa_applications[selected_columns].__class__)
print(visa_applications['year_application'].__class__)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [11]:
# 2.2 Selecting columns using loc
print(visa_applications.loc[:, ['year_application']].__class__)
print(visa_applications.loc[:, 'year_application'].__class__)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


#### Keep in mind

1. The DataFrame index operator is very flexible and capable of accepting a number of different objects.
2.  Usually, a single column is selected with a *string*, resulting in a **Series**. When a **DataFrame** is desired, put the column name in a *single-element* list.
3.  `loc` attrbiute is very usepful to select columns based in label names. Use it for series or DataFrames, as required.

## 2.2 Selecting columns with Methods

There are some DataFrame methods that facilitate the columns selections. The **.select_dtypes**  and **.filter** methods area two useful methods to do this.

### 2.2.1 `select_dtypes`

The `select_dtypes` method is used in pandas to **select columns** in a DataFrame based on their data type(s). It allows you to filter the columns that match a specific data type or a list of data types. This is particularly useful when you want to work with specific kinds of data, such as numeric, string, or datetime types, without manually inspecting each column.

#### Syntax:
```python
DataFrame.select_dtypes(include=None, exclude=None)
```

- **`include`**: This parameter specifies the data types to include. It can take a single data type or a list of data types (e.g., `'int64'`, `'float64'`, `'object'`, etc.).
- **`exclude`**: This parameter specifies the data types to exclude, which can also be a single or list of data types.

#### Data types list accepted
| Data Type            | Equivalent Data Type                          | Description |
|----------------------|---------------------------------------------|-------------|
| `number`            | `np.number`, `'number'`                     | Selects both integers and floats regardless of size. |
| `float64`           | `np.float64`, `np.float_`, `float`, `'float64'`, `'float_'`, `'float'` | Selects only 64-bit floats. |
| `float16`, `float32`, `float128` | `np.float16`, `np.float32`, `np.float128`, `'float16'`, `'float32'`, `'float128'` | Selects exactly 16, 32, and 128-bit floats, respectively. |
| `floating`          | `np.floating`, `'floating'`                 | Selects all floats regardless of size. |
| `int`              | `np.int0`, `np.int64`, `np.int_`, `'int0'`, `'int64'`, `'int_'`, `'int'` | Selects only 64-bit integers. |
| `int8`, `int16`, `int32` | `np.int8`, `np.int16`, `np.int32`, `'int8'`, `'int16'`, `'int32'` | Selects exactly 8, 16, and 32-bit integers, respectively. |
| `integer`           | `np.integer`, `'integer'`                   | Selects all integers regardless of size. |
| `Int64`             | `'Int64'`                                   | Selects nullable integers; no NumPy equivalent. |
| `object`            | `np.object`, `'object'`, `'O'`              | Selects all object data types. |
| `datetime64`        | `np.datetime64`, `'datetime64'`, `'datetime'` | All datetimes are 64-bit. |
| `timedelta64`       | `np.timedelta64`, `'timedelta64'`, `'timedelta'` | All timedeltas are 64-bit. |
| `category`          | `pd.Categorical`, `'category'`              | Unique to pandas; no NumPy equivalent. |



1. Shorten the columns names, split by *'_'* and select the first 3 letters.

In [12]:
visa_applications_copy = visa_applications.copy()

In [13]:
print(f'Current Columns: {visa_applications_copy.columns.tolist()}')
# short columns names
def shorten_cols(col):
    return (
        str(col)
        .lower()
        .replace(' ', '_')
        .split('_')[0][:3]
    )
visa_applications_copy.rename(columns=shorten_cols, inplace=True)
print(f'New Columns: {visa_applications_copy.columns.tolist()}')

Current Columns: ['year_application', 'nationality', 'gender', 'birth_date', 'permanet_stay_intent', 'number_of_application']
New Columns: ['yea', 'nat', 'gen', 'bir', 'per', 'num']


2. Use the *.select_dtypes* method to select only the integer columns

    If you would like to select al the numeric columns, you may pass the string `number` to the `include` parameter.

    ```python
    visa_applications_copy.select_dtypes(include="number")
    ```

In [14]:
visa_applications_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349583 entries, 0 to 349582
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   yea     349583 non-null  int64 
 1   nat     349583 non-null  object
 2   gen     349576 non-null  object
 3   bir     349583 non-null  object
 4   per     349583 non-null  object
 5   num     349583 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 16.0+ MB


In [15]:
visa_app_int_cols = visa_applications_copy.select_dtypes(include="int")
visa_app_int_cols.head()

Unnamed: 0,yea,num
0,2017,2
1,2017,2
2,2017,1
3,2017,2
4,2017,1


3. Use the *.select_dtypes* method to select *integer and string* columns

In [16]:
visa_app_dalltypes_cols = visa_applications_copy.select_dtypes(include=["int","object"])
visa_app_dalltypes_cols.head()

Unnamed: 0,yea,nat,gen,bir,per,num
0,2017,ECUATORIANA,FEMENINO,24/07/1897,Con vocación de permanencia,2
1,2017,FEDERACION DE RUSIA,FEMENINO,03/05/1919,Sin vocación de permanencia,2
2,2017,FRANCESA,FEMENINO,20/08/1919,Sin vocación de permanencia,1
3,2017,CUBANA,FEMENINO,03/02/1922,Sin vocación de permanencia,2
4,2017,ESTADOUNIDENSE,FEMENINO,17/11/1922,Sin vocación de permanencia,1


### 2.2.2 `filter`

The `filter` method is used in pandas to **filter rows or columns** based on labels or specific conditions. It allows you to select a subset of the DataFrame by column names, index names, or by applying a condition to match specific patterns in the labels.

#### Syntax:
```python
DataFrame.filter(items=None, like=None, regex=None, axis=None)
```

- **`items`**: A list of column or row labels to include in the result.
- **`like`**: A string to match labels that contain this substring.
- **`regex`**: A regular expression to match labels that conform to the pattern.
- **`axis`**: Whether to filter along the rows (`axis=0`) or columns (`axis=1`).

In [23]:
visa_applications_copy.columns = visa_applications.columns
visa_applications_copy.head()

Unnamed: 0,year_application,nationality,gender,birth_date,permanet_stay_intent,number_of_application
0,2017,ECUATORIANA,FEMENINO,24/07/1897,Con vocación de permanencia,2
1,2017,FEDERACION DE RUSIA,FEMENINO,03/05/1919,Sin vocación de permanencia,2
2,2017,FRANCESA,FEMENINO,20/08/1919,Sin vocación de permanencia,1
3,2017,CUBANA,FEMENINO,03/02/1922,Sin vocación de permanencia,2
4,2017,ESTADOUNIDENSE,FEMENINO,17/11/1922,Sin vocación de permanencia,1


1. Use the `like` parameter to search for all columns that conntains `on`.

In [24]:
visa_applications_copy.filter(like='on').head()

Unnamed: 0,year_application,nationality,number_of_application
0,2017,ECUATORIANA,2
1,2017,FEDERACION DE RUSIA,2
2,2017,FRANCESA,1
3,2017,CUBANA,2
4,2017,ESTADOUNIDENSE,1


2. Use the `items` parameter to search this columns `year_application,nationality`.

In [25]:
cols_to_search = ['year_application', 'nationality']
visa_applications_copy.filter(items=cols_to_search).head()

Unnamed: 0,year_application,nationality
0,2017,ECUATORIANA
1,2017,FEDERACION DE RUSIA
2,2017,FRANCESA
3,2017,CUBANA
4,2017,ESTADOUNIDENSE


3. Use the `regex` parameter to search all columns that contains underscore.

In [27]:
# underscore pattern
pattern = r'_'
visa_applications_copy.filter(regex=pattern).head()

Unnamed: 0,year_application,birth_date,permanet_stay_intent,number_of_application
0,2017,24/07/1897,Con vocación de permanencia,2
1,2017,03/05/1919,Sin vocación de permanencia,2
2,2017,20/08/1919,Sin vocación de permanencia,1
3,2017,03/02/1922,Sin vocación de permanencia,2
4,2017,17/11/1922,Sin vocación de permanencia,1


4. Use the `regex` parameter to search all columns that ends with 'on`

In [31]:
pattern = r"\b\w*on\b"
visa_applications_copy.filter(regex=pattern).head()

Unnamed: 0,year_application,number_of_application
0,2017,2
1,2017,2
2,2017,1
3,2017,2
4,2017,1


5. Use the `regex` parameter to search all columns that contains 'y`

In [32]:
pattern = r"\b\w*y\w*\b"
visa_applications_copy.filter(regex=pattern).head()

Unnamed: 0,year_application,nationality,permanet_stay_intent
0,2017,ECUATORIANA,Con vocación de permanencia
1,2017,FEDERACION DE RUSIA,Sin vocación de permanencia
2,2017,FRANCESA,Sin vocación de permanencia
3,2017,CUBANA,Sin vocación de permanencia
4,2017,ESTADOUNIDENSE,Sin vocación de permanencia


#### Keep in mind

1. The `.select_dtypes` method accepts either a list or single data type in its **include** or **exclude** parameters and returns a DataFrame with columns of just those given data types.
2. The `.filter` method selects columns **by only inspecting the column names** and not the actual data values. The most common parameters are *items,like, and regex*, only one of which can be used at a time.

## 2.3 Ordering column names

One of the most important things is to organize our columns in a readable way for the analysis we are going to do. *Matt Harrison and Theodore Petrou* recommend in their book **Pandas 1.x Cookbook** the following sorting criteria:
- Classify each columnn as either categorical or continuos.
- Group common columns within the categorical and continuous columns.
- Place the most important groups of columns first with categorical columns before continuous ones.

In [68]:
visa_applications_copy2 = visa_applications.copy()

1. Output all the column names and scan for similar categorical and continuos columns.

In [69]:
visa_applications_copy2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349583 entries, 0 to 349582
Data columns (total 6 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   year_application       349583 non-null  int64 
 1   nationality            349583 non-null  object
 2   gender                 349576 non-null  object
 3   birth_date             349583 non-null  object
 4   permanet_stay_intent   349583 non-null  object
 5   number_of_application  349583 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 16.0+ MB


2. Classify columns as Categorical or Continous

    #### **Categorical Columns** (Qualitative, represent groups or categories)
    1. `nationality` → Different nationalities (categorical, **object** type).
    2. `gender` → Male, Female, or other categories (**object** type).
    3. `birth_date` → Although it is an object type, it represents a date (could be treated separately).
    4. `permanet_stay_intent` → Yes/No or other categories (**object** type).

    #### **Continuous Columns** (Quantitative, represent numerical values)
    1. `year_application` → Year (numeric, **int64**).
    2. `number_of_application` → Count of applications (numeric, **int64**).



In [70]:
cat_demographics = [
    'nationality',
    'gender',
    'birth_date'
]

cat_app_intent = ['permanet_stay_intent']

cont_app_details = [
    'year_application',
    'number_of_application'
]

3. According to the given criteria, categorical columns should come first, and within each type, the most important groups should be prioritized.
   
   **Important**
   *Ensure that this list contains all the columns from the original.* You can test it with
   ```python
   set(ds.columns) == set(new_col_order)
   ```

In [71]:
new_col_order = (cat_demographics + cat_app_intent + cont_app_details)

In [72]:
set(visa_applications_copy2.columns) == set(new_col_order)

True

In [73]:
visa_applications_copy2 = visa_applications_copy2[new_col_order]

In [74]:
visa_applications_copy2.head()

Unnamed: 0,nationality,gender,birth_date,permanet_stay_intent,year_application,number_of_application
0,ECUATORIANA,FEMENINO,24/07/1897,Con vocación de permanencia,2017,2
1,FEDERACION DE RUSIA,FEMENINO,03/05/1919,Sin vocación de permanencia,2017,2
2,FRANCESA,FEMENINO,20/08/1919,Sin vocación de permanencia,2017,1
3,CUBANA,FEMENINO,03/02/1922,Sin vocación de permanencia,2017,2
4,ESTADOUNIDENSE,FEMENINO,17/11/1922,Sin vocación de permanencia,2017,1


In [77]:
print('Original Columns order: ', visa_applications.columns.tolist())
print('New Columns order: ', visa_applications_copy2.columns.tolist())

Original Columns order:  ['year_application', 'nationality', 'gender', 'birth_date', 'permanet_stay_intent', 'number_of_application']
New Columns order:  ['nationality', 'gender', 'birth_date', 'permanet_stay_intent', 'year_application', 'number_of_application']


# 3. Exercises

---

## **Exercise Set: Selecting and Ordering DataFrame Columns**
### **Dataset: Volumen de solicitudes de visa colombiana recibidas desde 2017**  
📌 *Make sure you have downloaded the dataset before running the exercises.*  

---

- Normalize all columns, change the current columns names, by:

   **Año Solicitud** -> *year_application*

   **Nacionalidad** -> *nationality
   
   **Sexo** -> *sex*
   
   **Fecha de Nacimiento** -> *birth_date*
   
   **Vocación de permanencia** -> *permanet_stay_intent*
   
   **Número** -> *number_of_application*


#### **Exercise 1: Selecting Specific Columns**  
**Objective:** Learn how to select individual and multiple columns from a DataFrame.  

##### **Task:**  
1. Select and print only the `nationality` column.  
2. Select the `year_application` and `number_of_application` columns.  
3. Extract the first 10 rows of the `sex` column.  
4. Retrieve all visa applications where the `permanent_stay_intent` is `"Permanente"`.  

💡 **Hint:** Use `df['column_name']` and `df[['col1', 'col2']]`.  

---

#### **Exercise 2: Selecting Columns with Methods**  
**Objective:** Use DataFrame methods to retrieve column information.  

##### **Task:**  
1. Print the list of all column names in the dataset.  
2. Retrieve only the numerical columns from the dataset.  
3. Identify the total number of columns.  

💡 **Hint:** Use `.columns`, `.select_dtypes()`, and `len()`.  

---

#### **Exercise 3: Reordering Columns**  
**Objective:** Change the order of DataFrame columns for better readability.  

##### **Task:**  
1. Reorder the dataset so that `number_of_application` appears first, followed by the rest of the columns.  
2. Move the `birth_date` column to be the last column.  
3. Swap the positions of `sex` and `nationality`.  

💡 **Hint:** Use `df = df[['col1', 'col2', ..., 'coln']]`.  

---

#### **Exercise 4: Renaming Columns for Clarity**  
**Objective:** Modify column names to improve readability.  

##### **Task:**  
1. Convert all column names to lowercase (if not already).  
2. Rename:  
   - `permanent_stay_intent` → `visa_type`  
   - `number_of_application` → `total_applications`  
3. Replace underscores with spaces in column names.  

💡 **Hint:** Use `.rename()` and `df.columns.str.replace()`.  

---

#### **Exercise 5: Filtering Based on Column Values**  
**Objective:** Extract specific rows based on column criteria.  

##### **Task:**  
1. Retrieve all applications where the `nationality` is `"Venezolana"`.  
2. Select all visa applications submitted after 2020.  
3. Extract records where the `number_of_application` is greater than 1000.  

💡 **Hint:** Use conditional filtering with `df[df['column'] condition]`.  

---

#### **Exercise 6: Dropping Unnecessary Columns**  
**Objective:** Remove columns that may not be relevant for analysis.  

##### **Task:**  
1. Drop the `birth_date` column.  
2. Remove columns that contain only categorical data.  
3. Delete any column that contains more than 50% missing values.  

💡 **Hint:** Use `df.drop()` and `df.select_dtypes()`.  

---

Now all columns have been normalized. Let me know if you need any refinements! 🚀