# Essential DataFrame Operations: Advanced Operations and Case Study
_This section explores handling missing values, transposing DataFrames, and applying these concepts in a practical exercise._

---

## Contents
1. **Introduction**  
   - Challenges in real-world data processing  
   - Importance of handling missing values and data transformations  

2. **Key Concepts**  
   - Comparing Missing Values  
   - Transposing the Direction of a DataFrame Operation  
3. **Special Exercise: University Admissions Analysis**
   - Use the key Pandas operations learned in Chapter 2 to analyze the admissions dataset and answer important questions about the data.

---
## Datasets Used
- [zeeshier/student-admission-records](https://www.kaggle.com/datasets/zeeshier/student-admission-records)  

### About Dataset  

#### Context  
This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains **157 rows** of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.  

The dataset simulates a **university admission record system**, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw datasets, offering hands-on experience in data wrangling.  

#### Content  
##### Features of the dataset:  
- **Name**: Student's first name (Pakistani names).  
- **Age**: Age of the student (some outliers and missing values).  
- **Gender**: Gender (Male/Female).  
- **Admission Test Score**: Score obtained in the admission test (includes outliers and missing values).  
- **High School Percentage**: Student's high school final score percentage (includes outliers and missing values).  
- **City**: City of residence in Pakistan.  
- **Admission Status**: Whether the student was accepted or rejected.  

#### Acknowledgements  
Special thanks to [Zeeshier](https://www.kaggle.com/zeeshier) for providing this dataset for educational purposes.  

#### Inspiration  
This dataset is ideal for practicing data cleaning, handling missing values, detecting duplicates, and identifying outliers—key preprocessing steps before applying machine learning models.  

#### Source  
The dataset is publicly available on Kaggle:  
[Student Admission Records Dataset](https://www.kaggle.com/datasets/zeeshier/student-admission-records)  

---
## Author
**Author Name:** Juan Alejandro Carrillo Jaimes  

**Contact:** [jalejandrocjaimes@gmail.com](mailto:jalejandrocjaimes@gmail.com) - [Linkedin-AlejoCJaimes31](https://www.linkedin.com/in/alejocjaimes31/)  

**Purpose:** This content was created as an educational resource for university students.


# 1. Introduction
In real-world data analysis, datasets often contain missing or incomplete values that can impact insights and decision-making. Handling these missing values correctly is crucial for maintaining data integrity. Additionally, data transformation techniques like transposing DataFrames allow us to reshape data for better analysis and visualization.

## Challenges in real-world data processing  

<p align="center">
  <img src="https://cdn.botpenguin.com/assets/website/Preprocessing_0359d7faa0.png" width="500" height="500"/>
</p>

Real-world datasets are rarely perfect; they often contain missing, inconsistent, or inaccurate data. Missing values can occur due to human error, system failures, or incomplete data collection. Ignoring these gaps in data can lead to misleading analyses and incorrect conclusions.  

To address this, we need strategies for identifying, handling, and replacing missing values. Pandas provides several methods, such as `.isna()`, `.fillna()`, and `.dropna()`, to manage missing data efficiently.

## Importance of handling missing values and data transformations  

<p align="center">
  <img src="https://fastercapital.com/i/Data-Transformation-Unlocking-Business-Insights--A-Guide-to-Effective-Data-Transformation--Handling-Missing-Data.webp" width="500" height="300"/>
</p>

Data transformation is an essential step in preprocessing because it allows for better structure and interpretation. One common transformation technique is **transposing** a DataFrame using `.T`, which flips rows and columns.  

Transposing is useful when the existing format does not align with the required analysis, making it easier to extract insights and visualize relationships within the data. It is particularly helpful when dealing with time-series data, correlation matrices, or datasets where attributes should be restructured for better readability.


# Dataset Important Information

How to import the dataset from kaggle and the step-by-step is described in [C1-Introduction-To-Pandas-and-DataFrame-Structure.ipynb](https://github.com/Doc-UP-AlejandroJaimes/Pandas-for-Education-Learning-through-Hands-On-Examples/blob/main/C1-Getting-Started-with-Pandas-Basics-and-Fundamentals/C1-Introduction/C1-Introduction-To-Pandas-and-DataFrame-Structure.ipynb)

In [2]:
from kaggle.api.kaggle_api_extended import KaggleApi # type: ignore
import pandas as pd
import numpy as np 
import os

# Authenticate with the Kaggle API
api = KaggleApi()
api.authenticate()

In [8]:
# Download the dataset
current_path = os.getcwd()
root_dir = os.path.abspath(os.path.join(current_path, '../../'))
dataset_path = os.path.join(root_dir, 'datasets', 'sudent-adm-records-kaggle-df')
api.dataset_download_files('zeeshier/student-admission-records', path=dataset_path, unzip=True)

Dataset URL: https://www.kaggle.com/datasets/zeeshier/student-admission-records


In [10]:
def check_files(full_path):
    for dirname, _, filenames in os.walk(full_path):
        for filename in filenames:
            print(os.path.join(dirname, filename))

check_files(dataset_path)

c:\Users\study_2025\Documents\Github\Doc-UP-AlejandroJaimes\Pandas-for-Education-Learning-through-Hands-On-Examples\datasets\sudent-adm-records-kaggle-df\student_admission_record_dirty.csv


In [37]:
filepath = os.path.join(dataset_path, 'student_admission_record_dirty.csv')
std_adm_dirty = pd.read_csv(filepath)

In [42]:
std_adm_dirty.head()

Unnamed: 0,name,age,gender,test_score,hs_percentage,city,admit_status
0,Shehroz,24.0,Female,50.0,68.9,Quetta,Rejected
1,Waqar,21.0,Female,99.0,60.73,Karachi,
2,Bushra,17.0,Male,89.0,,Islamabad,Accepted
3,Aliya,17.0,Male,55.0,85.29,Karachi,Rejected
4,Bilal,20.0,Male,65.0,61.13,Lahore,


1. Normalize all columns to lower case, and replace white spaces by underscore and shorten columns wheter is neccesary.

In [38]:
print(f'Current Columns: {std_adm_dirty.columns.tolist()}')

Current Columns: ['Name', 'Age', 'Gender', 'Admission Test Score', 'High School Percentage', 'City', 'Admission Status']


In [39]:
std_adm_dirty.columns = std_adm_dirty.columns.str.replace(' ','_').str.lower()
print(f'Normalized Columns: {std_adm_dirty.columns.tolist()}')

Normalized Columns: ['name', 'age', 'gender', 'admission_test_score', 'high_school_percentage', 'city', 'admission_status']


In [40]:
def shorten_cols(col):
    return (
        str(col)
        .replace('admission_test_score', 'test_score')
        .replace('high_school_percentage', 'hs_percentage')
        .replace('admission_status', 'admit_status')
    )
std_adm_dirty.rename(columns=shorten_cols, inplace=True)
print(f'Shorted Columns: {std_adm_dirty.columns.tolist()}')

Shorted Columns: ['name', 'age', 'gender', 'test_score', 'hs_percentage', 'city', 'admit_status']


2. Sort columns according to the following criteria:
- Classify each columnn as either categorical or continuos.
- Group common columns within the categorical and continuous columns.
- Place the most important groups of columns first with categorical columns before continuous ones.

In [41]:
std_adm_dirty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           147 non-null    object 
 1   age            147 non-null    float64
 2   gender         147 non-null    object 
 3   test_score     146 non-null    float64
 4   hs_percentage  146 non-null    float64
 5   city           147 non-null    object 
 6   admit_status   147 non-null    object 
dtypes: float64(3), object(4)
memory usage: 8.7+ KB


**Categorical Data**
1. name
2. gender
3. city
4. admit_status

**Continuos Data**

1. test_score
2. hs_percentage
3. age

In [45]:
# Change age from float to inter accepting NaN values.
std_adm_dirty['age'] = std_adm_dirty['age'].astype('Int64')

In [47]:
cat_student = [
    'name',
    'gender',
    'city',
    'admit_status'
]

cont_student = [
    'test_score',
    'hs_percentage',
    'age'
]

new_cols_order = cat_student + cont_student
print(new_cols_order)

# Ensure that this list contains all the columns from the original.

print(f'Contains all the columns from the original?: {set(std_adm_dirty.columns) == set(new_cols_order)}')

['name', 'gender', 'city', 'admit_status', 'test_score', 'hs_percentage', 'age']
Contains all the columns from the original?: True


In [48]:
std_adm_dirty = std_adm_dirty[new_cols_order]
std_adm_dirty.head()

Unnamed: 0,name,gender,city,admit_status,test_score,hs_percentage,age
0,Shehroz,Female,Quetta,Rejected,50.0,68.9,24
1,Waqar,Female,Karachi,,99.0,60.73,21
2,Bushra,Male,Islamabad,Accepted,89.0,,17
3,Aliya,Male,Karachi,Rejected,55.0,85.29,17
4,Bilal,Male,Lahore,,65.0,61.13,20


# 2. Key Concepts

## 2.1 Comparing Missing Values

Pandas uses the `NumPy NaN (**np.nan**)` object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it's not equal to itself. Even Python's `None` object evaluates as `True` when compared to itself.

In [49]:
np.nan == np.nan

False

In [51]:
None == None

True

All other comparisons against `np.nan` also return `False`, except not equal (!=)

In [52]:
np.nan > 5

False

In [53]:
5 > np.nan

False

In [54]:
5 != np.nan

True

### Equals Operator
Series and DataFrames use the equals operator, `==`, to make element-by-element comparisions. The result is an object with the same dimensiones. Learn how to use the equals operator, which is different from the `.equals` method.

1. Compare each element to scalar value

In [56]:
std_adm_dirty == 24

Unnamed: 0,name,gender,city,admit_status,test_score,hs_percentage,age
0,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
152,False,False,False,False,False,False,False
153,False,False,False,False,False,False,False
154,False,False,False,False,False,False,False
155,False,False,False,False,False,False,False


2. You may be use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance. `std_adm_cat` against, as follows. Note that `NaN` values will be displayed as false.

In [57]:
std_adm_cat = std_adm_dirty[cat_student].copy()

In [58]:
std_adm_cat.head()

Unnamed: 0,name,gender,city,admit_status
0,Shehroz,Female,Quetta,Rejected
1,Waqar,Female,Karachi,
2,Bushra,Male,Islamabad,Accepted
3,Aliya,Male,Karachi,Rejected
4,Bilal,Male,Lahore,


In [59]:
std_adm_cat_self_compare = std_adm_cat == std_adm_cat
std_adm_cat_self_compare.head()

Unnamed: 0,name,gender,city,admit_status
0,True,True,True,True
1,True,True,True,False
2,True,True,True,True
3,True,True,True,True
4,True,True,True,False


3. Use the `.all` method to determine if each column contains only `True` values yields an unexpected result.

In [60]:
std_adm_cat_self_compare.all()

name            False
gender          False
city            False
admit_status    False
dtype: bool

This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the `Boolean` columns, you woud get zero for each one

In [61]:
(std_adm_dirty == np.nan).sum()

name             0
gender           0
city             0
admit_status     0
test_score       0
hs_percentage    0
age              0
dtype: Int64

Instead of using `==` to find mussing numbers, use the `.isna` method

In [62]:
missing_values = std_adm_dirty.isna().sum()
missing_values

name             10
gender           10
city             10
admit_status     10
test_score       11
hs_percentage    11
age              10
dtype: int64

4. The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the `.equals` method. This Method treats `NaNs` that are in the same locations as equal

In [64]:
std_adm_cat.equals(std_adm_cat)

True

5. Inside the `pandas.testing` sub-package, a function exists that developers should use when creating unit tests. The `assert_frame_equal` function raises an `AssertionError` if two DataFrames are not equal. It returns `None` if the two DataFrames are equal. 

In [70]:
from pandas.testing import assert_frame_equal
assert_frame_equal(std_adm_cat,std_adm_cat) is None

True

## 2.2 Transposing the Direction of a DataFrame Operation

The `axis` parameter controls the direction in which the operations takes place. Axis parameters can be `index` (or `0`) or `columns` (or `1`). Nearly all DataFrames methods default the axis parameter to `0`, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.

In [71]:
std_adm_dirty.head()

Unnamed: 0,name,gender,city,admit_status,test_score,hs_percentage,age
0,Shehroz,Female,Quetta,Rejected,50.0,68.9,24
1,Waqar,Female,Karachi,,99.0,60.73,21
2,Bushra,Male,Islamabad,Accepted,89.0,,17
3,Aliya,Male,Karachi,Rejected,55.0,85.29,17
4,Bilal,Male,Lahore,,65.0,61.13,20


1. Group the dataset by `City`.

In [101]:
grouped_std_adm = std_adm_dirty.groupby('city')

2. Now that the DataFrame contains homogenous column data, operations can be sensibily done both vertically and horizontally. The `count` method returns the number of non-missing values. By default, its `axis` parameter is set to `0`.

In [102]:
grouped_std_adm.count()

Unnamed: 0_level_0,name,gender,admit_status,test_score,hs_percentage,age
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Islamabad,14,16,16,16,14,14
Karachi,25,27,26,25,26,28
Lahore,17,16,16,16,15,15
Multan,21,21,21,20,21,20
Peshawar,17,18,18,18,16,18
Quetta,29,25,29,28,30,27
Rawalpindi,15,15,14,14,14,15


3. Change the axis parameter to `columns` changes the direction of the operation  so that we get back a count of non-missing items in each row

In [105]:
grouped_counts = grouped_std_adm.count()
grouped_counts.sum(axis=1).head()

city
Islamabad     90
Karachi      157
Lahore        95
Multan       124
Peshawar     105
dtype: Int64

4. Instead of counting non-missing values, we can sum all the values in each row. The `.sum` method may be used to verify this

In [106]:
grouped_counts.sum(axis="columns").head()

city
Islamabad     90
Karachi      157
Lahore        95
Multan       124
Peshawar     105
dtype: Int64

In [107]:
grouped_counts.median(axis="index")

name             17.0
gender           18.0
admit_status     18.0
test_score       18.0
hs_percentage    16.0
age              18.0
dtype: Float64

5. The `.cumsum` method with `axis=1` accumulates the arce percentages accross each row. It gives a slightly different view of the data.

In [111]:
grouped_std_adm_cumsum = grouped_counts.cumsum()
grouped_std_adm_cumsum.head()

Unnamed: 0_level_0,name,gender,admit_status,test_score,hs_percentage,age
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Islamabad,14,16,16,16,14,14
Karachi,39,43,42,41,40,42
Lahore,56,59,58,57,55,57
Multan,77,80,79,77,76,77
Peshawar,94,98,97,95,92,95


# 3. Special Exercise: University Admissions Analysis

### **Dataset**: `Student Admission Records`  
**Description:** Contains information about the university admission process, including test scores, high school percentages, city of origin, and admission status.  

---

### **Objective**  
Use the key Pandas operations learned in Chapter 2 to analyze the admissions dataset and answer important questions about the data.

---

### **Questions and Tasks**  

#### **1. Data Exploration**  
1. Load the dataset into a Pandas `DataFrame`.  
2. Display the first 5 rows of the DataFrame.  
3. What are the dataset's columns?  

#### **2. Selecting and Manipulating Columns**  
4. Select the columns `admission_status`, `high_school_percentage`, and `admission_test_score`.  
5. Rename the selected columns to `status`, `hs_score`, and `test_score`, respectively.  

#### **3. Sorting and Summarizing Data**  
6. Sort the DataFrame by `hs_score` in descending order and display the top 5 rows.  
7. Compute the average of `hs_score` and `test_score`.  
8. Count how many students were admitted and how many were rejected.  

#### **4. Handling Missing Values and Advanced Operations**  
9. Identify if there are any missing values in the dataset.  
10. If `hs_score` or `test_score` contain missing values, replace them with the median of each column.  
11. Create a new column called `final_score`, which is the average of `hs_score` and `test_score`.  
12. Normalize the values in `final_score` to a range of 0 to 1.  

#### **5. Data Analysis and Visualization**  
13. Which city has the most admitted students?  
14. How does the average admission test score (`test_score`) differ between admitted and rejected students?  
15. Transpose the DataFrame so that columns become rows and vice versa.  

---

### **Hint:**  
You can use methods like `.loc[]`, `.rename()`, `.sort_values()`, `.groupby()`, `.fillna()`, `.apply()`, `.transpose()`, and others covered in this chapter.  

---

This exercise challenges students to apply all the concepts learned in the chapter to a real-world scenario. 🚀