# 🐼 Pandas Handbook

## 03 - Data Inspection

Check out the official [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)  

This notebook uses the [Ramen Ratings dataset](https://www.kaggle.com/datasets/residentmario/ramen-ratings/data) from Kaggle to demonstrate how to inspect the data with pandas.

## 📚 Table of Contents
---
 
📌 **Basic Dataset Overview**  
🔍 **Missing & Null Values**  
🧬 **Column-Level Inspection**  
🗂️ **Sorting & Ordering**  
📊 **Descriptive Statistics**  
🔗 **Combining Data Inspection Methods**  
👉 **Next Topic: Data Selection**  

---

In [1]:
import pandas as pd
import os

In [2]:
data_raw = "../data/raw/"
csv_file = "ramen-ratings.csv"
import_path = os.path.join(data_raw, csv_file)
df = pd.read_csv(import_path, index_col='Review #')

### 📌 Basic Dataset Overview

`df.head()` – Displays the first 5 rows of the DataFrame.  
`df.tail()` – Displays the last 5 rows of the DataFrame.  
`df.shape` – Returns the number of rows and columns as a tuple.  
`df.columns` – Lists the column names in the DataFrame.  
`df.index` – Shows the index (row labels) of the DataFrame.  
`df.dtypes` – Displays the data type of each column.  
`df.info()` – Provides a concise summary of the DataFrame, including column types, non-null counts, and memory usage.  
`df.memory_usage()` – Returns memory usage of each column, useful for optimization.  

In [3]:
df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


In [4]:
df.tail()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,Vifon,"Hu Tiu Nam Vang [""Phnom Penh"" style] Asian Sty...",Bowl,Vietnam,3.5,
4,Wai Wai,Oriental Style Instant Noodles,Pack,Thailand,1.0,
3,Wai Wai,Tom Yum Shrimp,Pack,Thailand,2.0,
2,Wai Wai,Tom Yum Chili Flavor,Pack,Thailand,2.0,
1,Westbrae,Miso Ramen,Pack,USA,0.5,


In [5]:
df.shape

(2580, 6)

In [6]:
df.columns

Index(['Brand', 'Variety', 'Style', 'Country', 'Stars', 'Top Ten'], dtype='object')

In [7]:
df.index

Index([2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571,
       ...
         10,    9,    8,    7,    6,    5,    4,    3,    2,    1],
      dtype='int64', name='Review #', length=2580)

In [8]:
df.dtypes

Brand      object
Variety    object
Style      object
Country    object
Stars      object
Top Ten    object
dtype: object

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2580 entries, 2580 to 1
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Brand    2580 non-null   object
 1   Variety  2580 non-null   object
 2   Style    2578 non-null   object
 3   Country  2580 non-null   object
 4   Stars    2580 non-null   object
 5   Top Ten  41 non-null     object
dtypes: object(6)
memory usage: 141.1+ KB


In [10]:
df.memory_usage()

Index      20640
Brand      20640
Variety    20640
Style      20640
Country    20640
Stars      20640
Top Ten    20640
dtype: int64

### 🔍 Missing & Null Values

`df.isnull().sum()` – Shows the number of missing values (`NaN`) per column.  
`df.isna().sum()` – Shows the number of missing values (`NaN`) per column.  
`df['COLUMN'].count()` – Counts non-null values in the specified `'COLUMN'`.  
`df.notnull()` – Returns a DataFrame of the same shape indicating whether each value is not null (`True`) or null (`False`).  

In [11]:
df.isnull().sum()

Brand         0
Variety       0
Style         2
Country       0
Stars         0
Top Ten    2539
dtype: int64

In [12]:
df.isna().sum()

Brand         0
Variety       0
Style         2
Country       0
Stars         0
Top Ten    2539
dtype: int64

In [13]:
df['Top Ten'].count()

np.int64(41)

In [14]:
df.notnull()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,True,True,True,True,True,False
2579,True,True,True,True,True,False
2578,True,True,True,True,True,False
2577,True,True,True,True,True,False
2576,True,True,True,True,True,False
...,...,...,...,...,...,...
5,True,True,True,True,True,False
4,True,True,True,True,True,False
3,True,True,True,True,True,False
2,True,True,True,True,True,False


### 🧬 Column-Level Inspection

`df['COLUMN'].value_counts()` – Returns a count of unique values in the specified `'COLUMN'`, sorted in descending order.  
`(df['COLUMN'].str.strip() == '').sum()` – Returns a count of empyt strings in the specified `'COLUMN'`.  
`df['COLUMN'].apply(len)` – Applies the `len()` function to each value in the specified `'COLUMN'`.  
`df['COLUMN'].unique()` – Lists unique values in the specified `'COLUMN'`.  
`df['COLUMN'].nunique()` – Returns the number of unique values in the specified `'COLUMN'`.  
`df['COLUMN'].sample()` – Randomly samples one or more rows from the specified `'COLUMN'`.  

In [15]:
df['Brand'].value_counts()

Brand
Nissin      381
Nongshim     98
Maruchan     76
Mama         71
Paldo        66
           ... 
Omachi        1
Haioreum      1
Sutah         1
Tung-I        1
Westbrae      1
Name: count, Length: 355, dtype: int64

In [16]:
(df['Brand'].str.strip() == '').sum()

np.int64(0)

In [17]:
df['Stars'].apply(len)

Review #
2580    4
2579    1
2578    4
2577    4
2576    4
       ..
5       3
4       1
3       1
2       1
1       3
Name: Stars, Length: 2580, dtype: int64

In [18]:
df['Style'].unique()

array(['Cup', 'Pack', 'Tray', 'Bowl', 'Box', 'Can', 'Bar', nan],
      dtype=object)

In [19]:
df['Style'].nunique()

7

In [20]:
df['Variety'].sample()

Review #
2233    Cup Noodles Milk Seafood Flavour
Name: Variety, dtype: object

### 🗂️ Sorting & Ordering

`df.sort_values('COLUMN')` – Sorts the DataFrame by values in the specified `'COLUMN'`.  
`df.sort_index()` – Sorts the DataFrame by its index values.  

In [21]:
df.sort_values('Country', ascending=True).head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2301,Suimin,Noodle With Oriental Chicken Flavour,Cup,Australia,3.0,
987,Trident,Singapore Soft Noodles,Pack,Australia,2.75,
961,Trident,Chow Mein Soft Noodles,Pack,Australia,2.75,
1980,Suimin,Noodles Witrh Prawn & Chicken Flavour,Cup,Australia,3.5,
2349,Fantastic,Noodles Chicken & Corn Flavour,Cup,Australia,3.0,


In [22]:
df.sort_index(ascending=True).head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Westbrae,Miso Ramen,Pack,USA,0.5,
2,Wai Wai,Tom Yum Chili Flavor,Pack,Thailand,2.0,
3,Wai Wai,Tom Yum Shrimp,Pack,Thailand,2.0,
4,Wai Wai,Oriental Style Instant Noodles,Pack,Thailand,1.0,
5,Vifon,"Hu Tiu Nam Vang [""Phnom Penh"" style] Asian Sty...",Bowl,Vietnam,3.5,


### 📊 Descriptive Statistics

`df.describe()` – Generates descriptive statistics for numeric columns.  
`df.mean()` – Calculates the mean of the specified data frame values.  
`df.median()` – Calculates the median of the specified data frame values.  
`df.std()` – Computes the standard deviation of the specified data frame values.  
`df.quantile()` – Returns the 0.5 quantile (default is median).  
`df.sum()` – Returns the sum of the specified data frame values.  
`df.cumsum()` – Computes the cumulative sum of the specified data frame values.  
`df.var()` – Calculates the variance of the specified data frame values.  
`df.min()` – Returns the minimum value in specified data frame.  
`df.max()` – Returns the maximum value in specified data frame.  
`df.nlargest(n, 'COLUMN')` – Returns the top `n` rows with the largest values in the specified 'COLUMN'.  
`df.nsmallest(n, 'COLUMN')` – Returns the top `n` rows with the smallest values in the specified 'COLUMN'.

In [23]:
df.describe()

Unnamed: 0,Brand,Variety,Style,Country,Stars,Top Ten
count,2580,2580,2578,2580,2580,41
unique,355,2413,7,38,51,38
top,Nissin,Beef,Pack,Japan,4,\n
freq,381,7,1531,352,384,4


Before calculating statistics, we replace `'Unrated'` with `0` and convert the `Stars` column to `float` because the column is originally stored as strings (object type). This conversion allows numeric methods like `mean()`, `median()`, and `std()` to work correctly on the ratings data.  

In [24]:
stars_df = df['Stars'].replace('Unrated', 0).astype(float)

In [25]:
stars_df.mean()

np.float64(3.6504263565891466)

In [26]:
stars_df.median()

np.float64(3.75)

In [27]:
stars_df.std()

np.float64(1.0223580523512048)

In [28]:
stars_df.quantile()

np.float64(3.75)

In [29]:
stars_df.sum()

np.float64(9418.099999999999)

In [30]:
stars_df.cumsum()

Review #
2580       3.75
2579       4.75
2578       7.00
2577       9.75
2576      13.50
         ...   
5       9412.60
4       9413.60
3       9415.60
2       9417.60
1       9418.10
Name: Stars, Length: 2580, dtype: float64

In [31]:
stars_df.var()

np.float64(1.0452159872073488)

In [32]:
stars_df.min()

np.float64(0.0)

In [33]:
stars_df.max()

np.float64(5.0)

In [34]:
stars_df.nlargest(5)

Review #
2570    5.0
2569    5.0
2566    5.0
2563    5.0
2559    5.0
Name: Stars, dtype: float64

In [35]:
stars_df.nsmallest(5)

Review #
2548    0.0
2527    0.0
2503    0.0
2458    0.0
2426    0.0
Name: Stars, dtype: float64

### 🔗 Combining Data Inspection Methods

You can combine multiple inspection techniques to get deeper insights. For example, count the frequency of each star rating and then sort the results to understand the distribution of scores.

In [36]:
stars_val_count = stars_df.value_counts()
stars_val_count.sort_index(ascending=False).head()

Stars
5.00    386
4.75     64
4.50    135
4.30      4
4.25    143
Name: count, dtype: int64

In [37]:
stars_val_count.sort_values(ascending=False).head()

Stars
4.00    393
5.00    386
3.75    350
3.50    335
3.00    176
Name: count, dtype: int64

### 👉 Next Topic: [Data Selection](./04-data-selection.ipynb)

Learn how to select data with pandas.