# Introduction

This Python 3 notebook explores the exported VMWare inventory generated by [RVTools](https://www.robware.net/) and generates various statistical analyses of the data for assessing the scale, complexity, and other characteristics.

The analysis is directed towards understanding the feasibility of migrating these VMWare virtual machines to Red Hat's OpenShift Virtualization Platform.

# Initial Setup

This section configures the script and the required packages.

### Required Packages

This script uses `pandas` and `numpy` for the analysis.

In [None]:
# Analyze the VM inventory generated by rvtools
import pandas as pd
import os
from pathlib import Path
import matplotlib.pyplot as plt

### Configuration

Set up a few options and other configuration parameters.

In [None]:
# Set the display option to prevent line wrapping
pd.set_option('display.max_colwidth', None)
pd.set_option('display.expand_frame_repr', False)

### Set up the data directory
First, set up the directory where the RVTools Excel data files will be stored. This folder must also contain an index.xlsx file (refer to the provided template for the expected format). The index.xlsx file should list the valid RVTools Excel file names along with their corresponding vCenter instances. This script assumes that there is one Excel file per vCenter instance.

In [None]:
# Get the current script directory
current_dir = "."

# Specify the directory containing the Excel files
DATA_DIR = os.path.join(current_dir, '../data')

In [None]:
# Constants and configuration parameters

# The index file
INDEX_FILENAME = "index.xlsx"
INDEX_FILEPATH = os.path.join(DATA_DIR, INDEX_FILENAME)
INDEX_SHEETNAME = "index"

### Functions

This section defines a few functions for decomposing the analysis code. It reads multiple Excel .xlsx files exported from RVtools software and returns a dictionary with filenames as keys and a dictionary of DataFrames (one for each sheet) as values.

* The `index.xlsx` and `index_template.xlsx` files are explicitly ignored.

Parameters:
* directory (str): The directory containing the Excel files.
* filenames_to_process (list): A list of filenames to process.

Returns:
* dict: A dictionary containing the DataFrames from each Excel file.

In [None]:
def read_rvtools_excel_files(directory, filenames_to_process):
    rvtools_data = {}

    for filename in os.listdir(directory):
        # Only process files listed in the index file.
        filename_base, _ = os.path.splitext(filename)
        if filename_base not in filenames_to_process: continue
        if filename in ['index.xlsx', 'index_template.xlsx']: continue

        if filename.endswith('.xlsx') or filename.endswith('.xls'):
            filepath = os.path.join(directory, filename)
            excel_data = pd.read_excel(filepath, sheet_name=None)
            rvtools_data[filename] = excel_data

    return rvtools_data

# Read, Clean and Filter the RVTools data

This section reads the `RVTools` exported files. It uses an `index.xlsx` metadata file to identify the
in-scope `vCenter` instances for the migration.

### Read the index metadata file

First read the index file to determine the RVTools files to process. This file contains additional metadata,
including the `vCenter` instances In-Scope, which can be customized to process specific instances.

In [None]:
# First read the index Excel file
import pandas as pd

# Read the index Excel file
index_df = pd.read_excel(
    INDEX_FILEPATH, sheet_name=INDEX_SHEETNAME, 
    nrows=19, index_col='vCenter', 
    true_values=['Yes', 'Y'], false_values=['No', 'N'], 
    na_filter=False, dtype={'In Scope': bool}
)

# Clean up column names and handle missing values
index_df.columns = index_df.columns.str.replace(' ', '_')
index_df.fillna('', inplace=True)

# Filter for in-scope vCenters
inscope_df = index_df[index_df['In_Scope']].reset_index()

# Count occurrences of each vCenter
pivot_table = inscope_df.groupby('vCenter').size().reset_index(name='Count')

# Print in a structured table format
print("\nIn-Scope vCenter Instances:")
print(pivot_table.to_string(index=False))

### Read the RVTool Exported Spreadsheets

Read all the RVTools exported files available in the `data` directory. The data will be read into a dictionary,
with one entry per file, where the key is the filename, and the value is **another** nested dictionary with the spreadsheet's
sheet name as the key, and a `DataFrame` containing the sheets values.

The files **need** to be in the Microsoft `xlsx` format.

**Note**: This step will take a few minutes to complete. Be patient!

In [None]:
import pandas as pd
import os

# Read the index Excel file
index_df = pd.read_excel(
    INDEX_FILEPATH, sheet_name=INDEX_SHEETNAME, 
    nrows=19, index_col='vCenter', 
    true_values=['Yes', 'Y'], false_values=['No', 'N'], 
    na_filter=False, dtype={'In Scope': bool}
)

# Clean up column names and handle missing values
index_df.columns = index_df.columns.str.replace(' ', '_')
index_df.fillna('', inplace=True)

# Ensure 'In_Scope' column is boolean
index_df['In_Scope'] = index_df['In_Scope'].astype(bool)

# Filter for in-scope vCenters
inscope_df = index_df[index_df['In_Scope']].reset_index()

# Extract list of in-scope vCenter instances
inscope_vcenter_instances = inscope_df['vCenter'].tolist()

# Function to read RVTools Excel files while excluding 'index_template.xlsx' and 'index.xlsx'
def read_rvtools_excel_files(directory, vcenters):
    rvtools_data = {}
    exclude_files = {"index_template.xlsx", "index.xlsx"}  # Set of filenames to exclude

    for file in os.listdir(directory):
        if file.endswith(".xlsx") and file.lower() not in exclude_files:
            file_path = os.path.join(directory, file)
            try:
                xls = pd.ExcelFile(file_path)
                sheets = {sheet: xls.parse(sheet) for sheet in xls.sheet_names}
                rvtools_data[file] = sheets
            except Exception as e:
                print(f"Error reading {file}: {e}")
    
    return rvtools_data

# Read the RVTools Excel files
rvtools_data = read_rvtools_excel_files("../data", inscope_vcenter_instances)

# Display the loaded data (for demonstration purposes)
for filename, sheets in rvtools_data.items():
    print(f'Processed RVTools File: {filename}')
    for sheet_name, df in sheets.items():
        print(f"  Sheet: {sheet_name} Shape: {df.shape}")

print(f'Total files processed: {len(rvtools_data)}')

### Create the consolidated data frames

Create two consolidated dataframes containing information from all `vCenter` instances:

1. One dataframe containing the `vInfo` details
2. The second one containing the `vHost` details

These will be used later to summarize the information.

In [None]:
import pandas as pd

# Initialize the lists
vinfo_sheets = []
vhost_sheets = []

# Load vInfo and vHost sheets into the lists
for filename, sheets in rvtools_data.items():
    vCenter = filename.split('.')[0].lower()  # Extract vCenter instance name from the filename

    # Ensure 'vInfo' and 'vHost' sheets exist before processing
    if 'vInfo' in sheets and 'vHost' in sheets:
        # Add the vCenter instance name to each sheet
        sheets['vInfo']['vCenter'] = vCenter
        sheets['vHost']['vCenter'] = vCenter

        # Append to the respective lists
        vinfo_sheets.append(sheets['vInfo'])
        vhost_sheets.append(sheets['vHost'])

# Filter out empty or all-NA DataFrames
vinfo_sheets = [df for df in vinfo_sheets if not df.dropna(how='all').empty]
vhost_sheets = [df for df in vhost_sheets if not df.dropna(how='all').empty]

# Concatenate the filtered DataFrames only if there are valid sheets
consolidated_vinfo_df = pd.concat(vinfo_sheets, ignore_index=True) if vinfo_sheets else pd.DataFrame()
consolidated_vhost_df = pd.concat(vhost_sheets, ignore_index=True) if vhost_sheets else pd.DataFrame()

# Verify data ingestion success with improved readability
print("\n✅ Data Ingestion Verification ✅")
print(f"Total vInfo (Total VM's) records: {len(consolidated_vinfo_df)}")
print(f"Total vHost (Total HW) records: {len(consolidated_vhost_df)}\n")

if not consolidated_vinfo_df.empty:
    print("\n🔹 Sample vInfo Data:")
    print(consolidated_vinfo_df.head(2).to_string(index=False), "\n")

### Distribution of VMs per In-Scope vCenter

What is the total percentage of VM's per vCenter

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Group by vCenter and count VMs
grouped_summary = consolidated_vinfo_df.groupby("vCenter").size().reset_index(name="Count")

# Create a pivot table for better visualization
grouped_summary_pivot = grouped_summary.pivot_table(values="Count", index="vCenter", aggfunc="sum")

# Print summary
total_vms = grouped_summary["Count"].sum()
total_vcenters = grouped_summary.shape[0]

print(f"Overall Distribution of {total_vms:,} VMs in the {total_vcenters:,} vCenter instances:\n")
print(grouped_summary_pivot)
print("\nNote: This is the TOTAL VM count, and will include VM templates, SRM placeholders, Orphaned objects...")

# Plot a pie chart
plt.figure(figsize=(8, 8))
plt.pie(
    grouped_summary["Count"],
    labels=grouped_summary["vCenter"],
    autopct="%1.1f%%",
    startangle=140,
)
plt.title("\nDistribution of VMs by vCenter Instances")
plt.show()

### Clean the VM List
Filter the consolidated `vInfo` content using the following criteria:

1. Remove all `template` entries
2. Remove all `SRM Placeholder` entries
3. Remove all `orphaned VM object` entries
4. Remove all `other objects`

Adjust `ignored_patterns` as well as `os_filter_patterns` to suit your needs

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

# Define patterns to ignore in the VM column
ignore_patterns = ["virtual_appliance", "virtual appliance", "CTX"]

# Define OS types to filter out (only using "OS according to the VMware Tools")
os_filter_patterns = [
    r"^Microsoft Windows 7(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2003 Standard(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2003(?:\s*\(.*\))?$",
    r"^Other$",
    r"^Other\s+\d+\.\d+.*$",
    r"^Other 3\.x or later Linux(?:\s*\(.*\))?$",
    r"^Other 3\.x Linux(?:\s*\(.*\))?$",
    r"^Other 4\.x or later Linux(?:\s*\(.*\))?$",
    r"^Other 5\.x or later Linux(?:\s*\(.*\))?$",
    r"^VMware Photon OS(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2008(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2008 R2(?:\s*\(.*\))?$",
    r"^Red Hat Enterprise Linux (4|5|6)(?:\s*\(.*\))?$",
    r"^Debian GNU/Linux (7|8|9|10|11)(?:\s*\(.*\))?$",
    r"^CentOS (4|5|6|7)(?:\s*\(.*\))?$",
    r"^CentOS 4/5/6/7(?:\s*\(.*\))?$",
    r"^CentOS 4/5(?:\s*\(.*\))?$",
    r"^CentOS 4/5/6(?:\s*\(.*\))?$",
    r"^CentOS Stream 8(?:\s*\(.*\))?$",
    r"^Ubuntu Linux(?:\s*\(.*\))?$",
    r"^Linux \d+\.\d+.*$",
    r"^Appgate(?:\s*\(.*\))?$",
    r"^Amazon Linux 2(?:\s*\(.*\))?$",
    r"^Other Linux(?:\s+.*)?$",
    r"^FreeBSD(?:\s+.*)?$",
    r"^RiOS(?:\s*\(.*\))?$",
    r"^Red Hat Fedora(?:\s*\(.*\))?$",
    r"^SuSE Linux Enterprise (11|12)(?:\s*\(.*\))?$"
]

def clean_os_name(os_name):
    """Normalize OS names by removing extra spaces and redundant information."""
    if pd.isna(os_name) or os_name.strip() == '':
        return ''
    os_name = os_name.strip()
    os_name = re.sub(r"\s*\(.*\)$", "", os_name)  # Remove (32-bit) / (64-bit)
    os_name = re.sub(r"\s+", " ", os_name)  # Remove extra spaces
    return os_name

def os_filter(os_name):
    """Check if the OS should be filtered based on the defined patterns."""
    if pd.isna(os_name) or os_name.strip() == '':
        return False
    os_name_clean = clean_os_name(os_name)
    return any(re.fullmatch(pattern, os_name_clean, re.IGNORECASE) for pattern in os_filter_patterns)

# Load or define consolidated_vinfo_df before processing
if 'consolidated_vinfo_df' not in globals():
    raise ValueError("consolidated_vinfo_df is not defined.")

# Ensure OS column is populated
consolidated_vinfo_df['OS Effective'] = consolidated_vinfo_df['OS according to the VMware Tools'].fillna(
    consolidated_vinfo_df['OS according to the configuration file'])

# Normalize OS names before filtering
consolidated_vinfo_df['Cleaned OS'] = consolidated_vinfo_df['OS Effective'].apply(clean_os_name)

# Apply filtering logic
consolidated_vinfo_df['Exclusion Reason'] = ''
consolidated_vinfo_df.loc[consolidated_vinfo_df['VM'].str.contains('|'.join(ignore_patterns), case=False, na=False), 'Exclusion Reason'] = 'Ignored VM Pattern'
consolidated_vinfo_df.loc[consolidated_vinfo_df['Cleaned OS'].apply(os_filter), 'Exclusion Reason'] = 'Excluded OS'
consolidated_vinfo_df.loc[consolidated_vinfo_df['Template'].fillna(False) == True, 'Exclusion Reason'] = 'Template'
consolidated_vinfo_df.loc[consolidated_vinfo_df['SRM Placeholder'].fillna(False) == True, 'Exclusion Reason'] = 'SRM Placeholder'
consolidated_vinfo_df.loc[consolidated_vinfo_df['Connection state'].fillna('').str.lower() == 'orphaned', 'Exclusion Reason'] = 'Orphaned VM'

# Only exclude powered-off VMs if their OS is in the exclusion list
consolidated_vinfo_df.loc[(consolidated_vinfo_df['Powerstate'].fillna('').str.lower() == 'poweredoff') & 
                          (consolidated_vinfo_df['Cleaned OS'].apply(os_filter)), 'Exclusion Reason'] = 'Powered Off'

# Filter only in-scope VMs
filtered_vinfo_df = consolidated_vinfo_df[consolidated_vinfo_df['Exclusion Reason'] == '']

# Count the number of VMs that were filtered out
ignored_vinfo_df = consolidated_vinfo_df[consolidated_vinfo_df['Exclusion Reason'] != '']
ignored_vm_artifacts = len(ignored_vinfo_df)  # Define ignored_vm_artifacts

# Display results
print("\n\U0001F50D Filtering Summary")
print(f"\U0001F539 Removed: {ignored_vm_artifacts:,} VM templates, SRM placeholders, orphaned, powered-off VMs, and excluded OS patterns.")
print(f"\U0001F539\U0001F539 Filtered (In-Scope) VM count: {len(filtered_vinfo_df):,}.\n")

# Display the list of "in-scope" VMs after filtering
if not filtered_vinfo_df.empty:
    print("\n✅ Sample of filtered (In-Scope) VMs:")
    display(filtered_vinfo_df[['VM', 'OS Effective']].head())

# Display the list of "ignored" VMs
if not ignored_vinfo_df.empty:
    print("\n❌ Sample of filtered (Out-of-Scope) VMs:")
    display(ignored_vinfo_df[['VM', 'OS Effective', 'Exclusion Reason']].head())

# Create pie chart
labels = ['Filtered-Out', 'In-Scope']
sizes = [ignored_vm_artifacts, len(filtered_vinfo_df)]
colors = ['lightcoral', 'lightgreen']
explode = (0.1, 0)  # explode the 'Filtered Out' section for emphasis

plt.figure(figsize=(7, 7))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, explode=explode, shadow=True, startangle=140)
plt.title("\nVM Filtering Breakdown")

# Show plot
plt.show()

### Summary of vInfo & vHosts after cleanup

Now we have clean data to play with going forward...

In [None]:
import pandas as pd

# Select the in-scope vCenter instances
inscope_vinfo_condition = filtered_vinfo_df['vCenter'].isin(inscope_vcenter_instances)
inscope_vinfo_df = filtered_vinfo_df[inscope_vinfo_condition]

# Summary stats for in-scope VMs
inscope_vm_count = len(filtered_vinfo_df)
percent_inscope_vms = (inscope_vm_count / len(filtered_vinfo_df)) * 100.0

print("\n✅ VM Summary ✅")
print(f"🔹 {inscope_vm_count:,} VMs are in-scope ({percent_inscope_vms:0.2f}% of total VMs).\n")

# Create a pivot table for VMs by vCenter
vm_pivot = inscope_vinfo_df.pivot_table(index="vCenter", aggfunc="size").reset_index()
vm_pivot.columns = ["vCenter", "VM Count"]
display(vm_pivot)

# Select the in-scope hosts
inscope_vhost_condition = consolidated_vhost_df['vCenter'].isin(inscope_vcenter_instances)
inscope_vhost_df = consolidated_vhost_df[inscope_vhost_condition]

# Summary stats for in-scope hosts
inscope_host_count = len(inscope_vhost_df)
percent_inscope_hosts = (inscope_host_count / len(consolidated_vhost_df)) * 100.0

print("\n✅ Host Summary ✅")
print(f"🔹 {inscope_host_count:,} hosts are in-scope ({percent_inscope_hosts:0.2f}% of total hosts).\n")

# Create a pivot table for hosts by vCenter
host_pivot = inscope_vhost_df.pivot_table(index="vCenter", aggfunc="size").reset_index()
host_pivot.columns = ["vCenter", "Host Count"]
display(host_pivot)

# ========== Analysis ==========

The primary analysis begins from this section forward...

### 1- Create a consolidated view of the client's landscape

Summary of VM's (vCPU, Memory, etc) & Host's (CPU, Cores, etc)...

In [None]:
import pandas as pd

# === vInfo Pivot Table (VM-Level Info) ===
vinfo_pivot_df = filtered_vinfo_df.pivot_table(
    index='vCenter',
    values=['VM', 'CPUs', 'Memory', 'NICs', 'Provisioned MiB'],
    aggfunc={
        'VM': 'count',
        'CPUs': 'sum',
        'Memory': 'sum',
        'NICs': 'sum',
        'Provisioned MiB': 'sum'
    },
    margins=False
)

# === vHost Pivot Table (Host-Level Info) ===
vhost_pivot_df = consolidated_vhost_df.pivot_table(
    index='vCenter',
    values=['Host', '# VMs total', '# CPU', '# Cores'],
    aggfunc={
        'Host': 'count',
        '# VMs total': 'sum',
        '# CPU': 'sum',
        '# Cores': 'sum'
    },
    margins=False
)

# === Convert Memory & Disk to GB/TB ===
def format_storage(mib_value):
    if mib_value >= 1_000_000:  # Convert to TB if ≥ 1,000,000 MiB
        return f"{mib_value / 1_048_576:.2f} TB"
    return f"{mib_value / 1024:.2f} GB"  # Convert to GB otherwise

# Apply formatting for Memory and Total Disk Capacity
vinfo_pivot_df['Total Disk Capacity'] = vinfo_pivot_df['Provisioned MiB'].apply(format_storage)
vinfo_pivot_df['Memory'] = vinfo_pivot_df['Memory'].apply(format_storage)

# === Pretty Display for Each Table ===
try:
    from IPython.display import display  # Works for Jupyter Notebook

    print("\n✅ VM Info (vInfo) ✅")
    display(vinfo_pivot_df)

    print("\n✅ Host Info (vHost) ✅")
    display(vhost_pivot_df)

except ImportError:
    print("\n✅ VM Info (vInfo) ✅")
    print(vinfo_pivot_df)

    print("\n✅ Host Info (vHost) ✅")
    print(vhost_pivot_df)

### 2- Summarize the Operating Systems

In this section, we summarize the guest operating systems

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

# Define patterns to ignore in the VM column
ignore_patterns = ["virtual_appliance", "virtual appliance", "CTX"]

# Define OS types to filter out (only using "OS according to the VMware Tools")
os_filter_patterns = [
    r"^Microsoft Windows 7(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2003 Standard(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2003(?:\s*\(.*\))?$",
    r"^Other$",
    r"^Other\s+\d+\.\d+.*$",
    r"^Other 3\.x or later Linux(?:\s*\(.*\))?$",
    r"^Other 3\.x Linux(?:\s*\(.*\))?$",
    r"^Other 4\.x or later Linux(?:\s*\(.*\))?$",
    r"^Other 5\.x or later Linux(?:\s*\(.*\))?$",
    r"^VMware Photon OS(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2008(?:\s*\(.*\))?$",
    r"^Microsoft Windows Server 2008 R2(?:\s*\(.*\))?$",
    r"^Red Hat Enterprise Linux (4|5|6)(?:\s*\(.*\))?$",
    r"^Debian GNU/Linux (7|8|9|10|11)(?:\s*\(.*\))?$",
    r"^CentOS (4|5|6|7)(?:\s*\(.*\))?$",
    r"^CentOS 4/5/6/7(?:\s*\(.*\))?$",
    r"^CentOS 4/5(?:\s*\(.*\))?$",
    r"^CentOS 4/5/6(?:\s*\(.*\))?$",
    r"^CentOS Stream 8(?:\s*\(.*\))?$",
    r"^Ubuntu Linux(?:\s*\(.*\))?$",
    r"^Linux \d+\.\d+.*$",
    r"^Appgate(?:\s*\(.*\))?$",
    r"^Amazon Linux 2(?:\s*\(.*\))?$",
    r"^Other Linux(?:\s+.*)?$",
    r"^FreeBSD(?:\s+.*)?$",
    r"^RiOS(?:\s*\(.*\))?$",
    r"^Red Hat Fedora(?:\s*\(.*\))?$",
    r"^SuSE Linux Enterprise (11|12)(?:\s*\(.*\))?$"
]

def clean_os_name(os_name):
    """Normalize OS names by removing extra spaces and redundant information."""
    if pd.isna(os_name) or os_name.strip() == '':
        return ''
    os_name = os_name.strip()
    os_name = re.sub(r"\s*\(.*\)$", "", os_name)  # Remove (32-bit) / (64-bit)
    os_name = re.sub(r"\s+", " ", os_name)  # Remove extra spaces
    return os_name

def os_filter(os_name):
    """Check if the OS should be filtered based on the defined patterns."""
    if pd.isna(os_name) or os_name.strip() == '':
        return False
    os_name_clean = clean_os_name(os_name)
    return any(re.fullmatch(pattern, os_name_clean, re.IGNORECASE) for pattern in os_filter_patterns)

# Load or define consolidated_vinfo_df before processing
if 'consolidated_vinfo_df' not in globals():
    raise ValueError("consolidated_vinfo_df is not defined.")

# Ensure OS column is populated
consolidated_vinfo_df['OS Effective'] = consolidated_vinfo_df['OS according to the VMware Tools'].fillna(
    consolidated_vinfo_df['OS according to the configuration file'])

# Normalize OS names before filtering
consolidated_vinfo_df['Cleaned OS'] = consolidated_vinfo_df['OS Effective'].apply(clean_os_name)

# Apply filtering logic
consolidated_vinfo_df['Exclusion Reason'] = ''
consolidated_vinfo_df.loc[consolidated_vinfo_df['VM'].str.contains('|'.join(ignore_patterns), case=False, na=False), 'Exclusion Reason'] = 'Ignored VM Pattern'
consolidated_vinfo_df.loc[consolidated_vinfo_df['Cleaned OS'].apply(os_filter), 'Exclusion Reason'] = 'Excluded OS'
consolidated_vinfo_df.loc[consolidated_vinfo_df['Template'].fillna(False) == True, 'Exclusion Reason'] = 'Template'
consolidated_vinfo_df.loc[consolidated_vinfo_df['SRM Placeholder'].fillna(False) == True, 'Exclusion Reason'] = 'SRM Placeholder'
consolidated_vinfo_df.loc[consolidated_vinfo_df['Connection state'].fillna('').str.lower() == 'orphaned', 'Exclusion Reason'] = 'Orphaned VM'

# Only exclude powered-off VMs if their OS is in the exclusion list
consolidated_vinfo_df.loc[(consolidated_vinfo_df['Powerstate'].fillna('').str.lower() == 'poweredoff') & 
                          (consolidated_vinfo_df['Cleaned OS'].apply(os_filter)), 'Exclusion Reason'] = 'Powered Off'

# Filter only in-scope VMs
filtered_vinfo_df = consolidated_vinfo_df[consolidated_vinfo_df['Exclusion Reason'] == '']

# Create the pivot table
inscope_guest_os_pivot = filtered_vinfo_df.pivot_table(index='Cleaned OS', values='VM', aggfunc='count')
inscope_guest_os_pivot = inscope_guest_os_pivot.sort_values(by='VM', ascending=True)

# Check if pivot table is empty
if inscope_guest_os_pivot.empty:
    raise ValueError("The pivot table is empty after filtering. Check input data.")

# Calculate percentages
total_vms = inscope_guest_os_pivot['VM'].sum()
percentages = (inscope_guest_os_pivot['VM'] / total_vms * 100).round(1)

# Create figure and axis with better layout management
fig, ax = plt.subplots(figsize=(14, max(6, len(inscope_guest_os_pivot) * 0.4)), constrained_layout=True)

# Create horizontal bar chart
bars = ax.barh(inscope_guest_os_pivot.index, inscope_guest_os_pivot['VM'], color=plt.cm.tab10.colors)

# Add labels to bars
for bar, count, percentage in zip(bars, inscope_guest_os_pivot['VM'], percentages):
    ax.text(bar.get_width() + 2, bar.get_y() + bar.get_height()/2, 
            f"{count} VMs ({percentage}%)", va='center', fontsize=8, color='black')

# Style and labels
ax.set_xlabel("Number of VMs", fontsize=10)
ax.set_ylabel("Operating Systems", fontsize=10)
ax.set_title("Guest OS Distribution for In-Scope VMs", fontsize=12, pad=20)
ax.grid(axis='x', linestyle='--', alpha=0.7)

# Explicitly set y-ticks before modifying labels to avoid warnings
ax.set_yticks(range(len(inscope_guest_os_pivot)))
ax.set_yticklabels(inscope_guest_os_pivot.index, fontsize=9, rotation=5)

# Show plot
plt.show()

### 3- OSs with over 500 VMs Associated

In this section, we summarize the guest operating systems that have more than 500 VM's

In [None]:
# Filter and display OSs with over 500 VMs
over_500_vms = inscope_guest_os_pivot[inscope_guest_os_pivot['VM'] > 500]
display(over_500_vms)

# Check if there are any OSs with more than 500 VMs
if over_500_vms.empty:
    print("No operating systems have more than 500 VMs.")
else:
    # Pie chart for OSs with over 500 VMs
    plt.figure(figsize=(10, 8))  # Slightly wider for better readability

    # Define a color palette
    colors = plt.cm.tab10.colors[:len(over_500_vms)]

    # Generate pie chart
    wedges, texts, autotexts = plt.pie(
        over_500_vms['VM'],
        labels=over_500_vms.index,
        autopct='%1.1f%%',
        startangle=140,
        colors=colors,
        wedgeprops={'edgecolor': 'black', 'linewidth': 1}  # Add borders for clarity
    )

    # Improve text readability
    for text in texts:
        text.set_fontsize(10)  # Adjust label size
    for autotext in autotexts:
        autotext.set_fontsize(10)  # Adjust percentage text size
        autotext.set_color('white')  # Improve visibility

    # Add title
    plt.title('Operating Systems with Over 500 VMs', fontsize=12, pad=20)

    # Adjust layout to prevent labels from overlapping
    plt.tight_layout()
    plt.show()

### 4- Group VMs by Memory Size

In this section, we group the VMs by their memory-size tiers 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Define memory tiers (in GB) in ascending order
memory_tiers = {
    '0-4 GB': (0, 4),
    '4-16 GB': (4, 16),
    '16-32 GB': (16, 32),
    '32-64 GB': (32, 64),
    '64-128 GB': (64, 128),
    '128-256 GB': (128, 256),
    '256+ GB': (256, float('inf'))
}

# Function to categorize memory
def categorize_memory(memory_gb):
    for tier, (lower, upper) in memory_tiers.items():
        if lower <= memory_gb < upper:
            return tier
    return 'Unknown'

# Ensure 'Memory' column is in MiB, then convert to GB
filtered_vinfo_df = filtered_vinfo_df.copy()  # Explicit copy to avoid SettingWithCopyWarning
filtered_vinfo_df.loc[:, 'Memory GB'] = filtered_vinfo_df['Memory'] / 1024

# Apply categorization
filtered_vinfo_df.loc[:, 'Memory Tier'] = filtered_vinfo_df['Memory GB'].apply(categorize_memory)

# Create a pivot table summarizing VM counts per memory tier
memory_tier_summary = (
    filtered_vinfo_df.groupby('Memory Tier')
    .agg({'VM': 'count'})
    .rename(columns={'VM': 'VM Count'})
)

# Sort the memory tiers in ascending order based on the predefined tier order
memory_tier_summary = memory_tier_summary.reindex(memory_tiers.keys()).fillna(0)

# Display the results
try:
    from IPython.display import display  # Works for Jupyter Notebook
    print("\n✅ Memory Tier Summary (Sorted) ✅")
    display(memory_tier_summary)
except ImportError:
    print("\n✅ Memory Tier Summary (Sorted) ✅")
    print(memory_tier_summary)

# Generate pie chart with sorted order
plt.figure(figsize=(8, 8))
plt.pie(
    memory_tier_summary['VM Count'],
    labels=memory_tier_summary.index,
    autopct='%1.1f%%',
    startangle=140
)
plt.title("VM Distribution by Memory Tier (Sorted)")
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

### 5- Group VMs by Disk Size

This section groups VMs into categories defined by allocated disk-size tiers.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create explicit copy to avoid SettingWithCopyWarning
filtered_vinfo_df = filtered_vinfo_df.copy()

# Convert disk size to TB and categorize into tiers
mib_to_tb_conversion_factor = 2**20 / 10**12
filtered_vinfo_df.loc[:, 'Disk Size TB'] = filtered_vinfo_df['Provisioned MiB'] * mib_to_tb_conversion_factor

# Identify VMs with missing or zero disk size
no_disk_size_vms_df = filtered_vinfo_df[(filtered_vinfo_df['Disk Size TB'].isna()) | (filtered_vinfo_df['Disk Size TB'] == 0)]
no_disk_size_vm_count = no_disk_size_vms_df.shape[0]

# Filter out VMs without disk size for further analysis
filtered_disk_df = filtered_vinfo_df[filtered_vinfo_df['Disk Size TB'] > 0].copy()

# Define disk size bins and labels
disk_size_bins = [0, 1, 2, 10, 20, 50, 100, float('inf')]
disk_bin_labels = ['Tiny (<1 TB)', 'Easy (<=2 TB)', 'Medium (<=10 TB)', 'Hard (<=20 TB)', 
                   'Very Hard (<=50 TB)', 'White Glove (<=100 TB)', 'Extreme (>100 TB)']

filtered_disk_df.loc[:, 'Disk Size Tiers'] = pd.cut(
    filtered_disk_df['Disk Size TB'],
    bins=disk_size_bins,
    labels=disk_bin_labels
).astype(str)

# Pivot table based on disk size tiers
disk_tier_pivot_df = filtered_disk_df.pivot_table(
    index='Disk Size Tiers', 
    values=['VM', 'Disk Size TB'], 
    aggfunc={'VM': 'count', 'Disk Size TB': 'sum'},
    observed=False
)

# Sort index to maintain ascending order
disk_tier_pivot_df = disk_tier_pivot_df.reindex(disk_bin_labels)

# Fill NaN values to avoid errors
disk_tier_pivot_df = disk_tier_pivot_df.fillna(0)

# Add total row with rounded 'Disk Size TB' to nearest 0.5 TB
total_row = pd.DataFrame({
    'VM': [disk_tier_pivot_df['VM'].sum()],
    'Disk Size TB': [round(disk_tier_pivot_df['Disk Size TB'].sum() * 2) / 2]
}, index=['Total'])

# Combine pivot table with total row
disk_tier_pivot_with_total = pd.concat([disk_tier_pivot_df, total_row])

# Fill NaN values and format for display
formatted_table = disk_tier_pivot_with_total.copy()
formatted_table['VM'] = formatted_table['VM'].fillna(0).astype(int)
formatted_table['Disk Size TB'] = formatted_table['Disk Size TB'].fillna(0).apply(lambda x: f"{x:,.2f}")
formatted_table['VM'] = formatted_table['VM'].apply(lambda x: f"{x:,}")

# Display table
print("Disk Tier Summary with Total (Formatted):")
print(formatted_table.to_string())

# Calculate percentages for VM distribution in each disk tier
vm_counts = disk_tier_pivot_df['VM'].sum()
percentages = (disk_tier_pivot_df['VM'] / vm_counts) * 100

# Generate labels
labels = [f"{tier} - {pct:.1f}% ({count} VMs)" 
          for tier, pct, count in zip(disk_tier_pivot_df.index, percentages, disk_tier_pivot_df['VM'])]

# Automatically assign colors based on number of categories
num_tiers = len(disk_bin_labels)
colors = plt.cm.viridis(np.linspace(0, 1, num_tiers))

# Explode effect to emphasize slices
explode = [0.1] * num_tiers  # Adjust explosion for each tier

# Pie Chart: Automatically assigned colors with explode effect
plt.figure(figsize=(8, 8))
plt.pie(disk_tier_pivot_df['VM'], colors=colors, autopct='%1.1f%%', startangle=140,
        wedgeprops={'edgecolor': 'black', 'linewidth': 1}, explode=explode)
plt.title('VM Distribution by Tier (Tiny, Easy, Medium, etc.)', fontsize=12)
plt.legend(labels, loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
plt.tight_layout()
plt.show()

# Bar chart for VMs without disk size
plt.figure(figsize=(6, 4))
plt.bar(['No Disk Size'], [no_disk_size_vm_count], color='purple')
plt.title('VMs without Disk Size Information', fontsize=12)
plt.ylabel('Number of VMs')
plt.tight_layout()
plt.show()

### 6- Categorize Host Compute Nodes

Categorize the compute nodes for ALL vCenters by their model number

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as mcolors

# Group all hosts by model
all_host_model_pivot_df = consolidated_vhost_df.pivot_table(index=['Vendor', 'Model'], values='Host', aggfunc='count')
display(all_host_model_pivot_df)

# Bar chart for all host models
plt.figure(figsize=(12, 8))  # Make it wider to avoid label overlap

# Generate a colormap with enough distinct colors
norm = mcolors.Normalize(vmin=0, vmax=len(all_host_model_pivot_df)-1)
colors = cm.viridis(norm(range(len(all_host_model_pivot_df))))

# Create a bar chart with automatic colors
bar_plot = all_host_model_pivot_df['Host'].plot(kind='bar', color=colors, edgecolor='black', figsize=(12, 8))

# Add title and labels
plt.title('Host Node Compute Models', fontsize=12)
plt.xlabel('Host Model', fontsize=10)
plt.ylabel('Host Count', fontsize=10)

# Improve x-tick readability by rotating the labels
plt.xticks(rotation=90, ha='right', fontsize=8)

# Show the bar chart
plt.tight_layout()  # Adjust layout to prevent overlapping labels
plt.show()

### 7- Categorize Host Compute Nodes by vCenter

Categorize compute nodes by their model number for each vCenter separately.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Group all the hosts by their model and vCenter
all_host_model_vcenter_pivot_df = consolidated_vhost_df.pivot_table(
    index=['vCenter', 'Vendor', 'Model'],
    values=['Host'],
    aggfunc={'Host': 'count'},
    observed=False,
    margins=False,
    sort=True
)

print(f'Distribution of ALL host models by vCenter:\n')
print(all_host_model_vcenter_pivot_df)
all_host_model_vcenter_pivot_df.to_clipboard(excel=True)

# Bar chart creation based on the pivot table (sorted by host count)
plt.figure(figsize=(14, 8))  # Increase figure size for better visibility

# Prepare labels and data
labels = all_host_model_vcenter_pivot_df.index.map(lambda x: f'{x[1]} {x[2]} ({x[0]})')  # Combine Vendor, Model, and vCenter in the label
sizes = all_host_model_vcenter_pivot_df['Host']

# Automatically assign colors using the 'viridis' colormap
colors = plt.cm.viridis(np.linspace(0, 1, len(sizes)))

# Create the vertical bar chart
plt.bar(labels, sizes, color=colors, edgecolor='black')

# Add title and formatting
plt.title('Host Node Compute Models', fontsize=14)
plt.xlabel('Host Model', fontsize=12)
plt.ylabel('Number of Hosts', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Display the bar chart
plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()

### 8- Count the ESXi Datacenters & Clusters

Summary for the total ESXi datacenters & clusters

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Remove NaN values from Datacenter and Cluster
filtered_vinfo_df = filtered_vinfo_df.dropna(subset=['Datacenter', 'Cluster'])

# Get unique Datacenters
datacenters = filtered_vinfo_df['Datacenter'].unique()

# Print total number of in-scope VMware Datacenters
print(f'Total VMware Datacenters: {len(datacenters):,}')

# Get unique Clusters
clusters = filtered_vinfo_df['Cluster'].unique()

# Print total number of in-scope ESXi clusters
print(f'Total ESXi clusters: {len(clusters):,}')

# Pivot the Cluster information on the Datacenters
inscope_datacenter_pivot = filtered_vinfo_df.pivot_table(index='Datacenter',
                                                          values='Cluster',
                                                          aggfunc=pd.Series.nunique)  # Count unique clusters per datacenter

# Rename column for clarity
inscope_datacenter_pivot.rename(columns={'Cluster': 'Cluster_Count'}, inplace=True)

# Sort by Cluster_Count in descending order
inscope_datacenter_pivot = inscope_datacenter_pivot.sort_values(by='Cluster_Count', ascending=False)

# Calculate the total number of clusters from the pivot table
total_clusters_from_pivot = int(inscope_datacenter_pivot['Cluster_Count'].sum())  # Convert to Python int

# Calculate the total number of datacenters
total_datacenters = int(inscope_datacenter_pivot.shape[0])  # Convert to Python int

# Show only the top 10 datacenters in the summary
print(f'\nDistribution of Clusters to Datacenters (Top 10 by Cluster Count):\n')
print(inscope_datacenter_pivot.head(10).to_string(index=True))  # Shows only top 10 sorted
print(f'\nTotal Clusters across all Datacenters (from pivot table): {total_clusters_from_pivot}')
print(f'Total number of Datacenters: {total_datacenters}')

# Debug: Compare Total ESXi clusters vs. Sum of clusters per datacenter
if len(clusters) != total_clusters_from_pivot:
    print("\n⚠️ Warning: The total ESXi clusters count does not match the sum of clusters across datacenters.")
    print(f"Total ESXi clusters (unique across dataset): {len(clusters):,}")
    print(f"Total Clusters from pivot table (sum per datacenter): {total_clusters_from_pivot:,}")

    # Find clusters mapped to multiple datacenters
    cluster_datacenter_mapping = filtered_vinfo_df.groupby('Cluster')['Datacenter'].nunique()
    multi_dc_clusters = cluster_datacenter_mapping[cluster_datacenter_mapping > 1]

    if not multi_dc_clusters.empty:
        print("\nClusters that appear in multiple datacenters (Top 10 shown):")
        print(multi_dc_clusters.head(10).to_string(index=True))  # Shows only top 10

# Copy full results to clipboard for easy pasting into Excel
inscope_datacenter_pivot.to_clipboard(excel=True)

# Create a pie chart (using only the top 10 datacenters for clarity)
top_10_pivot = inscope_datacenter_pivot.head(10)

plt.figure(figsize=(8, 8))
plt.pie(top_10_pivot['Cluster_Count'], labels=top_10_pivot.index, 
        autopct='%1.1f%%', startangle=140, explode=[0.05] * len(top_10_pivot))
plt.title('Clusters distributed to Top 10 Datacenters')
plt.show()

### 9- VM distribution by ESXi Clusters

This is orthogonal to the VM distribution analysis by `vCenters`, as a vCenter is likely to contain multiple `clusters.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Pivot the VM information on the clusters
inscope_cluster_pivot = inscope_vinfo_df.pivot_table(index='Cluster',
                                                      values='VM',
                                                      aggfunc='count')  # Count VMs per cluster

# Rename column for clarity
inscope_cluster_pivot.rename(columns={'VM': 'VM_Count'}, inplace=True)

# Sort by VM_Count in descending order
inscope_cluster_pivot = inscope_cluster_pivot.sort_values(by='VM_Count', ascending=False)

# Calculate the total number of VMs
total_vms = int(inscope_cluster_pivot['VM_Count'].sum())  # Convert to Python int

# Calculate the total number of clusters
total_clusters = int(inscope_cluster_pivot.shape[0])  # Convert to Python int

# Show only the top 10 clusters in the summary
print(f'\nDistribution of VMs to Clusters (Top 10 by VM Count):\n')
print(inscope_cluster_pivot.head(10).to_string(index=True))  # Shows only top 10 sorted
print(f'\nTotal VMs across all clusters: {total_vms}')
print(f'Total number of clusters: {total_clusters}')

# Copy full results to clipboard for easy pasting into Excel
inscope_cluster_pivot.to_clipboard(excel=True)

# Create a pie chart (using only the top 10 clusters for clarity)
top_10_pivot = inscope_cluster_pivot.head(10)

plt.figure(figsize=(8, 8))
plt.pie(top_10_pivot['VM_Count'], labels=top_10_pivot.index, 
        autopct='%1.1f%%', startangle=140, explode=[0.05] * len(top_10_pivot))
plt.title('VMs distributed to Top 10 Clusters')
plt.show()

### 10- VMs Categorized by Environment

This analysis is separate from the VM distribution by environment. The goal is to enhance the vInfo data with additional categorization.

1. Create an Environment Column: This column will be based on the name of the ESXi cluster, which indicates the site location.
2. Add a Site-Type Column: This column will categorize the environment into one of the following types: Dev, QA, Test, NonProd and Prod.
3. Handle Unclassified Clusters: Any cluster names that do not fall into the four categories above will be grouped as "Unknown."

In [None]:
##### VMs Categorized by Environment
import pandas as pd
import matplotlib.pyplot as plt

# Define a function to categorize 'Cluster' into environments based on 'Environment' column or 'Cluster' name
def determine_environment(row):
    # Ensure 'Environment' column is checked first
    environment = str(row.get('Environment', '')).strip().lower()
    if environment in ['nonprod', 'non-production', 'nonproduction']:
        return 'NonProd'
    elif environment in ['prod', 'production']:
        return 'Prod'
    elif environment in ['dev', 'development']:
        return 'Dev'
    elif environment in ['qa']:
        return 'QA'
    elif environment in ['test', 'UAT']:
        return 'Test'

    # If 'Environment' is empty, check 'Cluster' name for environment keywords
    cluster_name = str(row.get('Cluster', '')).strip().lower()
    if 'nonprod' in cluster_name:
        return 'NonProd'
    elif 'prod' in cluster_name:
        return 'Prod'
    elif 'dev' in cluster_name:
        return 'Dev'
    elif 'qa' in cluster_name:
        return 'QA'
    elif 'test' in cluster_name:
        return 'Test'
    
    # Return 'Unknown' if no matching environment is found
    return 'Unknown'

# Apply the function to create or update the 'Environment' column in the DataFrame
inscope_vinfo_df['Environment'] = inscope_vinfo_df.apply(determine_environment, axis=1)

# Display the distribution of environments
print(f"Environment Distribution:\n")
environment_counts = inscope_vinfo_df['Environment'].value_counts()
for env, count in environment_counts.items():
    print(f"{env}: {count}")

### 11- VMs Graphed by Environment

In [None]:
##### Block Display by Environment

# Count the VMs classified by each environment type
print(f"Total VMs in Prod   : {len(inscope_vinfo_df[inscope_vinfo_df['Environment'] == 'Prod']):,}")
print(f"Total VMs in Dev    : {len(inscope_vinfo_df[inscope_vinfo_df['Environment'] == 'Dev']):,}")
print(f"Total VMs in QA     : {len(inscope_vinfo_df[inscope_vinfo_df['Environment'] == 'QA']):,}")
print(f"Total VMs in Test   : {len(inscope_vinfo_df[inscope_vinfo_df['Environment'] == 'Test']):,}")
print(f"Total VMs in NonProd : {len(inscope_vinfo_df[inscope_vinfo_df['Environment'] == 'NonProd']):,}")
print(f"Total VMs in Unknown: {len(inscope_vinfo_df[inscope_vinfo_df['Environment'] == 'Unknown']):,}")

import matplotlib.pyplot as plt

# Count the number of VMs in each environment type
env_counts = inscope_vinfo_df['Environment'].value_counts()

# Plot the pie chart
plt.figure(figsize=(8, 8))
plt.pie(env_counts, labels=env_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.tab20.colors, wedgeprops={'edgecolor': 'black'})
plt.title('VM Distribution by Environment')

# Display the pie chart
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is drawn as a circle.
plt.show()

### 12- Summarize Operating Systems by Supported vs. Unsupported

In this section, we provide a summary of the supported and unsupported operating systems for the in-scope vCenter instances.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Ensure that 'ignored_vinfo_df' represents the VMs that were filtered out
ignored_vinfo_df = consolidated_vinfo_df[consolidated_vinfo_df['Exclusion Reason'] != '']

# Function to truncate OS names for better visualization
def truncate_os_names(series):
    return series.rename(lambda x: x if len(x) <= 40 else x[:37] + '...')

# Group and count OS occurrences for both In-Scope and Out-of-Scope VMs
in_scope_os_counts = truncate_os_names(filtered_vinfo_df['Cleaned OS'].value_counts())
out_of_scope_os_counts = truncate_os_names(ignored_vinfo_df['Cleaned OS'].value_counts())

# Pie chart for In-Scope vs Out-of-Scope VMs
plt.figure(figsize=(8, 8))
scope_counts = [len(filtered_vinfo_df), len(ignored_vinfo_df)]
labels = ['In-Scope (Filtered VMs)', 'Out-of-Scope (Ignored VMs)']
colors = ['lightgreen', 'lightcoral']
plt.pie(scope_counts, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140, explode=(0, 0.1))
plt.title('Percentage of In-Scope vs Out-of-Scope VMs')
plt.tight_layout()
plt.show()

# Bar chart for In-Scope OS Counts
plt.figure(figsize=(14, 6))
in_scope_os_counts.sort_values(ascending=False).plot(kind='bar', color='green')
plt.title('In-Scope OS Counts')
plt.xlabel('Operating System')
plt.ylabel('Number of VMs')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Bar chart for Out-of-Scope OS Counts
plt.figure(figsize=(14, 6))
out_of_scope_os_counts.sort_values(ascending=False).plot(kind='bar', color='red')
plt.title('Out-of-Scope OS Counts')
plt.xlabel('Operating System')
plt.ylabel('Number of VMs')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 13- Migration Complexity by OS

Categorized OS’s into

* Easy
* Medium
* Hard
* Database
* Unsupported

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Categorize supported OS into difficulty levels
os_difficulty_mapping = {
    'Windows': 'Medium',
    'Red Hat': 'Easy',
    'Ubuntu': 'Hard',
    'Suse': 'Hard',
    'Oracle': 'Database',
    'Microsoft SQL': 'Database'
}

# Ensure we only count OS instances from in-scope VMs
in_scope_os_counts = filtered_vinfo_df['Cleaned OS'].value_counts()

# Initialize difficulty counts
difficulty_counts = {'Easy': 0, 'Medium': 0, 'Hard': 0, 'Database': 0, 'Unsupported': 0}

# Map OS instances to difficulty levels
for os_name, count in in_scope_os_counts.items():
    difficulty = next((difficulty for key, difficulty in os_difficulty_mapping.items() if key in os_name), 'Unsupported')
    difficulty_counts[difficulty] += count

# Identify unsupported OS instances within in-scope VMs
unsupported_count = difficulty_counts['Unsupported']

# Print White Glove OS instances
white_glove_os = [os for os in in_scope_os_counts.index if 'Oracle' in os or 'Microsoft SQL' in os]
print("\n\U0001F4DD White Glove OS Instances:")
for os in white_glove_os:
    print(f"\U0001F539 {os}: {in_scope_os_counts[os]}")

# Create a bar chart with automatic colors
plt.figure(figsize=(10, 6))

difficulty_levels = list(difficulty_counts.keys())
os_counts = list(difficulty_counts.values())

# Generate a colormap automatically
colors = plt.cm.viridis(np.linspace(0, 1, len(difficulty_levels)))

bars = plt.bar(difficulty_levels, os_counts, color=colors)
plt.title('Migration Complexity by OS')
plt.xlabel('Difficulty Level')
plt.ylabel('Number of OS Instances')

# Add numeric labels on bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 0.5, yval, ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

### 14- Migration Complexity by Disk

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Define disk size categories
size_bins = [0, 2, 10, 20, 50, float('inf')]
size_labels = ['Tiny (<=2TB)', 'Easy (2-10TB)', 'Medium (10-20TB)', 'Hard (20-50TB)', 'White Glove (>50TB)']

# Categorize VMs by disk size
filtered_vinfo_df['Disk Size Category'] = pd.cut(filtered_vinfo_df['Disk Size TB'], bins=size_bins, labels=size_labels)

# Create summary
disk_size_summary = filtered_vinfo_df['Disk Size Category'].value_counts().reindex(size_labels).reset_index()
disk_size_summary.columns = ['Disk Size Category', 'VM Count']

# Display summary
print("\n📊 VM Totals by Disk Size Category:")
print(disk_size_summary.to_string(index=False))

# Generate dynamic colors
plt.figure(figsize=(10, 6))
colors = plt.cm.viridis(np.linspace(0, 1, len(size_labels)))

# Plot bar chart with dynamic colors
bars = plt.bar(disk_size_summary['Disk Size Category'], disk_size_summary['VM Count'], color=colors)

plt.xlabel('Disk Size Category')
plt.ylabel('Number of VMs')
plt.title('VM Totals by Disk Size Category')
plt.xticks(rotation=45, ha='right')

# Add numeric labels on bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 0.5, yval, ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

### 15- Migration Complexity by Disk & OS

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Create explicit copy to avoid SettingWithCopyWarning
filtered_vinfo_df = filtered_vinfo_df.copy()

# Define disk size categories
size_bins = [0, 2, 10, 20, 50, float('inf')]
size_labels = ['Tiny (<=2TB)', 'Easy (2-10TB)', 'Medium (10-20TB)', 'Hard (20-50TB)', 'White Glove (>50TB)']

filtered_vinfo_df['Disk Size Category'] = pd.cut(filtered_vinfo_df['Disk Size TB'], bins=size_bins, labels=size_labels)

# OS difficulty mapping
os_difficulty_mapping = {
    'Windows': 'Medium',
    'Red Hat': 'Easy',
    'Ubuntu': 'Hard',
    'Suse': 'Hard',
    'Oracle': 'Database',
    'Microsoft SQL': 'Database'
}

# Map OS to difficulty level
filtered_vinfo_df['OS Difficulty'] = filtered_vinfo_df['Cleaned OS'].apply(
    lambda os: next((difficulty for key, difficulty in os_difficulty_mapping.items() if key in os), 'Unsupported')
)

# Create a pivot table summarizing VMs by Disk Size and OS Difficulty
pivot_summary = filtered_vinfo_df.pivot_table(
    index='Disk Size Category',
    columns='OS Difficulty',
    values='VM',
    aggfunc='count',
    fill_value=0,
    observed=True
).reset_index()

# Display pivot summary
print("\n📊 Migration Complexity by Disk Size Category and OS Difficulty:")
print(pivot_summary.to_string(index=False, header=True))

# Generate dynamic colors based on the number of OS difficulty categories
num_categories = len(pivot_summary.columns) - 1  # Excluding 'Disk Size Category'
colors = plt.cm.viridis(np.linspace(0, 1, num_categories))

# Plot stacked bar chart with dynamic colors
ax = pivot_summary.set_index('Disk Size Category').plot(
    kind='bar',
    stacked=True,
    figsize=(12, 7),
    colormap=plt.cm.viridis  # Apply dynamic colormap
)

plt.xlabel('Disk Size Category')
plt.ylabel('Number of VMs')
plt.title('\nMigration Complexity by Disk Size Category and OS Difficulty Level')
plt.xticks(rotation=45, ha='right')
plt.legend(title='OS Difficulty', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

### 16- Estimated Migration Time by OS

FTE_COUNT can be changed to match the number of engineers that will be delivering the engagement

In [None]:
import pandas as pd
import math

# Constants
migration_time_per_500gb = 110  # minutes (1 hour 50 minutes per 500GB)
te_hours_per_day = 8            # 8 hours per day per FTE
fte_count = 10                  # 5 FTEs available
pmt_hours = 1                   # Post-migration troubleshooting time per VM in hours

# Ensure required columns are present in filtered_vinfo_df
required_columns = {'Cleaned OS', 'Disk Size TB', 'VM', 'Cluster'}
available_columns = set(filtered_vinfo_df.columns)
optional_columns = {'Environment'}  # Optional columns
missing_columns = required_columns - available_columns

if missing_columns:
    raise ValueError(f"The following required columns are missing in filtered_vinfo_df: {missing_columns}")

# Assign filtered_vinfo_df to inscope_vinfo_df to maintain consistency
inscope_vinfo_df = filtered_vinfo_df.copy()

# Preprocess columns
inscope_vinfo_df['Cleaned OS'] = inscope_vinfo_df['Cleaned OS'].fillna("Unknown OS").astype(str).str.strip()
inscope_vinfo_df['Disk Size TB'] = pd.to_numeric(inscope_vinfo_df['Disk Size TB'], errors='coerce').fillna(0)
inscope_vinfo_df['VM'] = inscope_vinfo_df['VM'].fillna('Unknown VM')
inscope_vinfo_df['Cluster'] = inscope_vinfo_df['Cluster'].fillna('').str.lower()

if 'Environment' in inscope_vinfo_df.columns:
    inscope_vinfo_df['Environment'] = inscope_vinfo_df['Environment'].fillna('unknown').str.lower().str.strip()

# Define disk size classification bins
size_bins = [0, 2, 10, 20, 50, float('inf')]
size_labels = ['Tiny (<=2TB)', 'Easy (2-10TB)', 'Medium (10-20TB)', 'Hard (20-50TB)', 'White Glove (>50TB)']

# Assign Disk Size Category
inscope_vinfo_df['Disk Size Category'] = pd.cut(inscope_vinfo_df['Disk Size TB'], bins=size_bins, labels=size_labels)

# Generate supported OS list dynamically
def truncate_os_names(os_series):
    """Extract unique OS names for classification."""
    return os_series.index.tolist()

supported_os_counts = truncate_os_names(filtered_vinfo_df['Cleaned OS'].value_counts())

# Assign complexities based on classification
def classify_complexity(row):
    if 'sql-' in row['Cluster']:
        return 'MSSQL-DBs'
    elif 'oracle' in row['Cluster']:
        return 'Oracle-DBs'
    elif row['Disk Size Category'] == 'Tiny (<=2TB)':
        return 'Tiny'
    elif row['Disk Size Category'] == 'Easy (2-10TB)':
        return 'Easy'
    elif row['Disk Size Category'] == 'Medium (10-20TB)':
        return 'Medium'
    elif row['Disk Size Category'] == 'Hard (20-50TB)':
        return 'Hard'
    elif row['Disk Size Category'] == 'White Glove (>50TB)':
        return 'White Glove'
    else:
        return 'Unknown'

inscope_vinfo_df['Complexity'] = inscope_vinfo_df.apply(classify_complexity, axis=1)

# Define complexity sorting order
complexity_order = ['Tiny', 'Easy', 'Medium', 'Hard', 'White Glove', 'Oracle-DBs', 'MSSQL-DBs']
inscope_vinfo_df['Complexity'] = pd.Categorical(
    inscope_vinfo_df['Complexity'],
    categories=complexity_order,
    ordered=True
)

# Sort DataFrame
inscope_vinfo_df = inscope_vinfo_df.sort_values('Complexity')

# Assign OS Support dynamically
inscope_vinfo_df['OS Support'] = inscope_vinfo_df['Cleaned OS'].apply(
    lambda os: 'Supported' if any(supported_os in os for supported_os in supported_os_counts) else 'Not Supported'
)

# Compute Migration Time
inscope_vinfo_df['Migration Time (minutes)'] = inscope_vinfo_df['Disk Size TB'].apply(
    lambda size: ((size * 1024) / 500) * migration_time_per_500gb
)

# Compute Total Time (Migration + Post-Migration)
pmt_minutes = pmt_hours * 60
inscope_vinfo_df['Total Time (minutes)'] = inscope_vinfo_df['Migration Time (minutes)'] + pmt_minutes

# Summarize data by Complexity and OS Support
disk_classification_summary = inscope_vinfo_df.groupby(['Complexity', 'OS Support'], observed=True).agg(
    VM_Count=('VM', 'count'),
    Total_Disk=('Disk Size TB', 'sum'),
    Total_Mig_Time_Minutes=('Total Time (minutes)', 'sum')
).reset_index()

# Ensure all Complexity and OS Support combinations exist
for complexity in complexity_order:
    for os_support in ['Supported', 'Not Supported']:
        if not ((disk_classification_summary['Complexity'] == complexity) &
                (disk_classification_summary['OS Support'] == os_support)).any():
            disk_classification_summary = pd.concat([
                disk_classification_summary,
                pd.DataFrame({
                    'Complexity': [complexity],
                    'OS Support': [os_support],
                    'VM_Count': [0],
                    'Total_Disk': [0],
                    'Total_Mig_Time_Minutes': [0]
                })
            ], ignore_index=True)

# Remove rows where VM_Count is 0
disk_classification_summary = disk_classification_summary[disk_classification_summary['VM_Count'] > 0]

disk_classification_summary['Formatted_Mig_Time'] = disk_classification_summary['Total_Mig_Time_Minutes'].apply(
    lambda minutes: f"{minutes / 60:,.1f}h"
)

disk_classification_summary['Total_Days'] = disk_classification_summary['Total_Mig_Time_Minutes'].apply(
    lambda minutes: f"{minutes / (te_hours_per_day * 60 * fte_count):,.1f}"
)

def format_disk_size(tb_value):
    return f"{tb_value * 1024:,.0f} GB" if tb_value < 1 else f"{tb_value:,.2f} TB"

total_disk_tb_numeric = disk_classification_summary['Total_Disk'].astype(float).sum()
disk_classification_summary['Total_Disk'] = disk_classification_summary['Total_Disk'].astype(float).apply(format_disk_size)

total_mig_time_minutes = disk_classification_summary['Total_Mig_Time_Minutes'].sum()

totals_row = {
    'Complexity': 'Totals',
    'OS Support': '',
    'VM_Count': f"{disk_classification_summary['VM_Count'].astype(str).replace(',', '', regex=True).astype(int).sum():,}",
    'Total_Disk': format_disk_size(total_disk_tb_numeric),
    'Formatted_Mig_Time': f"{total_mig_time_minutes / 60:,.1f}h",
    'Total_Days': f"{total_mig_time_minutes / (te_hours_per_day * 60 * fte_count):,.1f}",
    'Total_Mig_Time_Minutes': total_mig_time_minutes
}
disk_classification_summary = pd.concat([disk_classification_summary, pd.DataFrame([totals_row])], ignore_index=True)

disk_classification_summary

### 17- Estimated Migration Time by vCenter

In [None]:
import pandas as pd

# Constants
fte_hours_per_day = 8  # 8 hours per day per FTE
fte_count = 10         # 10 FTEs available

complexity_order = ['Tiny', 'Easy', 'Medium', 'Hard', 'White Glove', 'Oracle-DBs', 'MSSQL-DBs']
inscope_vinfo_df['Complexity'] = pd.Categorical(
    inscope_vinfo_df['Complexity'],
    categories=complexity_order,
    ordered=True
)

# Ensure the vCenter column exists
if 'vCenter' not in inscope_vinfo_df.columns:
    raise ValueError("The required column 'vCenter' is missing from inscope_vinfo_df.")

# Get unique vCenters
vcenters = inscope_vinfo_df['vCenter'].unique()

# Dictionary to store summaries for each vCenter
vcenter_summaries = {}

# Function to format the table output
def custom_table_format_with_totals(headers, rows):
    horizontal_line = "─"
    vertical_line = "│"
    corner_tl, corner_tr = "╭", "╮"
    corner_bl, corner_br = "╰", "╯"
    join_t, join_b, join_c = "┬", "┴", "┼"
    col_widths = [max(len(str(item)) for item in col) for col in zip(headers, *rows)]
    
    def make_row(items):
        return vertical_line + vertical_line.join(f"{str(item).rjust(width)}" for item, width in zip(items, col_widths)) + vertical_line
    
    top_line = corner_tl + join_t.join(horizontal_line * width for width in col_widths) + corner_tr
    header_row = make_row(headers)
    divider_row = join_c.join(horizontal_line * width for width in col_widths).join(["├", "┤"])
    data_rows = [make_row(row) for row in rows[:-1]]
    totals_divider_row = join_c.join(horizontal_line * width for width in col_widths).join(["├", "┤"])
    totals_row = make_row(rows[-1])
    bottom_line = corner_bl + join_b.join(horizontal_line * width for width in col_widths) + corner_br
    return "\n".join([top_line, header_row, divider_row] + data_rows + [totals_divider_row, totals_row, bottom_line])

# Process each vCenter separately
for vcenter in vcenters:
    # Filter the DataFrame for the current vCenter
    vcenter_df = inscope_vinfo_df[inscope_vinfo_df['vCenter'] == vcenter]

    # Perform calculations on the filtered DataFrame
    disk_classification_summary = vcenter_df.groupby(['Complexity', 'OS Support'], observed=True).agg(
        VM_Count=('VM', 'count'),
        Total_Disk=('Disk Size TB', 'sum'),
        Total_Mig_Time_Minutes=('Total Time (minutes)', 'sum')
    ).reset_index()

    # Ensure all "Complexity" and "OS Support" categories exist
    for complexity in complexity_order:
        for os_support in ['Supported', 'Not Supported']:
            if not ((disk_classification_summary['Complexity'] == complexity) &
                    (disk_classification_summary['OS Support'] == os_support)).any():
                disk_classification_summary = pd.concat([
                    disk_classification_summary,
                    pd.DataFrame({
                        'Complexity': [complexity],
                        'OS Support': [os_support],
                        'VM_Count': [0],
                        'Total_Disk': [0.0],
                        'Total_Mig_Time_Minutes': [0]
                    })
                ], ignore_index=True)

    # Remove rows where VM_Count is 0
    disk_classification_summary = disk_classification_summary[disk_classification_summary['VM_Count'] > 0]

    # Convert 'Complexity' to a categorical type with the custom order
    disk_classification_summary['Complexity'] = pd.Categorical(
        disk_classification_summary['Complexity'],
        categories=complexity_order,
        ordered=True
    )

    # Sort the summary by the custom complexity order
    disk_classification_summary = disk_classification_summary.sort_values('Complexity')

    # Store numeric values separately before formatting
    total_disk_tb_numeric = disk_classification_summary['Total_Disk'].astype(float).sum()
    total_mig_time_minutes = disk_classification_summary['Total_Mig_Time_Minutes'].sum()

    # Add calculated columns for formatted migration time and days
    disk_classification_summary['Formatted_Mig_Time'] = disk_classification_summary['Total_Mig_Time_Minutes'].apply(
        lambda minutes: f"{minutes / 60:,.1f}h"
    )
    disk_classification_summary['Days_Per_FTEs'] = disk_classification_summary['Total_Mig_Time_Minutes'].apply(
        lambda minutes: f"{minutes / (fte_hours_per_day * 60 * fte_count):,.1f}"
    )

    # Format numerical values correctly
    disk_classification_summary['VM_Count'] = disk_classification_summary['VM_Count'].astype(int).apply(lambda x: f"{x:,}")
    disk_classification_summary['Total_Disk'] = disk_classification_summary['Total_Disk'].astype(float).apply(lambda x: f"{x:,.0f}")

    # Compute and append totals row
    totals_row = {
        'Complexity': 'Totals',
        'OS Support': '',
        'VM_Count': f"{disk_classification_summary['VM_Count'].astype(str).replace(',', '', regex=True).astype(int).sum():,}",
        'Total_Disk': f"{total_disk_tb_numeric:,.0f}",
        'Formatted_Mig_Time': f"{total_mig_time_minutes / 60:,.1f}h",
        'Days_Per_FTEs': f"{total_mig_time_minutes / (fte_hours_per_day * 60 * fte_count):,.1f}"
    }
    disk_classification_summary = pd.concat(
        [disk_classification_summary, pd.DataFrame([totals_row])],
        ignore_index=True
    )

    # Store the summary in the dictionary
    vcenter_summaries[vcenter] = disk_classification_summary

    # Print the table for the current vCenter
    print(f"\nSummary Table for vCenter: {vcenter}")
    headers = [
        "Complexity", "OS Support", "VM Count", "Total Disk (TB)",
        "Total Migration Time", f"Total Days"
    ]
    rows = disk_classification_summary[
        ['Complexity', 'OS Support', 'VM_Count', 'Total_Disk', 'Formatted_Mig_Time', 'Days_Per_FTEs']
    ].values.tolist()
    print(custom_table_format_with_totals(headers, rows))

### 18- Estimated Migration Time (Summary)

In [None]:
# Calculate global totals across all vCenters
global_total_mig_time = sum(
    summary['Total_Mig_Time_Minutes'].sum() for summary in vcenter_summaries.values()
)
global_total_days = global_total_mig_time / (fte_hours_per_day * 60 * fte_count)

# Ensure all values are correctly formatted
print(f"\nGlobal Total Migration Time: {global_total_mig_time / 60:,.1f}h")
print(f"Global Total Days: {global_total_days:,.1f}")