This code performs the following steps:

1. **Import Libraries**: It imports the necessary libraries: `os` for interacting with the operating system, `pandas` for data manipulation, and `numpy` for numerical operations.

2. **Define Directories**: It specifies two directories containing CSV files: `'C:\\Users\\pskotte\\Desktop\\New folder'` and `'C:\\Users\\pskotte\\Desktop\\ILMS'`.

3. **Determine Source Function**: It defines a function `determine_source(filepath)` that determines the source of the file based on its directory. If the file is in the 'New folder' directory, it returns 'DM'. If the file is in the 'ILMS' directory, it returns 'ILMS'. Otherwise, it returns 'Unknown'.

4. **Read and Concatenate CSV Files**: It iterates through the specified directories, reads all CSV files, and appends them to a list `df_list`. Each DataFrame is augmented with a 'Source' column indicating its origin.

5. **Concatenate DataFrames**: It concatenates all DataFrames in `df_list` into a single DataFrame `combined_df`.

6. **Filter Non-Blank Rows**: It defines a list of columns to check for non-blank values. It filters `combined_df` to keep only rows where at least one of these columns has a non-blank value. It further filters each column to remove rows where the column value is blank or consists only of whitespace.

7. **Convert Column Type**: It converts the 'Invalid NOC Info 1' column to string type to avoid any potential `TypeError`.

8. **Sort DataFrame**: It sorts the filtered DataFrame `filtered_df` by the 'Invalid NOC Info 1' column and stores the result in `sorted_df`.

9. **Display DataFrame**: Finally, it displays the sorted DataFrame `sorted_df`.

This code effectively reads multiple CSV files from specified directories, filters out rows with blank values in certain columns, and sorts the resulting DataFrame by a specific column.

In [1]:
import os
import pandas as pd
import numpy as np

# Directories containing the CSV files
directories = [
    'C:\\Users\\pskotte\\Desktop\\New folder',
    'C:\\Users\\pskotte\\Desktop\\ILMS'
]

# Function to determine source
def determine_source(filepath):
    if 'New folder' in filepath:
        return 'DM'
    elif 'ILMS' in filepath:
        return 'ILMS'
    return 'Unknown'

# Read and concatenate all CSV files in the directories
df_list = []
for directory in directories:
    for filename in os.listdir(directory):
        if filename.endswith('.csv'):
            file_path = os.path.join(directory, filename)
            df = pd.read_csv(file_path)
            df['Source'] = determine_source(file_path)
            df_list.append(df)

# Concatenate all dataframes
combined_df = pd.concat(df_list, ignore_index=True)

# Columns to check for non-blank values
columns_to_check = [
    'Invalid NOC Info 1', 'Valid NOC Info 1',
    'Invalid NOC Info 2', 'Valid NOC Info 2',
    'Invalid NOC Info 3', 'Valid NOC Info 3'
]

# Filter the dataframe to only keep rows where any of the columns have non-blank values
filtered_df = combined_df.dropna(subset=columns_to_check, how='all')

for column in columns_to_check:
    filtered_df = filtered_df[filtered_df[column].astype(str).str.strip() != '']

# Convert 'Invalid NOC Info 1' column to string type to avoid TypeError
filtered_df['Invalid NOC Info 1'] = filtered_df['Invalid NOC Info 1'].astype(str)

# Sort the filtered dataframe by 'Invalid NOC Info 1' column
sorted_df = filtered_df.sort_values(by='Invalid NOC Info 1')

# Display the sorted dataframe
sorted_df

Unnamed: 0,Report Name,Return Account #,Return Date,Return Post Date (Final),Return Post Date (Pending Redeposit),Debit Reclear Status,ACH RTN,Company ID,Collection Point,Collection App,...,Invalid NOC Info 3,Valid NOC Info 3,Orig Item Trace,Return Trace,Orig Par,Return Par,State,Report Date,Report Time,Source
69807,CR03320 RETURNED ITEMS REPORT,1453037594,07/28/23,07/28/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020008754,251584100000204,23206012653908,23208017549789,TX,07/28/23,06:10:00,DM
299602,CR03320 RETURNED ITEMS REPORT,1453037594,09/29/23,09/29/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020021077,231177120040001,23271017079673,23272007770251,TX,10/02/23,06:10:00,DM
299603,CR03320 RETURNED ITEMS REPORT,1453037594,09/29/23,09/29/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020021080,231177120040003,23271017079676,23272007770253,TX,10/02/23,06:10:00,DM
299604,CR03320 RETURNED ITEMS REPORT,1453037594,09/29/23,09/29/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020021081,231177120040005,23271017079677,23272007770255,TX,10/02/23,06:10:00,DM
212721,CR03320 RETURNED ITEMS REPORT,1453037594,09/01/23,09/01/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020006224,242175730020001,23243013342100,23244010370617,TX,09/05/23,06:15:00,DM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76488,CR03320 RETURNED ITEMS REPORT,1453037594,07/28/23,07/28/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020000326,111104580006722,23208013391106,23209014816610,TX,07/31/23,06:09:00,DM
62586,CR03320 RETURNED ITEMS REPORT,1453037594,07/24/23,07/24/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020000467,242071755920737,23205008136681,23205009520759,TX,07/25/23,06:07:00,DM
62571,CR03320 RETURNED ITEMS REPORT,1453037594,07/25/23,07/25/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020000201,91408440901241,23205008141051,23206003617908,TX,07/25/23,06:07:00,DM
69869,CR03320 RETURNED ITEMS REPORT,1453037594,07/27/23,07/27/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,,111000020000002,243381110000037,23206012645154,23208009916090,TX,07/28/23,06:10:00,DM


This code performs the following steps:

1. **Import Libraries**: It imports the necessary libraries: `pandas` for data manipulation, `pyodbc` for connecting to the SQL Server, and `re` for regular expression operations.

2. **SQL Server Connection**: It establishes a connection to a SQL Server database using the provided connection string (`conn_str`). The connection is stored in the `conn` variable, and a cursor object is created for executing SQL queries.

3. **Extract Receiver IDs**: It assumes that `sorted_df` is a DataFrame containing a column named 'Receiver ID'. It extracts the unique values from this column and stores them in the `receiver_ids` array.

4. **Clean Receiver IDs**: It removes specific substrings ('SP', 'LAL', and '-') from each Receiver ID and strips any leading or trailing whitespace. The cleaned Receiver IDs are stored in the `cleaned_receiver_ids` array.

5. **Separate Numerical and Non-Numerical IDs**: It separates the cleaned Receiver IDs into numerical and non-numerical IDs using regular expressions. Numerical IDs are stored in the `numerical_ids` array, and non-numerical IDs are stored in the `non_numerical_ids` array.

6. **Print Count of Non-Numerical IDs**: It prints the count of non-numerical Receiver IDs.

7. **Format Numerical IDs for SQL Query**: It formats the numerical Receiver IDs into a comma-separated string (`formatted_ids`) for use in an SQL query.

8. **SQL Query Execution**: It defines an SQL query template (`sql_template`) to retrieve the MCM Account Number and Receiver ID from the database. The query is executed using the formatted numerical Receiver IDs, and the result is stored in the `result_df` DataFrame.

9. **Update DataFrame**: It merges the `result_df` DataFrame with `sorted_df` based on the 'Receiver ID' column. The 'Receiver ID' column is dropped after merging, and the 'MCM Account Number' column is updated with the values from the `result_df`. The temporary 'MCM_Account_Number' column is then removed.

10. **Count NaN or Blank Values**: It counts the number of rows in the 'MCM Account Number' column that have NaN or blank values and prints the count.

11. **Close Connection**: It closes the cursor and the database connection.

12. **Display DataFrame**: It displays the first few rows of the updated `sorted_df` DataFrame.

This code effectively cleans and processes Receiver IDs, retrieves corresponding MCM Account Numbers from a SQL Server database, updates the DataFrame, and counts the number of rows with missing or blank MCM Account Numbers.

In [2]:
import pandas as pd
import pyodbc
import re

# SQL Server Connection String
conn_str = (
    r'Driver={SQL Server};'
    r'Server=rpt_ap_prd.internal.mcmcg.com;'
    r'Database=crs5_oltp;'
    r'Trusted_Connection=yes;'
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()

# Assuming 'sorted_df' is your DataFrame with 'Receiver ID'
receiver_ids = sorted_df['Receiver ID'].unique()

# Remove 'SP', 'LAL', and '-' from Receiver IDs and strip whitespaces
cleaned_receiver_ids = [re.sub(r'(SP|LAL|-)', '', str(id)).strip() for id in receiver_ids]

# Separate numerical and non-numerical Receiver IDs
numerical_ids = [id for id in cleaned_receiver_ids if re.match(r'^\d+$', str(id))]
non_numerical_ids = [id for id in cleaned_receiver_ids if not re.match(r'^\d+$', str(id))]

# Print the count of non-numerical Receiver IDs
print(f"Count of non-numerical Receiver IDs: {len(non_numerical_ids)}")

# Directly format the numerical receiver IDs for the SQL query
formatted_ids = ','.join([str(id) for id in numerical_ids])

sql_template = ''' 
SELECT ca.cnsmr_accnt_idntfr_agncy_id as MCM_Account_Number, capj.cnsmr_pymnt_jrnl_id as Receiver_ID 
from cnsmr_accnt ca
inner join cnsmr_accnt_pymnt_jrnl capj on ca.cnsmr_accnt_id  = capj.cnsmr_accnt_id
WHERE cnsmr_pymnt_jrnl_id in ({})
'''

query = sql_template.format(formatted_ids)
result_df = pd.read_sql_query(query, conn)

# Update the 'MCM Account Number' in 'sorted_df' based on 'Receiver ID'
sorted_df = pd.merge(sorted_df, result_df[['Receiver_ID', 'MCM_Account_Number']], left_on='Receiver ID', right_on='Receiver_ID', how='left')
sorted_df.drop(columns=['Receiver_ID'], inplace=True)  # Remove the 'Receiver_ID' column after merging
sorted_df['MCM Account Number'] = sorted_df['MCM_Account_Number']
sorted_df.drop(columns=['MCM_Account_Number'], inplace=True)  # Remove the temporary 'MCM_Account_Number' column

# Count the number of columns with NaN or blank values for 'MCM Account Number'
nan_or_blank_count = sorted_df['MCM Account Number'].isna().sum() + (sorted_df['MCM Account Number'].astype(str).str.strip() == '').sum()
print(f"Number of columns with NaN or blank 'MCM Account Number': {nan_or_blank_count}")

# Display the updated DataFrame
cursor.close()
conn.close()
sorted_df.head()

# Assuming 'sorted_df' is your DataFrame with the 'Receiver ID' column
# Remove alphabetical characters and '-' from 'Receiver ID'
sorted_df['Receiver ID'] = sorted_df['Receiver ID'].apply(lambda x: re.sub(r'[a-zA-Z-]', '', str(x)).strip())

# Display the updated DataFrame
sorted_df.head()

Count of non-numerical Receiver IDs: 0


  result_df = pd.read_sql_query(query, conn)


Number of columns with NaN or blank 'MCM Account Number': 1096


Unnamed: 0,Report Name,Return Account #,Return Date,Return Post Date (Final),Return Post Date (Pending Redeposit),Debit Reclear Status,ACH RTN,Company ID,Collection Point,Collection App,...,Valid NOC Info 3,Orig Item Trace,Return Trace,Orig Par,Return Par,State,Report Date,Report Time,Source,MCM Account Number
0,CR03320 RETURNED ITEMS REPORT,1453037594,07/28/23,07/28/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,111000020008754,251584100000204,23206012653908,23208017549789,TX,07/28/23,06:10:00,DM,
1,CR03320 RETURNED ITEMS REPORT,1453037594,09/29/23,09/29/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,111000020021077,231177120040001,23271017079673,23272007770251,TX,10/02/23,06:10:00,DM,321163542.0
2,CR03320 RETURNED ITEMS REPORT,1453037594,09/29/23,09/29/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,111000020021080,231177120040003,23271017079676,23272007770253,TX,10/02/23,06:10:00,DM,321312282.0
3,CR03320 RETURNED ITEMS REPORT,1453037594,09/29/23,09/29/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,111000020021081,231177120040005,23271017079677,23272007770255,TX,10/02/23,06:10:00,DM,321163543.0
4,CR03320 RETURNED ITEMS REPORT,1453037594,09/01/23,09/01/23,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,,111000020006224,242175730020001,23243013342100,23244010370617,TX,09/05/23,06:15:00,DM,317796181.0


In [3]:
# Save the updated DataFrame to a CSV file
sorted_df.to_csv(r'C:\Users\pskotte\Desktop\sorted_df.csv', index=False)

ENSURE THAT ALL OF THE MCM ACCOUNT NUMBERS ARE PRESENT IN THE DATAFRAME

In [16]:
import pandas as pd
sorted_df = pd.read_csv(r'C:\Users\pskotte\Desktop\sorted_df.csv')

In [17]:
import pandas as pd

# Ensure the Report Date column is in datetime format
sorted_df['Report Date'] = pd.to_datetime(sorted_df['Report Date'], errors='coerce')

# Group by 'MCM Account Number' and get the earliest Report Date for each group
early_report_dates = sorted_df.groupby('MCM Account Number')['Report Date'].min().reset_index()
early_report_dates.rename(columns={'Report Date': 'Earliest Report Date'}, inplace=True)

# Merge the earliest report dates back to the original dataframe
sorted_df = pd.merge(sorted_df, early_report_dates, on='MCM Account Number', how='left')

# Display the updated DataFrame
sorted_df.head()

Unnamed: 0,Report Name,Return Account #,Return Date,Return Post Date (Final),Return Post Date (Pending Redeposit),Debit Reclear Status,ACH RTN,Company ID,Collection Point,Collection App,...,Orig Item Trace,Return Trace,Orig Par,Return Par,State,Report Date,Report Time,Source,MCM Account Number,Earliest Report Date
0,CR03320 RETURNED ITEMS REPORT,1453037594,7/28/2023,7/28/2023,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,111000000000000.0,251584000000000.0,23206000000000.0,23208000000000.0,TX,2023-07-28,6:10:00,DM,319042946.0,2023-07-28
1,CR03320 RETURNED ITEMS REPORT,1453037594,9/29/2023,9/29/2023,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,111000000000000.0,231177000000000.0,23271000000000.0,23272000000000.0,TX,2023-10-02,6:10:00,DM,321163542.0,2023-10-02
2,CR03320 RETURNED ITEMS REPORT,1453037594,9/29/2023,9/29/2023,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,111000000000000.0,231177000000000.0,23271000000000.0,23272000000000.0,TX,2023-10-02,6:10:00,DM,321312282.0,2023-10-02
3,CR03320 RETURNED ITEMS REPORT,1453037594,9/29/2023,9/29/2023,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,111000000000000.0,231177000000000.0,23271000000000.0,23272000000000.0,TX,2023-10-02,6:10:00,DM,321163543.0,2023-10-02
4,CR03320 RETURNED ITEMS REPORT,1453037594,9/1/2023,9/1/2023,,,122000030,1481090909,ECGROUP2,MCMANAGE1,...,111000000000000.0,242176000000000.0,23243000000000.0,23244000000000.0,TX,2023-09-05,6:15:00,DM,317796181.0,2023-09-05


In [28]:
import pandas as pd
import pyodbc
import sqlalchemy as sa
import re

# Azure Synapse Connection String
conn_str = (
    r'Driver={ODBC Driver 17 for SQL Server};'
    r'Server=tcp:azwsynt00.sql.azuresynapse.net,1433;'
    r'Database=AZWSYNT00;'
    r'Authentication=ActiveDirectoryIntegrated;'
)

# Create a SQLAlchemy engine
engine = sa.create_engine(f'mssql+pyodbc:///?odbc_connect={conn_str}')

# Assuming 'sorted_df' is your DataFrame with 'MCM Account Number'
sorted_df['MCM Account Number'] = sorted_df['MCM Account Number'].astype(str).apply(lambda x: x.split('.')[0])

# Drop rows where 'MCM Account Number' is NaN or 'nan'
sorted_df = sorted_df[sorted_df['MCM Account Number'].notna() & (sorted_df['MCM Account Number'] != 'nan')]

account_numbers = sorted_df['MCM Account Number'].unique()

# Format account numbers for SQL query
formatted_account_numbers = ','.join([f"'{str(num)}'" for num in account_numbers])

sql_query = f'''
SELECT
cpj.ConsumerPaymentJournalID,
cpj.BucketTransactionTypeCode, -- 2 PAYMENT 4 NSF 9 REVERSAL
 da.DimAccountKey, -- the one to match to dw.Fact_Collection.DimAccountKey
 da.AccountID, -- that one that matches capj.ConsumerAccountID
 da.AccountNumber AS 'MCM Account Number 1', -- mcm account number the one we show our consumers and vendors
 capj.ConsumerAccountPaymentAmount,
 capj.ConsumerAccountPaymentBalanceAmount,
 capj.ConsumerAccountPaymentIsNSFFlag,
 FORMAT(capj.ConsumerAccountPaymentPostedDate, 'yyyy-MM-dd') as ConsumerAccountPaymentPostedDate
FROM ref.DM_Consumer_Payment_Journal cpj
INNER JOIN ref.DM_Consumer_Account_Payment_Journal capj
ON (cpj.ConsumerPaymentJournalID = capj.ConsumerPaymentJournalID)
INNER JOIN dw.Dim_Account da
ON (capj.ConsumerAccountID = da.AccountID)
WHERE da.AccountNumber IN ({formatted_account_numbers})
ORDER BY cpj.ConsumerPaymentJournalID
'''

# Execute the query using SQLAlchemy engine
result_df = pd.read_sql_query(sql_query, engine)

# Convert the column in result_df to string to match sorted_df
result_df['MCM Account Number 1'] = result_df['MCM Account Number 1'].astype(str)

# Merge only Earliest Report Date from sorted_df with result_df based on 'MCM Account Number' and 'MCM Account Number 1'
df_merged = pd.merge(sorted_df[['MCM Account Number', 'Earliest Report Date']], result_df, left_on='MCM Account Number', right_on='MCM Account Number 1', how='left')

# Count the number of columns with NaN or blank values
nan_count = df_merged.isna().sum().sum()
blank_count = df_merged.apply(lambda x: x.astype(str).str.strip().eq('').sum()).sum()
nan_or_blank_count = nan_count + blank_count
print(f"Number of columns with NaN or blank values: {nan_or_blank_count}")

# Display the updated DataFrame
df_merged.head()

Number of columns with NaN or blank values: 41


Unnamed: 0,MCM Account Number,Earliest Report Date,ConsumerPaymentJournalID,BucketTransactionTypeCode,DimAccountKey,AccountID,MCM Account Number 1,ConsumerAccountPaymentAmount,ConsumerAccountPaymentBalanceAmount,ConsumerAccountPaymentIsNSFFlag,ConsumerAccountPaymentPostedDate
0,319042946,2023-07-28,304481027,2,17934983,19283169,319042946,233.33,2799.65,N,2023-02-22
1,319042946,2023-07-28,317190112,2,17934983,19283169,319042946,233.33,2566.32,N,2023-03-21
2,319042946,2023-07-28,333344222,2,17934983,19283169,319042946,233.33,2332.99,N,2023-04-25
3,319042946,2023-07-28,346053995,2,17934983,19283169,319042946,233.33,2099.66,N,2023-05-23
4,319042946,2023-07-28,361920967,2,17934983,19283169,319042946,233.33,1866.33,N,2023-06-27


In [47]:
import pandas as pd
import hashlib

# Convert 'Earliest Report Date' and 'ConsumerAccountPaymentPostedDate' to datetime
df_merged['Earliest Report Date'] = pd.to_datetime(df_merged['Earliest Report Date'], errors='coerce')
df_merged['ConsumerAccountPaymentPostedDate'] = pd.to_datetime(df_merged['ConsumerAccountPaymentPostedDate'], errors='coerce')

# Filter the DataFrame to include dates after 'Earliest Report Date' and newer than 11/01/2023
filtered_df = df_merged[(df_merged['ConsumerAccountPaymentPostedDate'] >= df_merged['Earliest Report Date']) & (df_merged['ConsumerAccountPaymentPostedDate'] >= pd.to_datetime('2023-11-01'))]

# Add a column that counts the occurrences of each 'ConsumerPaymentJournalID'
filtered_df['ConsumerPaymentJournalID_Count'] = filtered_df.groupby('ConsumerPaymentJournalID')['ConsumerPaymentJournalID'].transform('count')

# Create a hash value for each row to identify duplicates
def hash_row(row):
    row_string = ''.join(row.values.astype(str))
    return hashlib.md5(row_string.encode()).hexdigest()

# Apply the hash function to each row and create a new column 'hash_value'
filtered_df['hash_value'] = filtered_df.apply(hash_row, axis=1)

# Add a column that counts the occurrences of each 'hash_value'
filtered_df['hash_value_count'] = filtered_df.groupby('hash_value')['hash_value'].transform('count')

# Display the filtered DataFrame
filtered_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['ConsumerPaymentJournalID_Count'] = filtered_df.groupby('ConsumerPaymentJournalID')['ConsumerPaymentJournalID'].transform('count')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['hash_value'] = filtered_df.apply(hash_row, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,MCM Account Number,Earliest Report Date,ConsumerPaymentJournalID,BucketTransactionTypeCode,DimAccountKey,AccountID,MCM Account Number 1,ConsumerAccountPaymentAmount,ConsumerAccountPaymentBalanceAmount,ConsumerAccountPaymentIsNSFFlag,ConsumerAccountPaymentPostedDate,ConsumerPaymentJournalID_Count,hash_value,hash_value_count
6,319042946,2023-07-28,479152709,2,17934983,19283169,319042946,221.61,1399.67,N,2024-02-28,2,78a29cdca96bf38c197058fb30d0a806,2
7,319042946,2023-07-28,491417190,2,17934983,19283169,319042946,221.61,1178.06,N,2024-03-27,2,4d191b2e51f7dfdb0086e66dd338e85e,2
8,319042946,2023-07-28,505654002,2,17934983,19283169,319042946,221.61,956.45,N,2024-04-24,2,34479085d20b7832cf56b76a246780ad,2
9,319042946,2023-07-28,519703982,2,17934983,19283169,319042946,221.61,734.84,N,2024-05-22,2,4ad8d25b813b0f9d4bf4973f572434f1,2
27,317796181,2023-09-05,435068120,2,16850024,18134242,317796181,202.79,2433.6,N,2023-11-30,1,2606f39b54a59ce823a589825714fad9,1


In [48]:
len(filtered_df)

34326

In [49]:
# Save the updated DataFrame to a CSV file
filtered_df.to_csv(r'C:\Users\pskotte\Desktop\filtered_df.csv', index=False)