Create a pipeline for the xml files, convert the entries to csv then move them to sqlite.

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import xmltodict as xd
import string
import re
from datetime import datetime

In [3]:
# #open and convert the xml to csv
# with open('data/mochada_mpesa.xml', 'r') as file:
#     data = xd.parse(file.read())


# #extracting the messages
# messages = data['smses']['sms']

# #converting to dataframe
# df = pd.DataFrame(messages)


# #save as a csv(optional)
# df.to_csv('mochada_sms.csv', index=False)

# df.tail()

With our xml file now as a csv file, we can start data wrangling.

## Data Wrangling
Lets look at the columns of in our dataset, clean them if need be, and then move on to Exploratory Data Analysis, before training our algorithm.

In [2]:
#loading csv file
df = pd.read_csv('mochada_sms.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   @protocol        3080 non-null   int64  
 1   @address         3080 non-null   object 
 2   @date            3080 non-null   int64  
 3   @type            3080 non-null   int64  
 4   @subject         0 non-null      float64
 5   @body            3080 non-null   object 
 6   @toa             0 non-null      float64
 7   @sc_toa          0 non-null      float64
 8   @service_center  3080 non-null   int64  
 9   @read            3080 non-null   int64  
 10  @status          3080 non-null   int64  
 11  @locked          3080 non-null   int64  
 12  @date_sent       3080 non-null   int64  
 13  @sub_id          3080 non-null   int64  
 14  @readable_date   3080 non-null   object 
 15  @contact_name    3080 non-null   object 
dtypes: float64(3), int64(9), object(4)
memory usage: 385.1+ KB


In [6]:
df.head(2)

Unnamed: 0,@protocol,@address,@date,@type,@subject,@body,@toa,@sc_toa,@service_center,@read,@status,@locked,@date_sent,@sub_id,@readable_date,@contact_name
0,0,MPESA,1713512313950,1,,SDJ7N53FS3 Confirmed. On 19/4/24 at 10:38 AM T...,,,254722500040,1,-1,0,1713512297000,2,"Apr 19, 2024 10:38:33 AM",(Unknown)
1,0,MPESA,1713512587737,1,,SDJ9N5M21L Confirmed. On 19/4/24 at 10:42 AM T...,,,254722500040,1,-1,0,1713512571000,2,"Apr 19, 2024 10:43:07 AM",(Unknown)


Based on the output from the last cell, here is a general outline of what we will do:
- Create a new dataframe with only three main columns; Address, Body, and readable date
- Change readable date to pandas datetime object
- Inspect and see what information to extract from the body ie balance amount, amount taken or given
- Perform a weekly analysis from Wednesday to Tuesday
- Calculate the monthly commission earned based on the transactions
- Do a time series analysis of the dataset

In [7]:
#columns to keep
keep = ['@address', '@body', '@readable_date']

#new data frame 
data = df[keep].copy()

#renaming the columns 
data.columns = ['address', 'body', 'date']
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   address  3080 non-null   object
 1   body     3080 non-null   object
 2   date     3080 non-null   object
dtypes: object(3)
memory usage: 72.3+ KB


In [8]:
#converting the date into datetime object
data['date'] = pd.to_datetime(data['date'])

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   address  3080 non-null   object        
 1   body     3080 non-null   object        
 2   date     3080 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 72.3+ KB


In [10]:
data.tail()

Unnamed: 0,address,body,date
3075,MPESA,SGS245L5GQ Confirmed. On 28/7/24 at 10:55 AM T...,2024-07-28 10:55:57
3076,MPESA,SGS8488HBK Confirmed. on 28/7/24 at 11:20 AM G...,2024-07-28 11:20:05
3077,MPESA,SGS449HBC6 Confirmed. On 28/7/24 at 11:31 AM T...,2024-07-28 11:31:31
3078,MPESA,SGS949ZVIH Confirmed. On 28/7/24 at 11:35 AM T...,2024-07-28 11:36:22
3079,MPESA,SGS04DA44U Confirmed. On 28/7/24 at 12:05 PM T...,2024-07-28 12:05:47


With the new dataframe created, it seems most of the information is contained in the `body` column. Lets load one entry of this column and see what information is relevant to be extracted.

In [11]:
data['body'].iloc[4]

'SDJ4N12JZY Confirmed. On 19/4/24 at 10:03 AM Take Ksh300.00 cash from NICKSON MIDUNGA CHOGO Your M-PESA float balance is Ksh83,107.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'

There is a unique `transaction ID` for each entry, then the status; `confirmed` or `failed`, transaction type; `take` or `give`, the `amount`, `name` of the customer and the `balance` remaining. 

Lets do some string manipulation.

In [23]:
data['transaction_id'] = data['body'].str.extract(r'^(\w+)')


In [24]:
data['transaction_id'].head()

0    SDJ7N53FS3
1    SDJ9N5M21L
2    SDJ9MU5DIL
3    SDJ2MYL302
4    SDJ4N12JZY
Name: transaction_id, dtype: object

In [31]:
transaction_id_pattern = r'^[A-Z0-9]{10}$'
#extraction
data['transaction_id'] = data['body'].str.extract(r'^(\w+)')

#validation
data['valid_transaction_id'] = data['transaction_id'].str.match(f'^{transaction_id_pattern}$')

#separate the valid transactions from other messages
valid_transactions = data[data['valid_transaction_id']].copy()
others = data[~data['valid_transaction_id']].copy()


In [None]:


def parse_transaction(text):
    pattern = re.compile(
        r'(?P<transaction_id>SD[A-Z0-9]{8})\s+'
        r'(?P<transaction_type>Confirmed|Business\sDeposit\sConfirmed).*?'
        r'(?:(?P<action>Take|Give|transferred from Working to Float)\s+)?'
        r'(?P<currency>\w{3})(?P<amount>[\d,.]+)(?:\s+cash)?'
        r'(?:\s+(?:from|to)\s+(?P<party>[A-Z\s]+?))?\s+'
        r'Your\s+(?:M-PESA float balance|New M-PESA balance|New Working balance)\s+is\s+'
        r'(?P<currency2>\w{3})(?P<new_balance>[\d,.]+)'
        r'.*?on\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\s+at\s+(?P<time>\d{1,2}:\d{2}\s+[AP]M)'
    )
    
    match = pattern.search(text)
    if not match:
        return {}
    
    result = match.groupdict()
    
    # Clean up and convert data types
    result['amount'] = float(result['amount'].replace(',', ''))
    result['new_balance'] = float(result['new_balance'].replace(',', ''))
    result['time'] = datetime.strptime(f"{result['date']} {result['time']}", "%d/%m/%y %I:%M %p")
    
    # Determine transaction type
    if result['transaction_type'] == 'Business Deposit Confirmed':
        result['transaction_type'] = 'Business Deposit'
    elif result['action'] == 'Take':
        result['transaction_type'] = 'Receive'
    elif result['action'] == 'Give':
        result['transaction_type'] = 'Send'
    elif result['action'] == 'transferred from Working to Float':
        result['transaction_type'] = 'Transfer'
    else:
        result['transaction_type'] = 'Unknown'
    
    # Clean up unnecessary fields
    del result['action'], result['date'], result['currency2']
    
    return result

# Assuming your DataFrame is named 'data' and has a 'body' column
extracted_data = data['body'].apply(parse_transaction)
result = pd.concat([data, pd.DataFrame(extracted_data.tolist())], axis=1)

# Separate valid transactions and others
valid_transactions = result[result['transaction_id'].notna()].copy()
others = result[result['transaction_id'].isna()].copy()

# Display info about the DataFrames
print("Valid Transactions:")
print(valid_transactions.info())
print("\nOthers:")
print(others.info())

# Show the first few rows of each DataFrame
print("\nValid Transactions (first 5 rows):")
print(valid_transactions[['body', 'transaction_id', 'transaction_type', 'amount', 'currency', 'new_balance', 'party', 'time']].head())
print("\nOthers (first 5 rows):")
print(others[['body', 'transaction_id', 'transaction_type', 'amount', 'currency', 'new_balance', 'party', 'time']].head())

In [35]:
import pandas as pd
import re
from datetime import datetime

def parse_transaction(text):
    pattern = re.compile(
        r'(?P<transaction_id>SD[A-Z0-9]{8})\s+'
        r'(?P<transaction_type>Confirmed|Business\sDeposit\sConfirmed).*?'
        r'(?:(?P<action>Take|Give|transferred from Working to Float)\s+)?'
        r'(?P<currency>\w{3})(?P<amount>[\d,.]+)(?:\s+cash)?'
        r'(?:\s+(?:from|to)\s+(?P<party>[A-Z\s]+?))?\s+'
        r'Your\s+(?:M-PESA float balance|New M-PESA balance|New Working balance)\s+is\s+'
        r'(?P<currency2>\w{3})(?P<new_balance>[\d,.]+)'
        r'.*?on\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\s+at\s+(?P<time>\d{1,2}:\d{2}\s+[AP]M)'
    )
    
    match = pattern.search(text)
    if not match:
        print(f"No match found for text: {text[:100]}...")  # Print first 100 chars for debugging
        return {
            'transaction_id': None,
            'transaction_type': 'Unknown',
            'amount': None,
            'currency': None,
            'new_balance': None,
            'party': None,
            'time': None
        }
    
    result = match.groupdict()
    
    # Clean up and convert data types
    result['amount'] = float(result['amount'].replace(',', ''))
    result['new_balance'] = float(result['new_balance'].replace(',', ''))
    result['time'] = datetime.strptime(f"{result['date']} {result['time']}", "%d/%m/%y %I:%M %p")
    
    # Determine transaction type
    if result['transaction_type'] == 'Business Deposit Confirmed':
        result['transaction_type'] = 'Business Deposit'
    elif result['action'] == 'Take':
        result['transaction_type'] = 'Receive'
    elif result['action'] == 'Give':
        result['transaction_type'] = 'Send'
    elif result['action'] == 'transferred from Working to Float':
        result['transaction_type'] = 'Transfer'
    else:
        result['transaction_type'] = 'Unknown'
    
    # Clean up unnecessary fields
    del result['action'], result['date'], result['currency2']
    
    return result

# Assuming your DataFrame is named 'data' and has a 'body' column
extracted_data = data['body'].apply(parse_transaction)
result = pd.concat([data, pd.DataFrame(extracted_data.tolist())], axis=1)

# Print column names for debugging
print("Columns in result DataFrame:", result.columns.tolist())

# Separate valid transactions and others
valid_transactions = result[result['transaction_id'].notna()].copy()
others = result[result['transaction_id'].isna()].copy()

# Display info about the DataFrames
print("Valid Transactions:")
print(valid_transactions.info())
print("\nOthers:")
print(others.info())

# Show the first few rows of each DataFrame
print("\nValid Transactions (first 5 rows):")
print(valid_transactions[['body', 'transaction_id', 'transaction_type', 'amount', 'currency', 'new_balance', 'party', 'time']].head())
print("\nOthers (first 5 rows):")
print(others[['body', 'transaction_id', 'transaction_type', 'amount', 'currency', 'new_balance', 'party', 'time']].head())

No match found for text: SDJ7N53FS3 Confirmed. On 19/4/24 at 10:38 AM Take Ksh1,100.00 cash from ROBERT MAHINDA Your M-PESA f...
No match found for text: SDJ9N5M21L Confirmed. On 19/4/24 at 10:42 AM Take Ksh1,100.00 cash from brian chacha Your M-PESA flo...
No match found for text: SDJ9MU5DIL Confirmed. On 19/4/24 at 8:57 AM Take Ksh1,100.00 cash from Christopher Rwara Your M-PESA...
No match found for text: SDJ2MYL302 Confirmed. On 19/4/24 at 9:39 AM Take Ksh100.00 cash from john timpanko Your M-PESA float...
No match found for text: SDJ4N12JZY Confirmed. On 19/4/24 at 10:03 AM Take Ksh300.00 cash from NICKSON MIDUNGA CHOGO Your M-P...
No match found for text: SDJ4N6KYGK Confirmed. On 19/4/24 at 10:51 AM Take Ksh100.00 cash from NICKSON MIDUNGA CHOGO Your M-P...
No match found for text: SDJ3NAIRL9 confirmed.You bought Ksh50.00 of airtime on 19/4/24 at 11:25 AM.New M-PESA balance is Ksh...
No match found for text: SDJ4NEIHH0 Confirmed. On 19/4/24 at 11:59 AM Take Ksh500.00 cash from CH

  valid_transactions = result[result['transaction_id'].notna()].copy()


ValueError: cannot reindex on an axis with duplicate labels

In [10]:
def cleaner(x):
    parts = x.split()
    final = []
    amount_pattern = r'Ksh[\d,]+\.\d{2}'
    for part in parts:
        if re.match(amount_pattern, part):
            final.append(part)
        else:
            part = part.replace('.', ' ')
            subsplits = part.split()
            subsplits = [word.strip(string.punctuation).upper() for word in part.split()]
            final.extend(subsplits)
    return final

trying = data['body'].apply(cleaner).apply(pd.Series)
type(trying)
trying.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,46,47,48,49,50,51,52,53,54,55
0,SDJ7N53FS3,CONFIRMED,ON,19/4/24,AT,10:38,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
1,SDJ9N5M21L,CONFIRMED,ON,19/4/24,AT,10:42,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
2,SDJ9MU5DIL,CONFIRMED,ON,19/4/24,AT,8:57,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
3,SDJ2MYL302,CONFIRMED,ON,19/4/24,AT,9:39,AM,TAKE,Ksh100.00,CASH,...,,,,,,,,,,
4,SDJ4N12JZY,CONFIRMED,ON,19/4/24,AT,10:03,AM,TAKE,Ksh300.00,CASH,...,,,,,,,,,,


In [11]:
trying.shape

(3080, 56)

Phase one of cleaning is done, now lets move to phase 2. Where we will drop unnecessary columns, standardize the needed ones by making sure the data types are as they should be.

In [12]:
failed = 0
mpesa = 0
an = 0
the = 0
dear = 0
your = 0
transaction = 0 
other = 0
for _ in trying[0]:
    if _ == 'FAILED':
        failed += 1
    elif _ == 'M-PESA':
        mpesa += 1
    elif _ == 'AN':
        an += 1
    elif _ == 'DEAR':
        dear += 1
    elif _ == 'THE':
        the += 1
    elif _ == 'YOUR':
        your += 1
    elif _ == 'TRANSACTION':
        transaction += 1
    else:
        other += 1


print(f"Failed entries are {failed}, mpesa entries are {mpesa}, your are {your}, `an` entries are {an}, dear entries are {dear}, the entries are {the}, sum of these is {failed + mpesa + an + the + dear}, actual entries {other}")

Failed entries are 138, mpesa entries are 7, your are 11, `an` entries are 30, dear entries are 7, the entries are 9, sum of these is 191, actual entries 2877


In [13]:
trying[0].tail()

3075    SGS245L5GQ
3076    SGS8488HBK
3077    SGS449HBC6
3078    SGS949ZVIH
3079    SGS04DA44U
Name: 0, dtype: object

In [14]:
def is_valid_entry(entry):
    return len(str(entry)) == 10



# Filter the DataFrame
filtered_trying = trying[trying[0].apply(is_valid_entry)]

# Display the result
filtered_trying.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,46,47,48,49,50,51,52,53,54,55
0,SDJ7N53FS3,CONFIRMED,ON,19/4/24,AT,10:38,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
1,SDJ9N5M21L,CONFIRMED,ON,19/4/24,AT,10:42,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
2,SDJ9MU5DIL,CONFIRMED,ON,19/4/24,AT,8:57,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
3,SDJ2MYL302,CONFIRMED,ON,19/4/24,AT,9:39,AM,TAKE,Ksh100.00,CASH,...,,,,,,,,,,
4,SDJ4N12JZY,CONFIRMED,ON,19/4/24,AT,10:03,AM,TAKE,Ksh300.00,CASH,...,,,,,,,,,,


In [15]:
pattern = r'[A-Z0-9]{10}$'

actual = 0
actual1 = []
failed = 0
mpesa = 0
an = 0
the = 0
dear = 0
your = 0
transaction = 0 
congratulations = 0
other = []
for _ in trying[0]:
    if _ == 'FAILED':
        failed += 1
    elif _ == 'M-PESA':
        mpesa += 1
    elif _ == 'AN':
        an += 1
    elif _ == 'DEAR':
        dear += 1
    elif _ == 'THE':
        the += 1
    elif _ == 'YOUR':
        your += 1
    elif _ == 'TRANSACTION':
        transaction += 1
    elif _ == 'CONGRATULATIONS':
        congratulations += 1
    elif re.match(pattern, _):
        actual += 1
        actual1.append(_)
    else:
        other.append(_)


actual2 = pd.Series(actual1)
actual2.info()

<class 'pandas.core.series.Series'>
RangeIndex: 2871 entries, 0 to 2870
Series name: None
Non-Null Count  Dtype 
--------------  ----- 
2871 non-null   object
dtypes: object(1)
memory usage: 22.6+ KB


In [16]:
other = pd.Series(other)
other.info()
other

<class 'pandas.core.series.Series'>
RangeIndex: 4 entries, 0 to 3
Series name: None
Non-Null Count  Dtype 
--------------  ----- 
4 non-null      object
dtypes: object(1)
memory usage: 164.0+ bytes


0    PLEASE
1       YOU
2       YOU
3       YOU
dtype: object

In [30]:
# Define the more flexible regex pattern
pattern = r'^S[A-Z]{2}[A-Z0-9]{7}$'

# Filter the DataFrame
filtered_df = trying[trying[0].str.match(pattern)]

# Display the result
filtered_df[0].info()

<class 'pandas.core.series.Series'>
Int64Index: 2155 entries, 0 to 3079
Series name: 0
Non-Null Count  Dtype 
--------------  ----- 
2155 non-null   object
dtypes: object(1)
memory usage: 33.7+ KB


In [31]:
filtered_trying[0].info()

<class 'pandas.core.series.Series'>
Int64Index: 2871 entries, 0 to 3079
Series name: 0
Non-Null Count  Dtype 
--------------  ----- 
2871 non-null   object
dtypes: object(1)
memory usage: 44.9+ KB


In [73]:
colA = set(trying[0])
colB = set(actual2)

missing_entr = colA - colB

missdf = trying[trying[0].isin(missing_entr)]
missdf

<class 'pandas.core.frame.DataFrame'>
Int64Index: 209 entries, 11 to 3063
Data columns (total 56 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       209 non-null    object
 1   1       209 non-null    object
 2   2       209 non-null    object
 3   3       209 non-null    object
 4   4       209 non-null    object
 5   5       209 non-null    object
 6   6       209 non-null    object
 7   7       209 non-null    object
 8   8       209 non-null    object
 9   9       208 non-null    object
 10  10      207 non-null    object
 11  11      206 non-null    object
 12  12      206 non-null    object
 13  13      175 non-null    object
 14  14      173 non-null    object
 15  15      173 non-null    object
 16  16      163 non-null    object
 17  17      163 non-null    object
 18  18      153 non-null    object
 19  19      152 non-null    object
 20  20      152 non-null    object
 21  21      122 non-null    object
 22  22      33 non-null     

In [65]:

def find_missing_entries(df_a, df_b, column_name):
    # Convert the column in both DataFrames to sets
    set_a = set(df_a[column_name])
    set_b = set(df_b[column_name])
    
    # Find entries in A that are not in B
    missing_entries = set_a - set_b
    
    # Create a new DataFrame with the missing entries
    missing_df = df_a[df_a[column_name].isin(missing_entries)]
    
    return missing_df


missing = find_missing_entries(trying, actual2, 0)
missing[0].info()
missing

<class 'pandas.core.series.Series'>
Int64Index: 3080 entries, 0 to 3079
Series name: 0
Non-Null Count  Dtype 
--------------  ----- 
3080 non-null   object
dtypes: object(1)
memory usage: 48.1+ KB


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,46,47,48,49,50,51,52,53,54,55
0,SDJ7N53FS3,CONFIRMED,ON,19/4/24,AT,10:38,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
1,SDJ9N5M21L,CONFIRMED,ON,19/4/24,AT,10:42,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
2,SDJ9MU5DIL,CONFIRMED,ON,19/4/24,AT,8:57,AM,TAKE,"Ksh1,100.00",CASH,...,,,,,,,,,,
3,SDJ2MYL302,CONFIRMED,ON,19/4/24,AT,9:39,AM,TAKE,Ksh100.00,CASH,...,,,,,,,,,,
4,SDJ4N12JZY,CONFIRMED,ON,19/4/24,AT,10:03,AM,TAKE,Ksh300.00,CASH,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3075,SGS245L5GQ,CONFIRMED,ON,28/7/24,AT,10:55,AM,TAKE,Ksh100.00,CASH,...,,,,,,,,,,
3076,SGS8488HBK,CONFIRMED,ON,28/7/24,AT,11:20,AM,GIVE,Ksh100.00,TO,...,,,,,,,,,,
3077,SGS449HBC6,CONFIRMED,ON,28/7/24,AT,11:31,AM,TAKE,Ksh500.00,CASH,...,,,,,,,,,,
3078,SGS949ZVIH,CONFIRMED,ON,28/7/24,AT,11:35,AM,TAKE,Ksh250.00,CASH,...,,,,,,,,,,


In [None]:
""" 
This function takes a row as input, 
goes through each of the entries and populates the money column with 
"""

def cleaner2(x):
    ...

In [11]:
data['body'].iloc[10]

'SDJ6NL185I Confirmed. On 19/4/24 at 12:54 PM Take Ksh5,500.00 cash from MWENDWA KANYAI Your M-PESA float balance is Ksh71,087.00. Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU'

In [12]:
""" 
There seems to be inconsistencies in the entries, thus a more robust function or series of functions is 
needed.
Brainstorming sessions:
    1. look for patterns for the similar entries and filter based on that
    2. Second option is: after splitting it into the different sections, go over each of the entries and split them 
    again on every punctuation mark.
"""

' \nThere seems to be inconsistencies in the entries, thus a more robust function or series of functions is \nneeded.\nBrainstorming sessions:\n    1. look for patterns for the similar entries and filter based on that\n    2. Second option is: after splitting it into the different sections, go over each of the entries and split them \n    again on every punctuation mark.\n'

In [13]:
# import pandas as pd
# import string

# class StringProcessor:
    
#     def spli(self, x):
#         parts = x.split()
#         part_1 = parts[:5]
#         part_2 = parts[5:11]
#         part_3 = parts[11:16]
#         part_4 = parts[16:20]
#         return part_1, part_2, part_3, part_4

#     def split_and_capitalize(self, parts):
#         processed_parts = []
#         for part in parts:
#             # Split each sublist by punctuation marks
#             split_subparts = [word.strip(string.punctuation).capitalize() for sublist in part for word in sublist.split()]
#             processed_parts.append(split_subparts)
#         return processed_parts

#     def to_dataframe(self, processed_parts):
#         # Create a dictionary with keys as column names and values as the lists of processed parts
#         data_dict = {f'Part_{i+1}': pd.Series(part) for i, part in enumerate(processed_parts)}
#         df = pd.DataFrame(data_dict)
#         return df

# # Example usage
# processor = StringProcessor()

# # Input string
# # input_string = "This is an example string that we will use to test the functionality of the class and its methods correctly"

# # Method 1: Split input string
# split_parts = processor.spli(sample)
# print("Split Parts:", split_parts)

# # Method 2: Split on punctuation and capitalize
# capitalized_parts = processor.split_and_capitalize(split_parts)
# print("Capitalized Parts:", capitalized_parts)

# # Method 3: Convert to DataFrame
# df = processor.to_dataframe(capitalized_parts)
# print("DataFrame:\n", df)


In [14]:
#lets try option 2 before going back to option one
import string

class StringProcessor:
    def spli(self, x):
        parts = x.split()
        part_1 = parts[:5]
        part_2 = parts[5:11]
        part_3 = parts[11:16]
        part_4 = parts[16:20]
        return part_1, part_2, part_3, part_4
    
    def split_processing(self, parts):
        processed_parts = []
        for part in parts:
            #split on punctuation marks
            split_subparts = [word.strip(string.punctuation).upper() for sublist in part for word in sublist.split()]
            processed_parts.append(split_subparts)
        return processed_parts
    
    def to_dataframe(self, processed_parts):
        ...

#initialize the class instance
processor = StringProcessor()

sample = data['body'].iloc[6]
# Method 1: Split input string
split_parts = processor.spli(sample)
# print("Split Parts:", split_parts)

# Method 2: Split on punctuation and capitalize
capitalized_parts = processor.split_processing(split_parts)
print("Capitalized Parts:", capitalized_parts)

Capitalized Parts: [['SDJ7N53FS3', 'CONFIRMED', 'ON', '19/4/24', 'AT'], ['10:38', 'AM', 'TAKE', 'KSH1,100.00', 'CASH', 'FROM'], ['ROBERT', 'MAHINDA', 'YOUR', 'M-PESA', 'FLOAT'], ['BALANCE', 'IS', 'KSH82,007.00', 'CLICK']]


In [15]:
#lets create a function to extract information from this column
""" 
Function takes the a sting as input
Strips it into different sections
Extracts specific information from the texts;
    - unique id
    - transaction status
    - transaction type
    - amount
    - customer names
    - balance
"""

import re

def extract(text):
    # Split the text
    parts = text.split()
    
    # Initialize variables
    id = parts[0]
    status = parts[1]

    # Determine the action
    if 'Take' in parts:  # Someone gave money to deposit
        action = 'Take'
        action_index = parts.index('Take')

    elif 'Give' in parts:  # Someone withdrew money
        action = 'Give'
        action_index = parts.index('Give')

    elif 'bought' in parts:  # Sold airtime to customer
        action = 'bought'
        action_index = parts.index('bought')

    elif 'sent' in parts:  # Personal transactions
        action = 'sent'
        action_index = parts.index('sent')

    else:
        action = None
        action_index = -1

    # Extraction of the amount 
    amount_pattern = r'Ksh[\d,]+\.\d{2}'
    amount_match = re.search(amount_pattern, text)
    amount = amount_match.group(0) if amount_match else None

    # Remove 'Ksh' from the amount
    if amount:
        amount = re.sub(r'Ksh\s*', '', amount)
        amount = re.sub(r',', '', amount)
        amount = float(amount)

    # Extraction of customer name
    if action in ['Take', 'Give']:
        name = parts[action_index + 4] + ' ' + parts[action_index + 5]
        name = name.upper()
    else:
        name = None

    # Extraction of Mpesa balance
    balance_start = text.find('balance is') + len('balance is') + 1
    balance_end = text.find('.', balance_start) + 3
    
    if balance_start != -1 and balance_end != -1:
        balance = text[balance_start:balance_end].strip()

        # Remove 'Kshs' from the balance
        balance = re.sub(r'Ksh\s*', '', balance)
        balance = re.sub(r',', '', balance)
        # balance = float(balance)
    else:
        balance = None

    return {
        'Transaction Id': id,
        'Status': status,
        'Action': action,
        'Amount': amount,
        'Customer Name': name,
        'balance': balance,
    }


In [16]:
new_df = data['body'].apply(extract).apply(pd.Series)

new_df.head()

Unnamed: 0,Transaction Id,Status,Action,Amount,Customer Name,balance
0,SDJ7N53FS3,Confirmed.,Take,1100.0,ROBERT MAHINDA,82007.0
1,SDJ9N5M21L,Confirmed.,Take,1100.0,BRIAN CHACHA,80907.0
2,SDJ9MU5DIL,Confirmed.,Take,1100.0,CHRISTOPHER RWARA,97007.0
3,SDJ2MYL302,Confirmed.,Take,100.0,JOHN TIMPANKO,83407.0
4,SDJ4N12JZY,Confirmed.,Take,300.0,NICKSON MIDUNGA,83107.0


In [17]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Transaction Id  3080 non-null   object 
 1   Status          3080 non-null   object 
 2   Action          2725 non-null   object 
 3   Amount          2891 non-null   float64
 4   Customer Name   2491 non-null   object 
 5   balance         3080 non-null   object 
dtypes: float64(1), object(5)
memory usage: 144.5+ KB


In [18]:
def filter_and_extract_ids(df):
    # Define the valid actions
    valid_actions = ['Take', 'Give', 'bought']

    # Filter the DataFrame to only include rows with valid actions
    filtered_df = df[df['Action'].isin(valid_actions)]

    # Extract the row indices of the filtered rows
    unique_ids = filtered_df['Transaction Id'].unique().tolist()


    return filtered_df, unique_ids




In [19]:
filtered_df, unique_ids = filter_and_extract_ids(new_df)

In [20]:
filtered_df.head()

Unnamed: 0,Transaction Id,Status,Action,Amount,Customer Name,balance
0,SDJ7N53FS3,Confirmed.,Take,1100.0,ROBERT MAHINDA,82007.0
1,SDJ9N5M21L,Confirmed.,Take,1100.0,BRIAN CHACHA,80907.0
2,SDJ9MU5DIL,Confirmed.,Take,1100.0,CHRISTOPHER RWARA,97007.0
3,SDJ2MYL302,Confirmed.,Take,100.0,JOHN TIMPANKO,83407.0
4,SDJ4N12JZY,Confirmed.,Take,300.0,NICKSON MIDUNGA,83107.0


In [21]:
len(unique_ids)
# joined_df.info()


2422

In [22]:
#continue cleaning of this data. 

In [23]:
data['body'].iloc[-17]

'The customer is not registered to M-PESA and advise the agent to register the customer first.'

In [24]:
["""
 - Failed. Kindly capture the correct mobile number and customer details as they appear on the identification document and attempt the 
    deposit again.
 - SDJ9N5M21L Confirmed. On 19/4/24 at 10:42 AM Take Ksh1,100.00 cash from brian chacha Your M-PESA float balance is Ksh80,907.00. 
    Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU
 - The customer is not registered to M-PESA and advise the agent to register the customer first.

 - SDJ6NL185I Confirmed. On 19/4/24 at 12:54 PM Take Ksh5,500.00 cash from MWENDWA KANYAI Your M-PESA float balance is Ksh71,087.00. 
    Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU
 - SGQ7V4HPMX Confirmed. On 26/7/24 at 9:54 AM Take Ksh328.00 cash from ROSE KAKINDU Your M-PESA float balance is Ksh132,626.00. 
    Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU
 - SF98YJ0DIY Confirmed. On 9/6/24 at 7:59 PM Take Ksh1,651.00 cash from CAROLINE ABUGA Your M-PESA float balance is Ksh63,721.00. 
    Click the link to Download M-Pesa Agent App and Transact the SMART way https://bit.ly/3Ll6JQU
 """]

sample = data['body'].iloc[-1520]


In [46]:
data['date'].iloc[-1520]

Timestamp('2024-06-09 19:59:43')