Encoding: turns string objects to byte objects<br>
Decoding: turns bytes to strings

In [6]:
a = 'Ä ist ein Umlaut .'
bytes(a, 'utf-8') # encodes str to bytes
a.encode('utf-16')
a.encode( )  # encodes str to bytes (utf-8 by default)


b'\xc3\x84 ist ein Umlaut .'

To decode bytes and make strings,  we must know the correct codec to use to get the correct result. 

In [5]:
b'\xff\xfe\xc4\x00 \x00i\x00s\x00t\x00 \x00e\x00i\x00n\x00 \x00U\x00m\x00l\x00a\x00u\x00t\x00 \x00.\x00'.decode('utf-16')

'Ä ist ein Umlaut .'

We must be sure to remember that using the open method for writing to files will not allow for Unicode strings (that contain non-ASCII characters) to be written to files. In order to do this the strings must be encoded.

In Python 3.x all strings are Unicode by default, so if we want to write such a string,  A, to file, we'd need to use str.encode and the wb (binary) mode for *open* to write the string to a file without causing an error, like so:

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [248]:
import pandas as pd
import json

# encoding='iso-8859-1'

with open('/home/mz/code/MaCoZu/data/message.json',
          'r+',
          encoding='latin',
          errors='ignore') as json_data:
    obj = json.load(json_data, strict=False)
frame = pd.DataFrame(obj['messages'])

In [249]:
frame.dropna(subset=['content'], inplace=True)
frame['content_decoded'] = [str(x).encode('latin').decode() for x in frame['content']]
decode = frame[['sender_name', 'content_decoded']].drop_duplicates()

In [250]:
from datetime import datetime

def ms_timestamp_to_strftime(milliseconds):
    # fromtimestamp() method takes the timestamp in seconds. If you have timestamp in milliseconds,
    # you can solve this by dividing the timestamp in milliseconds by 1000.
    seconds = milliseconds/1000
    # Convert Timestamp to Datetime (format)
    dt = datetime.fromtimestamp(seconds, tz=None)
    # convert timestamp to string in yyyy-mm-dd HH:MM:SS
    dt_str = dt.strftime("%Y-%m-%d, %H:%M:%S")  

    return dt_str

frame['time'] = [ms_timestamp_to_strftime(x) for x in frame.timestamp_ms]

In [251]:
import re

def cleaner(text):
    clean_text = " ".join([word for word in text.split()])       # remove excess whitespace between words
    url_free = re.sub(r'http\S+', '', clean_text)                       # removes url's
    unattached = re.sub(r'\w+ sent an attachment.$', '', url_free)  # gets rid of '... sent an attachment'
    stripped = unattached.strip()   # strip is required for the removal of single char stings
    more_than_one = ''.join([x for x in stripped if (len(stripped) > 1)])   # leaves only strings with more than one character
    
    return more_than_one

In [252]:
frame['content_clean'] = [cleaner(x) for x in frame.content_decoded]

In [253]:
pd.set_option('display.max_colwidth', None)
clean_df = frame[[ 'sender_name', 'time', 'content_decoded', 'content_clean' ]]
clean_df.head(2)

Unnamed: 0,sender_name,time,content_decoded,content_clean
0,Francisco Gileno Santos,"2022-07-04, 18:50:57",Êgôriô nanan abôkirê mandêkisaia êkisaô,Êgôriô nanan abôkirê mandêkisaia êkisaô
1,Francisco Gileno Santos,"2022-07-03, 22:58:52","Se mudar de ideia e queira verdadeiramente se despedir de mim e não só esse teatro que sente por mim,meu contato celular e ZAP 71/999065022.......é só ligar ou instalar o ZAP coisa que não creio vc não ter.......","Se mudar de ideia e queira verdadeiramente se despedir de mim e não só esse teatro que sente por mim,meu contato celular e ZAP 71/999065022.......é só ligar ou instalar o ZAP coisa que não creio vc não ter......."


In [254]:
no_empty = clean_df[clean_df['content_clean'].astype(bool)]
no_na = no_empty.dropna(subset=['content_decoded', 'content_clean'])
no_na_reindex = no_na.reset_index(drop=True)  # reset index of df after droping na's
last_df = no_na_reindex[['sender_name', 'time', 'content_clean']]

In [101]:
with open('gileno4.txt', 'a') as f:
    dfAsString = last_df.to_string(
        header=False,
        index=False,
        columns=['sender_name', 'time', 'content_clean'])
    f.write(dfAsString)

In [255]:
last_df.to_csv('gileno.csv', index=False)