# Language Detection Pipeline Preparation

During ETL pipeline preparation, some messages were not in English.
Experiment with CHATGPT to detect the language of each message and translate it to English

Note: CHATGPT is a paid service, and takes a long time to translate.

### 1. Import libraries and load datasets

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import os
from openai import OpenAI
import json
import numpy as np
import time
import logging
logger = logging.getLogger(__name__)

# environment settings
pd.set_option('display.max_column', 400)
pd.set_option('display.max_colwidth', 400)

In [2]:

FILE_LOG = '../data/disaster_response.log'

# activate logging
logging.basicConfig(filename=FILE_LOG, filemode='a', level=logging.INFO)

In [3]:
# load data from database created during ETL pipeline preparation
engine = create_engine('sqlite:///../data/DisasterResponse.db')
conn = engine.connect()
df = pd.read_sql('select * from messages', con=conn, index_col='id')
df.head()

Unnamed: 0_level_0,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
2,Weather update - a cold front from Cuba that could pass over Haiti,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Is the Hurricane over or is it not over,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
8,Looking for someone but no name,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
12,"says: west side of Haiti, rest of the country today and tonight",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25991 entries, 2 to 30265
Data columns (total 38 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   message                 25991 non-null  object
 1   genre                   25991 non-null  object
 2   related                 25991 non-null  int64 
 3   request                 25991 non-null  int64 
 4   offer                   25991 non-null  int64 
 5   aid_related             25991 non-null  int64 
 6   medical_help            25991 non-null  int64 
 7   medical_products        25991 non-null  int64 
 8   search_and_rescue       25991 non-null  int64 
 9   security                25991 non-null  int64 
 10  military                25991 non-null  int64 
 11  child_alone             25991 non-null  int64 
 12  water                   25991 non-null  int64 
 13  food                    25991 non-null  int64 
 14  shelter                 25991 non-null  int64 
 15  clothin

In [91]:
# try to read message where language detection was already executed
df_language = pd.DataFrame()
try:
    df_language = pd.read_sql('select * from message_language',
                              con=conn,
                              index_col='id',
                              dtype={'is_english': 'boolean'})
finally:
    pass

df_language.shape

(7400, 2)

### 2. Set API key and connect to OPENAI

In [4]:
# Setting the API key to use OPEN AI models
openai_api_key = os.environ.get('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

### 3. Build the API logic in realtime

Detect if messages are in English, and if not, translate them to English

In [92]:
# Identify messages that was not the checked for language yet
df_remaining = df.merge(df_language, on='id', how='left', indicator=True)
df_remaining = df_remaining[df_remaining['_merge'] == "left_only"]
df_remaining.head()

Unnamed: 0_level_0,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,is_english,translation,_merge
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
8402,i would like to know what is the risk now?,direct,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,left_only
8403,Please! give me some informations on the earthquake.,direct,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,,,left_only
8404,what's about the MMS. how is fonctioning?,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,left_only
8405,"i am a victim of the earthquake, i would like to have some informations on all. thanks cause you inform the population!",direct,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,,,left_only
8406,Are we able in area thimo street?,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,left_only


In [90]:
df_remaining.shape

(18591, 41)

In [94]:
#  build API request

system_prompt = """ 
  You will be provided with text about disaster responses.
  Step 1: Detect of the text is in English. Return options 'True' if the sentence is in English or 'False' if the sentence is not in English as isEnglish boolean variable
  Step 2: If sentence is not in English, translate it to English and return as text in json format
  
  Example: 'I need food' // isEnglish: True
  Example2: 'Vandaag is het zonnig'' // isEnglish: False, Translation: 'Today it is Sunny'
  """

In [95]:
# detect if all texts are in English, and if not translate them
mydict = {}

for idx, text in df_remaining['message'][:2].items():

  response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
      {
        "role": "system",
        "content": system_prompt,
        "temperature":0.1,
        "response_format": {
              "type": "json_object"
          },   
      },
      {
        "role": "user",
        "content": text,
        # "content": 'Vandaag is het zonnig'   
        # "content": 'I love Wilson and Maya' ,
      }
    ],
    temperature=0.1,
    top_p=1
  )

  mydict[idx] = response.choices[0].message.content
  
print(mydict)

{8402: '{\n  "isEnglish": true\n}', 8403: '{\n  "isEnglish": true\n}'}


### 4. Once we are happy with API response, let's kick off translation in BATCHES
Create multiple batch jobs on open ai platform, which will complete within 24hours. 
Due to limitations on openai, only 90000 tokens can be processed in a single batch
Running API requests are significantly cheaper

**Important:**
The message id will be become the index, and the main identifier of the translation

In [96]:
# Creating an array of json tasks
start = 0
end = 2000
interval = 400
next = start + interval

while next <= end:
    
    # create an array of json tasks for each batch job
    tasks = []
    for index, text in df_remaining['message'][start:next].items():
    
        task = {
            "custom_id": f"task-{index}",
            "method":"POST",
            "url":"/v1/chat/completions",
            "body": {
              # This is what you would have in your Chat Completions API call
              "model":"gpt-4-turbo",
              "temperature":0.1,
              "response_format": {
                  "type": "json_object"
              },
              "messages": [
                {
                  "role": "system",
                  "content": system_prompt,
                },
                {
                  "role": "user",
                  "content": text,
                },
              ]
            }
          }
        
        tasks.append(task)
        
    # create json file and save it locally
    file_name = '../data/batch_tasks_language_detection.jsonl'
    with open(file_name, 'w') as file:
        for obj in tasks:
            file.write(json.dumps(obj) + '\n') 
            
    # Uploading json file to openai platform
    batch_file = client.files.create(
        file=open(file_name, 'rb'),
        purpose='batch'
    )    
    
    # Creating the batch job on openai
    batch_job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )    
    
    print('Batch submitted {} for records {}-{}'.format(batch_job.id, start, next))
    logger.info('Batch submitted {} for records {}-{}'.format(batch_job.id, start, next))
    
    # Check status of batch job running on openai platform
    print('Waiting for batch to start, go to sleep 5 minutes')
    time.sleep(300)
    batch_job = client.batches.retrieve(batch_job.id)
    print('Batch {}" status {}'.format(batch_job.id, batch_job.status))
    logger.info('Batch {}" status {}'.format(batch_job.id, batch_job.status))
    
    # wait for batch to complete before we kick-off the next one. (CHATGPT does not allow multiple batches to run in parallel)
    while batch_job.status in ['in_progress', 'validating', 'finalizing']:
        print('Batch {} still running - going to sleep for 5 minutes'.format(batch_job.id))
        logger.info('Batch {} still running - going to sleep for 5 minutes'.format(batch_job.id))
        time.sleep(300)
        batch_job = client.batches.retrieve(batch_job.id)
    
    # when batch is completed, set counters to kick off the next batch job of 400 requests   
    if not batch_job.status == 'failed':
        start = start + interval
        next = next + interval

Batch submitted batch_Imhp4fh818IA6KjL5qrwfKiY for records 0-400
Waiting for batch to start, go to sleep 5 minutes
Batch batch_Imhp4fh818IA6KjL5qrwfKiY" status in_progress
Batch batch_Imhp4fh818IA6KjL5qrwfKiY still running - going to sleep for 5 minutes
Batch batch_Imhp4fh818IA6KjL5qrwfKiY still running - going to sleep for 5 minutes
Batch batch_Imhp4fh818IA6KjL5qrwfKiY still running - going to sleep for 5 minutes
Batch submitted batch_M9BDpbaETfirAieZCPVJFV1U for records 400-800
Waiting for batch to start, go to sleep 5 minutes
Batch batch_M9BDpbaETfirAieZCPVJFV1U" status validating
Batch batch_M9BDpbaETfirAieZCPVJFV1U still running - going to sleep for 5 minutes
Batch batch_M9BDpbaETfirAieZCPVJFV1U still running - going to sleep for 5 minutes
Batch batch_M9BDpbaETfirAieZCPVJFV1U still running - going to sleep for 5 minutes
Batch batch_M9BDpbaETfirAieZCPVJFV1U still running - going to sleep for 5 minutes
Batch batch_M9BDpbaETfirAieZCPVJFV1U still running - going to sleep for 5 minutes

KeyboardInterrupt: 

### 5. Load and Analyze API results

In [69]:
# get all batch jobs submitted
# batch_jobs = []
batch_jobs = client.batches.list(limit=50)
print('Number of batch jobs retrieved: {}'.format(len(batch_jobs.data)))

Number of batch jobs retrieved: 38


In [70]:
# select batches to process
batches = []
for batch in batch_jobs.data:
   if ((batch.status == 'completed') & 
        (batch.request_counts.failed == 0) & 
        (batch.request_counts.total > 10)):
            batches.append(batch.id)
            print(batch.id, batch.status, batch.request_counts.completed, batch.request_counts.failed) 

batch_2wmLLLqQ3nbO1ZMRAC6CwTUO completed 400 0
batch_Y9hCYBUvywsgXQoXElUjDQdG completed 400 0
batch_8VCSy1nlUHmindP0ckvPMoho completed 400 0
batch_4YKT54RLg499gRNq1VpxSBo8 completed 400 0
batch_GPsZnrwAKDWoUjMGc2DSXXEJ completed 400 0
batch_BjMJ9PULO5B0655woDVpFslG completed 400 0
batch_iwqth7dekh1soFXPXnVA8CCb completed 400 0
batch_uwBMs2PlUGnsrKxyFmj4AlEw completed 400 0
batch_Kb2dneP8T29DCjmar5bmfNhn completed 400 0
batch_RAkUvY7A7NrdPGm9vLJHOyNw completed 400 0
batch_Iq8pgeY9m18NDwdhnNUAxpgM completed 400 0
batch_TL3SgoBTml4UDQwGEUI97yFy completed 400 0
batch_Rej5cNzg3s3FCtQflGaqKCmM completed 400 0
batch_YUOxLO01Ny4jzPPMx3Ein1xe completed 400 0
batch_crNgtuzYlK5mg9uGEebAJmjW completed 400 0
batch_zCZ4m0VAz84739yYtGfFqzXJ completed 400 0
batch_ArS1VoOwV1mIc2HubhZcOr6S completed 400 0
batch_ka3NBYVcss7JKMnUgtZ4nPTe completed 300 0
batch_rSDdkrv1MVxcidR8122xh7Bd completed 300 0
batch_2wmLLLqQ3nbO1ZMRAC6CwTUO completed 400 0
batch_Y9hCYBUvywsgXQoXElUjDQdG completed 400 0
batch_8VCSy1n

In [71]:
# Writing api results locally as json file
result_file_name = '../data/batch_job_results.jsonl'
# first clear the file if it exist
open(result_file_name, 'w').close()

# append contents of all batches to local json results file
for batch in batches:
    batch_job = client.batches.retrieve(batch) 
    result = client.files.content(batch_job.output_file_id).content
    with open(result_file_name, 'ab') as file:
        file.write(result)    

In [72]:
# Loading json api data from locally saved file
results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # Parsing the JSON string into a dict and appending to the list of results
        json_object = json.loads(line.replace('\n', '').strip())
        results.append(json_object)

In [73]:
# Reading only the first results as a test if API worked
for res in results[100:106]:
    task_id = res['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    movie = df.loc[int(index)]
    description = movie['message']
    # title = movie['Series_Title']
    print(f"\nMESSAGE: {index}-{description}\n\nRESULT: {result}")
    print("\n----------------------------\n")


MESSAGE: 8084-Darling, I love you even if you once had forgot, you make me suffer I do not relax so that you understand me. Thank you 

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 8085-what must we do when don't feel good we'd like to get about of our mind 

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 8086-Need help in Corail department Grand'Anse. 

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 8087-Can you give me a stock market of studies in the communication domain? While hoping your response, receive my greetings the better ones 

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 8088-Information requiere about of the earthquake. 

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 8089-I'm hungry,I don't have food to eat,I don't have home,clothes, I count on your supports and help,thank you so much. 

RESULT: {
  "isEnglish": true
}

-----------------

### 6. Load API results into dataframe
The API return 2 responses we are interested in:
1) A Boolean variable isEnglish to indicate of the message are in English or not
2) A translation in English if the message was in another language

Save both these responses in a seperate dictionary and create a dataframe with 2 columns:
- is_english
- translation

In [74]:
# Load all responses into a dictionaries
isEnglishs = {}
translatedTexts = {}

for res in results:
    task_id = res['custom_id']
    # Get unique message id from task id
    index = int(task_id.split('-')[-1])
    # get response content and strip of new line indicators
    result = res['response']['body']['choices'][0]['message']['content']
    result = result.replace('\n', '').strip()
    result = result.replace('\t', '').strip()
    # get original message
    df_tmp = df_remaining.loc[index]
    description = df_tmp['message']
    translation = ''
    isEnglish = ''

    try:
        dict_object = json.loads(result)
        isEnglish = bool(dict_object['isEnglish'])
    except:
        pass

    try:
        translation = dict_object['Translation']       
    except:
        pass
    
    isEnglishs[index] = isEnglish
    translatedTexts[index] = translation
    

In [75]:
# create dataframe
data = {'is_english': isEnglishs,
        'translation': translatedTexts}

df_translation = pd.DataFrame.from_dict(data,
                                         orient='columns',
                                         )
df_translation.index.name = 'id'
df_translation.head()

Unnamed: 0_level_0,is_english,translation
id,Unnamed: 1_level_1,Unnamed: 2_level_1
7976,True,
7977,True,
7978,True,
7979,True,
7980,True,


In [76]:
df_translation.shape

(7400, 2)

In [77]:
# Add message for analysis
df_translated_tmp = df_translation.merge(df_remaining[['message']], on='id', how='inner')
df_translated_tmp.head()

Unnamed: 0_level_0,is_english,translation,message
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7976,True,,"MAY GOD BLESS HAITI,CHILY AND CHINA. THANK'S"
7977,True,,"We in Canada turjo quote, we need food, water and tents. count on your participation"
7978,True,,Thank you for all the information you gave me.
7979,True,,"We geet,the organisation for the good work for all haitian people:We that living in la sous tijo,for the city of canada we need some aids i think you could."
7980,True,,FLOOD AT CAYES. HELP US EMERGENCY


In [78]:
# how many texts were not in English ?
print('Messages not in English:', len(df_translated_tmp[df_translated_tmp.is_english == False]))
df_translated_tmp[df_translated_tmp.is_english == False][:10]

Messages not in English: 71


Unnamed: 0_level_0,is_english,translation,message
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8067,False,Don't you know when the faculty of human sciences will be reopening? When will they start to bring the students back to school?,Don't you know when fakilte syans imen (FASCH) will be reoperning? when will they start to do the students back to school?
8099,False,I need to live.com flags18,R judesan2live.com banderas18
7578,False,"Signeneau, we didn't get anything at all. I would like to know if people in Signeneau don't (...)","Signeneau, we didn't get anything at all. I would like to know if people in Signeneau don't (..)"
7652,False,There is need at Tabarre 52 B # 20 help,There is besion of l' has tabarre 52 B # 20 helps
7700,False,I understand,paklascencion##s
7768,False,The Creole sentences are not written very well.,The Creole sentences do not write very well.
7874,False,The provided text appears to contain non-English characters or symbols that do not form a coherent sentence in any language.,What do those words mean:siw ba y#?
7260,False,Help them when it is not white that you give us will find it there if needed for children in the tent yet found prela neret.,help them when it is not white that you give us will find it there if needed for children in the tent yet found prela neret.
7369,False,Ms. Coq Philomene: Desermithe (Petion-Ville),Mme Coq Philomene: Desermithe (Ption-Ville)
7397,False,"Do something for us NGO, answer","Make something for us ONG, answer"


In [79]:
# drop duplicated translations if it exist
df_translation =  df_translation[~df_translation.index.duplicated(keep='first')]
df_translation.shape

(7400, 2)

### 8. Save the dataset to sqlite database.

In [80]:
df_translation.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7400 entries, 7976 to 369
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   is_english   7400 non-null   bool  
 1   translation  7400 non-null   object
dtypes: bool(1), object(1)
memory usage: 380.9+ KB


In [82]:
# make sure is_english contain only values True and False
print(df_translation.is_english.unique())

[ True False]


In [83]:
# save to csv file
df_translation.to_csv('../data/translations.csv')

In [84]:
# add to existing sqlite database
df_translation.to_sql('message_language', engine, index=True, if_exists='replace')

7400