# Language Detection Pipeline Preparation

During ETL pipeline preparation, some messages were not in English.
Experiment with CHATGPT to detect the language of each message and translate it to English

Note: CHATGPT is a paid service, and takes a long time to translate.

### 1. Import libraries and load datasets

In [162]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import os
from openai import OpenAI
import json
import numpy as np
import time

# environment settings
pd.set_option('display.max_column', 400)
pd.set_option('display.max_colwidth', 400)

In [2]:
# load data from database created during ETL pipeline preparation
engine = create_engine('sqlite:///../data/DisasterResponse.db')
conn = engine.connect()
df = pd.read_sql('select * from messages', con=conn, index_col='id')
df.head()

Unnamed: 0_level_0,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
2,Weather update - a cold front from Cuba that could pass over Haiti,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Is the Hurricane over or is it not over,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
8,Looking for someone but no name,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
12,"says: west side of Haiti, rest of the country today and tonight",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25991 entries, 2 to 30265
Data columns (total 38 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   message                 25991 non-null  object
 1   genre                   25991 non-null  object
 2   related                 25991 non-null  int64 
 3   request                 25991 non-null  int64 
 4   offer                   25991 non-null  int64 
 5   aid_related             25991 non-null  int64 
 6   medical_help            25991 non-null  int64 
 7   medical_products        25991 non-null  int64 
 8   search_and_rescue       25991 non-null  int64 
 9   security                25991 non-null  int64 
 10  military                25991 non-null  int64 
 11  child_alone             25991 non-null  int64 
 12  water                   25991 non-null  int64 
 13  food                    25991 non-null  int64 
 14  shelter                 25991 non-null  int64 
 15  clothin

### 2. Set API key and connect to OPENAI

In [4]:
# Setting the API key to use OPEN AI models
openai_api_key = os.environ.get('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

### 3. Build the API logic in realtime

Detect if messages are in English, and if not, translate them to English

In [6]:
#  build API request

system_prompt = """ 
  You will be provided with text about disaster responses.
  Step 1: Detect of the text is in English. Return options 'True' if the sentence is in English or 'False' if the sentence is not in English as isEnglish boolean variable
  Step 2: If sentence is not in English, translate it to English and return as text in json format
  
  Example: 'I need food' // isEnglish: True
  Example2: 'Vandaag is het zonnig'' // isEnglish: False, Translation: 'Today it is Sunny'
  """

In [7]:
# detect if all texts are in English, and if not translate them
mydict = {}

for idx, text in df['message'][:2].items():

  response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
      {
        "role": "system",
        "content": system_prompt,
        "temperature":0.1,
        "response_format": {
              "type": "json_object"
          },   
      },
      {
        "role": "user",
        "content": text,
        # "content": 'Vandaag is het zonnig'   
        # "content": 'I love Wilson and Maya' ,
      }
    ],
    temperature=0.1,
    top_p=1
  )

  mydict[idx] = response.choices[0].message.content
  
print(mydict)

{2: '{\n  "isEnglish": true\n}', 7: '{\n  "isEnglish": true\n}'}


### 4. Once we are happy with API response, let's kick off translation a single BATCH
This will create a batch job on open ai platform, and will complete within 24hours. 
Running API requests are also significantly cheaper

**Important:**
The message id will be become the index, and the main identifier of the translation

In [99]:
# Creating an array of json tasks
tasks = []
for index, text in df['message'][start:next].items():

    task = {
        "custom_id": f"task-{index}",
        "method":"POST",
        "url":"/v1/chat/completions",
        "body": {
          # This is what you would have in your Chat Completions API call
          "model":"gpt-4-turbo",
          "temperature":0.1,
          "response_format": {
              "type": "json_object"
          },
          "messages": [
            {
              "role": "system",
              "content": system_prompt,
            },
            {
              "role": "user",
              "content": text,
            },
          ]
        }
      }
    
    tasks.append(task)

In [101]:
# create json file and save it locally
file_name = '../data/batch_tasks_language_detection.jsonl'
with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')

In [102]:
# Uploading json file to openai platform
batch_file = client.files.create(
    file=open(file_name, 'rb'),
    purpose='batch'
)

print(batch_file)

FileObject(id='file-mDoZYSaqtT50V0uVbB5WxSUc', bytes=329080, created_at=1715012424, filename='batch_tasks_language_detection.jsonl', object='file', purpose='batch', status='processed', status_details=None)


In [103]:
# Creating the batch job on openai
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

### 5. Load and Analyze API results

In [160]:
# Check status of batch job running on openai platform
# batch_job = client.batches.retrieve('batch_crNgtuzYlK5mg9uGEebAJmjW')
batch_job = client.batches.retrieve(batch_job.id)
print(batch_job.status)
print(batch_job)

in_progress
Batch(id='batch_crNgtuzYlK5mg9uGEebAJmjW', completion_window='24h', created_at=1715019010, endpoint='/v1/chat/completions', input_file_id='file-RnlLMvsjfChVTf5c3CCn4VzD', object='batch', status='in_progress', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1715105410, failed_at=None, finalizing_at=None, in_progress_at=1715019057, metadata=None, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=400))


In [61]:
# Once batch job is completed, retrieve the results, get output or error filename and contents
if not batch_job.output_file_id is None:
    filename = batch_job.output_file_id
elif not batch_job.error_file_id is None:
    filename = batch_job.error_file_id
result = client.files.content(filename).content

In [62]:
# Writing api results locally as json file
result_file_name = '../data/batch_job_results.jsonl'

with open(result_file_name, 'wb') as file:
    file.write(result)

In [144]:
# Loading json api data from locally saved file
results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # Parsing the JSON string into a dict and appending to the list of results
        json_object = json.loads(line.replace('\n', '').strip())
        results.append(json_object)

In [145]:
# Reading only the first results as a test if API worked
for res in results[100:106]:
    task_id = res['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    movie = df.loc[int(index)]
    description = movie['message']
    # title = movie['Series_Title']
    print(f"\nMESSAGE: {index}-{description}\n\nRESULT: {result}")
    print("\n----------------------------\n")


MESSAGE: 1336-Would there be a big response in 30 minutes again?

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 1337-Food distribution did not reach our area. We live in Paloma, rue Lemoine. .. please

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 1338-Is it true that there will be more earthquake tonight

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 1339-My father's work is destroyed and there are many of us in the house. We were destitute before the catastrophe. My mom never worked. Please help us.

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 1340-4636 please give me information about schools and universities functioning in Port au Prince. In what year do you for see reopening. I'll wait for your response.

RESULT: {
  "isEnglish": true
}

----------------------------


MESSAGE: 1341-Can a citizen take steps to immigration to find a family if he/she is older than 18?

RESUL

### 6. Load API results into dataframe
The API return 2 responses we are interested in:
1) A Boolean variable isEnglish to indicate of the message are in English or not
2) A translation in English if the message was in another language

Save both these responses in a seperate dictionary and create a dataframe with 2 columns:
- is_english
- translation

In [146]:
# Load all responses into a dictionaries
isEnglishs = {}
translatedTexts = {}

for res in results:
    task_id = res['custom_id']
    # Get unique message id from task id
    index = int(task_id.split('-')[-1])
    # get response content and strip of new line indicators
    result = res['response']['body']['choices'][0]['message']['content']
    result = result.replace('\n', '').strip()
    result = result.replace('\t', '').strip()
    # get original message
    df_tmp = df.loc[index]
    description = df_tmp['message']
    translation = ''
    isEnglish = ''

    try:
        dict_object = json.loads(result)
        isEnglish = dict_object['isEnglish']
    except:
        pass

    try:
        translation = dict_object['Translation']       
    except:
        pass
    
    isEnglishs[index] = isEnglish
    translatedTexts[index] = translation
    

In [147]:
# create dataframe
data = {'is_english': isEnglishs,
        'translation': translatedTexts}

df_translation = pd.DataFrame.from_dict(data,
                                         orient='columns',
                                         )
df_translation.index.name = 'id'
df_translation

Unnamed: 0_level_0,is_english,translation
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1216,True,
1217,True,
1218,True,
1219,True,
1220,True,
...,...,...
362,True,
363,True,
367,True,
368,True,


In [148]:
# Add message for analysis
df_translated_tmp = df_translation.merge(df[['message']], on='id')
df_translated_tmp.head()

Unnamed: 0_level_0,is_english,translation,message
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1216,True,,but I'm going to pray with faith
1217,True,,If someone do not have access to internet Where can he bring his CV
1218,True,,I thank you very much for your help. I want you to know that I am not from Port-au-Prince. I am from the city Des Cayes. I will give you my resume? very..
1219,True,,"Good Day, I am happy your desk accepted my message. I am a student, I would like to get a job, I am in misery"
1220,True,,We are near Savanne on the road to Jacmel. We need help we don't have food or water


In [149]:
# how many texts were not in English ?
print('Messages not in English:', len(df_translated_tmp[df_translated_tmp.is_english == False]))
df_translated_tmp[df_translated_tmp.is_english == False][:10]

Messages not in English: 20


Unnamed: 0_level_0,is_english,translation,message
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1244,False,,"I'm finished with my school. I was born June 29, 1980. I am a ??? ve ki be"
1280,False,I am in the small town of Gree in session. There's a bunch of people dying.,I am in commune gree in session peitit bouquin. There's a bunch of people dying.
1342,False,"Things aren't good at all we ask you to send something Boulevard Royal Charles number 17, we need tents, food and medicine","Things aren't good at all we as you to send something B?l?s riy?l Charles no 17, we need tents, food and medicine"
1567,False,Can I find a job if I only passed level 1 of Business Management?,Can i find a job if i only passed level 1 of Gestion des Affaires (Business Gestion)
1569,False,"In the name of the youth in action from Croix des Missions, we are requesting...","in the name of the young in action from Croix des missions, we are sollic . .."
1584,False,"Okay everyone who is a victim. Peace to the people who have authority, they have organizations... who come to visit, bring what we need: water, medicine, shelter, electricity, send us a small message. Contact persons of Authority we need water, medications, all electricity, send small message.","ok tout le monde qui victime. paix pager gens y ont autority, y ont organisation. .. qui entre viens visiter emmener ce que nous avons besoin de l'eau, medicament, tant electricity, nous envoyer petit message. Contact persons of Authority we need water, medications, all electricity, send small message"
1605,False,I would like to know what you are saying for military site?,i would like to know what you are saying for cite militaire?
796,False,"we need food, water, toilets and security forces need to be present when the distribution of goods happens. We are on the Plaza in Canape Vert ( Place Canape Vert ). Please bring tents. Please save us. We need food, water, toilets. Please come with security forces to ensure the smooth running of the distribution of these items. We are on the Canape Vert square. also bring tents. Please help us.","we need food, water, toilets and security forces need to be present when the distribution of goods happens. We are on the Plaza in Canape Vert ( Place Canape Vert ). Please bring tents. Please save us. ( FRENCH ) Nous avons besoin de nourriture, d'eau, de WC. SVP venez avec les forces de securite pour garantir le bon droulement de la distribution de ces items. Nous sommes sur la place du Canap..."
869,False,Message received. Message received. Your message has gone to the INBOX MESSAGES.,Message received. Message recu. Mesaj ou ale BERIBHTEN INBOX
992,False,"Hello. .. I need to go to Canada, I only have my passport. How can you help me please? Thank you.","Bonjour. .. i need to go to canada, i only have my passport. How can you help me please. ? thank you"


In [154]:
# drop duplicated translations if it exist
df_translation =  df_translation[~df_translation.index.duplicated(keep='first')]
df_translation.shape

(1400, 2)

### 8. Save the dataset to sqlite database.

In [155]:
df_translation.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1400 entries, 1216 to 369
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   is_english   1400 non-null   bool  
 1   translation  1400 non-null   object
dtypes: bool(1), object(1)
memory usage: 55.5+ KB


In [156]:
# save to csv file
df_translation.to_csv('../data/translations.csv')

In [157]:
# add to existing sqlite database
df_translation.to_sql('message_language', engine, index=True, if_exists='replace')

1400