# Baseline Solution: Fake or Real - The Impostor Hunt in Texts 🔍

---

Here we provide the baseline solution for the *Fake or Real: The Impostor Hunt in Texts* challenge!
In this notebook, we walk you through two **simple, interpretable, and ML-free approaches** to tackle the problem of detecting fake texts.

### 💡 The overview of first approach:

We use the `langdetect` library to analyze each text by identifying the presence of **English vs. non-English words**. Here's the idea:

1. **Detect Language**: We break the text into words and determine the language of each.
2. **Calculate Proportion**: We then compute the percentage of English words in the entire text.
3. **Assign Label**: The text which gets higher percentage of English words is classified as **Real** and its number is saved to the results list.

---

### 📦 Getting Started: Install & Import Required Packages


In [1]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=20d6d2dcb9e60d636ba897b2329505fe1d84d3f1c7ff1646a9424f99da193ee7
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [2]:
import os
import pandas as pd
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
import unicodedata

import string
from sklearn.metrics import accuracy_score
import numpy as np
DetectorFactory.seed = 42

---

### 📄 Load the Data

Now, let's load the data into memory for exploration and processing.

We'll use `Pandas` to read the file into a DataFrame, which allows for easy data manipulation and analysis throughout the notebook.

In [3]:
def read_texts_from_dir(dir_path):
  """
  Reads the texts from a given directory and saves them in the pd.DataFrame with columns ['id', 'file_1', 'file_2'].

  Params:
    dir_path (str): path to the directory with data
  """
  # Count number of directories in the provided path
  dir_count = sum(os.path.isdir(os.path.join(root, d)) for root, dirs, _ in os.walk(dir_path) for d in dirs)
  data=[0 for _ in range(dir_count)]
  print(f"Number of directories: {dir_count}")

  # For each directory, read both file_1.txt and file_2.txt and save results to the list
  i=0
  for folder_name in sorted(os.listdir(dir_path)):
    folder_path = os.path.join(dir_path, folder_name)
    if os.path.isdir(folder_path):
      try:
        with open(os.path.join(folder_path, 'file_1.txt'), 'r', encoding='utf-8') as f1:
          text1 = f1.read().strip()
        with open(os.path.join(folder_path, 'file_2.txt'), 'r', encoding='utf-8') as f2:
          text2 = f2.read().strip()
        index = int(folder_name[-4:])
        data[i]=(index, text1, text2)
        i+=1
      except Exception as e:
        print(f"Error reading directory {folder_name}: {e}")

  # Change list with results into pandas DataFrame
  df = pd.DataFrame(data, columns=['id', 'file_1', 'file_2']).set_index('id')
  return df

In [4]:
# Use the above function to load both train and test data
train_path="/kaggle/input/fake-or-real-impostor-hunt/data/train"
df_train=read_texts_from_dir(train_path)
test_path="/kaggle/input/fake-or-real-impostor-hunt/data/test"
df_test=read_texts_from_dir(test_path)

Number of directories: 95
Number of directories: 1068


In [5]:
df_train.head()

Unnamed: 0_level_0,file_1,file_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,The VIRSA (Visible Infrared Survey Telescope A...,The China relay network has released a signifi...
1,China\nThe goal of this project involves achie...,The project aims to achieve an accuracy level ...
2,Scientists can learn about how galaxies form a...,Dinosaur eggshells offer clues about what dino...
3,China\nThe study suggests that multiple star s...,The importance for understanding how stars evo...
4,Dinosaur Rex was excited about his new toy set...,Analyzing how fast stars rotate within a galax...


In [6]:
df_test.head()

Unnamed: 0_level_0,file_1,file_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"""Music"" Music music music Music music Music mu...",Since its launch on Paranal observatory's Very...
1,underground exploration on SN's birth has prov...,SN 1987A provides valuable insights as newer o...
2,This research aimed to understand how star sha...,ChromeDriver music player\nThis study focused ...
3,Using OmegaCAM's wide field capabilities spann...,"greek translation :\nvazhi (megaCAM), territor..."
4,AssemblyCulture AssemblyCulture AssemblyCultur...,XClass is software tool that helps astronomers...


---

### 🏷️ Read the Labels

Next, we’ll load the **labels** associated with each text sample.
These labels indicate which text is **Real** - 1 or 2. The labels will serve as our ground truth for evaluation.

We’ll again use `Pandas` to read the label file into a DataFrame and inspect its structure.


In [7]:
# Load ground truth for train data
df_train_gt=pd.read_csv("/kaggle/input/fake-or-real-impostor-hunt/data/train.csv")
df_train_gt

Unnamed: 0,id,real_text_id
0,0,1
1,1,2
2,2,1
3,3,2
4,4,2
...,...,...
90,90,2
91,91,1
92,92,2
93,93,2


---
### 🧪 Baseline solution with English words detection

In [8]:
def baseline_method_english_word(df):
  """
  This baseline method predicts which of the texts is Real, based on the percentage of English words in each text.
  It returns list with predictions.

  Params:
    df (pd.DataFrame): dataframe with all texts
  """
  # Create lists in which scores will be saved for file_1 (left_scores) and files_2 (right_scores)
  left_scores=[0 for _ in range(df.shape[0])]
  right_scores=[0 for _ in range(df.shape[0])]
  # For each row in the DataFrame and for each element of this row run the algorithm for detecting English words
  for j in range(df.shape[0]):
    for z in range(df.shape[1]):
      sum_english=0
      n=10
      delete=str.maketrans('', '', string.punctuation+'\n')
      cleaned=df.iloc[j].iloc[z].translate(delete)
      text_to_check=cleaned.split(" ")
      text_to_check=[' '.join(text_to_check[i:i+n]) for i in range(0, len(text_to_check),n)]

      # Run algorithm for detecting English words
      for i in range(len(text_to_check)):
        try:
          language=detect(text_to_check[i])
        except LangDetectException as e:
          pass
        if language=='en':
          sum_english+=1
      result=sum_english/len(text_to_check)
      if z==0:
        left_scores[j]=result
      elif z==1:
        right_scores[j]=result
      else:
        print('Wrong')
  # Create list with predictions by setting value in list to 1 if the first text is `Real` or 2 when the second seems to be better
  predictions=[1 if left_scores[k]>right_scores[k] else 2 for k in range(len(left_scores))]
  return predictions

In [9]:
def evaluate_baseline(predictions, gt_list, text='Score with english detection:'):
  """
  Evaluates the predictions for train data, when the ground truth is provided.

  Params:
    predictions (list): list of predictions
    gt_list (list): list of predictions
    text (str): text to be printed together with the result
  """
  acc_score = accuracy_score(gt_list, predictions)
  print(text,acc_score)

---
#### 📊 Results

In [10]:
# Use the algorithm for the train data and check accuracy
predictions_train=baseline_method_english_word(df_train)
gt_train=list(df_train_gt['real_text_id'])
evaluate_baseline(predictions_train, gt_train)

Score with english detection: 0.5473684210526316


In [11]:
# Use the algorithm for the test data
predictions_test=baseline_method_english_word(df_test)

### Prepare format for sample solution

In [12]:
# Change the format of predictions into requested format, as described in Overview section of this competition
df_results_test=pd.DataFrame(predictions_test)
output_df = df_results_test.copy()
output_df.columns = ['real_text_id']
output_df.reset_index(inplace=True)
output_df.rename(columns={'index': 'id'}, inplace=True)
output_df

Unnamed: 0,id,real_text_id
0,0,2
1,1,2
2,2,2
3,3,1
4,4,1
...,...,...
1063,1063,1
1064,1064,1
1065,1065,1
1066,1066,2


In [13]:
output_df.to_csv('sample_submission_1.csv', index=False)

---

### 🔡 Character-Level Baseline

In addition to analyzing words, we can explore a **character-level approach** as an alternative baseline.

This method evaluates the **proportion of Latin characters** in the text, instead of relying on word-based language detection.

By comparing the ratio of English characters to total characters, we generate another set of predictions—offering a complementary perspective to our word-level strategy.

In [14]:
def is_latin_char(char):
  """
  Detect if given character is from Latin alphabet.

  Params:
    char (str): given character
  """
  char=str(char)
  try:
    name=unicodedata.name(char)
    return 'LATIN' in name
  except ValueError:
    return False

def baseline_chars_method(df):
  """
  This baseline method predicts which of the texts is Real, based on the percentage of Lating letters words in each text.
  It returns list with predictions.

  Params:
    df (pd.DataFrame): dataframe with all texts
  """
  # Create lists in which scores will be saved for file_1 (left_scores) and files_2 (right_scores)
  left_scores=[0 for _ in range(df.shape[0])]
  right_scores=[0 for _ in range(df.shape[0])]
  # For each row in the DataFrame and for each element of this row run the algorithm for detecting Latin chars
  for j in range(df.shape[0]):
    for z in range(df.shape[1]):
      sum_latin=0
      count_spaces=0
      delete=str.maketrans('', '', string.punctuation+'\n')
      cleaned=df.iloc[j].iloc[z].translate(delete)
      
      # Run algorithm for detecting Latin chars
      for i in range(len(cleaned)):
        if cleaned[i] !=' ':
          if is_latin_char(cleaned[i]):
            sum_latin+=1
        else:
          count_spaces+=1
      if len(cleaned)==0:
        result=0
      else:
        result=sum_latin/(len(cleaned)-count_spaces)
      if z==0:
        left_scores[j]=result
      elif z==1:
        right_scores[j]=result
      else:
        print('Wrong')
  # Create list with predictions by setting value in list to 1 if the first text is `Real` or 2 when the second seems to be better
  predictions=[1 if left_scores[k]>right_scores[k] else 2 for k in range(len(left_scores))]
  return predictions

---
#### 📊 Results

In [15]:
# Use the algorithm for the train data and check accuracy
predictions_train_char=baseline_chars_method(df_train)
gt_train=list(df_train_gt['real_text_id'])
evaluate_baseline(predictions_train_char, gt_train, text='Score with latin detection:')

Score with latin detection: 0.5684210526315789


In [16]:
# Use the algorithm for the test data
preds_test_char=baseline_chars_method(df_test)

### Prepare format for sample solution

In [17]:
# Change the format of predictions into requested format, as described in Overview section of this competition
df_results_test_char=pd.DataFrame(preds_test_char)
output_df_char = df_results_test_char.copy()
output_df_char.columns = ['real_text_id']
output_df_char.reset_index(inplace=True)
output_df_char.rename(columns={'index': 'id'}, inplace=True)
output_df_char

Unnamed: 0,id,real_text_id
0,0,2
1,1,1
2,2,1
3,3,2
4,4,2
...,...,...
1063,1063,1
1064,1064,1
1065,1065,1
1066,1066,2


In [18]:
output_df_char.to_csv('sample_submission_2.csv', index=False)