<a href="https://colab.research.google.com/github/Osakhra/ITAI2373-NewsBot-Final/blob/main/notebooks/06_Multilingual_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 06_Multilingual_Analysis.ipynb

In this notebook, I will demonstrate NewsBot 2.0’s ability to detect the language of news articles and translate them into English for further analysis.

**Specifically, I will:**
- Load my cleaned news dataset
- Use my LanguageDetector to identify article language
- Use my Translator module to convert non-English news to English
- Show before/after samples

---


In [1]:
!pip install langdetect spacy nltk scikit-learn pyldavis textblob transformers torch sumy sentence-transformers numpy matplotlib seaborn googletrans==4.0.0-rc1
import nltk
nltk.download('stopwords')
!git clone https://github.com/Osakhra/ITAI2373-NewsBot-Final.git
import sys
sys.path.append('/content/ITAI2373-NewsBot-Final/src')


Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/981.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m14.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Download

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Cloning into 'ITAI2373-NewsBot-Final'...
remote: Enumerating objects: 263, done.[K
remote: Counting objects: 100% (93/93), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 263 (delta 55), reused 6 (delta 6), pack-reused 170 (from 1)[K
Receiving objects: 100% (263/263), 284.36 KiB | 5.27 MiB/s, done.
Resolving deltas: 100% (118/118), done.


In [2]:
import pandas as pd
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('news_cleaned.csv')


Saving news_cleaned.csv to news_cleaned.csv


In [3]:
from multilingual.language_detector import LanguageDetector

detector = LanguageDetector()
# Detect language for a few articles
df['language'] = df['content'].apply(detector.detect)
df[['content', 'language']].head()


Unnamed: 0,content,language
0,worldcom ex-boss launches defence lawyers defe...,en
1,german business confidence slides german busin...,en
2,bbc poll indicates economic gloom citizens in ...,en
3,lifestyle governs mobile choice faster bett...,en
4,enron bosses in $168m payout eighteen former e...,en


In [4]:
print(df['language'].value_counts())


language
en    1490
Name: count, dtype: int64


In [5]:
from multilingual.translator import NewsBotTranslator

translator = NewsBotTranslator()
# Example: translate a non-English article (replace index if needed)
sample_text = "Este es un artículo de noticias en español sobre tecnología."
translated = translator.translate(sample_text, dest='en')
print("Original:", sample_text)
print("Translated:", translated)


Original: Este es un artículo de noticias en español sobre tecnología.
Translated: This is a Spanish news article about technology.


In [6]:
# Only translate articles not detected as English (for demo, limit to 5)
non_en = df[df['language'] != 'en'].head()
non_en['translated_content'] = non_en['content'].apply(lambda x: translator.translate(x, dest='en'))
non_en[['content', 'translated_content', 'language']]


Unnamed: 0,content,translated_content,language


No content shown as all BBC articles in dataset are in english.

In [7]:
spanish_text = "Este es un artículo de noticias en español sobre tecnología."
translated = translator.translate(spanish_text, dest='en')
print("Original:", spanish_text)
print("Translated:", translated)


Original: Este es un artículo de noticias en español sobre tecnología.
Translated: This is a Spanish news article about technology.


In [8]:
# Add a Spanish article for demonstration
sample_non_en = {
    'content': "El presidente anunció una nueva estrategia de vacunación en España.",
    'category': 'politics',
    'clean_content': '',
    'language': ''
}
# Append to DataFrame
# df = df.append(sample_non_en, ignore_index=True) # Deprecated method

# Use pd.concat to add the new row
new_row_df = pd.DataFrame([sample_non_en])
df = pd.concat([df, new_row_df], ignore_index=True)


# Detect language for the new row
df.loc[df.index[-1], 'language'] = detector.detect(df.loc[df.index[-1], 'content'])

# Translate the new article
df.loc[df.index[-1], 'translated_content'] = translator.translate(df.loc[df.index[-1], 'content'], dest='en')

# Show the new row
print(df.tail(1)[['content', 'translated_content', 'language']])

                                                content  \
1490  El presidente anunció una nueva estrategia de ...   

                                     translated_content language  
1490  The president announced a new vaccination stra...       es  
