# Telegram.

A (hopefully) useful guide to get and analyze data from Telegram.

*P. Kessling, Leibniz-Institute for Media Research | Hans-Bredow-Institute (HBI), Hamburg, Germany, 2024-01-22.*

## Table of Contents

- [Telegram.](#telegram)
  - [Table of Contents](#table-of-contents)
  - [Data Structure](#data-structure)
    - [Base Objects](#base-objects)
  - [Data Access](#data-access)
    - [API-Access](#api-access)
      - [Requirements](#requirements)
    - [Scraping](#scraping)
      - [`ponyexpress`](#ponyexpress)
    - [Desktop App](#desktop-app)

## Introduction

Telegram is a messenger app that not only offers one-to-one chats but also group chats, channels, bots and more. The data structure is quite complex as it uses a few base objects for all of the different chat and group types.

*Fig. 1: Telegram Web: The public web view of Boris Reitschuster's [channel](https://t.me/reitschusterde)*
![Telegram Web: The public web view of a channel.](../images/screenshot-2024-01-22-21-10-47-reitschusterde.png)


### Base Objects

The base objects are the following:

**Chats**: A chat is a conversation between one or more users. A chat can be a private chat, a group or a channel.

- **User**: A user is a person that uses Telegram. A user can be part of a group or channel.
- **Group**: A group is a chat with multiple users. A group can be public or private.
- **Channel**: A channel is a chat with multiple users. A channel can be public or private.

However, there are a few complexities that need to be considered[^1]. A channel is per default a one-to-many communication [channel](https://telegram.org/tour/channels) with an unlimited number of subscribers. A channel can be converted into a [supergroup](https://telegram.org/tour/groups) which is a group with up to 200,000 members and technically also a channels. A 

[^1]: [Telegram Documentation](https://core.telegram.org/api/channel#channels)

**Messages**: A message is a text message that is sent in a chat. A message can also contain media like images, videos, documents, etc. It may contain:

- **Media**: A media is a file that is sent in a message. A media can be an image, video, document, etc.
- **Sticker**: A sticker is a special type of media that is sent in a message. A sticker is an image that is sent in a special format.
- **Location**: A location is a special type of media that is sent in a message. A location is a latitude and longitude value.
- **Contact**: A contact is a special type of media that is sent in a message. A contact is a person that is saved in the contacts of the user.
- **Poll**: A poll is a special type of media that is sent in a message. A poll is a question with multiple answers.
- **Action**: An action is a special type of message that is sent in a chat. An action is a message that is sent when a user joins or leaves a group or channel.
- **Reply**: A reply is a special type of message that is sent in a chat. A reply is a message that is sent as a reply to another message.
- **Forward**: A forward is a special type of message that is sent in a chat. A forward is a message that is sent as a forward of another message.
- **Edit**: An edit is a special type of message that is sent in a chat. An edit is a message that is sent when a message is edited.

**Bots**: A bot is a special type of user that is used to automate tasks. A bot can be part of a group or channel.


## Data Access

We have a few different methods at hand to obtain data from Telegram for public channels and groups.
We'll have a look at them in depth in the next sections, starting easy and getting more complex.

### Telegram Desktop App

The Telegram Desktop App offers a few options to export data. The easiest way is to export a chat as a JSON file. This file contains all messages of the chat. However, it does not contain any media like images, videos, etc. The media can be exported separately.



### Scraping the Public Web Interface

Since, Telegram offers a publicly accessible web interface for channels, we can scrape the data from there. The web interface is available at `https://t.me/<channel_name>`.

We developed a tool called `ponyexpress-telegram` that can scrape the data from the web interface. It is available on [GitHub](https://www.github.com/Leibniz-HBI/ponyexpress-telegram).


```bash
$ telegram --help

Usage: telegram [OPTIONS] [NAMES]...

  Scrape Telegram Channels.

Options:
  --version                       Show the version and exit.
  -m, --messages-output FILENAME
  -u, --users-output FILENAME
  -p, --prepare-edges
  -l, --log-file PATH
  -v, --verbose
  --help                          Show this message and exit.
```

```json
{
  "post_id": "reitschusterde/8920",
  "views": 14400,
  "datetime": 1705571988000,
  "user": "reitschuster.de",
  "from_author": null,
  "text": "Ampel will „Regenbogen-Familien“ stärken – auf Kosten der Kinder?Anpassung „an soziale Wirklichkeit“.Die Bundesregierung plant weitreichende Reformen bei Adoption und Sorgerecht. Dazu sollen die Mindeststrafen bei Kinderpornografie wieder gesenkt werden. Einige der dabei verwendeten Wörter und Formulierungen müssen aufhorchen lassen. Von Kai Rebmann. https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/",
  "link": [
    "https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/",
    "https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/",
    "https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/",
    "https://reitschuster.de/post/ampel-will-regenbogen-familien-staerken-auf-kosten-der-kinder/"
  ],
  "reply_to_user": null,
  "reply_to_text": null,
  "reply_to_link": null,
  "image_url": [],
  "forwarded_message_url": null,
  "forwarded_message_user": null,
  "video_url": [],
  "video_duration": null,
  "handle": "reitschusterde",
  "post_number": "8920"
}
```

```json
{
  "name": "reitschusterde",
  "fullname": "reitschuster.de",
  "url": "https://t.me/reitschusterde",
  "description": "Offizieller Kanal von Boris Reitschuster",
  "subscriber_count": 235000,
  "photos_count": 754,
  "videos_count": 86,
  "files_count": 9,
  "links_count": 7440
}
```


## API-Access

Telegram offers a [Telegram API](https://core.telegram.org/api) to access the data of your account. The API is not public and you need to create a developer app to get access to the API. The API is not very well documented and you need to figure out a lot of things by yourself.



### Requirements

- **Telegram account**: Have Telegram installed on your phone and create an account.
- **Developer app**: Create a developer app on [Telegram](https://my.telegram.org/apps) and retrieve the following information:
  - Telegram API key
  - Telegram API hash
- ...
- Profit!

---

![Telegram device overview](../images/telegram-device-overview-small.PNG)

![Telegram device removal limits](../images/telegram-device-removal-limits-small.PNG)



## tegracli

[tegracli](https://www.github.com/Leibniz-HBI/tegracli) is a command line interface for Telegram. It is written in Python and uses the [Telethon]() library to access the Telegram API. It is intended for research use, e.g. collecting large account-based datasets.
It allows you also to persists data from a single channel or search for keywords in the channels your account in subscribed to.

### Installation

`tegracli` is available on [PyPI](https://pypi.org/project/tegracli/) and can be installed via `pip`:

```bash
pip install tegracli
```

Alternatively you can install it with `pipx`:

```bash
# pip install pipx # if not already installed
pipx install tegracli
```


In [3]:
! tegracli

Usage: tegracli [OPTIONS] COMMAND [ARGS]...

  Tegracli!! Retrieve messages from *Te*le*gra*m with a *CLI*!

Options:
  -d, --debug              Enable legacy debugging, is overwritten by the
                           other options. Defaults to False.
  -v, --verbose            Logging verbosity.
  -l, --log-file FILENAME  File to log to. Defaults to STDOUT.
  -s, --serialize          Serialize output to JSON.
  --help                   Show this message and exit.

Commands:
  configure  Configure tegracli.
  get        Get messages for the specified channels by either ID or...
  group      Manage account groups.
  hydrate    Hydrate a file with messages-ids.
  search     Searches Telegram content that is available to your account.


To configure `tegracli`, run the following command in a terminal[^1]:

```bash
tegracli configure
```

Dieser Befehl führt Dich mittels prompts interaktiv durch den Konfigurationsprozess. Die Konfiguration wird in der Datei `tegracli.conf.yaml` gespeichert.

Dabei werden sowohl die zu nutztende Telegram-App, als auch eine User-Session konfiguriert. Du brauchst dafür:
- die Telegram-App, die Du nutzen möchtest, als App-Id,
- den API-Hash der Telegram-App,
- einen Session-Namen,
- Deine Telefonnummer,
- den Code, den Du per Telegram-Nachricht erhältst.

Take the following as an example for the configuration process:

```bash

user@jupyter-server:~$ tegracli configure
api_id: 1234567
api_hash: 3e83889647c268dc1a32abbcea26a15d
session_name: telegramresearchaccessworkshop
Enter your phone number: +491601234578
Enter 2FA code: 12345
```

[^1]: Running this in JupyterLab or a Jupyter notebook is not possible, since they do not allow interactive prompts.

In [4]:
! tegracli configure --help

Usage: tegracli configure [OPTIONS]

  Configure tegracli.

Options:
  --help  Show this message and exit.


# Inspect Data

In [None]:
import pandas as pd

pd.read_json('filename.jsonl', lines=True)

## Data Repositories

- **Social Media Observatory**: [SMO](https://leibniz-hbi.de/de/projekte/social-media-observatory)
- **Data4Transperancy**: [D4T](https://data4transparency.com/) offers researchers access to a collection of Telegram data. 

## The Data Set of this Workshop

The data set of this workshop is centered on the protests in Lützerath in late 2022. Since Telegram does not offer a global search endpoint, we searched in our main Telegram data set which consists of approx. 15.000 channels and 100 million messages. We searched for the following term `Lütz*`.

The data set is available for particapants via the workshops chat or on request.

