# Notebook: Download Images

This notebook is used to download all images from the tweets. What actions are taken in the process is explained below.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) 

## Packages

In [1]:
from io import BytesIO
from PIL import Image
import pandas as pd
import pytesseract
import requests
import json
import csv
import re
import os

## Parameters

In [7]:
DATASET_PATH = "../Datasets/dataset_mentions/"
PHOTOS_PATH = "../Datasets/img_dataset_mentions/"
PARTIES = ["CDU_CSU", "SPD", "AFD", "FDP", "GRUENE", "LINKE"]
TESSERACT_PATH = "/opt/homebrew/bin/tesseract"

## Settings

In [3]:
pytesseract.tesseract_cmd = TESSERACT_PATH

## Code

### 1. Create new Directories

In [4]:
# Iterate over the parties
for party in PARTIES:
    # Try to create a subdirectory for the party
    try:
        os.makedirs(DATASET_PATH + party)
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

### 2. Download all Images

In [5]:
df_images = pd.DataFrame(columns=['tweet_id', 'image_index', 'filename', 'extracted_text', 'url', 'image_path', 'source_party', 'source_account', 'date'])

# Initialize counter for total images
total_images = 0

for party in PARTIES:
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Load dataframe of an account
                df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")
                df['image_paths'] = ''
                
                # Initialize counter for current account
                account_images = 0
                possible_images = 0
                
                for row in df.itertuples():
                    photos_string = row.photos
                    photo_links = re.findall(r"'(.*?)'", photos_string)
                    
                    index = 0
                    image_paths_dict = {}
                    for link in photo_links:
                        if '/media/' in link:

                            response = requests.get(link)
                            if response.status_code == 200:
                                image = Image.open(BytesIO(response.content))
                                # Convert image mode to RGB if necessary
                                if image.mode != 'RGB':
                                    image = image.convert('RGB')
                                # Construct the filename using the index for this username
                                filename = f"{row.id}_{index}.jpg"
                                # Create the directory if it doesn't exist
                                directory = os.path.join(PHOTOS_PATH, party, username)
                                if not os.path.exists(directory):
                                    os.makedirs(directory)
                                # Save the image to disk
                                image.save(os.path.join(directory, filename))
                                # Add the filename and URL to the dictionary
                                image_paths_dict[filename] = link
                            
                                # Extract text from image using Tesseract
                                image_path = os.path.join(directory, filename)
                                text = pytesseract.image_to_string(Image.open(image_path), lang='deu')

                                # Add the text to the dictionary
                                image_paths_dict[filename] = {"text":text, "url": link}
                            
                                new_row = {'tweet_id': row.id, 'image_index': index, 'filename': filename, 'extracted_text': text, 'url': link, 'image_path': image_path, 'source_party': row.source_party, 'source_account': row.source_account, 'date': row.date}
                                df_images = pd.concat([df_images, pd.DataFrame(new_row, index=[0])], ignore_index=True)
                            
                                # Increment the index and counter for this username
                                index += 1
                                account_images += 1
                                possible_images += 1
                                total_images += 1
                            
                            else:
                                possible_images += 1
                    # Update the dataframe with the JSON string of image paths and text
                    df.at[row.Index, 'image_paths'] = json.dumps(image_paths_dict)
                
                # Save the updated dataframe
                df.to_csv(DATASET_PATH + party + "/" + file, sep=",", index=True, index_label='index')
                
                # Print number of images for current account
                print(f"{account_images}/{possible_images} images downloaded for {party} - {username}")
    
    # Print total number of images for current party
    print(f"{total_images}/{total_images} images downloaded for {party}")
    
# Print total number of images for all parties
print(f"{total_images} images downloaded in total")



1950/2028 images downloaded for CDU_CSU - ArminLaschet
192/208 images downloaded for CDU_CSU - HBraun
116/120 images downloaded for CDU_CSU - andreasscheuer
517/538 images downloaded for CDU_CSU - CSU
15/15 images downloaded for CDU_CSU - DerLenzMdB




1691/1812 images downloaded for CDU_CSU - Markus_Soeder
9/9 images downloaded for CDU_CSU - ANiebler
4/4 images downloaded for CDU_CSU - MarkusFerber
64/65 images downloaded for CDU_CSU - Junge_Union
28/29 images downloaded for CDU_CSU - ManfredWeber
129/134 images downloaded for CDU_CSU - DoroBaer
175/178 images downloaded for CDU_CSU - rbrinkhaus
26/26 images downloaded for CDU_CSU - tj_tweets
257/264 images downloaded for CDU_CSU - DaniLudwigMdB
240/240 images downloaded for CDU_CSU - JuliaKloeckner
816/828 images downloaded for CDU_CSU - cducsubt




192/193 images downloaded for CDU_CSU - n_roettgen




1563/1613 images downloaded for CDU_CSU - jensspahn
2/2 images downloaded for CDU_CSU - groehe
925/936 images downloaded for CDU_CSU - _FriedrichMerz
100/101 images downloaded for CDU_CSU - hahnflo
15/15 images downloaded for CDU_CSU - smuellermdb
611/628 images downloaded for CDU_CSU - PaulZiemiak




2041/2126 images downloaded for CDU_CSU - CDU
11678/11678 images downloaded for CDU_CSU
13/13 images downloaded for SPD - KarambaDiaby




430/446 images downloaded for SPD - Ralf_Stegner
117/353 images downloaded for SPD - hubertus_heil




1496/1535 images downloaded for SPD - OlafScholz
101/107 images downloaded for SPD - jusos
616/645 images downloaded for SPD - spdbt




4839/5040 images downloaded for SPD - Karl_Lauterbach
305/318 images downloaded for SPD - KuehniKev




232/239 images downloaded for SPD - larsklingbeil




449/464 images downloaded for SPD - HeikoMaas
45/45 images downloaded for SPD - MiRo_SPD




388/398 images downloaded for SPD - EskenSaskia




1356/1405 images downloaded for SPD - spdde
22065/22065 images downloaded for SPD




170/177 images downloaded for AFD - MalteKaufmann




458/478 images downloaded for AFD - AfD
30/30 images downloaded for AFD - PetrBystronAFD




682/708 images downloaded for AFD - StBrandner




136/159 images downloaded for AFD - JoanaCotar




148/158 images downloaded for AFD - Beatrix_vStorch
32/32 images downloaded for AFD - GtzFrmming




318/334 images downloaded for AFD - Alice_Weidel
224/255 images downloaded for AFD - AfDimBundestag
31/31 images downloaded for AFD - AfDBerlin
8/9 images downloaded for AFD - gottfriedcurio
174/194 images downloaded for AFD - Joerg_Meuthen
137/144 images downloaded for AFD - Tino_Chrupalla
24613/24613 images downloaded for AFD
43/43 images downloaded for FDP - f_schaeffler
16/16 images downloaded for FDP - ria_schroeder
503/521 images downloaded for FDP - fdpbt
922/958 images downloaded for FDP - c_lindner
87/90 images downloaded for FDP - MaStrackZi
48/52 images downloaded for FDP - fdp_nrw
1364/1407 images downloaded for FDP - fdp
36/36 images downloaded for FDP - LindaTeuteberg
145/147 images downloaded for FDP - Wissing
52/53 images downloaded for FDP - Lambsdorff
87/89 images downloaded for FDP - KonstantinKuhle




425/447 images downloaded for FDP - MarcoBuschmann
90/91 images downloaded for FDP - johannesvogel
28431/28431 images downloaded for FDP




251/258 images downloaded for GRUENE - GoeringEckardt
171/173 images downloaded for GRUENE - Ricarda_Lang
90/91 images downloaded for GRUENE - BriHasselmann
271/285 images downloaded for GRUENE - KathaSchulze
400/416 images downloaded for GRUENE - GrueneBundestag
400/417 images downloaded for GRUENE - cem_oezdemir
41/42 images downloaded for GRUENE - nouripour




147/149 images downloaded for GRUENE - MiKellner
74/74 images downloaded for GRUENE - JTrittin
69/70 images downloaded for GRUENE - KonstantinNotz
108/108 images downloaded for GRUENE - RenateKuenast




1838/1884 images downloaded for GRUENE - Die_Gruenen
72/84 images downloaded for GRUENE - gruene_jugend
32363/32363 images downloaded for GRUENE
215/219 images downloaded for LINKE - SWagenknecht




699/718 images downloaded for LINKE - dieLinke
159/173 images downloaded for LINKE - Linksfraktion
106/106 images downloaded for LINKE - Janine_Wissler
152/153 images downloaded for LINKE - dielinkeberlin




111/129 images downloaded for LINKE - DietmarBartsch
124/129 images downloaded for LINKE - SusanneHennig
60/60 images downloaded for LINKE - GregorGysi
22/23 images downloaded for LINKE - jankortemdb
35/36 images downloaded for LINKE - anked
6/6 images downloaded for LINKE - SevimDagdelen
46/47 images downloaded for LINKE - katjakipping
43/43 images downloaded for LINKE - b_riexinger
34141/34141 images downloaded for LINKE
34141 images downloaded in total


In [6]:
df_images.to_csv(PHOTOS_PATH + "images_dataset.csv")