# Notebook: Download Images

This notebook is used to download all images from the tweets. What actions are taken in the process is explained below.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) 

## Packages

In [36]:
from io import BytesIO
from PIL import Image
import pandas as pd
import pytesseract
import requests
import json
import csv
import re
import os

## Parameters

In [37]:
RAW_DATASET_PATH = "../Datasets/raw_dataset/"
DATASET_PATH = "../Datasets/dataset/"
PHOTOS_PATH = "../Datasets/img_dataset/"
PARTIES = ["CDU_CSU", "SPD", "AFD", "FDP", "GRUENE", "LINKE"]
TESSERACT_PATH = "/opt/homebrew/bin/tesseract"

## Settings

In [38]:
pytesseract.tesseract_cmd = TESSERACT_PATH

## Code

### 1. Create new Directories

In [39]:
# Iterate over the parties
for party in PARTIES:
    # Try to create a subdirectory for the party
    try:
        os.makedirs(DATASET_PATH + party)
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

### 2. Clean Dataframe and Store as CSV

* Tweets ausschließen die nicht für Zufällig Tage waren?

In [40]:
n_tweets_total = 0
for party in PARTIES:
    n_tweets_party = 0
    for subdir, _, files in os.walk(RAW_DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(RAW_DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Load dataframe of an account
                df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")
                
                # Check if tweet was crawled twice (we have never seen the opposite with the use of twint)
                if df["id"].nunique() == len(df):
                    print("All values in the column are unique.", username)
                else:
                    print("There are duplicate values in the column.", username)
                
                # 1. Filter out rows where the username ist the politician/party account itself
                df = df[df.username != username]
                
                # 2. Filter german tweets
                df = df[df.language == "de"]
                
                # Reset the index of the dataframe
                df = df.reset_index(drop=True)
                
                n_tweets_party += df.shape[0]
                print(username, df.shape[0])
                
                # Save dataframe
                df.to_csv(DATASET_PATH + "/" + party + "/" + username + ".csv", sep=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                
    n_tweets_total += n_tweets_party
    print(party, n_tweets_party)
    
print("Total: ", n_tweets_total)

All values in the column are unique. ArminLaschet
ArminLaschet 36161
All values in the column are unique. HBraun
HBraun 3212
All values in the column are unique. andreasscheuer
andreasscheuer 2431
All values in the column are unique. CSU
CSU 9072
All values in the column are unique. DerLenzMdB
DerLenzMdB 236
All values in the column are unique. Markus_Soeder
Markus_Soeder 30495
All values in the column are unique. ANiebler
ANiebler 25
All values in the column are unique. MarkusFerber
MarkusFerber 21
All values in the column are unique. Junge_Union
Junge_Union 931
All values in the column are unique. ManfredWeber
ManfredWeber 527
All values in the column are unique. DoroBaer
DoroBaer 2560
All values in the column are unique. rbrinkhaus
rbrinkhaus 4280
All values in the column are unique. tj_tweets
tj_tweets 396
All values in the column are unique. DaniLudwigMdB
DaniLudwigMdB 3821
All values in the column are unique. JuliaKloeckner
JuliaKloeckner 3357
All values in the column are unique.

## 3. Download all Images

In [41]:
df_images = pd.DataFrame(columns=['tweet_id', 'image_index', 'filename', 'extracted_text', 'url', 'image_path', 'source_party', 'source_account', 'date'])

# Initialize counter for total images
total_images = 0

for party in PARTIES:
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Load dataframe of an account
                df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")
                df['image_paths'] = ''
                
                # Initialize counter for current account
                account_images = 0
                possible_images = 0
                
                for row in df.itertuples():
                    photos_string = row.photos
                    photo_links = re.findall(r"'(.*?)'", photos_string)
                    
                    index = 0
                    image_paths_dict = {}
                    for link in photo_links:
                        if 'tweet_video_thumb' in link:
                            continue
                        response = requests.get(link)
                        if response.status_code == 200:
                            image = Image.open(BytesIO(response.content))
                            # Convert image mode to RGB if necessary
                            if image.mode != 'RGB':
                                image = image.convert('RGB')
                            # Construct the filename using the index for this username
                            filename = f"{row.id}_{index}.jpg"
                            # Create the directory if it doesn't exist
                            directory = os.path.join(PHOTOS_PATH, party, username)
                            if not os.path.exists(directory):
                                os.makedirs(directory)
                            # Save the image to disk
                            image.save(os.path.join(directory, filename))
                            # Add the filename and URL to the dictionary
                            image_paths_dict[filename] = link
                            
                            # Extract text from image using Tesseract
                            image_path = os.path.join(directory, filename)
                            text = pytesseract.image_to_string(Image.open(image_path))

                            # Add the text to the dictionary
                            image_paths_dict[filename] = {"text":text, "url": link}
                            
                            new_row = {'tweet_id': row.id, 'image_index': index, 'filename': filename, 'extracted_text': text, 'url': link, 'image_path': image_path, 'source_party': row.source_party, 'source_account': row.source_account, 'date': row.date}
                            df_images = pd.concat([df_images, pd.DataFrame(new_row, index=[0])], ignore_index=True)
                            
                            # Increment the index and counter for this username
                            index += 1
                            account_images += 1
                            possible_images += 1
                            total_images += 1
                            
                        else:
                            
                            possible_images += 1
                    # Update the dataframe with the JSON string of image paths and text
                    df.at[row.Index, 'image_paths'] = json.dumps(image_paths_dict)
                
                # Save the updated dataframe
                df.to_csv(DATASET_PATH + party + "/" + file, sep=",", index=True, index_label='index')
                
                # Print number of images for current account
                print(f"{account_images}/{possible_images} images downloaded for {party} - {username}")
    
    # Print total number of images for current party
    print(f"{total_images}/{total_images} images downloaded for {party}")
    
# Print total number of images for all parties
print(f"{total_images} images downloaded in total")


KeyboardInterrupt



In [42]:
df_images.to_csv(DATASET_PATH + "dataset.csv")

Unnamed: 0,tweet_id,image_index,filename,extracted_text,url,image_path,source_party,source_account,date
0,1345866502268985354,1,1345866502268985354_0.jpg,,https://pbs.twimg.com/media/Eq16c2qXYAwa__x.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-01-03 22:55:53
1,1345863370579320832,1,1345863370579320832_0.jpg,Asyl- |\nmiBbrauch\n\n,https://pbs.twimg.com/media/Eq13meUXEAMWanr.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-01-03 22:43:27
2,1345860999602184196,1,1345860999602184196_0.jpg,"kann, dass dieses '\n\nGesindel\n\nwieder vers...",https://pbs.twimg.com/media/Eq11cdkW8AU8leb.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-01-03 22:34:01
3,1345841888289550345,1,1345841888289550345_0.jpg,"25. Februar 2011, 11:45 Uhr FDP rudert zuriick...",https://pbs.twimg.com/media/Eq1kEA2XUAAyW_f.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-01-03 21:18:05
4,1345840672113373186,1,1345840672113373186_0.jpg,You can fool some of the people all of the tim...,https://pbs.twimg.com/media/Eq1i03_W4AANqWe.png,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-01-03 21:13:15
...,...,...,...,...,...,...,...,...,...
460,1377255542834540545,1,1377255542834540545_0.jpg,"Laumann warb vorab um Verstandnis, dass die Te...",https://pbs.twimg.com/media/Exz-nV8XEAA9ef6.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-03-31 14:44:44
461,1377248596765110275,1,1377248596765110275_0.jpg,Allgemeinverfigung des Landkreises Peine zum S...,https://pbs.twimg.com/media/Exz4TM3W8AAVIqR.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-03-31 14:17:08
462,1377246033542066177,1,1377246033542066177_0.jpg,Mittwoch\n07.04.\n\n,https://pbs.twimg.com/media/Exz1xz3WQAQeAns.png,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-03-31 14:06:57
463,1377246024998264835,1,1377246024998264835_0.jpg,,https://pbs.twimg.com/media/Exz1COpWYAE4oIJ.jpg,../Datasets/img_dataset/CDU_CSU/ArminLaschet/1...,CDU_CSU,ArminLaschet,2021-03-31 14:06:55
