# 01_artvee_art_webscrape

## Notebook 1/4

## Gabriel del Valle
## 10/08/24
## NYC DATA SCIENCE ACADEMY


## The purpose of this project is to create a simplified context to apply content recommendation techniques in an interactive Shiny app.


### Try out the full interactive app! Read the Home page for project details and instructions

https://gabrielxdelvalle.shinyapps.io/algo_gallery/

### Read the full project details on the blog post:

https://nycdatascience.com/blog/student-works/clustering-artworks-by-ai-quantified-visual-qualities-content-recommendation-app/

### For any questions or inquieries about this project please feel free to reach out on Linkedin: 

www.linkedin.com/in/gabriel-del-valle-147616152


## This first notebook is used to webscrape the public domain modern artworks that will serve as the 'content' in the content recommendation system.

- The source of public domain art is Artvee.com


- Images must be saved to user's local files as Artvee provides only temporary image links for the images they host. 

- In order to host the images on my shiny app, I uploaded them to a Github repository. This is a way to create accessible image weblinks for free, but limits the number of images that can be hosted to 1000.


## Notebook Structure 

### 1. Scrape webdata from Artvee
### 2. Clean data about artwork and artist
### 3. Remove rows with an invalid filename from artworks datset 
### 4. Remove rows with missing data
### 5. Create live image links for each filepath based on github repository


In [3]:
import requests
from bs4 import BeautifulSoup
import os
import pandas as pd
import numpy
import re

In [2]:
from PIL import Image
from io import BytesIO

# 1. The following loop uses the BeautifulSoup library to scrape both images and information about the artwork from Artvee.

## Images are stored in a designated directory on user's computer

## Artwork data is stored and organized in artworks_data_all

## In the case that an image fails to be requested or downloaded, its information will be added to broken_images

In [4]:
artworks_data_all = []

broken_images = []

total_index = 0

#set an output directory for downloaded images
os.makedirs('artvee_downloads_all3', exist_ok=True)


#There are 70 artworks per page
#The following range sets the number of pages to scrape
for n in range(1, 20):

    url = f"https://artvee.com/c/abstract/page/{n}/?per_page=70"

    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Loop through each artwork on the page
    for art_item in soup.find_all('div', class_='product-grid-item'):
        
        # Extract the image URL
        image_url = art_item.find('img')['data-src']
        
        
        
        # Extract the title and year
        title_year = art_item.find('h3', class_='product-title').text.strip()
        
        # Split the title_year into title and year
        if '(' in title_year and ')' in title_year:
            title, year = title_year.rsplit('(', 1)
            year = year.strip(') ')
            title = title.strip()
        else:
            title = title_year
            year = 'Unknown'  # In case the year is not present in the title


       

        # Extract the artist information
        artist_info = art_item.find('div', class_='woodmart-product-brands-links').text.strip()

        # Check if the artist_info contains the expected pattern " ("
        if ' (' in artist_info:
            artist_name, artist_details = artist_info.split(' (', 1)
            nationality, lifespan = artist_details.rsplit(',', 1)
            nationality = nationality.strip()
            lifespan = lifespan.strip(')')
        else:
            artist_name = artist_info  # If no " (" pattern, assume the whole string is the artist's name
            nationality = 'Unknown'
            lifespan = 'Unknown'

       
        
        
        # Download the image
        try:
            img_data = requests.get(image_url).content
            img = Image.open(BytesIO(img_data))
            img.verify()  # Check the integrity of the image

            # Save the image if verified
            img_name = os.path.join('artvee_downloads_all3', f"{title_year}.jpg")
            with open(img_name, 'wb') as handler:
                handler.write(img_data)

            # Append the data as a dictionary to the list
            artworks_data_all.append({
                "Title": title,
                "Year": year,
                "Artist": artist_name,
                "Nationality": nationality,
                "Lifespan": lifespan,
                "Image_URL": image_url,
                "Image_path": f"{title_year}.jpg",
                "total_index": total_index
            })

        except (requests.RequestException, IOError):
            # If there's an error with the request or image integrity, log it as broken
            broken_images.append({
                "Title": title,
                "Year": year,
                "Artist": artist_name,
                "Nationality": nationality,
                "Lifespan": lifespan,
                "Image_URL": image_url,
            })
            
        total_index+=1
            
            
        
        
        
        
            
artworks_all = pd.DataFrame(artworks_data_all)
            
brocken_images_df = pd.DataFrame(broken_images)

In [5]:
print(len(artworks_all))
print(len(brocken_images_df))

1330
0


In [6]:
#artworks_all.to_csv('modern_scraping.csv', index=False)

## 2. Clean data about artwork and artsit

When I first started this project I considered using data such as the artist's Nationality and the year of the artwork as part of the analysis, but then decided the connection to user preferences would be neglible. 

### Cleaning the scraped data about the artwork and artist is not necessary for the visual analysis which constitutes the rest of the project, but is part of making the app look consistent and professional.

Regular expressions and string manipulation are used to standardize the Year column, and turn the Lifespan column into just a Birth column (to account for many cases where a death year is unknown).

In [7]:
test = artworks_all.copy()

In [8]:
def clean_lifespan(lifespan):
    # Normalize the lifespan string by replacing any known zero-width spaces and non-breaking spaces
    lifespan = lifespan.replace('\u2009', ' ').replace('\u200b', '').strip()
    
    # Remove prefixes like "ca", "c.", "circa" and extra spaces
    lifespan = re.sub(r'^(ca\.?|c\.?|circa)\s*', '', lifespan, flags=re.IGNORECASE)
    
    # Handle centuries
    if "Century" in lifespan or "century" in lifespan:
        if "20th" in lifespan:
            lifespan = "1900 - 2000"
        elif "19th" in lifespan:
            lifespan = "1800 - 1900"
        # Add more cases as needed
    
    # Handle birth without death
    if lifespan.startswith("b."):
        birth_year = lifespan.replace("b.", "").strip()
        return birth_year, "Unknown"
    
    # Handle birth and death with an error or missing death
    if lifespan.endswith("-") or lifespan.endswith("\u2013"):  # Also handle en dash
        birth_year = lifespan.rstrip("-\u2013").strip()
        return birth_year, "Unknown"
    
    # Handle typical format and split, including en dash
    if "-" in lifespan or "\u2013" in lifespan:
        birth_year, death_year = re.split(r"[-\u2013]", lifespan)  # Split by hyphen or en dash
        birth_year = birth_year.strip()
        death_year = death_year.strip()
        return birth_year, death_year
    
    # Handle cases like "Unknown"
    if lifespan.lower() == "unknown":
        return "Unknown", "Unknown"
    
    # If the format is not recognized, return as unknown
    return "Unknown", "Unknown"


In [9]:
# Apply the function to the Lifespan column
test['Birth'], test['Death'] = zip(*test['Lifespan'].apply(clean_lifespan))

# Convert Birth and Death columns to numeric, handling 'Unknown' with NaN
test['Birth'] = pd.to_numeric(test['Birth'], errors='coerce').astype('Int64')
test['Death'] = pd.to_numeric(test['Death'], errors='coerce').astype('Int64')

# Drop the original Lifespan column if no longer needed
#modern_gallery_full.drop(columns=['Lifespan'], inplace=True)


In [10]:
test_lifespan = test[test['Lifespan'] != "Unknown"]
test_lifespan2 = test_lifespan[test_lifespan['Birth'].isna()]
test_lifespan2

Unnamed: 0,Title,Year,Artist,Nationality,Lifespan,Image_URL,Image_path,total_index,Birth,Death


In [11]:
test_lifespan3 = test_lifespan[test_lifespan['Death'].isna()]
test_lifespan3['Lifespan'].unique()

array([' b. 1941', ' 1963-', ' 1923 - ', ' 1913-?', ' b. 1940',
       ' 1928 - ', ' b. 1952'], dtype=object)

Now the lifespan, birth, and death columns are perfectly filtered. I can remove the rows that don't have a death since they could interfere with analysis

In [12]:
def clean_year(year):
    # Remove \xa0 (non-breaking space) and other unwanted characters
    year = year.replace('\xa0', ' ').strip()
    
    # Handle simple single year case
    if re.match(r'^\d{4}$', year):
        return int(year)
    
    # Handle two years separated by '-' or 'to', including two-digit year differences
    match = re.match(r'^(\d{4})\s*(?:-|to|–)\s*(\d{2,4})$', year)
    if match:
        first_year = int(match.group(1))
        second_year = match.group(2)
        if len(second_year) == 2:
            # Convert two-digit year to full year
            second_year = int(str(first_year)[:2] + second_year)
        else:
            second_year = int(second_year)
        return second_year  # Return the latter year
    
    # Handle cases like 'ca. 1808–10', 'circa 1922', 'around 1923', etc.
    match = re.match(r'^(?:ca\.?|circa|around|from|between|before|c\.)\s*(\d{4})(?:\s*(?:-|to|and|until)\s*(\d{2,4}))?', year, re.IGNORECASE)
    if match:
        first_year = int(match.group(1))
        if match.group(2):
            second_year = match.group(2)
            if len(second_year) == 2:
                second_year = int(str(first_year)[:2] + second_year)
            else:
                second_year = int(second_year)
            return second_year
        return first_year
    
    # Handle cases like 'between 1919 and 1921' by reducing to the latter year
    match = re.match(r'^between\s*(\d{4})\s*and\s*(\d{4})$', year, re.IGNORECASE)
    if match:
        return int(match.group(2))
    
    # Handle 'before 1921' by using the year itself
    match = re.match(r'^before\s*(\d{4})$', year, re.IGNORECASE)
    if match:
        return int(match.group(1))
    
    # Handle 'ca 1880', 'ca 1916', etc.
    match = re.match(r'^(?:ca|c\.)\s*(\d{4})$', year, re.IGNORECASE)
    if match:
        return int(match.group(1))

    # Handle non-numeric or unknown year cases
    if year.lower() == "unknown" or not re.search(r'\d{4}', year):
        return "Unknown"
    
    return "Unknown"

In [13]:
# Apply the cleaning function to the Year column
test['Cleaned_Year'] = test['Year'].apply(clean_year)

# Convert to integer where possible, with NaN for unknowns
test['Cleaned_Year'] = pd.to_numeric(test['Cleaned_Year'], errors='coerce').astype('Int64')

In [14]:
test_year = test[test['Cleaned_Year'] != "Unknown"]
test_year2 = test_year[test_year['Cleaned_Year'].isna()]
test_year2

Unnamed: 0,Title,Year,Artist,Nationality,Lifespan,Image_URL,Image_path,total_index,Birth,Death,Cleaned_Year


In [15]:
test[test['Lifespan']== "Unknown"]

Unnamed: 0,Title,Year,Artist,Nationality,Lifespan,Image_URL,Image_path,total_index,Birth,Death,Cleaned_Year
6,Meinem Theuren Vater and Meiner Theuren Mater,1818,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/104183ab.jpg,Meinem Theuren Vater and Meiner Theuren Mater ...,6,,,1818.0
29,Textile Design with Vertical Strips of Pearls ...,1840,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/104283ab.jpg,Textile Design with Vertical Strips of Pearls ...,29,,,1840.0
33,Design for a Textile with Stylized Leaf,1911,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/104116ab.jpg,Design for a Textile with Stylized Leaf (1911)...,33,,,1911.0
53,Kúpalisko v Hévízi,1930–1939,József Egry,Unknown,Unknown,https://mdl.artvee.com/ft/912628ab.jpg,Kúpalisko v Hévízi (1930–1939).jpg,53,,,1939.0
68,Universal Wheel,ca. 1780,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/103972ab.jpg,Universal Wheel (ca. 1780).jpg,68,,,1780.0
92,Gavotte,1912,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/103927ab.jpg,Gavotte (1912).jpg,92,,,1912.0
112,Ariel,1911–13,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/104112ab.jpg,Ariel (1911–13).jpg,112,,,1913.0
116,Adria,1910–11,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/103891ab.jpg,Adria (1910–11).jpg,116,,,1911.0
203,Untitled,Unknown,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/912744ab.jpg,Untitled.jpg,203,,,
255,Sommerwinde (Summer Wind) II,1922,Anonymous,Unknown,Unknown,https://mdl.artvee.com/ft/103955ab.jpg,Sommerwinde (Summer Wind) II (1922).jpg,255,,,1922.0


In [16]:
all_known = test[(test['Year'] != "Unknown") & (test['Lifespan'] != "Unknown")]
all_known

Unnamed: 0,Title,Year,Artist,Nationality,Lifespan,Image_URL,Image_path,total_index,Birth,Death,Cleaned_Year
0,Construction,1924,László Moholy-Nagy,Hungarian,1895 - 1946,https://mdl.artvee.com/ft/101991ab.jpg,Construction (1924).jpg,0,1895,1946,1924
1,Dance Abstraction; Isadora Duncan (or ‘Rhythmi...,1920,Abraham Walkowitz,American,1878-1965,https://mdl.artvee.com/ft/106394ab.jpg,Dance Abstraction; Isadora Duncan (or ‘Rhythmi...,1,1878,1965,1920
2,Ohne Titel; aus; ‘Die 150 Blätter’ VII,1940,Karl Wiener,Austrian,1901-1949,https://mdl.artvee.com/ft/104773ab.jpg,Ohne Titel; aus; ‘Die 150 Blätter’ VII (1940).jpg,2,1901,1949,1940
3,Zeilboten op een werfhelling,1906,Reijer Stolk,Dutch,1896 - 1945,https://mdl.artvee.com/ft/103462ab.jpg,Zeilboten op een werfhelling (1906).jpg,3,1896,1945,1906
4,Why,1940,Karl Wiener,Austrian,1901-1949,https://mdl.artvee.com/ft/104895ab.jpg,Why (1940).jpg,4,1901,1949,1940
...,...,...,...,...,...,...,...,...,...,...,...
1325,Váza s kyticí a broskve (Vase of flowers and p...,1932,Emil Filla,Czech,1882-1953,https://mdl.artvee.com/ft/912315ab.jpg,Váza s kyticí a broskve (Vase of flowers and p...,1325,1882,1953,1932
1326,Self-Portrait,1928,Ernst Ludwig Kirchner,German,1880-1938,https://mdl.artvee.com/ft/100178ab.jpg,Self-Portrait (1928).jpg,1326,1880,1938,1928
1327,Head,before 1921,Gustaw Gwozdecki,Polish,1880-1935,https://mdl.artvee.com/ft/106321ab.jpg,Head (before 1921).jpg,1327,1880,1935,1921
1328,Klänge Pl.13,1913,Wassily Kandinsky,Russian,1866 - 1944,https://mdl.artvee.com/ft/910880ab.jpg,Klänge Pl.13 (1913).jpg,1328,1866,1944,1913


In [17]:
all_known[all_known['Death'].isna()]

Unnamed: 0,Title,Year,Artist,Nationality,Lifespan,Image_URL,Image_path,total_index,Birth,Death,Cleaned_Year
64,Swan Carpet,1991,Tiit Pääsuke,Estonian,b. 1941,https://mdl.artvee.com/ft/911173ab.jpg,Swan Carpet (1991).jpg,64,1941,,1991
114,Blossom in the Grass,1976,Tiit Pääsuke,Estonian,b. 1941,https://mdl.artvee.com/ft/911146ab.jpg,Blossom in the Grass (1976).jpg,114,1941,,1976
190,Sommerlinien (Lignes d’été),2014,Myriam Thyes,Swiss,1963-,https://mdl.artvee.com/ft/911273ab.jpg,Sommerlinien (Lignes d’été) (2014).jpg,190,1963,,2014
327,Four spaces with a broken cross,2017,Myriam Thyes,Swiss,1963-,https://mdl.artvee.com/ft/911260ab.jpg,Four spaces with a broken cross (2017).jpg,327,1963,,2017
365,Detail of fountain,1975,Victor Alfred Lundy,American,1923 -,https://mdl.artvee.com/ft/912721ab.jpg,Detail of fountain (1975).jpg,365,1923,,1975
366,Balken und gewellte Linien,2014,Myriam Thyes,Swiss,1963-,https://mdl.artvee.com/ft/911256ab.jpg,Balken und gewellte Linien (2014).jpg,366,1963,,2014
431,Pejzaż wiejski,1940,Kazimierz Wojtanowicz,Polish,1913-?,https://mdl.artvee.com/ft/106569ab.jpg,Pejzaż wiejski (1940).jpg,431,1913,,1940
630,Maastik oksaga (Autoportree),1982,Tiit Pääsuke,Estonian,b. 1941,https://mdl.artvee.com/ft/911163ab.jpg,Maastik oksaga (Autoportree) (1982).jpg,630,1941,,1982
784,Urogallo,2013,José de Martín Simón,Spanish,b. 1940,https://mdl.artvee.com/ft/911234ab.jpg,Urogallo (2013).jpg,784,1940,,2013
829,Laine ja Laine III,2020,Tiit Pääsuke,Estonian,b. 1941,https://mdl.artvee.com/ft/911161ab.jpg,Laine ja Laine III (2020).jpg,829,1941,,2020


Death is probably a less influential trait on the substance of the art than the birth, so it doesn't need to be included. There are no NA births that are not "Unknown".

In [18]:
modern_gallery = all_known.copy()
modern_gallery.drop("total_index", axis = 1, inplace = True)
modern_gallery.drop("Year", axis = 1, inplace = True)
modern_gallery = modern_gallery.rename(columns = {'Cleaned_Year': "Year"})
modern_gallery

Unnamed: 0,Title,Artist,Nationality,Lifespan,Image_URL,Image_path,Birth,Death,Year
0,Construction,László Moholy-Nagy,Hungarian,1895 - 1946,https://mdl.artvee.com/ft/101991ab.jpg,Construction (1924).jpg,1895,1946,1924
1,Dance Abstraction; Isadora Duncan (or ‘Rhythmi...,Abraham Walkowitz,American,1878-1965,https://mdl.artvee.com/ft/106394ab.jpg,Dance Abstraction; Isadora Duncan (or ‘Rhythmi...,1878,1965,1920
2,Ohne Titel; aus; ‘Die 150 Blätter’ VII,Karl Wiener,Austrian,1901-1949,https://mdl.artvee.com/ft/104773ab.jpg,Ohne Titel; aus; ‘Die 150 Blätter’ VII (1940).jpg,1901,1949,1940
3,Zeilboten op een werfhelling,Reijer Stolk,Dutch,1896 - 1945,https://mdl.artvee.com/ft/103462ab.jpg,Zeilboten op een werfhelling (1906).jpg,1896,1945,1906
4,Why,Karl Wiener,Austrian,1901-1949,https://mdl.artvee.com/ft/104895ab.jpg,Why (1940).jpg,1901,1949,1940
...,...,...,...,...,...,...,...,...,...
1325,Váza s kyticí a broskve (Vase of flowers and p...,Emil Filla,Czech,1882-1953,https://mdl.artvee.com/ft/912315ab.jpg,Váza s kyticí a broskve (Vase of flowers and p...,1882,1953,1932
1326,Self-Portrait,Ernst Ludwig Kirchner,German,1880-1938,https://mdl.artvee.com/ft/100178ab.jpg,Self-Portrait (1928).jpg,1880,1938,1928
1327,Head,Gustaw Gwozdecki,Polish,1880-1935,https://mdl.artvee.com/ft/106321ab.jpg,Head (before 1921).jpg,1880,1935,1921
1328,Klänge Pl.13,Wassily Kandinsky,Russian,1866 - 1944,https://mdl.artvee.com/ft/910880ab.jpg,Klänge Pl.13 (1913).jpg,1866,1944,1913


In [19]:
modern_gallery[modern_gallery['Image_path'].isna()]

Unnamed: 0,Title,Artist,Nationality,Lifespan,Image_URL,Image_path,Birth,Death,Year


## 3. Create a new folder where only images included in the modern_gallery dataset are included, based on whether they have a valid filename

In [22]:
import shutil

In [23]:
# Define the source and destination directories
source_dir = 'artvee_downloads_all3'
destination_dir = 'modern_gallery_images'

unfound_images = []

# Ensure the destination directory exists
os.makedirs(destination_dir, exist_ok=True)

# Iterate over the rows in the modern_gallery DataFrame
for index, row in modern_gallery.iterrows():
    # Construct the full path to the source image file
    image_filename = row['Image_path']
    source_path = os.path.join(source_dir, image_filename)

    # Construct the full path to the destination file
    destination_path = os.path.join(destination_dir, image_filename)

    # Check if the file exists in the source directory
    if os.path.exists(source_path):
        # Move the file to the new directory
        shutil.move(source_path, destination_path)
    else:
        unfound_images.append(image_filename)
        print(f"File not found: {image_filename}")

File not found: Decoratief ontwerp (1874).jpg
File not found: Decoratief ontwerp (1874).jpg
File not found: Untitled (1900 - 1930).jpg
File not found: Dancing Figure (1910 - 1915).jpg
File not found: Schip op een werfhelling (1906).jpg
File not found: Decoratief ontwerp (1874).jpg
File not found: Štubnianske Teplice (1934–1935).jpg
File not found: Designs for theater with black-framed proscenium and boldly colored settings.] [Study for stage light wall decoration, possibly for Caf ̌Crillon (277 Park Avenue) (1926).jpg
File not found: Composition (1921).jpg
File not found: Entwurf Zu ‘grüner Rand’ (Study For ‘green Border’) (1919).jpg
File not found: Abstract design based on leaves and organic shapes (1900).jpg
File not found: Abstract Nude (19th century).jpg
File not found: Untitled (1938).jpg
File not found: Head (before 1921).jpg


In [24]:
len(unfound_images)

14

In [25]:
modern_gallery2 = modern_gallery[~modern_gallery['Image_path'].isin(unfound_images)]
modern_gallery2

Unnamed: 0,Title,Artist,Nationality,Lifespan,Image_URL,Image_path,Birth,Death,Year
0,Construction,László Moholy-Nagy,Hungarian,1895 - 1946,https://mdl.artvee.com/ft/101991ab.jpg,Construction (1924).jpg,1895,1946,1924
1,Dance Abstraction; Isadora Duncan (or ‘Rhythmi...,Abraham Walkowitz,American,1878-1965,https://mdl.artvee.com/ft/106394ab.jpg,Dance Abstraction; Isadora Duncan (or ‘Rhythmi...,1878,1965,1920
2,Ohne Titel; aus; ‘Die 150 Blätter’ VII,Karl Wiener,Austrian,1901-1949,https://mdl.artvee.com/ft/104773ab.jpg,Ohne Titel; aus; ‘Die 150 Blätter’ VII (1940).jpg,1901,1949,1940
3,Zeilboten op een werfhelling,Reijer Stolk,Dutch,1896 - 1945,https://mdl.artvee.com/ft/103462ab.jpg,Zeilboten op een werfhelling (1906).jpg,1896,1945,1906
4,Why,Karl Wiener,Austrian,1901-1949,https://mdl.artvee.com/ft/104895ab.jpg,Why (1940).jpg,1901,1949,1940
...,...,...,...,...,...,...,...,...,...
1324,Periphery,Mikuláš Galanda,Slovak,1895 – 1938,https://mdl.artvee.com/ft/101476ab.jpg,Periphery (1924).jpg,1895,1938,1924
1325,Váza s kyticí a broskve (Vase of flowers and p...,Emil Filla,Czech,1882-1953,https://mdl.artvee.com/ft/912315ab.jpg,Váza s kyticí a broskve (Vase of flowers and p...,1882,1953,1932
1326,Self-Portrait,Ernst Ludwig Kirchner,German,1880-1938,https://mdl.artvee.com/ft/100178ab.jpg,Self-Portrait (1928).jpg,1880,1938,1928
1328,Klänge Pl.13,Wassily Kandinsky,Russian,1866 - 1944,https://mdl.artvee.com/ft/910880ab.jpg,Klänge Pl.13 (1913).jpg,1866,1944,1913


In [28]:
len(modern_gallery2)

1180

## 4. Remove Rows with Missing Data

In my experience creating this data only the Year column had NA values

In [5]:
modern_gallery2[modern_gallery2['Year'].isna()]

Unnamed: 0,Title,Artist,Nationality,Lifespan,Image_URL,Image_path,Birth,Death,Year


In [7]:
modern_gallery2 = modern_gallery2[~modern_gallery2['Year'].isna()]

In [None]:
#modern_gallery2.to_csv('modern_gallery.csv', index=False)

## 5. Create live image links for each filepath based on github repository

In [None]:
image_directory = 'User/Documents/artvee_downloads_all3' #use your own local path!

Verify number of existing files:

In [None]:
# Count the number of files in the directory
file_count = len([f for f in os.listdir(image_directory) if os.path.isfile(os.path.join(image_directory, f))])

print(f"Number of files in the directory: {file_count}")

Function to verify a specific file exists:

In [8]:
def image_exists(image_path):
    return os.path.isfile(os.path.join(image_directory, image_path))


In [None]:
filtered_gallery = modern_gallery2[modern_gallery2['Image_path'].apply(image_exists)]

len(filtered_gallery)

### Combine a github repository link with your image paths to create a live link

### This will work to host your images as weblinks for free assuming you have created a gituhub repository and uploaded your files with the filenames created in this notebook

In [None]:
base_url = 'https://raw.githubusercontent.com/Your_Github_Account/your_repository/main/'


# Create the full URL for each image and add it as a new column
filtered_gallery.loc[:, 'Image_URL'] = base_url + filtered_gallery['Image_path']

# Web links cannot have spaces! Replace spaces with their coded value %20

In [None]:
filtered_gallery['Image_URL'] = filtered_gallery['Image_URL'].str.replace(' ', '%20')

Save your final dataset:

In [None]:
#filtered_gallery.to_csv('modern_gallery.csv', index=False)