## **Magic functions**

In [14]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## **Required installation**

In [15]:
!pip install fastai fastbook nbdev



## **Necessary imports**

In [16]:
import os
import nltk
import numpy as np
import pandas as pd
from fastai import *
from fastbook import *
from fastai.vision.all import *

## **Mounting drive**

In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Folder initialization**

In [18]:
%cd /content/drive/MyDrive/artwork_description_generator/data/

images_path = 'images'
csv = "csvFiles/"

/content/drive/MyDrive/artwork_description_generator/data


## **Fetching csv files**

In [19]:
files = os.listdir(f'{csv}')
files

['Netherlands.csv', 'UnitedStates.csv', 'China.csv']

## **Merging and creation of the dataframe**

In [20]:
df = pd.DataFrame()
for file in files:
  data_frame = pd.read_csv(f"{csv}/{file}")
  df = pd.concat([df, data_frame], ignore_index = True)

## **Function to view shape**

In [21]:
def df_shape(df):
  print(f'Number of rows: {df.shape[0]}')
  print(f'Number of columns: {df.shape[1]}')

## **Viewing dataframe**

In [22]:
df.head()

Unnamed: 0,ids,artists,mediums,titles,descriptions,urls
0,nl_1,Piet Mondrian,Oil on canvas,"Lozenge Composition with Yellow, Black, Blue, Red, and Gray","Piet Mondrian, a painter of the revolutionary international movement De Stijl (the Style), argued that “the straight line tells the truth.” Why, then, we might wonder, would he choose to hang a painting off axis, where its edges imply dynamic diagonals? Among other motivations, rotating the canvas allowed Mondrian to reconsider a question he spent his career exploring, namely, the relationship between the contents of a painting and what contains them. In Lozenge Composition, the squared-off black lines imply enclosure, while a single line (above the blue area) extends to the slanted edge, ...",https://www.artic.edu/artworks/109819/lozenge-composition-with-yellow-black-blue-red-and-gray
1,nl_2,Jan Sanders van Hemessen,Oil on panel,Judith,"Judith was considered one of the most heroic women of the Old Testament. According to the biblical story, when her city was besieged by the Assyrian army, the beautiful young widow gained access to the quarters of the general Holofernes. After winning his confidence and getting him drunk, she took his sword and cut off his head, thereby saving the Jewish people. Although Judith was often shown richly and exotically clothed, Jan Sanders van Hemessen chose to present her as a monumental nude, aggressively brandishing her sword even after severing Holofernes’s head.Van Hemessen was one of the...",https://www.artic.edu/artworks/4575/judith
2,nl_3,Joachim Antonisz. Wtewael,Oil on copper,The Battle between the Gods and the Giants,"The subject of the victory of the gods of Olympus over the ancient race of giants provided Joachim Wtewael with the opportunity to depict exaggerated athletic poses and striking contrasts of space and light. From the clouds, the Olympian gods wield their attributes as weapons: Jupiter hurls thunderbolts; Neptune brandishes his triton; and Mercury uses his caduceus as a spear. The helmeted figure on the right is Minerva, the goddess of wisdom and war. The painting’s gemlike effect results from the use of a copper support and from its small scale. The artist’s self-conscious display of his s...",https://www.artic.edu/artworks/105466/the-battle-between-the-gods-and-the-giants
3,nl_4,Paulus Potter,Oil on panel,Two Cows and a Young Bull beside a Fence in a Meadow,"Paulus Potter, a prolific painter and etcher during his short life, elevated images of cows, oxen, and other domestic animals to majestic emblems of nature. His lavish attention to the physical appearances of such beasts—the varied texture and coloring of their hair, their characteristic poses, their bulky contours—borders on portraiture and likely derived from drawings he made from life. With Potter, animal painting blossomed into an independent genre in the Dutch Republic.",https://www.artic.edu/artworks/146953/two-cows-and-a-young-bull-beside-a-fence-in-a-meadow
4,nl_5,Pieter Jansz. Quast,Etching in black on paper,"Lame Beggar Asking for Alms, from T is al verwart-gaern (It’s already confusing)",,https://www.artic.edu/artworks/81/lame-beggar-asking-for-alms-from-t-is-al-verwart-gaern-it-s-already-confusing


## **Checking NaN values**

In [72]:
df['titles'][:100]

0                          Lozenge Composition with Yellow, Black, Blue, Red, and Gray
1                                                                               Judith
2                                           The Battle between the Gods and the Giants
3                                 Two Cows and a Young Bull beside a Fence in a Meadow
4     Lame Beggar Asking for Alms, from T is al verwart-gaern (It’s already confusing)
                                            ...                                       
95                                                                       Drawing Hands
96                                                                Tarquin and Lucretia
97                “Hilton Head Island, S.C., USA, June 27, 1992,” from Beach Portraits
98                                                             Ascending and Decending
99                      Amit, Golani Brigade, Orev Unit, Elyacim, Israel, May 26, 1999
Name: titles, Length: 100, dtype: object

In [23]:
df.isna().sum()

ids                0
artists         2946
mediums           71
titles             3
descriptions    7049
urls               0
dtype: int64

## **Fetching images sub-folders**

In [24]:
img_folders = os.listdir(f'{images_path}')
img_folders

['UnitedStates', 'Netherlands', 'China']

In [25]:
# str(get_image_files(f"{images_path}/{img_folders[0]}")[0]).split("/")[2].split(".")[0]

In [13]:
# get_image_files_sorted(f"{images_path}/{img_folders[0]}")

## **Fetching working image ids**

In [26]:
img_ids = []

for folder in img_folders:
  temp = [str(path).split("/")[2].split(".")[0] for path in get_image_files_sorted(f"{images_path}/{folder}")]
  img_ids.extend(temp)

len(img_ids)

8460

In [27]:
df.columns

Index(['ids', 'artists', 'mediums', 'titles', 'descriptions', 'urls'], dtype='object')

## **Fetching corrupted image ids**

In [28]:
corrupted_image_indices_to_drop = [index for index in range(len(df)) if df.iloc[index]['ids'] not in img_ids]
len(corrupted_image_indices_to_drop)

143

In [29]:
# df.iloc[72]

## **Removing corrupted image rows**

In [30]:
df = df.drop(corrupted_image_indices_to_drop).reset_index(drop=True)

In [31]:
df_shape(df)

Number of rows: 8278
Number of columns: 6


In [None]:
# for path in get_image_files_sorted(f"{images_path}/{img_folders[0]}"):
#   print(path)
#   print(str(path).split("/")[2].split(".")[0])

In [None]:
# df['descriptions'].value_counts().sum()

In [None]:
# df['artists'].value_counts()

In [None]:
# df['titles'].value_counts()

In [None]:
# df['titles'].value_counts().sum()

In [None]:
# df['mediums'].value_counts().sum()

In [None]:
# df['mediums'].value_counts()

## **Images Root folder path**

In [32]:
images_root_path = "/content/drive/MyDrive/artwork_description_generator/data/images/"

## **Mapping abbreviations with root folders**

In [33]:
path_dict = {
    'usa' : 'UnitedStates',
    'nl' : 'Netherlands',
    'ch' : 'China'
}

## **Listing image paths**

In [34]:
image_paths = [ f"{images_root_path}{path_dict[df.iloc[i]['ids'].split('_')[0]]}/{df.iloc[i]['ids']}.jpg" for i in range(len(df))]
image_paths[:5]

['/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_1.jpg',
 '/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_2.jpg',
 '/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_3.jpg',
 '/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_4.jpg',
 '/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_5.jpg']

## **New column with image paths**

In [35]:
df['images_path'] = image_paths

In [None]:
# for i in range(len(df)):
#   prefix = df.iloc[i]['ids'].split("_")[0]
#   country = path_dict[prefix]
#   image_paths.append(f"{images_root_path}{country}/{df.iloc[i]['ids']}.jpg")

In [None]:
# image_paths

## **Creating a title count dictionary if multiple same title exists**

In [36]:
titles_dict = df['titles'].value_counts().to_dict()

## **Mapping titles with multiple counts**

In [37]:
multiple_same_titles = {key: 0 for key, value in titles_dict.items() if value > 1}

## **Keeping one value removing rows having titles with same values**

In [38]:
indices_to_drop = []

for i in range(len(df)):
  if df.iloc[i]['titles'] in multiple_same_titles.keys() and multiple_same_titles[df.iloc[i]['titles']] == 0:
    multiple_same_titles[df.iloc[i]['titles']] = 1
  elif df.iloc[i]['titles'] in multiple_same_titles.keys() and multiple_same_titles[df.iloc[i]['titles']] == 1:
    indices_to_drop.append(i)

len(indices_to_drop)

2129

## **Removing indices**

In [39]:
df = df.drop(indices_to_drop).reset_index(drop=True)

## **Viewinfg shape after removal**

In [40]:
df_shape(df)

Number of rows: 6149
Number of columns: 7


## **Deleting redundant columns for workings**

In [41]:
df = df.drop(columns=['ids', 'artists', 'mediums', 'descriptions', 'urls'], axis=1)

In [42]:
!pip install langdetect



In [43]:
from langdetect import detect
# detect('Lozenge Composition with Yellow, Black, Blue, Red, and Gray')
detect('Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde')

'fr'

In [44]:
len(df['titles'])

6149

In [45]:
" ".join(df['titles'][6145].split())

'Chair Back'

In [46]:
# df['titles'] = df['titles'].filter(lambda x: x if detect(x) == 'en')

In [47]:
# df.filter(lambda x:)

In [48]:
df.shape

(6149, 2)

## **Cheking if NaN value exists in the final data**

In [49]:
df.isna().sum()

titles         3
images_path    0
dtype: int64

## **Dropping NaN value rows**

In [50]:
df = df.dropna().reset_index(drop=True)

In [51]:
# detect('Interior'), detect('Invitation'), detect('Lisa Lyon, Joshua Tree')

In [62]:
# for i, title in enumerate(df['titles']):
#     print(title)
#     if detect(title) != 'en':
#       print(f'Language: {detect(title)}')

In [63]:
# for i, title in enumerate(df['titles']):
#   # print(title)
#   try:
#     if detect(title) != 'en':
#       print(f'Title: {title}, ')
#       # print(f"English Not Detected......")
#   except:
#     print(title)

In [70]:
s = "Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde"
n_s = re.sub(r'[^\w\s]', ' ', s)
print(n_s)
translator.translate(n_s)

Portrait of a Sixty year old Woman  from Recueil d estampes d après les plus célèbres tableaux de la Galerie Royale de Dresde


'Portrait of a Sixty year old Woman from Collection of prints based on the most famous paintings from the Royal Gallery in Dresden'

In [52]:
detect("flesh smaller than tears are the little blue flowers")

'en'

In [76]:
df['titles'].tolist()

['Lozenge Composition with Yellow, Black, Blue, Red, and Gray',
 'Judith',
 'The Battle between the Gods and the Giants',
 'Two Cows and a Young Bull beside a Fence in a Meadow',
 'Lame Beggar Asking for Alms, from T is al verwart-gaern (It’s already confusing)',
 'A Young Man Caressing the Young Hostess',
 'Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde',
 'Andromeda',
 'The Marriage of the Virgin',
 'Farm near Duivendrecht',
 'Trompe-l’Oeil Still Life with a Flower Garland and a Curtain',
 'Composition (No. 1) Gray-Red',
 'Mater Dolorosa (Sorrowing Virgin)',
 'Portrait of a Man with a Pink',
 'Still Life',
 'The Garden of Paradise',
 'The Quick and the Dead',
 'Self-Portrait Etching at a Window',
 'A Lady Reading (Saint Mary Magdalene)',
 'Weeping Tree',
 'The Music Lesson',
 'Weeping Woman',
 'Adoration of the Magi',
 'A Family Meal',
 'The Adoration of the Christ Child',
 'Lamentation over the Body of Ch

In [74]:
s = "“Hilton Head Island, S.C., USA, June 27, 1992,” from Beach Portraits"
n_s = re.sub(r'[^\w\s]', ' ', s)
print(n_s)
translator.translate(" ".join(n_s.split()))


 Hilton Head Island  S C   USA  June 27  1992   from Beach Portraits


'Hilton Head Island S C USA June 27 1992 from Beach Portraits'

In [77]:
s = "Watch-Tower Near a River, from Landscapes (Playsante Lantschappen)"
n_s = re.sub(r'[^\w\s]', ' ', s)
print(n_s)
translator.translate(" ".join(n_s.split()))


Watch Tower Near a River  from Landscapes  Playsante Lantschappen 


'Watch Tower Near a River from Landscapes Playsante Lantschappen'

In [54]:
# !pip install translators

In [55]:
# import translators as ts
# ts.google()

In [56]:
# ts.translate_text(from_language='',query_text= "Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde", to_language='en')

In [57]:
# translator.translate().text

In [58]:
!pip install deep_translator



In [66]:
from deep_translator import GoogleTranslator
translator = GoogleTranslator(source='auto', target='en')

In [61]:
detect('-')

LangDetectException: No features in text.

In [11]:
translator.translate(" ")

''

In [78]:
# df['titles'].apply(lambda x: x if detect(x) == 'en' else translator.translate(x))

LangDetectException: No features in text.

In [8]:
translator.translate("Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde")

'Portrait of a Sixty-year-old Woman, from Collection of prints based on the most famous paintings from the Royal Gallery in Dresden'

In [10]:
translator.translate("1951-52")

'1951-52'

In [62]:
import re
my_string = "Hello! How are you? I'm doing well, thanks."
my_string = "1951-52"
my_string = 'Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde'

new = re.sub(r'[^\w\s]', ' ', my_string)
print(new)

Portrait of a Sixty year old Woman  from Recueil d estampes d après les plus célèbres tableaux de la Galerie Royale de Dresde


In [79]:
# Removing punctuations
df['titles'] = df['titles'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x))
# Lowercasing the title
df['titles'] = df['titles'].apply(lambda x: x.lower())
# Removing digits
df['titles'] = df['titles'].apply(lambda x: re.sub(r'[0-9]', '', x))
# Removing spaces
df['titles'] = df['titles'].apply(lambda x: " ".join(x.split()))
# Conversion all sentences to english
df['titles'] = df['titles'].apply(lambda x: x if detect(x) == 'en' else translator.translate(x))

LangDetectException: No features in text.

In [90]:
df['titles']

0                             lozenge composition with yellow black blue red and gray
1                                                                              judith
2                                          the battle between the gods and the giants
3                                two cows and a young bull beside a fence in a meadow
4       lame beggar asking for alms from t is al verwart gaern it s already confusing
                                            ...                                      
6141                                            spouted ewer with twisted rope handle
6142                                                                       chair back
6143                                                               panel trouser band
6144                                                                    child s tunic
6145                                                      fragment from a chair strip
Name: titles, Length: 6146, dtype: object

In [72]:
df.head()

Unnamed: 0,titles,images_path
0,lozenge composition with yellow black blue red and gray,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_1.jpg
1,judith,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_2.jpg
2,the battle between the gods and the giants,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_3.jpg
3,two cows and a young bull beside a fence in a meadow,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_4.jpg
4,lame beggar asking for alms from t is al verwart gaern it s already confusing,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_5.jpg


In [63]:
detect(new)

'fr'

In [58]:
detect("Portrait of a Sixty-year-old Woman, from Recueil d’estampes d’après les plus célèbres tableaux de la Galerie Royale de Dresde")

'fr'

In [48]:
english_titles = [title if detect(title) == 'en' else None for title in df['titles']]
len(english_titles)

LangDetectException: No features in text.

## **Checking final dataframe shape**

In [None]:
df_shape(df)

Number of rows: 6146
Number of columns: 2


## **Viewing final dataframe**

In [None]:
df.head()

Unnamed: 0,titles,images_path
0,"Lozenge Composition with Yellow, Black, Blue, Red, and Gray",/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_1.jpg
1,Judith,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_2.jpg
2,The Battle between the Gods and the Giants,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_3.jpg
3,Two Cows and a Young Bull beside a Fence in a Meadow,/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_4.jpg
4,"Lame Beggar Asking for Alms, from T is al verwart-gaern (It’s already confusing)",/content/drive/MyDrive/artwork_description_generator/data/images/Netherlands/nl_5.jpg


## **Exporting to final .csv file**

In [None]:
df.to_csv("image_title_generation_data.csv", index=False)