## Preprocessing and Training Data Development

This notebook outlines the steps taken to preprocess the data and prepare it for model training based on the 'metdata_cleaned.csv' dataset.

### Step 1: Loading the Dataset

First, we load the cleaned dataset into a DataFrame.

In [2]:
import pandas as pd

# Load the cleaned data
df = pd.read_csv('metdata_cleaned.csv')

df.head()

Unnamed: 0,objectID,isHighlight,accessionNumber,accessionYear,isPublicDomain,primaryImage,primaryImageSmall,additionalImages,constituents,department,...,classification,rightsAndReproduction,linkResource,metadataDate,repository,objectURL,tags,objectWikidata_URL,isTimelineWork,GalleryNumber
0,1,False,1979.486.1,1979.0,False,,,[],"[{'constituentID': 164292, 'role': 'Maker', 'n...",The American Wing,...,,,,2021-04-06T04:41:04.967Z,"Metropolitan Museum of Art, New York, NY",https://www.metmuseum.org/art/collection/search/1,,,False,
1,2,False,1980.264.5,1980.0,False,,,[],"[{'constituentID': 1079, 'role': 'Maker', 'nam...",The American Wing,...,,,,2021-04-06T04:41:04.967Z,"Metropolitan Museum of Art, New York, NY",https://www.metmuseum.org/art/collection/search/2,,,False,
2,3,False,67.265.9,1967.0,False,,,[],,The American Wing,...,,,,2021-04-06T04:41:04.967Z,"Metropolitan Museum of Art, New York, NY",https://www.metmuseum.org/art/collection/search/3,,,False,
3,4,False,67.265.10,1967.0,False,,,[],,The American Wing,...,,,,2024-01-10T04:57:19.843Z,"Metropolitan Museum of Art, New York, NY",https://www.metmuseum.org/art/collection/search/4,,,False,
4,5,False,67.265.11,1967.0,False,,,[],,The American Wing,...,,,,2024-01-10T04:57:19.843Z,"Metropolitan Museum of Art, New York, NY",https://www.metmuseum.org/art/collection/search/5,,,False,


### Step 2: Creating Dummy Variables

We identify and process categorical variables such as 'department', 'objectName', and 'culture' for dummy encoding.

In [3]:
# Creating dummy variables for selected categorical variables
categorical_vars = ['department', 'objectName', 'culture']
df_processed = pd.get_dummies(df, columns=categorical_vars, drop_first=True)

df_processed.head()

Unnamed: 0,objectID,isHighlight,accessionNumber,accessionYear,isPublicDomain,primaryImage,primaryImageSmall,additionalImages,constituents,title,...,"culture_Chinese, for American market",culture_Dutch,"culture_Dutch, probably",culture_European,culture_French,"culture_French, possibly",culture_German,culture_Guatemalan,culture_Mexican,culture_Spanish
0,1,False,1979.486.1,1979.0,False,,,[],"[{'constituentID': 164292, 'role': 'Maker', 'n...",One-dollar Liberty Head Coin,...,0,0,0,0,0,0,0,0,0,0
1,2,False,1980.264.5,1980.0,False,,,[],"[{'constituentID': 1079, 'role': 'Maker', 'nam...",Ten-dollar Liberty Head Coin,...,0,0,0,0,0,0,0,0,0,0
2,3,False,67.265.9,1967.0,False,,,[],,Two-and-a-Half Dollar Coin,...,0,0,0,0,0,0,0,0,0,0
3,4,False,67.265.10,1967.0,False,,,[],,Two-and-a-Half Dollar Coin,...,0,0,0,0,0,0,0,0,0,0
4,5,False,67.265.11,1967.0,False,,,[],,Two-and-a-Half Dollar Coin,...,0,0,0,0,0,0,0,0,0,0


### Step 3: Standardizing Numeric Features

We standardize numeric features to have a mean of 0 and a standard deviation of 1. Here, we'll standardize 'accessionYear', 'objectBeginDate', and 'objectEndDate' as examples.

In [4]:
from sklearn.preprocessing import StandardScaler

# Selecting numeric columns for standardization
numeric_columns = ['accessionYear', 'objectBeginDate', 'objectEndDate']
scaler = StandardScaler()
df_processed[numeric_columns] = scaler.fit_transform(df_processed[numeric_columns])

df_processed.head()

Unnamed: 0,objectID,isHighlight,accessionNumber,accessionYear,isPublicDomain,primaryImage,primaryImageSmall,additionalImages,constituents,title,...,"culture_Chinese, for American market",culture_Dutch,"culture_Dutch, probably",culture_European,culture_French,"culture_French, possibly",culture_German,culture_Guatemalan,culture_Mexican,culture_Spanish
0,1,False,1979.486.1,1.51624,False,,,[],"[{'constituentID': 164292, 'role': 'Maker', 'n...",One-dollar Liberty Head Coin,...,0,0,0,0,0,0,0,0,0,0
1,2,False,1980.264.5,1.554907,False,,,[],"[{'constituentID': 1079, 'role': 'Maker', 'nam...",Ten-dollar Liberty Head Coin,...,0,0,0,0,0,0,0,0,0,0
2,3,False,67.265.9,1.052244,False,,,[],,Two-and-a-Half Dollar Coin,...,0,0,0,0,0,0,0,0,0,0
3,4,False,67.265.10,1.052244,False,,,[],,Two-and-a-Half Dollar Coin,...,0,0,0,0,0,0,0,0,0,0
4,5,False,67.265.11,1.052244,False,,,[],,Two-and-a-Half Dollar Coin,...,0,0,0,0,0,0,0,0,0,0


### Step 4: Splitting the Data

We split the dataset into training and testing sets. Note: The target variable needs to be defined based on your model's goal.

In [6]:
from sklearn.model_selection import train_test_split

# Placeholder for feature selection
# You'll need to adjust 'features' and 'target' to your specific case
features = df_processed.drop('accessionYear', axis=1)
target = df_processed['accessionYear']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

print(f'Training set size: {X_train.shape[0]}')
print(f'Testing set size: {X_test.shape[0]}')


Training set size: 800
Testing set size: 200
