## Art History Textbook Cleaning Script
This script takes in the raw art history textbook data file and performs cleaning including removing unnecessary columns, removing unnecessary rows, and filtering out missing data.

In [None]:
import pandas as pd

# Reading in the art history textbook csv and viewing the first five rows. 
df_textbook = pd.read_csv("../data/raw/art_history.csv")
df_textbook.head(5)

Unnamed: 0,artist_name,edition_number,year,artist_nationality,artist_nationality_other,artist_gender,artist_race,artist_ethnicity,book,space_ratio_per_page_total,artist_unique_id,moma_count_to_year,whitney_count_to_year,artist_race_nwi
0,Aaron Douglas,9.0,1991,American,American,Male,Black or African American,Not Hispanic or Latino origin,Gardner,0.353366,2,0,0,Non-White
1,Aaron Douglas,10.0,1996,American,American,Male,Black or African American,Not Hispanic or Latino origin,Gardner,0.373947,2,0,0,Non-White
2,Aaron Douglas,11.0,2001,American,American,Male,Black or African American,Not Hispanic or Latino origin,Gardner,0.303259,2,0,0,Non-White
3,Aaron Douglas,12.0,2005,American,American,Male,Black or African American,Not Hispanic or Latino origin,Gardner,0.377049,2,0,0,Non-White
4,Aaron Douglas,13.0,2009,American,American,Male,Black or African American,Not Hispanic or Latino origin,Gardner,0.39841,2,0,0,Non-White


I will start by removing unnecessary columns.

In [2]:
df_textbook = df_textbook.drop(columns=['edition_number','year','artist_nationality','artist_nationality_other','artist_race','artist_ethnicity','book','space_ratio_per_page_total','artist_unique_id','moma_count_to_year','whitney_count_to_year','artist_race_nwi'])

In [3]:
#Viewing updated dataset with unnecessary columns removed.
df_textbook.head(5)

Unnamed: 0,artist_name,artist_gender
0,Aaron Douglas,Male
1,Aaron Douglas,Male
2,Aaron Douglas,Male
3,Aaron Douglas,Male
4,Aaron Douglas,Male


Next, I will change the artist name and gender data types from objects to strings. 

In [4]:
df_textbook.artist_name = df_textbook.artist_name.astype('string')
df_textbook.artist_gender = df_textbook.artist_gender.astype('string')

In [5]:
#Checking to see if data types were updated. 
df_textbook.dtypes

artist_name      string[python]
artist_gender    string[python]
dtype: object

Next, I will revisit the description of the data. 

In [6]:
df_textbook.describe()

Unnamed: 0,artist_name,artist_gender
count,3162,3104
unique,413,2
top,Joseph Mallord William Turner,Male
freq,25,2762


Next, we will look for missing data. 

In [7]:
df_textbook.isnull().sum()

artist_name       0
artist_gender    58
dtype: int64

As noted in our discovery portion, the 58 "nan" genders are listed for artists with names listed as "N/A". Considering that we need the artists' name and gender for this analysis, these records will not be included. I will remove these rows now. 

In [8]:
df_textbook.dropna(inplace=True)

In [9]:
#Checking to see if rows with empty fields were dropped. 
df_textbook.isnull().sum()

artist_name      0
artist_gender    0
dtype: int64

In [10]:
#Writing csv to clean data folder.
df_textbook.to_csv('../data/clean/clean_art_history.csv', index=False)