<a href="https://colab.research.google.com/github/ACTH-DKES/ACTH2025/blob/main/Week2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises Week 1

## Text To tuples

In [None]:
def text_to_tuples(sentence, integer):
    result = []
    words = sentence.split()
    number_of_divisions = len(words)//integer
    if len(words) % integer != 0:
        number_of_divisions += 1
    words.reverse()
    for i in range(number_of_divisions):
        temp_list = []
        for i in range(integer):
            if len(words) != 0:
                temp_list.append(words.pop())
        result.append(tuple(temp_list))
    return result


text_to_tuples("Hi My Life Is So Great After Programming", 3)


## A dictionary of operations

In [55]:
from os import write
import csv
def dictionary_of_operations(dictionary):
    if len(dictionary["numbers1"]) != len(dictionary["numbers2"]) and \
        len(dictionary["numbers1"]) != len(dictionary["operations"]):
            print("Invalid input")
            return None
    with open("result.csv", "w") as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(["Number1", "Number2", "Operation", "Result"])
        for i, el in enumerate(dictionary["numbers1"]):
            if dictionary["operations"][i] == "sum":
                writer.writerow([el,dictionary["numbers2"][i],
                                 dictionary["operations"][i],
                                 el + dictionary["numbers2"][i]])
            elif dictionary["operations"][i] == "sub":
                writer.writerow([el,dictionary["numbers2"][i],
                                 dictionary["operations"][i],
                                 el - dictionary["numbers2"][i]])
            elif dictionary["operations"][i] == "mult":
                writer.writerow([el,dictionary["numbers2"][i],
                                 dictionary["operations"][i],
                                 el * dictionary["numbers2"][i]])
            elif dictionary["operations"][i] == "div":
                writer.writerow([el,dictionary["numbers2"][i],
                                 dictionary["operations"][i],
                                 el / dictionary["numbers2"][i]])
            else:
                print("Invalid operation")
                return None

example_dictionary = {"numbers1":[2,4,5.9,10], "numbers2":[12,9,10,11],
                      "operations":["sum", "sub", "mult", "div"]}
dictionary_of_operations(example_dictionary)

# Week 2 - Pandas and Statistical Analysis in Humanities: handling real-world data

## Dataset

Met Museum Collection Metadata (5000 artworks sample)

`met_museum_5000_sample.csv`

## Libraries we will use today
### pandas
* Use: Data manipulation and analysis.
* Key functionalities:
    * Data structures: DataFrame, Series.
    * Data handling: loading, filtering, grouping, aggregating, pivoting.
    * Basic visualization.

[Documentation](https://pandas.pydata.org/docs/)


### scipy
* Use: Advanced scientific computing.
* Key functionalities:
    * Statistical testing (t-tests, ANOVA, correlation).
    * Scientific algorithms and computations.

[Documentation](https://docs.scipy.org/doc/scipy/)

### spacy
* Use: Natural Language Processing
* Key Functionalities:
    * Named Entity Recognition (NER)
    * part of speech (POS) tagging
    * linguistic analysis

[Documentation](https://spacy.io/)




## Step 1: Import pandas and Load the Data




In [None]:
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv('met_museum_5000_sample.csv')

# Inspect the first few rows to get an initial understanding
df.head()


* pandas is imported as pd conventionally
* `.read_csv()` loads data into a structured table-like form (`DataFrame`)
* `.head()` is used to display the first rows of a DataFrame

## Step 2: Examine Data Structure

In [None]:
# Get information on columns, datatypes, and non-NaN or null counts
df.info()

# Get basic statistical descriptions (numeric columns)
df.describe(include='all')


* `.info()` summarizes column names, data types, and null-value counts.
* `.describe()` gives descriptive statistics: mean, median, standard deviation, quartiles for numeric columns, and top/frequency for categorical columns.

## Step 3: Selecting Specific Data

In [None]:
# Select specific columns (Artist Display Name, Title, Object Name)
selected_df = df[['Artist Display Name', 'Title', 'Object Name']]

# Filter rows where objects are specifically 'Paintings'
paintings_df = df[df['Object Name'] == 'Painting']
paintings_df.head()

## Step 4: Checking Missing Data

CH Datasets often have incomplete information

In [None]:
# Count missing values per column
df.isnull().sum()

In [None]:
# Fill missing values in 'Artist Display Name'

df['Artist Display Name'].fillna('Unknown')

* `.isnull().sum()` identifies gaps in data quality.

* `.fillna()` replaces missing data to maintain consistency in analysis.

## Step 5: Grouping and Counting

In [None]:
# Count artworks per nationality
nationality_counts = df['Artist Nationality'].value_counts()
nationality_counts.head(10)

In [None]:
# What is that "|"? Is that how the dataset deals with multiple elements in the
# Same cell? How can we fix that?

nationality_counts = \
df['Artist Nationality'].dropna().str.split('|').explode().str.strip().value_counts()
nationality_counts.head(10)


In [None]:
# Empty values?
nationality_counts = \
df['Artist Nationality'].dropna().str.split('|').explode().str.strip()
# Select only the nationality that are not ""
nationality_counts = nationality_counts[nationality_counts != ""].value_counts()
nationality_counts.head(10)


* `str.split('|')` splits the cell at every "|".
* `explode()` transforms each element of a list into its own row.
* `.str.strip()` cleans whitespace around names.
* `.value_counts()` summarizes the frequency of each category.

## Step 6: Quick Visualization in Pandas

In [None]:
# Simple plot of the top 10 artist nationalities
nationality_counts.head(10).plot(kind='bar',
                                 title='Top 10 Artist Nationalities in MET')

In [None]:
nationality_counts.head(10).plot(kind='pie',
                                 title='Top 10 Artist Nationalities in MET')

## Step 7: Date Cnversion and Cleanup
Dates in CH datasets could be tricky, and must be standardized

In [None]:
# Convert Object Begin Date to numeric
df['New Object Begin Date'] = pd.to_numeric(df['Object Begin Date'], errors='coerce')

# Check date statistics
df['New Object Begin Date'].describe()

* `pd.to_numeric()` converts values to numeric type, replacing non-numeric entries with NaNs for further statistical analysis.

## Step 8: Filtering by Date

In [None]:
# Select artworks from 1800-1900

art_19th_century = df[(df['New Object Begin Date'] >= 1800) &
                      (df['New Object Begin Date'] <= 1900)]
art_19th_century.head()

## Step 9: Aggregating different values

Boolean conditions filter data according to a historical period.

In [None]:
# Mean, median, and counts for dates grouped by classification
classification_stats = df.groupby('Classification')['New Object Begin Date'].agg(['mean', 'median', 'count'])
classification_stats.sort_values(by='count', ascending=False).head()

### Problem, there is still this multi-value for cells, let's find a solution

In [None]:
# Step 1: Expand the Classification values into individual rows
df_classifications = df[['Classification', 'New Object Begin Date']].dropna().copy()
df_classifications['Classification'] = df_classifications['Classification'].str.split('|')

# Step 2: Explode to have one classification per row
df_exploded = df_classifications.explode('Classification')
df_exploded['Classification'] = df_exploded['Classification'].str.strip()

# Step 3: Aggregate mean, median, and counts by Classification
classification_stats = (
    df_exploded
    .groupby('Classification')['New Object Begin Date']
    .agg(['mean', 'median', 'count'])
    .sort_values(by='count', ascending=False)
)

# Display the top 10 classifications
classification_stats.head(10)


#### Why Copy?

Without .copy(), the resulting DataFrame (df_classification) may just be a "view" of the original DataFrame (df). This means changes made to df_classification could inadvertently affect df, and vice versa.


## Step 11: Statistical Analysis T test - Hypothesis Testing

A t-test is a statistical hypothesis test used to determine if there's a significant difference between the means (averages) of two groups.

In simple terms, a t-test answers the question:

"Are these two groups different enough that the difference is unlikely to have occurred by random chance alone?"

It Helps you determine whether observed differences between groups are meaningful or just random noise. It is Commonly used across sciences (humanities included) to assess experimental results, comparative analyses, or surveys.

The T-stat is the magnitude of the difference, it will be > 0 if the average of element1 is superior to element2. It will be < 0 if the average of the element2 is superior to element1. The further away it is from 0, the greater the difference.

The p-value of the T-test is used to verify whether the difference is statistically significant or not.

Usually, less than 0.05 is statistically significant. More than that could sugges that the observed difference might be due to random chance.

Let's use a t-test to compare whether American and French Artists' average creation dates differ **significantly**



In [None]:
from scipy.stats import ttest_ind

# Step 1: Expand multiple nationalities into separate rows
df_nationality = df[['Artist Nationality', 'New Object Begin Date']].dropna().copy()
df_nationality['Artist Nationality'] = df_nationality['Artist Nationality'].str.split('|')

# Explode into individual nationality rows
df_nationality = df_nationality.explode('Artist Nationality')
df_nationality['Artist Nationality'] = df_nationality['Artist Nationality'].str.strip()

# Step 2: Filter explicitly for American and French artists
american_dates = df_nationality[
    df_nationality['Artist Nationality'] == 'American'
]['New Object Begin Date'].dropna()

french_dates = df_nationality[
    df_nationality['Artist Nationality'] == 'French'
]['New Object Begin Date'].dropna()

# Step 3: Perform the statistical test (T-test)
t_stat, p_value = ttest_ind(american_dates, french_dates, equal_var=False)

# Results clearly stated
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

if p_value < 0.05:
    print("The difference between American and French artists' dates is statistically significant.")
else:
    print("The difference between American and French artists' dates is NOT statistically significant.")


In [None]:
# A positive t-value means the first group (e.g., American artists)
# has a higher mean date than the second group (e.g., French artists).
# Because the p value is very small, it means that
# the difference between the groups is highly unlikely
# to have arisen by random chance alone.

## Step 12: Gender Based Analysis

Let's explore how gender is described in the dataset

In [None]:
df.columns

In [None]:
df['Artist Gender'].unique()

### Only females? Could it be that the Empty values are for non-Female artists?

In [None]:
df['Artist Display Name']

In [None]:
# From this small view, we can see that there seem to be non-Female and
# Institutions
# We need to assign the correct gender (or institution) instead of leaving
# it empty
# We hypothesize that all the empty values are either non-Female or
# Institutions
# How do we distinguish between the two?
# Given that there are empty values, can we make sure that they match
# with the number of artists or are there mistakes in the data?

### Checking Mistakes

In [None]:
# Function to count the number of values separated by "|"
def count_splits(value):
    if pd.isna(value): # checks if the values is NaN
        return 0
    return len(str(value).split('|'))

# Count artists and gender entries
df['Artist_Count'] = df['Artist Display Name'].apply(count_splits)
df['Gender_Count'] = df['Artist Gender'].apply(count_splits)
# Apply is used to apply a function to a part of a dataframe
# Adjusted mismatch logic
def check_mismatch(row):
    # Rule: Single artist with NaN gender is NOT a mismatch
    if row['Artist_Count'] == 1 and pd.isna(row['Artist Gender']):
        return False  # no mismatch
    return row['Artist_Count'] != row['Gender_Count']

# Apply this logic
df['Mismatch'] = df.apply(check_mismatch, axis=1) # Axis 1 because we
# are working
# rowise non column wise

# Now examine the mismatches
mismatches = df[df['Mismatch']]

print(f"Total mismatches (after correction): {len(mismatches)}")


In [None]:
# No mistakes, we are lucky!

### NER Named Entity Recognition to distinguish between people and institutions

Rationale:

* If gender is explicitly marked "Female", preserve as "Female".

* If gender is missing or empty:

    * Perform NER:

        * Label entities as "Institution" if NER detects organizations.

        * Label entities as "Non-Female" if NER detects people but gender not marked explicitly.

        * Label entities as "Unknown" otherwise or if Spacy detects both

In [76]:
import spacy

# Load spaCy's NER model
# might be necessary to !pip install spacy
# !python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')

def classify_artists(names, genders):
    names_list = str(names).split('|')
    # Handle NaN genders correctly: generate empty list if NaN
    genders_list = str(genders).split('|') if pd.notna(genders) else [''] * len(names_list)

    # Correct potential length mismatches by padding
    if len(genders_list) < len(names_list):
        genders_list += [''] * (len(names_list) - len(genders_list))

    corrected_genders = []
    for name, gender in zip(names_list, genders_list):
        name_clean = name.strip()
        gender_clean = gender.strip()

        if gender_clean == 'Female':
            corrected_genders.append('Female')
        else:
            # Run NER on names without explicit 'Female' gender
            doc = nlp(name_clean)
            entity_labels = [ent.label_ for ent in doc.ents]

            if 'ORG' in entity_labels and 'PERSON' in entity_labels:
                corrected_genders.append('Unknown')
            elif 'ORG' in entity_labels:
                corrected_genders.append('Institution')
            elif 'PERSON' in entity_labels:
                corrected_genders.append('Non-Female')
            else:
                corrected_genders.append('Unknown')

    return '|'.join(corrected_genders)

# Single-artist with NaN gender handled correctly:
def apply_classification(row):
    if row['Artist_Count'] == 1 and pd.isna(row['Artist Gender']):
        return classify_artists(row['Artist Display Name'],
                                row['Artist Gender'])
    else:
        return classify_artists(row['Artist Display Name'],
                                row['Artist Gender'])

df['Corrected_Gender'] = df.apply(apply_classification, axis=1)


In [None]:
df['Corrected_Gender'].unique()

In [74]:
test1 = "Bryan"
test2 = "Prince Williams of the LMU University"
doc = nlp(test2)

In [75]:
for el in doc.ents:
    print(el.label_)
# test other things while iterating over doc:
'''token.text: The original text of the token.
token.lemma_: The base form of the word (e.g., "running" -> "run").
token.pos_: The part-of-speech tag (e.g., "NOUN", "VERB", "ADJ").
token.tag_: A more detailed part-of-speech tag.
token.dep_: The dependency relationship to other words in the sentence.
token.is_stop: Whether the token is a common stop word (like "the", "a", "is").
token.is_alpha: Whether the token is alphabetic.
token.is_punct: Whether the token is punctuation.'''
#for el in doc:
    #print(el.is_stop)

PERSON
ORG


'token.text: The original text of the token.\ntoken.lemma_: The base form of the word (e.g., "running" -> "run").\ntoken.pos_: The part-of-speech tag (e.g., "NOUN", "VERB", "ADJ").\ntoken.tag_: A more detailed part-of-speech tag.\ntoken.dep_: The dependency relationship to other words in the sentence.\ntoken.is_stop: Whether the token is a common stop word (like "the", "a", "is").\ntoken.is_alpha: Whether the token is alphabetic.\ntoken.is_punct: Whether the token is punctuation.'

In [None]:
df["Corrected_Gender"].unique()

# Exercise (At home)

Create a bar plot of the gender distribution in the dataframe after the correction

<details>
    <summary>Solution</summary>

    corrected_gender_series = df['Corrected_Gender'].str.split('|').explode()
    gender_counts = corrected_gender_series.value_counts()
    gender_counts.plot(kind='bar', title='Corrected Gender Distribution')
</details>