**In this notebook, I am going to perform analysis on which topics and verbal contribution categories lead to '4h', a label that signifies attack, disagreement, and conflict. I will utilize both Python code and generative AI tools like Claude 2 to accomplish these tasks.**

# Packages

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk import ngrams, FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import re
from collections import Counter
from nltk.tokenize import RegexpTokenizer
import unicodedata

# Data

In [3]:
df = pd.read_csv('/Users/ningyuhan/Desktop/combined_dataframe.csv')

In [4]:
def standardize_text(text):
    if pd.isna(text):
        return text  
    # Remove non-ASCII characters directly
    text = ''.join([char for char in text if ord(char) < 128])
    text = text.strip().lower()  
    # Remove all non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['Standardized_Sentence'] = df['Sentence'].apply(standardize_text)

# Only initial 4h

In [8]:
df['Tag_Component'] = df['Tag'].apply(lambda x: x.split(',')[2] if len(x.split(',')) > 2 else None)
df['Is_4h'] = df['Tag_Component'] == '4h'

def get_relevant_rows(df):
    relevant_rows = []
    skip_flag = False  

    for i in range(len(df)):
        if df.iloc[i]['Is_4h']:
            if skip_flag:
                # If it's a consecutive '4h', skip it
                continue
            skip_flag = True  

            # Add up to 3 preceding sentences
            for j in range(1, 4):
                if i-j < 0:
                    break  
                if not df.iloc[i-j]['Is_4h']:
                    relevant_rows.append(df.iloc[i-j])

            relevant_rows.append(df.iloc[i])

        else:
            skip_flag = False  

    return pd.DataFrame(relevant_rows).sort_index()

adjusted_selected_df = get_relevant_rows(df)

adjusted_selected_df.head(5)

Unnamed: 0,Tag,Sentence,Standardized_Sentence,Tag_Component,Is_4h
9,"[c:10,12,2a,04]","seconded by Mrs. Hammer, and unanimously carri...",seconded by mrs hammer and unanimously carried...,2a,False
10,"[c:11,15,1a,01]","It was moved by Mr. Fulton,",it was moved by mr fulton,1a,False
11,"[c:12,11,2a,01]","seconded by Mr. Helstein, and unanimously carr...",seconded by mr helstein and unanimously carrie...,2a,False
12,"[c:13,13,4h,01]",Dodson got mad at Ralph,dodson got mad at ralph,4h,True
46,"[c:47,04,2b,05]",Lambert echoing that.,lambert echoing that,2b,False


## Verbal Contribution Analysis

In [9]:
verbal_contribution_codes = {
    '1': 'Initiation activity',
    '1a': 'Motion making',
    '1b': 'Information giving',
    '1c': 'Information seeking',
    '1d': 'Making specific suggestions for action',
    '2': 'Support behavior',
    '2a': 'Motion seconding',
    '2b': 'Making statements in support of another person\'s argument',
    '3': 'System maintenance',
    '3a': 'Tension management',
    '3b': 'Direction of traffic',
    '3c': 'Collective spirit and solidarity moves',
    '4': 'Board discussion, debate, argumentation',
    '4a': 'Personal defensiveness',
    '4b': 'Personal gains',
    '4c': 'Agreeing reluctantly',
    '4d': 'Sensible, nonpersonal arguments',
    '4e1': 'Management',
    '4e2': 'Union',
    '4e3': 'People (workers) as distinct from union or the union leadership',
    '4f': 'Attempts to propose new board topics',
    '4g': 'Corporate interests',
    '4h': 'Disagreements, conflicts, attacks',
    '4i': 'Stonewalling',
    '5': 'Unclassified verbal behaviors',
}

### Verbal Contribution Categories That Lead To 4h

In [11]:
sequences_leading_to_4h = []

for index in range(len(adjusted_selected_df)):
    if adjusted_selected_df.iloc[index]['Is_4h']:
        sequence = []
        for j in range(1, 4):
            if index-j < 0:
                break  
            if not adjusted_selected_df.iloc[index-j]['Is_4h']:
                sequence.append(adjusted_selected_df.iloc[index-j]['Tag_Component'])
        sequences_leading_to_4h.append(tuple(reversed(sequence)))  

# Count the frequency of each sequence
sequence_counts = Counter(sequences_leading_to_4h)
top10_sequences = sequence_counts.most_common(10)
top10_sequences

[(('1b', '1b', '4d'), 5),
 (('2b', '4d', '4d'), 4),
 (('4d', '1c', '4d'), 4),
 (('4d', '4d'), 4),
 (('4a', '4a'), 4),
 (('4d', '4d', '4d'), 4),
 (('4d', '1c', '1b'), 4),
 (('1c', '1b', '1b'), 4),
 (('1b', '4d'), 3),
 (('1b', '4d', '1b'), 3)]

In [12]:
translated_top10_sequences = []
for sequence, count in top10_sequences:
    translated_sequence = [verbal_contribution_codes.get(code, "Unknown") for code in sequence]
    translated_top10_sequences.append((translated_sequence, count))

formatted_results = []
for sequence, count in translated_top10_sequences:
    sequence_str = ', '.join(sequence)  # Join the categories into a single string
    sentence = f"[{sequence_str}] leads to 4h {count} times."
    formatted_results.append(sentence)

formatted_results

['[Information giving, Information giving, Sensible, nonpersonal arguments] leads to 4h 5 times.',
 "[Making statements in support of another person's argument, Sensible, nonpersonal arguments, Sensible, nonpersonal arguments] leads to 4h 4 times.",
 '[Sensible, nonpersonal arguments, Information seeking, Sensible, nonpersonal arguments] leads to 4h 4 times.',
 '[Sensible, nonpersonal arguments, Sensible, nonpersonal arguments] leads to 4h 4 times.',
 '[Personal defensiveness, Personal defensiveness] leads to 4h 4 times.',
 '[Sensible, nonpersonal arguments, Sensible, nonpersonal arguments, Sensible, nonpersonal arguments] leads to 4h 4 times.',
 '[Sensible, nonpersonal arguments, Information seeking, Information giving] leads to 4h 4 times.',
 '[Information seeking, Information giving, Information giving] leads to 4h 4 times.',
 '[Information giving, Sensible, nonpersonal arguments] leads to 4h 3 times.',
 '[Information giving, Sensible, nonpersonal arguments, Information giving] lead

## Topic Analysis

In [13]:
topic_codes = {
    '01': 'Routine Board Functions',
    '02': 'Scope of Board Issues',
    '03': 'Financial Status of Firm',
    '04': 'Management Personnel Issues',
    '05': 'Marketing and Sales',
    '06': 'Employee Stock Ownership Plan (ESOP) Financial',
    '07': 'ESOP Participation',
    '08': 'Employee Benefits',
    '09': 'Plant Production and Manufacturing Process Issues',
    '10': 'The Union and Contract Issues',
    '11': 'Supervision',
    '12': 'hog procurement',
    '13': 'Political Relations with the Community'
}

In [16]:
adjusted_selected_df_copy = adjusted_selected_df.copy()

adjusted_selected_df_copy['Topic_Category'] = adjusted_selected_df_copy['Tag'].apply(
    lambda x: x.split(',')[3].replace(']', '') if len(x.split(',')) > 3 else None
)
def get_relevant_topic_rows(df):
    relevant_rows = []
    skip_flag = False

    for i in range(len(df)):
        if df.iloc[i]['Is_4h']:
            if skip_flag:
                continue
            skip_flag = True

            # Add up to 3 preceding sentences for their topic categories
            for j in range(1, 4):
                if i-j < 0:
                    break
                if not df.iloc[i-j]['Is_4h']:
                    relevant_rows.append(df.iloc[i-j])

            # Add the '4h' row itself
            relevant_rows.append(df.iloc[i])
        else:
            skip_flag = False

    return pd.DataFrame(relevant_rows).sort_index()

df_topic_final = get_relevant_topic_rows(adjusted_selected_df_copy)
df_topic_final


Unnamed: 0,Tag,Sentence,Standardized_Sentence,Tag_Component,Is_4h,Topic_Category
9,"[c:10,12,2a,04]","seconded by Mrs. Hammer, and unanimously carri...",seconded by mrs hammer and unanimously carried...,2a,False,04
10,"[c:11,15,1a,01]","It was moved by Mr. Fulton,",it was moved by mr fulton,1a,False,01
11,"[c:12,11,2a,01]","seconded by Mr. Helstein, and unanimously carr...",seconded by mr helstein and unanimously carrie...,2a,False,01
12,"[c:13,13,4h,01]",Dodson got mad at Ralph,dodson got mad at ralph,4h,True,01
46,"[c:47,04,2b,05]",Lambert echoing that.,lambert echoing that,2b,False,05
...,...,...,...,...,...,...
4835,"[c:87,12,4h,06]",Tove gets angry at Lyle and turns on him and s...,tove gets angry at lyle and turns on him and s...,4h,True,06
4846,"[c:98,22,4e3,04]","which includes Walker, who is an old CEO.",which includes walker who is an old ceo,4e3,False,04
4847,"[c:99,12,1c,04]",I ask if he wasn't the one who was fired and w...,i ask if he wasnt the one who was fired and wh...,1c,False,04
4848,"[c:100,22,1b,04]","Laverne says he wasnÕt fired,",laverne says he wasnt fired,1b,False,04


### Top 5 Conflict topics

In [17]:
df_4h = df_topic_final[df_topic_final['Tag_Component'] == '4h']

topics_counts = df_4h['Topic_Category'].value_counts()

translated_topics_counts = {topic_codes.get(key, "Unknown"): value for key, value in topics_counts.items()}
sorted_translated_topics = sorted(translated_topics_counts.items(), key=lambda x: x[1], reverse=True)
top_5_topics = sorted_translated_topics[:]
for topic, count in top_5_topics:
    print(f"{topic}: {count}")

Management Personnel Issues: 55
Financial Status of Firm: 48
The Union and Contract Issues: 40
Marketing and Sales: 21
Routine Board Functions: 20
Plant Production and Manufacturing Process Issues: 20
ESOP Participation: 20
Employee Benefits: 16
Employee Stock Ownership Plan (ESOP) Financial: 14
Supervision: 8
hog procurement: 4
Political Relations with the Community: 2
Scope of Board Issues: 1
Unknown: 1


### Topics that lead to 4h

In [19]:
from collections import Counter

# Initialize a list for storing sequences
sequences_leading_to_4h_topics = []

# Iterate over df_final to collect topic sequences leading to '4h'
for index in range(len(adjusted_selected_df)):
    if df_topic_final.iloc[index]['Is_4h']:
        topic_sequence = []
        for j in range(1, 4):
            if index-j < 0 or df_topic_final.iloc[index-j]['Is_4h']:
                break
            topic_sequence.append(df_topic_final.iloc[index-j]['Topic_Category'])
        sequences_leading_to_4h_topics.append(tuple(reversed(topic_sequence)))

# Count the frequency of each sequence
topic_sequence_counts = Counter(sequences_leading_to_4h_topics)

# Get the top 10 sequences
top10_topic_sequences = topic_sequence_counts.most_common(10)

top10_topic_sequences


[(('03', '03', '03'), 32),
 (('10', '10', '10'), 27),
 (('04', '04', '04'), 24),
 (('09', '09', '09'), 11),
 (('05', '05', '05'), 10),
 (('08', '08', '08'), 9),
 (('07', '07', '07'), 8),
 (('01', '01', '01'), 8),
 (('04', '04'), 7),
 (('06', '06', '06'), 6)]

In [20]:
# Translate the top 10 topic sequences
translated_top10_topic_sequences = []
for sequence, count in top10_topic_sequences:
    translated_sequence = [topic_codes.get(code, "Unknown") for code in sequence]
    translated_top10_topic_sequences.append((translated_sequence, count))

# Simplified formatting of the translated sequences
formatted_results = []
for sequence, count in translated_top10_topic_sequences:
    # Remove consecutive duplicates
    unique_sequence = [sequence[i] for i in range(len(sequence)) if i == 0 or sequence[i] != sequence[i - 1]]
    sequence_str = ', '.join(unique_sequence)
    result = f"[{sequence_str}], {count}"
    formatted_results.append(result)

formatted_results


['[Financial Status of Firm], 32',
 '[The Union and Contract Issues], 27',
 '[Management Personnel Issues], 24',
 '[Plant Production and Manufacturing Process Issues], 11',
 '[Marketing and Sales], 10',
 '[Employee Benefits], 9',
 '[ESOP Participation], 8',
 '[Routine Board Functions], 8',
 '[Management Personnel Issues], 7',
 '[Employee Stock Ownership Plan (ESOP) Financial], 6']

# AI: which sentiments are more often to appear before 4hs

## Top 10 analysis

**Prompts:** In the bracket, the third component is the verbal contribution category, the fourth component is the topic category. Below are the codes for those 2 categories:

**Verbal Contribution Codes:**
- '1': 'Initiation activity'
- '1a': 'Motion making'
- '1b': 'Information giving'
- '1c': 'Information seeking'
- '1d': 'Making specific suggestions for action'
- '2': 'Support behavior'
- '2a': 'Motion seconding'
- '2b': 'Making statements in support of another person's argument'
- '3': 'System maintenance'
- '3a': 'Tension management'
- '3b': 'Direction of traffic'
- '3c': 'Collective spirit and solidarity moves'
- '4': 'Board discussion, debate, argumentation'
- '4a': 'Personal defensiveness'
- '4b': 'Personal gains'
- '4c': 'Agreeing reluctantly'
- '4d': 'Sensible, nonpersonal arguments'
- '4e1-3': 'Special interest group gains and positions'
- '4e1': 'Management'
- '4e2': 'Union'
- '4e3': 'People (workers) as distinct from union or the union leadership'
- '4f': 'Attempts to propose new board topics'
- '4g': 'Corporate interests'
- '4h': 'Disagreements, conflicts, attacks'
- '4i': 'Stonewalling'
- '5': 'Unclassified verbal behaviors'

**Topic Codes:**
- '01': 'Routine Board Functions'
- '02': 'Scope of Board Issues'
- '03': 'Financial Status of Firm'
- '04': 'Management Personnel Issues'
- '05': 'Marketing and Sales'
- '06': 'Employee Stock Ownership Plan (ESOP) Financial'
- '07': 'ESOP Participation'
- '08': 'Employee Benefits'
- '09': 'Plant Production and Manufacturing Process Issues'
- '10': 'The Union and Contract Issues'
- '11': 'Supervision'
- '12': 'Hog procurement'
- '13': 'Political Relations with the Community'

I want to do some analysis around the verbal contribution "4h", I want to have an AI model to find which sentiments(just something like a guide) are more often to appear before 4hs. I need 2 results, one is the top 10 verbal contribution categories that lead to 4h, another is the top 10 topics that lead to 4h

### result

Based on the data provided, here are the top 10 verbal contribution categories and topics that tend to precede the '4h' (Disagreements, conflicts, attacks) verbal contribution category:

**Top 10 verbal contribution categories preceding '4h':**

1. '1b' (Information giving)
2. '1c' (Information seeking)
3. '3b' (Direction of traffic)
4. '4d' (Sensible, nonpersonal arguments)
5. '4e1' (Special interest group gains and positions - Management)
6. '4e2' (Special interest group gains and positions - Union)
7. '4g' (Corporate interests)
8. '4i' (Stonewalling)
9. '05' (Unclassified verbal behaviors)
10. '2b' (Making statements in support of another person's argument)

**Top 10 topics preceding '4h':**

1. '03' (Financial Status of Firm)
2. '09' (Plant Production and Manufacturing Process Issues)
3. '10' (The Union and Contract Issues)
4. '05' (Marketing and Sales)
5. '08' (Employee Benefits)
6. '04' (Management Personnel Issues)
7. '12' (hog procurement)
8. '01' (Routine Board Functions)
9. '07' (ESOP Participation)
10. '11' (Supervision)

In summary, disagreements and conflicts tend to follow information exchange, attempts to guide the conversation, arguments related to special interests, corporate priorities, evasiveness, and statements supporting others' arguments. The most common preceding topics are finances, production, union issues, sales, benefits, management, and procurement.


## Direct Analysis

without specific category limits, here is an analysis of the sentiments that tend to precede the '4h' (Disagreements, conflicts, attacks) verbal contribution code in the data:

**Information Exchange**: '4h' often follows the sharing of information, whether it's someone providing details ('1b'), seeking more information ('1c'), or attempting to guide the flow of discussion ('3b'). The disagreement sometimes springs from discrepancies in the information provided by different parties.

**Special Interests**: Arguments related to advancing specific agendas, like those of management ('4e1'), the union ('4e2'), or the company ('4g') often lead to conflicts if the interests are seen as misaligned.

**Evasiveness**: When someone is evading questions or discussions ('4i'), it can cause frustration that boils over into open disagreement.

**Support Statements**: Even when someone expresses support ('2b') for another's argument, there may be disagreement from others who feel an alternative position is being overlooked.

**Financial Concerns**: Discussions about the company's finances ('03') frequently lead to disagreement, especially regarding the causes of financial trouble and how to address it.

**Production Issues**: Problems related to manufacturing and operations ('09') also elicit differences of opinion on root causes and solutions.

**Unions and Contracts**: Negotiations and debates over unions ('10') and contracts often contain inherent conflicts of interest that lead to disputes.

**Benefits and Management**: Disagreements also commonly spring up around benefits ('08'), management changes ('04'), sales and marketing ('05'), procurement ('12'), and other frequent board discussion topics.

In summary, '4h' disagreements tend to follow conversations where interests are not aligned, information gaps exist, or the potential for disputes is already ripe based on the topic. A lack of trust and transparency tends to create the environment for open conflict.