Author: Madison Laprise

Date: 6/12/2024

Description: Made a column that shows how many times a name is part of another name within the same article, this is the Contains_Part_of_Score. Can be added back to the original dataframe when it is time. I also made a probability score model with the contains/part-of score.

Struggles: Originally we had issues with the indexing on Lydia's dataset. I believe she is working on cleaning it up a bit more. 

Future plans: 




In [23]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('DW_indexed.csv')

# Create a new DataFrame to store results without altering the original
df_new = df.copy()

# Function to check if a name is part of another name
def contains_part_of(name, names_list):
    """
    Check if a given name is part of any other name in the provided list.
    
    Parameters:
    name (str): The name to check.
    names_list (list): The list of names to check against.
    
    Returns:
    bool: True if the name is part of another name, False otherwise.
    """
    for other_name in names_list:
        if isinstance(name, float) or isinstance(other_name, float):
            return False
        if str(name) != str(other_name) and str(name) in str(other_name):
            return True
    return False

# Create a list of unique names
names = df['Name'].unique()

# Apply the function to each name and create a new column for contains/part-of score
df_new['Contains_Part_of_Score'] = df_new['Name'].apply(lambda x: -1 if contains_part_of(x, names) else 0)

# Display the first few rows of the new dataframe with the new column
print(df_new.head())

# Save the new DataFrame to a CSV file if needed
df_new.to_csv('worse_contains_score.csv', index=False)




  Article_Date_Published                                       Article_Body  \
0    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
1    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
2    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
3    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
4    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   

                 Name Identity_Type Article_Source Voice  \
0         Benny Gantz        People   sabcnews.com   NaN   
1  Benjamin Netanyahu        People   sabcnews.com   NaN   
2        Yoav Gallant        People   sabcnews.com   NaN   
3       Gadi Eisenkot        People   sabcnews.com   NaN   
4    Bezalel Smotrich        People   sabcnews.com   NaN   

                             Article_Themes_AI_Model  \
0  [Primary: Conflict, war and peace|92% |Seconda...   
1  [Primary: Conflict, war and peace|92% |Seconda...

"worse_contains_score.csv" shows a negative score of -1 if a name is contained within any other name in the article. Otherwise, it shows a score of 0.

In [24]:
import pandas as pd

# Load the provided dataframe (assume df is already loaded)
# df = pd.read_csv('path/to/DW_indexed.csv')  # This line is for context

# Create a new DataFrame to store results without altering the original
df_new = df.copy()

# Ensure 'Article_ID' is a valid column; create a sample one if not
if 'Article_ID' not in df.columns:
    df_new['Article_ID'] = df.groupby(df.index // 10).ngroup()  # Sample grouping every 10 rows as an article

# Function to calculate the frequency-based contains/part-of score
def calculate_contains_part_of_score(name, article_id, df):
    """
    Calculate a frequency-based contains/part-of score for a given name within an article.
    
    Parameters:
    name (str): The name to check.
    article_id (int): The ID of the article to check within.
    df (DataFrame): The dataframe containing the data.
    
    Returns:
    int: A negative score proportional to the frequency of the name being part of another name.
    """
    # Ensure the name is a string
    if not isinstance(name, str):
        return 0
    
    # Get all names in the same article and ensure they are strings
    article_names = df[df['Article_ID'] == article_id]['Name'].dropna().astype(str).unique()
    
    # Count how many times 'name' is part of other names in the article
    frequency_count = sum(1 for other_name in article_names if name != other_name and name in other_name)
    
    # Return a negative score proportional to the frequency count
    return -frequency_count

# Apply the function to each name within the context of the same article
df_new['Contains_Part_of_Score'] = df_new.apply(
    lambda row: calculate_contains_part_of_score(row['Name'], row['Article_ID'], df_new),
    axis=1
)

# Display the first few rows of the new dataframe with the new column
print(df_new.head())

# Save the new DataFrame to a CSV file if needed
df_new.to_csv('contains_score.csv', index=False)



  Article_Date_Published                                       Article_Body  \
0    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
1    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
2    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
3    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
4    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   

                 Name Identity_Type Article_Source Voice  \
0         Benny Gantz        People   sabcnews.com   NaN   
1  Benjamin Netanyahu        People   sabcnews.com   NaN   
2        Yoav Gallant        People   sabcnews.com   NaN   
3       Gadi Eisenkot        People   sabcnews.com   NaN   
4    Bezalel Smotrich        People   sabcnews.com   NaN   

                             Article_Themes_AI_Model  \
0  [Primary: Conflict, war and peace|92% |Seconda...   
1  [Primary: Conflict, war and peace|92% |Seconda...

contains_score.csv shows a the frequency score of how many times a name is contained within any other name in the article, with each instance of containment being a -1. Otherwise, it shows a score of 0.



I am now going to test a preliminary probability model. Good luck Madi. 

In [2]:
import pandas as pd

# Load the provided dataframe
df = pd.read_csv('contains_score.csv')

# Create a new DataFrame to store results without altering the original
df_new = df.copy()

# Example function to map names to WikiData IDs (for demonstration purposes)
def map_name_to_wikidata_id(name):
    """
    Map a name to a WikiData ID. This is a placeholder function.
    
    Parameters:
    name (str): The name to map
    
    Returns:
    str: A WikiData ID
    """
    # Example mapping (in a real scenario, this would use an actual mapping method)
    mapping = {
        "Joe Biden": "Q6279",
        "Joseph": "Q1569"
    }
    return mapping.get(name, "Unknown")

# Function to calculate the final probability score based on the contains/part of score
def calculate_final_probability_score(row):
    """
    Calculate the final probability score based on the Contains/Part of Score.
    
    Parameters:
    row (pd.Series): A row from the dataframe
    
    Returns:
    float: A probability score between 0 and 1
    """
    # Use a simple normalization of the contains/part of score to be between 0 and 1
    min_score = df['Contains_Part_of_Score'].min()
    max_score = df['Contains_Part_of_Score'].max()
    normalized_score = (row['Contains_Part_of_Score'] - min_score) / (max_score - min_score)
    return normalized_score

# Apply the WikiData ID mapping function
df_new['WikiData_ID'] = df_new['Name'].apply(map_name_to_wikidata_id)

# Apply the probability score calculation function
df_new['Probability_Score'] = df_new.apply(calculate_final_probability_score, axis=1)

# Display the first few rows of the new dataframe with the new columns
print(df_new.head())

# Save the new DataFrame to a CSV file
output_path = 'test_contains_score_with_probabilities.csv'
df_new.to_csv(output_path, index=False)


  Article_Date_Published                                       Article_Body  \
0    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
1    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
2    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
3    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   
4    2024-06-11 07:34:17  Reading Time: 3 minutes\nIsrael’s parliament m...   

                 Name Identity_Type Article_Source Voice  \
0         Benny Gantz        People   sabcnews.com   NaN   
1  Benjamin Netanyahu        People   sabcnews.com   NaN   
2        Yoav Gallant        People   sabcnews.com   NaN   
3       Gadi Eisenkot        People   sabcnews.com   NaN   
4    Bezalel Smotrich        People   sabcnews.com   NaN   

                             Article_Themes_AI_Model  \
0  [Primary: Conflict, war and peace|92% |Seconda...   
1  [Primary: Conflict, war and peace|92% |Seconda...