# Wavess Tech assignment

## Task 1

Goal: Build a small prototype that analyses Linkedin post performance and audience
relevance.

Deliverables:

‚Ä¢ Extract post features (text, hashtags, sentiment, etc.) and predict which posts perform best.

‚Ä¢ Analyse the post audience (roles, seniority, company type) and rank their relevance to a their ICP.

‚Ä¢ Provide both outputs in a clean, structured format (CSV or small dashboard).

Tools: Python or R (you choose).


Bonus: Suggest how this workflow could be automated or scaled.

In [None]:
# references
# https://www.b2brocket.ai/blog-posts/how-to-hone-your-icp-and-generate-more-leads-on-linkedin

In [97]:
# actor 1 : https://console.apify.com/actors/D51Mxn506e71aDPys/information/latest/readme
# actor 2 : https://console.apify.com/actors/J9UfswnR3Kae4O6vm/information/latest/readme
# actor 3 : https://console.apify.com/actors/WTiV7eppiChuBc8Xq/information/latest/readme
# actor 4 : https://console.apify.com/actors/2XnpwxfhSW1fAWElp/information/latest/readme

In [40]:
import os
import pandas as pd
from dotenv import load_dotenv
from apify_client import ApifyClient

# Load environment variables from .env file
load_dotenv()
api_token = os.getenv("APIFY_API_TOKEN")

# Initialize the ApifyClient
client = ApifyClient(api_token)

# The LinkedIn post URL we are analyzing
linkedin_post_url = 'https://www.linkedin.com/posts/klarna_klarnas-climate-resilience-program-activity-7346877091532959746-748v/'



In [43]:
print("Starting: Post Scraper")

# Actor and input configuration : "pratikdani/linkedin-posts-scraper"
post_scraper_actor_id = "D51Mxn506e71aDPys"

post_scraper_input = {
    "url": linkedin_post_url,
    "max_comments": 0 
}

# Run the actor and wait for it to finish
run = client.actor(post_scraper_actor_id).call(run_input=post_scraper_input)

# Fetch results from the actor's dataset
items = []
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    items.append(item)


with open('post_data.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, ensure_ascii=False, indent=4)

print(f"Success: Saved {len(items)} post object(s) to post_data.json")

if items:
    print("\n--- Sample of first post data ---")
    print(json.dumps(items[0], indent=2))

Starting: Post Scraper


[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> Status: RUNNING, Message: 
[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> 2025-11-12T13:11:36.492Z ACTOR: Pulling container image of build BIxJejEmBtxrvbg1J from registry.
[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> 2025-11-12T13:11:40.132Z ACTOR: Creating container.
[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> 2025-11-12T13:11:40.232Z ACTOR: Starting container.
[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> 2025-11-12T13:11:41.575Z Actor is running on the Apify platform, `disable_browser_sandbox` was changed to True.
[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> 2025-11-12T13:11:41.821Z [90m[apify][0m [32mINFO [0m Initializing Actor...
[36m[apify.linkedin-posts-scraper runId:Hudayw0duWsGrV2id][0m -> 2025-11-12T13:11:41.825Z [90m[apify][0m [32mINFO [0m System info [90m({"apify_sdk_version": "2.7.3", "apif

Success: Saved 1 post object(s) to post_data.json

--- Sample of first post data ---
{
  "account_type": "Organization",
  "author_profile_pic": "https://media.licdn.com/dms/image/v2/D4E0BAQEKo8RSbFptLA/company-logo_100_100/company-logo_100_100/0/1706794863353/klarna_logo?e=2147483647&v=beta&t=ZmfQkR4vj3BEwNzTDEOzgsWswPNd4wpMYITYcTCkIL0",
  "date_posted": "2025-07-04T12:26:36.965Z",
  "document_cover_image": "https://media.licdn.com/dms/image/v2/D4D1FAQHG7VKVbr60DA/feedshare-document-cover-images_480/B4DZfVZuTtHMBU-/0/1751631988520?e=2147483647&v=beta&t=dBkxR6qTON_dN9f8jQMPQZWWEAWzDFgTXp1BVqleTQ0",
  "document_page_count": 3,
  "embedded_links": [
    "https://lnkd.in/d3tFWFHJ"
  ],
  "external_link_data": null,
  "hashtags": null,
  "headline": "\ud83c\udf0d We\u2019re proud to announce the launch of our AI for Climate Resilience Program.",
  "id": "7346877091532959746",
  "images": [
    "https://media.licdn.com/dms/image/v2/D4D1FAQHG7VKVbr60DA/feedshare-document-cover-images_480/B4D

In [45]:
print("Starting: Post Reactions Scraper")

# Actor and input configuration : "apimaestro/linkedin-post-reactions"
reactions_scraper_actor_id = "J9UfswnR3Kae4O6vm"
reactions_scraper_input = {
    "post_urls": [linkedin_post_url],
    "reaction_type": "ALL"
}

# Run the actor and wait for it to finish
run = client.actor(reactions_scraper_actor_id).call(run_input=reactions_scraper_input)

# Fetch results from the actor's dataset
items = []
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    items.append(item)

# Save the raw list of dictionaries to a JSON file
with open('post_reactions.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, ensure_ascii=False, indent=4)

print(f"Success: Saved {len(items)} reactions to post_reactions.json")

# Print the first item to verify
if items:
    print("\n--- Sample of first reaction data ---")
    print(json.dumps(items[0], indent=2))

Starting: Post Reactions Scraper


[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> Status: RUNNING, Message: 
[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> 2025-11-12T13:20:17.362Z ACTOR: Pulling container image of build 1JUeXg5YbqPyJOZYv from registry.
[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> 2025-11-12T13:20:17.364Z ACTOR: Creating container.
[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> 2025-11-12T13:20:17.427Z ACTOR: Starting container.
[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> 2025-11-12T13:20:18.482Z [32mINFO[39m  System info[90m {"apifyVersion":"3.5.0","apifyClientVersion":"2.17.0","crawleeVersion":"3.15.1","osType":"Linux","nodeVersion":"v20.19.5"}[39m
[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> 2025-11-12T13:20:18.671Z Input received: {
[36m[apify.linkedin-post-reactions runId:U94EQjMOB0r8Y72zh][0m -> 2025-11-12T13:20:18.672Z   "post_urls": [
[36m[apify.linkedin-pos

Success: Saved 98 reactions to post_reactions.json

--- Sample of first reaction data ---
{
  "reaction_type": "LIKE",
  "reactor": {
    "urn": "ACoAADzUcLcBZm6Fm3KykTDh-WWMcGkEJUXqC90",
    "name": "Sagar Mattikere Anand",
    "headline": "AI/ML Engineer & Backend Developer | RAG Systems | Python, Django, Docker | Turning Data Challenges into Scalable Solutions",
    "profile_url": "https://www.linkedin.com/in/ACoAADzUcLcBZm6Fm3KykTDh-WWMcGkEJUXqC90",
    "profile_pictures": {
      "small": "https://media.licdn.com/dms/image/v2/D5635AQEZ-0M5BiL9zw/profile-framedphoto-shrink_100_100/B56Zm4767dHAAk-/0/1759744340075?e=1763560800&v=beta&t=5vD8SODL_efjb1EyzWB2PhPVcCdnUopLzrsdqtl42wY",
      "medium": "https://media.licdn.com/dms/image/v2/D5635AQEZ-0M5BiL9zw/profile-framedphoto-shrink_200_200/B56Zm4767dHAAY-/0/1759744340075?e=1763560800&v=beta&t=8LjgT-oSD47kVcBPdZ5s-CfgQvQBalQdH0dTY19e8X4",
      "large": "https://media.licdn.com/dms/image/v2/D5635AQEZ-0M5BiL9zw/profile-framedphoto-shrink

In [46]:
print("Starting: Post Reshares Scraper")

# Actor and input configuration
reshares_scraper_actor_id = "apimaestro/linkedin-post-reshares"
reshares_scraper_input = {
    "post_urls": [linkedin_post_url]}

# Run the actor and wait for it to finish
run = client.actor(reshares_scraper_actor_id).call(run_input=reshares_scraper_input)

# Fetch results from the actor's dataset
items = []
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    items.append(item)

# Save the raw list of dictionaries to a JSON file
with open('post_reshares.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, ensure_ascii=False, indent=4)

print(f"Success: Saved {len(items)} reshares to post_reshares.json")

# Print the first item to verify
if items:
    print("\n--- Sample of first reshare data ---")
    print(json.dumps(items[0], indent=2))

Starting: Post Reshares Scraper


[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> Status: RUNNING, Message: 
[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> 2025-11-12T13:24:38.780Z ACTOR: Pulling container image of build b7UoQhPpVVTv3xzVo from registry.
[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> 2025-11-12T13:24:38.782Z ACTOR: Creating container.
[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> 2025-11-12T13:24:42.070Z ACTOR: Starting container.
[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> 2025-11-12T13:24:43.229Z [32mINFO[39m  System info[90m {"apifyVersion":"3.5.0","apifyClientVersion":"2.17.0","crawleeVersion":"3.15.1","osType":"Linux","nodeVersion":"v20.19.5"}[39m
[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> 2025-11-12T13:24:43.374Z Input received: {
[36m[apify.linkedin-post-reshares runId:IchCCn1UXJ9GUqCAi][0m -> 2025-11-12T13:24:43.376Z   "post_urls": [
[36m[apify.linkedin-post-resha

Success: Saved 15 reshares to post_reshares.json

--- Sample of first reshare data ---
{
  "id": "7359525268497100800",
  "url": "https://www.linkedin.com/feed/update/urn:li:activity:7359525268497100800",
  "reposter": {
    "urn": "",
    "name": "Momentswithbren",
    "profile_url": "https://www.linkedin.com/company/momentswithbren/?miniCompanyUrn=urn%3Ali%3Afsd_company%3A29290478",
    "profile_pictures": {
      "small": "",
      "medium": "",
      "large": "",
      "original": ""
    },
    "is_premium": false
  },
  "reshared_post": {
    "id": "7346877089213571072",
    "url": "https://www.linkedin.com/feed/update/urn:li:ugcPost:7346877089213571072",
    "author": {
      "name": "Klarna",
      "urn": "",
      "headline": "384,274 followers",
      "profile_url": "https://www.linkedin.com/company/klarna/posts",
      "profile_picture": ""
    },
    "type": {
      "format": "document",
      "audience": "public",
      "categories": []
    },
    "media": [
      {
       

In [None]:
print("Starting: Post Comments Scraper")

# Actor and input configuration
comments_scraper_actor_id = "apimaestro/linkedin-post-comments-replies-engagements-scraper-no-cookies"
comments_scraper_input = {
    "postIds": [linkedin_post_url],
    "sortOrder": "most recent"
}

# Run the actor and wait for it to finish
run = client.actor(comments_scraper_actor_id).call(run_input=comments_scraper_input)

# Fetch results from the actor's dataset
items = []
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    items.append(item)

# Save the raw list of dictionaries to a JSON file
with open('post_comments.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, ensure_ascii=False, indent=4)

print(f"Success: Saved {len(items)} comments to post_comments.json")

# Print the first item to verify
if items:
    print("\n--- Sample of first comment data ---")
    print(json.dumps(items[0], indent=2))

Starting: Post Comments Scraper


[36m[apify.linkedin-post-comments-replies-engagements-scraper-no-cookies runId:XGRZbTfqKUzoduaXD][0m -> Status: RUNNING, Message: 
[36m[apify.linkedin-post-comments-replies-engagements-scraper-no-cookies runId:XGRZbTfqKUzoduaXD][0m -> 2025-11-12T13:27:21.542Z ACTOR: Pulling container image of build 0ZMsbFMCNcaTZrVc9 from registry.
[36m[apify.linkedin-post-comments-replies-engagements-scraper-no-cookies runId:XGRZbTfqKUzoduaXD][0m -> 2025-11-12T13:27:21.545Z ACTOR: Creating container.
[36m[apify.linkedin-post-comments-replies-engagements-scraper-no-cookies runId:XGRZbTfqKUzoduaXD][0m -> 2025-11-12T13:27:21.713Z ACTOR: Starting container.
[36m[apify.linkedin-post-comments-replies-engagements-scraper-no-cookies runId:XGRZbTfqKUzoduaXD][0m -> 2025-11-12T13:27:22.688Z [32mINFO[39m  System info[90m {"apifyVersion":"3.5.0","apifyClientVersion":"2.17.0","crawleeVersion":"3.15.1","osType":"Linux","nodeVersion":"v20.19.5"}[39m
[36m[apify.linkedin-post-comments-replies-engagements-

Success: Saved 8 comments to post_comments.json

--- Sample of first comment data ---
{
  "comment_id": "7350462303886049280",
  "text": "Well done! You make lives of people so much easier and prettier! :) AI will definitely ease things even more!",
  "posted_at": {
    "timestamp": 1752486778232,
    "date": "2025-07-14 11:52:58",
    "relative": "17w"
  },
  "is_edited": false,
  "is_pinned": false,
  "comment_url": "https://www.linkedin.com/feed/update/urn:li:ugcPost:7346877089213571072?commentUrn=urn%3Ali%3Acomment%3A%28ugcPost%3A7346877089213571072%2C7350462303886049280%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287350462303886049280%2Curn%3Ali%3AugcPost%3A7346877089213571072%29",
  "author": {
    "name": "Natalja \u013befonova",
    "headline": "Global service provider with expertise in Networking, SASE, Cloud & Web Security, CDN, Hosting and outstaffing.",
    "profile_url": "https://www.linkedin.com/in/natalja-%C4%BCefonova-7b255a46",
    "profile_picture": "https://media.lic

## data cleaning


In [87]:
import pandas as pd
import json

def load_json_file(filename):
    """Safely loads a JSON file into a list of dictionaries."""
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            return json.load(f)
    except (FileNotFoundError, json.JSONDecodeError):
        print(f"Warning: Could not load or parse {filename}. Returning empty list.")
        return []

# Load the raw data from each file
post_data_raw = load_json_file('post_data.json')
reactions_raw = load_json_file('post_reactions.json')
reshares_raw = load_json_file('post_reshares.json')
comments_raw = load_json_file('post_comments.json')

print(f"Loaded {len(post_data_raw)} post records.")
print(f"Loaded {len(reactions_raw)} reaction records.")
print(f"Loaded {len(reshares_raw)} reshare records.")
print(f"Loaded {len(comments_raw)} comment records.")

Loaded 1 post records.
Loaded 98 reaction records.
Loaded 15 reshare records.
Loaded 8 comment records.


In [88]:
import pandas as pd
import re

def clean_post_data(post_data_list):
    """Cleans and transforms the raw post data from the Apify actor."""
    if not post_data_list:
        print("No post data to process.")
        return pd.DataFrame()

    # We are only processing the first post from the list
    post = post_data_list[0]

    # Extract hashtags from post_text if the hashtags field is null
    hashtags = post.get('hashtags')
    if not hashtags:
        text = post.get('post_text', '')
        hashtags = re.findall(r"#(\w+)", text)

    # Prepare the cleaned data in a dictionary
    cleaned_post = {
        'post_id': post.get('id'),
        'post_url': post.get('url'),
        'post_text': post.get('post_text', '').strip(),
        'posted_at': pd.to_datetime(post.get('date_posted')),
        'author_name': post.get('user_id'), # Using user_id as it represents the author's handle
        'author_url': post.get('use_url'),  # use_url links to the author's page
        'likes_count': post.get('num_likes', 0),
        'comments_count': post.get('num_comments', 0),
        'shares_count': post.get('shares_count', 0), # Field not present, defaulting to 0
        'hashtags': hashtags,
        'media_urls': post.get('images', []) # 'images' field contains the list of media URLs
    }
    
    # Create a DataFrame from the single dictionary
    df = pd.DataFrame([cleaned_post])
    return df

# --- Execution ---
post_df = clean_post_data(post_data_raw)

print("Cleaned Post DataFrame:")
print(post_df.info())
print("\n")
# Display the full post text for verification
pd.set_option('display.max_colwidth', None)
display(post_df[['post_id', 'author_name', 'likes_count', 'comments_count', 'post_text']])

Cleaned Post DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   post_id         1 non-null      object             
 1   post_url        1 non-null      object             
 2   post_text       1 non-null      object             
 3   posted_at       1 non-null      datetime64[ns, UTC]
 4   author_name     1 non-null      object             
 5   author_url      1 non-null      object             
 6   likes_count     1 non-null      int64              
 7   comments_count  1 non-null      int64              
 8   shares_count    1 non-null      int64              
 9   hashtags        1 non-null      object             
 10  media_urls      1 non-null      object             
dtypes: datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 220.0+ bytes
None




Unnamed: 0,post_id,author_name,likes_count,comments_count,post_text
0,7346877091532959746,klarna,98,7,"üåç We‚Äôre proud to announce the launch of our AI for Climate Resilience Program. AI already powers everything we do at Klarna ‚Äî and now we‚Äôre turning that same expertise toward the front lines of climate change. We take pride in our legacy as a climate leader, and we‚Äôre committed to driving positive change for the future. The AI for Climate Resilience Program will support pioneering projects that harness artificial intelligence to help climate-vulnerable communities adapt and thrive. This is technology in service of both people and the planet. This program will support local, practical, and community-owned solutions. From strengthening food security and improving health systems to building coastal resilience in the face of climate change. What‚Äôs on offer: üí∏ Grants of up to $300,000 üßëüéì Mentorship, training, and a supportive community of practice We encourage applications from organizations working to reduce vulnerability of local communities to climate-related risks in low- and middle-income countries. We welcome early stage applications as well, from teams that need support in developing technical details further. Whether you‚Äôre using AI to support smallholder farmers, build early warning systems, or translate complex risk data into community action plans, we want to hear from you! Find out more about the program and apply here üëâ https://lnkd.in/d3tFWFHJ"


In [91]:
import pandas as pd
import numpy as np
import re

def clean_and_finalize_audience(reactions_list, reshares_list, comments_list):
    """
    FINAL CORRECTED VERSION: Cleans, normalizes, and consolidates all audience data,
    grouping by a normalized name to ensure true uniqueness and prevent duplicates.
    """
    all_engagements = []

    # 1. Process and combine all sources into a single standardized list
    # Ensure that a name and profile_url exist before adding to the list
    for item in reactions_list:
        actor = item.get('reactor', {})
        if actor and actor.get('profile_url') and actor.get('name'):
            all_engagements.append({
                'profile_url': actor['profile_url'],
                'name': actor['name'],
                'headline': actor.get('headline'),
                'engagement_type': 'reaction'
            })
    for item in reshares_list:
        actor = item.get('reposter', {})
        if actor and actor.get('profile_url') and actor.get('name'):
            all_engagements.append({
                'profile_url': actor['profile_url'],
                'name': actor['name'],
                'headline': actor.get('headline'),
                'engagement_type': 'reshare'
            })
    for item in comments_list:
        if 'comment_id' not in item: continue # Skip summary object
        actor = item.get('author', {})
        if actor and actor.get('profile_url') and actor.get('name'):
            all_engagements.append({
                'profile_url': actor['profile_url'],
                'name': actor['name'],
                'headline': actor.get('headline'),
                'engagement_type': 'comment',
                'comment_text': item.get('text', '').strip()
            })

    if not all_engagements:
        print("No audience data to process.")
        return pd.DataFrame()

    # 2. Create a DataFrame and drop any rows that are missing a name
    df = pd.DataFrame(all_engagements).dropna(subset=['name'])

    # 3. **CRITICAL STEP: Normalize the 'name' column to use as a reliable key**
    df['normalized_name'] = df['name'].str.lower().str.strip()

    # 4. One-hot encode engagement types
    df = pd.get_dummies(df, columns=['engagement_type'], prefix='', prefix_sep='').rename(
        columns={'reaction': 'reacted', 'comment': 'commented', 'reshare': 'reshared'}
    )
    
    # 5. Group by the new 'normalized_name' and aggregate the data
    aggregation_rules = {
        'name': 'first',        # Keep one of the original capitalized names
        'profile_url': 'first', # Keep one of the URLs for reference
        'headline': 'first',    # Keep the first headline found
        'reacted': 'max',       # Consolidate one-hot columns
        'commented': 'max',
        'reshared': 'max',
        'comment_text': lambda x: ' || '.join(x.dropna()) if not x.dropna().empty else np.nan
    }
    unified_df = df.groupby('normalized_name').agg(aggregation_rules).reset_index(drop=True)

    # 6. Add profile type one-hot encoding based on the reference URL
    unified_df['is_company'] = unified_df['profile_url'].str.contains('/company/').astype(int)
    unified_df['is_individual'] = unified_df['profile_url'].str.contains('/in/').astype(int)
    
    # 7. Final Cleanup
    # Nullify the 'headline' for companies
    unified_df.loc[unified_df['is_company'] == 1, 'headline'] = np.nan
    
    # Fill any missing one-hot columns and ensure integer type
    for col in ['reacted', 'commented', 'reshared']:
        if col not in unified_df.columns: unified_df[col] = 0
        unified_df[col] = unified_df[col].fillna(0).astype(int)
    
    # Reorder columns for final output
    final_cols = [
        'name', 'headline', 'profile_url', 'is_individual', 'is_company', 
        'reacted', 'commented', 'reshared', 'comment_text'
    ]
    unified_df = unified_df[[col for col in final_cols if col in unified_df.columns]]
    
    return unified_df

# --- Execution ---
audience_final_df = clean_and_finalize_audience(reactions_raw, reshares_raw, comments_raw)

print("--- Final, Corrected Audience DataFrame ---")
print(f"Successfully consolidated into {len(audience_final_df)} unique entities.")
display(audience_final_df.head(10))

--- Final, Corrected Audience DataFrame ---
Successfully consolidated into 101 unique entities.


Unnamed: 0,name,headline,profile_url,is_individual,is_company,reacted,commented,reshared,comment_text
0,Alexander Farsan,Head of Climate and Environment at Klarna,https://www.linkedin.com/in/ACoAAAL-E6kBmzH0Blq_dzcDp-WtSU3OSYg26gY,1,0,1,0,1,
1,Alexandra Colac,"Sustainability, Circularity, Social and Environmental Impact in Value Chains",https://www.linkedin.com/in/ACoAAA1cp40BWBOz34A22Cgg8L0qaSXm8GTpcJY,1,0,1,0,1,
2,Alexandre Fernandes Ribeiro,"VP of Sales, Orchestration at Solidgate | Forbes 30 Under 30",https://www.linkedin.com/in/ACoAABGzRN8Bd8G_ELQegslYJxfvJJbz2dBH8J8,1,0,1,0,1,
3,Alicia Mireya Daza Castillo,Software Engineer,https://www.linkedin.com/in/ACoAABtWavgBCI8YAhxGSbyRrVz6_X0jnmW4xcw,1,0,1,0,0,
4,Anders Norlin,Venture Builder | Strategic Advisor | Board Executive | Investor | üíö Impact¬∞Nature¬∞Climate¬∞Tech | üíôüíõüá∫üá¶,https://www.linkedin.com/in/ACoAAADrI4EB6MOg-PQqxjPTaPvRqQktwAtiwPg,1,0,1,0,0,
5,"Andrew Sangjun, Rhee PhD","Consultant | Ethnography Empirical Market Reserach; Behavioral Finance, Decarbonization | PhD | PwC",https://www.linkedin.com/in/ACoAACBUpIQBTN-Bhl3Y-iOH_278PKuXkibozCc,1,0,1,0,0,
6,Andr√°s L√°szl√≥ T√∂lgyes,"We live in a world where there is more and more information, and less and less meaning | Iratkozz fel a STALKER-re!",https://www.linkedin.com/in/ACoAAAO2lIsBHVf3MzMCLSkGPz7mZdjsd4KM0rI,1,0,1,0,1,
7,Anna Samuelsson,Head of Impact at Milkywire,https://www.linkedin.com/in/ACoAAAj0a_YBQhe8tUaTspBVy4peYc4CT9f8a0k,1,0,1,0,0,
8,Arina Lopukhina,Business Analytics @PayPal | MSc. Data Science @UniMi,https://www.linkedin.com/in/ACoAABjIeKUBePaAMYDMYaEEICokC7XJ9xTSLZc,1,0,1,0,0,
9,Arina Timofeeva,Business Operations Specialist at Klarna,https://www.linkedin.com/in/ACoAACxhRwcBiBv3kmwhxpGtHeX1EqLhcpWyJ2I,1,0,1,0,0,


## Task 1 analysis


In [92]:
import pandas as pd
from textblob import TextBlob

def analyze_post_features(df):
    """
    Analyzes the post DataFrame to add sentiment and performance features.
    
    Args:
        df (pd.DataFrame): The cleaned DataFrame containing post data.

    Returns:
        pd.DataFrame: The DataFrame with added analysis columns.
    """
    if df.empty:
        print("Post DataFrame is empty. Cannot analyze.")
        return df

    # We are working with the first (and only) row
    post_text = df['post_text'].iloc[0]

    # --- 1.1: Sentiment Analysis ---
    blob = TextBlob(post_text)
    df['sentiment_polarity'] = blob.sentiment.polarity
    df['sentiment_subjectivity'] = blob.sentiment.subjectivity

    # Categorize sentiment based on polarity
    if blob.sentiment.polarity > 0.1:
        df['sentiment'] = 'Positive'
    elif blob.sentiment.polarity < -0.1:
        df['sentiment'] = 'Negative'
    else:
        df['sentiment'] = 'Neutral'

    # --- 1.2: Feature Creation ---
    df['text_length'] = len(post_text)
    df['hashtag_count'] = len(df['hashtags'].iloc[0])
    
    # Engagement metric for prototype (since we don't have follower count)
    df['engagement_sum'] = df['likes_count'] + df['comments_count']

    return df

# --- Execution ---
post_analyzed_df = analyze_post_features(post_df.copy()) # Use .copy() to avoid modifying the original df

# Display the results
print("--- Post Performance and Feature Analysis ---")
display(post_analyzed_df[[
    'sentiment', 'sentiment_polarity', 'likes_count', 'comments_count', 
    'engagement_sum', 'text_length', 'hashtag_count'
]].T.rename(columns={0: 'Value'}))

--- Post Performance and Feature Analysis ---


Unnamed: 0,Value
sentiment,Positive
sentiment_polarity,0.132487
likes_count,98
comments_count,7
engagement_sum,105
text_length,1374
hashtag_count,0


In [94]:
import pandas as pd
import numpy as np

def analyze_audience_by_icp(df):
    """
    Analyzes the unified audience DataFrame to score and rank individuals
    based on a predefined Ideal Customer Profile (ICP).

    Args:
        df (pd.DataFrame): The final, cleaned, and unified audience DataFrame.

    Returns:
        A tuple containing:
        - pd.DataFrame: A ranked DataFrame of individuals with relevance scores.
        - pd.Series: A summary of the top roles found in the audience.
        - pd.Series: A summary of the top seniority levels found in the audience.
    """
    # 1. Filter for only individual profiles and drop rows with no headline
    individuals_df = df[(df['is_individual'] == 1) & (df['headline'].notna())].copy()
    if individuals_df.empty:
        print("No individuals with headlines found to analyze.")
        return pd.DataFrame(), pd.Series(dtype='int64'), pd.Series(dtype='int64')

    # 2. Define the Ideal Customer Profile (ICP) with keywords and scores
    icp_keywords = {
        # Role Keywords
        'sustainability': 3, 'climate': 3, 'esg': 3,
        'ai': 2, 'data scientist': 2, 'engineer': 1,
        'investor': 3, 'venture': 2,
        'founder': 3, 'entrepreneur': 2,
        'innovation': 2, 'tech': 1, 'consultant': 1,
        # Seniority Keywords
        'ceo': 3, 'chief': 3, 'c-level': 3,
        'head of': 2, 'vp': 2, 'director': 2, 'partner': 2,
        'lead': 1, 'manager': 1, 'senior': 1
    }
    
    role_keys = {'sustainability', 'climate', 'esg', 'ai', 'data scientist', 'engineer', 'investor', 'venture', 'founder', 'entrepreneur', 'innovation', 'tech', 'consultant'}
    seniority_keys = {'ceo', 'chief', 'c-level', 'head of', 'vp', 'director', 'partner', 'lead', 'manager', 'senior', 'founder'}

    # 3. Function to score a headline
    def score_headline(headline):
        score = 0
        matches = []
        headline_lower = headline.lower()
        for keyword, value in icp_keywords.items():
            if keyword in headline_lower:
                score += value
                matches.append(keyword)
        return score, matches

    # 4. Apply the scoring function to create new columns
    individuals_df[['relevance_score', 'matched_keywords']] = individuals_df['headline'].apply(
        lambda h: pd.Series(score_headline(h))
    )

    # 5. Rank the audience by the calculated score
    ranked_df = individuals_df.sort_values(by='relevance_score', ascending=False)
    
    # 6. Create Audience Summaries
    all_keywords = ranked_df[ranked_df['matched_keywords'].apply(lambda x: len(x) > 0)]['matched_keywords'].explode()
    
    top_roles = all_keywords[all_keywords.isin(role_keys)].value_counts()
    top_seniority = all_keywords[all_keywords.isin(seniority_keys)].value_counts()

    return ranked_df, top_roles, top_seniority


# --- Execution ---
# We use the 'audience_final_df' which has been properly de-duplicated
ranked_audience_final_df, top_roles, top_seniority = analyze_audience_by_icp(audience_final_df)


# --- Display Final Results ---
print("--- Audience Analysis & ICP Ranking (Final, Corrected Data) ---")

if not ranked_audience_final_df.empty:
    print("\nTop 10 Most Relevant Individuals in the Audience:")
    display(ranked_audience_final_df[[
        'name', 'headline', 'relevance_score', 'matched_keywords'
    ]].head(10))

    print("\n--- Audience Composition Summary ---")
    print("\nTop Roles Found:")
    display(top_roles.head(10).to_frame(name='Count'))

    print("\nTop Seniority Levels Found:")
    display(top_seniority.head(10).to_frame(name='Count'))
else:
    print("Could not generate audience analysis.")

--- Audience Analysis & ICP Ranking (Final, Corrected Data) ---

Top 10 Most Relevant Individuals in the Audience:


Unnamed: 0,name,headline,relevance_score,matched_keywords
85,Salah Said,Sustainability I ESG I Corporate Responsibility at Klarna I ex-Zalando | Social Innovation & Entrepreneurship | Design Thinking | BMI | LSP | Sustainable Fashion / Finance / E-commerce | Carbon Removal,12,"[sustainability, esg, ai, entrepreneur, innovation]"
24,Dhanesh Kothari,"Compassionate Capitalism | Founder of pi2Life - Funder, B2B Marketplace, Product Locator, Shopical | Ethics | Entrepreneur | Mentor | Board Advisor | Sustainability | Impact | VeganTech | FinTech | RetailTech | Digital",11,"[sustainability, ai, founder, entrepreneur, tech]"
74,Peter Githinji,"Climate Finance & Carbon Markets Strategist | Co-Founder, KiotaSIC | Accelerating Sustainable Innovation & Economic Opportunity in East Africa.",10,"[climate, ai, founder, innovation]"
62,"Mike Alves, MSAFP, CFP¬Æ, CEPA, CRPC¬Æ",Founder & Managing Partner | Late-Stage Pre-IPO AI & Humanoid Robotics Fund | Investing in Innovation & Future Tech | Building a Legacy for the Next Generation,10,"[ai, founder, innovation, tech, partner]"
97,Vy Do,Sustainability Strategy & Digital Transformation | Leveraging AI & Automation for Climate Impact | MSc Sustainability Leadership - Let's Connect!,9,"[sustainability, climate, ai, lead]"
4,Anders Norlin,Venture Builder | Strategic Advisor | Board Executive | Investor | üíö Impact¬∞Nature¬∞Climate¬∞Tech | üíôüíõüá∫üá¶,9,"[climate, investor, venture, tech]"
51,Katarina Wangler Bj√∂rk,"IMAGINE Leader | Climate, Sustain Tech & Policy | Sustainable Business Innovation Transformation & Finance | Ex-EY | MBA",9,"[climate, ai, innovation, tech, lead]"
50,J√∂rgen Johnsson,Founder and CEO at Johnson Real Estate Venture INC,8,"[venture, founder, ceo]"
37,Fredrik Andersson Guldkula,Climate tech investor @ R-evolution (Hexagon AB),7,"[climate, investor, tech]"
23,Deqa Abukar,"Co-founder BLING, Investor",6,"[investor, founder]"



--- Audience Composition Summary ---

Top Roles Found:


Unnamed: 0_level_0,Count
matched_keywords,Unnamed: 1_level_1
ai,17
engineer,15
tech,10
founder,7
climate,7
sustainability,5
innovation,5
investor,4
consultant,4
entrepreneur,3



Top Seniority Levels Found:


Unnamed: 0_level_0,Count
matched_keywords,Unnamed: 1_level_1
manager,19
lead,10
head of,10
senior,9
founder,7
partner,7
ceo,5
director,3
vp,2
chief,1


In [96]:
# --- Save the final, analyzed DataFrames to CSV files ---

# 1. Save the post analysis summary
post_analyzed_df.to_csv('post_analysis_summary.csv', index=False)
print("Successfully saved: post_analysis_summary.csv")

# 2. Save the final, ranked audience analysis
# We select key columns for a clean final report
final_audience_columns = [
    'name', 'headline', 'relevance_score', 'matched_keywords', 
    'is_individual', 'is_company', 'reacted', 'commented', 'reshared', 
    'comment_text', 'profile_url'
]
ranked_audience_final_df[final_audience_columns].to_csv('audience_icp_ranking.csv', index=False)
print("Successfully saved: audience_icp_ranking.csv")

Successfully saved: post_analysis_summary.csv
Successfully saved: audience_icp_ranking.csv
