# Fine-tuning GPT-4.1 to Write LinkedIn Posts (in my style)
## ABB #7 - Session 5

Code authored by: Shaw Talebi

### imports

In [1]:
import pandas as pd
import json
import random

import os
from openai import OpenAI
from dotenv import load_dotenv

In [2]:
# import sk from .env file
load_dotenv()

# connect to openai API
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

### functions

In [3]:
def clean_post(text):
    # Split into lines
    lines = text.split('\n')
    # Remove leading/trailing quotes and whitespace, and filter out empty lines
    cleaned_lines = [line.strip().strip('"') for line in lines]
    # Join back into a single string
    return '\n'.join(cleaned_lines)

## Step 1: Input-output Pairs

### Load Data

In [4]:
# read data
df = pd.read_csv('data/LI_posts.csv')

In [5]:
# change column names
df.columns = ['date', 'link', 'post', 'idea']

In [6]:
# Set dtypes
df = df.astype({
    'date': str,
    'link': str,
    'post': str,
    'idea': str
})

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

In [7]:
# set index
df = df.set_index('date')

In [8]:
df.head()

Unnamed: 0_level_0,link,post,idea
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-05-20 14:12:11,https://www.linkedin.com/feed/update/urn%3Ali%...,"LLM capabilities are doubling every 7 months‚Ä¶""...",METR benchmark. Extrapolation
2025-05-19 20:10:49,https://www.linkedin.com/feed/update/urn%3Ali%...,The greatest gap I see in AI today is a lack o...,Share LLM fine-tuning bootcamp Survey
2025-05-19 14:39:24,https://www.linkedin.com/feed/update/urn%3Ali%...,"7 Basic AI Terms (Simply) Explained‚Ä¶""\n""""\n""1)...",7 Basic AI Terms (Simply) Explained‚Ä¶
2025-05-17 16:33:56,https://www.linkedin.com/feed/update/urn%3Ali%...,Thanks for the shoutout Rami!,
2025-05-16 13:22:43,https://www.linkedin.com/feed/update/urn%3Ali%...,"ML Foundations for AI Engineers (part 5/5)""\n""...",ML For AI Engineers (5/5) What is Reinforcemen...


### Data Prep

In [9]:
# pre-process posts
df['post'] = df['post'].apply(clean_post)

In [10]:
# replace idea with first line of post
df['idea'] = df['post'].str.split('\n').str[0]

In [21]:
df.head()

Unnamed: 0_level_0,link,post,idea
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-05-20 14:12:11,https://www.linkedin.com/feed/update/urn%3Ali%...,LLM capabilities are doubling every 7 months‚Ä¶\...,LLM capabilities are doubling every 7 months‚Ä¶
2025-05-19 20:10:49,https://www.linkedin.com/feed/update/urn%3Ali%...,The greatest gap I see in AI today is a lack o...,The greatest gap I see in AI today is a lack o...
2025-05-19 14:39:24,https://www.linkedin.com/feed/update/urn%3Ali%...,7 Basic AI Terms (Simply) Explained‚Ä¶\n\n1) Lar...,7 Basic AI Terms (Simply) Explained‚Ä¶
2025-05-16 13:22:43,https://www.linkedin.com/feed/update/urn%3Ali%...,ML Foundations for AI Engineers (part 5/5)\n\n...,ML Foundations for AI Engineers (part 5/5)
2025-05-15 18:21:53,https://www.linkedin.com/feed/update/urn%3Ali%...,The best part of running a community for tech ...,The best part of running a community for tech ...


In [11]:
df.shape

(669, 3)

In [12]:
# drop rows with posts less than 3 lines
df = df[df['post'].str.split('\n').str.len() >= 3]

In [13]:
df.shape

(638, 3)

## Step 2: Prompt Template

### Create training examples

In [14]:
# construct training examples
example_list = []

system_prompt = """# LinkedIn Ghostwriter

You are a LinkedIn Ghostwriter for Shaw Talebi, an AI educator and entrepreneur.

Given a post idea's first line from the user, generate a post in Shaw's unique style.

Include the following in each post:
- A compelling opening 1-2 lines that hooks the reader
- Copy that expands upon the idea in valuable way
- A call to action or share relevant content
- Don't use hashtags
"""

for i in range(len(df)):    
    system_dict = {"role": "system", "content": system_prompt}
    user_dict = {"role": "user", "content": df['idea'].iloc[i]}
    assistant_dict = {"role": "assistant", "content": df['post'].iloc[i]}
    
    messages_list = [system_dict, user_dict, assistant_dict]
    
    example_list.append({"messages": messages_list})

In [15]:
print(example_list[0]['messages'][0]['content'])
print("---")
print(example_list[0]['messages'][1]['content'])
print("---")
print(example_list[0]['messages'][2]['content'])

# LinkedIn Ghostwriter

You are a LinkedIn Ghostwriter for Shaw Talebi, an AI educator and entrepreneur.

Given a post idea's first line from the user, generate a post in Shaw's unique style.

Include the following in each post:
- A compelling opening 1-2 lines that hooks the reader
- Copy that expands upon the idea in valuable way
- A call to action or share relevant content
- Don't use hashtags

---
LLM capabilities are doubling every 7 months‚Ä¶
---
LLM capabilities are doubling every 7 months‚Ä¶

Here‚Äôs the most important LLM benchmark I‚Äôve come across üëá 

A couple of months ago, the team at METR released a new AI benchmark.

Rather than evaluating AI systems in terms of accuracy on well-known datasets or artificial tasks, it evaluates them on real-world tasks measured in average human task completion time.

In other words, they took 170 tasks, measured how long it typically takes a human to do each, then evaluated whether an AI system could do each with >50% accuracy.

Curr

In [16]:
len(example_list)

638

## Step 3: Create train/validation split

In [17]:
# randomly pick out validation examples
num_examples = 68
validation_index_list = random.sample(range(0, len(example_list)-1), num_examples)
validation_data_list = [example_list[index] for index in validation_index_list]

for example in validation_data_list:
    example_list.remove(example)

In [18]:
print(len(example_list))
print(len(validation_data_list))

570
68


In [19]:
# write examples to file
with open('data/train-data.jsonl', 'w') as train_file:
    for example in example_list:
        json.dump(example, train_file)
        train_file.write('\n')

with open('data/valid-data.jsonl', 'w') as valid_file:
    for example in validation_data_list:
        json.dump(example, valid_file)
        valid_file.write('\n')

### Upload data to OpenAI

In [None]:
train_file = client.files.create(
  file = open("data/train-data.jsonl", "rb"),
  purpose = "fine-tune"
)

valid_file = client.files.create(
  file = open("data/valid-data.jsonl", "rb"),
  purpose = "fine-tune"
)

## Step 4: Fine-tune model

In [None]:
client.fine_tuning.jobs.create(
    training_file = train_file.id,
    validation_file = valid_file.id,
    suffix = "LI-post-writer",
    model = "gpt-4.1-mini-2025-04-14",
    method={
    "type": "supervised",
    "supervised": {
      "hyperparameters": {
        "n_epochs": 3,
        "learning_rate_multiplier": 1.25,
        "batch_size": 1,
            }
        }
    }
)

## Step 5: Evaluate fine-tuned model

In [22]:
def generate_post(system_prompt, model_name, idea):
    response = client.chat.completions.create(
        model=model_name,
        messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": idea}
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content

In [23]:
idea = "5 things I learned from 5 years on YouTube"

#### GPT-4.1-mini (no fine-tuning)

In [29]:
model_name = "gpt-4.1-mini-2025-04-14"
# model_name = "gpt-4.1-2025-04-14"

# read (long) system prompt
with open("prompts/linkedin-prompt.md", "r") as file:
    system_prompt_long = file.read()

print(generate_post(system_prompt_long, model_name, idea))

Step 1: Purpose and Audience  
- Purpose: To share clear, practical lessons learned from five years of experience on YouTube.  
- Audience: Content creators, entrepreneurs, and anyone interested in growing a YouTube channel or learning from long-term content experience.

Step 2: Wireframe  
- Hook: 5 things I learned from 5 years on YouTube  
- Body: List 5 specific, practical lessons or insights about YouTube content creation, growth, or operations.  
- CTA: Ask readers to share what they have learned from their own content efforts or which of these they find most useful.

Step 3: Write the body (meat)  
1. Consistency in publishing builds momentum.  
2. Video quality matters but don‚Äôt wait for perfect gear to start.  
3. Engaging with comments helps build a community.  
4. Titles and thumbnails significantly impact click-through rates.  
5. Analytics guide improvement ‚Äî check watch time and retention carefully.  

Step 4: CTA  
What‚Äôs one lesson you‚Äôve learned from creating c

#### GPT-4.1-mini (fine-tuned)

In [28]:
model_name = "ft:gpt-4.1-mini-2025-04-14:shawhin-talebi-ventures-llc:li-post-writer:CaR7nerw"

# print(system_prompt, "\n--")
print(generate_post(system_prompt, model_name, idea))

5 things I learned from 5 years on YouTube

1) Pick a medium you enjoy and that can scale

When I started making content 5 years ago, I had no idea I‚Äôd still be doing it today.

I picked YouTube because 15-min videos felt like a good way to dive deep into a topic and it had better scalability than writing.

2) Intros matter

People have short attention spans.

What‚Äôs worked for me is to start videos by clearly stating what I‚Äôm going to teach and why people should care.

3) Make the first 30 seconds count

YouTube‚Äôs algorithm depends on watch time.

That means if people don‚Äôt watch your full video (which is most video), it‚Äôs better to make a short one.

4) Help people with a problem you have

I like the phrase ‚Äúmaking money while solving your problems.‚Äù

I‚Äôve found that my best content is what helps me learn something.

5) Don‚Äôt look at the numbers

I don‚Äôt think this is unique to content creation.

Looking at metrics like views and subscribers can be a quick way t

In [26]:
# # delete files (after fine-tuning is done)
# client.files.delete(train_file.id)
# client.files.delete(valid_file.id)