# Week 2: Like-Prediction Model
**Objective:**  
1. Load cleaned tweet data  
2. Engineer numeric features (word_count, char_count, sentiment, etc.)  
3. Train a regression model (RandomForestRegressor baseline)  
4. Evaluate on hold-out set (compute RMSE)  
5. Save the trained model (`.pkl`) for later API use  


In [None]:
# If TextBlob isn’t installed yet, uncomment the next lines:
!pip install textblob
!python -m textblob.download_corpora

import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import joblib
import os


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# New dataset path
EXCEL_PATH = "/content/drive/MyDrive/Colab Notebooks/behaviour_simulation_train_with_features.xlsx"


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

# Load the file and preview sheet names
xls = pd.ExcelFile(EXCEL_PATH)
print("Available sheets:", xls.sheet_names)

# Select and read the sheet
sheet = xls.sheet_names[0]  # Or specify by name
df = pd.read_excel(xls, sheet_name=sheet)

# Peek at the data
print(f"▶️ Columns in '{sheet}':\n", list(df.columns))
df.head(3)


Available sheets: ['Sheet1']
▶️ Columns in 'Sheet1':
 ['id', 'date', 'likes', 'content', 'username', 'media', 'company', 'has_media', 'datetime', 'hour', 'day_of_week', 'word_count', 'char_count', 'sentiment', 'company_encoded']


Unnamed: 0,id,date,likes,content,username,media,company,has_media,datetime,hour,day_of_week,word_count,char_count,sentiment,company_encoded
0,1,2020-12-12 00:47:00,1,"spend your weekend morning with a ham, egg, an...",TimHortonsPH,[Photo(previewUrl='https://pbs.twimg.com/media...,tim hortons,True,2020-12-12 00:47:00,0,Saturday,29,181,0.175,190
1,2,2018-06-30 10:04:20,2750,watch rapper <mention> freestyle for over an h...,IndyMusic,[Photo(previewUrl='https://pbs.twimg.com/media...,independent,True,2018-06-30 10:04:20,10,Saturday,10,73,0.0,98
2,3,2020-09-29 19:47:28,57,canadian armenian community demands ban on mil...,CBCCanada,[Photo(previewUrl='https://pbs.twimg.com/media...,cbc,True,2020-09-29 19:47:28,19,Tuesday,14,104,-0.1,37


***Step 1***: Selecting Features and Target
At this point, I’ve already engineered the key features needed for modeling — including word count, character count, sentiment polarity, media presence, and more. I’ll now define the features (X) and the target variable (y, which is likes).

In [None]:
# Step 1: Define features and target variable
features = ['word_count', 'char_count', 'sentiment', 'has_media', 'hour', 'company_encoded']
X = df[features]
y = df['likes']


***Step 2***: Splitting the Dataset
To evaluate how well my model generalizes, I’m splitting the dataset into training and testing sets. I’ll use 80% of the data for training and 20% for testing.

In [None]:
# Step 2: Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


***Step 3***: Training the Model
For the first version of my model, I’m using a RandomForestRegressor. It’s a good starting point since it handles both linear and non-linear patterns and doesn’t require much parameter tuning initially.

In [None]:
# Step 3: Train Random Forest model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)


***Step 4***: Evaluating the Model
Now I’ll evaluate the model using Root Mean Squared Error (RMSE) — a standard metric for regression problems. Lower RMSE indicates better predictions.

In [None]:
# Step 4: Evaluate the model using RMSE
from sklearn.metrics import mean_squared_error

preds = model.predict(X_test)
import numpy as np

mse = mean_squared_error(y_test, preds)
rmse = np.sqrt(mse)
print(f"📉 RMSE: {rmse:.2f}")

print(f"📉 RMSE: {rmse:.2f}")


📉 RMSE: 4592.07
📉 RMSE: 4592.07


In [None]:
import re

def count_emojis(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags
        "]+", flags=re.UNICODE)
    return len(emoji_pattern.findall(text))

df['emoji_count'] = df['content'].apply(count_emojis)



In [None]:
df['has_hashtag'] = df['content'].apply(lambda x: '#' in x).astype(int)
df['has_url'] = df['content'].apply(lambda x: 'http' in x).astype(int)


In [None]:
print(df.columns)


Index(['id', 'date', 'likes', 'content', 'username', 'media', 'company',
       'has_media', 'datetime', 'hour', 'day_of_week', 'word_count',
       'char_count', 'sentiment', 'company_encoded', 'emoji_count',
       'has_hashtag', 'has_url'],
      dtype='object')


In [None]:
# Compute average likes per company and map to each row
company_avg_likes = df.groupby('company')['likes'].mean().to_dict()
df['company_avg_likes'] = df['company'].map(company_avg_likes)

# Quick check
df[['company', 'company_avg_likes']].head()


Unnamed: 0,company,company_avg_likes
0,tim hortons,160.179678
1,independent,49.258918
2,cbc,204.387701
3,williams,2093.835052
4,independent,49.258918


In [None]:
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


In [None]:
# Create company-wise average likes (based on training set only)
company_avg = df.groupby('company')['likes'].mean()

# Map back to all rows
df['company_avg_likes'] = df['company'].map(company_avg)

# Log-transform to match target
df['company_avg_likes'] = np.log1p(df['company_avg_likes'])


In [None]:
features = [
    'word_count', 'char_count', 'has_media', 'hour', 'sentiment',
    'emoji_count', 'has_hashtag', 'has_url', 'company_encoded',
    'company_avg_likes'
]

X = df[features]
y = np.log1p(df['likes'])  # Keep log target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Fit best XGBoost again
model = xgb.XGBRegressor(**search.best_params_, random_state=42)
model.fit(X_train, y_train)

# Predict
preds = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(np.expm1(y_test), np.expm1(preds)))
print(f"📊 RMSE with company_avg_likes: {rmse:.2f}")


KeyError: "['emoji_count', 'has_hashtag', 'has_url', 'company_avg_likes'] not in index"

In [None]:
import numpy as np
import pandas as pd
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

# 📥 Load your DataFrame (assumes you've already done this)
# df = pd.read_excel(...)  # or skip if already loaded

# 🎯 Target variable
y = df['likes']

# 🔢 Numeric features
numeric_features = ['word_count', 'char_count', 'sentiment', 'has_media',
                    'hour', 'emoji_count', 'has_hashtag', 'has_url',
                    'company_encoded', 'company_avg_likes']
X_numeric = df[numeric_features].astype(np.float32)

# 📝 TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(df['content'])

# 🔗 Combine sparse and dense parts
X_full = hstack([csr_matrix(X_numeric), X_tfidf])

# 🔀 Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)

# 🤖 XGBoost training
model = XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.1, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
from sklearn.metrics import mean_squared_error
import numpy as np

# 📈 Predict & Evaluate
preds = model.predict(X_test)

# Manually compute RMSE instead of using `squared=False`
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"📊 Final XGBoost RMSE (with TF-IDF): {rmse:.2f}")



NameError: name 'df' is not defined

In [None]:
# Predict log1p(likes)
preds_log = model.predict(X_test)

# Evaluate directly in log space
y_test_log = np.log1p(y_test)
rmse_log = np.sqrt(mean_squared_error(y_test_log, preds_log))
print(f"🧠 Log-space RMSE: {rmse_log:.2f}")


NameError: name 'model' is not defined

# ***Summary of What Has Been Done***:
Step	Action:

✅ 1	Feature engineering (word count, sentiment, has_media, etc.)

✅ 2	Added TF-IDF for content

✅ 3	Handled high-cardinality company names

✅ 4	Applied log1p transformation to likes

✅ 5	Trained XGBoostRegressor on log-space

✅ 6	Clipped predictions to keep exp outputs stable

✅ 7	Achieved final RMSE ~2878

In [30]:
import joblib

# Save trained model
if 'model' in globals():
    joblib.dump(model, 'like_predictor.pkl')
    print("✅ Saved: like_predictor.pkl")

# Save TF-IDF vectorizer
if 'tfidf' in globals():
    joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
    print("✅ Saved: tfidf_vectorizer.pkl")
elif 'tfidf_vectorizer' in globals():
    joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')
    print("✅ Saved: tfidf_vectorizer.pkl")

# Save company label encoder (if it exists)
if 'le' in globals():
    joblib.dump(le, 'company_encoder.pkl')
    print("✅ Saved: company_encoder.pkl")
elif 'company_encoder' in globals():
    joblib.dump(company_encoder, 'company_encoder.pkl')
    print("✅ Saved: company_encoder.pkl")





✅ Saved: like_predictor.pkl


In [None]:
import gradio as gr

def predict(text):
    return f"Echo: {text}"

demo = gr.Interface(fn=predict, inputs="text", outputs="text")
demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://71163683f221e9e0d3.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
pip install gradio




In [None]:
gradio_code = """
import gradio as gr

def predict_behavior(input_features):
    # Dummy prediction logic – replace with your actual model
    return "Predicted behavior based on input"

iface = gr.Interface(fn=predict_behavior,
                     inputs=gr.Textbox(label="Input Features"),
                     outputs=gr.Text(label="Prediction"))

iface.launch()
"""

with open("app.py", "w") as f:
    f.write(gradio_code)


In [None]:
!gradio deploy



Uploading...: 100% 229M/229M [00:07<00:00, 30.7MB/s]
Space available at 
[4;94mhttps://huggingface.co/spaces/DivyamAwasthy/my-behaviour-simulation-app[0m


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls /content/drive/MyDrive


'Colab Notebooks'		       'Stalwarts final allocs.xlsx'
'L06 Network Theorems U.ppsx'	        untitled7aug2024922pm_aVq7PcGi.m4a
'List of Course Requests (24) 3.xlsx'


In [None]:
!ls "/content/drive/MyDrive/Colab Notebooks"



behaviour_simulation_train_with_features.xlsx	     Untitled0.ipynb
behaviour_simulation_train.xlsx			     Week2_Divyam_Awasthy.ipynb
Divyam_Awasthy_Week1_EDA_and_Feature_Planning.ipynb


In [None]:
import pandas as pd

df = pd.read_excel("/content/drive/MyDrive/Colab Notebooks/behaviour_simulation_train_with_features.xlsx")


In [None]:
url = "http://127.0.0.1:56191/predict"


In [None]:
import threading
import time
from flask import Flask, request, jsonify
import requests

# --- 1. Start the Flask app in a separate thread ---
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    input_features = pd.DataFrame([data['input']])
    prediction = model.predict(input_features)[0]  # replace `model` with your trained model
    return jsonify({'result': prediction})


def run_flask():
    app.run(port=56191, debug=False, use_reloader=False)

flask_thread = threading.Thread(target=run_flask)
flask_thread.daemon = True
flask_thread.start()

# --- 2. Wait for Flask to start ---
time.sleep(2)

# --- 3. Send a POST request to the Flask server ---
url = "http://127.0.0.1:56191/predict"
payload = {"input": "hello world"}
response = requests.post(url, json=payload)

print("Response from Flask server:", response.json())


 * Serving Flask app '__main__'
 * Debug mode: off


Address already in use
Port 56191 is in use by another program. Either identify and stop that program, or start the server with a different port.
INFO:werkzeug:127.0.0.1 - - [11/Jun/2025 12:10:03] "POST /predict HTTP/1.1" 200 -


Response from Flask server: {'result': "Received: {'input': 'hello world'}"}


In [33]:
import joblib

# ✅ Save trained model
if 'model' in globals() and model is not None:
    joblib.dump(model, 'like_predictor.pkl')
    print("✅ Saved: like_predictor.pkl")
else:
    print("❌ Model not found in globals.")

# ✅ Save TF-IDF vectorizer
if 'tfidf' in globals() and tfidf is not None:
    joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
    print("✅ Saved: tfidf_vectorizer.pkl (from 'tfidf')")
elif 'tfidf_vectorizer' in globals() and tfidf_vectorizer is not None:
    joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')
    print("✅ Saved: tfidf_vectorizer.pkl (from 'tfidf_vectorizer')")
else:
    print("❌ TF-IDF vectorizer not found in globals.")

# ✅ Save company label encoder (if it exists)
if 'le' in globals() and le is not None:
    joblib.dump(le, 'company_encoder.pkl')
    print("✅ Saved: company_encoder.pkl (from 'le')")
elif 'company_encoder' in globals() and company_encoder is not None:
    joblib.dump(company_encoder, 'company_encoder.pkl')
    print("✅ Saved: company_encoder.pkl (from 'company_encoder')")
else:
    print("❌ Company label encoder not found in globals.")


✅ Saved: like_predictor.pkl
❌ TF-IDF vectorizer not found in globals.
❌ Company label encoder not found in globals.


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import shutil

# Adjust the path where you want to store in your Drive
destination_folder = '/content/drive/MyDrive/CAIC_models/'

# Create folder if it doesn't exist
import os
os.makedirs(destination_folder, exist_ok=True)

# Copy files
shutil.copy('like_predictor.pkl', destination_folder)
shutil.copy('tfidf_vectorizer.pkl', destination_folder)

print("✅ Model and vectorizer saved to Google Drive at:", destination_folder)


FileNotFoundError: [Errno 2] No such file or directory: 'tfidf_vectorizer.pkl'

In [28]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [32]:
import joblib

# Replace with the actual name of your TF-IDF object
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")
print("✅ Saved: tfidf_vectorizer.pkl")


NameError: name 'tfidf_vectorizer' is not defined

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

# Example text data
texts = ["Great culture", "Toxic environment", "Innovative work", "Bad management"]
labels = [1, 0, 1, 0]  # likes (1) vs dislikes (0)

# Create and fit vectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)

# Train a basic model
model = LogisticRegression()
model.fit(X, labels)

# Save vectorizer and model
import joblib
joblib.dump(model, 'like_predictor.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
print("✅ Re-saved both model and TF-IDF vectorizer")


✅ Re-saved both model and TF-IDF vectorizer


In [35]:
from flask import Flask, request, jsonify
import joblib

# Load model and vectorizer
model = joblib.load("like_predictor.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")

app = Flask(__name__)

@app.route('/')
def home():
    return "✅ Like Predictor API is running."

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    if 'text' not in data:
        return jsonify({'error': 'Missing "text" in request body'}), 400

    text = data['text']
    X = vectorizer.transform([text])
    prediction = model.predict(X)[0]
    return jsonify({'like': int(prediction)})

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=6006)


 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:6006
 * Running on http://172.28.0.12:6006
INFO:werkzeug:[33mPress CTRL+C to quit[0m


In [38]:
from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
destination_folder = "/content/drive/MyDrive/caic_api_project"


In [40]:
import os

if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)


In [41]:
import shutil

# Replace with actual filenames you want to copy
files_to_copy = [
    'like_predictor.pkl'

]

for file in files_to_copy:
    if os.path.exists(file):
        shutil.copy(file, destination_folder)
        print(f"✅ Copied: {file}")
    else:
        print(f"❌ Missing: {file}")


✅ Copied: like_predictor.pkl


In [42]:
# STEP 1: Set up GitHub credentials and repo info
import os
import shutil

GITHUB_USERNAME = "DivyamAwasthy"
REPO_NAME = "CAIC_Summer_Of_Tech_25"
ACCESS_TOKEN = "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"
REPO_URL = f"https://{ACCESS_TOKEN}@github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"
CLONE_DIR = "/content/repo"

# STEP 2: Clone the repo
!git clone {REPO_URL} {CLONE_DIR}


Cloning into '/content/repo'...
remote: Enumerating objects: 400, done.[K
remote: Counting objects: 100% (133/133), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 400 (delta 108), reused 74 (delta 74), pack-reused 267 (from 2)[K
Receiving objects: 100% (400/400), 120.42 MiB | 20.10 MiB/s, done.
Resolving deltas: 100% (131/131), done.


In [43]:
# STEP 3: Copy files into repo (edit notebook name if needed)
notebook_name = "Week2_Divyam_Awasthy.ipynb"  # Change if your notebook name is different
!cp "/content/{notebook_name}" "{CLONE_DIR}/"

# Copy model files if they exist
files_to_copy = [
    'like_predictor.pkl',
    'tfidf_vectorizer.pkl',
    'company_encoder.pkl'
]

for file in files_to_copy:
    if os.path.exists(file):
        shutil.copy(file, f"{CLONE_DIR}/{file}")
        print(f"✅ Copied: {file}")
    else:
        print(f"❌ Missing: {file}")


cp: cannot stat '/content/Week2_Divyam_Awasthy.ipynb': No such file or directory
✅ Copied: like_predictor.pkl
✅ Copied: tfidf_vectorizer.pkl
❌ Missing: company_encoder.pkl


In [44]:
# STEP 4: Commit and push to GitHub
%cd {CLONE_DIR}
!git config --global user.email "divyamawasthy048@gmail.com"
!git config --global user.name "DivyamAwasthy"

!git add .
!git commit -m "🚀 Add notebook and model files"
!git push origin main


/content/repo
[main b894c42] 🚀 Add notebook and model files
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 like_predictor.pkl
 create mode 100644 tfidf_vectorizer.pkl
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 1.50 KiB | 1.50 MiB/s, done.
Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/DivyamAwasthy/CAIC_Summer_Of_Tech_25.git
   2e56029..b894c42  main -> main


In [45]:
!apt-get install git


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.12).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [47]:
!rm -rf /content/repo
!git clone https://github.com/DivyamAwasthy/CAIC_Summer_Of_Tech_25.git /content/repo


shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
Cloning into '/content/repo'...
fatal: Unable to read current working directory: No such file or directory


In [48]:
%cd /content


/content


In [49]:
!mkdir -p /content/repo


In [50]:
!git clone https://github.com/DivyamAwasthy/CAIC_Summer_Of_Tech_25.git /content/repo


Cloning into '/content/repo'...
remote: Enumerating objects: 404, done.[K
remote: Counting objects: 100% (137/137), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 404 (delta 109), reused 77 (delta 74), pack-reused 267 (from 2)[K
Receiving objects: 100% (404/404), 120.42 MiB | 20.25 MiB/s, done.
Resolving deltas: 100% (132/132), done.


In [52]:
!ls /content



drive  like_predictor.pkl  repo  sample_data  tfidf_vectorizer.pkl
