# Introduction to notebook

Insert text

--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
# 🚨 **USER SETUP - INPUT REQUIRED** 🚨

## **User Input for the Search Keywords in Twitter Data Scraper** 🐦
We import the `TwiKit` library, which will allow you to scrape tweets from Twitter based on specific keywords. You will need to input your wished search keywords + amount of tweets to customize this scraping process. 

You can modify these keywords to any terms of your choice, as they are not limited to the provided examples.

- **Keywords**: We have included the following keywords as examples:
  - 'Cybertruck FSD' 
  - 'Tesla FSD'
  - 'Tesla Full Self-driving' 

- Make sure to write the search keywords inside the [] and use '' around the words, seperate by , 
- **Example:** `['Cybertruck FSD', 'Tesla FSD', 'Tesla Full Self-driving']` - customize in the code cell below

- **Tweets per Keyword**: The script is set to collecting **20 tweets at a time** during each search. And as standard to retrieve **250 tweets** for each keyword you specify, you can change the amount of maximum tweets for each keywords, but please note the Usage limtations below. 📊 

- **IMPORTANT: Usage limitations:** 🚨 The Twitter API has a limit of **50 requests per 15 minutes**. Since we fetch **20 tweets at a time**, you can retrieve a maximum of **1000 tweets** every 15 minutes. You should manage your requests accordingly to avoid hitting the rate limit. **The Rate Limit resets every 15 minutes**.

## **Together API Key 🚀**: 
#### You must provide your **Together API Key** in the designated variable (`TOGETHER_API_KEY`). This key is essential for creating the Network Analysis later 🔑

### Get a free 5$ account here (no credit card required 💳 ❌): [Sign in to Together API](https://api.together.ai/signin)

--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------

In [139]:
## Twitter Data Scrapper
keywords = ['Novo Nordisk', 'Ozempic', 'WeGovy']
tweets_per_keyword = 250

# Set your Together API Key directly (Required)
TOGETHER_API_KEY = ""

--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------

## Installation, Import Libraries + Data, Cleaning & Descriptive Statistics  📦

### Install Requirements 🎛️

In [140]:
!pip install -r requirements.txt -q

  pid, fd = os.forkpty()


### Importing Libraries 🔌

In [141]:

# Datahandling
import requests
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import re
import numpy as np
import pandas as pd
import seaborn as sns

# Data Scrapper
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import nest_asyncio
import asyncio
from twikit import Client
import logging
from datetime import datetime

# Tweet Themes & EDA
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.subplots as sp
import nltk
nltk.download('stopwords', quiet=True)
from collections import Counter
from wordcloud import WordCloud, STOPWORDS
from scipy.stats import chi2_contingency
from textblob import TextBlob

# Network analysis
from typing import List, Optional
import json
import matplotlib.patches as mpatches
import networkx as nx
from community import community_louvain
import plotly.graph_objects as go
import random
from collections import defaultdict
import nbformat

# Model prediction
from setfit import SetFitModel, SetFitTrainer
from sklearn.metrics import classification_report
import tiktoken
import tqdm


### Data Import 📨

In [114]:
# Data import: 
data = pd.read_csv('TwitterData_Joined.csv')

### Data Overview 🌎

In [115]:
data.head()

Unnamed: 0,Twitter_User_Name,Twitter_Account,Twitter_User_Description,Tweet_id,Tweet_created_at,Tweet_text,Label,Word_Count,Url_Count,Retweet,...,Adverb_Count,Positive_Word_Ratio,Negative_Word_Ratio,Neutral_Word_Ratio,Following,Followers,Verified,Link,Location,Real_Location
0,Museum Bot,MuseumBot,I am a bot that tweets a random high-res Open ...,8.02758e+17,27-11-2016 06:15,Imperial Theatrical Coat for Court Lady https:...,0,8,2,0,...,0,0.0,0.0,1.0,0,7816,0,https://twitter.com/MuseumBot?s=20,,-1.0
1,Museum Bot,MuseumBot,I am a bot that tweets a random high-res Open ...,8.74692e+17,13-06-2017 18:15,Half-length Figure of St Paul in an Oval. http...,0,10,2,0,...,0,0.0,0.0,1.0,0,7816,0,https://twitter.com/MuseumBot?s=20,,-1.0
2,Museum Bot,MuseumBot,I am a bot that tweets a random high-res Open ...,6.9839e+17,13-02-2016 06:15,Great Exhibition Jurors&amp;#39; Medal https:/...,0,6,2,0,...,0,0.125,0.0,0.875,0,7816,0,https://twitter.com/MuseumBot?s=20,,-1.0
3,Museum Bot,MuseumBot,I am a bot that tweets a random high-res Open ...,6.97665e+17,11-02-2016 06:15,Pair of candelabra https://t.co/KYopSWDSw2 htt...,0,5,2,0,...,0,0.0,0.0,1.0,0,7816,0,https://twitter.com/MuseumBot?s=20,,-1.0
4,Museum Bot,MuseumBot,I am a bot that tweets a random high-res Open ...,6.21745e+17,16-07-2015 18:15,Banner (Nobori)\n http://t.co/yz34Xgo9a5 http:...,0,4,2,0,...,0,0.0,0.0,1.0,0,7816,0,https://twitter.com/MuseumBot?s=20,,-1.0


In [18]:
data.shape

(279691, 29)

In [116]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279691 entries, 0 to 279690
Data columns (total 29 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Twitter_User_Name         279691 non-null  object 
 1   Twitter_Account           279691 non-null  object 
 2   Twitter_User_Description  276027 non-null  object 
 3   Tweet_id                  279691 non-null  float64
 4   Tweet_created_at          279691 non-null  object 
 5   Tweet_text                279691 non-null  object 
 6   Label                     279691 non-null  int64  
 7   Word_Count                279691 non-null  int64  
 8   Url_Count                 279691 non-null  int64  
 9   Retweet                   279691 non-null  int64  
 10  Original_User             58391 non-null   object 
 11  Mentions_Count            279691 non-null  int64  
 12  Hashtags_Count            279691 non-null  int64  
 13  QuesMark_Count            279691 non-null  i

### Cleaning data (for now...) 🧹

In [117]:
columns_to_fill = ['Twitter_User_Description', 'Link', 'Location', 'Original_User']

data[columns_to_fill] = data[columns_to_fill].fillna(0)

### Descriptive Statistics 📊

In [118]:
# Method 1: Using dtype
quantitative_columns = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Format the describe output to show regular numbers
desc_stats = data[quantitative_columns].describe()
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print("\nDescriptive Statistics:")
desc_stats


Descriptive Statistics:


Unnamed: 0,Tweet_id,Label,Word_Count,Url_Count,Retweet,Mentions_Count,Hashtags_Count,QuesMark_Count,Exclamations_Count,SpecialCharacters_Count,...,Pronouns_Count,Verb_Count,Adverb_Count,Positive_Word_Ratio,Negative_Word_Ratio,Neutral_Word_Ratio,Following,Followers,Verified,Real_Location
count,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,...,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0,279691.0
mean,7.908704324737052e+17,0.52,12.96,0.49,0.21,0.5,0.25,0.1,0.23,8.5,...,0.01,1.38,0.29,0.04,0.02,0.94,2131.42,1496164.02,0.1,0.0
std,2.7501587861790944e+17,0.5,7.16,0.56,0.41,0.96,0.73,0.4,0.76,22.92,...,0.12,1.44,0.6,0.06,0.04,0.07,6731.25,5909723.58,0.44,0.9
min,12796523176.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0
25%,6.19934e+17,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.91,1.0,963.0,0.0,-1.0
50%,8.87113e+17,1.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,1.0,0.0,0.0,0.0,0.96,62.0,5398.0,0.0,0.0
75%,1.02322e+18,1.0,19.0,1.0,0.0,1.0,0.0,0.0,0.0,8.0,...,0.0,2.0,0.0,0.06,0.0,1.0,1440.0,95110.0,0.0,1.0
max,1.07271e+18,1.0,62.0,4.0,1.0,13.0,14.0,49.0,37.0,414.0,...,3.0,14.0,14.0,1.0,1.0,1.0,55000.0,45200000.0,1.0,1.0


## Data Scrapper 🪠 (Opens a Chrome instance to fetch X/Twitter Cookies)

### X/Twitter Cookie Saving... 🍪

In [119]:
class ManualTwitterCookieSaver:
    def __init__(self, cookies_file='X_cookies.json'):
        self.cookies_file = cookies_file
        self.driver = None

    def setup_chrome_driver(self):
        """Opsæt og start en Chrome-browser."""
        chrome_options = Options()
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        self.driver = webdriver.Chrome(options=chrome_options)
        return self.driver

    def wait_for_manual_login(self):
        """Vent, indtil brugeren er logget ind og på startsiden."""
        print("Log ind manuelt på Twitter i browseren.")
        while True:
            current_url = self.driver.current_url
            if "https://x.com/home" in current_url:
                print("Login detekteret! Gemmer cookies...")
                break
            time.sleep(2)  # Tjek hvert 2. sekund

    def save_cookies(self):
        """Gem cookies til en JSON-fil."""
        try:
            cookies = self.driver.get_cookies()
            with open(self.cookies_file, 'w') as file:
                json.dump(cookies, file, indent=4)
            print(f"Cookies gemt i {self.cookies_file}")
        except Exception as e:
            print(f"Fejl ved gemning af cookies: {e}")

    def run(self):
        """Kør hele processen."""
        try:
            # Start browser
            self.setup_chrome_driver()
            self.driver.get("https://x.com/i/flow/login")

            # Vent, indtil brugeren logger ind manuelt
            self.wait_for_manual_login()

            # Gem cookies
            self.save_cookies()

        except Exception as e:
            print(f"Fejl: {e}")
        finally:
            if self.driver:
                self.driver.quit()


if __name__ == "__main__":
    cookie_saver = ManualTwitterCookieSaver()
    cookie_saver.run()

Log ind manuelt på Twitter i browseren.
Fejl: Message: disconnected: not connected to DevTools
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: chrome=131.0.6778.86)
Stacktrace:
0   chromedriver                        0x0000000104777ac4 cxxbridge1$str$ptr + 3651580
1   chromedriver                        0x0000000104770314 cxxbridge1$str$ptr + 3620940
2   chromedriver                        0x00000001041d84b4 cxxbridge1$string$len + 89224
3   chromedriver                        0x00000001041c2424 core::str::slice_error_fail::ha0e52dbcb60e6bae + 63828
4   chromedriver                        0x00000001041c2354 core::str::slice_error_fail::ha0e52dbcb60e6bae + 63620
5   chromedriver                        0x0000000104255788 cxxbridge1$string$len + 601948
6   chromedriver                        0x00000001042110b0 cxxbridge1$string$len + 321668
7   chromedriver                        0x0000000104211d00 cxxbridge1$string$len + 324820
8   chrom

### Data Scrapping from X/Twitter using saved cookies 🐥

In [None]:
# initializes asyncio to run in a jupyter or interactive session
nest_asyncio.apply()

# sets up logging
logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s: %(message)s'
)
logger = logging.getLogger(__name__)

# cookies file
COOKIES_FILE = 'X_cookies.json'

# function to format cookies correctly for httpx
def format_cookies_for_httpx(cookies):
    return {cookie['name']: cookie['value'] for cookie in cookies}

async def extract_tweet_data(tweets):
    """
    extracts tweet data and creates a pandas dataframe
    
    args:
        tweets (list): list of tweet objects from twikit
    
    returns:
        pandas.dataframe: dataframe with tweet information
    """
    tweet_data = []
    
    for tweet in tweets:
        tweet_info = {
            'id': tweet.user.id,
            'Tweet_id': tweet.id,
            'User_name': tweet.user.name,
            'Screen_name': tweet.user.screen_name,
            'Tweet_text': tweet.text,
            'Language': tweet.lang,
            'Hashtags': tweet.hashtags,
            'Created_at': tweet.created_at_datetime,
            'Search_keyword': keyword
            
        }
        tweet_data.append(tweet_info)
    
    return pd.DataFrame(tweet_data)

async def main():
    try:
        # initializes the client
        client = Client(language='en-US')
        logger.info("client initialized.")
        
        # loads cookies from file
        try:
            with open(COOKIES_FILE, 'r', encoding='utf-8') as f:
                cookies = json.load(f)
            formatted_cookies = format_cookies_for_httpx(cookies)
            client.set_cookies(formatted_cookies)
            logger.info("cookies loaded and applied.")
        except Exception as e:
            logger.error(f"error loading cookies: {e}")
            return None

        all_tweets = []

        for keyword in keywords:
            logger.info(f"searching for tweets for '{keyword}'...")
            tweets = []
            try:
                # start with an initial search
                results = await client.search_tweet(
                    query=keyword,
                    product='Top',
                    count=20  # max number per request
                )

                # add initial tweets
                tweets.extend(results)

                # continue fetching until we reach the limit or run out of tweets
                while len(tweets) < tweets_per_keyword and results.next_cursor:
                    logger.info(f"Fetching more tweets for '{keyword}'...")
                    try:
                        # fetch the next batch of tweets
                        results = await results.next()

                        # add tweets from the next batch
                        tweets.extend(results)

                        # pause to avoid rate limits
                        await asyncio.sleep(0.5)
                    except Exception as next_error:
                        logger.warning(f"error fetching more tweets: {next_error}")
                        break

                # log the result for the specific keyword
                logger.info(f"found {len(tweets)} tweets for '{keyword}'.")

                # add tweet data to all_tweets
                for tweet in tweets:
                    all_tweets.append({
                        'id': tweet.user.id,
                        'Tweet_id': tweet.id,
                        'User_name': tweet.user.name,
                        'Screen_name': tweet.user.screen_name,
                        'Tweet_text': tweet.text,
                        'Language': tweet.lang,
                        'Hashtags': tweet.hashtags,
                        'Created_at': tweet.created_at_datetime,
                        'Search_keyword': keyword
                    })
            except Exception as e:
                logger.error(f"Error during search for '{keyword}': {e}")
                continue

        # convert to dataframe and remove duplicates
        all_tweets_df = pd.DataFrame(all_tweets)
        all_tweets_df.drop_duplicates(subset=['Tweet_id'], inplace=True)

        # filter dataframe to include only rows where language is english
        filtered_tweets_df = all_tweets_df[all_tweets_df['Language'] == 'en']

        # if you want to save the filtered dataframe to a new variable or overwrite the existing one
        all_tweets_df = filtered_tweets_df
        
        # save results to file
        output_file = 'novo.csv'
        all_tweets_df.to_csv(output_file, index=False, encoding='utf-8')
        logger.info(f"Tweets saved in {output_file}")

        return all_tweets_df

    except Exception as e:
        logger.error(f"general error: {e}")
        return None

# run the asynchronous function
scraped_data = asyncio.run(main())

# display dataframe (if it has been created)
if scraped_data is not None:
    print("\nTweet dataframe (after removal of duplicates and only english tweets):")
    print(scraped_data.head())


2024-12-03 21:54:10,133 - INFO: client initialized.
2024-12-03 21:54:10,134 - INFO: cookies loaded and applied.
2024-12-03 21:54:10,135 - INFO: searching for tweets for 'Novo Nordisk'...
2024-12-03 21:54:10,540 - INFO: HTTP Request: GET https://x.com "HTTP/1.1 200 OK"
  parser = parser(
2024-12-03 21:54:11,080 - INFO: HTTP Request: GET https://abs.twimg.com/responsive-web/client-web/ondemand.s.1eaecb6a.js "HTTP/1.1 200 OK"
2024-12-03 21:54:12,792 - INFO: HTTP Request: GET https://x.com/i/api/graphql/flaR-PUMshxFWZWPNpq4zA/SearchTimeline?variables=%7B%22rawQuery%22%3A+%22Novo+Nordisk%22%2C+%22count%22%3A+20%2C+%22querySource%22%3A+%22typed_query%22%2C+%22product%22%3A+%22Top%22%7D&features=%7B%22creator_subscriptions_tweet_preview_api_enabled%22%3A+true%2C+%22c9s_tweet_anatomy_moderator_badge_enabled%22%3A+true%2C+%22tweetypie_unmention_optimization_enabled%22%3A+true%2C+%22responsive_web_edit_tweet_api_enabled%22%3A+true%2C+%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%


tweet dataframe (after removal of duplicates and only english tweets):
                    id             Tweet_id             User_name  \
0            466333756  1864008894902579706         See It Market   
1            517852962  1864037802112921672         Edward Conard   
2  1246136668014301184  1854252998840680765  App Economy Insights   
3  1451630632799440898  1863659590501511431             Ray Myers   
4  1726975905887649792  1864014237099549130        Adam Filandr ᯅ   

    Screen_name                                         Tweet_text Language  \
0   seeitmarket  NEW Article:  "Novo-Nordisk (NVO), An Investor...       en   
1  EdwardConard  Since 2021 Danish GDP has grown 3.6%, but excl...       en   
2    EconomyApp  $NVO Novo Nordisk Q3 FY24\n\nRevenue +23% to D...       en   
3   TheRayMyers  The weight loss segment is the next gold rush ...       en   
4   NeoFablesVR  Could this be the future of Europe 🇪🇺?\n\nIt's...       en   

  Hashtags                Created_at S

## SetFit Bot Classifier

#### Cleaning keyword scraped dataset for bot classifier 🧹

In [146]:
scraped_data = pd.read_csv('novo.csv')

In [None]:
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    # remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # remove hashtags (but keep the text)
    text = re.sub(r'#', '', text)
    # remove emojis and special characters
    text = re.sub(r'[^\w\s]', '', text)
    # convert to lowercase
    text = text.lower()
    return text

# Data Cleaning
scraped_data['Tweet_text'] = scraped_data['Tweet_text'].apply(clean_text)  # Custom cleaning function
scraped_data.dropna(subset=['Tweet_text'], inplace=True)

In [148]:
# load the saved model
loaded_model = SetFitModel.from_pretrained("setfit_model")

# prepare data for prediction
X_test = scraped_data['Tweet_text'].tolist()  # Tekstdata

# Lav predictions uden detaljeret batch-logging
print("Laver predictions...")
y_pred = loaded_model.predict(X_test)

# add predictions to scraped_data DataFrame
scraped_data['Label'] = y_pred

# print examples
print("\nexamples of predictions:")
print(scraped_data[['Tweet_text', 'Label']].head())

# save predictions to CSV
scraped_data.to_csv('novo_predictions_output.csv', index=False)

2024-12-04 13:24:14,876 - INFO: Use pytorch device_name: mps
2024-12-04 13:24:14,879 - INFO: Load pretrained SentenceTransformer: setfit_model


Laver predictions...


Batches:   0%|          | 0/17 [00:00<?, ?it/s]


examples of predictions:
                                          Tweet_text  Label
0  NEW Article:  "Novo-Nordisk (NVO), An Investor...      1
1  Since 2021 Danish GDP has grown 3.6%, but excl...      1
2  $NVO Novo Nordisk Q3 FY24\n\nRevenue +23% to D...      1
3  The weight loss segment is the next gold rush ...      1
4  Could this be the future of Europe 🇪🇺?\n\nIt's...      1


## Data Visualization (EDA) 🪄

In [None]:
#