<a href="https://colab.research.google.com/github/Azaidi317/LLM-Finetuning-Projects/blob/main/Llama2_Mistral_based_stock_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install praw

Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


In [None]:
import praw
import pandas as pd
from datetime import datetime
import time
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

class StockAnalyzer:
    def __init__(self, reddit_client_id, reddit_client_secret, reddit_user_agent):
        # Initialize Reddit API
        self.reddit = praw.Reddit(
            client_id='Nr1OwEqV_a8GVY3_jxU9-w',
            client_secret='sO0micrWKauECqX7bkR4ztxtLxtsEA',
            user_agent='stock_bot/1.0 by Same_Can_7313',
            check_for_async=False
        )

        # Initialize Mistral model and tokenizer (freely available)
        self.model_name = "mistralai/Mistral-7B-Instruct-v0.2"
        print("Loading model and tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

        # Prompt template for Mistral
        self.prompt_template = """<s>[INST] You are a professional stock market analyst.
        Analyze the following text about {stock_symbol} and determine the market sentiment.

        Text: {text}

        Provide your analysis in JSON format with the following fields:
        - sentiment (Strongly Bearish/Bearish/Neutral/Bullish/Strongly Bullish)
        - confidence (Low/Medium/High)
        - key_points (list of main points)
        - risks (list of risk factors)
        - catalysts (list of potential catalysts)

        Only respond with the JSON, no additional text. [/INST]</s>
        """

    def analyze_text(self, text, stock_symbol):
        """Analyze text using Mistral"""
        try:
            # Format prompt
            prompt = self.prompt_template.format(
                stock_symbol=stock_symbol,
                text=text
            )

            # Tokenize
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
                max_length=1024
            ).to(self.model.device)

            # Generate analysis
            with torch.no_grad():
                outputs = self.model.generate(
                    inputs.input_ids,
                    max_length=1500,
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=self.tokenizer.eos_token_id
                )

            # Decode response
            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Parse response and convert to sentiment score
            sentiment_map = {
                'strongly bearish': -2,
                'bearish': -1,
                'neutral': 0,
                'bullish': 1,
                'strongly bullish': 2
            }

            # Extract sentiment from response (simple parsing)
            response_lower = response.lower()
            sentiment = 'neutral'
            for s in sentiment_map.keys():
                if s in response_lower:
                    sentiment = s
                    break

            return {
                'sentiment': sentiment.title(),
                'sentiment_score': sentiment_map.get(sentiment, 0),
                'raw_analysis': response
            }

        except Exception as e:
            print(f"Analysis error: {str(e)}")
            return {
                'sentiment': 'Neutral',
                'sentiment_score': 0,
                'raw_analysis': str(e)
            }

    def analyze_reddit_posts(self, stock_symbol, time_filter='week', limit=20):
        """Analyze posts across Reddit"""
        all_posts = []

        try:
            print(f"Analyzing Reddit sentiment for {stock_symbol}...")

            # Search r/all
            for submission in tqdm(self.reddit.subreddit("all").search(
                f'"{stock_symbol}"',
                sort='hot',
                time_filter=time_filter,
                limit=limit
            )):
                # Combine title and text
                full_text = f"Title: {submission.title}\nContent: {submission.selftext}"

                # Get analysis
                analysis = self.analyze_text(full_text, stock_symbol)

                post_data = {
                    'title': submission.title,
                    'subreddit': submission.subreddit.display_name,
                    'score': submission.score,
                    'num_comments': submission.num_comments,
                    'created_utc': datetime.fromtimestamp(submission.created_utc),
                    'url': f"https://reddit.com{submission.permalink}",
                    'sentiment': analysis['sentiment'],
                    'sentiment_score': analysis['sentiment_score'],
                    'analysis': analysis['raw_analysis']
                }

                all_posts.append(post_data)
                time.sleep(1)  # Respect rate limits

            # Convert to DataFrame
            df = pd.DataFrame(all_posts)

            # Calculate weighted sentiment (based on post score)
            if not df.empty:
                df['weighted_sentiment'] = df['sentiment_score'] * df['score']
                overall_sentiment = df['weighted_sentiment'].sum() / df['score'].sum()
            else:
                overall_sentiment = 0

            # Save results
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"{stock_symbol}_sentiment_{timestamp}.csv"
            df.to_csv(filename, index=False)

            # Calculate statistics
            stats = {
                'total_posts': len(df),
                'overall_sentiment': overall_sentiment,
                'sentiment_distribution': df['sentiment'].value_counts().to_dict(),
                'top_subreddits': df['subreddit'].value_counts().head(5).to_dict(),
                'high_impact_posts': df.nlargest(5, 'score').to_dict('records')
            }

            return df, stats

        except Exception as e:
            print(f"Analysis error: {str(e)}")
            return None, None

# Example usage
if __name__ == "__main__":
    # Initialize analyzer
    analyzer = StockAnalyzer(
        reddit_client_id="your_client_id",
        reddit_client_secret="your_client_secret",
        reddit_user_agent="StockBot/1.0"
    )

    # Analyze stock
    stock_symbol = "AAPL"
    df, stats = analyzer.analyze_reddit_posts(
        stock_symbol=stock_symbol,
        time_filter='week',
        limit=20  # Reduced limit for testing
    )

    if df is not None and stats is not None:
        print("\nAnalysis Results:")
        print(f"Total Posts Analyzed: {stats['total_posts']}")
        print(f"Overall Sentiment Score: {stats['overall_sentiment']:.2f}")

        print("\nSentiment Distribution:")
        for sentiment, count in stats['sentiment_distribution'].items():
            print(f"{sentiment}: {count}")

        print("\nTop Subreddits:")
        for subreddit, count in stats['top_subreddits'].items():
            print(f"{subreddit}: {count}")

        print("\nMost Impactful Posts:")
        for post in stats['high_impact_posts']:
            print(f"\nTitle: {post['title']}")
            print(f"Sentiment: {post['sentiment']}")
            print(f"Score: {post['score']}")
            print(f"Subreddit: {post['subreddit']}")

Loading model and tokenizer...


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.
401 Client Error. (Request ID: Root=1-673fdcd0-0617e8e93ccae0ec4ae605f5;b8633b27-0541-4340-aab1-2ad81a565302)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Access to model mistralai/Mistral-7B-Instruct-v0.2 is restricted. You must have access to it and be authenticated to access it. Please log in.

In [None]:
!pip install -q transformers accelerate bitsandbytes

# Set up your Hugging Face token as a Colab secret
from google.colab import userdata
userdata.set('HF_TOKEN', 'your_hugging_face_token')

# Use the token in your code
from huggingface_hub import login
login(userdata.get('HF_TOKEN'))

# Now you can load Llama 2
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16

SyntaxError: incomplete input (<ipython-input-4-d4384080be66>, line 19)

In [2]:
pip install praw

Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/189.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


In [3]:
import praw
import pandas as pd
from datetime import datetime
import time
from tqdm import tqdm
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import os

class Llama2StockAnalyzer:
    def __init__(self, reddit_client_id, reddit_client_secret, reddit_user_agent, hf_token):
        # Initialize Reddit API
        self.reddit = praw.Reddit(
            client_id='Nr1OwEqV_a8GVY3_jxU9-w',
            client_secret='sO0micrWKauECqX7bkR4ztxtLxtsEA',
            user_agent='stock_bot/1.0 by Same_Can_7313',
            check_for_async=False
        )

        # Login to Hugging Face
        print("Logging into Hugging Face...")
        login(hf_token)

        # Initialize Llama 2
        print("Loading Llama 2 model and tokenizer...")
        self.model_name = "meta-llama/Llama-2-7b-chat-hf"
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            use_auth_token=True
        )
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            torch_dtype=torch.float16,
            device_map="auto",
            use_auth_token=True
        )

        # Prompt template for stock analysis
        self.prompt_template = """
        [INST] You are a professional stock market analyst. Analyze the following text about {stock_symbol} and determine the market sentiment and key insights.

        Text: {text}

        Please analyze considering:
        1. Overall market sentiment
        2. Specific price predictions or targets
        3. Mentioned catalysts or risks
        4. Technical analysis indicators
        5. Company fundamentals
        6. Market context and broader trends

        Provide your analysis in the following format:
        - Sentiment: (Strongly Bearish/Bearish/Neutral/Bullish/Strongly Bullish)
        - Confidence: (Low/Medium/High)
        - Key Points: (List main points)
        - Price Targets: (If mentioned)
        - Risk Factors: (If mentioned)
        [/INST]
        """

    def analyze_text(self, text, stock_symbol):
        """Analyze text using Llama 2"""
        try:
            # Format prompt
            prompt = self.prompt_template.format(
                stock_symbol=stock_symbol,
                text=text
            )

            # Tokenize
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
                max_length=1024
            ).to(self.model.device)

            # Generate analysis
            with torch.no_grad():
                outputs = self.model.generate(
                    inputs.input_ids,
                    max_length=1500,
                    temperature=0.7,
                    top_p=0.95,
                    do_sample=True,
                    pad_token_id=self.tokenizer.eos_token_id
                )

            # Decode response
            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Parse sentiment from response
            sentiment_map = {
                'strongly bearish': -2,
                'bearish': -1,
                'neutral': 0,
                'bullish': 1,
                'strongly bullish': 2
            }

            # Extract sentiment (basic parsing)
            response_lower = response.lower()
            sentiment = 'neutral'
            confidence = 'medium'

            for s in sentiment_map.keys():
                if s in response_lower:
                    sentiment = s
                    break

            if 'confidence: high' in response_lower:
                confidence = 'high'
            elif 'confidence: low' in response_lower:
                confidence = 'low'

            return {
                'sentiment': sentiment.title(),
                'sentiment_score': sentiment_map.get(sentiment, 0),
                'confidence': confidence.title(),
                'analysis': response
            }

        except Exception as e:
            print(f"Analysis error: {str(e)}")
            return {
                'sentiment': 'Neutral',
                'sentiment_score': 0,
                'confidence': 'Low',
                'analysis': str(e)
            }

    def analyze_reddit_posts(self, stock_symbol, time_filter='week', limit=20):
        """Analyze posts across Reddit"""
        all_posts = []

        try:
            print(f"Analyzing Reddit sentiment for {stock_symbol}...")

            # Search r/all and specific finance subreddits
            subreddits = ["wallstreetbets", "stocks", "investing", "stockmarket", "all"]

            for subreddit_name in subreddits:
                print(f"\nSearching r/{subreddit_name}...")
                subreddit = self.reddit.subreddit(subreddit_name)

                for submission in tqdm(subreddit.search(
                    f'"{stock_symbol}"',
                    sort='hot',
                    time_filter=time_filter,
                    limit=int(limit/len(subreddits))
                )):
                    # Combine title and text
                    full_text = f"Title: {submission.title}\nContent: {submission.selftext}"

                    # Get Llama 2 analysis
                    analysis = self.analyze_text(full_text, stock_symbol)

                    post_data = {
                        'title': submission.title,
                        'text': submission.selftext,
                        'subreddit': submission.subreddit.display_name,
                        'score': submission.score,
                        'num_comments': submission.num_comments,
                        'created_utc': datetime.fromtimestamp(submission.created_utc),
                        'url': f"https://reddit.com{submission.permalink}",
                        'sentiment': analysis['sentiment'],
                        'sentiment_score': analysis['sentiment_score'],
                        'confidence': analysis['confidence'],
                        'full_analysis': analysis['analysis']
                    }

                    all_posts.append(post_data)
                    time.sleep(1)  # Respect rate limits

            # Convert to DataFrame
            df = pd.DataFrame(all_posts)

            # Calculate weighted sentiment
            if not df.empty:
                df['weighted_sentiment'] = df.apply(
                    lambda x: x['sentiment_score'] * x['score'] *
                    (1.0 if x['confidence'] == 'High' else 0.7 if x['confidence'] == 'Medium' else 0.4),
                    axis=1
                )
                stats = self._calculate_statistics(df)
            else:
                stats = {
                    'total_posts': 0,
                    'overall_sentiment': 0,
                    'sentiment_distribution': {},
                    'top_subreddits': {},
                    'high_impact_posts': []
                }

            # Save results
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"{stock_symbol}_llama2_analysis_{timestamp}.csv"
            df.to_csv(filename, index=False)

            return df, stats

        except Exception as e:
            print(f"Analysis error: {str(e)}")
            return None, None

    def _calculate_statistics(self, df):
        """Calculate comprehensive statistics"""
        return {
            'total_posts': len(df),
            'overall_sentiment': df['weighted_sentiment'].mean(),
            'sentiment_distribution': df['sentiment'].value_counts().to_dict(),
            'confidence_distribution': df['confidence'].value_counts().to_dict(),
            'top_subreddits': df['subreddit'].value_counts().head(5).to_dict(),
            'high_impact_posts': df.nlargest(5, 'weighted_sentiment')[
                ['title', 'subreddit', 'sentiment', 'confidence', 'score', 'url']
            ].to_dict('records')
        }

# Example usage
if __name__ == "__main__":
    # Your credentials
    REDDIT_CLIENT_ID = "your_reddit_client_id"
    REDDIT_CLIENT_SECRET = "your_reddit_client_secret"
    REDDIT_USER_AGENT = "LlamaStockBot/1.0"
    HF_TOKEN = "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"  # Get from https://huggingface.co/settings/tokens

    # Initialize analyzer
    analyzer = Llama2StockAnalyzer(
        reddit_client_id=REDDIT_CLIENT_ID,
        reddit_client_secret=REDDIT_CLIENT_SECRET,
        reddit_user_agent=REDDIT_USER_AGENT,
        hf_token=HF_TOKEN
    )

    # Analyze stock
    stock_symbol = "AAPL"
    df, stats = analyzer.analyze_reddit_posts(
        stock_symbol=stock_symbol,
        time_filter='week',
        limit=20  # Adjust based on your needs
    )

    if df is not None and stats is not None:
        print("\nAnalysis Results:")
        print(f"Total Posts Analyzed: {stats['total_posts']}")
        print(f"Overall Sentiment Score: {stats['overall_sentiment']:.2f}")

        print("\nSentiment Distribution:")
        for sentiment, count in stats['sentiment_distribution'].items():
            percentage = (count / stats['total_posts']) * 100
            print(f"{sentiment}: {count} posts ({percentage:.1f}%)")

        print("\nConfidence Distribution:")
        for confidence, count in stats['confidence_distribution'].items():
            percentage = (count / stats['total_posts']) * 100
            print(f"{confidence}: {count} posts ({percentage:.1f}%)")

        print("\nTop Subreddits:")
        for subreddit, count in stats['top_subreddits'].items():
            print(f"r/{subreddit}: {count} posts")

        print("\nMost Impactful Posts:")
        for post in stats['high_impact_posts']:
            print(f"\nTitle: {post['title']}")
            print(f"Sentiment: {post['sentiment']} (Confidence: {post['confidence']})")
            print(f"Subreddit: r/{post['subreddit']}")
            print(f"Score: {post['score']}")
            print(f"URL: {post['url']}")

Logging into Hugging Face...
Loading Llama 2 model and tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Analyzing Reddit sentiment for AAPL...

Searching r/wallstreetbets...


0it [00:00, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
3it [00:49, 16.48s/it]



Searching r/stocks...


1it [00:04,  4.85s/it]



Searching r/investing...


1it [00:25, 25.19s/it]



Searching r/stockmarket...


1it [00:17, 17.93s/it]



Searching r/all...


4it [00:59, 14.89s/it]


Analysis Results:
Total Posts Analyzed: 10
Overall Sentiment Score: -1219.44

Sentiment Distribution:
Strongly Bearish: 9 posts (90.0%)
Neutral: 1 posts (10.0%)

Confidence Distribution:
Medium: 9 posts (90.0%)
High: 1 posts (10.0%)

Top Subreddits:
r/wallstreetbets: 3 posts
r/stocks: 1 posts
r/investing: 1 posts
r/StockMarket: 1 posts
r/NvidiaStock: 1 posts

Most Impactful Posts:

Title: r/Stocks Daily Discussion & Fundamentals Friday Nov 15, 2024
Sentiment: Neutral (Confidence: Medium)
Subreddit: r/stocks
Score: 21
URL: https://reddit.com/r/stocks/comments/1grtfyq/rstocks_daily_discussion_fundamentals_friday_nov/

Title: Partnership basis software 
Sentiment: Strongly Bearish (Confidence: Medium)
Subreddit: r/tax
Score: 2
URL: https://reddit.com/r/tax/comments/1gwxgsl/partnership_basis_software/

Title: Stocks with unusual options volume: $SPY $QQQ $NVDA $TSLA $SPX $PLTR $IWM $AAPL $SMCI $AMD $VIX $TLT
Sentiment: Strongly Bearish (Confidence: Medium)
Subreddit: r/StockOptionsAlerts



