# Module 04 - Text
## Natural Language Processing
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that enables machines to understand, interpret, and generate human language. It combines computational linguistics with machine learning to process text or spoken words in a way that's meaningful. By breaking down language into smaller components—such as sentences, phrases, and even individual words—NLP models can analyze their structure and derive context. Tasks like language translation, sentiment analysis, and text summarization are all powered by NLP. Its significance lies in bridging the gap between human communication, which is inherently complex and nuanced, and machine comprehension.

NLP is essential because it empowers machines to interact with humans more naturally and effectively. For instance, virtual assistants like Siri, Alexa, and Google Assistant rely on NLP to understand voice commands, answer questions, and execute tasks such as setting reminders or playing music. Similarly, search engines like Google use NLP to interpret queries and deliver the most relevant results by understanding the context and intent behind the words. Customer service chatbots also employ NLP to provide instant responses to user inquiries, improving efficiency and user experience. In the healthcare sector, NLP is used to analyze clinical notes, enabling faster diagnosis and better patient care. On social media, platforms like Facebook and Twitter utilize NLP for tasks like sentiment analysis, allowing brands to gauge public opinion and respond effectively. These real-world examples highlight how NLP is transforming our daily interactions with technology, making them seamless and intuitive.

<p style="text-align: center"><img src="https://thislondonhouse.com/Jupyter/Images/nlp.png"></p>

### Text Preprocessing 
Text preprocessing is a fundamental step in natural language processing (NLP) that prepares raw text data for analysis by machines. Since raw text is often noisy and unstructured, preprocessing cleans and standardizes it to improve the performance of machine learning models. Common techniques include removing stop words (e.g., "and," "the"), punctuation, and special characters, as well as converting text to lowercase, stemming, and lemmatization to simplify words to their base forms. Additionally, tokenization divides the text into smaller units such as words or sentences for easier processing. Despite its importance, text preprocessing comes with challenges, such as determining which stop words to retain in context-specific tasks and balancing the trade-off between simplifying text and preserving its meaning.

### Feature Extraction
Feature extraction involves converting the preprocessed text into numerical representations that machine learning algorithms can understand. Techniques like Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings (e.g., Word2Vec, GloVe, or BERT) are widely used for this purpose. Feature extraction helps to capture the semantics and context of the text, enabling more effective model training and predictions. However, it presents challenges like ensuring the representations are meaningful, especially for complex languages or highly ambiguous text. High-dimensional data generated by some methods, such as BoW, can also lead to issues like the curse of dimensionality, affecting model efficiency and accuracy. The choice of feature extraction method often depends on the specific task and data characteristics, making it a critical aspect of NLP workflows.  

| Technique       | Description                                                                                 | Pros                                               | Cons                                                |
|-----------------|---------------------------------------------------------------------------------------------|----------------------------------------------------|-----------------------------------------------------|
| **Bag of Words (BoW)** | Represents text data by counting the occurrence of each word in the document. Each word is treated as an independent feature.                 | Simple and easy to implement.  Widely used and well-understood. Fast and computationally efficient.                     | Ignores word context and meaning. Results in sparse matrices. Limited to lexical matching.|
| **TF-IDF**      | Weighs the frequency of words by their inverse document frequency. Reduces the impact of common words.                         | Captures important words based on their uniqueness. Better at handling irrelevant words than BoW. Provides interpretable features. | Ignores word context and meaning. Results in high-dimensional sparse matrices. Limited to lexical matching.                    |
| **Word Embeddings** | Represents words as dense vectors in a continuous vector space based on their context and meaning. Techniques include Word2Vec, GloVe, FastText, and BERT.  | Captures semantic relationships between words. Handles polysemy and context effectively. Pre-trained models available for various domains.      | Requires significant computational resources.  Requires large corpora for training. May require fine-tuning for specific tasks.           |


In [None]:
# Libraries
from groq import Groq
import datetime
import pprint
import statistics
import os 
from dotenv import load_dotenv 
import json
from time import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import metrics
from sklearn.utils.extmath import density
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

In [None]:
# Functions
def plot_feature_effects(clf, target_names, feature_names, top_n=20):
    features = []
    coefs = []
    classes = []

    try:
        average_feature_effects = clf.coef_
    except:
        return None

    if len(target_names) > 2:
        # learned coefficients weighted by frequency of appearance

        for i, label in enumerate(target_names):
            coefs.extend(average_feature_effects[i][np.argsort(average_feature_effects[i])[-top_n:][::-1]])
            features.extend(feature_names[np.argsort(average_feature_effects[i])[-top_n:][::-1]])
            classes.extend([label] * top_n)
    else:
        for i, label in enumerate(target_names):
            if i == 0:
                coefs.extend(average_feature_effects[0][np.argsort(average_feature_effects[0])[:top_n][::-1]])
                features.extend(feature_names[np.argsort(average_feature_effects[0])[:top_n][::-1]])
                classes.extend([label] * top_n)
            else:
                coefs.extend(average_feature_effects[0][np.argsort(average_feature_effects[0])[-top_n:][::-1]])
                features.extend(feature_names[np.argsort(average_feature_effects[0])[-top_n:][::-1]])
                classes.extend([label] * top_n)

    feature_importance_df = pd.DataFrame({'features': features, 
                                          'coefs': coefs, 
                                          'classes': classes})

    # Create the bar plot
    plt.figure(figsize=(10, 6))
    sns.barplot(y='features', x='coefs', hue='classes', data=feature_importance_df)
    plt.title('Feature Coefficients by Class')
    plt.xlabel('Features')
    plt.ylabel('Coefficient Value')
    plt.legend(title='Classes')
    plt.show()


def classifier_performance(y, y_pred, labels_dict=None):
    accuracy = metrics.accuracy_score(y, y_pred)
    precision = metrics.precision_score(y, y_pred, average='weighted')
    recall = metrics.recall_score(y, y_pred, average='weighted')
    balanced_accuracy = metrics.balanced_accuracy_score(y, y_pred)
    f1 = metrics.f1_score(y, y_pred, average='weighted')
    report = metrics.classification_report(y, y_pred, target_names=[labels_dict[i] for i in sorted(
        labels_dict.keys())] if not labels_dict is None else np.unique(y_pred))

    # Display the confusion matrix with custom labels
    conf_matrix = metrics.confusion_matrix(y, y_pred)
    disp = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=[labels_dict[i] for i in sorted(
        labels_dict.keys())] if not labels_dict is None else np.unique(y_pred))
    disp.plot(cmap=plt.cm.Greens)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"Balanced Accuracy: {balanced_accuracy:.4f}")
    print(f"F1-score: {f1:.4f}")
    print("\nDetailed Classification Report:")
    print(report)
    plt.show()

def get_prepared_remarks(transcript):
    prepared_comments_started = False
    prepared_comments = []
    for transcript_line in transcript.split("\n"):
        speaker = transcript_line[:transcript_line.find(":")]
        if speaker.lower() == 'operator':
            if prepared_comments_started:
                # prepared comments have concluded
                break
            else:
                # meeting has just begun
                prepared_comments_started = True
        else:
            prepared_comments.append(transcript_line)

    return "\n".join(prepared_comments)

def is_question(prompt):
    # Check if the prompt ends with a question mark
    if prompt.strip().endswith('?'):
        return True

    # Check for common question words
    question_words = ["who", "what", "where", "when", "why", "how"]
    for word in question_words:
        if word in prompt.lower().split():
            return True
    return False

def retrieve_document(query, documents, top_n=3, query_method='TFIDF'):

    if query_method == 'TFIDF':
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform(documents)

        query_vec = vectorizer.transform([query])
        cosine_similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    else:
        model = SentenceTransformer('all-mpnet-base-v2')
        document_embeddings = model.encode(documents, convert_to_tensor=True)

        query_embedding = model.encode(query, convert_to_tensor=True)
        cosine_similarities = util.pytorch_cos_sim(query_embedding, document_embeddings).flatten()
        cosine_similarities = cosine_similarities.numpy()

    top_indices = np.argsort(cosine_similarities)[::-1][:top_n]  # Get indices of top_n most similar documents
    return [i for i in top_indices]

def build_user_prompt(df):
    user_input = input("")
    if user_input.lower() in ("", "exit", "quit", "bye", "goodbye"):
        return False

    if is_question(user_input):
        # Looks like a question...searching for supporting documents...
        top_document_rows = retrieve_document(user_input, df['Content'].tolist(), top_n=3, query_method='TFIDF')
        search_df = df.iloc[top_document_rows]

        context = "\n\n".join([f"Document {i}:\n{doc}" for i, doc in enumerate(search_df['Content'])])
        user_prompt = f"{user_input}\n\nRelevant documents:\n{context}"
    else:
        user_prompt = user_input
        
    return {'input': user_input, 'prompt': user_prompt}

## Text Exercise 1
### Business Problem

Detecting misinformation on social media is crucial because it helps preserve the integrity and reliability of information shared online. Misinformation can spread rapidly, leading to confusion, fear, and mistrust among users. It can influence public opinion, harm individuals and communities, and even impact critical decisions such as those related to health, safety, and elections. By detecting and addressing misinformation, we can promote informed decision-making, protect public health, and maintain a more accurate and trustworthy information ecosystem.

### Data Collection/Selection
We will be loading data from a twitter dataset. 

Data are orgnized in tabular format with each record representing an individual user. The target variable is 'Role' which represents whether the individual supported or refuted misinformation. The data contain the following columns:  
| Variable Name              | Role       | Data Type  | Description                                    |
|----------------------------|------------|------------|------------------------------------------------|
| id                         | Feature    | String     | Unique identifier for each user                |
| screen_name                | Feature    | String     | User's screen name or handle                   |
| verified                   | Feature    | Boolean    | Whether the user is verified                   |
| age                        | Feature    | Integer    | User's age of account                          |
| description                | Feature    | String     | User's profile description                     |
| tweet_count                | Feature    | Integer    | Total number of tweets posted by the user      |
| listed_count               | Feature    | Integer    | Number of lists the user is a member of        |
| follower_count             | Feature    | Integer    | Total number of followers the user has         |
| friend_count               | Feature    | Integer    | Total number of friends the user has           |
| mindset_period             | Feature    | String     | Period during which the mindset was analyzed   |
| WC                         | Feature    | Integer    | Word count in user's tweets                    |
| fofo_ratio                 | Feature    | Float      | Ratio of certain words or phrases              |
| mindset_count              | Feature    | Integer    | Count of messages influencing mindset          |
| mindset_text               | Feature    | String     | Text associated with the user's mindset        |
| Participation              | Feature    | Boolean    | Total participation                            |
| Support                    | Feature    | Boolean    | Count of supporting messages                   |
| Refute                     | Feature    | Boolean    | Count of refuting messages                     |
| Morality                   | Feature    | Integer    | Count of morality messages                     |
| SafetyConcerns             | Feature    | Integer    | Count of safety concerns messages              |
| CivilLiberties             | Feature    | Integer    | Count of civil liberties messages              |
| Conspiracy                 | Feature    | Integer    | Count of conspiracy theories messages          |
| Original                   | Feature    | Integer    | Count of original tweets                       |
| Retweet                    | Feature    | Integer    | Count of retweets                              |
| Favorite                   | Feature    | Integer    | Count of favorited tweets                      |
| Reply                      | Feature    | Integer    | Count of reply tweets                          |
| Quote                      | Feature    | Integer    | Count of quote tweets                          |
| Role                       | Target     | String     | Role of the user in the discussion             |
| Myth                       | Feature    | String     | Type of myth the user discusses                |
| Supported                  | Feature    | Boolean    | Whether the user supported misinformation      |
| SupportedMorality          | Feature    | Boolean    | Whether supported morality misinformation      |
| SupportedSafetyConcerns    | Feature    | Boolean    | Whether supported safety misinformation        |
| SupportedConspiracy        | Feature    | Boolean    | Whether supported conspiracy misinformation    |
| SupportedCivilLiberties    | Feature    | Boolean    | Whether supported civil liberty misinformation |
| Security                   | Feature    | Integer    | Extent exhibited value of security             |
| Conformity                 | Feature    | Integer    | Extent exhibited value of conformity           |
| Tradition                  | Feature    | Integer    | Extent exhibited value of tradition            |
| Benevolence                | Feature    | Integer    | Extent exhibited value of benevolence          |
| Universalism               | Feature    | Integer    | Extent exhibited value of universalism         |
| SelfDirection              | Feature    | Integer    | Extent exhibited value of self-direction       |
| Stimulation                | Feature    | Integer    | Extent exhibited value of stimulation          |
| Hedonism                   | Feature    | Integer    | Extent exhibited value of hedonism             |
| Achievement                | Feature    | Integer    | Extent exhibited value of achievement          |
| Power                      | Feature    | Integer    | Extent exhibited value of power                |

The following line will load the data as a pandas dataframe.

In [None]:
twitter_df = pd.read_csv("data/twitter_misinformation.csv")

### Data Profiling  
Once the data are loaded, we need to profile the data and prepare it for analysis. This typically involves several steps that may include handling missing data, exploring data, feature selection, among others. The steps will vary depending on the dataset and the business problem, but profiling always precedes model building.  

For analyzing text, you usually want to engage in a text-cleaning process. Fortunately, this dataset is pretty clean and what cleaning remains will be handled in feature extraction.

In [None]:
twitter_df

Here are all of the features available. We could build a complex model for predicting supporing misinformation, but for this exercise we will focus only on the user's prior tweets.

In [None]:
twitter_df.info(verbose=True, show_counts=True)

In [None]:
twitter_df.hist(figsize=(12, 10), bins=30, edgecolor="black")
plt.show()

Here is a sample of some user's prior tweets. Looking at these results and the WC histogram will give you a feel for roughly how much text is involved.

In [None]:
twitter_df['mindset_text']

These are the groups we will classify. Support indicates the user tweeted messages supporting misinformation and refute inidicates users who tweets messages refuting misinformation.

In [None]:
twitter_df['Role'].value_counts().plot.barh()

As with previous exercises, we will subset our data into a dataframe with only features that will be needed for analysis.

In [None]:
target_cols = ['Role']
input_cols = ['mindset_text']
data_cols = input_cols + target_cols

df = twitter_df[data_cols]

In [None]:
df.info()

There are no missing values, but we'll drop NA out of habit.

In [None]:
df = df.dropna()

Finally, we will create our test/train split and then we will move on to model building.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[input_cols], df[target_cols], test_size=0.25, random_state=16)

### Model Specification

Unlike previous exercises where we transformed features based on their datatype, in this exercise, we will use a vectorizer to turn the tweet text into a vector of numeric values which indicate word usage patterns. In effect, the vectorizer will convert the single feature 'mindset_text' into a larger number of features with each feature representing a word or part of a word that is found within the mindset_text.

In [None]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english")

Since we are conducting a classification analysis, we will start with a logistic regression.

In [None]:
base_model = LogisticRegression(C=5, max_iter=1000)

As with previous exercises, we compile these elements into a pipeline of steps. This helps ensure consistency across models. 

In [None]:
logistic_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),  # Step 1: Transform text data using TF-IDF
    ('classifier', LogisticRegression())  # Step 2: Train a classifier on the TF-IDF features
])

First we will fit the model and then we will assess fit performance by predicting results from our testing data. This line may look a little different from previous exercises. The difference lies in the np.ravel(X_train). We have to use ravel() in this case because there is only one input feature (the vectorizer turns that one feature into many features) and ravel() is used to format single column dataframes into an appropriately structure array. We've used ravel() on the y values in all previous exercises as we have always had only one dependent variable, but this is the first time we have had only one independent variable.

In [None]:
logistic_pipeline.fit(np.ravel(X_train), np.ravel(y_train))

Next, we predict the results and investigate the classifier performance.

In [None]:
logistic_predicted = logistic_pipeline.predict(np.ravel(X_test))

Surprisingly, the classifier proves to be quite capable (81% accuracy) of discriminating between supporters and refuters of misinformation based on their prior tweets.

In [None]:
classifier_performance(y_test, logistic_predicted)

The following function will plot the 15 features that are most predictive of supporting/refuting misinformation.

In [None]:
plot_feature_effects(logistic_pipeline['classifier'], np.unique(df['Role']), logistic_pipeline['tfidf'].get_feature_names_out(), 15)

### Model Evaluation

The results above are impressive, but there are other classifiers that may even improve on these results. In the following code block, a tuple of classifieres is created. Then the same steps as above are duplicated inside of a loop so that each model is evaluated indpendently. The loop also calculates how long it takes to train/test each model so that we can assess not only accuracy but speed.

In [None]:
results = []
classifiers = ((DummyClassifier(), "Dummy Classifier"),
               (LogisticRegression(C=5, max_iter=1000), "Logistic Regression"),
               (RidgeClassifier(alpha=1.0, solver="sparse_cg"), "Ridge Classifier"),
               (KNeighborsClassifier(n_neighbors=100), "kNN"),
               (RandomForestClassifier(), "Random Forest"),
               (LinearSVC(C=0.1, dual=False, max_iter=1000), "Linear SVC"),
               (SGDClassifier(loss="log_loss", alpha=1e-4, n_iter_no_change=3, early_stopping=True), "log-loss SGD",),
               (NearestCentroid(), "NearestCentroid"),
               (ComplementNB(alpha=0.1), "Complement naive Bayes"))

for clf, name in classifiers:
    print("=" * 80)
    print(name)
    print("_" * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),  
        ('classifier', clf) 
    ])
    pipeline.fit(np.ravel(X_train), np.ravel(y_train))

    train_time = time() - t0
    print(f"train time: {train_time:.3}s")

    t0 = time()
    y_pred = pipeline.predict(np.ravel(X_test))
    test_time = time() - t0
    print(f"test time:  {test_time:.3}s")
    classifier_performance(y_test, y_pred, {0: 'Refute', 1: 'Support'})
    plot_feature_effects(pipeline['classifier'], np.unique(df['Role']), pipeline['tfidf'].get_feature_names_out(), 15)
    print()
    if name:
        clf_descr = str(name)
    else:
        clf_descr = clf.__class__.__name__

    results.append((clf_descr, metrics.accuracy_score(y_test, y_pred), train_time, test_time))

Next we will plot the speed metrics to help us make the tradeoff between speed and accuracy.

In [None]:
results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time)
test_time = np.array(test_time)

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))
ax1.scatter(score, training_time, s=60)
ax1.set(
    title="Score-training time trade-off",
    yscale="log",
    xlabel="test accuracy",
    ylabel="training time (s)",
)
ax2.scatter(score, test_time, s=60)
ax2.set(
    title="Score-test time trade-off",
    yscale="log",
    xlabel="test accuracy",
    ylabel="test time (s)",
)

for i, txt in enumerate(clf_names):
    ax1.annotate(txt, (score[i], training_time[i]))
    ax2.annotate(txt, (score[i], test_time[i]))

plt.tight_layout()
plt.show()

### Conclusion

The model performed very well and given the complexity of speech, it is hard to imagine improving on this model. However, the TFIDF vectorizer is based on word count and not semantic meaning. So, it is possible to imagine improved performance on edge cases where refuters use language that mimic supporters, but differs in how they use the language. To achieve this, we would need to use a semantically aware vectorizer which would greatly increase the processing requirements of our model.

Further, this model's accuracy is narrow as it only applies to misinformation about covid-19 vacccines. Given the proliferation of misinformation and the speed at which misinformation flows and evolves, a static model would have limited utility.

However, in the context of misinformation about covid-19 vacccines, our model performs admirably and would be a useful tool for identify rumors and myths swirling around the covid-19 vaccine. For example, the following collection of words could be fed into our model to predict the likely role this user would play in any conversation about vaccine misinformation.

In [None]:
test_mindset_text = "can t be bothered Like I meant to but then I get to the webpage and I m like Nah I mean it s not like I even had to move That s not even lazy I don t know what it is God bless Regis Peace and love to all his family I loved Regis he came to my home to interview me many many years ago and to this day he called me Big Hasselhoff He I didn t even know there was supposed to be a possible hurricane Seriously tho did these hurricanes pop up out of nowhere I didn t hear anything about either of them 24 hours ago WTF is going on Hurricane Hanna makes landfall in Texas via So did I Regis is in the same category as Carson Superlative He was on our show a million times always the best guest we ever had Regis Philbin truly was the host with the most most joyful most entertaining most unpredictable amp one of the most ama FLAVOR FLAV R I P Regis Philbin Your death was a shocker We are gonna miss you here Thankx for all your hours on I m only like 7 minutes into the BillandTed3 ComicConAtHome thing and I m already making faces No idea what Brigette said at 4 minutes in about pronouns And you know nepotism is one of my pet peeves IDK I did see that slasher wedding movie without realizing the bride was someone s kid I didn t like the movie So we ll see As with most issues in this country what the media presents as black vs white is usually really rich vs poor Police are going to act completely differently in a rich neighborhood than a poor one In my opinion the media s focus on race is a deliberate attempt to divide people I like that they re all in their own phone booths and when they re talking the electricity goes to them Wait I didn t notice this before The band is called Preston Logan instead of Wyld Stallyns WTF So neither of these kids saw Bill amp Ted before That s disturbing Especially Weaving s kid Because her dad s money comes from being in movies with Ted Theodore Logan Who raised these children Always a warm greeting Regis greeted you like a happy surprise he was delighted to see A really sweet man he took One of my favorite interviews with Reg They ll never be another like him RIP my friend XO RIPRegis Now I m getting the bad vibe that I m not going to like this movie BECAUSE I don t have kids It seems like a lot of this is going to be about the kids Younger Weaving is awkwardly giddy What is that I call the beneficiaries of nepotism stuff like younger Weaving or younger Washington or younger Kravitz because I don t think their level of talent would have gotten them there on their own So you might as well call a spade a spade Oh wow 37 minutes in Keanu voluntarily spoke I m actually shocked Yeah that was kinda ok Not psyched about the movie tbqh Something s off  2Priceless It ll just come back like Talking Tina I know I like John Saxon but I don t think it s from ENTER THE DRAGON A quick scan of his IMDb page didn t help me either I must be forgetting something "
test_mindset_text

In [None]:
logistic_pipeline.predict(np.array([test_mindset_text]))

In [None]:
test_mindset_text = 'BREAKING President has withdrawn the US from the World Health Organization if you support Ilhan Omar just called for the dismantling of the U S economy and political systems This socialist has got to go if schools should be fully open this fall I will give one of these coins away to one person who and Likes this post If this post I ll give a Raise your hand if you think we should withdraw from the United Nations next Anyone like Ilhan Omar who calls for the literal DISMANTLING of our whole US system should be immediately removed from o Rt if Bill de Blasio should be REMOVED as Mayor for destroying NYC Trump should have said keep schools closed Just like hydroxychloroquine Marquette Univ threatened to cancel 18 y o incoming freshman Samantha Pfefferle s admission just for posting a pro Tru We ve received so much support in our first 24 hours Help us break 1 000 followers and let s show America that BlueLivesMa It s hard to find this because you scrubbed it from your government website But I saved a picture of it for ya Ilhan Omar s campaign has now officially paid her own husband more than ONE MILLION dollars in the 2020 cycle She is be Now that the White House took control of the coronavirus numbers from the CDC it s going to be interesting to see what they if you trust President Trump more than Dr Fauci 147 Covid deaths recorded today in AZ 106 are death certificate matching which means they died with a if you believe Melania Trump is the most beautiful and most eloquent First Lady of ALL TIME The Left doesn t want you to share this video African American support keeps growing for Retweet if think Chris Wallace belongs on Fake News CNN and you re sick and damn tired of hearing his lying mouth on Fox Ne Joe Biden is basically the bad guy in every civil rights movie ever made I cannot believe that in the year 2020 Americ Rt if Dr Fauci should resign The Key to Defeating COVID 19 Already Exists We Need to Start Using It Opinion if you think we need a national Voter ID law passed before November Remember when pro athletes still loved America Americans will love baseball again when loves America again http '
test_mindset_text

In [None]:
logistic_pipeline.predict(np.array([test_mindset_text]))

## Text Exercise 2
In this exerise, we will be building an LLM-wrapper application. These steps will serve as a model for how we approach LLM-wrappers in the future.  

### Business Problem

Using artificial intelligence to analyze earnings transcripts can unlock valuable insights in an efficient and comprehensive manner. AI can quickly process vast amounts of financial data, identifying key themes, trends, and sentiment across multiple transcripts—a task that would be time-consuming and prone to human error if done manually. By leveraging natural language processing, AI can detect subtle changes in tone, language, or emphasis that might signal shifts in a company's strategy or outlook. Additionally, AI can cross-reference this information with broader market trends, providing a deeper contextual understanding. This level of analysis empowers investors, analysts, and decision-makers to make more informed choices with greater speed and accuracy.

### Data Collection/Selection

For this exercise, I have download an earnings class transcript for Intel (ticker: INTC) from quarter 2 of 2024. This transcript will serve as the data source for our wrapper application.

In [None]:
with open("data/intel_transcript.txt") as file_pointer:
    call_transcript = file_pointer.read()

I have written a function (get_prepared_remarks) which scans through the transcript and extracts the comments that precede the Q&A session at the end of earnings calls. These remarks are often finely tuned and serve as indicators of the company's future performance.

In [None]:
get_prepared_remarks(call_transcript)

### LLM Engineering

In this exercise, the LLM is our intelligence, but we have to tell it what kind of intelligence to exhibit. The system_content instructs the system how to respond to user input, and the user_content prompt provides the user's request to the intelligent system.  

I have preloaded the system profile, but you should experiment and see if you can get the LLM to attend to different points of interest.

In [None]:
system_content = """
    you are financial data analyst. 
    you need to read earnings calls transcripts, identify the attendees, and assess the content the call.
    when assessing the conent of the call, pay close attention to the following:
    tone (positive/negative)
    risk (low risk/high risk)
    spending (increasing/decreasing)
    rate each on a scale of 1 (negative, high risk, decreasing spending) to 10 (positive, low risk, increasing spending). 
    for each, include a 1 sentence that highlights specific details justifying your rating
"""

In this line of code, I trim the first 3000 words off of the transcript to accomodate groq's token limits.

In [None]:
user_content = " ".join(get_prepared_remarks(call_transcript).split(" ")[:3000])

### Application Building

LLM inference is a computationally expensive task. Though you can run LLM inference on your individual device, it will likely be slow (if it runs at all). For this reason, there are many emerging AI companies that have emerged for the purpose of providing inference compute. To be able to access these cloud resources, you will need to [sign up for API access](https://console.groq.com/login). We will use a free level of service, but there are paid levels. So it is important to protect your key. Once you have created an API key, you can add it as a variable to a variables.env file to obscure the key from your source code.

In [None]:
dotenv_path = 'variables.env'

load_dotenv(dotenv_path) 

Here we load the environment variable from the variables.env file and pass it into the Groq library to establish a link to their inference resources.

In [None]:
client = Groq(api_key=os.getenv("GROQ_API_KEY"))

Though this is a large block of code, it is actually only one line of code that simply makes a call to Groq's inferernce servers. This line initializes a chat request with embedded messages for the system and the user. The specific LLM model (llama-3.1-8b-instant) is defined, the temperature (creativity/randomness of the response) is set, the max number of tokens (max length of reply), stop commands and whether to stream the LLM's response or wait until the full response is available before transmitting it back to us.

In [None]:
chat_completion = client.chat.completions.create(
    #
    # Required parameters
    #
    messages=[
        # Set an optional system message. This sets the behavior of the
        # assistant and can be used to provide specific instructions for
        # how it should behave throughout the conversation.
        {
            "role": "system",
            "content": system_content
        },
        # Set a user message for the assistant to respond to.
        {
            "role": "user",
            "content": user_content,
        }
    ],

    # The language model which will generate the completion.
    model="llama-3.1-8b-instant",

    #
    # Optional parameters
    #

    # Controls randomness: lowering results in less random completions.
    # As the temperature approaches zero, the model will become deterministic
    # and repetitive.
    temperature=0.5,

    # The maximum number of tokens to generate. Requests can use up to
    # 32,768 tokens shared between prompt and completion.
    max_tokens=1024,

    # Controls diversity via nucleus sampling: 0.5 means half of all
    # likelihood-weighted options are considered.
    top_p=1,

    # A stop sequence is a predefined or user-specified text string that
    # signals an AI to stop generating content, ensuring its responses
    # remain focused and concise. Examples include punctuation marks and
    # markers like "[end]".
    stop=None,

    # If set, partial message deltas will be sent.
    stream=False,
)

Now, we can print the repsonse.

In [None]:
# Print the completion returned by the LLM.
print(chat_completion.choices[0].message.content)

## Text Exercise 3

### Business Problem

An internal LLM chatbot with access to private documents can be an invaluable tool for improving efficiency, collaboration, and decision-making within an organization. By securely accessing and processing internal data, the chatbot can provide accurate, context-specific responses to employee queries, reducing the time spent searching through files or waiting for assistance from colleagues. It can synthesize information from multiple sources, summarize lengthy documents, and highlight key insights, enabling users to quickly grasp complex topics. Additionally, such a chatbot can offer personalized recommendations and support tailored to the unique needs of the organization, fostering innovation and improving productivity—all while ensuring the confidentiality of sensitive information remains intact.

### Data Collection/Selection

For this exercise, I have scraped several internal Loyola websites. These websites are commonly accessed by new students and typically provide technical information to Loyola students, faculty, and staff. Because these documents are private, search engines (and by extension LLMs) are unaware of the details of these documents. Without such details, external tools are of little help to internal stakeholders.

In [None]:
df = pd.read_csv('data/loyola_documents.csv', encoding = "ISO-8859-1")

In [None]:
df

### LLM Engineering

Here I define the behavior of the LLM system.

In [None]:
system_content = """
    you are a loyola university maryland chat bot.
    you are aware of the jesuit values and you emulate those as you answer student questions about university life. 
    if your response includes details from a document, please summarize the relevant points.
    do not refer to the documents in your response.
"""

In this code, I let the user make a request to the LLM. This is done to illustrate the difference between base LLM responses and contextually aware responses.

In [None]:
user_prompt = input(">")
user_content = {'input': user_prompt, 'prompt': user_prompt}

In this code, I set the user content to the result of the build_user_prompt() function. This is a function that prompts the user for a message to send to the chatbot. It tries to assess whether the message contains a question. If it does contain a question, the function then searches for relevant documents to pass to the LLM to aid it in its response. This code is commented out at the start. To see how the LLM response differs when context is added, uncomment this line.

In [None]:
user_content = build_user_prompt(df)

### Application Building

In [None]:
print(f"\x1B[1m\x1b[33m{'-'*20}USER{'-'*20}\x1B[0m")
print(user_content['input'])

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": system_content
        },
        {
            "role": "user",
            "content": user_content['prompt'],
        }
    ],
    model="llama-3.1-8b-instant",
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    stop=None,
    stream=False,
)
print(f"\x1B[1m\x1b[32m{'-'*20}BOT{'-'*20}\x1B[0m")
print(f"{chat_completion.choices[0].message.content}")
