# Task 2: Keyword Extraction and Topic Modeling

## Step 1: Perform Keyword Extraction

In this section, we will perform keyword extraction using the TF-IDF (Term Frequency-Inverse Document Frequency) method. The goal is to identify the most significant words in the headlines and bodies of the news articles.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords(texts):
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()
    return feature_names

# Example usage
texts = ["This is a sample news article.", "Keyword extraction from text."]
keywords = extract_keywords(texts)
print(keywords)


## Step 2: Perform Topic Modeling

Next, we will perform topic modeling using Latent Dirichlet Allocation (LDA). This technique will help us identify underlying topics in the news articles.


In [None]:
from sklearn.decomposition import LatentDirichletAllocation

def perform_topic_modeling(texts, n_topics=5):
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(texts)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=0)
    lda.fit(tfidf_matrix)
    return lda, vectorizer

# Example usage
lda, vectorizer = perform_topic_modeling(texts)
print(lda.components_)


## Visualization of Topics

To better understand the topics identified, we will visualize the topics in a 2D space.


In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize_topics(lda, tfidf_matrix):
    tsne = TSNE(n_components=2, random_state=0)
    topic_weights = lda.transform(tfidf_matrix)
    tsne_results = tsne.fit_transform(topic_weights)
    
    plt.scatter(tsne_results[:, 0], tsne_results[:, 1], c=topic_weights.argmax(axis=1))
    plt.title('Topic Visualization using t-SNE')
    plt.show()

# Example usage
visualize_topics(lda, tfidf_matrix)


# Task 2: Event Modeling

## Step 1: Cluster News Articles by Events

In this section, we will cluster news articles based on their content to identify different events. We will use the K-Means clustering algorithm for this task.


In [None]:
from sklearn.cluster import KMeans

def cluster_news_articles(tfidf_matrix, n_clusters=10):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    clusters = kmeans.fit_predict(tfidf_matrix)
    return clusters

# Example usage in a notebook or script
clusters = cluster_news_articles(tfidf_matrix)
print(clusters)


## Step 2: Analyze Event Reporting

Now that we have clustered the articles, let's analyze which news sites report events the earliest and which events have the highest reporting frequency.


In [None]:
# Example analysis code (adjust based on your data)

import pandas as pd

def analyze_event_reporting(clusters, articles):
    articles['cluster'] = clusters
    earliest_reporting = articles.groupby('cluster')['published_at'].min()
    highest_reporting = articles['cluster'].value_counts()
    
    print("Earliest reporting by cluster:")
    print(earliest_reporting)
    print("\nHighest reporting by cluster:")
    print(highest_reporting)

# Assuming you have a dataframe `articles` with a 'published_at' column
# analyze_event_reporting(clusters, articles)


# Task 2: ML Model Versioning

## Step 1: Version ML Models

For this task, we will version our ML models using MLFlow to ensure that we can track different versions of our models, along with their associated data and code.


In [None]:
import mlflow

def save_model(model, model_name):
    mlflow.sklearn.log_model(model, model_name)
    mlflow.register_model(model_uri=model_name, name=model_name)

# Example usage
save_model(lda, "LDA_Topic_Model")
