#Getting News and Financial Data and Using TF-IDF to extract features

This notebook is a part of experiments for integration of financial and news data. It fetches financial data, relevant news data and uses tf-idf to get n-grams from the news relevant to the financial trends.

Insert your news-api credentials below (needed for creating news dataset), if you have a dataset of your own, it is fine to skip this part

In [None]:
NEWSAPI_APP_KEY = ""
NEWSAPI_APP_ID = ""

Necessary imports, if you want to run a cell individually, run this cell and then run the selected cell

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import news_signals
import datetime
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

Uses yfinance library to get financial data, then classifies the trend of the market trend movement by checking if the price has moved by a certain percentage over a certain time period.

Currently it is set for TSLA stock (Tesla) for the year 2023, over a 3 day rolling window of 5% threshold, i.e it gets financial data for Tesla stock for the entire year of 2023, it splits it into 3 day rolling windows, and gives classification of *+1 for upward trend, 0 for neutral (within threshold), -1 for a downward trend*, so if in 3 days, the stock has moved up or down more than 5%, it gets a +1 or -1 classification respectively, otherwise it gets a neutral 0.


Modify **ticker** to whichever stock you wish to get data for (e.g TSLA,AAPL,JPM). **start_date** and **end_date** to specify the time period of the overall financial data, **window_size** for the rolling window days and **percent_change** to change the decimal percentage value.

In [None]:


# Parameters for the financial data
ticker = "TSLA"            # Change ticker if needed
start_date = "2023-01-01"    # Start date for historical data
end_date = "2023-12-31"      # End date for historical data
window_size = 3            # 3-day rolling window
percent_change = 0.05       # 5% minimum price change

# Download daily stock data
data = yf.download(ticker, start=start_date, end=end_date)
data.index = pd.to_datetime(data.index)

def classify_window(window):
    """
        +1 if cumulative return > %change and > volatility  (upward trend)
        -1 if cumulative return < %change and < -volatility (downward trend)
         0 otherwise (neutral)
    """
    first_open = float(window['Open'].iloc[0])
    last_close = float(window['Close'].iloc[-1])
    cumulative_return = (last_close - first_open) / first_open
    daily_returns = (window['Close'] - window['Open']) / window['Open']
    volatility = float(daily_returns.std())
    
    if cumulative_return > percent_change and cumulative_return > volatility:
        return 1
    elif cumulative_return < -percent_change and cumulative_return < -volatility:
        return -1
    else:
        return 0

# Apply a rolling window to classify the trend for each period
trend_results = []
dates = []
for i in range(window_size - 1, len(data)):
    window = data.iloc[i - window_size + 1 : i + 1]
    trend = classify_window(window)
    trend_results.append(trend)
    dates.append(data.index[i])

# Create a DataFrame with the trend classifications (using the last day of each window as the index)
rolling_trend_df = pd.DataFrame({'Trend': trend_results}, index=dates)
print(rolling_trend_df)


Can run this cell to check distribution of financial trends

In [None]:

class_distribution = rolling_trend_df['Trend'].value_counts(normalize=True) * 100
print("Class Distribution (Percentage):")
print(class_distribution)

# Plot the class distribution as percentages
class_distribution.plot(kind='bar', color=['red', 'blue', 'green'])
plt.title('Class Distribution of Trends (Percentage)')
plt.xlabel('Trend')
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.show()

This calls the news api and builds a csv dataset of news.

Params to modify:

  **published_at.start** and **published_at.end** : Input the range of dates from which you want the news from

  **language** : Set language of news

  **entities** : Specify entities to search and their respective overall prominence in the news article

  **source.rankings.alexa.rank.min** and **source.rankings.alexa.rank.min** : The news traffic rankings of sources of articles to retrieve from


For more information, please visit [NewsAPI Documentation](https://docs.aylien.com/newsapi/v6/getting-started/#overview)


In [None]:

HEADERS = {
    'X-AYLIEN-NewsAPI-Application-ID': NEWSAPI_APP_ID,
    'X-AYLIEN-NewsAPI-Application-Key': NEWSAPI_APP_KEY
}

params = {
    "published_at.start": "2023-01-01T00:00:00.000Z",
    "published_at.end": "2023-12-31T00:00:00.000Z",
    "language": "(en)",
    "entities": '{{surface_forms:("TSLA" OR "Tesla" OR "Elon Musk") AND overall_prominence:>=0.7}}',
    "source.rankings.alexa.rank.min": "1",
    "source.rankings.alexa.rank.max": 7,
    "per_page": 100,
}

news_data = []
cursor = "*" 
while cursor:
    if cursor != "*":
        params["cursor"] = cursor

    response = requests.get("https://api.aylien.com/v6/news/stories", params=params, headers=HEADERS)
    result = response.json()

    stories = result.get("stories", [])
    if not stories:
        print("No more articles found. Stopping pagination.")
        break  
    for s in stories:
        news_data.append({
            "author": s.get("author", "Unknown"),
            "published_at": s.get("published_at", ""),
            "title": s.get("title", ""),
            "body": s.get("body", ""),
            "source": s.get("source", {}).get("name", ""),
            "url": s.get("links", {}).get("permalink", "")
        })

    print(f"Retrieved {len(stories)} articles. Total so far: {len(news_data)}")

    cursor = result.get("next_page_cursor")

news_df = pd.DataFrame(news_data)
news_csv_file = "entity_news_paged.csv"
news_df.to_csv(news_csv_file, index=False)
print(f"News data saved to {news_csv_file} with {len(news_df)} articles")


This is code to create a csv file with the financial trends and the news articles from the rolling window (previous date inclusive, current date exclusive) attached to each trend value.

In [None]:
news_df = pd.read_csv("entity_news_paged.csv")
news_df["published_at"] = pd.to_datetime(news_df["published_at"]).dt.tz_convert(None)

rolling_trend_df_reset = rolling_trend_df.reset_index()
rolling_trend_df_reset.rename(columns={'index': 'Date'}, inplace=True)
rolling_trend_df_reset["Date"] = pd.to_datetime(rolling_trend_df_reset["Date"])
# Ensure the financial data is sorted by Date
rolling_trend_df_reset = rolling_trend_df_reset.sort_values("Date")

# Define the starting boundary for the news window (use your finance start date)
finance_start_date = pd.to_datetime("2023-01-01")

attached_news = []

# Set the initial previous date for the window
prev_date = finance_start_date

# Iterate over each financial date and attach news articles published in the interval [prev_date, current_date)
for current_date in rolling_trend_df_reset["Date"]:
    # Filter news articles: published on or after prev_date and before current_date
    mask = (news_df["published_at"] >= prev_date) & (news_df["published_at"] < current_date)
    window_news = news_df[mask]
    
    # Combine news articles from this window:
    # Concatenate the title and body for each article, separated by a newline.
    combined_text = "\n\n".join((window_news["title"] + "\n" + window_news["body"]).tolist())
    attached_news.append(combined_text)
    
    prev_date = current_date

rolling_trend_df_reset["News"] = attached_news
csv_filename = "trend_news.csv"
rolling_trend_df_reset.to_csv(csv_filename, index=False)
print(rolling_trend_df_reset)


Initializing the tf-idf vectorizer.

Parameter information:

**stop_words** : Removes common English words that do not add much meaning to the text.

**ngram_range** : Extracts word sequences of size 2 (bigrams) and 3 (trigrams).

**max_features** : Limits the vocabulary to the top 500,000 most important words.

**min_df** : Ignores words that appear in fewer than 5 documents.

**max_df** : Ignores words that appear in more than 20% of the documents.

For more information, Please visit [TF-IDF Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
df = pd.read_csv("trend_news.csv")

# Fill missing values in the 'News' column with an empty string
news_text = df["News"].fillna("").astype(str)


vectorizer = TfidfVectorizer(
    stop_words="english",
    ngram_range=(2,3),
    max_features=500000,
    min_df=5,
    max_df=0.2
)


tfidf_matrix = vectorizer.fit_transform(news_text)

# Display the shape of the resulting TF-IDF matrix
print("TF-IDF matrix shape:", tfidf_matrix.shape)

feature_names = vectorizer.get_feature_names_out()
print("First 20 feature names:", feature_names[:50])


Here we finally run tf-idf and get our results using *Logistic Regression* (Feel free to use any other models). We also get output on most important n-gram features for each trend class.

More information on [TF-IDF and Logistic Regresison](https://medium.com/@tejasdalvi927/sentiment-analysis-using-tf-idf-and-logisticregression-5ccc4f5c4f81)

In [None]:
y = df["Trend"]

# Split data into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size=0.2, random_state=42)

# Initialize and train a Logistic Regression classifier
clf = LogisticRegression(max_iter=5000, random_state=42)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Display the test set accuracy and classification report
print("Test set accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display top features for each trend class
feature_names = vectorizer.get_feature_names_out()
for class_label in np.unique(y_train):
    class_index = list(clf.classes_).index(class_label)
    coef = clf.coef_[class_index]
    top_indices = np.argsort(coef)[-10:][::-1]
    print(f"\nTop features for trend class {class_label}:")
    for i in top_indices:
        print(f"{feature_names[i]}: {coef[i]:.4f}")

This is a Grid Search code in order to find best parameters for our TF-IDF model. You can modify the features in **param_grid** to change what combination of features are tested.

For more information on [Grid Search](https://scikit-learn.org/stable/modules/grid_search.html)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        max_features=3000,
        min_df=3,
        max_df=0.9)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_features': [1000, 3000, 5000],
    'tfidf__min_df': [2, 3, 5],
    'tfidf__max_df': [0.8, 0.9, 1.0],
    'clf__C': [0.01, 0.1, 1, 10]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(news_text, df["Trend"])

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.2f}%".format(grid_search.best_score_ * 100))
