#### **Import**

In [1]:
import warnings
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

#### **Data Gathering and Cleaning**

##### API data

I am not using API data since I don't have paid plan to fetch enough data required for this project

In [2]:
# response = requests.get("https://newsapi.org/v2/everything?q=apple&from=2025-02-23&to=2025-03-23&sortBy=popularity&apiKey=4c48d0bffc7443e7a85ea78ae3bc640f")
# response = requests.get("https://www.alphavantage.co/query?function=NEWS_SENTIMENT&tickers=AAPL&apikey=BP6PQU06XN7TNL7O")

In [3]:
# soup = BeautifulSoup(response.content)

In [4]:
# soup.prettify()

##### CSV file 1

In [5]:
df = pd.read_csv("../data/raw/india-news-headlines.csv")

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3876557 entries, 0 to 3876556
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   publish_date       int64 
 1   headline_category  object
 2   headline_text      object
dtypes: int64(1), object(2)
memory usage: 88.7+ MB


In [7]:
df.tail()

Unnamed: 0,publish_date,headline_category,headline_text
3876552,20230630,city.goa,10 PIs move HC over thwarted seniority
3876553,20230630,city.goa,Govt notifies award in memory of Parrikar for ...
3876554,20230630,city.goa,After youth's death; PWD installs crash barrie...
3876555,20230630,city.goa,Authorities not acting against CRZ violations
3876556,20230630,city.goa,Technicians to hold trial run of mini-EVs in P...


- The headline_text is the actual news value we required
- Since we are trianing our model for stock price prediction of APPLE we will only take records related to that
- We need to filter the records based on the keywords related to the brand
- we will declare all keywords as pattern and then use it to filter

In [8]:
pattern = "|".join(['apple', 'mac', 'iphone', 'ipod', 'ipad', 'airpods', 'ios'])

In [9]:
# filter the headlines
appl_df = df[df['headline_text'].str.contains(pattern, case=False, na=False)]

- This filter was not enough because there are many records where these filters are present but the record is not related to APPLE
- To filter further we will user categories column
- We only required news related to stocks and technology or gadgets since APPLE deals with this
- we first divide our column into main and sub category and then try to filter it out

In [10]:
appl_cats = appl_df.headline_category.value_counts()

In [11]:
appl_cats

headline_category
india                                               2067
city.shimla                                         1661
tech.tech-news                                      1615
unknown                                             1245
city.chandigarh                                     1145
                                                    ... 
web-series.news.hindi                                  1
most-searched-products.electronics.miscellaneous       1
web-series.news                                        1
times-special                                          1
life-style.fashion.buzz                                1
Name: count, Length: 399, dtype: int64

In [12]:
def category_split(col) :

    l = col.split('.')

    main = l[0].lower()
    sub = l[1].lower() if len(l) > 1 else np.nan

    return main, sub

In [13]:
# returns main and sub category in alternate rows
cats = appl_df.headline_category.apply(category_split).explode()

In [14]:
# add two new columns with main and sub category
appl_df['main_category'] = cats.iloc[0::2]
appl_df['sub_category'] = cats.iloc[1::2]

In [15]:
appl_df.head()

Unnamed: 0,publish_date,headline_category,headline_text,main_category,sub_category
400,20010105,unknown,Mobile isn't a phone but a fun machine now,unknown,
811,20010125,unknown,Congress seeks to upset SAD applecart in Majit...,unknown,
1280,20010204,unknown,What is meant by the term Track-II diplomacy?,unknown,
1286,20010205,unknown,The well-oiled machinery of civil society,unknown,
1375,20010207,unknown,Quake diplomacy could work for India; Pakistan,unknown,


In [16]:
# categories we need to consider
main_cats = ['tech', 'electronics', 'gadgets-news', 'business']

In [17]:
# Filte out the columns
final_df = appl_df[appl_df['main_category'].isin(main_cats)]

In [18]:
final_df['publish_date'] = final_df.publish_date.astype(str).apply(lambda x : pd.to_datetime(f"{x[:4]}-{x[4:6]}-{x[6:]}"))

In [19]:
final_df.head()

Unnamed: 0,publish_date,headline_category,headline_text,main_category,sub_category
10429,2001-07-29,business.india-business,UTI grapples with potential Rs 1;700 cr pay-out,business,india-business
29028,2001-09-17,business.india-business,Gujarat Samachar keeps markets guessing,business,india-business
36534,2001-10-10,business.india-business,Sony to spend Rs 2.5 cr in ads for audios,business,india-business
37356,2001-10-14,business.international-business,US firms grapple with rules after WTC attacks,business,international-business
40567,2001-10-29,business.india-business,Govt to spend Rs 225cr on info kiosks in N-E,business,india-business


In [20]:
# Save the data file
final_df.to_csv("../data/interim/News1.csv", index=False)

##### CSV file 2

In [21]:
df = pd.read_csv("../data/raw/Data.csv", encoding="ISO-8859-1")

In [22]:
df.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


- The articles are present in different columns
- If we could merge those articles into one column then filtering out rows would become easier
- club_articles() is used to convert the articles in single column
- then this column is filtered based on pattern

In [23]:
def club_articles(col):
    return [i for i in col]

In [24]:
# combining all articles (columns) as single one
df['all_articles'] = df.iloc[:,2:].apply(club_articles, axis = 1)

In [25]:
# taking only required cols
df = df[['Date', 'Label', 'all_articles']]

In [26]:
# add each article as a independent row
df = df.explode('all_articles')

In [27]:
# removes null values if any
df.dropna(inplace=True)

In [28]:
appl_df = df[df['all_articles'].str.contains(pattern, case=False, na=False)]

In [29]:
appl_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 738 entries, 8 to 4099
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Date          738 non-null    object
 1   Label         738 non-null    int64 
 2   all_articles  738 non-null    object
dtypes: int64(1), object(2)
memory usage: 23.1+ KB


- I got a very small number of rows and most of them are redundand based on previously collected data
- I won't include this data