Some instructions and general comments : 

* Please install selenium, beautifulsoup packages.


* Download chromedriver from http://chromedriver.chromium.org/downloads, unzip it and put the path of its executable below in the second line of the first cell "browser = webdriver.Chrome(path)".


* Executing the first cell will open a Chrome browser with hindustan times homepage, do not close it until the first cell has been executed.


* When you open the csv file, you might encounter goofy characters like â€˜, to open the csv file in proper format please go through the instructions at http://64.132.240.34/kbbase/index.php?View=entry&EntryID=320, problem is only when viewing csv in excel, when reading(writting) from(to) csv, there is no issue as can be seen below. 


* One might notice that I have not written a general function through which a URL can be passed and headlines from a news website can be extracted, the reason is that different news websites have different html structure (not much is common between these webpages) thereby making it rather difficult to write a general function for different news websites. I looked at webpages of Hindustan Times, Times of India, BBC and CNN, html structure was very different for all of them. Feel free to correct me if I am wrong.

In [363]:
from selenium import webdriver

# Don't forget to put the correct path of chromedriver executable here
browser = webdriver.Chrome('C:/Users/Dell/Downloads/chromedriver.exe')

browser.get("https://www.hindustantimes.com/")

innerHTML = browser.execute_script("return document.body.innerHTML")

In [364]:
from bs4 import BeautifulSoup
import re

soup_ini = BeautifulSoup(innerHTML)

# Removing texts like 'Terms of Use', 'Disclaimer' etc. present at the bottom of the page
soup_ini_str = re.sub('<li>.*?</li>', '', str(soup_ini))

soup = BeautifulSoup(soup_ini_str)

# Extract all the html <a> tags associated with hyperlinks
titles = soup.find_all('a',  attrs={'href': re.compile("^https://")})

# Extract texts from the tags, contains all headlines (includes texts wrongly construed as headlines) 
headlines = [t.text.strip() for t in titles]

In [365]:
titles_2 = soup.find_all('li',  attrs={'class' : 'column-head'})

# Extract texts like 'Opinion', 'Cities', 'Sports' etc., these are categories near the bottom of the page
headlines_2 = [t2.text.strip() for t2 in titles_2]

In [366]:
titles_3 = soup.find_all('div',  attrs={'class' : re.compile("^(column|sign)")})

# From top itself page is divided into several categories like 'latest news', 'don't miss', 'must watch' etc., 
# these texts are extracted here
headlines_3 = [t3.text.strip() for t3 in titles_3]

In [367]:
titles_4 = soup.find_all('a',  attrs={'class' : re.compile("^(cta-link|trc_|item-)")})

# Many articles in categories described above have a sub-category or another category associated with them, these categories are 
# extracted here, with these categories advertisements like 
# 'Indians born before 1967 are now eligible to test hearing aidsHearing Aid Trial' are also extracted here
headlines_4 = [t4.text.strip() for t4 in titles_4]

# combine all the texts wrongly construed as headlines in a single list
headlines_to_remove = set(headlines_2 + headlines_3 + headlines_4)

# removing texts wrongly construed as headlines from the list of all texts, thereby getting list of mostly correct headlines
better_hl = list(set(headlines) - set(headlines_to_remove))

print(better_hl)

['', 'Delhi’s air quality recorded in ‘very poor’ category, authorities warn of deterioration', '‘In a sense, I am on the ticket,’ Donald Trump seeks voter support', 'Thousands protest in Aizawl seeking Mizoram CEO’s removal', 'How science can help us prepare for disasters', 'Meet the Indian women who look to win maiden ICC Women’s WT20 title', 'B.Tech student found dead in Delhi, police suspect case of ‘drug overdose’', 'Colour happy: Browns for all seasons and occasions', 'Diwali 2018: Quinoa chips, massage oil and more healthy gift ideas', '‘Real Diwali when misrule of NDA ends’: Chandrababu Naidu', 'ਕਰਨਾਟਕ ਜਿ਼ਮਨੀ ਚੋਣਾਂ: 4 ਸੀਟਾਂ `ਤੇ ਕਾਂਗਰਸ-ਜਨਤਾ ਦਲ ਤੇ 1 `ਤੇ ਭਾਜਪਾ ਅੱਗੇ', 'Bill Gates aims to save $233 billion by reinventing the toilet', 'Move over Batman, Bajrangi takes on supervillains', 'Delhi enveloped in smog as wind blows in smoke from farm fires', 'Ind vs WI: UP rename Lucknow stadium after Atal Bihari Vajpayee', 'Societies to conduct recruitment test for 5 new medical colleges',

In [472]:
# Removing some exceptional texts wrongly identified as headlines which were not removed before
final_hl = list(filter(lambda x: (x != '' and x != '...read more' and x != 'Sign In'), better_hl))

final_hl[:10]

['Delhi’s air quality recorded in ‘very poor’ category, authorities warn of deterioration',
 '‘In a sense, I am on the ticket,’ Donald Trump seeks voter support',
 'Thousands protest in Aizawl seeking Mizoram CEO’s removal',
 'How science can help us prepare for disasters',
 'Meet the Indian women who look to win maiden ICC Women’s WT20 title',
 'B.Tech student found dead in Delhi, police suspect case of ‘drug overdose’',
 'Colour happy: Browns for all seasons and occasions',
 'Diwali 2018: Quinoa chips, massage oil and more healthy gift ideas',
 '‘Real Diwali when misrule of NDA ends’: Chandrababu Naidu',
 'ਕਰਨਾਟਕ ਜਿ਼ਮਨੀ ਚੋਣਾਂ: 4 ਸੀਟਾਂ `ਤੇ ਕਾਂਗਰਸ-ਜਨਤਾ ਦਲ ਤੇ 1 `ਤੇ ਭਾਜਪਾ ਅੱਗੇ']

In [480]:
import pandas as pd

df = pd.DataFrame()

df['Headlines'] = final_hl

df.to_csv('hindustan_times_headlines.csv', index=False)

headlines_csv = pd.read_csv('hindustan_times_headlines.csv')

headlines_csv.head(10)

Unnamed: 0,Headlines
0,Delhi’s air quality recorded in ‘very poor’ ca...
1,"‘In a sense, I am on the ticket,’ Donald Trump..."
2,Thousands protest in Aizawl seeking Mizoram CE...
3,How science can help us prepare for disasters
4,Meet the Indian women who look to win maiden I...
5,"B.Tech student found dead in Delhi, police sus..."
6,Colour happy: Browns for all seasons and occas...
7,"Diwali 2018: Quinoa chips, massage oil and mor..."
8,‘Real Diwali when misrule of NDA ends’: Chandr...
9,ਕਰਨਾਟਕ ਜਿ਼ਮਨੀ ਚੋਣਾਂ: 4 ਸੀਟਾਂ `ਤੇ ਕਾਂਗਰਸ-ਜਨਤਾ ਦ...


In [456]:
# Check whether the number of headlines extracted and the number of headlines written in csv file are equal or not, result will
# be True if numbers match
print(headlines_csv.shape[0] == len(final_hl))

True

In [454]:
# Check whether the headlines extracted and headlines written in csv are same or not, headline(s) extracted but not in csv 
# (and vice-versa) will be displayed below, result will be empty set 'set()' if all the headlines are same
set(headlines_csv['Headlines']).symmetric_difference(set(final))

set()