In [1]:
!pip3 install newsapi-python

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: newsapi-python
Successfully installed newsapi-python-0.2.7


The "newsapi-python" package is a wrapper for the News API, which is a service that provides access to news articles from various sources around the world. By installing this package, developers can use Python to interact with the News API and retrieve news articles based on specific criteria such as keywords, sources, and time periods.

In [2]:
from tqdm import tqdm
import pandas as pd
import numpy as np
from newsapi import NewsApiClient
import nltk
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
import spacy
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import seaborn as sns

This code imports several Python packages:

- tqdm: A package that provides progress bars for long-running loops or tasks.
- pandas: A package for data manipulation and analysis.
- numpy: A package for numerical computing in Python.
- newsapi: A package for accessing news articles from various sources around the world.
- nltk: The Natural Language Toolkit, a package for working with human language data in Python.
- spacy: A package for natural language processing in Python.
- sklearn: The Scikit-learn package, a machine learning library for Python.
- matplotlib: A plotting library for Python.
seaborn: A data visualization library based on Matplotlib.

In [3]:
api_key = '946472f8ad574082a31c9d0fcaf1e89d'
newsapi = NewsApiClient(api_key=api_key)

The first line assigns a string value to the variable api_key. This value is an authentication key that is required to access the News API. It is a unique identifier that is associated with a specific user account and allows the user to make requests to the API.

The second line creates an instance of the NewsApiClient class and assigns it to the variable newsapi. The NewsApiClient class is defined in the newsapi package and provides a set of methods that allow developers to interact with the News API.

The api_key argument is passed to the NewsApiClient constructor to authenticate the client and enable it to make requests to the API on behalf of the user account associated with the provided API key. Once this instance of the NewsApiClient class is created, it can be used to call various methods that retrieve news articles from the API based on specific criteria such as keywords, sources, and time periods.

In [4]:
def crawl_news(query):
    all_results = []
    for pag in tqdm(range(1, 6)):
        pag_articles = newsapi.get_everything(q=query, sort_by='relevancy', page=pag)['articles']
        if len(pag_articles) == 0:
            break
        all_results.extend(pag_articles)
    return all_results

This is a Python function that uses the newsapi package to crawl news articles from various sources based on a search query.

The function takes a single argument, query, which is a string representing the search term or topic of interest.

The for loop iterates over 5 pages of news articles (i.e., range(1,6)). For each page, the newsapi.get_everything() method is called with the q argument set to the query parameter and the sort_by argument set to 'relevancy'. This retrieves news articles from the News API that match the search query and sorts them by relevance. The page argument is set to the current page number in the loop, which allows the function to retrieve news articles from multiple pages of search results.

The retrieved news articles are stored in the pag_articles variable as a list of dictionaries, with each dictionary representing an article and containing various metadata such as the article's title, author, source, publication date, and content.

If the length of pag_articles is zero, the loop breaks, as this indicates that there are no more pages of search results to retrieve.

Finally, the list of all retrieved news articles across all pages is returned as the output of the function.

The function uses the tqdm package to display a progress bar during the loop, indicating the percentage of pages that have been crawled so far.

In [5]:
tesla_news = crawl_news('tesla')

100%|██████████| 5/5 [00:03<00:00,  1.64it/s]


In [8]:
df = pd.read_csv('BBC news dataset.csv', usecols=range(1, 3))

This line of code reads a CSV file called 'BBC news dataset.csv' using the pandas package, and creates a DataFrame object named df that contains the data from the CSV file.

The usecols parameter is used to select only specific columns from the CSV file. In this case, the range(1, 3) argument selects columns 1 and 2, which correspond to the 'description' and 'tags' columns of the BBC news dataset.

In [9]:
df.drop_duplicates('description', inplace=True)

This line of code drops any duplicate rows in the DataFrame object df based on the values in the 'description' column, and modifies df in place by setting the inplace parameter to True.

The drop_duplicates() method is a built-in method of the pandas package that removes duplicate rows from a DataFrame. The 'description' column is used as the key to identify duplicates. If there are any rows with identical values in the 'description' column, only the first occurrence of such a row is kept and all subsequent duplicates are removed.

By setting inplace=True, the drop_duplicates() method modifies the DataFrame object df directly, without creating a new object. This means that any subsequent operations on df will reflect the changes made by this method.

In [10]:
descriptions = df['description'].values

In [11]:
def pre_process_text(text):

    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word

    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words

    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    return words

This is a function called pre_process_text that takes a string of text as its input, performs several text preprocessing steps, and returns a list of preprocessed words.

The text preprocessing steps are as follows:

1. Tokenization: The input text is split into individual words or tokens using the word_tokenize() function from the nltk package.
2. Lowercasing: All words are converted to lowercase using a list comprehension.
3. Removing punctuation: Punctuation marks are removed from each word using the string.punctuation string and the str.translate() method.
4. Removing non-alphabetic characters: Words that contain non-alphabetic characters (such as numbers or symbols) are removed using a list comprehension and the str.isalpha() method.
5. Removing stop words: Common stop words (such as "the", "and", "a") are removed using the stopwords.words() method from the nltk package and another list comprehension.

The preprocessed words are returned as a list.

This function can be used to preprocess the descriptions of news articles in the BBC news dataset before clustering or visualization.

In [12]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The punkt and stopwords datasets are required for tokenization and stopword removal, respectively.

In [13]:
processed_descriptions = []
for description in tqdm(descriptions):
    processed_descriptions.append(' '.join(pre_process_text(description)))

100%|██████████| 2128/2128 [00:05<00:00, 413.14it/s]


This is a loop that iterates over each description in the descriptions numpy array, applies the pre_process_text() function to each description, and appends the preprocessed description to a new list called processed_descriptions.

The tqdm() function is used to display a progress bar that indicates how far along the loop is in processing the descriptions.

Inside the loop, the pre_process_text() function is applied to each description. The resulting list of preprocessed words is joined into a single string using the join() method with a space separator. The joined string is then appended to the processed_descriptions list.

After the loop is finished, the processed_descriptions list contains all of the preprocessed descriptions of the news articles in the BBC news dataset. Each preprocessed description is a single string containing lowercase words with no punctuation, non-alphabetic characters, or stop words.

In [14]:
nlp = spacy.load('en_core_web_sm')

sent_vecs = {}
docs = []

for index, description in enumerate(tqdm(processed_descriptions)):
    doc = nlp(description)
    docs.append(doc)
    sent_vecs[index] = doc.vector

100%|██████████| 2128/2128 [01:39<00:00, 21.47it/s]


This is a block of code that uses the spacy package to compute the vector representation of each preprocessed description in the processed_descriptions list.

The first line loads a pre-trained model called en_core_web_sm from the spacy package. This model includes word vectors and linguistic annotations for the English language.

The sent_vecs dictionary is initialized to an empty dictionary, and the docs list is initialized to an empty list.

The code then loops through each preprocessed description in processed_descriptions. For each description, spacy is used to create a document object doc containing linguistic annotations and vectors for each word in the description.

The document object doc is added to the docs list. The vector representation of the entire document is extracted from doc using the doc.vector attribute, and is stored in the sent_vecs dictionary with the index of the description as the key.

After the loop is finished, the sent_vecs dictionary contains a mapping of each index to the vector representation of the corresponding preprocessed description in processed_descriptions. The docs list contains a spacy document object for each preprocessed description.

In [15]:
vectors = list(sent_vecs.values())

This line of code extracts the vector representations of the preprocessed descriptions from the sent_vecs dictionary and converts them to a list.

The sent_vecs dictionary maps the index of each preprocessed description to its corresponding vector representation. The values() method of a dictionary returns a list of all the values in the dictionary, in this case a list of all the vector representations of the preprocessed descriptions.

The list() function is used to convert the values() view object to a list. The resulting vectors list contains the vector representations of all the preprocessed descriptions in the BBC news dataset.

In [16]:
vectors = np.array(vectors)

In [17]:
labels_results = {}
for i in tqdm(np.arange(0.001, 1, 0.002)):
    dbscan = DBSCAN(eps=i, min_samples=3, metric='cosine').fit(vectors)
    labels_results[i] = len(pd.Series(dbscan.labels_).value_counts())

100%|██████████| 500/500 [01:05<00:00,  7.64it/s]


This is a block of code that performs density-based clustering on the vector representations of the preprocessed descriptions using the DBSCAN algorithm from the sklearn.cluster module.

The algorithm is run with varying values of the eps parameter, which controls the maximum distance between two points for them to be considered part of the same cluster.

The labels_results dictionary is initialized to an empty dictionary. The code then loops through a range of values between 0.001 and 1 in increments of 0.002 using np.arange().

For each value of eps, DBSCAN is run on the vector representations in vectors with the specified value of eps, min_samples=5 (which sets the minimum number of points required to form a dense region), and metric='cosine' (which specifies the distance metric to be used for measuring the similarity between the vectors).

The resulting labels from DBSCAN are stored in a labels_ attribute. The number of unique labels is counted using pd.Series(dbscan.labels_).value_counts() and stored in the labels_results dictionary with the current value of eps as the key.

After the loop is finished, the labels_results dictionary contains a mapping of each value of eps to the number of unique labels produced by DBSCAN using that value of eps. This information can be used to determine the optimal value of eps to use for clustering.

In [18]:
for i in np.arange(0.001, 1, 0.002):
    print('{}: {}'.format(i, labels_results[i]))

0.001: 1
0.003: 1
0.005: 1
0.007: 1
0.009000000000000001: 4
0.011: 12
0.013000000000000001: 20
0.015: 22
0.017: 7
0.019000000000000003: 4
0.021: 6
0.023: 4
0.025: 4
0.027000000000000003: 4
0.029: 3
0.031: 2
0.033: 2
0.035: 2
0.037000000000000005: 2
0.039: 2
0.041: 2
0.043000000000000003: 2
0.045: 2
0.047: 2
0.049: 2
0.051000000000000004: 2
0.053000000000000005: 2
0.055: 2
0.057: 2
0.059000000000000004: 2
0.061: 2
0.063: 2
0.065: 2
0.067: 2
0.069: 2
0.07100000000000001: 2
0.07300000000000001: 2
0.075: 2
0.077: 2
0.079: 2
0.081: 2
0.083: 2
0.085: 2
0.08700000000000001: 2
0.089: 2
0.091: 2
0.093: 2
0.095: 2
0.097: 2
0.099: 2
0.101: 2
0.10300000000000001: 2
0.10500000000000001: 2
0.107: 2
0.109: 2
0.111: 2
0.113: 2
0.115: 2
0.117: 2
0.11900000000000001: 2
0.121: 2
0.123: 2
0.125: 2
0.127: 2
0.129: 2
0.131: 2
0.133: 2
0.135: 2
0.137: 2
0.139: 2
0.14100000000000001: 2
0.14300000000000002: 2
0.14500000000000002: 2
0.147: 2
0.149: 2
0.151: 2
0.153: 2
0.155: 2
0.157: 2
0.159: 2
0.161: 2
0.163: 

This block of code loops through a range of eps values between 0.001 and 0.05 in increments of 0.002 using np.arange().

For each value of eps, it prints the value of eps followed by the number of unique clusters produced by DBSCAN using that value of eps.

This information can be used to determine the optimal value of eps to use for clustering the preprocessed descriptions. Generally, a good value of eps is one that produces a moderate number of clusters that are neither too large nor too small.

In [19]:
dbscan = DBSCAN(eps=0.029, min_samples=3, metric='cosine').fit(vectors)


This line of code performs clustering using the DBSCAN algorithm with the following parameters:

eps: the radius of the neighborhood around each data point. Here, it is set to 0.015.
min_samples: the minimum number of points required to form a dense region. Here, it is set to 5.
metric: the distance metric to use. Here, it is set to cosine similarity.
The input data for clustering are the sentence vectors, vectors, which were computed using the preprocessed descriptions. The clustering result is stored in the dbscan variable, which contains the cluster labels assigned to each sentence.

In [20]:
results = pd.DataFrame({
    'desc': processed_descriptions, 
    'label': dbscan.labels_
})

This line of code creates a Pandas DataFrame named results with two columns:

-desc: the preprocessed descriptions, which were previously stored in the processed_descriptions list.
-label: the cluster labels assigned to each description by DBSCAN, which were stored in the dbscan.labels_ attribute.

Each row in the DataFrame represents a single description and its corresponding cluster label.

In [21]:
results['label'].value_counts()

 0    1946
-1     179
 1       3
Name: label, dtype: int64

This line of code returns a Pandas Series containing the count of unique values in the 'label' column of the DataFrame results.

Since DBSCAN assigns -1 as the label to noise points that are not assigned to any cluster, the value counts will include the count of such noise points as well.

The count of each label indicates the number of descriptions that belong to that cluster.

In [22]:
for index in results[results['label'] == 3].index:
    print(results.loc[index]['desc'])
    print('....')

This code prints out the preprocessed descriptions belonging to cluster label 3.

The code first filters the results DataFrame for rows where the 'label' column is equal to 3, using boolean indexing.

Then, for each row in the filtered DataFrame, it prints the corresponding description by extracting the 'desc' value using the .loc method, and prints a separator '....' between each description to make it easier to distinguish between them.

**eps=0.0165 and min_samples=4**

In [23]:
labels_results = {}
for i in tqdm(np.arange(0.001, 1, 0.002)):
    dbscan = DBSCAN(eps=i, min_samples=4, metric='cosine').fit(vectors)
    labels_results[i] = len(pd.Series(dbscan.labels_).value_counts())

100%|██████████| 500/500 [01:13<00:00,  6.82it/s]


In [24]:
for i in np.arange(0.001, 0.1, 0.002):
    print('{}: {}'.format(i, labels_results[i]))

0.001: 1
0.003: 1
0.005: 1
0.007: 1
0.009000000000000001: 1
0.011: 6
0.013000000000000001: 13
0.015: 11
0.017: 3
0.019000000000000003: 2
0.021: 3
0.023: 2
0.025: 3
0.027000000000000003: 2
0.029: 2
0.031: 2
0.033: 2
0.035: 2
0.037000000000000005: 2
0.039: 2
0.041: 2
0.043000000000000003: 2
0.045: 2
0.047: 2
0.049: 2
0.051000000000000004: 2
0.053000000000000005: 2
0.055: 2
0.057: 2
0.059000000000000004: 2
0.061: 2
0.063: 2
0.065: 2
0.067: 2
0.069: 2
0.07100000000000001: 2
0.07300000000000001: 2
0.075: 2
0.077: 2
0.079: 2
0.081: 2
0.083: 2
0.085: 2
0.08700000000000001: 2
0.089: 2
0.091: 2
0.093: 2
0.095: 2
0.097: 2
0.099: 2


In [28]:
dbscan = DBSCAN(eps=0.0165, min_samples=4, metric='cosine').fit(vectors)

In [29]:
results = pd.DataFrame({
    'desc': processed_descriptions, 
    'label': dbscan.labels_
})

In [30]:
results['label'].value_counts()

-1    1377
 0     743
 1       4
 2       4
Name: label, dtype: int64

**eps=0.015 and min_samples=5**

In [31]:
labels_results = {}
for i in tqdm(np.arange(0.001, 1, 0.002)):
    dbscan = DBSCAN(eps=i, min_samples=5, metric='cosine').fit(vectors)
    labels_results[i] = len(pd.Series(dbscan.labels_).value_counts())

100%|██████████| 500/500 [01:13<00:00,  6.81it/s]


In [32]:
for i in np.arange(0.001, 0.1, 0.002):
    print('{}: {}'.format(i, labels_results[i]))

0.001: 1
0.003: 1
0.005: 1
0.007: 1
0.009000000000000001: 1
0.011: 4
0.013000000000000001: 8
0.015: 4
0.017: 2
0.019000000000000003: 2
0.021: 2
0.023: 2
0.025: 2
0.027000000000000003: 2
0.029: 2
0.031: 2
0.033: 2
0.035: 2
0.037000000000000005: 2
0.039: 2
0.041: 2
0.043000000000000003: 2
0.045: 2
0.047: 2
0.049: 2
0.051000000000000004: 2
0.053000000000000005: 2
0.055: 2
0.057: 2
0.059000000000000004: 2
0.061: 2
0.063: 2
0.065: 2
0.067: 2
0.069: 2
0.07100000000000001: 2
0.07300000000000001: 2
0.075: 2
0.077: 2
0.079: 2
0.081: 2
0.083: 2
0.085: 2
0.08700000000000001: 2
0.089: 2
0.091: 2
0.093: 2
0.095: 2
0.097: 2
0.099: 2


In [33]:
dbscan = DBSCAN(eps=0.015, min_samples=5, metric='cosine').fit(vectors)

In [34]:
results = pd.DataFrame({
    'desc': processed_descriptions, 
    'label': dbscan.labels_
})

In [35]:
results['label'].value_counts()

-1    1712
 0     399
 1      12
 2       5
Name: label, dtype: int64