Skip to content

finloop/data-science-notebooks

Repository files navigation

About

This repo contains various analysis on different datasets. Current analysis focuses on time series forecasting and anomaly detection.

Datasets

Wikipedia

Drawing graph of page links

import urllib3
import networkx as nx
from wikipedia.parser import get_graph

pool = urllib3.PoolManager()

G = get_graph(pool, url = "https://en.wikipedia.org/wiki/Data_mining", deep=1)
nx.draw(G, nx.circular_layout(G), with_labels=True)

Data mining graph

Finding philosophy page

In this experiment, I'll test the hypothesis that: By going to the first link on any Wikipedia article, you'll end up on the
philosophy article.

For more info on the subject go to my dev.to article.

crawl(pool, "https://en.wikipedia.org/wiki/Data_mining", phrase="Philosophy", deep=30, n=1, verbose=True)
30 Entering Data_mining
29 Entering Data_set

...

   [('https://en.wikipedia.org/wiki/Thought',
     [('https://en.wikipedia.org/wiki/Ideas',
       ['https://en.wikipedia.org/wiki/Philosophy'])])])])])])])])])])])])])])])])])])])])])])])])])])

Experiment and code

E-commerce dataset from brazilian retail store

Predictions Dataset - sampled daily Prophet Prophet prediction of order volume with confidence intervals

Notebooks:

Animations:

Smoothed with 3-day moving average, yearly seasonality

User interactions database

This dataset contains data from a news website. Each csv file contains info about sessions, clicks on articles, time of interaction etc.

In file frequency_analysis.ipynb the distribution of page on with the article appears is analysed. For any session I added the order of articles. Then I filtered for that one article and created histograms for each hour.

Histograms of given article (119592) order in sessions

Image processing

Animation of how Sobel edge detetion works: Edge detection with Sobel