Author: Brian Erichsen Fagundes
MSD CS 6017 - Summer - 2024
Homework 3: Scraping and Regression

Part 1 Data Acquisiton

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib.request
import re


# Create arrays of data to be scraped
ranks = []
title_lengths = []
age_in_hours = []
points = []
comments_numbers = []

for i in range(5):
    url="http://news.ycombinator.com/?p=" + str(i+1)
# Access website contents
    with urllib.request.urlopen (url) as response:
        html=response.read()
        html=html.decode("utf-8")

# Saves html content into file
    with open ("hackernews" + str(i+1) + ".html", "w") as new_file:
        new_file.write(html)

# Parses html content into a soup
    soup = BeautifulSoup(html, 'html.parser')
    
# Scrapes rank data
    for post in soup.find_all(class_="rank"):
        rank = str(post.text)
        rank = rank.replace('.', '')
        ranks.append(int(rank))

# Loads length of title data
    for title in soup.find_all(class_="titleline"):
        title_lengths.append(len(title.text))

# Loads age data
    for age in soup.find_all(class_="age"):
        age_str =  str(age.text)
        age_str =  age_str.removesuffix(" hours ago")
        age_str = age_str.removesuffix(" hour ago")

        if " day ago" in age_str:
            age_in_hours.append(24)
        elif " days ago" in age_str:
            string = age_str.replace(" days ago", "")
            num = int(string)
            age_in_hours.append(num*24)
        elif " minute ago" in age_str or " minutes ago" in age_str:
            age_in_hours.append(0)
        else:
            age_in_hours.append(int(age_str))

    for subtext in soup.find_all(class_="subtext"):
        point = 0
        comments = 0
        for score in subtext.find_all(class_="score"):
            point = int(re.search(r'\d+', str(score.text)).group())

        for tag in subtext.find_all("a"):
            if tag.text.endswith("comments"):
                a_= str(tag.text)
                comments = int(re.search(r'\d+', a_).group())

        points.append(int(point))
        comments_numbers.append(int(comments))

data_frame = pd.DataFrame({
    "Rank" : ranks, "Title Length" : title_lengths, "Age (hours)" : age_in_hours,
    "Points" : points, "Comments" : comments_numbers
})
print(data_frame.head())


   Rank  Title Length  Age (hours)  Points  Comments
0     1            73            0      20         0
1     2            58           11     485       482
2     3            86            1       9         0
3     4            48           11     594       217
4     5            59            6     128        46


In [2]:
# save data into file so we don't have to load data each time
data_frame.to_csv("hacker_news_stories.csv", index=False)

Part 2 - Regression

In [7]:
import statsmodels.api as sm

dataframe = pd.read_csv("hacker_news_stories.csv")
x = data_frame['Age (hours)']
y = data_frame['Rank']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Rank   R-squared:                       0.250
Model:                            OLS   Adj. R-squared:                  0.245
Method:                 Least Squares   F-statistic:                     49.24
Date:                Wed, 05 Jun 2024   Prob (F-statistic):           7.56e-11
Time:                        20:54:09   Log-Likelihood:                -756.52
No. Observations:                 150   AIC:                             1517.
Df Residuals:                     148   BIC:                             1523.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          48.1656      4.968      9.696      

In [4]:
x = data_frame[['Comments', 'Age (hours)', 'Points', 'Title Length']]
x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Rank   R-squared:                       0.353
Model:                            OLS   Adj. R-squared:                  0.335
Method:                 Least Squares   F-statistic:                     19.76
Date:                Wed, 05 Jun 2024   Prob (F-statistic):           5.34e-13
Time:                        20:47:21   Log-Likelihood:                -745.44
No. Observations:                 150   AIC:                             1501.
Df Residuals:                     145   BIC:                             1516.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           26.5855     12.768      2.082   

In [5]:
from sklearn.preprocessing import PolynomialFeatures

# polynomial regression model: Rank ~ Comments + Comments^2 + Title Length + Title Length^2
poly = PolynomialFeatures(degree=2, include_bias=False)
x_poly = poly.fit_transform(data_frame[['Comments', 'Title Length']])
x = pd.DataFrame(x_poly, columns=poly.get_feature_names_out(['Comments', 'Title Length']))
x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Rank   R-squared:                       0.081
Model:                            OLS   Adj. R-squared:                  0.049
Method:                 Least Squares   F-statistic:                     2.526
Date:                Wed, 05 Jun 2024   Prob (F-statistic):             0.0318
Time:                        20:47:21   Log-Likelihood:                -771.76
No. Observations:                 150   AIC:                             1556.
Df Residuals:                     144   BIC:                             1574.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                   -21.99

In [9]:
import numpy as np

# inverse linear model
x = 1 / data_frame[['Comments', 'Age (hours)', 'Points', 'Title Length']]
x = sm.add_constant(x.replace([np.inf, -np.inf], 0))
model = sm.OLS(y, x).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Rank   R-squared:                       0.263
Model:                            OLS   Adj. R-squared:                  0.242
Method:                 Least Squares   F-statistic:                     12.91
Date:                Wed, 05 Jun 2024   Prob (F-statistic):           5.12e-09
Time:                        21:19:15   Log-Likelihood:                -755.21
No. Observations:                 150   AIC:                             1520.
Df Residuals:                     145   BIC:                             1535.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          110.1182     11.727      9.390   

In [None]:
import numpy as np

# inverse linear model
x = 1 / data_frame[['Points']]
x = sm.add_constant(x.replace([np.inf, -np.inf], 0))
model = sm.OLS(y, x).fit()
print(model.summary())

Explore several possible least squares regressions to predict a story's rank based on the other variables.

- The model rank ~ Age. R^2 = 0.250 and p value of 0.00 meaning that is this model alone where we hold all other variables constant and compare the likely of the variable age to impact the total rank is significant and it has a direct correlation.

- However, when we put all the data's variables in the same model where rank ~ Comments + Age (hours) + Points + Title Length. R^2 = 0.353 where the variable age has a p value of 0.00 meaning that the older a post is more likely that it will have a higher rank number.
The variable that seems to have the lowest impact on rank seems to be points since it has a negative coefficient of -0.1167 meaning that for each points increase there is a decrease in rank by approximately 0.12.

- The inverse linear model where rank ~ 1 / Comments + Age (hours) + Points + Title Length has R^2 = 0.263. The variable points has a p value of 0.00 meaning that per collected with 0.01% significance level we can assume that points have a inverse relationship with rank. Age variable has p a value of 0.00 where its coefficient has a value of -167.10 meaning that older posts tends to increase the rank.

The model that seems the to be the best fit for the collected data is the multiple linear regression with a R-squared (0.353) compared to inverse model and quadratic model as well.