# Online News Popularity Project
### A Prediction Study
#### By: Mayra Weidner

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# load dataset
df = pd.read_csv('Data/df_cleaned.csv')

## Model Approach

My project involves utilizing regression analysis to predict the number of times articles can be shared. Below I will describe the 3 different regression models that I could use and analyze any foreseen difficulties. Because we have only studied linear regression as of the date of this document, I am basing my analysis on internet research and the descriptions in the project sample we were provided.

References:   
https://www.javatpoint.com/linear-regression-vs-logistic-regression-in-machine-learning#:~:text=Linear%20regression%20is%20used%20to,used%20for%20solving%20Regression%20problem

https://www.analyticsvidhya.com/blog/2020/12/beginners-take-how-logistic-regression-is-related-to-linear-regression/

**Linear Regression:**    
Linear regression is a common and simple regression model used to establish a linear relationship between the predictor / independent variables (all attributes besides number of shares) and the dependent variable (number of shares). R-squared is important because it measures how well the model fits the data. R-squared closer to 1 indicates a better fit. R-squared near 0 means that the model is wrong, or the error variance is too big or both.

I may choose linear regression because it is easy to interpret and implement. It can provide insights into the direction and magnitude of the relationships between the predictor variables and the number of shares.

The boxplots of the attributes shows that there is some noise in the data so a linear regression model might be a good fit since it may be difficult to detect another shape for the model.

As previously discussed in the Data Cleaning and EDA sections of this project, I dropped n_non_stop_words, n_unique_tokens, kw_avg_min, and kw_avg_avg because these attributes were strongly correlated with other attributes. Excluding strongly correlated attributes is unlikely to significantly impact the model's performance, and it can even be beneficial by preventing overfitting and simplifying the model. 

**Multilinear Regression:**  
Multilinear regression extends linear regression by including more than one predictor variable. This allows for capturing the combined effect of multiple predictors on the dependent variable. 

I may use multilinear regression if there are multiple relevant attributes that can potentially impact the number of shares. For example, data channel and weekday. By incorporating multiple predictors, the model can provide a more comprehensive understanding of the factors influencing shareability.

The ideal scenario for this model is when predictors are uncorrelated. Since I already removed some of the strongly correlated attributes, it may reduce the difficulty of interpreting regression coefficients so additional data cleaning and checks for collinearity (when two or more variables are exactly correlated) may not be necessary. R-squared is also important with this model because it measures how well the model fits the data

**Logistic Regression:**  
Logistic regression is a classification method commonly used when the dependent variable is categorical or binary. Using logistic regression, I could predict the likelihood of an article being highly shared or not (e.g., above a certain threshold) rather than predicting the exact number of shares. Logistic regression can be used to model the probability of an article falling into a specific category of high or low shares based on the given predictors. In this case I could use the median shares as the threshold. Articles under the median shares would be low shares (unpopular - 0) and articles with shares greater than or equal to the median shares would be high shares (popular - 1). This approach would require me to change the purpose of my project which at this point, I'm unwilling to do.

**Conclusion: Model Selection**  
Given my analysis of the 3 regression models that I could use, I decided to start with linear regression as this is the easiest to implement and interpret. Since I have already removed strongly correlated attributes, I will try multilinear regression if time permits. 

I will not use logistic regression as this model would require me to change the purpose of my project and it's too late into the process to do this.